# ASL Model Evaluation Framework

This notebook provides a framework for evaluating multiple machine learning models on American Sign Language (ASL) recognition tasks. It can compare various large vision language models and traditional computer vision models on their ability to recognize ASL hand signs.

The framework supports:
- Multiple models
- Multiple prompting strategies
- Various evaluation metrics

### Please refer to evaluate_models.py for the actually implemented script


In [1]:
# Import required libraries
import os
import argparse
import base64
import io
import json
import time
from pathlib import Path
import requests
from PIL import Image
from tqdm import tqdm
import torch
import torchvision.transforms as transforms
from dotenv import load_dotenv
import logging
import traceback
import random
import torch.nn.functional as F
import numpy as np
import sys
import re
import string
from sklearn.metrics import confusion_matrix, classification_report, top_k_accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Any

  from .autonotebook import tqdm as notebook_tqdm


## Setup and Configuration

First, we'll set up logging, import the model predictors, and define our evaluation parameters.

In [2]:
# Set up logging
timestamp = time.strftime("%Y%m%d_%H%M%S")
log_dir = "evaluation_logs"
os.makedirs(log_dir, exist_ok=True)
log_file = os.path.join(log_dir, f"evaluation_{timestamp}.log")

# Configure logging to write to both file and console
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(log_file),
        logging.StreamHandler()
    ]
)

logging.info(f"Logging to file: {log_file}")

# Initialize global results dictionary
results = {
    "timestamp": time.strftime("%Y%m%d_%H%M%S"),
    "dataset_path": "",
    "sample_size": 0,
    "results": {}
}

# Initialize global variables for token management
watsonx_token = None
watsonx_token_expiry = None

2025-05-13 11:24:59,274 - INFO - Logging to file: evaluation_logs/evaluation_20250513_112459.log


In [3]:
# --- Import Model Predictors --- #

# Keep these imports active
AVAILABLE_MODELS = {}

try:
    from test_gpt4_turbo import get_asl_prediction as get_gpt4_turbo_prediction
    AVAILABLE_MODELS["gpt4_turbo"] = get_gpt4_turbo_prediction
except ImportError as e:
    logging.warning(f"GPT-4 Turbo model not available: {e}")

try:
    from test_gpt4o import get_asl_prediction as get_gpt4o_prediction
    AVAILABLE_MODELS["gpt4o"] = get_gpt4o_prediction
except ImportError as e:
    logging.warning(f"GPT-4 Vision model not available: {e}")

try:
    from test_gemini_2_flash import get_asl_prediction as get_gemini_flash_prediction
    AVAILABLE_MODELS["gemini_flash"] = get_gemini_flash_prediction
except ImportError as e:
    logging.warning(f"Gemini Flash model not available: {e}")

try:
    from test_gemini_2_flash_lite import get_asl_prediction as get_gemini_flash_lite_prediction
    AVAILABLE_MODELS["gemini_flash_lite"] = get_gemini_flash_lite_prediction
except ImportError as e:
    logging.warning(f"Gemini Flash Lite model not available: {e}")

try:
    from test_llama_90b_vision import get_asl_prediction as get_llama_90b_prediction
    AVAILABLE_MODELS["llama_90b"] = get_llama_90b_prediction
except ImportError as e:
    logging.warning(f"Llama 90B Vision model not available: {e}")

try:
    from test_llama_maverick_17b import get_asl_prediction as get_llama_maverick_prediction
    AVAILABLE_MODELS["llama_maverick"] = get_llama_maverick_prediction
except ImportError as e:
    logging.warning(f"Llama Maverick model not available: {e}")

try:
    from test_llama_scout_17b import get_asl_prediction as get_llama_scout_prediction
    AVAILABLE_MODELS["llama_scout"] = get_llama_scout_prediction
except ImportError as e:
    logging.warning(f"Llama Scout model not available: {e}")

try:
    from test_pixtral_12b import get_asl_prediction as get_mistral_prediction
    AVAILABLE_MODELS["mistral"] = get_mistral_prediction
except ImportError as e:
    logging.warning(f"Mistral (Pixtral) model not available: {e}")

try:
    from test_granite_vision import get_asl_prediction as get_granite_vision_prediction
    AVAILABLE_MODELS["granite_vision"] = get_granite_vision_prediction
except ImportError as e:
    logging.warning(f"Granite Vision model not available: {e}")

# Define models to evaluate based on availability
MODELS_TO_EVALUATE = list(AVAILABLE_MODELS.keys())
logging.info(f"Available models: {MODELS_TO_EVALUATE}")

if not MODELS_TO_EVALUATE:
    logging.warning("No models available for evaluation. Please check your environment variables and dependencies.")

Loading .env file from: /Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/backend/.env
Loading .env file from: /Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/backend/.env


2025-05-13 11:25:01,263 - INFO - Available models: ['gpt4_turbo', 'gpt4o', 'gemini_flash', 'gemini_flash_lite', 'llama_90b', 'llama_maverick', 'llama_scout', 'mistral', 'granite_vision']


Loading .env file from: /Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/backend/.env
Loading .env file from: /Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/backend/.env
Loading .env file from: /Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/backend/.env
Loading .env file from: /Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/backend/.env
Loading .env file from: /Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/backend/.env
Loading .env file from: /Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/backend/.env
Loading .env file from: /Users/henrikjacobsen/Desktop/CBS/Semester 2/Artific

In [4]:
# Load environment variables from backend/.env
dotenv_path = os.path.join(os.path.dirname(os.path.dirname(os.getcwd())), 'backend', '.env')
load_dotenv(dotenv_path=dotenv_path)
logging.info(f"Attempting to load .env file from: {dotenv_path}")

# Define valid classes and class mapping
VALID_CLASSES = list("ABCDEFGHIJKLMNOPQRSTUVWXYZ")  # All ASL letters
CLASS_MAP = {label: i for i, label in enumerate(VALID_CLASSES)}
INDEX_TO_CLASS = {i: label for label, i in CLASS_MAP.items()}

# Define available prompt strategies (use all 5)
PROMPT_STRATEGIES = [
    "zero_shot",
    "few_shot",
    "chain_of_thought",
    "visual_grounding",
    "contrastive"
]

2025-05-13 11:25:01,270 - INFO - Attempting to load .env file from: /Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/backend/.env


## Helper Functions

We'll define helper functions for authentication, image processing, and model predictions.

In [5]:
def get_watsonx_token(api_key):
    """Gets or refreshes the IBM Cloud IAM token."""
    global watsonx_token, watsonx_token_expiry
    now = time.time()

    if watsonx_token and watsonx_token_expiry and now < watsonx_token_expiry:
        return watsonx_token

    logging.info("Refreshing WatsonX authentication token...")
    auth_url = "https://iam.cloud.ibm.com/identity/token"
    headers = {"Content-Type": "application/x-www-form-urlencoded"}
    data = {"grant_type": "urn:ibm:params:oauth:grant-type:apikey", "apikey": api_key}
    try:
        response = requests.post(auth_url, headers=headers, data=data, timeout=15)
        response.raise_for_status()
        token_data = response.json()
        watsonx_token = token_data.get("access_token")
        expires_in = token_data.get("expires_in", 3600)
        watsonx_token_expiry = now + expires_in - 300  # 5 min buffer
        return watsonx_token
    except requests.exceptions.RequestException as e:
        logging.error(f"Authentication failed: {e}")
        return None
    except Exception as e:
        logging.error(f"Error refreshing token: {e}")
        return None

def encode_image_base64(image_path, resize_dim=(256, 256)):
    """Reads, resizes, and returns the base64 encoded string of an image."""
    try:
        with Image.open(image_path) as img:
            if img.mode != 'RGB':
                img = img.convert('RGB')

            if resize_dim:
                img = img.resize(resize_dim)

            buffer = io.BytesIO()
            img.save(buffer, format="JPEG", quality=95)
            base64_str = base64.b64encode(buffer.getvalue()).decode('utf-8')
            base64_str = ''.join(base64_str.split())
            return base64_str
    except Exception as e:
        logging.error(f"Error encoding/resizing image {image_path}: {e}")
        traceback.print_exc()
        return None

In [6]:
def load_dataset_sample(dataset_path_str, sample_size=30):
    """Load random samples from the dataset for each letter."""
    dataset_path = Path(dataset_path_str)
    if not dataset_path.exists():
        raise ValueError(f"Dataset path {dataset_path} does not exist")
    
    samples = []
    for letter in VALID_CLASSES:
        letter_dir = dataset_path / letter
        if not letter_dir.exists():
            logging.warning(f"Directory for letter {letter} not found: {letter_dir}")
            continue
            
        # Get all image files in the directory
        image_files = list(letter_dir.glob("*.jpg"))
        if not image_files:
            logging.warning(f"No images found for letter {letter} in {letter_dir}")
            continue
            
        # Randomly sample images
        if len(image_files) < sample_size:
            logging.warning(f"Not enough images for letter {letter}. Found {len(image_files)}, requested {sample_size}")
            selected_images = image_files
        else:
            selected_images = random.sample(image_files, sample_size)
            
        # Add selected images to samples list
        for img_path in selected_images:
            samples.append((str(img_path), letter))
    
    logging.info(f"Loaded {len(samples)} images across all letters")
    return samples

def get_prediction(model_name, image_path, prompt_strategy="zero_shot"):
    """Get prediction from the specified model."""
    start_time = time.time()
    try:
        if model_name not in AVAILABLE_MODELS:
            raise ValueError(f"Model {model_name} is not available")
            
        # Get the prediction function for this model
        predict_func = AVAILABLE_MODELS[model_name]
        
        # Handle different parameter naming conventions
        if model_name in ["gemini_flash", "gemini_flash_lite", "granite_vision"]:
            result = predict_func(image_path, prompt_strategy=prompt_strategy)
        elif model_name in ["llama_90b", "llama_maverick", "llama_scout", "mistral"]:
            result = predict_func(image_path, strategy=prompt_strategy)
        else:  # GPT-4 models
            result = predict_func(image_path, prompt_strategy=prompt_strategy)
        
        response_time = time.time() - start_time
        
        # Extract token usage if available
        token_usage = 0
        if isinstance(result, dict):
            if "metadata" in result:
                if "total_tokens" in result["metadata"]:
                    token_usage = result["metadata"]["total_tokens"]
                elif "tokens" in result["metadata"] and "total" in result["metadata"]["tokens"]:
                    token_usage = result["metadata"]["tokens"]["total"]
            
            # If the result has a "prediction" key, use that as the actual result
            if "prediction" in result:
                result = result["prediction"]
        
        return result, response_time, token_usage
    except Exception as e:
        logging.error(f"Error getting prediction from {model_name}: {str(e)}")
        response_time = time.time() - start_time
        return {"error": str(e)}, response_time, 0

In [7]:
def handle_result(model_name, prediction_result, true_letter, image_path, response_time=None, token_usage=None, prompt_strategy="zero_shot"):
    """Handle prediction result and update global results."""
    if model_name not in results["results"]:
        results["results"][model_name] = {
            "predictions": [],
            "ground_truth": [],
            "response_times": [],
            "token_usage": [],
            "prompt_strategy_results": {}
        }
    
    # Add to overall results
    results["results"][model_name]["predictions"].append(prediction_result)
    results["results"][model_name]["ground_truth"].append(true_letter)
    results["results"][model_name]["response_times"].append(response_time if response_time else 0)
    results["results"][model_name]["token_usage"].append(token_usage if token_usage else 0)
    
    # Add to strategy-specific results
    if prompt_strategy not in results["results"][model_name]["prompt_strategy_results"]:
        results["results"][model_name]["prompt_strategy_results"][prompt_strategy] = {
            "predictions": [],
            "ground_truth": [],
            "response_times": [],
            "token_usage": []
        }
    
    strategy_data = results["results"][model_name]["prompt_strategy_results"][prompt_strategy]
    strategy_data["predictions"].append(prediction_result)
    strategy_data["ground_truth"].append(true_letter)
    strategy_data["response_times"].append(response_time if response_time else 0)
    strategy_data["token_usage"].append(token_usage if token_usage else 0)

## Metrics Calculation and Visualization

These functions calculate evaluation metrics and generate visualizations for the results.

In [8]:
def calculate_statistics(results: Dict[str, Any], misclassified: Dict[str, List[Tuple[str, str]]]) -> Dict[str, Any]:
    """Calculate statistics for each model and strategy."""
    for model_name, model_data in results["results"].items():
        # Calculate overall metrics
        predictions = model_data["predictions"]
        ground_truth = model_data["ground_truth"]
        response_times = model_data["response_times"]
        token_usage = model_data["token_usage"]
        
        # Calculate overall statistics
        correct = sum(1 for p, g in zip(predictions, ground_truth) if p == g)
        total = len(predictions)
        errors = total - correct
        
        model_data.update({
            "correct": correct,
            "total": total,
            "errors": errors,
            "metrics": calculate_metrics(predictions, ground_truth)
        })
        
        # Calculate strategy-specific statistics
        for strategy, strategy_data in model_data["prompt_strategy_results"].items():
            strategy_predictions = strategy_data["predictions"]
            strategy_ground_truth = strategy_data["ground_truth"]
            strategy_response_times = strategy_data["response_times"]
            strategy_token_usage = strategy_data["token_usage"]
            
            strategy_correct = sum(1 for p, g in zip(strategy_predictions, strategy_ground_truth) if p == g)
            strategy_total = len(strategy_predictions)
            
            strategy_data.update({
                "correct": strategy_correct,
                "total": strategy_total,
                "metrics": calculate_metrics(strategy_predictions, strategy_ground_truth)
            })
    
    return results

def calculate_metrics(predictions: List[str], ground_truth: List[str], prediction_probs: np.ndarray = None) -> Dict[str, Any]:
    """Calculate comprehensive evaluation metrics."""
    if not predictions or not ground_truth:
        return {"error": "No predictions or ground truth data available"}
    
    metrics = {}
    
    # Filter out invalid predictions (ensure they're all strings and within VALID_CLASSES)
    valid_indices = []
    for i, (pred, truth) in enumerate(zip(predictions, ground_truth)):
        if isinstance(pred, str) and isinstance(truth, str) and pred in VALID_CLASSES and truth in VALID_CLASSES:
            valid_indices.append(i)
    
    if not valid_indices:
        return {"error": "No valid predictions available for metrics calculation"}
    
    # Use only the valid predictions and ground truth values
    filtered_preds = [predictions[i] for i in valid_indices]
    filtered_truth = [ground_truth[i] for i in valid_indices]
    
    # Convert to numpy arrays for easier calculations
    preds = np.array(filtered_preds)
    truth = np.array(filtered_truth)
    
    # 1. Basic Accuracy
    metrics["accuracy"] = np.mean(preds == truth)
    
    # 2. Confusion Matrix (if we have enough data)
    try:
        cm = confusion_matrix(truth, preds, labels=VALID_CLASSES)
        metrics["confusion_matrix"] = cm.tolist()
    except Exception as e:
        logging.error(f"Error calculating confusion matrix: {e}")
        metrics["confusion_matrix"] = None
    
    # 3. Classification Report (if we have enough data)
    try:
        report = classification_report(truth, preds, labels=VALID_CLASSES, output_dict=True)
        metrics["classification_report"] = report
        
        # Extract macro and weighted averages
        metrics["macro_avg"] = {
            "precision": report["macro avg"]["precision"],
            "recall": report["macro avg"]["recall"],
            "f1_score": report["macro avg"]["f1-score"]
        }
        
        metrics["weighted_avg"] = {
            "precision": report["weighted avg"]["precision"],
            "recall": report["weighted avg"]["recall"],
            "f1_score": report["weighted avg"]["f1-score"]
        }
    except Exception as e:
        logging.error(f"Error calculating classification report: {e}")
        metrics["classification_report"] = None
        metrics["macro_avg"] = None
        metrics["weighted_avg"] = None
    
    return metrics

In [9]:
def evaluate_subset(predictions: List[str], ground_truth: List[str], image_paths: List[str], subset_type: str) -> Dict[str, Any]:
    """Evaluate model performance on a specific subset of images."""
    subset_indices = []
    for i, path in enumerate(image_paths):
        if subset_type == 'grayscale' and 'grayscale' in path.lower():
            subset_indices.append(i)
        elif subset_type == 'flipped' and 'flipped' in path.lower():
            subset_indices.append(i)
    
    if not subset_indices:
        return {
            "count": 0,
            "accuracy": None,
            "error_rate": None
        }
    
    # Only include indices that have corresponding predictions
    valid_indices = [i for i in subset_indices if i < len(predictions)]
    
    if not valid_indices:
        return {
            "count": 0,
            "accuracy": None,
            "error_rate": None
        }
    
    subset_preds = [predictions[i] for i in valid_indices]
    subset_truth = [ground_truth[i] for i in valid_indices]
    
    accuracy = np.mean(np.array(subset_preds) == np.array(subset_truth))
    error_rate = 1 - accuracy
    
    return {
        "count": len(valid_indices),
        "accuracy": accuracy,
        "error_rate": error_rate
    }

def save_intermediate_results(output_dir: str, model_name: str) -> None:
    """Save intermediate results to a temporary file."""
    timestamp = time.strftime("%Y%m%d_%H%M%S")
    temp_file = os.path.join(output_dir, f"temp_results_{model_name}_{timestamp}.json")
    with open(temp_file, 'w') as f:
        json.dump(results, f, indent=2)
    logging.info(f"Saved intermediate results to {temp_file}")


In [10]:
def plot_confusion_matrix(cm, labels, title, output_path):
 """Plot and save a confusion matrix as a heatmap."""
 plt.figure(figsize=(12, 10))
 
 # Normalize the confusion matrix for display
 cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
 cm_normalized = np.nan_to_num(cm_normalized) # Replace NaN with 0
 
 # Create the heatmap
 sns.heatmap(cm_normalized, annot=cm, fmt='d', cmap='Blues',
 xticklabels=labels, yticklabels=labels)
 
 plt.title(title)
 plt.ylabel('True Label')
 plt.xlabel('Predicted Label')
 
 # Save the figure
 plt.tight_layout()
 plt.savefig(output_path, dpi=300)
 plt.close()
 
 return output_path

def generate_confusion_matrices(data, output_dir):
    """Generate confusion matrices for each model and strategy."""
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Get all possible labels (A-Z)
    labels = list(string.ascii_uppercase)
    
    for model_name, model_data in data['results'].items():
        # Check if we have prediction data
        if 'predictions' not in model_data or not model_data['predictions']:
            logging.warning(f"No prediction data for model {model_name}")
            continue
            
        # Overall model confusion matrix
        try:
            cm = confusion_matrix(
                model_data['ground_truth'],
                model_data['predictions'],
                labels=labels
            )
            plot_confusion_matrix(
                cm, 
                labels,
                f'Confusion Matrix - {model_name}',
                os.path.join(output_dir, f'confusion_matrix_{model_name.lower()}.png')
            )
        except Exception as e:
            logging.error(f"Error generating confusion matrix for {model_name}: {e}")
        
        # Strategy-specific confusion matrices
        if 'prompt_strategy_results' in model_data:
            for strategy, strategy_data in model_data['prompt_strategy_results'].items():
                if not strategy_data.get('predictions'):
                    continue
                    
                try:
                    cm = confusion_matrix(
                        strategy_data['ground_truth'],
                        strategy_data['predictions'],
                        labels=labels
                    )
                    plot_confusion_matrix(
                        cm,
                        labels,
                        f'Confusion Matrix - {model_name} - {strategy}',
                        os.path.join(output_dir, f'confusion_matrix_{model_name.lower()}_{strategy.lower()}.png')
                    )
                except Exception as e:
                    logging.error(f"Error generating confusion matrix for {model_name} - {strategy}: {e}")

## Main Evaluation Function

This section defines the main evaluation function that will run the evaluation on all models using the specified prompting strategies.

In [14]:
def evaluate_models(dataset_path: str, sample_size: int = 30, output_dir: str = "evaluation_results") -> Dict[str, Any]:
    """Evaluate all available models on the dataset."""
    # Update global results with dataset info
    results["dataset_path"] = dataset_path
    results["sample_size"] = sample_size
    
    # Create output directory
    os.makedirs(output_dir, exist_ok=True)
    
    # Load dataset samples
    samples = load_dataset_sample(dataset_path, sample_size)
    
    # Evaluate each model
    for model_name in MODELS_TO_EVALUATE:
        logging.info(f"Evaluating model: {model_name}")
        
        # Evaluate each prompt strategy
        for strategy in PROMPT_STRATEGIES:
            logging.info(f"Using prompt strategy: {strategy}")
            
            # Evaluate each sample
            for image_path, true_letter in tqdm(samples, desc=f"{model_name} - {strategy}"):
                try:
                    # Get prediction with timing
                    start_time = time.time()
                    prediction_result, response_time, token_usage = get_prediction(model_name, image_path, strategy)
                    
                    # Handle the result
                    handle_result(
                        model_name,
                        prediction_result,
                        true_letter,
                        image_path,
                        response_time,
                        token_usage,
                        strategy
                    )
                    
                except Exception as e:
                    logging.error(f"Error evaluating {model_name} on {image_path}: {e}")
                    continue
            
            # Save intermediate results after each strategy is complete
            save_intermediate_results(output_dir, f"{model_name}_{strategy}")
    
    # Calculate final statistics
    final_results = calculate_statistics(results, {})
    
    # Generate confusion matrices
    # generate_confusion_matrices(final_results, output_dir)
    
    # Save results
    output_file = os.path.join(output_dir, f"evaluation_results_{time.strftime('%Y%m%d_%H%M%S')}.json")
    with open(output_file, 'w') as f:
        json.dump(final_results, f, indent=2)
    
    return final_results

def main():
    # Define parameters
    dataset_path = '/Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/model_comparison/data'
    sample_size = 30
    output_dir = 'evaluation_results'
    quick_test = True  # Set to False for full evaluation

    if quick_test:
        # For quick test, override the sample loading to get just one random image
        def quick_test_sample(dataset_path_str, sample_size=None):
            dataset_path = Path(dataset_path_str)
            all_images = []
            for letter in VALID_CLASSES:
                letter_dir = dataset_path / letter
                if letter_dir.exists():
                    image_files = list(letter_dir.glob("*.jpg"))
                    if image_files:
                        all_images.append((str(random.choice(image_files)), letter))
            if all_images:
                return [random.choice(all_images)]
            return []
        
        # # Declare globals before modifying
        global PROMPT_STRATEGIES
        global load_dataset_sample
        
        # Override the load_dataset_sample function temporarily
        original_load_dataset = load_dataset_sample
        load_dataset_sample = quick_test_sample
        
        # Override the prompt strategies temporarily
        original_strategies = PROMPT_STRATEGIES
        PROMPT_STRATEGIES = ["zero_shot"]
        
        try:
            # Run evaluation with quick test settings
            results = evaluate_models(dataset_path, sample_size, output_dir)
        finally:
            # Restore original functions
            load_dataset_sample = original_load_dataset
            PROMPT_STRATEGIES = original_strategies
    else:
        # Run normal evaluation
        results = evaluate_models(dataset_path, sample_size, output_dir)

if __name__ == "__main__":
    main() 

2025-05-13 11:27:51,408 - INFO - Evaluating model: gpt4_turbo
2025-05-13 11:27:51,408 - INFO - Using prompt strategy: zero_shot
gpt4_turbo - zero_shot:   0%|          | 0/1 [00:00<?, ?it/s]2025-05-13 11:27:51,418 - INFO - Image encoded successfully. Base64 size: 49920 characters
2025-05-13 11:27:55,773 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-13 11:27:55,776 - INFO - Generated text: {
  "letter": "V",
  "confidence": 0.95,
  "feedback": "The gesture shows the person holding up their hand with the index and middle fingers raised and separated, which corresponds to the ASL sign for the letter 'V'."
}
gpt4_turbo - zero_shot: 100%|██████████| 1/1 [00:04<00:00,  4.37s/it]
2025-05-13 11:27:55,780 - INFO - Saved intermediate results to evaluation_results/temp_results_gpt4_turbo_zero_shot_20250513_112755.json
2025-05-13 11:27:55,781 - INFO - Evaluating model: gpt4o
2025-05-13 11:27:55,782 - INFO - Using prompt strategy: zero_shot
gpt4o - 

## Conclusion

This notebook provides a comprehensive framework for evaluating different ASL recognition models using various prompting strategies. The evaluation measures accuracy, precision, recall, and F1-score, and visualizes the results through confusion matrices and comparison charts.

Key insights from such evaluations can help identify:
- Which models perform best for ASL recognition
- Which prompting strategies yield the best results for each model
- Common misclassifications and error patterns
- Efficiency metrics like token usage and response time