# GPT-4o Vision ASL Recognition

This notebook demonstrates how to use the GPT-4o Vision model for ASL letter recognition. The code includes:

- Setting up the OpenAI API client
- Processing images for analysis
- Implementing different prompting strategies
- Handling model responses and extracting predictions

### Note: This notebook requires OpenAI API credentials in the backend/.env file
### Please refer to test_gpt4o.py for the actually implemented script

In [1]:
import os
import json
import logging
import base64
import requests
import argparse
from pathlib import Path
from dotenv import load_dotenv
from PIL import Image
import io
import time
import openai
from typing import Literal

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

## Load Environment Variables and Initialize Client

Load API credentials from .env file and initialize the OpenAI client.

In [2]:
# Load environment variables from backend/.env
# Use an absolute path to the .env file
dotenv_path = "/Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/backend/.env"
load_dotenv(dotenv_path=dotenv_path)
print(f"Loading .env file from: {dotenv_path}")

# Get OpenAI credentials
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
print(f"API key found: {'Yes' if OPENAI_API_KEY else 'No'}")
if not OPENAI_API_KEY:
    raise ValueError("OpenAI API key not found in environment variables. Please check your .env file.")

# Set OpenAI API key
openai.api_key = OPENAI_API_KEY

Loading .env file from: /Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/backend/.env
API key found: Yes


## Prompt Templates

Define prompt templates for different strategies.

In [3]:
# Prompt templates
PROMPT_TEMPLATES = {
"zero_shot": """You are an expert in American Sign Language (ASL) recognition. Analyze the provided image and identify the ASL letter being signed (A-Z).

Respond only with a valid JSON object, using this format:
{
  "letter": "A single uppercase letter (A-Z)",
  "confidence": "confidence score (0-1)",
  "feedback": "A short explanation of how the gesture maps to the predicted letter"
}
Be precise and avoid adding anything outside the JSON response.""",

"few_shot": """You are an expert in American Sign Language (ASL) recognition. Analyze the provided image and identify the ASL letter being signed (A-Z).

Here are some known ASL hand signs:
- A: Fist with thumb resting on the side
- B: Flat open hand, fingers extended upward, thumb across the palm
- C: Hand curved into the shape of the letter C
- D: Index finger up, thumb touching middle finger forming an oval
- E: Fingers bent, thumb tucked under

Respond only with a JSON object like this:
{
  "letter": "A single uppercase letter (A-Z)",
  "confidence": "confidence score (0-1)",
  "feedback": "Why this gesture matches the predicted letter"
}
Only return the JSON object. No explanations before or after.""",

"chain_of_thought": """You are an expert in American Sign Language (ASL) recognition. Carefully analyze the provided image step-by-step to identify the ASL letter (A-Z).

1. Describe the hand shape
2. Describe the finger and thumb positions
3. Compare these to known ASL letter signs
4. Identify the most likely letter

Then output your answer as JSON:
{
  "letter": "A single uppercase letter (A-Z)",
  "confidence": "confidence score (0-1),
  "feedback": "Summarize your reasoning in one sentence"
}
Return only the JSON object with no extra text.""",

"visual_grounding": """You are an expert in American Sign Language (ASL) recognition. Carefully analyze the provided image of a hand gesture and determine which ASL letter (A–Z) it represents.

To guide your analysis, consider the following:
- Which fingers are extended or bent?
- Is the thumb visible, and where is it positioned?
- What is the orientation of the palm (facing forward, sideways, etc.)?
- Are there any unique shapes formed (e.g., circles, fists, curves)?

Now, based on this visual inspection, provide your prediction in the following JSON format:

{
  "letter": "predicted letter (A-Z)",
  "confidence": "confidence score (0–1)",
  "feedback": "brief explanation describing the observed hand shape and reasoning"
}

Be precise, use visual clues from the image, and avoid guessing without justification.""",

"contrastive": """You are an expert in American Sign Language (ASL) recognition. Analyze the provided image of a hand gesture and identify the correct ASL letter.

Consider the following candidate letters: A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z
(These letters are visually similar and often confused.)

Step-by-step:
1. Observe the hand shape, finger positions, and thumb placement.
2. Compare the observed gesture against the typical signs for each candidate letter.
3. Eliminate unlikely candidates based on visible differences.
4. Choose the most plausible letter and explain your reasoning.

Format your response as JSON:

{
  "letter": "predicted letter from candidates",
  "confidence": "confidence score (0–1)",
  "feedback": "why this letter was selected over the others"
}

Be analytical and compare carefully to avoid misclassification."""
}

## Image Processing Functions

Define functions to process images for the OpenAI API.

In [4]:
def encode_image_base64(image_path):
    """Encode image to base64 string."""
    try:
        with Image.open(image_path) as img:
            # Convert to RGB if necessary
            if img.mode != 'RGB':
                img = img.convert('RGB')
            
            # Resize image if needed
            max_size = (1024, 1024)
            img.thumbnail(max_size, Image.Resampling.LANCZOS)
            
            # Save to bytes
            buffer = io.BytesIO()
            img.save(buffer, format="JPEG", quality=95)
            return base64.b64encode(buffer.getvalue()).decode('utf-8')
    except Exception as e:
        logging.error(f"Error encoding image: {e}")
        raise

## ASL Prediction Function

Define the main function for getting ASL predictions from the GPT-4o model.

In [5]:
def get_asl_prediction(image_path: str, prompt_strategy: Literal["zero_shot", "few_shot", "chain_of_thought", "visual_grounding", "contrastive"] = "zero_shot") -> dict:
    """Get ASL prediction from GPT-4o model."""
    start_time = time.time()
    
    try:
        # Encode image
        image_base64 = encode_image_base64(image_path)
        logging.info(f"Image encoded successfully. Base64 size: {len(image_base64)} characters")
        
        # Get appropriate prompt template
        prompt = PROMPT_TEMPLATES[prompt_strategy]
        
        # Create the message with image and prompt
        client = openai.OpenAI()
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": "You are a professional ASL interpreter. Your task is to analyze hand gestures and identify the correct ASL letter."
                },
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": prompt
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/jpeg;base64,{image_base64}"
                            }
                        }
                    ]
                }
            ],
            max_tokens=300,
            temperature=0.05,
            top_p=1.0
        )
        
        # Calculate response time
        response_time = time.time() - start_time
        
        # Extract the generated text and token usage
        generated_text = response.choices[0].message.content
        prompt_tokens = response.usage.prompt_tokens
        completion_tokens = response.usage.completion_tokens
        total_tokens = response.usage.total_tokens
        
        logging.info(f"Generated text: {generated_text}")
        
        # Try to parse the JSON response
        try:
            # Find the JSON object in the response
            json_start = generated_text.find('{')
            json_end = generated_text.rfind('}') + 1
            if json_start >= 0 and json_end > json_start:
                json_str = generated_text[json_start:json_end]
                result = json.loads(json_str)
                
                # Add timing and token information to the result
                result["metadata"] = {
                    "response_time_seconds": round(response_time, 3),
                    "prompt_tokens": prompt_tokens,
                    "completion_tokens": completion_tokens,
                    "total_tokens": total_tokens
                }
                return result
            else:
                logging.warning("No JSON object found in response")
                return {
                    "error": "No JSON found in response",
                    "metadata": {
                        "response_time_seconds": round(response_time, 3),
                        "prompt_tokens": prompt_tokens,
                        "completion_tokens": completion_tokens,
                        "total_tokens": total_tokens
                    }
                }
        except json.JSONDecodeError:
            logging.warning("Failed to parse JSON from response")
            return {
                "error": "Invalid JSON response",
                "metadata": {
                    "response_time_seconds": round(response_time, 3),
                    "prompt_tokens": prompt_tokens,
                    "completion_tokens": completion_tokens,
                    "total_tokens": total_tokens
                }
            }
            
    except Exception as e:
        response_time = time.time() - start_time
        logging.error(f"Error testing GPT-4o: {e}")
        return {
            "error": str(e),
            "metadata": {
                "response_time_seconds": round(response_time, 3)
            }
        }

## Test with Sample Image

Test the model with a sample image using different prompting strategies.

In [6]:
# Update this path to your image
base_path = Path("/Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/model_comparison")
image_path = base_path / "data/V/V_0_20250428_114109.jpg"

# Make sure the path exists
if not image_path.exists():
    print(f"Image not found: {image_path}")
    # Try to find any image in the dataset
    data_dir = base_path / "data"
    if data_dir.exists():
        for letter_dir in data_dir.glob("*"):
            if letter_dir.is_dir():
                for img_file in letter_dir.glob("*.jpg"):
                    image_path = img_file
                    print(f"Using alternative image: {image_path}")
                    break
            if image_path.exists():
                break

In [7]:
# Test with a specific strategy
def test_single_strategy(strategy="zero_shot"):
    """Test the model with a single prompting strategy."""
    print(f"\nTesting with {strategy} strategy...")
    print(f"Using GPT-4o on image: {image_path}")
    
    result = get_asl_prediction(str(image_path), strategy)
    print(f"\nResult:\n{json.dumps(result, indent=2)}")
    
    return result

In [8]:
# Run the test with the zero_shot strategy
result = test_single_strategy("zero_shot")

# Uncomment to test all strategies
# def test_all_strategies():
#     """Test the model with all prompting strategies."""
#     results = {}
#     
#     for strategy in PROMPT_TEMPLATES.keys():
#         print(f"\nTesting with {strategy} strategy...")
#         result = get_asl_prediction(str(image_path), strategy)
#         results[strategy] = result
#         
#         print(f"Result:\n{json.dumps(result, indent=2)}")
#         
#     return results
# 
# all_results = test_all_strategies()

2025-05-13 12:01:16,250 - INFO - Image encoded successfully. Base64 size: 146312 characters



Testing with zero_shot strategy...
Using GPT-4o on image: /Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/model_comparison/data/V/V_0_20250428_114109.jpg


2025-05-13 12:01:20,414 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-13 12:01:20,422 - INFO - Generated text: ```json
{
  "letter": "V",
  "confidence": 0.95,
  "feedback": "The gesture shows the index and middle fingers extended and separated, resembling the ASL letter 'V'."
}
```



Result:
{
  "letter": "V",
  "confidence": 0.95,
  "feedback": "The gesture shows the index and middle fingers extended and separated, resembling the ASL letter 'V'.",
  "metadata": {
    "response_time_seconds": 4.233,
    "prompt_tokens": 897,
    "completion_tokens": 49,
    "total_tokens": 946
  }
}
