# Gemini 2.0 Flash-Lite ASL Recognition

This notebook demonstrates how to use the Gemini 2.0 Flash-Lite model for ASL letter recognition. The code includes:

- Setting up the Gemini API client
- Processing images for analysis
- Implementing different prompting strategies
- Handling model responses and extracting predictions

### Note: This notebook requires Google API credentials in the backend/.env file
### Please refer to test_gemini_2_flash_lite.py for the actually implemented script

In [25]:
import os
import json
import logging
import base64
import argparse
from pathlib import Path
from dotenv import load_dotenv
from google import genai
from google.genai import types
from PIL import Image
import io
import time
from typing import Literal

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

## Load Environment Variables and Initialize Client

Load API credentials from .env file and initialize the Gemini client.

In [26]:
# Load environment variables from backend/.env
# Use an absolute path to the .env file
dotenv_path = "/Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/backend/.env"
load_dotenv(dotenv_path=dotenv_path)
print(f"Loading .env file from: {dotenv_path}")

# Get Google API credentials
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
print(f"API key found: {'Yes' if GOOGLE_API_KEY else 'No'}")
if not GOOGLE_API_KEY:
    raise ValueError("Google API key not found in environment variables. Please check your .env file.")

# Initialize the Gemini client
client = genai.Client(api_key=GOOGLE_API_KEY)

Loading .env file from: /Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/backend/.env
API key found: Yes


## Model Configuration and Prompt Templates

Define model configuration and prompt templates for different strategies.

In [27]:
# Model configurations
MODEL_CONFIGS = {
    "flash-lite": {
        "name": "gemini-2.0-flash-lite",
        "display_name": "Gemini 2.0 Flash-Lite",
        "rate_limit_delay": 3,  # 3 seconds between requests (faster than Flash)
        "max_retries": 3
    }
}

# Prompt templates
PROMPT_TEMPLATES = {
"zero_shot": """You are an expert in American Sign Language (ASL) recognition. Analyze the provided image and identify the ASL letter being signed (A-Z).

Respond only with a valid JSON object, using this format:
{
  "letter": "A single uppercase letter (A-Z)",
  "confidence": "confidence score (0-1)",
  "feedback": "A short explanation of how the gesture maps to the predicted letter"
}
Be precise and avoid adding anything outside the JSON response.""",

"few_shot": """You are an expert in American Sign Language (ASL) recognition. Analyze the provided image and identify the ASL letter being signed (A-Z).

Here are some known ASL hand signs:
- A: Fist with thumb resting on the side
- B: Flat open hand, fingers extended upward, thumb across the palm
- C: Hand curved into the shape of the letter C
- D: Index finger up, thumb touching middle finger forming an oval
- E: Fingers bent, thumb tucked under

Respond only with a JSON object like this:
{
  "letter": "A single uppercase letter (A-Z)",
  "confidence": "confidence score (0-1)",
  "feedback": "Why this gesture matches the predicted letter"
}
Only return the JSON object. No explanations before or after.""",

"chain_of_thought": """You are an expert in American Sign Language (ASL) recognition. Carefully analyze the provided image step-by-step to identify the ASL letter (A-Z).

1. Describe the hand shape
2. Describe the finger and thumb positions
3. Compare these to known ASL letter signs
4. Identify the most likely letter

Then output your answer as JSON:
{
  "letter": "A single uppercase letter (A-Z)",
  "confidence": "confidence score (0-1),
  "feedback": "Summarize your reasoning in one sentence"
}
Return only the JSON object with no extra text.""",

"visual_grounding": """You are an expert in American Sign Language (ASL) recognition. Carefully analyze the provided image of a hand gesture and determine which ASL letter (A–Z) it represents.

To guide your analysis, consider the following:
- Which fingers are extended or bent?
- Is the thumb visible, and where is it positioned?
- What is the orientation of the palm (facing forward, sideways, etc.)?
- Are there any unique shapes formed (e.g., circles, fists, curves)?

Now, based on this visual inspection, provide your prediction in the following JSON format:

{
  "letter": "predicted letter (A-Z)",
  "confidence": "confidence score (0–1)",
  "feedback": "brief explanation describing the observed hand shape and reasoning"
}

Be precise, use visual clues from the image, and avoid guessing without justification.""",

"contrastive": """You are an expert in American Sign Language (ASL) recognition. Analyze the provided image of a hand gesture and identify the correct ASL letter.

Consider the following candidate letters: A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z
(These letters are visually similar and often confused.)

Step-by-step:
1. Observe the hand shape, finger positions, and thumb placement.
2. Compare the observed gesture against the typical signs for each candidate letter.
3. Eliminate unlikely candidates based on visible differences.
4. Choose the most plausible letter and explain your reasoning.

Format your response as JSON:

{
  "letter": "predicted letter from candidates",
  "confidence": "confidence score (0–1)",
  "feedback": "why this letter was selected over the others"
}

Be analytical and compare carefully to avoid misclassification."""
}

## Image Processing Functions

Define functions to process images for the Gemini API.

In [28]:
def encode_and_convert_image(image_path, target_format="JPEG", quality=95, max_size=(512, 512)):
    """Process image and return raw bytes and MIME type."""
    try:
        with Image.open(image_path) as img:
            # Log original image format
            logging.debug(f"Original image format: {img.format}")
            
            # Convert to RGB if necessary
            if img.mode != 'RGB':
                logging.debug(f"Converting image from {img.mode} to RGB")
                img = img.convert('RGB')
            
            # Resize if needed
            if max(img.size) > max(max_size):
                logging.debug(f"Resizing image from {img.size} to max {max_size}")
                img.thumbnail(max_size)  # Resize while maintaining aspect ratio
            
            # Save to bytes with specified format
            buffer = io.BytesIO()
            img.save(buffer, format=target_format, quality=quality)
            image_bytes = buffer.getvalue()
            
            # Get the correct MIME type
            mime_type = f"image/{target_format.lower()}"
            logging.debug(f"Using MIME type: {mime_type}")
            logging.debug(f"Image processing successful. Size: {len(image_bytes)} bytes")
            
            return mime_type, image_bytes
    except Exception as e:
        logging.error(f"Error processing image {image_path}: {e}")
        raise

## ASL Prediction Function

Define the main function for getting ASL predictions from Gemini model.

In [29]:
def get_asl_prediction(image_path: str, prompt_strategy: Literal["zero_shot", "few_shot", "chain_of_thought", "visual_grounding", "contrastive"] = "zero_shot") -> dict:
    """Get ASL prediction from Gemini 2.0 Flash-Lite model for a given image path, including visibility check."""
    config = MODEL_CONFIGS["flash-lite"]
    model_name = config["name"]
    display_name = config["display_name"]
    delay_seconds = config["rate_limit_delay"]
    max_retries = config["max_retries"]
    
    # Initialize timing
    start_time = time.time()
    
    try:
        # Process the image and get raw bytes and MIME type
        mime_type, image_bytes = encode_and_convert_image(image_path)
        logging.debug(f"Image processed successfully as {mime_type}. Size: {len(image_bytes)} bytes")

        # Step 1: Visibility Check
        logging.debug(f"Sending visibility check request to {display_name}...")
        visibility_contents = [
            {
                "role": "user",
                "parts": [
                    types.Part.from_bytes(
                        mime_type=mime_type,
                        data=image_bytes
                    ),
                    {
                        "text": "Can you see the image I'm sending? Please respond with ONLY 'yes' or 'no'."
                    }
                ]
            }
        ]
        
        # Count tokens for visibility check
        visibility_tokens = client.models.count_tokens(
            model=model_name,
            contents=visibility_contents
        )
        
        visibility_response = client.models.generate_content(
            model=model_name,
            contents=visibility_contents,
            config=types.GenerateContentConfig(
                temperature=0.05,
                top_p=1.0,
                max_output_tokens=300
            )
        )
        
        visibility_text = visibility_response.text.strip().lower()
        logging.debug(f"Visibility response from {display_name}: {visibility_text}")
        if "yes" not in visibility_text:
            logging.warning(f"{display_name} visibility check failed for {image_path}. Response: {visibility_text}")
            return {
                "error": "Model cannot see the image",
                "metadata": {
                    "response_time_seconds": round(time.time() - start_time, 3),
                    "visibility_check_tokens": visibility_tokens.total_tokens
                }
            }
            
        logging.debug(f"{display_name} confirmed image visibility.")

        # Step 2: Proceed with ASL recognition
        logging.debug(f"Sending ASL recognition request to {display_name}...")
        
        # Get appropriate prompt template
        prompt = PROMPT_TEMPLATES[prompt_strategy]
        
        # Prepare contents for ASL recognition
        asl_contents = [
            {
                "role": "user",
                "parts": [
                    types.Part.from_bytes(
                        mime_type=mime_type,
                        data=image_bytes
                    ),
                    {
                        "text": prompt
                    }
                ]
            }
        ]
        
        # Count tokens for ASL recognition
        asl_tokens = client.models.count_tokens(
            model=model_name,
            contents=asl_contents
        )
        
        # Make prediction
        response = client.models.generate_content(
            model=model_name,
            contents=asl_contents,
            config=types.GenerateContentConfig(
                temperature=0.05,
                top_p=1.0,
                max_output_tokens=300
            )
        )
        
        # Calculate total response time
        response_time = time.time() - start_time
        
        # Extract the response text
        response_text = response.text.strip()
        logging.debug(f"Raw ASL response from {display_name}: {response_text}")
        
        # Try to parse the JSON response
        try:
            # Find the JSON object in the response
            json_start = response_text.find('{')
            json_end = response_text.rfind('}') + 1
            if json_start >= 0 and json_end > json_start:
                json_str = response_text[json_start:json_end]
                result = json.loads(json_str)
                
                # Add timing and token information to the result
                result["metadata"] = {
                    "response_time_seconds": round(response_time, 3),
                    "visibility_check_tokens": visibility_tokens.total_tokens,
                    "asl_recognition_tokens": asl_tokens.total_tokens,
                    "total_tokens": visibility_tokens.total_tokens + asl_tokens.total_tokens
                }
                return result
            else:
                logging.warning(f"No JSON object found in {display_name} ASL response for {image_path}")
                return {
                    "error": "No JSON found in response",
                    "metadata": {
                        "response_time_seconds": round(response_time, 3),
                        "visibility_check_tokens": visibility_tokens.total_tokens,
                        "asl_recognition_tokens": asl_tokens.total_tokens,
                        "total_tokens": visibility_tokens.total_tokens + asl_tokens.total_tokens
                    }
                }
        except json.JSONDecodeError:
            logging.warning(f"Failed to parse JSON from {display_name} ASL response for {image_path}")
            return {
                "error": "Invalid JSON response",
                "metadata": {
                    "response_time_seconds": round(response_time, 3),
                    "visibility_check_tokens": visibility_tokens.total_tokens,
                    "asl_recognition_tokens": asl_tokens.total_tokens,
                    "total_tokens": visibility_tokens.total_tokens + asl_tokens.total_tokens
                }
            }
            
    except Exception as e:
        response_time = time.time() - start_time
        logging.error(f"Error getting prediction from {display_name} for {image_path}: {e}")
        # Check for specific API errors (like google.api_core.exceptions.InvalidArgument)
        error_response = {
            "error": f"API Error: {e.message}" if hasattr(e, 'message') else str(e),
            "metadata": {
                "response_time_seconds": round(response_time, 3)
            }
        }
        return error_response
    finally:
        # Add a delay to respect rate limits
        logging.debug(f"Waiting {delay_seconds}s before next {display_name} request...")
        time.sleep(delay_seconds)

## Test with Sample Image

Test the model with a sample image using different prompting strategies.

In [30]:
# Update this path to your image
base_path = Path("/Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/model_comparison")
image_path = base_path / "data/V/V_0_20250428_114109.jpg"

# Make sure the path exists
if not image_path.exists():
    print(f"Image not found: {image_path}")
    # Try to find any image in the dataset
    data_dir = base_path / "data"
    if data_dir.exists():
        for letter_dir in data_dir.glob("*"):
            if letter_dir.is_dir():
                for img_file in letter_dir.glob("*.jpg"):
                    image_path = img_file
                    print(f"Using alternative image: {image_path}")
                    break
            if image_path.exists():
                break

In [32]:
# Test with a specific strategy
def test_single_strategy(strategy="zero_shot"):
    """Test the model with a single prompting strategy."""
    print(f"\nTesting with {strategy} strategy...")
    config = MODEL_CONFIGS["flash-lite"]
    print(f"Using {config['display_name']} on image: {image_path}")
    
    result = get_asl_prediction(str(image_path), strategy)
    print(f"\nResult:\n{json.dumps(result, indent=2)}")
    
    return result

In [33]:
# Run the test with the zero_shot strategy
result = test_single_strategy("zero_shot")

# Uncomment to test all strategies
# all_results = test_all_strategies()


Testing with zero_shot strategy...
Using Gemini 2.0 Flash-Lite on image: /Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/model_comparison/data/V/V_0_20250428_114109.jpg


2025-05-13 11:50:40,084 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash-lite:countTokens "HTTP/1.1 200 OK"
2025-05-13 11:50:40,086 - INFO - AFC is enabled with max remote calls: 10.
2025-05-13 11:50:41,015 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash-lite:generateContent "HTTP/1.1 200 OK"
2025-05-13 11:50:41,017 - INFO - AFC remote call 1 is done.
2025-05-13 11:50:41,077 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash-lite:countTokens "HTTP/1.1 200 OK"
2025-05-13 11:50:41,078 - INFO - AFC is enabled with max remote calls: 10.
2025-05-13 11:50:42,060 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash-lite:generateContent "HTTP/1.1 200 OK"
2025-05-13 11:50:42,062 - INFO - AFC remote call 1 is done.



Result:
{
  "letter": "V",
  "confidence": 0.95,
  "feedback": "The index and middle fingers are extended and separated, while the other fingers are closed, forming the letter V in ASL.",
  "metadata": {
    "response_time_seconds": 2.133,
    "visibility_check_tokens": 280,
    "asl_recognition_tokens": 366,
    "total_tokens": 646
  }
}
