# Llama 3.2 90B Vision ASL Recognition

This notebook demonstrates how to use the Llama 3.2 90B Vision model for ASL letter recognition via WatsonX. The code includes:

- Setting up the WatsonX API client for Llama 3.2 90B Vision
- Processing images for analysis
- Implementing different prompting strategies
- Handling model responses and extracting predictions

### Note: This notebook requires WatsonX API credentials in the backend/.env file
### Please refer to test_llama_90b_vision.py for the actually implemented script

In [1]:
import os
import json
import logging
import base64
import requests
from pathlib import Path
from dotenv import load_dotenv
from PIL import Image
import io
import time
import re
from typing import Dict, Any, List
from tenacity import retry, stop_after_attempt, wait_exponential
import argparse

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

## Load Environment Variables and Initialize Client

Load API credentials from .env file for WatsonX API access.

In [2]:
# Load environment variables from backend/.env
# Use an absolute path to the .env file
dotenv_path = "/Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/backend/.env"
load_dotenv(dotenv_path=dotenv_path)
print(f"Loading .env file from: {dotenv_path}")

# Get WatsonX credentials
WATSONX_API_KEY = os.getenv("WATSONX_API_KEY")
WATSONX_PROJECT_ID = os.getenv("WATSONX_PROJECT_ID")
WATSONX_URL = os.getenv("WATSONX_URL", "https://us-south.ml.cloud.ibm.com")

print(f"API key found: {'Yes' if WATSONX_API_KEY else 'No'}")
print(f"Project ID found: {'Yes' if WATSONX_PROJECT_ID else 'No'}")

if not all([WATSONX_API_KEY, WATSONX_PROJECT_ID]):
    raise ValueError("WatsonX credentials not found in environment variables. Please check your .env file.")

# Model ID
MODEL_ID = "meta-llama/llama-3-2-90b-vision-instruct"

Loading .env file from: /Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/backend/.env
API key found: Yes
Project ID found: Yes


## Prompt Templates

Define prompt templates for different strategies.

In [3]:
# Prompt templates
PROMPT_TEMPLATES = {
"zero_shot": """You are an expert in American Sign Language (ASL) recognition. Analyze the provided image and identify the ASL letter being signed (A-Z).

Respond only with a valid JSON object, using this format:
{
  "letter": "A single uppercase letter (A-Z)",
  "confidence": "confidence score (0-1)",
  "feedback": "A short explanation of how the gesture maps to the predicted letter"
}
Be precise and avoid adding anything outside the JSON response.""",

"few_shot": """You are an expert in American Sign Language (ASL) recognition. Analyze the provided image and identify the ASL letter being signed (A-Z).

Here are some known ASL hand signs:
- A: Fist with thumb resting on the side
- B: Flat open hand, fingers extended upward, thumb across the palm
- C: Hand curved into the shape of the letter C
- D: Index finger up, thumb touching middle finger forming an oval
- E: Fingers bent, thumb tucked under

Respond only with a JSON object like this:
{
  "letter": "A single uppercase letter (A-Z)",
  "confidence": "confidence score (0-1)",
  "feedback": "Why this gesture matches the predicted letter"
}
Only return the JSON object. No explanations before or after.""",

"chain_of_thought": """You are an expert in American Sign Language (ASL) recognition. Carefully analyze the provided image step-by-step to identify the ASL letter (A-Z).

1. Describe the hand shape
2. Describe the finger and thumb positions
3. Compare these to known ASL letter signs
4. Identify the most likely letter

Then output your answer as JSON:
{
  "letter": "A single uppercase letter (A-Z)",
  "confidence": "confidence score (0-1),
  "feedback": "Summarize your reasoning in one sentence"
}
Return only the JSON object with no extra text.""",

"visual_grounding": """You are an expert in American Sign Language (ASL) recognition. Carefully analyze the provided image of a hand gesture and determine which ASL letter (A–Z) it represents.

To guide your analysis, consider the following:
- Which fingers are extended or bent?
- Is the thumb visible, and where is it positioned?
- What is the orientation of the palm (facing forward, sideways, etc.)?
- Are there any unique shapes formed (e.g., circles, fists, curves)?

Now, based on this visual inspection, provide your prediction in the following JSON format:

{
  "letter": "predicted letter (A-Z)",
  "confidence": "confidence score (0–1)",
  "feedback": "brief explanation describing the observed hand shape and reasoning"
}

Be precise, use visual clues from the image, and avoid guessing without justification.""",

"contrastive": """You are an expert in American Sign Language (ASL) recognition. Analyze the provided image of a hand gesture and identify the correct ASL letter.

Consider the following candidate letters: A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z
(These letters are visually similar and often confused.)

Step-by-step:
1. Observe the hand shape, finger positions, and thumb placement.
2. Compare the observed gesture against the typical signs for each candidate letter.
3. Eliminate unlikely candidates based on visible differences.
4. Choose the most plausible letter and explain your reasoning.

Format your response as JSON:

{
  "letter": "predicted letter from candidates",
  "confidence": "confidence score (0–1)",
  "feedback": "why this letter was selected over the others"
}

Be analytical and compare carefully to avoid misclassification."""
}

## API and Authentication Functions

Define functions to handle WatsonX API authentication and requests with retry logic.

In [4]:
def estimate_tokens(text: str) -> int:
    """Estimate the number of tokens in a text string (1 token ≈ 4 characters)."""
    return len(text) // 4

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def get_watsonx_token(api_key: str) -> str:
    """Get a token for WatsonX API authentication with retry logic."""
    auth_url = "https://iam.cloud.ibm.com/identity/token"
    headers = {"Content-Type": "application/x-www-form-urlencoded"}
    data = {"grant_type": "urn:ibm:params:oauth:grant-type:apikey", "apikey": api_key}
    
    try:
        response = requests.post(auth_url, headers=headers, data=data)
        response.raise_for_status()
        return response.json().get("access_token")
    except Exception as e:
        logging.error(f"Error getting WatsonX token: {e}")
        raise

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def make_api_request(token: str, payload: Dict[str, Any]) -> Dict[str, Any]:
    """Make API request with retry logic."""
    headers = {
        "Accept": "application/json",
        "Content-Type": "application/json",
        "Authorization": f"Bearer {token}"
    }
    
    response = requests.post(
        f"{WATSONX_URL}/ml/v1/text/chat?version=2023-05-29",
        headers=headers,
        json=payload,
        timeout=60
    )
    response.raise_for_status()
    return response.json()

## Image Processing Functions

Define functions to process images for the WatsonX API.

In [5]:
def encode_image_base64(image_path: str) -> str:
    """Encode image to base64 string."""
    try:
        with Image.open(image_path) as img:
            # Convert to RGB if necessary
            if img.mode != 'RGB':
                img = img.convert('RGB')
            
            # Resize image if needed
            max_size = (1024, 1024)
            img.thumbnail(max_size, Image.Resampling.LANCZOS)
            
            # Save to bytes
            buffer = io.BytesIO()
            img.save(buffer, format="JPEG", quality=95)
            return base64.b64encode(buffer.getvalue()).decode('utf-8')
    except Exception as e:
        logging.error(f"Error encoding image: {e}")
        raise

## ASL Prediction Function

Define the main function for getting ASL predictions from the Llama 90B Vision model.

In [6]:
def get_asl_prediction(image_path: str, strategy: str = "zero_shot") -> Dict[str, Any]:
    """Get ASL prediction from Llama 90B Vision model with specified prompting strategy."""
    start_time = time.time()
    
    try:
        # Get authentication token
        token = get_watsonx_token(WATSONX_API_KEY)
        
        # Encode image
        image_base64 = encode_image_base64(image_path)
        logging.info(f"Image encoded successfully. Base64 size: {len(image_base64)} characters")
        
        # Get appropriate prompt template
        prompt = PROMPT_TEMPLATES.get(strategy, PROMPT_TEMPLATES["zero_shot"])
        
        # Create the message with image and prompt
        messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": prompt
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{image_base64}"
                        }
                    }
                ]
            }
        ]
        
        payload = {
            "model_id": MODEL_ID,
            "project_id": WATSONX_PROJECT_ID,
            "messages": messages,
            "temperature": 0.05,
            "top_p": 1.0,
            "max_tokens": 300
        }
        
        # Make the API request
        result = make_api_request(token, payload)
        
        # Extract the generated text
        generated_text = result.get("choices", [{}])[0].get("message", {}).get("content", "")
        logging.info(f"Generated text: {generated_text}")
        
        # Calculate response time and token estimates
        response_time = time.time() - start_time
        prompt_tokens = estimate_tokens(prompt)
        response_tokens = estimate_tokens(generated_text)
        total_tokens = prompt_tokens + response_tokens
        
        # Try to parse the JSON response
        try:
            # Find the JSON object in the response
            json_start = generated_text.find('{')
            json_end = generated_text.rfind('}') + 1
            if json_start >= 0 and json_end > json_start:
                json_str = generated_text[json_start:json_end]
                prediction = json.loads(json_str)
                
                # Add metadata to the response
                final_result = {
                    "prediction": prediction,
                    "metadata": {
                        "model": "llama-90b-vision",
                        "strategy": strategy,
                        "response_time": round(response_time, 3),
                        "prompt_tokens": prompt_tokens,
                        "response_tokens": response_tokens,
                        "total_tokens": total_tokens
                    }
                }
                return final_result
            else:
                raise ValueError("No JSON object found in response")
        except json.JSONDecodeError as e:
            logging.error(f"Error parsing JSON response: {e}")
            return {
                "error": f"Invalid JSON response: {str(e)}",
                "metadata": {
                    "response_time": round(response_time, 3),
                    "prompt_tokens": prompt_tokens,
                    "response_tokens": response_tokens,
                    "total_tokens": total_tokens
                }
            }
            
    except Exception as e:
        response_time = time.time() - start_time
        logging.error(f"Error in ASL prediction: {e}")
        return {
            "error": str(e),
            "metadata": {
                "response_time": round(response_time, 3)
            }
        }

## Test with Sample Image

Test the model with a sample image using different prompting strategies.

In [7]:
# Update this path to your image
base_path = Path("/Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/model_comparison")
image_path = base_path / "data/V/V_0_20250428_114109.jpg"

# Make sure the path exists
if not image_path.exists():
    print(f"Image not found: {image_path}")
    # Try to find any image in the dataset
    data_dir = base_path / "data"
    if data_dir.exists():
        for letter_dir in data_dir.glob("*"):
            if letter_dir.is_dir():
                for img_file in letter_dir.glob("*.jpg"):
                    image_path = img_file
                    print(f"Using alternative image: {image_path}")
                    break
            if image_path.exists():
                break

In [8]:
# Test with a specific strategy
def test_single_strategy(strategy="zero_shot"):
    """Test the model with a single prompting strategy."""
    print(f"\nTesting with {strategy} strategy...")
    print(f"Using Llama 3.2 90B Vision on image: {image_path}")
    
    result = get_asl_prediction(str(image_path), strategy)
    print(f"\nResult:\n{json.dumps(result, indent=2)}")
    
    return result

In [9]:
# Run the test with the zero_shot strategy
result = test_single_strategy("zero_shot")

# Uncomment to test all strategies
# def test_all_strategies():
#     """Test the model with all prompting strategies."""
#     results = {}
#     
#     for strategy in PROMPT_TEMPLATES.keys():
#         print(f"\nTesting with {strategy} strategy...")
#         result = get_asl_prediction(str(image_path), strategy)
#         results[strategy] = result
#         
#         print(f"Result:\n{json.dumps(result, indent=2)}")
#         
#     return results
# 
# all_results = test_all_strategies()


Testing with zero_shot strategy...
Using Llama 3.2 90B Vision on image: /Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/model_comparison/data/V/V_0_20250428_114109.jpg


2025-05-13 12:04:33,588 - INFO - Image encoded successfully. Base64 size: 146312 characters
2025-05-13 12:04:37,156 - INFO - Generated text: {
  "letter": "V",
  "confidence": 0.9,
  "feedback": "The handshape and orientation of the fingers match the ASL sign for the letter 'V'. The index and middle fingers are extended and separated, while the other fingers are closed."
}



Result:
{
  "prediction": {
    "letter": "V",
    "confidence": 0.9,
    "feedback": "The handshape and orientation of the fingers match the ASL sign for the letter 'V'. The index and middle fingers are extended and separated, while the other fingers are closed."
  },
  "metadata": {
    "model": "llama-90b-vision",
    "strategy": "zero_shot",
    "response_time": 4.242,
    "prompt_tokens": 109,
    "response_tokens": 58,
    "total_tokens": 167
  }
}
