# Test Pixtral 12B for ASL Recognition

This notebook tests the Pixtral 12B model for ASL sign language recognition using IBM WatsonX platform.

The notebook includes:
- Setting up environment and authentication
- Image processing and encoding
- Multiple prompting strategies
- API request handling with retries
- Result parsing and evaluation

### Note: This notebook requires WatsonX API credentials in the backend/.env file
### Please refer to test_llama_90b_vision.py for the actually implemented script

In [1]:
import os
import json
import logging
import base64
import requests
from pathlib import Path
from dotenv import load_dotenv
from PIL import Image
import io
import time
import argparse
from tenacity import retry, stop_after_attempt, wait_exponential
from typing import Dict, Any, Literal

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

## Load Environment Variables

Load the WatsonX API credentials from the environment variables file.

In [2]:
# Load environment variables from backend/.env
dotenv_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath("__file__"))), 'backend', '.env')
load_dotenv(dotenv_path=dotenv_path)
print(f"Loading .env file from: {dotenv_path}")

# Get WatsonX credentials
WATSONX_API_KEY = os.getenv("WATSONX_API_KEY")
WATSONX_PROJECT_ID = os.getenv("WATSONX_PROJECT_ID")
WATSONX_URL = os.getenv("WATSONX_URL", "https://us-south.ml.cloud.ibm.com")

if not all([WATSONX_API_KEY, WATSONX_PROJECT_ID]):
    raise ValueError("WatsonX credentials not found in environment variables. Please check your .env file.")

# Model ID
MODEL_ID = "mistralai/pixtral-12b"

Loading .env file from: /Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/backend/.env


## Define Prompt Templates

Different prompting strategies for the model.

In [3]:
# Prompt templates
PROMPT_TEMPLATES = {
"zero_shot": """You are an expert in American Sign Language (ASL) recognition. Analyze the provided image and identify the ASL letter being signed (A-Z).

Respond only with a valid JSON object, using this format:
{
  "letter": "A single uppercase letter (A-Z)",
  "confidence": "confidence score (0-1)",
  "feedback": "A short explanation of how the gesture maps to the predicted letter"
}
Be precise and avoid adding anything outside the JSON response.""",

"few_shot": """You are an expert in American Sign Language (ASL) recognition. Analyze the provided image and identify the ASL letter being signed (A-Z).

Here are some known ASL hand signs:
- A: Fist with thumb resting on the side
- B: Flat open hand, fingers extended upward, thumb across the palm
- C: Hand curved into the shape of the letter C
- D: Index finger up, thumb touching middle finger forming an oval
- E: Fingers bent, thumb tucked under

Respond only with a JSON object like this:
{
  "letter": "A single uppercase letter (A-Z)",
  "confidence": "confidence score (0-1)",
  "feedback": "Why this gesture matches the predicted letter"
}
Only return the JSON object. No explanations before or after.""",

"chain_of_thought": """You are an expert in American Sign Language (ASL) recognition. Carefully analyze the provided image step-by-step to identify the ASL letter (A-Z).

1. Describe the hand shape
2. Describe the finger and thumb positions
3. Compare these to known ASL letter signs
4. Identify the most likely letter

Then output your answer as JSON:
{
  "letter": "A single uppercase letter (A-Z)",
  "confidence": "confidence score (0-1),
  "feedback": "Summarize your reasoning in one sentence"
}
Return only the JSON object with no extra text.""",

"visual_grounding": """You are an expert in American Sign Language (ASL) recognition. Carefully analyze the provided image of a hand gesture and determine which ASL letter (A–Z) it represents.

To guide your analysis, consider the following:
- Which fingers are extended or bent?
- Is the thumb visible, and where is it positioned?
- What is the orientation of the palm (facing forward, sideways, etc.)?
- Are there any unique shapes formed (e.g., circles, fists, curves)?

Now, based on this visual inspection, provide your prediction in the following JSON format:

{
  "letter": "predicted letter (A-Z)",
  "confidence": "confidence score (0–1)",
  "feedback": "brief explanation describing the observed hand shape and reasoning"
}

Be precise, use visual clues from the image, and avoid guessing without justification.""",

"contrastive": """You are an expert in American Sign Language (ASL) recognition. Analyze the provided image of a hand gesture and identify the correct ASL letter.

Consider the following candidate letters: A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z
(These letters are visually similar and often confused.)

Step-by-step:
1. Observe the hand shape, finger positions, and thumb placement.
2. Compare the observed gesture against the typical signs for each candidate letter.
3. Eliminate unlikely candidates based on visible differences.
4. Choose the most plausible letter and explain your reasoning.

Format your response as JSON:

{
  "letter": "predicted letter from candidates",
  "confidence": "confidence score (0–1)",
  "feedback": "why this letter was selected over the others"
}

Be analytical and compare carefully to avoid misclassification."""
}

## Utility Functions

Helper functions for token estimation, authentication, and image processing.

In [4]:
def estimate_tokens(text: str) -> int:
    """Estimate the number of tokens in a text string (1 token ≈ 4 characters)."""
    return len(text) // 4

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def get_watsonx_token(api_key: str) -> str:
    """Get a token for WatsonX API authentication with retry logic."""
    auth_url = "https://iam.cloud.ibm.com/identity/token"
    headers = {"Content-Type": "application/x-www-form-urlencoded"}
    data = {"grant_type": "urn:ibm:params:oauth:grant-type:apikey", "apikey": api_key}
    
    try:
        response = requests.post(auth_url, headers=headers, data=data, timeout=15)
        response.raise_for_status()
        token_data = response.json()
        return token_data.get("access_token")
    except Exception as e:
        logging.error(f"Error getting token: {e}")
        raise

def encode_image_base64(image_path: str) -> str:
    """Reads and returns the base64 encoded string of an image."""
    try:
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode("utf-8")
    except Exception as e:
        logging.error(f"Error encoding image: {e}")
        raise

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def make_api_request(token: str, payload: Dict[str, Any]) -> Dict[str, Any]:
    """Make API request with retry logic."""
    headers = {
        "Accept": "application/json",
        "Content-Type": "application/json",
        "Authorization": f"Bearer {token}"
    }
    
    try:
        response = requests.post(
            f"{WATSONX_URL}/ml/v1/text/chat?version=2023-05-29",
            headers=headers,
            json=payload,
            timeout=60
        )
        response.raise_for_status()
        return response.json()
    except requests.exceptions.HTTPError as e:
        logging.error(f"HTTP error response: {e.response.status_code} - {e.response.text}")
        raise

## Main Prediction Function

Function to get ASL predictions from the Pixtral 12B model.

In [5]:
def get_asl_prediction(image_path: str, strategy: str = "zero_shot") -> Dict[str, Any]:
    """Get ASL prediction from Pixtral model with specified prompting strategy."""
    start_time = time.time()
    
    try:
        # Get authentication token
        token = get_watsonx_token(WATSONX_API_KEY)
        
        # Encode image
        image_base64 = encode_image_base64(image_path)
        logging.info(f"Image encoded successfully. Base64 size: {len(image_base64)} characters")
        
        # First, test if the model can see the image
        visibility_test_payload = {
            "model_id": MODEL_ID,
            "project_id": WATSONX_PROJECT_ID,
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": "Look at the image I provided. Can you see a hand gesture? If yes, describe what you see. If no, say 'No, I cannot see the image'."
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/jpeg;base64,{image_base64}"
                            }
                        }
                    ]
                }
            ],
            "temperature": 0.05,
            "top_p": 1.0,
            "max_tokens": 300
        }
        
        # Make the visibility test API request
        logging.info("Testing image visibility...")
        visibility_result = make_api_request(token, visibility_test_payload)
        
        # Extract the visibility response
        visibility_text = visibility_result.get("choices", [{}])[0].get("message", {}).get("content", "")
        logging.info(f"Visibility test response: {visibility_text}")
        
        if "no, i cannot see the image" in visibility_text.lower():
            raise ValueError("Model cannot see the image")
        
        # If visibility test passed, proceed with the ASL sign recognition
        logging.info("Proceeding with ASL sign recognition...")
        
        # Select prompt based on strategy
        prompt = PROMPT_TEMPLATES.get(strategy, PROMPT_TEMPLATES["zero_shot"])
        
        # Create the message with image and prompt
        asl_payload = {
            "model_id": MODEL_ID,
            "project_id": WATSONX_PROJECT_ID,
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": prompt
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/jpeg;base64,{image_base64}"
                            }
                        }
                    ]
                }
            ],
            "temperature": 0.05,
            "top_p": 1.0,
            "max_tokens": 300
        }
        
        # Make the API request
        result = make_api_request(token, asl_payload)
        
        # Extract the generated text
        generated_text = result.get("choices", [{}])[0].get("message", {}).get("content", "")
        logging.info(f"Generated text: {generated_text}")
        
        # Calculate response time and token estimates
        response_time = time.time() - start_time
        prompt_tokens = estimate_tokens(prompt)
        response_tokens = estimate_tokens(generated_text)
        total_tokens = prompt_tokens + response_tokens
        
        # Try to parse the JSON response
        try:
            # Find the JSON object in the response
            json_start = generated_text.find('{')
            json_end = generated_text.rfind('}') + 1
            if json_start >= 0 and json_end > json_start:
                json_str = generated_text[json_start:json_end]
                prediction = json.loads(json_str)
                
                # Add metadata to the response
                final_result = {
                    "prediction": prediction,
                    "metadata": {
                        "model": "pixtral-12b",
                        "strategy": strategy,
                        "response_time": round(response_time, 3),
                        "tokens": {
                            "prompt": prompt_tokens,
                            "response": response_tokens,
                            "total": total_tokens
                        }
                    }
                }
                return final_result
            else:
                raise ValueError("No JSON object found in response")
        except json.JSONDecodeError as e:
            raise ValueError(f"Failed to parse JSON from response: {e}")
            
    except Exception as e:
        logging.error(f"Error in ASL prediction: {e}")
        return {
            "error": str(e),
            "metadata": {
                "model": "pixtral-12b",
                "strategy": strategy,
                "response_time": round(time.time() - start_time, 3)
            }
        }

## Test the Model

Run a test prediction with the model on a sample image.

In [6]:
# Define a path to test image
image_path = "/Users/henrikjacobsen/Desktop/CBS/Semester 2/Artifical Intelligence and Machine Learning/Final Project/AIML25-shared/model_comparison/data/V/V_17_20250428_114126.jpg"
strategy = "zero_shot"

# Check if the image exists
if not os.path.exists(image_path):
    print(f"Image file not found: {image_path}")
else:
    # Run prediction
    print(f"Testing Pixtral 12B model with {strategy} strategy...")
    result = get_asl_prediction(image_path, strategy)
    print(f"\nPixtral 12B {strategy} result:")
    print(json.dumps(result, indent=2))
    
    # Save results
    timestamp = int(time.time())
    results_file = f"pixtral_12b_{strategy}_result_{timestamp}.json"
    with open(results_file, 'w') as f:
        json.dump(result, f, indent=2)
    print(f"\nResult saved to {results_file}")

Testing Pixtral 12B model with zero_shot strategy...


2025-05-13 12:14:08,075 - INFO - Image encoded successfully. Base64 size: 196332 characters
2025-05-13 12:14:08,075 - INFO - Testing image visibility...
2025-05-13 12:14:10,123 - INFO - Visibility test response: Yes, I can see the image. The person in the image is making a hand gesture with their right hand, forming a "V" sign with their index and middle fingers.
2025-05-13 12:14:10,124 - INFO - Proceeding with ASL sign recognition...
2025-05-13 12:14:11,528 - INFO - Generated text: ```json
{
  "letter": "F",
  "confidence": "0.95",
  "feedback": "The hand position with the index finger and thumb forming a 'V' shape is indicative of the ASL letter 'F'."
}
```



Pixtral 12B zero_shot result:
{
  "prediction": {
    "letter": "F",
    "confidence": "0.95",
    "feedback": "The hand position with the index finger and thumb forming a 'V' shape is indicative of the ASL letter 'F'."
  },
  "metadata": {
    "model": "pixtral-12b",
    "strategy": "zero_shot",
    "response_time": 3.91,
    "tokens": {
      "prompt": 109,
      "response": 44,
      "total": 153
    }
  }
}

Result saved to pixtral_12b_zero_shot_result_1747131251.json
