# Ollama Model Performance - Sentiment Analysis

## Objective
Test and compare the performance and accuracy of configurable Ollama models for sentiment analysis using the mental health dataset.

## Checklist
- Load and validate the mental health dataset with proper handling of missing values
- Configure Ollama models, sample sizes, and test parameters
- Query Ollama models for sentiment predictions on sampled data
- Calculate accuracy, precision, recall, and F1 scores for each model
- Record results in an append-only JSON structure with timestamps
- Display server specifications and comparative performance metrics

## 1. Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
import json
import requests
import time
from datetime import datetime
from pathlib import Path
from typing import List, Dict, Any, Optional, Tuple
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import warnings
warnings.filterwarnings('ignore')

print("✓ Libraries imported successfully")

✓ Libraries imported successfully


## 2. Configuration Parameters

In [2]:
# Configurable Parameters
CONFIG = {
    "ollama_url": "http://localhost:11434/api/chat",
    "models_to_test": ["pilardi/sentiment-analysis:gemma3", "pilardi/sentiment-analysis:llama3", "pilardi/sentiment-analysis:phi3", "pilardi/sentiment-analysis:qwen3.4b"],  # List of Ollama models
    # "models_to_test": ["r1-1776:latest", "deepseek-r1:70b", "llama3-gradient:70b"],  # List of Ollama models
    # "models_to_test": ["mvkvl/sentiments:qwen2", "mvkvl/sentiments:phi3", "mvkvl/sentiments:mistral", "mvkvl/sentiments:llama3", "mvkvl/sentiments:aya", "mvkvl/sentiments:gemma"],  # List of Ollama models
    "dataset_path": "./datasets/Mental Health Dataset.csv",
    "sample_size": 100,  # Number of records to sample for testing
    "random_seed": 42,  # For reproducibility
    "batch_size": 10,  # Batch size for predictions
    "test_entire_dataset": False,  # Set to True to test entire dataset
    "results_file": "results.json",  # File to store test results
    "timeout_seconds": 30,  # Timeout for API calls
    "store_sample_predictions": True,  # Store sample predictions for debugging
    "sample_predictions_count": 5,  # Number of sample predictions to store
    "unload_models_between_tests": True,  # Automatically unload models to free VRAM
    "memory_management_pause": 2  # Seconds to wait after unloading a model
}

print("Configuration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

# Memory management recommendation
if CONFIG["unload_models_between_tests"]:
    print(f"\n💡 Memory Management: Enabled")
    print(f"   Models will be automatically unloaded between tests to free VRAM")
else:
    print(f"\n⚠️  Memory Management: Disabled") 
    print(f"   All models will remain in memory (may cause VRAM issues)")

Configuration:
  ollama_url: http://localhost:11434/api/chat
  models_to_test: ['pilardi/sentiment-analysis:gemma3', 'pilardi/sentiment-analysis:llama3', 'pilardi/sentiment-analysis:phi3', 'pilardi/sentiment-analysis:qwen3.4b']
  dataset_path: ./datasets/Mental Health Dataset.csv
  sample_size: 100
  random_seed: 42
  batch_size: 10
  test_entire_dataset: False
  results_file: results.json
  timeout_seconds: 30
  store_sample_predictions: True
  sample_predictions_count: 5
  unload_models_between_tests: True
  memory_management_pause: 2

💡 Memory Management: Enabled
   Models will be automatically unloaded between tests to free VRAM


## 3. Ollama API Integration

In [3]:
def query_ollama_simple(text: str, model: str, ollama_url: str, timeout: int = 30) -> Optional[str]:
    """
    Simple Ollama query for connectivity testing.
    Returns: raw response text or None on error
    """
    payload = {
        "model": model,
        "messages": [
            {"role": "user", "content": text}
        ]
    }
    
    try:
        response = requests.post(ollama_url, json=payload, timeout=timeout, stream=True)
        response.raise_for_status()
        
        # Handle streaming NDJSON response
        assembled = ''
        for line in response.iter_lines(decode_unicode=False):
            if not line:
                continue
            try:
                obj = json.loads(line.decode('utf-8', errors='replace'))
                if isinstance(obj, dict):
                    msg = obj.get('message')
                    if isinstance(msg, dict) and 'content' in msg:
                        assembled += str(msg['content'])
            except Exception:
                continue
        
        return assembled.strip() if assembled else None
    except Exception as e:
        print(f"Error querying model: {e}")
        return None

def get_loaded_models(ollama_base_url: str = "http://localhost:11434", timeout: int = 10) -> List[str]:
    """
    Get list of currently loaded Ollama models.
    
    Args:
        ollama_base_url: Base URL for Ollama API
        timeout: Request timeout in seconds
    
    Returns:
        List of loaded model names
    """
    try:
        # Use the /api/ps endpoint to list running models
        ps_url = f"{ollama_base_url}/api/ps"
        response = requests.get(ps_url, timeout=timeout)
        response.raise_for_status()
        
        data = response.json()
        models = data.get('models', [])
        return [model.get('name', 'unknown') for model in models]
        
    except Exception as e:
        print(f"⚠️ Failed to get loaded models: {e}")
        return []

def unload_ollama_model(model: str, ollama_base_url: str = "http://localhost:11434", timeout: int = 10) -> bool:
    """
    Unload a specific Ollama model from memory to free up VRAM.
    
    Args:
        model: The name of the model to unload (e.g., "gemma3:4b")
        ollama_base_url: Base URL for Ollama API (without /api/chat)
        timeout: Request timeout in seconds
    
    Returns:
        True if successful, False otherwise
    """
    try:
        # Use the /api/generate endpoint with keep_alive=0 to unload model
        generate_url = f"{ollama_base_url}/api/generate"
        payload = {
            "model": model,
            "keep_alive": 0
        }
        
        response = requests.post(generate_url, json=payload, timeout=timeout)
        response.raise_for_status()
        
        print(f"✓ Model '{model}' unloaded from memory")
        return True
        
    except Exception as e:
        print(f"⚠️ Failed to unload model '{model}': {e}")
        return False

def unload_all_models(ollama_base_url: str = "http://localhost:11434", timeout: int = 10) -> int:
    """
    Unload all currently loaded Ollama models.
    
    Args:
        ollama_base_url: Base URL for Ollama API
        timeout: Request timeout in seconds
    
    Returns:
        Number of models successfully unloaded
    """
    loaded_models = get_loaded_models(ollama_base_url, timeout)
    
    if not loaded_models:
        print("No models currently loaded")
        return 0
    
    print(f"Unloading {len(loaded_models)} loaded model(s): {', '.join(loaded_models)}")
    
    success_count = 0
    for model in loaded_models:
        if unload_ollama_model(model, ollama_base_url, timeout):
            success_count += 1
    
    return success_count

def query_ollama_sentiment(text: str, model: str, ollama_url: str, timeout: int = 30) -> Optional[str]:
    """
    Query Ollama model for sentiment analysis.
    Returns: sentiment label (very_negative/negative/neutral/positive) or None on error
    """
    prompt = f"""Analyze the sentiment of the following text and respond with ONLY one of these four words: 'very_negative', 'negative', 'neutral', or 'positive'.

Guidelines:
- very_negative: Extremely negative, distressing, or severely critical
- negative: Generally negative or critical but not extreme
- neutral: Balanced, factual, or neither positive nor negative
- positive: Generally positive, satisfied, or optimistic

Text: {text}

Sentiment:"""
    
    payload = {
        "model": model,
        "messages": [
            {"role": "user", "content": prompt}
        ]
    }
    
    try:
        response = requests.post(ollama_url, json=payload, timeout=timeout, stream=True)
        response.raise_for_status()
        
        # Handle streaming NDJSON response
        assembled = ''
        for line in response.iter_lines(decode_unicode=False):
            if not line:
                continue
            try:
                obj = json.loads(line.decode('utf-8', errors='replace'))
                if isinstance(obj, dict):
                    msg = obj.get('message')
                    if isinstance(msg, dict) and 'content' in msg:
                        assembled += str(msg['content'])
            except Exception:
                continue
        
        sentiment = assembled.strip().lower().replace(' ', '_')
        
        # Extract sentiment label (prioritize exact matches)
        valid_sentiments = ["very_negative", "negative", "neutral", "positive"]
        
        for valid in valid_sentiments:
            if valid in sentiment:
                return valid
        
        # If no exact match, try to extract first word and map it
        first_word = sentiment.split()[0] if sentiment else None
        if first_word in valid_sentiments:
            return first_word
            
        return None
    except Exception as e:
        print(f"Error querying model: {e}")
        return None

print("✓ Ollama API integration ready with memory management")

✓ Ollama API integration ready with memory management


## 4. Test Ollama Connectivity

In [4]:
print('OLLAMA_URL:', CONFIG["ollama_url"])
print('MODEL_NAME:', CONFIG["models_to_test"][0])
print('\n-- Testing Ollama connectivity --')

try:
    test_response = query_ollama_simple(
        "Please reply with the single token: PONG", 
        CONFIG["models_to_test"][0],
        CONFIG["ollama_url"],
        CONFIG["timeout_seconds"]
    )
    print(f'\nResponse: {test_response}')
    
    if test_response and 'PONG' in str(test_response).upper():
        print('\n✅ SUCCESS: Ollama is reachable and responding correctly!')
    elif test_response is None:
        print('\n⚠️ WARNING: Got no response from Ollama')
    else:
        print(f'\n⚠️ WARNING: Got unexpected response: {test_response}')
        
except requests.exceptions.ConnectionError as ce:
    print(f'❌ Connection error: {ce}')
    print('Please verify OLLAMA_URL is correct and Ollama is running.')
except requests.exceptions.Timeout as te:
    print(f'❌ Request timed out: {te}')
except Exception as e:
    print(f'❌ Unexpected error: {e}')

OLLAMA_URL: http://localhost:11434/api/chat
MODEL_NAME: pilardi/sentiment-analysis:gemma3

-- Testing Ollama connectivity --

Response: PONG

✅ SUCCESS: Ollama is reachable and responding correctly!

Response: PONG

✅ SUCCESS: Ollama is reachable and responding correctly!


## 5. Server Specifications

Record the hardware and system specifications used for testing.

In [5]:
import platform
import subprocess

def get_server_specs() -> Dict[str, Any]:
    """
    Gather server specifications. Set unknown values to None.
    """
    specs = {
        "cpu": None,
        "ram_gb": None,
        "gpu_count": None,
        "gpu_type": None,
        "gpu_bus_speed_gbps": None,
        "os": platform.platform()
    }
    
    # Try to get CPU info
    try:
        if platform.system() == "Linux":
            cpu_info = subprocess.check_output("lscpu | grep 'Model name'", shell=True).decode().strip()
            specs["cpu"] = cpu_info.split(":")[1].strip() if ":" in cpu_info else None
        elif platform.system() == "Darwin":  # macOS
            cpu_info = subprocess.check_output("sysctl -n machdep.cpu.brand_string", shell=True).decode().strip()
            specs["cpu"] = cpu_info
    except:
        pass
    
    # Try to get RAM
    try:
        if platform.system() == "Linux":
            mem_info = subprocess.check_output("free -g | grep Mem | awk '{print $2}'", shell=True).decode().strip()
            specs["ram_gb"] = int(mem_info)
        elif platform.system() == "Darwin":
            mem_info = subprocess.check_output("sysctl hw.memsize | awk '{print $2}'", shell=True).decode().strip()
            specs["ram_gb"] = int(int(mem_info) / (1024**3))
    except:
        pass
    
    # Try to get GPU info (NVIDIA)
    try:
        gpu_info = subprocess.check_output("nvidia-smi --query-gpu=name --format=csv,noheader", shell=True).decode().strip()
        gpu_list = gpu_info.split("\n")
        specs["gpu_count"] = len(gpu_list)
        specs["gpu_type"] = gpu_list[0] if gpu_list else None
        
        # Try to get PCIe link speed (current)
        try:
            pcie_speed = subprocess.check_output("nvidia-smi --query-gpu=pcie.link.gen.current --format=csv,noheader", shell=True).decode().strip()
            pcie_width = subprocess.check_output("nvidia-smi --query-gpu=pcie.link.width.current --format=csv,noheader", shell=True).decode().strip()
            
            # Also get max supported values for comparison
            pcie_speed_max = subprocess.check_output("nvidia-smi --query-gpu=pcie.link.gen.max --format=csv,noheader", shell=True).decode().strip()
            pcie_width_max = subprocess.check_output("nvidia-smi --query-gpu=pcie.link.width.max --format=csv,noheader", shell=True).decode().strip()
            
            # Get first GPU's speed
            gen = int(pcie_speed.split("\n")[0]) if pcie_speed else None
            width = int(pcie_width.split("\n")[0]) if pcie_width else None
            gen_max = int(pcie_speed_max.split("\n")[0]) if pcie_speed_max else None
            width_max = int(pcie_width_max.split("\n")[0]) if pcie_width_max else None
            
            # Calculate approximate bandwidth in GB/s based on PCIe generation
            # PCIe 1.0: ~0.25 GB/s per lane, 2.0: ~0.5, 3.0: ~1.0, 4.0: ~2.0, 5.0: ~4.0
            speed_per_lane = {1: 0.25, 2: 0.5, 3: 1.0, 4: 2.0, 5: 4.0}
            if gen and width and gen in speed_per_lane:
                specs["gpu_bus_speed_gbps"] = round(speed_per_lane[gen] * width, 1)
            
            # Store detailed PCIe info for verbose output
            specs["_pcie_details"] = {
                "current_gen": gen,
                "current_width": width,
                "max_gen": gen_max,
                "max_width": width_max
            }
        except:
            pass
    except:
        pass
    
    return specs

SERVER_SPECS = get_server_specs()
print("Server Specifications:")
print(json.dumps(SERVER_SPECS, indent=2))

# Print verbose PCIe information if available
if "_pcie_details" in SERVER_SPECS and SERVER_SPECS["_pcie_details"]:
    details = SERVER_SPECS["_pcie_details"]
    print("\nPCIe Configuration Details:")
    print(f"  Current: PCIe Gen {details['current_gen']} x{details['current_width']}")
    print(f"  Maximum: PCIe Gen {details['max_gen']} x{details['max_width']}")
    
    if details['current_gen'] and details['max_gen']:
        if details['current_gen'] < details['max_gen'] or details['current_width'] < details['max_width']:
            print("\n  ⚠️  WARNING: GPU is not running at maximum PCIe speed!")
            print("  Possible causes:")
            print("    - GPU is in a PCIe slot with lower bandwidth")
            print("    - Power saving mode is enabled")
            print("    - BIOS PCIe settings are not optimal")
            print("    - GPU is idle (speeds may increase under load)")
            
            # Calculate potential max speed
            speed_per_lane = {1: 0.25, 2: 0.5, 3: 1.0, 4: 2.0, 5: 4.0}
            if details['max_gen'] in speed_per_lane:
                max_speed = round(speed_per_lane[details['max_gen']] * details['max_width'], 1)
                print(f"    - Potential max bandwidth: {max_speed} GB/s")
    
    # Remove internal details before saving to results
    del SERVER_SPECS["_pcie_details"]

Server Specifications:
{
  "cpu": "13th Gen Intel(R) Core(TM) i9-13900K",
  "ram_gb": 62,
  "gpu_count": 2,
  "gpu_type": "NVIDIA GeForce RTX 3090",
  "gpu_bus_speed_gbps": 16.0,
  "os": "Linux-5.15.0-157-generic-x86_64-with-glibc2.35",
  "_pcie_details": {
    "current_gen": 4,
    "current_width": 8,
    "max_gen": 4,
    "max_width": 16
  }
}

PCIe Configuration Details:
  Current: PCIe Gen 4 x8
  Maximum: PCIe Gen 4 x16

  Possible causes:
    - GPU is in a PCIe slot with lower bandwidth
    - Power saving mode is enabled
    - BIOS PCIe settings are not optimal
    - GPU is idle (speeds may increase under load)
    - Potential max bandwidth: 32.0 GB/s


## 6. Data Loading and Validation

In [6]:
def load_and_validate_data(dataset_path: str) -> Tuple[pd.DataFrame, int]:
    """
    Load the dataset and validate/clean the data.
    Returns: (cleaned_dataframe, skipped_rows_count)
    """
    # Load dataset
    df = pd.read_csv(dataset_path)
    print(f"Loaded dataset with {len(df)} rows")
    
    initial_count = len(df)
    
    # Remove rows with missing or blank 'posts' field
    df = df.dropna(subset=['posts'])
    df = df[df['posts'].str.strip() != '']
    
    # Remove rows with missing 'predicted' or 'intensity' values
    df = df.dropna(subset=['predicted', 'intensity'])
    
    skipped_count = initial_count - len(df)
    
    print(f"Cleaned dataset: {len(df)} rows (skipped {skipped_count} rows)")
    print(f"Label distribution: {df['predicted'].value_counts().to_dict()}")
    
    return df, skipped_count

# Load data
df_full, skipped_rows = load_and_validate_data(CONFIG["dataset_path"])
print(f"\n✓ Data loaded and validated")

Loaded dataset with 10392 rows
Cleaned dataset: 10391 rows (skipped 1 rows)
Label distribution: {'neutral': 4374, 'negative': 4112, 'very negative': 1155, 'positive': 750}

✓ Data loaded and validated


## 7. Data Sampling

In [7]:
def sample_data(df: pd.DataFrame, sample_size: int, random_seed: int, test_entire: bool = False) -> Tuple[pd.DataFrame, List[int]]:
    """
    Sample data from the dataset.
    Returns: (sampled_dataframe, list_of_indices)
    """
    if test_entire:
        sampled_df = df.copy()
        indices = df.index.tolist()
    else:
        sample_size = min(sample_size, len(df))
        sampled_df = df.sample(n=sample_size, random_state=random_seed)
        indices = sampled_df.index.tolist()
    
    print(f"Sampled {len(sampled_df)} records for testing")
    return sampled_df, indices

# Sample data
df_sample, sample_indices = sample_data(
    df_full, 
    CONFIG["sample_size"], 
    CONFIG["random_seed"],
    CONFIG["test_entire_dataset"]
)

print(f"Sample indices: {sample_indices[:10]}...")
print(f"\n✓ Data sampled")

Sampled 100 records for testing
Sample indices: [3625, 3037, 2574, 1488, 3677, 8861, 7884, 1692, 39, 3441]...

✓ Data sampled


## 8. Batch Prediction

In [8]:
def predict_sentiments(df: pd.DataFrame, model: str, config: Dict) -> Tuple[List[Optional[str]], int]:
    """
    Predict sentiments for all texts in the dataframe.
    Returns: (predictions_list, error_count)
    """
    predictions = []
    error_count = 0
    
    for idx, row in df.iterrows():
        text = row['posts']
        prediction = query_ollama_sentiment(
            text, 
            model, 
            config["ollama_url"],
            config["timeout_seconds"]
        )
        
        if prediction is None:
            error_count += 1
        
        predictions.append(prediction)
        
        # Progress update
        if (len(predictions)) % 10 == 0:
            print(f"Progress: {len(predictions)}/{len(df)} predictions completed (errors: {error_count})")
    
    return predictions, error_count

print("✓ Batch prediction function ready")

✓ Batch prediction function ready


## 9. Evaluation Metrics

In [9]:
def calculate_metrics(y_true: List[str], y_pred: List[Optional[str]], 
                     simplified: bool = False) -> Dict[str, float]:
    """
    Calculate accuracy, precision, recall, and F1 score.
    Handles None predictions by filtering them out.
    
    Args:
        y_true: True labels
        y_pred: Predicted labels (may contain None)
        simplified: If True, map very_negative -> negative for 3-class evaluation
    """
    # Filter out None predictions
    valid_pairs = [(true, pred) for true, pred in zip(y_true, y_pred) if pred is not None]
    
    if not valid_pairs:
        return {
            "accuracy": 0.0,
            "precision": 0.0,
            "recall": 0.0,
            "f1": 0.0,
            "class_count": 0
        }
    
    y_true_valid = [pair[0] for pair in valid_pairs]
    y_pred_valid = [pair[1] for pair in valid_pairs]
    
    # Apply simplification if requested
    if simplified:
        y_true_valid = ['negative' if label == 'very_negative' else label for label in y_true_valid]
        y_pred_valid = ['negative' if label == 'very_negative' else label for label in y_pred_valid]
    
    accuracy = accuracy_score(y_true_valid, y_pred_valid)
    precision, recall, f1, _ = precision_recall_fscore_support(
        y_true_valid, 
        y_pred_valid, 
        average='weighted',
        zero_division=0
    )
    
    # Count unique classes
    all_labels = set(y_true_valid + y_pred_valid)
    
    return {
        "accuracy": round(accuracy, 4),
        "precision": round(precision, 4),
        "recall": round(recall, 4),
        "f1": round(f1, 4),
        "class_count": len(all_labels)
    }

print("✓ Evaluation metrics function ready")

✓ Evaluation metrics function ready


## 10. Results Storage

In [10]:
def load_results(results_file: str) -> Dict:
    """
    Load existing results from JSON file or create new structure.
    """
    results_path = Path(results_file)
    
    if results_path.exists():
        with open(results_path, 'r') as f:
            return json.load(f)
    else:
        return {
            "runs": [],
            "server_specs": SERVER_SPECS
        }

def save_results(results: Dict, results_file: str):
    """
    Save results to JSON file.
    """
    with open(results_file, 'w') as f:
        json.dump(results, f, indent=2)
    print(f"Results saved to {results_file}")

def add_test_run(results: Dict, run_data: Dict):
    """
    Add a new test run to results (append-only).
    """
    results["runs"].append(run_data)
    # Ensure chronological order (oldest first)
    results["runs"] = sorted(results["runs"], key=lambda x: x["timestamp"])

print("✓ Results storage functions ready")

✓ Results storage functions ready


## 11. Main Test Execution

In [11]:
def run_model_test(model: str, df: pd.DataFrame, indices: List[int], config: Dict) -> Dict:
    """
    Run a complete test for a single model.
    Returns: test run data dictionary
    """
    print(f"\n{'='*60}")
    print(f"Testing model: {model}")
    print(f"{'='*60}")
    
    start_time = time.time()
    error_message = None
    
    try:
        # Get predictions
        predictions, error_count = predict_sentiments(df, model, config)
        
        # Calculate metrics for both 4-class and 3-class scenarios
        y_true = df['predicted'].tolist()
        stats_4class = calculate_metrics(y_true, predictions, simplified=False)
        stats_3class = calculate_metrics(y_true, predictions, simplified=True)
        
        # Prepare sample predictions for debugging
        sample_predictions = []
        if config.get("store_sample_predictions", False):
            sample_count = min(config.get("sample_predictions_count", 5), len(df))
            for i in range(sample_count):
                row = df.iloc[i]
                sample_predictions.append({
                    "index": int(indices[i]),
                    "input": row['posts'][:100] + "..." if len(row['posts']) > 100 else row['posts'],
                    "true_label": row['predicted'],
                    "pred": predictions[i]
                })
        
    except Exception as e:
        error_message = str(e)
        predictions = []
        error_count = len(df)
        stats_4class = {"accuracy": 0.0, "precision": 0.0, "recall": 0.0, "f1": 0.0, "class_count": 0}
        stats_3class = {"accuracy": 0.0, "precision": 0.0, "recall": 0.0, "f1": 0.0, "class_count": 0}
        sample_predictions = []
        print(f"ERROR: {error_message}")
    
    runtime = round(time.time() - start_time, 2)
    
    # Create test run data
    run_data = {
        "timestamp": datetime.utcnow().isoformat() + "Z",
        "model": model,
        "sample_size": len(df),
        "dataset_indices": indices,
        "config": {
            "random_seed": config["random_seed"],
            "batch_size": config["batch_size"],
            "timeout_seconds": config["timeout_seconds"]
        },
        "stats_4class": stats_4class,
        "stats_3class": stats_3class,
        "runtime_sec": runtime,
        "skipped_rows": skipped_rows,
        "error_count": error_count,
        "error_message": error_message,
        "sample_predictions": sample_predictions
    }
    
    # Print summary
    print(f"\nTest completed in {runtime} seconds")
    print(f"4-Class Results (very_negative/negative/neutral/positive):")
    print(f"  Accuracy: {stats_4class['accuracy']:.4f}")
    print(f"  Precision: {stats_4class['precision']:.4f}")
    print(f"  Recall: {stats_4class['recall']:.4f}")
    print(f"  F1 Score: {stats_4class['f1']:.4f}")
    print(f"3-Class Results (negative/neutral/positive):")
    print(f"  Accuracy: {stats_3class['accuracy']:.4f}")
    print(f"  Precision: {stats_3class['precision']:.4f}")
    print(f"  Recall: {stats_3class['recall']:.4f}")
    print(f"  F1 Score: {stats_3class['f1']:.4f}")
    print(f"Errors: {error_count}")
    
    return run_data

print("✓ Main test execution function ready")

✓ Main test execution function ready


## 12. Execute Tests for All Models

**Micro-update:** Running tests for all configured models and storing results in append-only JSON format.

In [12]:
# Load existing results
results = load_results(CONFIG["results_file"])

# Extract base URL for memory management functions
base_url = CONFIG["ollama_url"].replace("/api/chat", "")

# Show initial memory status
print("📊 Initial Memory Status:")
loaded_models = get_loaded_models(base_url)
if loaded_models:
    print(f"   Currently loaded models: {', '.join(loaded_models)}")
else:
    print("   No models currently loaded")

# Test each model with optional memory management
for i, model in enumerate(CONFIG["models_to_test"]):
    print(f"\n{'='*80}")
    print(f"Starting test {i+1}/{len(CONFIG['models_to_test'])}: {model}")
    print(f"{'='*80}")
    
    # Run the test
    run_data = run_model_test(model, df_sample, sample_indices, CONFIG)
    add_test_run(results, run_data)
    save_results(results, CONFIG["results_file"])
    print(f"\n✓ Results for {model} saved to {CONFIG['results_file']}")
    
    # Memory management between tests
    if CONFIG.get("unload_models_between_tests", True) and i < len(CONFIG["models_to_test"]) - 1:
        print(f"\n🧹 Memory Management - Unloading {model}...")
        unload_success = unload_ollama_model(model, base_url, timeout=10)
        
        if unload_success:
            print("✓ Memory freed successfully")
            # Brief pause to ensure model is fully unloaded
            pause_time = CONFIG.get("memory_management_pause", 2)
            if pause_time > 0:
                print(f"⏳ Waiting {pause_time} seconds for cleanup...")
                time.sleep(pause_time)
        else:
            print("⚠️ Model unloading failed - continuing anyway")
            
        # Show updated memory status
        remaining_loaded = get_loaded_models(base_url)
        if remaining_loaded:
            print(f"📊 Remaining loaded models: {', '.join(remaining_loaded)}")
        else:
            print("📊 All models unloaded")
            
    elif not CONFIG.get("unload_models_between_tests", True):
        print(f"\n📝 Memory Management: Disabled - {model} remains loaded")
    else:
        print(f"\n✓ Final model test completed - keeping {model} loaded")

print(f"\n{'='*60}")
print("All tests completed!")
print(f"{'='*60}")

# Final memory status and summary
print(f"\n💡 Memory Management Summary:")
print(f"  - Tested {len(CONFIG['models_to_test'])} models sequentially")

if CONFIG.get("unload_models_between_tests", True):
    print(f"  - Unloaded {len(CONFIG['models_to_test'])-1} models to free VRAM")
    print(f"  - Final model '{CONFIG['models_to_test'][-1]}' remains loaded")
    
    final_loaded = get_loaded_models(base_url)
    if final_loaded:
        print(f"  - Currently loaded: {', '.join(final_loaded)}")
    else:
        print(f"  - No models currently loaded")
else:
    print(f"  - Memory management disabled - all models may remain loaded")
    current_loaded = get_loaded_models(base_url)
    if current_loaded:
        print(f"  - Currently loaded: {', '.join(current_loaded)}")

print(f"\n💡 Tip: To manually clear all models from memory, run: unload_all_models('{base_url}')")

📊 Initial Memory Status:
   Currently loaded models: pilardi/sentiment-analysis:gemma3

Starting test 1/4: pilardi/sentiment-analysis:gemma3

Testing model: pilardi/sentiment-analysis:gemma3
Progress: 10/100 predictions completed (errors: 0)
Progress: 10/100 predictions completed (errors: 0)
Progress: 20/100 predictions completed (errors: 1)
Progress: 20/100 predictions completed (errors: 1)
Progress: 30/100 predictions completed (errors: 1)
Progress: 30/100 predictions completed (errors: 1)
Progress: 40/100 predictions completed (errors: 3)
Progress: 40/100 predictions completed (errors: 3)
Progress: 50/100 predictions completed (errors: 3)
Progress: 50/100 predictions completed (errors: 3)
Progress: 60/100 predictions completed (errors: 4)
Progress: 60/100 predictions completed (errors: 4)
Progress: 70/100 predictions completed (errors: 4)
Progress: 70/100 predictions completed (errors: 4)
Progress: 80/100 predictions completed (errors: 5)
Progress: 80/100 predictions completed (erro

## 13. View Results Summary

## 12.5. Manual Memory Management

Use these utilities to manually check and manage Ollama model memory usage.

In [13]:
# Manual Memory Management Utilities
base_url = CONFIG["ollama_url"].replace("/api/chat", "")

print("🔧 Memory Management Utilities")
print("=" * 50)

# Check currently loaded models
print("\n1. Currently Loaded Models:")
loaded_models = get_loaded_models(base_url)
if loaded_models:
    for i, model in enumerate(loaded_models, 1):
        print(f"   {i}. {model}")
    print(f"\n   Total: {len(loaded_models)} model(s) loaded")
else:
    print("   No models currently loaded")

# Manual unload options
print(f"\n2. Manual Unload Options:")
print(f"   To unload a specific model:")
print(f"   → unload_ollama_model('MODEL_NAME', '{base_url}')")
print(f"   ")
print(f"   To unload all models:")
print(f"   → unload_all_models('{base_url}')")

# Memory recommendations
if loaded_models:
    print(f"\n💡 Memory Recommendations:")
    if len(loaded_models) > 1:
        print(f"   ⚠️  Multiple models loaded - consider unloading unused ones")
        print(f"   💾 Estimated VRAM usage: ~{len(loaded_models) * 4}GB+ (varies by model size)")
    else:
        print(f"   ✓ Only 1 model loaded - memory usage optimized")
        
    print(f"\n   Quick actions:")
    print(f"   🧹 Unload all: unload_all_models('{base_url}')")
    if len(loaded_models) > 1:
        print(f"   🎯 Keep only first: [unload_ollama_model(m, '{base_url}') for m in {loaded_models}[1:]]")

# Example usage
print(f"\n3. Example Usage:")
print(f"   # Check memory status")
print(f"   loaded = get_loaded_models('{base_url}')")
print(f"   print(f'Loaded: {{loaded}}')")
print(f"   ")
print(f"   # Unload specific model")
print(f"   unload_ollama_model('gemma3:4b', '{base_url}')")
print(f"   ")
print(f"   # Clean all memory")
print(f"   unload_all_models('{base_url}')")

# Uncomment the line below to unload all models now:
# unload_all_models(base_url)

🔧 Memory Management Utilities

1. Currently Loaded Models:
   1. pilardi/sentiment-analysis:qwen3.4b

   Total: 1 model(s) loaded

2. Manual Unload Options:
   To unload a specific model:
   → unload_ollama_model('MODEL_NAME', 'http://localhost:11434')
   
   To unload all models:
   → unload_all_models('http://localhost:11434')

💡 Memory Recommendations:
   ✓ Only 1 model loaded - memory usage optimized

   Quick actions:
   🧹 Unload all: unload_all_models('http://localhost:11434')

3. Example Usage:
   # Check memory status
   loaded = get_loaded_models('http://localhost:11434')
   print(f'Loaded: {loaded}')
   
   # Unload specific model
   unload_ollama_model('gemma3:4b', 'http://localhost:11434')
   
   # Clean all memory
   unload_all_models('http://localhost:11434')


In [14]:
# Load and display results
results = load_results(CONFIG["results_file"])

print("\nTest Results Summary")
print("=" * 100)

if results["runs"]:
    # Create summary DataFrame
    summary_data = []
    for run in results["runs"]:
        # Handle both old and new result formats for backward compatibility
        if "stats_4class" in run and "stats_3class" in run:
            # New format with 4-class and 3-class metrics
            summary_data.append({
                "Timestamp": run["timestamp"],
                "Model": run["model"],
                "Sample Size": run["sample_size"],
                "4-Class Accuracy": run["stats_4class"]["accuracy"],
                "4-Class F1": run["stats_4class"]["f1"],
                "3-Class Accuracy": run["stats_3class"]["accuracy"], 
                "3-Class F1": run["stats_3class"]["f1"],
                "Runtime (s)": run["runtime_sec"],
                "Errors": run["error_count"]
            })
        else:
            # Old format with single stats - treat as 3-class equivalent
            stats = run.get("stats", {"accuracy": 0.0, "f1": 0.0})
            summary_data.append({
                "Timestamp": run["timestamp"],
                "Model": run["model"],
                "Sample Size": run["sample_size"],
                "4-Class Accuracy": "N/A",
                "4-Class F1": "N/A",
                "3-Class Accuracy": stats["accuracy"], 
                "3-Class F1": stats["f1"],
                "Runtime (s)": run["runtime_sec"],
                "Errors": run["error_count"]
            })
    
    df_summary = pd.DataFrame(summary_data)
    print(df_summary.to_string(index=False))
    
    # Display server specs
    print("\n" + "=" * 100)
    print("Server Specifications:")
    print(json.dumps(results["server_specs"], indent=2))
    
    # Show label distribution in latest run
    if results["runs"]:
        latest_run = results["runs"][-1]
        print(f"\n" + "=" * 100)
        print("Sample Predictions Analysis:")
        if latest_run.get("sample_predictions"):
            true_labels = [pred["true_label"] for pred in latest_run["sample_predictions"]]
            pred_labels = [pred["pred"] for pred in latest_run["sample_predictions"]]
            print(f"True label distribution in sample: {dict(pd.Series(true_labels).value_counts())}")
            print(f"Predicted label distribution in sample: {dict(pd.Series(pred_labels).value_counts())}")
else:
    print("No test runs found.")


Test Results Summary
                  Timestamp                               Model  Sample Size 4-Class Accuracy 4-Class F1  3-Class Accuracy  3-Class F1  Runtime (s)  Errors
2025-10-13T16:52:23.557816Z                           gemma3:4b          100              N/A        N/A            0.0000      0.0000         0.14     100
2025-10-13T21:47:32.627160Z                           gemma3:4b          100              N/A        N/A            0.0000      0.0000         0.16     100
2025-10-13T21:52:20.087253Z                           gemma3:4b          100              N/A        N/A            0.0000      0.0000         0.16     100
2025-10-13T21:52:39.882826Z                           gemma3:4b          100              N/A        N/A            0.0000      0.0000         0.15     100
2025-10-13T21:55:59.603946Z                           gemma3:4b          100              N/A        N/A            0.3500      0.2543        12.52       0
2025-10-14T16:22:36.048610Z               

## 14. OpenAI Expert Analysis

Get professional insights from OpenAI's GPT-4 model acting as a senior ML engineer reviewing our test results.

In [15]:
import os
import openai
from openai import OpenAI

def get_openai_analysis(results_summary: str, server_specs: dict, sample_predictions: list) -> str:
    """
    Get expert analysis from OpenAI GPT-4 on the model testing results.
    """
    try:
        # Initialize OpenAI client with API key from environment
        client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
        
        if not os.getenv('OPENAI_API_KEY'):
            return "❌ OpenAI API key not found in environment variables. Please set OPENAI_API_KEY."
        
        # Prepare the analysis prompt
        prompt = f"""You are a senior ML engineer with expertise in LLM evaluation and sentiment analysis. 
Review these Ollama model performance test results and provide professional insights.

TEST RESULTS SUMMARY:
{results_summary}

SERVER SPECIFICATIONS:
{json.dumps(server_specs, indent=2)}

SAMPLE PREDICTIONS (latest run):
{json.dumps(sample_predictions[:3], indent=2) if sample_predictions else "No sample predictions available"}

Please provide a comprehensive analysis covering:

1. **Performance Assessment**: How do these models compare? Which performs best and why?

2. **4-Class vs 3-Class Results**: What do the differences tell us about model capability?

3. **Technical Observations**: Any concerns about accuracy, precision, recall, or F1 scores?

4. **Infrastructure Analysis**: How do the server specs impact performance? Any bottlenecks?

5. **Sample Prediction Quality**: Do the sample predictions show good understanding?

6. **Recommendations**: 
   - Which model should be used for production?
   - What optimizations would you suggest?
   - Are there red flags or areas needing attention?

7. **Next Steps**: What additional testing or validation would you recommend?

Provide specific, actionable insights as if briefing a technical team. Be honest about limitations and risks."""

        # Call OpenAI API
        response = client.chat.completions.create(
            model="gpt-4o",  # Best model for analysis tasks
            messages=[
                {"role": "system", "content": "You are a senior ML engineer specializing in LLM evaluation and sentiment analysis systems."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=2000,
            temperature=0.1  # Low temperature for consistent, factual analysis
        )
        
        return response.choices[0].message.content
        
    except Exception as e:
        return f"❌ Error getting OpenAI analysis: {str(e)}"

# Generate expert analysis
if results["runs"]:
    print("🤖 Requesting expert analysis from OpenAI GPT-4...")
    print("=" * 80)
    
    # Prepare summary data for analysis
    summary_text = df_summary.to_string(index=False) if 'df_summary' in locals() else "No summary data available"
    latest_sample_predictions = latest_run.get("sample_predictions", []) if 'latest_run' in locals() else []
    
    expert_analysis = get_openai_analysis(summary_text, results["server_specs"], latest_sample_predictions)
    
    print(expert_analysis)
    print("\n" + "=" * 80)
    print("💡 Analysis complete. Use these insights to guide your model selection and optimization efforts.")
    
else:
    print("⚠️ No test results available for analysis. Run some model tests first.")

🤖 Requesting expert analysis from OpenAI GPT-4...
### Comprehensive Analysis of Model Performance and Recommendations

#### 1. **Performance Assessment**

The test results indicate varying performance across different models. Here's a summary of the key findings:

- **gemma3:12b** and **gemma3:27b** models generally perform better than the smaller **gemma3:4b** model, particularly in 3-class sentiment analysis, with accuracies around 0.45 and F1 scores close to 0.39.
- **mvkvl/sentiments:llama3** and **pilardi/sentiment-analysis:gemma3** also show competitive performance, with 4-class accuracies around 0.32 and 0.37, respectively.
- **gemma3:4b** consistently underperforms, with 3-class accuracy around 0.33 and 4-class accuracy as low as 0.07.

**Best Performing Model**: Based on the results, **gemma3:12b** and **gemma3:27b** are the top performers, with **gemma3:12b** being slightly more efficient in terms of runtime.

#### 2. **4-Class vs 3-Class Results**

The models generally perfo

## 15. View Sample Predictions

Display sample predictions from the most recent test run for debugging.

In [16]:
if results["runs"]:
    latest_run = results["runs"][-1]
    print(f"Sample predictions from: {latest_run['model']}")
    print("=" * 80)
    
    for i, pred in enumerate(latest_run.get("sample_predictions", []), 1):
        print(f"\nSample {i}:")
        print(f"  Index: {pred['index']}")
        print(f"  Input: {pred['input']}")
        print(f"  True Label: {pred['true_label']}")
        print(f"  Predicted: {pred['pred']}")
        print(f"  Match: {'✓' if pred['true_label'] == pred['pred'] else '✗'}")
else:
    print("No test runs available.")

Sample predictions from: pilardi/sentiment-analysis:qwen3.4b

Sample 1:
  Index: 3625
  Input: look at liver function bloodwork http www dummy com how to content look at liver function bloodwork ...
  True Label: neutral
  Predicted: negative
  Match: ✗

Sample 2:
  Index: 3037
  Input: hey I am new here try to find a place to express what I am go through and find other to talk to who ...
  True Label: very negative
  Predicted: negative
  Match: ✗

Sample 3:
  Index: 2574
  Input: I am a new member reach out and willing to give back to the extend of my ability my father was recen...
  True Label: very negative
  Predicted: None
  Match: ✗

Sample 4:
  Index: 1488
  Input: I am out the change to this site is not user friendly at all not easy to navigate and the regular ha...
  True Label: neutral
  Predicted: negative
  Match: ✗

Sample 5:
  Index: 3677
  Input: http hepatitiscresearchandnewsupdate blogspot com 2012 04 seven step to healthy liver html
  True Label: neutral
  Predicted: