# Inference and Evaluation in BERT

This notebook covers how to use trained BERT models for inference and evaluate their performance.

## What You'll Learn:
1. Running inference on new text
2. Different evaluation metrics
3. Probing tasks to understand what BERT learned
4. Fine-tuning for downstream tasks
5. Analyzing model behavior and limitations
6. Performance benchmarking

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import sys
sys.path.append('..')

from model import MiniBERT
from tokenizer import WordPieceTokenizer
from mlm import mask_tokens
# Note: metrics module might not exist, we'll implement what we need
import time

np.random.seed(42)
try:
    plt.style.use('seaborn-v0_8-darkgrid')
except OSError:
    try:
        plt.style.use('seaborn-darkgrid') 
    except OSError:
        plt.style.use('default')

## Part 1: Basic Inference

How to use BERT to make predictions on new text.

In [None]:
# Load model and tokenizer
model = MiniBERT()
tokenizer = WordPieceTokenizer()
tokenizer.load_model('../tokenizer_8k.pkl')  # Fixed: Use load_model instead of .load

def run_inference(text, model, tokenizer, mask_token='[MASK]'):
    """
    Run inference on text with masked tokens.
    """
    print(f"Input text: '{text}'")
    
    # Tokenize
    tokens = text.split()
    
    # Find mask positions in original text
    mask_positions_text = [i for i, token in enumerate(tokens) if token == mask_token]
    
    if not mask_positions_text:
        print("No [MASK] tokens found in input text.")
        return
    
    # Encode with tokenizer
    input_ids = tokenizer.encode(text)
    input_batch = np.array([input_ids])
    
    # Forward pass
    logits, cache = model.forward(input_batch)
    probabilities = np.exp(logits - np.max(logits, axis=-1, keepdims=True))
    probabilities = probabilities / np.sum(probabilities, axis=-1, keepdims=True)
    
    # Find mask token ID positions in tokenized sequence
    mask_token_id = tokenizer.vocab[mask_token]
    tokenized_sequence = input_ids
    mask_positions = [i for i, token_id in enumerate(tokenized_sequence) if token_id == mask_token_id]
    
    # Create inverse vocab mapping for display
    inv_vocab = {v: k for k, v in tokenizer.vocab.items()}
    
    # Make predictions for each mask
    predictions = []
    for pos in mask_positions:
        pos_probs = probabilities[0, pos]
        top_k = 5
        top_indices = np.argsort(pos_probs)[-top_k:][::-1]
        top_probs = pos_probs[top_indices]
        
        # Get token strings
        top_tokens = []
        for idx in top_indices:
            token = inv_vocab.get(idx, f'ID_{idx}')
            top_tokens.append(token)
        
        predictions.append({
            'position': pos,
            'top_predictions': list(zip(top_tokens, top_probs))
        })
    
    # Display results
    print(f"\\nPredictions for {len(mask_positions)} masked position(s):")
    for i, pred in enumerate(predictions):
        print(f"\\nMask {i+1} (position {pred['position']}):")
        for rank, (token, prob) in enumerate(pred['top_predictions']):
            print(f"  {rank+1}. '{token}' ({prob*100:.2f}%)")
    
    return predictions

# Test inference
test_sentences = [
    "The cat sat on the [MASK].",
    "I love [MASK] learning.",
    "The [MASK] is shining brightly.",
    "She is a [MASK] student."
]

print("=" * 60)
print("INFERENCE EXAMPLES")
print("=" * 60)

for i, sentence in enumerate(test_sentences[:2]):  # Test first 2
    print(f"\\nExample {i+1}:")
    print("-" * 30)
    predictions = run_inference(sentence, model, tokenizer)
    if i < len(test_sentences) - 1:
        print("\\n" + "="*40)

## Part 2: Evaluation Metrics

How to measure BERT's performance quantitatively.

In [None]:
def comprehensive_evaluation(model, tokenizer, test_texts, mask_prob=0.15):
    """
    Comprehensive evaluation of BERT model.
    """
    results = {
        'total_examples': 0,
        'total_masked_tokens': 0,
        'correct_predictions': 0,
        'losses': [],
        'top_k_accuracies': {1: 0, 3: 0, 5: 0},
        'inference_times': []
    }
    
    print(f"Evaluating on {len(test_texts)} examples...")
    
    for text_idx, text in enumerate(test_texts):
        # Tokenize and mask
        token_ids = tokenizer.encode(text)
        masked_ids, target_ids, mask_positions = mask_tokens(
            np.array([token_ids]),
            vocab_size=len(tokenizer.vocab),
            mask_id=tokenizer.vocab['[MASK]'],
            p_mask=mask_prob
        )
        
        # Extract actual masked positions and targets
        actual_positions = []
        actual_targets = []
        
        if isinstance(target_ids, np.ndarray) and target_ids.shape == masked_ids.shape:
            sentinel = -100
            for pos in range(len(target_ids[0])):
                if target_ids[0][pos] != sentinel:
                    actual_positions.append(pos)
                    actual_targets.append(target_ids[0][pos])
        
        if not actual_positions:
            continue
        
        results['total_examples'] += 1
        
        # Time inference
        start_time = time.time()
        logits, _ = model.forward(masked_ids)
        inference_time = time.time() - start_time
        results['inference_times'].append(inference_time)
        
        # Convert to probabilities
        probabilities = np.exp(logits - np.max(logits, axis=-1, keepdims=True))
        probabilities = probabilities / np.sum(probabilities, axis=-1, keepdims=True)
        
        # Evaluate each masked token
        for pos, target_id in zip(actual_positions, actual_targets):
            pos_probs = probabilities[0, pos]
            
            # Top-k predictions
            top_indices = np.argsort(pos_probs)[::-1]
            
            # Check top-k accuracy
            for k in [1, 3, 5]:
                if target_id in top_indices[:k]:
                    results['top_k_accuracies'][k] += 1
            
            # Compute loss
            token_loss = -np.log(pos_probs[target_id] + 1e-10)
            results['losses'].append(token_loss)
            
            results['total_masked_tokens'] += 1
    
    # Compute final metrics
    if results['total_masked_tokens'] > 0:
        for k in results['top_k_accuracies']:
            results['top_k_accuracies'][k] /= results['total_masked_tokens']
        
        results['average_loss'] = np.mean(results['losses'])
        results['perplexity'] = np.exp(results['average_loss'])
    
    if results['inference_times']:
        results['avg_inference_time'] = np.mean(results['inference_times'])
        results['total_inference_time'] = np.sum(results['inference_times'])
    
    return results

# Create evaluation dataset
evaluation_texts = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is revolutionizing artificial intelligence.",
    "The weather today is sunny and warm.",
    "She finished her homework before dinner.",
    "The library has many interesting books to read.",
    "Computer science requires logical thinking skills.",
    "The concert was absolutely amazing last night.",
    "Fresh vegetables are important for good health."
]

# Run evaluation
print("Running comprehensive evaluation...")
eval_results = comprehensive_evaluation(model, tokenizer, evaluation_texts)

# Display results
print("\\n" + "=" * 60)
print("EVALUATION RESULTS")
print("=" * 60)

print(f"Total examples evaluated: {eval_results['total_examples']}")
print(f"Total masked tokens: {eval_results['total_masked_tokens']}")

if eval_results['total_masked_tokens'] > 0:
    print(f"\\nAccuracy Metrics:")
    for k, acc in eval_results['top_k_accuracies'].items():
        print(f"  Top-{k} accuracy: {acc:.3f} ({acc*100:.1f}%)")
    
    print(f"\\nLoss Metrics:")
    print(f"  Average loss: {eval_results['average_loss']:.4f}")
    print(f"  Perplexity: {eval_results['perplexity']:.2f}")

if eval_results['inference_times']:
    print(f"\\nPerformance Metrics:")
    print(f"  Average inference time: {eval_results['avg_inference_time']*1000:.2f} ms")
    print(f"  Total evaluation time: {eval_results['total_inference_time']:.2f} seconds")
    
    tokens_per_second = eval_results['total_masked_tokens'] / eval_results['total_inference_time']
    print(f"  Tokens per second: {tokens_per_second:.1f}")

## Part 3: Probing Tasks

Understanding what linguistic knowledge BERT has learned.

In [None]:
def simple_pos_probe(model, tokenizer, sentences_with_pos):
    """
    Simple POS tagging probe to test syntactic knowledge.
    """
    print("Running POS Tagging Probe...")
    
    # Extract representations
    all_representations = []
    all_pos_tags = []
    
    for sentence, pos_tags in sentences_with_pos:
        # Tokenize
        token_ids = tokenizer.encode(sentence)
        input_batch = np.array([token_ids])
        
        # Get representations from final layer
        logits, cache = model.forward(input_batch)
        
        # Extract hidden states (before final projection)
        # We'll use the logits as a proxy for final hidden states
        # In a real implementation, you'd extract the actual hidden states
        representations = logits[0]  # [seq_len, vocab_size]
        
        # Align with POS tags (simplified)
        tokens = sentence.split()
        min_len = min(len(tokens), len(pos_tags), representations.shape[0])
        
        for i in range(min_len):
            all_representations.append(representations[i])
            all_pos_tags.append(pos_tags[i])
    
    # Simple analysis: cluster representations by POS tag
    pos_tag_reps = {}
    for rep, tag in zip(all_representations, all_pos_tags):
        if tag not in pos_tag_reps:
            pos_tag_reps[tag] = []
        pos_tag_reps[tag].append(rep)
    
    # Compute average representation for each POS tag
    pos_centroids = {}
    for tag, reps in pos_tag_reps.items():
        if len(reps) > 0:
            pos_centroids[tag] = np.mean(reps, axis=0)
    
    print(f"\nAnalyzed {len(all_representations)} tokens")
    print(f"Found {len(pos_centroids)} POS tags: {list(pos_centroids.keys())}")
    
    # Compute pairwise similarities between POS centroids
    if len(pos_centroids) > 1:
        print("\nPOS Tag Similarities (cosine similarity):")
        tags = list(pos_centroids.keys())
        
        for i, tag1 in enumerate(tags):
            for j, tag2 in enumerate(tags):
                if i < j:
                    vec1 = pos_centroids[tag1]
                    vec2 = pos_centroids[tag2]
                    
                    # Cosine similarity
                    similarity = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
                    print(f"  {tag1} - {tag2}: {similarity:.3f}")
    
    return pos_centroids

# Simple POS-tagged sentences for probing
pos_examples = [
    ("The cat sat", ["DET", "NOUN", "VERB"]),
    ("She runs quickly", ["PRON", "VERB", "ADV"]),
    ("Big dogs bark", ["ADJ", "NOUN", "VERB"]),
    ("The quick fox", ["DET", "ADJ", "NOUN"]),
]

pos_centroids = simple_pos_probe(model, tokenizer, pos_examples)

# Semantic similarity probe
def semantic_similarity_probe(model, tokenizer, word_pairs):
    """
    Test semantic similarity understanding.
    """
    print("\n" + "=" * 50)
    print("SEMANTIC SIMILARITY PROBE")
    print("=" * 50)
    
    for pair_type, pairs in word_pairs.items():
        print(f"\n{pair_type.upper()} pairs:")
        
        similarities = []
        for word1, word2 in pairs:
            # Get representations for each word in context
            sentence1 = f"The {word1} is here"
            sentence2 = f"The {word2} is here"
            
            # Tokenize and get representations
            ids1 = tokenizer.encode(sentence1)
            ids2 = tokenizer.encode(sentence2)
            
            logits1, _ = model.forward(np.array([ids1]))
            logits2, _ = model.forward(np.array([ids2]))
            
            # Find word positions (simplified - assumes single token)
            # In practice, you'd need more sophisticated alignment
            if len(ids1) > 2 and len(ids2) > 2:
                rep1 = logits1[0, 1]  # Assume word is at position 1
                rep2 = logits2[0, 1]
                
                # Cosine similarity
                similarity = np.dot(rep1, rep2) / (np.linalg.norm(rep1) * np.linalg.norm(rep2))
                similarities.append(similarity)
                
                print(f"  {word1} - {word2}: {similarity:.3f}")
        
        if similarities:
            print(f"  Average similarity: {np.mean(similarities):.3f}")

# Semantic word pairs
word_pairs = {
    "synonyms": [("big", "large"), ("small", "tiny"), ("happy", "glad")],
    "antonyms": [("hot", "cold"), ("big", "small"), ("good", "bad")],
    "related": [("dog", "cat"), ("car", "truck"), ("book", "read")],
    "unrelated": [("dog", "computer"), ("happy", "table"), ("car", "music")]
}

semantic_similarity_probe(model, tokenizer, word_pairs)

## Part 4: Performance Analysis

Analyzing computational performance and efficiency.

In [None]:
def performance_benchmark(model, tokenizer, test_cases):
    """
    Benchmark model performance across different scenarios.
    """
    print("=" * 60)
    print("PERFORMANCE BENCHMARK")
    print("=" * 60)
    
    results = {}
    
    for test_name, test_data in test_cases.items():
        print(f"\nTesting: {test_name}")
        print("-" * 30)
        
        times = []
        memory_usage = []
        
        for text in test_data:
            # Tokenize
            token_ids = tokenizer.encode(text)
            input_batch = np.array([token_ids])
            
            # Time the forward pass
            start_time = time.time()
            logits, cache = model.forward(input_batch)
            end_time = time.time()
            
            inference_time = end_time - start_time
            times.append(inference_time)
            
            # Estimate memory usage (rough approximation)
            memory_estimate = logits.nbytes + sum(
                v.nbytes for v in cache.values() if isinstance(v, np.ndarray)
            )
            memory_usage.append(memory_estimate)
        
        # Compute statistics
        avg_time = np.mean(times)
        std_time = np.std(times)
        avg_memory = np.mean(memory_usage)
        
        # Tokens per second
        total_tokens = sum(len(tokenizer.encode(text)) for text in test_data)
        tokens_per_second = total_tokens / sum(times)
        
        results[test_name] = {
            'avg_time_ms': avg_time * 1000,
            'std_time_ms': std_time * 1000,
            'avg_memory_mb': avg_memory / (1024 * 1024),
            'tokens_per_second': tokens_per_second,
            'num_examples': len(test_data)
        }
        
        print(f"  Examples: {len(test_data)}")
        print(f"  Avg time: {avg_time*1000:.2f} ± {std_time*1000:.2f} ms")
        print(f"  Avg memory: {avg_memory/(1024*1024):.2f} MB")
        print(f"  Throughput: {tokens_per_second:.1f} tokens/sec")
    
    return results

# Create performance test cases
performance_tests = {
    "Short sentences": [
        "The cat sat.",
        "She runs fast.",
        "Dogs bark loudly.",
        "Birds fly high.",
        "Cars drive slowly."
    ],
    "Medium sentences": [
        "The quick brown fox jumps over the lazy dog.",
        "Machine learning models require large amounts of training data.",
        "The weather today is sunny and warm with clear skies.",
        "She finished her homework before dinner and watched television.",
        "The library has many interesting books to read and study."
    ],
    "Long sentences": [
        "The artificial intelligence system demonstrated remarkable capabilities in natural language understanding and generation tasks across multiple domains.",
        "Recent advances in deep learning have revolutionized computer vision, natural language processing, and many other fields of artificial intelligence research.",
        "The comprehensive evaluation showed that the model achieved state-of-the-art performance on several benchmark datasets while maintaining computational efficiency."
    ]
}

# Run benchmark
benchmark_results = performance_benchmark(model, tokenizer, performance_tests)

# Visualize results
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

test_names = list(benchmark_results.keys())
avg_times = [benchmark_results[name]['avg_time_ms'] for name in test_names]
avg_memory = [benchmark_results[name]['avg_memory_mb'] for name in test_names]
throughput = [benchmark_results[name]['tokens_per_second'] for name in test_names]

# Inference time
bars1 = axes[0].bar(test_names, avg_times, color='skyblue')
axes[0].set_ylabel('Time (ms)')
axes[0].set_title('Average Inference Time')
axes[0].tick_params(axis='x', rotation=45)

for bar, time_val in zip(bars1, avg_times):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
                f'{time_val:.1f}', ha='center', va='bottom')

# Memory usage
bars2 = axes[1].bar(test_names, avg_memory, color='lightcoral')
axes[1].set_ylabel('Memory (MB)')
axes[1].set_title('Average Memory Usage')
axes[1].tick_params(axis='x', rotation=45)

for bar, mem_val in zip(bars2, avg_memory):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
                f'{mem_val:.2f}', ha='center', va='bottom')

# Throughput
bars3 = axes[2].bar(test_names, throughput, color='lightgreen')
axes[2].set_ylabel('Tokens/Second')
axes[2].set_title('Throughput')
axes[2].tick_params(axis='x', rotation=45)

for bar, throughput_val in zip(bars3, throughput):
    axes[2].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
                f'{throughput_val:.1f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

print("\nPerformance Summary:")
print("• Longer sentences take more time (linear scaling)")
print("• Memory usage scales with sequence length")
print("• Throughput varies with sequence complexity")

## Part 5: Model Analysis and Limitations

Understanding what the model does well and where it struggles.

In [None]:
def analyze_model_behavior(model, tokenizer):
    """
    Analyze model behavior on various linguistic phenomena.
    """
    print("=" * 60)
    print("MODEL BEHAVIOR ANALYSIS")
    print("=" * 60)
    
    # Test cases for different linguistic phenomena
    test_cases = {
        "Simple Prediction": [
            "The cat sat on the [MASK].",
            "I drink [MASK] in the morning.",
            "The [MASK] is yellow."
        ],
        "Grammar": [
            "She [MASK] to school yesterday.",  # verb tense
            "The [MASK] are flying.",           # plural agreement
            "He [MASK] very tall.",             # is/am/are
        ],
        "World Knowledge": [
            "The capital of France is [MASK].",
            "Shakespeare wrote [MASK].",
            "The largest planet is [MASK]."
        ],
        "Context Dependency": [
            "The man went to the bank to [MASK] money.",        # financial context
            "The river bank was covered with [MASK].",          # geographical context
            "She couldn't see the movie because [MASK] was tall." # pronoun reference
        ]
    }
    
    analysis_results = {}
    
    for category, examples in test_cases.items():
        print(f"\n{category.upper()}:")
        print("-" * 40)
        
        category_results = []
        
        for example in examples:
            print(f"\nInput: {example}")
            
            # Tokenize
            token_ids = tokenizer.encode(example)
            input_batch = np.array([token_ids])
            
            # Find mask position
            mask_token_id = tokenizer.vocab['[MASK]']
            mask_positions = [i for i, token_id in enumerate(token_ids) if token_id == mask_token_id]
            
            if mask_positions:
                # Forward pass
                logits, _ = model.forward(input_batch)
                probabilities = np.exp(logits - np.max(logits, axis=-1, keepdims=True))
                probabilities = probabilities / np.sum(probabilities, axis=-1, keepdims=True)
                
                # Get top predictions
                pos = mask_positions[0]
                pos_probs = probabilities[0, pos]
                top_indices = np.argsort(pos_probs)[-3:][::-1]  # Top 3
                
                predictions = []
                for idx in top_indices:
                    try:
                        token = tokenizer.decode([idx]).strip()
                        prob = pos_probs[idx]
                        predictions.append((token, prob))
                    except:
                        predictions.append((f'UNK_{idx}', pos_probs[idx]))
                
                print("Predictions:")
                for i, (token, prob) in enumerate(predictions):
                    print(f"  {i+1}. '{token}' ({prob*100:.1f}%)")
                
                category_results.append({
                    'input': example,
                    'predictions': predictions
                })
        
        analysis_results[category] = category_results
    
    return analysis_results

# Run analysis
behavior_analysis = analyze_model_behavior(model, tokenizer)

print("\n" + "=" * 60)
print("ANALYSIS SUMMARY")
print("=" * 60)

print("\nModel Capabilities Observed:")
print("• Basic word prediction from context")
print("• Some grammatical awareness (limited without training)")
print("• Context-sensitive predictions")

print("\nModel Limitations:")
print("• Limited world knowledge (untrained model)")
print("• May struggle with complex reasoning")
print("• Predictions based on statistical patterns only")
print("• No real understanding of meaning")

print("\nNote: This is an untrained/randomly initialized model.")
print("With proper training on large corpora, performance would improve significantly.")

## Part 6: Evaluation Best Practices

Guidelines for proper model evaluation.

In [None]:
def evaluation_checklist():
    """
    Display evaluation best practices.
    """
    print("=" * 60)
    print("EVALUATION BEST PRACTICES CHECKLIST")
    print("=" * 60)
    
    practices = {
        "Data Preparation": [
            "Use held-out test data (never seen during training)",
            "Ensure test data is representative of intended use",
            "Balance test set across different domains/topics",
            "Document data collection and preprocessing steps"
        ],
        "Evaluation Metrics": [
            "Use multiple complementary metrics",
            "Report confidence intervals where possible",
            "Include both automatic and human evaluation",
            "Consider task-specific metrics"
        ],
        "Experimental Setup": [
            "Fix random seeds for reproducibility",
            "Run multiple evaluation rounds",
            "Compare against meaningful baselines",
            "Document hyperparameters and model configuration"
        ],
        "Analysis": [
            "Analyze failure cases and error patterns",
            "Test on edge cases and challenging examples",
            "Evaluate computational efficiency",
            "Consider ethical implications and biases"
        ],
        "Reporting": [
            "Report both strengths and limitations",
            "Include statistical significance tests",
            "Provide examples of model outputs",
            "Make evaluation code and data available"
        ]
    }
    
    for category, items in practices.items():
        print(f"\n{category.upper()}:")
        for item in items:
            print(f"  ✓ {item}")
    
    print("\n" + "=" * 60)
    print("COMMON EVALUATION PITFALLS TO AVOID")
    print("=" * 60)
    
    pitfalls = [
        "Training on test data (data leakage)",
        "Overfitting hyperparameters to test set",
        "Using only a single metric",
        "Ignoring computational costs",
        "Not testing on diverse examples",
        "Cherry-picking results",
        "Inadequate baseline comparisons",
        "Not reporting confidence intervals",
        "Ignoring model biases and fairness",
        "Not validating evaluation metrics themselves"
    ]
    
    for pitfall in pitfalls:
        print(f"  ❌ {pitfall}")

# Display checklist
evaluation_checklist()

# Summary of metrics used in this notebook
print("\n" + "=" * 60)
print("METRICS USED IN THIS NOTEBOOK")
print("=" * 60)

metrics_summary = {
    "Accuracy Metrics": {
        "Top-k Accuracy": "Percentage of times true token is in top-k predictions",
        "Exact Match": "Percentage of exactly correct predictions"
    },
    "Loss Metrics": {
        "Cross-entropy Loss": "Standard loss for classification tasks",
        "Perplexity": "Exponential of cross-entropy, measures uncertainty"
    },
    "Performance Metrics": {
        "Inference Time": "Time to process input and generate output",
        "Memory Usage": "RAM required during inference",
        "Throughput": "Tokens processed per second"
    },
    "Linguistic Metrics": {
        "POS Accuracy": "Ability to distinguish part-of-speech categories",
        "Semantic Similarity": "Cosine similarity between word representations"
    }
}

for category, metrics in metrics_summary.items():
    print(f"\n{category}:")
    for metric, description in metrics.items():
        print(f"  • {metric}: {description}")

## Summary: Key Evaluation Concepts

### **1. Types of Evaluation**
- **Intrinsic**: MLM accuracy, perplexity
- **Extrinsic**: Downstream task performance
- **Probing**: Test specific linguistic knowledge
- **Human**: Qualitative assessment by people

### **2. Essential Metrics**
- **Accuracy**: How often predictions are correct
- **Perplexity**: Measure of model uncertainty
- **Efficiency**: Speed and memory usage
- **Robustness**: Performance on edge cases

### **3. Evaluation Process**
1. **Prepare**: Clean, representative test data
2. **Measure**: Multiple complementary metrics
3. **Analyze**: Error patterns and limitations
4. **Compare**: Against baselines and humans
5. **Report**: Transparent, complete results

### **4. Key Insights**
- No single metric tells the full story
- Performance varies across different tasks
- Computational efficiency matters in practice
- Understanding failures is as important as successes

### **5. Next Steps**
- Train model on real data for better performance
- Implement more sophisticated evaluation metrics
- Test on standardized benchmarks
- Conduct human evaluation studies

## Exercises

1. **Custom Metrics**: Implement a metric to measure how well the model handles negation ("not good" vs "good").

2. **Bias Analysis**: Test the model for gender, racial, or other biases in its predictions.

3. **Domain Transfer**: Evaluate how well the model performs on different domains (medical, legal, technical).

4. **Multilingual**: Test the model's behavior on non-English text if your tokenizer supports it.

In [None]:
# Space for your experiments
