# Exercise 4: BERT Classification

Welcome to modern NLP with BERT! You'll learn how to fine-tune state-of-the-art language models for classification tasks.

## Learning Objectives
By the end of this exercise, you will be able to:
1. **BERT Architecture**: Understand transformer-based language models and attention mechanisms
2. **German BERT Models**: Work with pre-trained German BERT variants (GBERT, DistilBERT)
3. **Fine-tuning Process**: Adapt pre-trained models for specific classification tasks
4. **Tokenization**: Handle BERT's WordPiece tokenization for German text
5. **Performance Comparison**: Compare BERT with traditional ML approaches
6. **Model Evaluation**: Assess BERT model performance with appropriate metrics

## What You'll Build
- German sentiment classifier using BERT
- Performance comparison framework
- BERT model fine-tuning pipeline
- Attention visualization system
- Production-ready classification API

## Applications
- **Advanced Sentiment Analysis**: More nuanced emotion detection
- **Document Classification**: High-accuracy topic categorization
- **Intent Recognition**: Chatbot and voice assistant improvements
- **Content Moderation**: Sophisticated harmful content detection

**Ready to harness the power of transformers?** ü§ñ‚ö°

## Exercise 1: German BERT Classification Pipeline

**Goal**: Build and fine-tune a German BERT model for sentiment classification.

**Your Tasks**: 
1. Load and explore pre-trained German BERT models
2. Prepare data for BERT fine-tuning
3. Fine-tune BERT on German sentiment data
4. Compare BERT performance with traditional methods

**Hints**:
- Use 'dbmdz/bert-base-german-cased' for German text
- BERT requires special tokenization with [CLS] and [SEP] tokens
- Fine-tuning typically needs only a few epochs
- Learning rates for BERT are usually much smaller (2e-5)

### Setup and Imports

In [None]:
# Simple import - just try the pipeline
try:
    from transformers import pipeline
    print("Transformers library available!")
    
    # Try to load a simple German sentiment pipeline
    classifier = pipeline("sentiment-analysis", model="oliverguhr/german-sentiment-bert")
    print("German BERT model loaded successfully!")
    
except Exception as e:
    print("BERT libraries not available. Install with: pip install transformers")
    classifier = None
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import Dataset as HFDataset
import warnings
warnings.filterwarnings('ignore')

# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

print("Libraries imported successfully!")

### Step 1: Load German BERT Model and Tokenizer

In [None]:
def load_german_bert_model(model_name="bert-base-german-cased", num_labels=3):
    """
    Load German BERT model and tokenizer.
    
    Args:
        model_name (str): Name of the pre-trained model
        num_labels (int): Number of classification labels
    
    Returns:
        tuple: (tokenizer, model)
    """
    # TODO: Load and configure German BERT model:
    # 1. Load tokenizer for German text processing
    # 2. Load pre-trained model for sequence classification
    # 3. Configure for the specific number of labels
    # 4. Move model to appropriate device (GPU/CPU)
    
    print(f"Loading German BERT model: {model_name}")
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    print(f"Tokenizer vocabulary size: {tokenizer.vocab_size}")
    
    # Load model for sequence classification
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=num_labels
    )
    
    # Move to device
    model.to(device)
    
    print(f"Model loaded successfully!")
    print(f"Model parameters: {model.num_parameters():,}")
    
    return tokenizer, model

def test_tokenizer(tokenizer, sample_texts):
    """
    Test the tokenizer with sample German texts.
    
    Args:
        tokenizer: Loaded tokenizer
        sample_texts (list): Sample texts to tokenize
    """
    print("\nTokenizer Testing:")
    print("=" * 30)
    
    for i, text in enumerate(sample_texts):
        print(f"\nSample {i+1}: {text}")
        
        # Tokenize
        tokens = tokenizer.tokenize(text)
        print(f"Tokens: {tokens}")
        
        # Encode
        encoded = tokenizer.encode(text, add_special_tokens=True, max_length=64, truncation=True)
        print(f"Token IDs: {encoded}")
        
        # Decode back
        decoded = tokenizer.decode(encoded)
        print(f"Decoded: {decoded}")

# Load German BERT model
tokenizer, model = load_german_bert_model()

# Test tokenizer with German samples
sample_texts = [
    "Das ist ein gro√üartiger Film!",
    "Ich bin sehr entt√§uscht von diesem Produkt.",
    "Ein durchschnittliches Restaurant mit normalem Service."
]

test_tokenizer(tokenizer, sample_texts)

### Step 2: Create Enhanced Dataset

In [None]:
def create_enhanced_sentiment_dataset():
    """
    Create a larger, more diverse German sentiment dataset.
    
    Returns:
        pandas.DataFrame: Enhanced dataset
    """
    # TODO: Create a comprehensive German sentiment dataset:
    # 1. Include more diverse examples
    # 2. Add different domains (movies, products, services)
    # 3. Ensure balanced classes
    # 4. Include varying text lengths
    
    data = {
        'text': [
            # Positive examples (movies)
            "Dieser Film ist absolut fantastisch und sehr bewegend!",
            "Ein Meisterwerk des deutschen Kinos mit hervorragenden Schauspielern.",
            "Brillante Regie und eine fesselnde Geschichte von Anfang bis Ende.",
            "Ich war begeistert von der visuellen Pracht und der emotionalen Tiefe.",
            "Ein Film, den man immer wieder sehen kann - einfach wunderbar!",
            
            # Positive examples (products)
            "Dieses Produkt √ºbertrifft alle meine Erwartungen bei weitem.",
            "Ausgezeichnete Qualit√§t und sehr benutzerfreundlich.",
            "Ich bin absolut zufrieden mit diesem Kauf und kann es nur empfehlen.",
            "Hervorragende Verarbeitung und tolles Design.",
            "Das beste Produkt in dieser Preisklasse, ohne Zweifel.",
            
            # Positive examples (services)
            "Der Kundenservice war au√üergew√∂hnlich hilfsbereit und kompetent.",
            "Schnelle Lieferung und perfekte Verpackung.",
            "Das Personal war sehr freundlich und professionell.",
            "Eine rundum positive Erfahrung, die ich gerne wiederholen w√ºrde.",
            "Exzellenter Service mit pers√∂nlicher Betreuung.",
            
            # Negative examples (movies)
            "Dieser Film ist langweilig und vorhersehbar.",
            "Schlechte Schauspieler und eine verwirrende Handlung.",
            "Zwei Stunden verschwendete Zeit - absolut entt√§uschend.",
            "Die Dialoge sind schlecht und die Effekte wirken billig.",
            "Ein Film ohne Seele und ohne jeden k√ºnstlerischen Wert.",
            
            # Negative examples (products)
            "Das Produkt ist v√∂llig unbrauchbar und schlecht verarbeitet.",
            "Schlechte Qualit√§t f√ºr einen so hohen Preis.",
            "Nach einer Woche bereits kaputt - nie wieder!",
            "Funktioniert nicht wie beschrieben und der Support ist unh√∂flich.",
            "Geldverschwendung - w√ºrde es nicht einmal verschenken.",
            
            # Negative examples (services)
            "Furchtbarer Kundenservice und sehr lange Wartezeiten.",
            "Unprofessionelles Personal und schlechte Organisation.",
            "Versp√§tete Lieferung und besch√§digte Ware.",
            "Unfreundlich und inkompetent - eine Katastrophe.",
            "Der schlechteste Service, den ich je erlebt habe.",
            
            # Neutral examples (mixed)
            "Das Produkt erf√ºllt seinen Zweck, mehr aber auch nicht.",
            "Ein durchschnittlicher Film mit einigen guten Momenten.",
            "Der Service war in Ordnung, nichts Besonderes.",
            "Akzeptable Qualit√§t f√ºr den Preis.",
            "Weder besonders gut noch besonders schlecht.",
            "Ein normales Restaurant mit standardm√§√üigem Essen.",
            "Das Personal war h√∂flich, aber nicht sehr aufmerksam.",
            "Funktioniert wie erwartet, ohne √úberraschungen.",
            "Ein gew√∂hnliches Produkt ohne besondere Merkmale.",
            "Mittelm√§√üige Leistung in allen Bereichen.",
            "Eine durchschnittliche Erfahrung, die man vergisst.",
            "Solide Qualit√§t, aber nichts Au√üergew√∂hnliches.",
            "Der Film war okay, aber nicht unvergesslich.",
            "Angemessener Preis f√ºr angemessene Leistung.",
            "Eine neutrale Bewertung f√ºr ein neutrales Erlebnis."
        ],
        'sentiment': (
            ['positive'] * 15 + 
            ['negative'] * 15 + 
            ['neutral'] * 15
        ),
        'domain': (
            ['movie'] * 5 + ['product'] * 5 + ['service'] * 5 +  # positive
            ['movie'] * 5 + ['product'] * 5 + ['service'] * 5 +  # negative
            ['mixed'] * 15  # neutral
        )
    }
    
    df = pd.DataFrame(data)
    
    # Add text length information
    df['text_length'] = df['text'].apply(len)
    df['word_count'] = df['text'].apply(lambda x: len(x.split()))
    
    return df

# Create enhanced dataset
df = create_enhanced_sentiment_dataset()

print(f"Dataset created with {len(df)} samples")
print(f"\nClass distribution:")
print(df['sentiment'].value_counts())
print(f"\nDomain distribution:")
print(df['domain'].value_counts())
print(f"\nText length statistics:")
print(df['text_length'].describe())

# Display sample data
print("\nSample data:")
print(df.head())

### Step 3: Prepare Data for BERT

In [None]:
class SentimentDataset(Dataset):
    """
    Custom Dataset class for sentiment analysis with BERT.
    """
    
    def __init__(self, texts, labels, tokenizer, max_length=128):
        """
        Initialize the dataset.
        
        Args:
            texts (list): List of text samples
            labels (list): List of labels
            tokenizer: BERT tokenizer
            max_length (int): Maximum sequence length
        """
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
        
        # Create label mapping
        self.label_to_id = {'negative': 0, 'neutral': 1, 'positive': 2}
        self.id_to_label = {v: k for k, v in self.label_to_id.items()}
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        """
        Get a single sample.
        
        Args:
            idx (int): Sample index
        
        Returns:
            dict: Tokenized input with labels
        """
        # TODO: Implement data preparation for BERT:
        # 1. Tokenize text with proper padding and truncation
        # 2. Convert labels to numerical format
        # 3. Return tensors in the correct format
        
        text = str(self.texts[idx])
        label = self.labels[idx]
        
        # Tokenize text
        encoding = self.tokenizer(
            text,
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(self.label_to_id[label], dtype=torch.long)
        }

def prepare_datasets(df, tokenizer, test_size=0.2, val_size=0.1, max_length=128):
    """
    Prepare train, validation, and test datasets.
    
    Args:
        df (pandas.DataFrame): Input dataframe
        tokenizer: BERT tokenizer
        test_size (float): Test set proportion
        val_size (float): Validation set proportion
        max_length (int): Maximum sequence length
    
    Returns:
        tuple: (train_dataset, val_dataset, test_dataset)
    """
    # TODO: Split data and create datasets:
    # 1. Split into train/validation/test sets
    # 2. Ensure stratified sampling for balanced classes
    # 3. Create Dataset objects for each split
    
    # First split: separate test set
    train_val_texts, test_texts, train_val_labels, test_labels = train_test_split(
        df['text'].tolist(),
        df['sentiment'].tolist(),
        test_size=test_size,
        random_state=42,
        stratify=df['sentiment'].tolist()
    )
    
    # Second split: separate train and validation
    train_texts, val_texts, train_labels, val_labels = train_test_split(
        train_val_texts,
        train_val_labels,
        test_size=val_size/(1-test_size),  # Adjust for already removed test set
        random_state=42,
        stratify=train_val_labels
    )
    
    print(f"Dataset splits:")
    print(f"  Train: {len(train_texts)} samples")
    print(f"  Validation: {len(val_texts)} samples")
    print(f"  Test: {len(test_texts)} samples")
    
    # Create Dataset objects
    train_dataset = SentimentDataset(train_texts, train_labels, tokenizer, max_length)
    val_dataset = SentimentDataset(val_texts, val_labels, tokenizer, max_length)
    test_dataset = SentimentDataset(test_texts, test_labels, tokenizer, max_length)
    
    return train_dataset, val_dataset, test_dataset

# Prepare datasets
train_dataset, val_dataset, test_dataset = prepare_datasets(df, tokenizer)

# Test dataset loading
print("\nSample from training dataset:")
sample = train_dataset[0]
print(f"Input IDs shape: {sample['input_ids'].shape}")
print(f"Attention mask shape: {sample['attention_mask'].shape}")
print(f"Label: {sample['labels']}")
print(f"Decoded text: {tokenizer.decode(sample['input_ids'], skip_special_tokens=True)}")

### Step 4: Fine-tune BERT Model

In [None]:
def setup_training_arguments(output_dir="./bert-sentiment-german"):
    """
    Setup training arguments for BERT fine-tuning.
    
    Args:
        output_dir (str): Directory to save model outputs
    
    Returns:
        TrainingArguments: Configured training arguments
    """
    # TODO: Configure training hyperparameters:
    # 1. Set learning rate, batch size, number of epochs
    # 2. Configure evaluation strategy
    # 3. Set up model saving and logging
    # 4. Add early stopping if needed
    
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=3,  # Small number for demonstration
        per_device_train_batch_size=8,  # Adjust based on GPU memory
        per_device_eval_batch_size=8,
        learning_rate=2e-5,  # Common learning rate for BERT fine-tuning
        weight_decay=0.01,
        logging_dir=f'{output_dir}/logs',
        logging_steps=10,
        evaluation_strategy="steps",
        eval_steps=50,
        save_strategy="steps",
        save_steps=100,
        load_best_model_at_end=True,
        metric_for_best_model="eval_accuracy",
        greater_is_better=True,
        save_total_limit=2,
        seed=42,
        report_to=None,  # Disable wandb/tensorboard for simplicity
    )
    
    return training_args

def compute_metrics(eval_pred):
    """
    Compute evaluation metrics.
    
    Args:
        eval_pred: Evaluation predictions
    
    Returns:
        dict: Computed metrics
    """
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    accuracy = accuracy_score(labels, predictions)
    
    return {
        'accuracy': accuracy,
    }

def fine_tune_bert(model, tokenizer, train_dataset, val_dataset):
    """
    Fine-tune BERT model for sentiment classification.
    
    Args:
        model: Pre-trained BERT model
        tokenizer: BERT tokenizer
        train_dataset: Training dataset
        val_dataset: Validation dataset
    
    Returns:
        Trainer: Trained model
    """
    # TODO: Implement BERT fine-tuning:
    # 1. Setup training arguments
    # 2. Create Trainer object
    # 3. Add callbacks (early stopping)
    # 4. Train the model
    # 5. Evaluate performance
    
    print("Setting up BERT fine-tuning...")
    
    # Setup training arguments
    training_args = setup_training_arguments()
    
    # Create trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
    )
    
    print("Starting training...")
    print("Note: This may take several minutes depending on your hardware.")
    
    # Train the model
    trainer.train()
    
    print("Training completed!")
    
    # Evaluate on validation set
    eval_results = trainer.evaluate()
    print(f"\nValidation Results:")
    for key, value in eval_results.items():
        print(f"  {key}: {value:.4f}")
    
    return trainer

# Fine-tune BERT model
print("Starting BERT fine-tuning process...")
trainer = fine_tune_bert(model, tokenizer, train_dataset, val_dataset)

### Step 5: Evaluate Model Performance

In [None]:
def evaluate_bert_model(trainer, test_dataset, tokenizer):
    """
    Comprehensive evaluation of the fine-tuned BERT model.
    
    Args:
        trainer: Trained Trainer object
        test_dataset: Test dataset
        tokenizer: BERT tokenizer
    
    Returns:
        dict: Evaluation results
    """
    # TODO: Implement comprehensive model evaluation:
    # 1. Evaluate on test set
    # 2. Generate predictions and probabilities
    # 3. Create confusion matrix
    # 4. Generate classification report
    # 5. Analyze errors and model behavior
    
    print("Evaluating BERT model on test set...")
    
    # Evaluate on test set
    test_results = trainer.evaluate(test_dataset)
    
    print(f"\nTest Set Results:")
    for key, value in test_results.items():
        print(f"  {key}: {value:.4f}")
    
    # Get predictions
    predictions = trainer.predict(test_dataset)
    y_pred = np.argmax(predictions.predictions, axis=1)
    y_true = predictions.label_ids
    
    # Label mapping
    id_to_label = {0: 'negative', 1: 'neutral', 2: 'positive'}
    label_names = ['negative', 'neutral', 'positive']
    
    # Classification report
    print("\nClassification Report:")
    print(classification_report(y_true, y_pred, target_names=label_names))
    
    # Confusion Matrix
    cm = confusion_matrix(y_true, y_pred)
    
    plt.figure(figsize=(10, 4))
    
    # Plot confusion matrix
    plt.subplot(1, 2, 1)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=label_names, yticklabels=label_names)
    plt.title('BERT Confusion Matrix')
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    
    # Plot class distribution
    plt.subplot(1, 2, 2)
    unique_labels, counts = np.unique(y_true, return_counts=True)
    plt.bar([label_names[i] for i in unique_labels], counts, color=['red', 'gray', 'green'])
    plt.title('Test Set Class Distribution')
    plt.xlabel('Sentiment Class')
    plt.ylabel('Count')
    
    plt.tight_layout()
    plt.show()
    
    # Error analysis
    print("\nError Analysis:")
    print("=" * 30)
    
    errors = []
    for i, (true_label, pred_label) in enumerate(zip(y_true, y_pred)):
        if true_label != pred_label:
            # Get original text
            sample = test_dataset[i]
            text = tokenizer.decode(sample['input_ids'], skip_special_tokens=True)
            
            errors.append({
                'text': text,
                'true_label': id_to_label[true_label],
                'predicted_label': id_to_label[pred_label],
                'confidence': np.max(predictions.predictions[i])
            })
    
    print(f"Total errors: {len(errors)} out of {len(y_true)} samples")
    print(f"Error rate: {len(errors)/len(y_true)*100:.2f}%")
    
    # Show some error examples
    if errors:
        print("\nSample errors:")
        for i, error in enumerate(errors[:5]):
            print(f"\nError {i+1}:")
            print(f"  Text: {error['text'][:100]}...")
            print(f"  True: {error['true_label']}, Predicted: {error['predicted_label']}")
            print(f"  Confidence: {error['confidence']:.3f}")
    
    return {
        'test_results': test_results,
        'predictions': predictions,
        'confusion_matrix': cm,
        'errors': errors
    }

# Evaluate the model
evaluation_results = evaluate_bert_model(trainer, test_dataset, tokenizer)

### Step 6: Compare with Traditional ML

In [None]:
def compare_with_traditional_ml(df, bert_results):
    """
    Compare BERT performance with traditional ML approaches.
    
    Args:
        df (pandas.DataFrame): Original dataset
        bert_results (dict): BERT evaluation results
    
    Returns:
        dict: Comparison results
    """
    # TODO: Implement comparison with traditional methods:
    # 1. Train TF-IDF + Logistic Regression baseline
    # 2. Train TF-IDF + SVM baseline
    # 3. Compare accuracy, precision, recall, F1-score
    # 4. Analyze training time and computational requirements
    # 5. Visualize performance comparison
    
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.linear_model import LogisticRegression
    from sklearn.svm import SVC
    from sklearn.metrics import accuracy_score, precision_recall_fscore_support
    import time
    
    print("Comparing BERT with Traditional ML Methods:")
    print("=" * 50)
    
    # Prepare data
    X = df['text'].tolist()
    y = df['sentiment'].tolist()
    
    # Same split as BERT
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    
    # TF-IDF Vectorization
    print("\n1. Training TF-IDF + Logistic Regression...")
    start_time = time.time()
    
    tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
    X_train_tfidf = tfidf.fit_transform(X_train)
    X_test_tfidf = tfidf.transform(X_test)
    
    # Logistic Regression
    lr_model = LogisticRegression(random_state=42, max_iter=1000)
    lr_model.fit(X_train_tfidf, y_train)
    lr_pred = lr_model.predict(X_test_tfidf)
    
    lr_train_time = time.time() - start_time
    lr_accuracy = accuracy_score(y_test, lr_pred)
    lr_precision, lr_recall, lr_f1, _ = precision_recall_fscore_support(
        y_test, lr_pred, average='weighted'
    )
    
    print(f"  Training time: {lr_train_time:.2f} seconds")
    print(f"  Accuracy: {lr_accuracy:.4f}")
    print(f"  F1-score: {lr_f1:.4f}")
    
    # SVM
    print("\n2. Training TF-IDF + SVM...")
    start_time = time.time()
    
    svm_model = SVC(kernel='linear', random_state=42)
    svm_model.fit(X_train_tfidf, y_train)
    svm_pred = svm_model.predict(X_test_tfidf)
    
    svm_train_time = time.time() - start_time
    svm_accuracy = accuracy_score(y_test, svm_pred)
    svm_precision, svm_recall, svm_f1, _ = precision_recall_fscore_support(
        y_test, svm_pred, average='weighted'
    )
    
    print(f"  Training time: {svm_train_time:.2f} seconds")
    print(f"  Accuracy: {svm_accuracy:.4f}")
    print(f"  F1-score: {svm_f1:.4f}")
    
    # Get BERT metrics
    bert_accuracy = bert_results['test_results']['eval_accuracy']
    
    # Create comparison visualization
    methods = ['Logistic Regression', 'SVM', 'BERT']
    accuracies = [lr_accuracy, svm_accuracy, bert_accuracy]
    f1_scores = [lr_f1, svm_f1, bert_accuracy]  # Approximation for BERT
    train_times = [lr_train_time, svm_train_time, 300]  # Estimated BERT time
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # Accuracy comparison
    axes[0].bar(methods, accuracies, color=['blue', 'green', 'red'])
    axes[0].set_title('Accuracy Comparison')
    axes[0].set_ylabel('Accuracy')
    axes[0].set_ylim(0, 1)
    for i, v in enumerate(accuracies):
        axes[0].text(i, v + 0.01, f'{v:.3f}', ha='center', va='bottom')
    
    # F1-score comparison
    axes[1].bar(methods, f1_scores, color=['blue', 'green', 'red'])
    axes[1].set_title('F1-Score Comparison')
    axes[1].set_ylabel('F1-Score')
    axes[1].set_ylim(0, 1)
    for i, v in enumerate(f1_scores):
        axes[1].text(i, v + 0.01, f'{v:.3f}', ha='center', va='bottom')
    
    # Training time comparison (log scale)
    axes[2].bar(methods, train_times, color=['blue', 'green', 'red'])
    axes[2].set_title('Training Time Comparison')
    axes[2].set_ylabel('Training Time (seconds)')
    axes[2].set_yscale('log')
    for i, v in enumerate(train_times):
        axes[2].text(i, v * 1.1, f'{v:.0f}s', ha='center', va='bottom')
    
    plt.tight_layout()
    plt.show()
    
    # Summary
    print("\nComparison Summary:")
    print("=" * 30)
    print(f"BERT Accuracy: {bert_accuracy:.4f} (+{bert_accuracy - max(lr_accuracy, svm_accuracy):.4f})")
    print(f"Best Traditional: {max(lr_accuracy, svm_accuracy):.4f}")
    print(f"BERT vs Traditional: {((bert_accuracy - max(lr_accuracy, svm_accuracy)) / max(lr_accuracy, svm_accuracy) * 100):.1f}% improvement")
    
    return {
        'logistic_regression': {'accuracy': lr_accuracy, 'f1': lr_f1, 'time': lr_train_time},
        'svm': {'accuracy': svm_accuracy, 'f1': svm_f1, 'time': svm_train_time},
        'bert': {'accuracy': bert_accuracy, 'time': 300}
    }

# Compare with traditional methods
comparison_results = compare_with_traditional_ml(df, evaluation_results)

### Step 7: Attention Visualization (Optional)

In [None]:
def visualize_attention_patterns(model, tokenizer, sample_text):
    """
    Visualize BERT attention patterns for interpretability.
    Note: This is a simplified visualization. Full attention analysis requires additional tools.
    
    Args:
        model: Fine-tuned BERT model
        tokenizer: BERT tokenizer
        sample_text (str): Text to analyze
    """
    # TODO: Implement attention visualization:
    # 1. Get model attention weights
    # 2. Visualize attention patterns
    # 3. Identify important tokens
    # 4. Analyze layer-wise attention
    
    print(f"Analyzing attention patterns for: '{sample_text}'")
    print("=" * 60)
    
    # Tokenize input
    inputs = tokenizer(
        sample_text,
        return_tensors='pt',
        add_special_tokens=True,
        max_length=128,
        padding='max_length',
        truncation=True
    )
    
    # Move to device
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Get model predictions and attention weights
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs, output_attentions=True)
    
    # Get predictions
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=-1).item()
    confidence = torch.max(predictions).item()
    
    label_names = ['negative', 'neutral', 'positive']
    
    print(f"Prediction: {label_names[predicted_class]} (confidence: {confidence:.3f})")
    print(f"Class probabilities:")
    for i, prob in enumerate(predictions[0]):
        print(f"  {label_names[i]}: {prob:.3f}")
    
    # Get tokens (remove padding)
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
    # Find actual length (before padding)
    actual_length = len([t for t in tokens if t != '[PAD]'])
    tokens = tokens[:actual_length]
    
    print(f"\nTokens ({len(tokens)}): {tokens}")
    
    # Simple attention analysis (last layer, first head)
    if outputs.attentions:
        # Get attention from last layer, first head
        last_layer_attention = outputs.attentions[-1][0, 0, :actual_length, :actual_length]
        
        # Average attention received by each token
        token_importance = torch.mean(last_layer_attention, dim=0).cpu().numpy()
        
        print(f"\nToken Importance (averaged attention):")
        for token, importance in zip(tokens, token_importance):
            print(f"  {token}: {importance:.3f}")
        
        # Simple visualization
        plt.figure(figsize=(12, 6))
        
        # Token importance plot
        plt.subplot(1, 2, 1)
        bars = plt.bar(range(len(tokens)), token_importance[:len(tokens)])
        plt.xticks(range(len(tokens)), tokens, rotation=45, ha='right')
        plt.title('Token Importance (Attention)')
        plt.ylabel('Attention Score')
        
        # Color bars by importance
        max_importance = max(token_importance[:len(tokens)])
        for bar, importance in zip(bars, token_importance[:len(tokens)]):
            bar.set_color(plt.cm.Reds(importance / max_importance))
        
        # Attention heatmap (simplified)
        plt.subplot(1, 2, 2)
        attention_matrix = last_layer_attention.cpu().numpy()
        sns.heatmap(
            attention_matrix,
            xticklabels=tokens,
            yticklabels=tokens,
            cmap='Blues',
            cbar=True
        )
        plt.title('Attention Matrix (Last Layer, Head 1)')
        plt.xticks(rotation=45, ha='right')
        plt.yticks(rotation=0)
        
        plt.tight_layout()
        plt.show()
    else:
        print("\nAttention weights not available (model may not output attentions)")

# Analyze attention for sample texts
sample_texts = [
    "Dieser Film ist absolut fantastisch und sehr bewegend!",
    "Das Produkt ist v√∂llig unbrauchbar und schlecht verarbeitet.",
    "Ein durchschnittliches Restaurant mit normalem Service."
]

print("Attention Pattern Analysis:")
print("=" * 40)

for i, text in enumerate(sample_texts):
    print(f"\nSample {i+1}:")
    visualize_attention_patterns(trainer.model, tokenizer, text)
    if i < len(sample_texts) - 1:
        print("\n" + "-"*60)

## Exercise Tasks

Complete the following tasks to deepen your understanding:

1. **Model Variants**:
   - Try different German BERT variants (distilbert-base-german-cased, bert-base-multilingual-cased)
   - Compare model sizes, speed, and performance
   - Experiment with RoBERTa or ELECTRA architectures

2. **Hyperparameter Tuning**:
   - Optimize learning rate, batch size, and number of epochs
   - Implement learning rate scheduling
   - Add dropout and weight decay regularization

3. **Advanced Evaluation**:
   - Implement cross-validation for BERT
   - Add more evaluation metrics (ROC-AUC, precision-recall curves)
   - Perform statistical significance testing

4. **Interpretability Analysis**:
   - Implement LIME or SHAP for BERT explanations
   - Analyze attention patterns across layers and heads
   - Create token importance visualizations

5. **Production Deployment**:
   - Optimize model for inference (quantization, distillation)
   - Create REST API for sentiment classification
   - Implement batch processing and caching

## Reflection Questions

1. What makes BERT more effective than traditional ML approaches?
2. How does bidirectional context improve understanding?
3. What are the computational trade-offs of using BERT?
4. How can attention weights help interpret model decisions?
5. When might traditional ML still be preferable to BERT?

## Next Steps

- Explore task-specific BERT variants (sentiment-specific models)
- Learn about BERT for other NLP tasks (NER, QA, etc.)
- Study recent transformer developments (GPT, T5, etc.)
- Investigate multilingual and cross-lingual applications