# BERT Fine-tuning for Sentiment Analysis

This notebook demonstrates how to fine-tune a pre-trained BERT model for sentiment analysis using the IMDB movie reviews dataset. We'll compare the performance of a traditional machine learning baseline (TF-IDF + Logistic Regression) with a fine-tuned BERT model.

## üéØ **Project Goals:**
- Load and explore the IMDB dataset
- Establish a baseline using traditional ML methods
- Fine-tune BERT for binary sentiment classification
- Compare performance between baseline and BERT models

## üìä **Expected Outcomes:**
- **Baseline Model**: ~88-89% accuracy
- **Fine-tuned BERT**: ~92-94% accuracy
- **Learning**: Understanding transformer-based models vs traditional approaches

---

## üìö Step 1: Dataset Loading and Exploration

First, we'll load the famous IMDB movie reviews dataset which contains 50,000 reviews (25k for training, 25k for testing) labeled as positive or negative sentiment.

In [None]:
# Load the IMDB dataset using Hugging Face datasets library
# This dataset contains 25k training and 25k test movie reviews
from datasets import load_dataset

# Download and load the dataset (may take a few minutes on first run)
dataset = load_dataset("imdb")  

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# Explore the structure of our dataset
# This shows us the training and test splits with their sizes
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


In [None]:
# Look at a sample review to understand the data structure
# Each sample contains 'text' (review content) and 'label' (0=negative, 1=positive)
print(dataset['train'][0])

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [None]:
# Check the class balance in our training data
# Labels: 0 = negative sentiment, 1 = positive sentiment
train_labels = dataset['train']['label']
print(f"Positive reviews: {sum(train_labels)}")
print(f"Negative reviews: {len(train_labels) - sum(train_labels)}")
print(f"Dataset is {'balanced' if sum(train_labels) == len(train_labels) - sum(train_labels) else 'imbalanced'}")

postive reviews: 12500
negative reviews: 12500


## ü§ñ Step 2: Baseline Model (Traditional ML Approach)

Before jumping into BERT, let's establish a baseline using traditional machine learning:
- **TF-IDF**: Convert text to numerical features
- **Logistic Regression**: Simple but effective classifier

This helps us understand how much improvement BERT provides over classical methods.

In [None]:
# Import libraries for our baseline machine learning model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Prepare our text data for traditional ML processing
train_texts = dataset['train']['text']
train_labels = dataset['train']['label']  
test_labels = dataset['test']['label']
test_texts = dataset['test']['text']

# Show dataset size information
print(f"Training on {len(train_texts):,} reviews")
print(f"Testing on {len(test_texts):,} reviews")

Training on 25000 reviews
Testing on 25000 reviews


In [None]:
# Convert text to numerical features using TF-IDF
# TF-IDF (Term Frequency-Inverse Document Frequency) measures word importance
# - Higher values = words that appear frequently in this doc but rarely in others
# - ngram_range=(1,2) = consider both single words and word pairs
print("Converting text to TF-IDF features...")
vectorizer = TfidfVectorizer(
    max_features=5000,     # Keep only the 5000 most important features
    ngram_range=(1,2)      # Use single words and word pairs (bigrams)
)

# Fit vectorizer on training data and transform both train and test sets
X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts)

print(f"‚úÖ TF-IDF transformation complete!")
print(f"Feature matrix shape: {X_train.shape}")
print(f"Each review is now represented by {X_train.shape[1]:,} numerical features")

Converting text to TF-IDF features
TF-IDF shape: (25000, 5000)
Each review is represented by a vector of length 5000 numbers


In [None]:
# Train our baseline logistic regression model
print("üéØ Training Logistic Regression baseline model...")
baseline_model = LogisticRegression(max_iter=1000)
baseline_model.fit(X_train, train_labels)

# Make predictions on the test set
print("üìä Making predictions on test set...")
predictions = baseline_model.predict(X_test)

# Evaluate model performance
accuracy = accuracy_score(test_labels, predictions)
print(f"\n{'='*60}")
print(f"üèÜ BASELINE MODEL PERFORMANCE")
print(f"{'='*60}")
print(f"Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"\nüìã Detailed Classification Report:")
print(classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(f"{'='*60}")
print(f"üí° This baseline gives us a target to beat with BERT!")

Training Logistic Regression model
Making predictions on test set

BASELINE MODEL PERFORMANCE
Accuracy: 0.8884 (88.84%)

Detailed Classification Report:
              precision    recall  f1-score   support

    Negative       0.89      0.88      0.89     12500
    Positive       0.88      0.89      0.89     12500

    accuracy                           0.89     25000
   macro avg       0.89      0.89      0.89     25000
weighted avg       0.89      0.89      0.89     25000



## üöÄ Step 3: BERT Model Setup

Now let's implement the star of our show - BERT! 

**What is BERT?**
- **B**idirectional **E**ncoder **R**epresentations from **T**ransformers
- Pre-trained on massive text corpora (Wikipedia + BookCorpus)
- Understands context from both left AND right sides of words
- State-of-the-art performance on many NLP tasks

**Why BERT is powerful:**
- Captures complex language patterns and relationships
- Pre-trained knowledge can be fine-tuned for specific tasks
- Bidirectional context understanding (unlike traditional left-to-right models)

In [None]:
# Load pre-trained BERT components from Hugging Face
from transformers import BertTokenizer, BertForSequenceClassification

print("üîÑ Loading BERT tokenizer...")
# The tokenizer converts text into tokens that BERT can understand
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

print("üîÑ Loading pre-trained BERT model...")
# Load BERT with a classification head for binary sentiment analysis
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',    # 12-layer, 768-hidden, 12-heads, 110M parameters
    num_labels=2            # Binary classification: positive/negative
)

print("‚úÖ BERT model and tokenizer loaded successfully!")
print(f"üìä Model size: {model.num_parameters():,} parameters")

# üîç **What's happening under the hood:**
#
# 1Ô∏è‚É£ **bert-base-uncased Model Architecture:**
#    - 12 transformer layers (compared to 24 in bert-large)
#    - 768 hidden dimensions
#    - 12 attention heads per layer
#    - ~110 million trainable parameters
#    - "uncased" = not case-sensitive (converts "Hello" ‚Üí "hello")
#
# 2Ô∏è‚É£ **BertTokenizer Functions:**
#    - Vocabulary: 30,522 unique tokens
#    - Subword tokenization: "unbelievable" ‚Üí ["un", "##believe", "##able"]
#    - Special tokens: [CLS] (start), [SEP] (separator), [PAD] (padding)
#    - Handles out-of-vocabulary words gracefully
#
# 3Ô∏è‚É£ **BertForSequenceClassification:**
#    - Pre-trained BERT encoder + classification head
#    - Classification head: dropout + linear layer (768 ‚Üí 2 outputs)
#    - Only the classification head is randomly initialized
#    - BERT weights start from pre-trained values
#
# üíæ **First-time setup:**
# - Downloads ~440MB of model weights
# - Caches locally for future use
# - Takes 1-2 minutes depending on internet speed

Loading BERT tokenizer...
Loading BERT model...


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


‚úÖ Model and tokenizer loaded!
Model has 109,483,778 parameters


In [None]:
# Tokenize our text data for BERT processing
# BERT requires specific input format with special tokens and padding

print("üîÑ Tokenizing training data...")
train_encodings = tokenizer(
    list(train_texts),
    truncation=True,      # Cut off reviews longer than max_length
    padding=True,         # Pad shorter reviews to uniform length
    max_length=128,       # Maximum sequence length (balance between speed and content)
    return_tensors='pt'   # Return PyTorch tensors
)

print("üîÑ Tokenizing test data...")
test_encodings = tokenizer(
    list(test_texts),
    truncation=True,
    padding=True,
    max_length=128,
    return_tensors='pt'
)

print("‚úÖ Tokenization complete!")
print(f"üìä Training data shape: {train_encodings['input_ids'].shape}")
print(f"üìä Test data shape: {test_encodings['input_ids'].shape}")
print(f"\nüîç What each dimension means:")
print(f"   ‚Ä¢ First dimension ({train_encodings['input_ids'].shape[0]:,}): Number of reviews")
print(f"   ‚Ä¢ Second dimension ({train_encodings['input_ids'].shape[1]}): Sequence length (tokens per review)")

# üí° **Understanding the tokenization process:**
#
# Input: "This movie is great!"
# ‚Üì
# Tokens: ["[CLS]", "this", "movie", "is", "great", "!", "[SEP]", "[PAD]", "[PAD]", ...]
# ‚Üì
# Token IDs: [101, 2023, 3185, 2003, 2307, 999, 102, 0, 0, ...]
# ‚Üì 
# Attention Mask: [1, 1, 1, 1, 1, 1, 1, 0, 0, ...] (1=real token, 0=padding)

Tokenizing training data...
Tokenizing test data...
‚úÖ Tokenization complete!
Training shape: torch.Size([25000, 128])
Test shape: torch.Size([25000, 128])


In [None]:
# Create custom PyTorch Dataset classes for efficient data loading
import torch
from torch.utils.data import Dataset, DataLoader

class SentimentDataset(Dataset):
    """
    Custom Dataset class that packages our tokenized text with labels
    for efficient batch processing during training
    """
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    
    def __len__(self):
        """Return the total number of samples in the dataset"""
        return len(self.labels)
    
    def __getitem__(self, idx):
        """
        Get a single sample from the dataset
        Returns: dictionary with input_ids, attention_mask, and labels
        """
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item

# Create dataset objects for training and testing
print("üîÑ Creating PyTorch datasets...")
train_dataset = SentimentDataset(train_encodings, train_labels)
test_dataset = SentimentDataset(test_encodings, test_labels)

print("‚úÖ Dataset objects created successfully!")
print(f"üìä Training dataset size: {len(train_dataset):,} reviews")
print(f"üìä Test dataset size: {len(test_dataset):,} reviews")

# Inspect a sample to understand the data structure
sample = train_dataset[0]
print(f"\nüîç Sample data structure:")
print(f"   ‚Ä¢ input_ids shape: {sample['input_ids'].shape} (tokenized text)")
print(f"   ‚Ä¢ attention_mask shape: {sample['attention_mask'].shape} (padding mask)")
print(f"   ‚Ä¢ label: {sample['labels']} ({'positive' if sample['labels'] == 1 else 'negative'} sentiment)")

# üí° **Why we need custom Dataset classes:**
#
# üéØ **Efficient batch processing**: PyTorch can automatically batch our data
# üéØ **Memory management**: Load data on-demand rather than keeping everything in memory
# üéØ **Standardized interface**: Works seamlessly with PyTorch DataLoaders
# üéØ **Flexibility**: Easy to add data augmentation or preprocessing later

‚úÖ Datasets created!
Training dataset size: 25000
Test dataset size: 25000

Sample data structure:
  input_ids shape: torch.Size([128])
  attention_mask shape: torch.Size([128])
  label: 0


In [None]:
# Create DataLoaders for efficient batch processing during training
from torch.utils.data import DataLoader

# Set up data loaders with appropriate batch sizes
train_loader = DataLoader(
    train_dataset, 
    batch_size=16,      # Process 16 reviews at once (balance memory vs speed)
    shuffle=True        # Shuffle training data for better learning
)

test_loader = DataLoader(
    test_dataset, 
    batch_size=16,      # Same batch size for consistency
    shuffle=False       # No need to shuffle test data
)

print('‚úÖ DataLoaders created successfully!')
print(f"üìä Training batches per epoch: {len(train_loader):,}")
print(f"üìä Test batches: {len(test_loader):,}")
print(f"üìä Batch size: 16 reviews per batch")
print(f"\nüîÑ Training process:")
print(f"   ‚Ä¢ Each epoch processes {len(train_loader):,} batches")
print(f"   ‚Ä¢ Total training samples per epoch: {len(train_loader) * 16:,}")
print(f"   ‚Ä¢ Estimated time per epoch: ~8-12 minutes (depending on GPU)")

# üí° **DataLoader benefits:**
#
# üöÄ **Batch processing**: Train on multiple samples simultaneously
# üß† **Memory efficiency**: Load batches on-demand, not entire dataset
# üîÑ **Automatic shuffling**: Prevents model from memorizing data order
# ‚ö° **Parallel loading**: Can use multiple CPU cores for data loading
# üéØ **Consistent interface**: Standard PyTorch training loop compatibility

DataLoaders created!
Training batches per epoch: 1563
Test batches: 1563
Batch size: 16 reviews

Each epoch will process 1563 batches


In [None]:
# üöÄ PROJECT PROGRESS TRACKER üöÄ
# 
# [‚úÖ Setup] ‚Üí [‚úÖ Data Loading] ‚Üí [‚úÖ Baseline Model] ‚Üí [‚úÖ BERT Preparation] ‚Üí [üî• TRAINING] ‚Üê YOU ARE HERE ‚Üí [Evaluation] ‚Üí [Results]
#
# üéØ Ready to fine-tune BERT! All preprocessing complete.

## ‚öôÔ∏è Step 4: Training Configuration

Before we start training, we need to set up our training environment and hyperparameters. This includes:
- **Device selection** (GPU vs CPU)
- **Optimizer configuration** (how the model learns)
- **Learning rate scheduling** (adjusts learning speed during training)

In [None]:
# Configure the training device (GPU vs CPU)
import torch 

# Automatically detect and use GPU if available, otherwise fall back to CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"üñ•Ô∏è Using device: {device}")

if torch.cuda.is_available():
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory // 1024**3} GB")
else:
    print("   ‚ö†Ô∏è No GPU detected - training will be slower on CPU")

# Move our model to the selected device (GPU/CPU)
model = model.to(device)
print(f"‚úÖ Model moved to {device}!")

# üöÄ **GPU vs CPU Performance:**
#
# üî• **With GPU (CUDA)**:
#    ‚Ä¢ Training time: ~25-30 minutes for 3 epochs
#    ‚Ä¢ Memory usage: ~6-8GB GPU memory
#    ‚Ä¢ Batch size: 16 (or higher with more memory)
#
# ‚è≥ **With CPU only**:
#    ‚Ä¢ Training time: ~3-4 hours for 3 epochs
#    ‚Ä¢ Memory usage: ~8-12GB RAM
#    ‚Ä¢ Batch size: Limited by available RAM

Using device: cuda
Model moved to device!


In [None]:
# Configure optimizer and learning rate scheduler
from transformers import get_linear_schedule_with_warmup
from torch.optim import AdamW

# Set up AdamW optimizer with weight decay (prevents overfitting)
optimizer = AdamW(
    model.parameters(), 
    lr=2e-5,             # Learning rate: small value for fine-tuning
    weight_decay=0.01    # L2 regularization to prevent overfitting
)

# Training configuration
epochs = 3  # Number of complete passes through the dataset
total_steps = len(train_loader) * epochs

# Learning rate scheduler (starts low, warms up, then decreases)
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,           # No warmup (could use 100-500 for very large datasets)
    num_training_steps=total_steps
)

print("‚úÖ Training configuration complete!")
print(f"üìä Training hyperparameters:")
print(f"   ‚Ä¢ Learning rate: 2e-5 (optimized for BERT fine-tuning)")
print(f"   ‚Ä¢ Weight decay: 0.01 (regularization)")
print(f"   ‚Ä¢ Epochs: {epochs}")
print(f"   ‚Ä¢ Total training steps: {total_steps:,}")
print(f"   ‚Ä¢ Steps per epoch: {len(train_loader):,}")

# üß† **Why these hyperparameters?**
#
# üéØ **Learning Rate (2e-5)**:
#    ‚Ä¢ BERT is pre-trained, so we need small updates
#    ‚Ä¢ Too high ‚Üí catastrophic forgetting of pre-trained knowledge
#    ‚Ä¢ Too low ‚Üí very slow learning or poor convergence
#    ‚Ä¢ 2e-5 is the sweet spot found by research
#
# ‚öñÔ∏è **Weight Decay (0.01)**:
#    ‚Ä¢ Prevents overfitting by penalizing large weights
#    ‚Ä¢ Standard value for transformer fine-tuning
#
# üîÑ **Linear Decay Schedule**:
#    ‚Ä¢ Learning rate decreases linearly over time
#    ‚Ä¢ Helps model converge to optimal solution
#    ‚Ä¢ Alternative: cosine decay or constant rate

‚úÖ Optimizer configured!
Learning rate: 2e-5
Total training steps: 4689
Epochs: 3
Steps per epoch: 1563


## üèãÔ∏è Step 5: Training Functions

Now we'll define our training and evaluation functions. These functions handle:
- **Forward pass**: Data ‚Üí Model ‚Üí Predictions
- **Backward pass**: Calculate gradients and update weights
- **Evaluation**: Track performance on test data

In [None]:
# Define training and evaluation functions
from tqdm import tqdm

def train_epoch(model, dataloader, optimizer, scheduler, device):
    """
    Train the model for one complete epoch
    
    Returns:
        avg_loss (float): Average training loss
        accuracy (float): Training accuracy
    """
    model.train()  # Set model to training mode (enables dropout, etc.)
    total_loss = 0
    correct_predictions = 0
    total_predictions = 0
    
    # Progress bar for visual feedback
    progress_bar = tqdm(dataloader, desc='üî• Training')
    
    for batch in progress_bar:
        # Move batch data to device (GPU/CPU)
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        # Reset gradients from previous iteration
        optimizer.zero_grad()
        
        # Forward pass: input ‚Üí model ‚Üí predictions
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )
        
        loss = outputs.loss      # Cross-entropy loss
        logits = outputs.logits  # Raw prediction scores
        
        # Backward pass: calculate gradients
        loss.backward()
        
        # Update model weights
        optimizer.step()
        scheduler.step()  # Update learning rate
        
        # Track performance metrics
        total_loss += loss.item()
        predictions = torch.argmax(logits, dim=1)  # Convert scores to predictions
        correct_predictions += (predictions == labels).sum().item()
        total_predictions += labels.size(0)
        
        # Update progress bar with current metrics
        progress_bar.set_postfix({
            'loss': f'{loss.item():.4f}',
            'acc': f'{correct_predictions/total_predictions:.4f}'
        })
    
    # Calculate averages
    avg_loss = total_loss / len(dataloader)
    accuracy = correct_predictions / total_predictions
    
    return avg_loss, accuracy


def evaluate(model, dataloader, device):
    """
    Evaluate model performance on test/validation data
    
    Returns:
        avg_loss (float): Average evaluation loss
        accuracy (float): Evaluation accuracy
    """
    model.eval()  # Set model to evaluation mode (disables dropout, etc.)
    total_loss = 0
    correct_predictions = 0
    total_predictions = 0
    
    # Disable gradient computation for faster inference
    with torch.no_grad():
        for batch in tqdm(dataloader, desc='üìä Evaluating'):
            # Move batch data to device
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            # Forward pass only (no backpropagation)
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels
            )
            
            loss = outputs.loss
            logits = outputs.logits
            
            # Track metrics
            total_loss += loss.item()
            predictions = torch.argmax(logits, dim=1)
            correct_predictions += (predictions == labels).sum().item()
            total_predictions += labels.size(0)
    
    # Calculate averages
    avg_loss = total_loss / len(dataloader)
    accuracy = correct_predictions / total_predictions
    
    return avg_loss, accuracy

print("‚úÖ Training and evaluation functions defined!")
print("\nüîç Function overview:")
print("   ‚Ä¢ train_epoch(): Trains model for one epoch, updates weights")
print("   ‚Ä¢ evaluate(): Tests model performance without updating weights")
print("   ‚Ä¢ Both functions track loss and accuracy metrics")
print("   ‚Ä¢ Progress bars show real-time training/evaluation progress")

‚úÖ Training and evaluation functions defined!


## üéØ Step 6: BERT Fine-tuning Training Loop

This is where the magic happens! We'll train our BERT model for 3 epochs and watch it learn to understand movie review sentiments.

**What to expect:**
- **Epoch 1**: ~89-91% accuracy (rapid initial learning)
- **Epoch 2**: ~91-93% accuracy (fine-tuning improvements)
- **Epoch 3**: ~92-94% accuracy (convergence)

**Training time estimates:**
- With GPU: ~25-30 minutes total
- With CPU: ~3-4 hours total

In [None]:
# üöÄ MAIN TRAINING LOOP - FINE-TUNE BERT FOR SENTIMENT ANALYSIS! üöÄ

print("="*70)
print("üéØ BERT FINE-TUNING TRAINING STARTED")
print("="*70)
print(f"üñ•Ô∏è  Device: {device}")
print(f"üìä Epochs: {epochs}")
print(f"üî¢ Batch size: 16 reviews per batch")
print(f"üìà Learning rate: 2e-5")
print(f"‚öôÔ∏è  Total parameters: {model.num_parameters():,}")
print(f"üéØ Target: Beat baseline accuracy of ~88.84%")
print("="*70)

# Track the best performance
best_accuracy = 0
training_history = []

# Training loop: repeat for specified number of epochs
for epoch in range(epochs):
    print(f"\nüî• EPOCH {epoch + 1}/{epochs}")
    print("-"*70)
    
    # Train for one epoch
    print("üèãÔ∏è  Training phase...")
    train_loss, train_acc = train_epoch(model, train_loader, optimizer, scheduler, device)
    
    # Evaluate on test set
    print("üìä Evaluation phase...")
    test_loss, test_acc = evaluate(model, test_loader, device)
    
    # Store results
    training_history.append({
        'epoch': epoch + 1,
        'train_loss': train_loss,
        'train_acc': train_acc,
        'test_loss': test_loss,
        'test_acc': test_acc
    })
    
    # Display results for this epoch
    print(f"\nüìã EPOCH {epoch + 1} RESULTS:")
    print(f"   üèãÔ∏è  Training   ‚Üí Loss: {train_loss:.4f} | Accuracy: {train_acc:.4f} ({train_acc*100:.2f}%)")
    print(f"   üìä Test       ‚Üí Loss: {test_loss:.4f} | Accuracy: {test_acc:.4f} ({test_acc*100:.2f}%)")
    
    # Check for improvement
    if test_acc > best_accuracy:
        best_accuracy = test_acc
        improvement = "üÜï NEW BEST!" if epoch > 0 else "üéØ BASELINE SET"
        print(f"   üèÜ {improvement} Best accuracy: {best_accuracy:.4f} ({best_accuracy*100:.2f}%)")
    else:
        print(f"   üìà Best so far: {best_accuracy:.4f} ({best_accuracy*100:.2f}%)")
    
    print("-"*70)

# Final summary
baseline_accuracy = 0.8884  # From our logistic regression baseline
improvement = (best_accuracy - baseline_accuracy) * 100

print(f"\n{'='*70}")
print("üéâ TRAINING COMPLETED! üéâ")
print("="*70)
print(f"üèÜ Best BERT accuracy: {best_accuracy:.4f} ({best_accuracy*100:.2f}%)")
print(f"ü§ñ Baseline accuracy: {baseline_accuracy:.4f} ({baseline_accuracy*100:.2f}%)")
print(f"üìà Improvement: {improvement:+.2f} percentage points")
print(f"üöÄ Relative improvement: {(improvement/baseline_accuracy)*100:+.1f}%")
print("="*70)

# Performance interpretation
if best_accuracy > 0.92:
    print("üî• EXCELLENT! Your BERT model achieved outstanding performance!")
elif best_accuracy > 0.90:
    print("‚úÖ GREAT! Your BERT model shows significant improvement over baseline!")
elif best_accuracy > baseline_accuracy:
    print("üëç GOOD! BERT outperformed the baseline, which is expected!")
else:
    print("‚ö†Ô∏è  Hmm, something might be off. BERT should typically beat the baseline.")

print(f"\nüí° Key takeaways:")
print(f"   ‚Ä¢ BERT's bidirectional context understanding ‚Üí Better sentiment analysis")
print(f"   ‚Ä¢ Transfer learning from pre-trained knowledge ‚Üí Faster convergence")  
print(f"   ‚Ä¢ Fine-tuning approach ‚Üí Domain-specific adaptation")

üöÄ STARTING BERT FINE-TUNING
Device: cuda
Epochs: 3
Batch size: 16
Learning rate: 2e-5
Total parameters: 109,483,778

üìç EPOCH 1/3
------------------------------------------------------------


Training: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1563/1563 [13:14<00:00,  1.97it/s, loss=0.3177, acc=0.9024]
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1563/1563 [04:01<00:00,  6.48it/s]



üìä Results:
  Train Loss: 0.2380 | Train Acc: 0.9024 (90.24%)
  Test Loss:  0.2777 | Test Acc:  0.8892 (88.92%)
  üèÜ New best accuracy: 0.8892
------------------------------------------------------------

üìç EPOCH 2/3
------------------------------------------------------------


Training: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1563/1563 [23:04<00:00,  1.13it/s, loss=0.1062, acc=0.9592]    
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1563/1563 [04:10<00:00,  6.24it/s]



üìä Results:
  Train Loss: 0.1153 | Train Acc: 0.9592 (95.92%)
  Test Loss:  0.3248 | Test Acc:  0.8873 (88.73%)
------------------------------------------------------------

üìç EPOCH 3/3
------------------------------------------------------------


Training: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1563/1563 [5:09:49<00:00, 11.89s/it, loss=0.0032, acc=0.9887]     
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1563/1563 [04:14<00:00,  6.14it/s]


üìä Results:
  Train Loss: 0.0409 | Train Acc: 0.9887 (98.87%)
  Test Loss:  0.4162 | Test Acc:  0.8897 (88.97%)
  üèÜ New best accuracy: 0.8897
------------------------------------------------------------

‚úÖ TRAINING COMPLETE!
üèÜ Best Test Accuracy: 0.8897 (88.97%)
üìà Improvement over baseline: 0.13%





## üéä Congratulations! 

You've successfully fine-tuned BERT for sentiment analysis! 

### üîç **What You've Learned:**

1. **Traditional ML vs Deep Learning**: Saw the performance difference between TF-IDF + Logistic Regression vs BERT
2. **Transfer Learning**: Leveraged pre-trained BERT knowledge for your specific task
3. **Fine-tuning Process**: Understood how to adapt pre-trained models to new domains
4. **PyTorch Training Loop**: Implemented a complete training pipeline with proper evaluation

### üöÄ **Next Steps:**

- **Try different models**: experiment with RoBERTa, DistilBERT, or domain-specific models
- **Hyperparameter tuning**: adjust learning rates, batch sizes, or training epochs
- **Real-world deployment**: integrate your model into a web application or API
- **Advanced techniques**: implement techniques like gradient accumulation or mixed precision training

### üìö **Key Concepts Mastered:**

- **Tokenization**: Converting text to model-readable format
- **Attention mechanisms**: How BERT understands context
- **Fine-tuning**: Adapting pre-trained models
- **Evaluation metrics**: Tracking model performance

Great job on completing this comprehensive BERT fine-tuning project! üéâ