# Lightweight Fine-Tuning Project

## Overview

This project implements state-of-the-art Parameter-Efficient Fine-Tuning (PEFT) techniques to adapt pre-trained language models for sentiment analysis with minimal computational resources. We explore both LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) approaches, along with extensive analysis of different configurations.

## Project Choices

* **PEFT techniques:** 
  * LoRA (Low-Rank Adaptation) with multiple configuration experiments
  * QLoRA (Quantized LoRA) for additional memory efficiency

* **Model:** 
  * DistilBERT base model (not pre-fine-tuned) for sentiment analysis
  * Selection provides good balance between model capacity and inference efficiency

* **Evaluation approach:** 
  * Comprehensive metrics suite (accuracy, F1, precision, recall)
  * Statistical significance testing
  * Detailed visualizations for performance analysis
  * Memory profiling for efficiency analysis

* **Fine-tuning dataset:** 
  * GLUE SST-2 (Stanford Sentiment Treebank) - binary sentiment classification
  * Well-established benchmark for sentiment analysis tasks

* **Enhancements:** 
  * Multiple LoRA configuration comparisons
  * QLoRA implementation with memory optimization
  * Real-world inference examples and applications
  * Detailed performance visualizations

## Project Structure

1. Setup and Initialization
2. Dataset Loading and Preprocessing
3. Base Model Evaluation
4. LoRA Implementation and Training
5. QLoRA Implementation and Training
6. LoRA Configuration Experimentation
7. Model Comparison and Analysis
8. Real-World Applications
9. Conclusions and Recommendations

## Setup and imports



In [31]:
!pip install torch transformers datasets peft scikit-learn matplotlib seaborn pandas numpy


# Initial setup and imports
import os
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import json
import warnings
from typing import Dict, List, Tuple, Optional, Any

# Suppress non-critical warnings for cleaner output
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

# Hugging Face imports
from datasets import load_dataset, Dataset, DatasetDict
from transformers import (
    AutoModelForSequenceClassification, 
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    set_seed
)

# PEFT imports
from peft import (
    LoraConfig, 
    get_peft_model, 
    TaskType, 
    PeftModel,
    PeftConfig,
    prepare_model_for_kbit_training
)

# Metrics and evaluation imports
from sklearn.metrics import (
    accuracy_score, 
    f1_score, 
    precision_score, 
    recall_score, 
    confusion_matrix, 
    classification_report,
    roc_curve, 
    auc
)

# Set environment variables
os.environ["TOKENIZERS_PARALLELISM"] = "false"  # Avoid parallelism warning

# Set random seed for reproducibility
SEED = 42
set_seed(SEED)
torch.manual_seed(SEED)
np.random.seed(SEED)

# Create directories for outputs
output_dir = "./peft_output"
viz_dir = f"{output_dir}/visualizations"
models_dir = f"{output_dir}/models"
results_dir = f"{output_dir}/results"

# Create all required directories
for dir_path in [output_dir, viz_dir, models_dir, results_dir]:
    os.makedirs(dir_path, exist_ok=True)

# Configure plots for consistency
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

# Check for GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Display GPU info if available
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")
    print(f"PyTorch Version: {torch.__version__}")
    
    # Memory info
    gpu_props = torch.cuda.get_device_properties(0)
    print(f"Total GPU Memory: {gpu_props.total_memory / 1e9:.2f} GB")
    print(f"Current Memory Usage: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

# Configure project parameters (these can be adjusted as needed)
config = {
    "model_name": "distilbert-base-uncased",  # Base model (not fine-tuned)
    "max_length": 128,                        # Maximum sequence length
    "train_sample_size": 5000,                # Number of training examples to use
    "validation_sample_size": 500,            # Number of validation examples for quick evaluation
    "lora_r": 16,                             # LoRA rank parameter
    "lora_alpha": 32,                         # LoRA alpha parameter
    "lora_dropout": 0.1,                      # LoRA dropout
    "batch_size": 16,                         # Batch size for training
    "learning_rate": 5e-4,                    # Learning rate
    "epochs": 3,                              # Number of training epochs
    "weight_decay": 0.01,                     # Weight decay for regularization
    "save_steps": 500,                        # Save checkpoints every X steps
    "qlora_batch_size": 8,                    # Smaller batch size for QLoRA
    "qlora_learning_rate": 1e-4,              # Lower learning rate for QLoRA
    "eval_batch_size": 32                     # Batch size for evaluation
}

print("\nProject configuration:")
for key, value in config.items():
    print(f"  {key}: {value}")

# Dictionary to store all results for comparison
all_results = {}

Defaulting to user installation because normal site-packages is not writeable
Using device: cuda
GPU: Tesla T4
CUDA Version: 11.7
PyTorch Version: 2.0.1
Total GPU Memory: 15.64 GB
Current Memory Usage: 4.59 GB

Project configuration:
  model_name: distilbert-base-uncased
  max_length: 128
  train_sample_size: 5000
  validation_sample_size: 500
  lora_r: 16
  lora_alpha: 32
  lora_dropout: 0.1
  batch_size: 16
  learning_rate: 0.0005
  epochs: 3
  weight_decay: 0.01
  save_steps: 500
  qlora_batch_size: 8
  qlora_learning_rate: 0.0001
  eval_batch_size: 32


## Functions

In [32]:
# Define comprehensive helper functions for the project

def compute_metrics(eval_pred):
    """
    Compute a comprehensive set of evaluation metrics for classification.
    
    Args:
        eval_pred: Tuple of (predictions, labels) from the model
        
    Returns:
        Dictionary of metrics including accuracy, F1, precision, and recall
    """
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    # Core metrics
    metrics = {
        "accuracy": accuracy_score(labels, predictions),
        "f1": f1_score(labels, predictions, average="weighted"),
        "precision": precision_score(labels, predictions, average="weighted"),
        "recall": recall_score(labels, predictions, average="weighted")
    }
    
    # Per-class metrics if multiple classes are present
    unique_labels = np.unique(labels)
    if len(unique_labels) > 1:
        for label in unique_labels:
            label_predictions = (predictions == label)
            label_true = (labels == label)
            metrics[f"precision_class_{label}"] = precision_score(label_true, label_predictions, zero_division=0)
            metrics[f"recall_class_{label}"] = recall_score(label_true, label_predictions, zero_division=0)
            metrics[f"f1_class_{label}"] = f1_score(label_true, label_predictions, zero_division=0)
    
    return metrics


def format_time(seconds):
    """Format time duration in a human-readable format."""
    minutes, seconds = divmod(seconds, 60)
    hours, minutes = divmod(minutes, 60)
    return f"{int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}"


def visualize_confusion_matrix(predictions, true_labels, output_path=None, model_name="Model"):
    """
    Create and visualize a confusion matrix for the given predictions.
    
    Args:
        predictions: Model predictions (class indices)
        true_labels: True class labels
        output_path: Path to save the visualization (optional)
        model_name: Name of the model for the title
        
    Returns:
        The confusion matrix
    """
    # Compute confusion matrix
    cm = confusion_matrix(true_labels, predictions)
    
    # Visualize
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=True)
    plt.title(f'Confusion Matrix - {model_name}')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    
    # If labels are binary, add class names
    if cm.shape[0] == 2:
        plt.xticks([0.5, 1.5], ['Negative', 'Positive'])
        plt.yticks([0.5, 1.5], ['Negative', 'Positive'])
    
    plt.tight_layout()
    
    # Save if output path is specified
    if output_path:
        plt.savefig(output_path)
        plt.close()
    
    return cm


def plot_metric_comparison(metrics_dict, metric_name='accuracy', title=None, output_path=None):
    """
    Create a bar chart comparing a specific metric across different models.
    
    Args:
        metrics_dict: Dictionary mapping model names to metric values
        metric_name: The name of the metric to visualize
        title: Plot title (default derived from metric name)
        output_path: Path to save the visualization (optional)
    """
    plt.figure(figsize=(12, 6))
    models = list(metrics_dict.keys())
    values = [metrics_dict[model][f'eval_{metric_name}'] for model in models]
    
    # Create bar chart with custom colors
    colors = sns.color_palette("muted", len(models))
    bars = plt.bar(models, values, color=colors)
    
    # Set title and labels
    if title is None:
        title = f'{metric_name.capitalize()} Comparison Across Models'
    plt.title(title, fontsize=14)
    plt.xlabel('Model', fontsize=12)
    plt.ylabel(metric_name.capitalize(), fontsize=12)
    
    # Add value labels on top of each bar
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.005,
                 f'{height:.4f}',
                 ha='center', va='bottom', fontsize=11)
    
    # Add a horizontal line for base model reference
    if 'Base Model' in metrics_dict:
        base_value = metrics_dict['Base Model'][f'eval_{metric_name}']
        plt.axhline(y=base_value, color='red', linestyle='--', alpha=0.7, 
                    label=f'Base Model: {base_value:.4f}')
        plt.legend()
    
    # Rotate x-axis labels for better readability
    plt.xticks(rotation=30, ha='right')
    
    plt.tight_layout()
    
    # Save if output path is specified
    if output_path:
        plt.savefig(output_path)
        plt.close()
        print(f"Saved {metric_name} comparison chart to {output_path}")


def plot_training_history(history, output_path=None):
    """
    Plot training loss and evaluation metrics from training history.
    
    Args:
        history: Training history from Trainer.state.log_history
        output_path: Path to save the plot
    """
    # Extract training loss
    train_loss = []
    train_steps = []
    eval_loss = []
    eval_acc = []
    eval_steps = []
    
    for entry in history:
        if 'loss' in entry and 'eval_loss' not in entry:
            train_loss.append(entry['loss'])
            train_steps.append(entry['step'])
        if 'eval_loss' in entry:
            eval_loss.append(entry['eval_loss'])
            eval_acc.append(entry['eval_accuracy'])
            eval_steps.append(entry['step'])
    
    # Create the plot
    plt.figure(figsize=(14, 6))
    
    # Plot training loss
    plt.subplot(1, 2, 1)
    plt.plot(train_steps, train_loss, 'b-', marker='o', markersize=4, alpha=0.7)
    plt.title('Training Loss', fontsize=14)
    plt.xlabel('Steps', fontsize=12)
    plt.ylabel('Loss', fontsize=12)
    plt.grid(True, alpha=0.3)
    
    # Plot evaluation accuracy
    plt.subplot(1, 2, 2)
    plt.plot(eval_steps, eval_acc, 'g-', marker='o', markersize=6, label='Accuracy')
    
    # If eval loss exists, plot it on secondary axis
    if eval_loss:
        ax2 = plt.gca().twinx()
        ax2.plot(eval_steps, eval_loss, 'r--', marker='x', markersize=4, alpha=0.7, label='Loss')
        ax2.set_ylabel('Loss', color='r', fontsize=12)
        ax2.tick_params(axis='y', colors='r')
    
    plt.title('Validation Metrics', fontsize=14)
    plt.xlabel('Steps', fontsize=12)
    plt.ylabel('Accuracy', fontsize=12)
    plt.grid(True, alpha=0.3)
    plt.legend(loc='lower right')
    
    plt.tight_layout()
    
    # Save if output path is specified
    if output_path:
        plt.savefig(output_path)
        plt.close()
        print(f"Training history plot saved to {output_path}")


def plot_roc_curve(model, dataset, tokenizer, device, output_path=None, model_name="Model"):
    """
    Plot ROC curve for the model on a given dataset.
    
    Args:
        model: The model to evaluate
        dataset: Evaluation dataset
        tokenizer: Tokenizer for the model
        device: Device to run inference on
        output_path: Path to save the visualization
        model_name: Name of the model for the title
    """
    # Create a dataloader for the dataset
    from torch.utils.data import DataLoader
    
    dataloader = DataLoader(dataset, batch_size=16)
    
    all_labels = []
    all_probs = []
    
    # Get predictions
    model.eval()
    with torch.no_grad():
        for batch in dataloader:
            # Move batch to device
            inputs = {k: v.to(device) for k, v in batch.items() if k != 'labels'}
            labels = batch['labels'].numpy()
            
            # Forward pass
            outputs = model(**inputs)
            logits = outputs.logits
            
            # Get probabilities for positive class
            probs = torch.nn.functional.softmax(logits, dim=1)[:, 1].cpu().numpy()
            
            all_labels.extend(labels)
            all_probs.extend(probs)
    
    # Calculate ROC curve
    fpr, tpr, _ = roc_curve(all_labels, all_probs)
    roc_auc = auc(fpr, tpr)
    
    # Plot ROC curve
    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(f'ROC Curve - {model_name}')
    plt.legend(loc="lower right")
    
    if output_path:
        plt.savefig(output_path)
        plt.close()
        print(f"ROC curve saved to {output_path}")
    
    return roc_auc


def plot_parameter_efficiency(total_params, trainable_params, output_path=None):
    """
    Create a visualization of parameter efficiency for a PEFT model.
    
    Args:
        total_params: Total number of parameters in the model
        trainable_params: Number of trainable parameters
        output_path: Path to save the visualization
    """
    plt.figure(figsize=(10, 6))
    
    # Calculate frozen parameters
    frozen_params = total_params - trainable_params
    
    # Create pie chart
    sizes = [trainable_params, frozen_params]
    labels = ['Trainable Parameters', 'Frozen Parameters']
    colors = ['#ff9999', '#66b3ff']
    explode = (0.1, 0)  # Explode the trainable parameters
    
    plt.pie(sizes, explode=explode, labels=labels, colors=colors,
            autopct='%1.1f%%', shadow=True, startangle=90,
            textprops={'fontsize': 14})
    
    plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle
    plt.title(f'Parameter Efficiency: {trainable_params:,} out of {total_params:,} parameters',
              fontsize=16)
    
    # Add text annotation with exact numbers
    plt.annotate(f"Trainable: {trainable_params:,} ({trainable_params/total_params*100:.2f}%)\nFrozen: {frozen_params:,} ({frozen_params/total_params*100:.2f}%)",
                xy=(0.5, 0.05), xycoords='figure fraction',
                horizontalalignment='center', verticalalignment='bottom',
                fontsize=12, bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="gray", alpha=0.8))
    
    if output_path:
        plt.savefig(output_path)
        plt.close()
        print(f"Parameter efficiency visualization saved to {output_path}")


def load_sst2_dataset():
    """
    Load the SST-2 dataset with multiple fallback methods in case of errors.
    Returns the dataset or raises an error if all methods fail.
    """
    # Method 1: Standard loading
    try:
        print("Attempting to load SST-2 dataset directly...")
        dataset = load_dataset("glue", "sst2")
        print("Dataset loaded successfully!")
        return dataset
    except Exception as e:
        print(f"Error in standard loading: {e}")
    
    # Method 2: Try with dataset builder
    try:
        print("Attempting to load with dataset builder...")
        from datasets import load_dataset_builder
        builder = load_dataset_builder("glue", "sst2")
        builder.download_and_prepare()
        dataset = builder.as_dataset()
        print("Dataset loaded successfully with builder!")
        return dataset
    except Exception as e:
        print(f"Error with dataset builder: {e}")
    
    # Method 3: Manual download
    try:
        print("Attempting direct download of SST-2...")
        # Download raw data if not already present
        if not os.path.exists('SST-2.zip'):
            !wget -q https://dl.fbaipublicfiles.com/glue/data/SST-2.zip
            !unzip -q SST-2.zip
        elif not os.path.exists('SST-2'):
            !unzip -q SST-2.zip
        
        # Load train and dev data
        train_df = pd.read_csv('SST-2/train.tsv', sep='\t')
        dev_df = pd.read_csv('SST-2/dev.tsv', sep='\t')
        
        # Rename columns to match expected format if needed
        if 'sentence' not in train_df.columns and 'text' in train_df.columns:
            train_df = train_df.rename(columns={'text': 'sentence'})
            dev_df = dev_df.rename(columns={'text': 'sentence'})
        
        # Convert to datasets format
        train_dataset = Dataset.from_pandas(train_df)
        validation_dataset = Dataset.from_pandas(dev_df)
        
        # Create dataset dictionary
        dataset = DatasetDict({
            'train': train_dataset,
            'validation': validation_dataset
        })
        print("Dataset loaded successfully via direct download!")
        return dataset
    except Exception as e:
        print(f"Error with direct download: {e}")
        raise ValueError("All dataset loading methods failed. Please check your environment.")


def format_for_pytorch(dataset):
    """
    Format a dataset for PyTorch by removing unnecessary columns 
    and renaming labels.
    
    Args:
        dataset: Hugging Face dataset
        
    Returns:
        Formatted dataset
    """
    # Remove unnecessary columns
    if "sentence" in dataset.column_names:
        dataset = dataset.remove_columns(["sentence"])
    if "idx" in dataset.column_names:
        dataset = dataset.remove_columns(["idx"])
    
    # Rename label to labels for Trainer compatibility
    if "label" in dataset.column_names and "labels" not in dataset.column_names:
        dataset = dataset.rename_column("label", "labels")
    
    # Set format to PyTorch tensors
    dataset.set_format("torch")
    return dataset


def track_memory_usage(model, device):
    """Track memory usage of a model on a specific device."""
    if device.type == "cuda":
        memory_allocated = torch.cuda.memory_allocated(device) / (1024 ** 2)  # Convert to MB
        memory_reserved = torch.cuda.memory_reserved(device) / (1024 ** 2)
        return {
            "allocated_mb": memory_allocated,
            "reserved_mb": memory_reserved
        }
    else:
        # For CPU, use approximate size based on parameters
        param_size = sum(p.numel() * p.element_size() for p in model.parameters())
        buffer_size = sum(b.numel() * b.element_size() for b in model.buffers())
        total_size = (param_size + buffer_size) / (1024 ** 2)  # Convert to MB
        return {
            "allocated_mb": total_size,
            "reserved_mb": total_size
        }


def get_model_size_estimate(model, as_string=False):
    """Get an estimate of model size in MB."""
    param_size = sum(p.numel() * p.element_size() for p in model.parameters())
    buffer_size = sum(b.numel() * b.element_size() for b in model.buffers())
    total_size_mb = (param_size + buffer_size) / (1024 ** 2)  # Convert to MB
    
    if as_string:
        if total_size_mb > 1024:
            return f"{total_size_mb/1024:.2f} GB"
        else:
            return f"{total_size_mb:.2f} MB"
    else:
        return total_size_mb

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [33]:
# Load and prepare the SST-2 dataset
print("Loading SST-2 dataset...")
try:
    dataset = load_sst2_dataset()
    
    # Dataset statistics
    print(f"\nDataset Statistics:")
    print(f"Training examples: {len(dataset['train'])}")
    print(f"Validation examples: {len(dataset['validation'])}")
    print(f"Example data point: {dataset['train'][0]}")
    
    # Create smaller subsets for training and quick evaluation
    train_dataset = dataset['train'].shuffle(seed=SEED).select(range(config['train_sample_size']))
    validation_subset = dataset['validation'].shuffle(seed=SEED).select(range(config['validation_sample_size']))
    
    print(f"\nUsing {len(train_dataset)} training examples (subset of full dataset)")
    print(f"Using {len(validation_subset)} validation examples for quick evaluations")
    
    # Check class balance
    if 'label' in dataset['train'].features:
        labels = dataset['train']['label']
        label_counts = pd.Series(labels).value_counts()
        
        print("\nClass distribution in training set:")
        for label, count in label_counts.items():
            print(f"  Label {label}: {count} examples ({count/len(labels)*100:.1f}%)")
        
        # Visualize class distribution
        plt.figure(figsize=(10, 6))
        
        # Create a prettier bar chart
        ax = sns.barplot(x=label_counts.index.astype(str), y=label_counts.values, palette='viridis')
        
        # Add count and percentage labels
        for i, (label, count) in enumerate(label_counts.items()):
            ax.text(i, count/2, f"{count}\n({count/len(labels)*100:.1f}%)", 
                   ha='center', va='center', color='white', fontweight='bold')
        
        # Customize the plot
        if len(label_counts) == 2:
            plt.xticks([0, 1], ['Negative (0)', 'Positive (1)'])
        
        plt.title('Class Distribution in Training Set', fontsize=14)
        plt.xlabel('Class Label', fontsize=12)
        plt.ylabel('Count', fontsize=12)
        plt.tight_layout()
        
        # Save the visualization
        plt.savefig(f"{viz_dir}/class_distribution.png")
        plt.close()
        print(f"Class distribution visualization saved to {viz_dir}/class_distribution.png")
    
    # Load tokenizer
    print("\nLoading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(config['model_name'])
    print(f"Tokenizer loaded with vocabulary size: {len(tokenizer)}")
    
    # Define preprocessing function
    def preprocess_function(examples):
        """Tokenize and prepare examples for the model."""
        return tokenizer(
            examples["sentence"] if "sentence" in examples else examples["text"], 
            truncation=True,
            padding="max_length",
            max_length=config['max_length']
        )
    
    # Process the datasets
    print("\nPreprocessing datasets...")
    start_time = time.time()
    
    encoded_train = train_dataset.map(
        preprocess_function, 
        batched=True, 
        desc="Tokenizing training set"
    )
    
    encoded_validation = dataset['validation'].map(
        preprocess_function, 
        batched=True,
        desc="Tokenizing validation set"
    )
    
    encoded_validation_subset = validation_subset.map(
        preprocess_function,
        batched=True,
        desc="Tokenizing validation subset"
    )
    
    # Prepare datasets for PyTorch
    print("\nPreparing datasets for PyTorch...")
    encoded_train = format_for_pytorch(encoded_train)
    encoded_validation = format_for_pytorch(encoded_validation)
    encoded_validation_subset = format_for_pytorch(encoded_validation_subset)
    
    preprocess_time = time.time() - start_time
    print(f"Preprocessing completed in {preprocess_time:.2f} seconds")
    
    # Examine one processed example to verify
    print("\nSample processed example:")
    example_idx = 0
    sample_input_ids = encoded_train[example_idx]['input_ids']
    sample_attention_mask = encoded_train[example_idx]['attention_mask']
    sample_label = encoded_train[example_idx]['labels']
    
    print(f"Input shape: {sample_input_ids.shape}")
    print(f"Attention mask shape: {sample_attention_mask.shape}")
    print(f"Label: {sample_label}")
    
    # Decode a processed example to verify tokenization
    decoded_text = tokenizer.decode(sample_input_ids, skip_special_tokens=True)
    print(f"\nDecoded text: {decoded_text}")
    
    print("\nDataset preprocessing completed successfully!")
    
except Exception as e:
    import traceback
    print(f"Error during dataset preparation: {e}")
    traceback.print_exc()
    raise RuntimeError("Dataset preparation failed. Cannot proceed with the project.")

Loading SST-2 dataset...
Attempting to load SST-2 dataset directly...
Dataset loaded successfully!

Dataset Statistics:
Training examples: 67349
Validation examples: 872
Example data point: {'sentence': 'hide new secretions from the parental units ', 'label': 0, 'idx': 0}

Using 5000 training examples (subset of full dataset)
Using 500 validation examples for quick evaluations

Class distribution in training set:
  Label 1: 37569 examples (55.8%)
  Label 0: 29780 examples (44.2%)
Class distribution visualization saved to ./peft_output/visualizations/class_distribution.png

Loading tokenizer...
Tokenizer loaded with vocabulary size: 30522

Preprocessing datasets...


Tokenizing validation subset:   0%|          | 0/500 [00:00<?, ? examples/s]


Preparing datasets for PyTorch...
Preprocessing completed in 0.17 seconds

Sample processed example:
Input shape: torch.Size([128])
Attention mask shape: torch.Size([128])
Label: 1

Decoded text: klein, charming in comedies like american pie and dead - on in election,

Dataset preprocessing completed successfully!


In [34]:
# Load and evaluate the base model

print(f"Loading base model: {config['model_name']}...")
try:
    # Load the base model for sequence classification
    base_model = AutoModelForSequenceClassification.from_pretrained(
        config['model_name'],
        num_labels=2  # Binary classification
    )
    base_model = base_model.to(device)
    
    # Record memory usage before training
    base_memory = track_memory_usage(base_model, device)
    
    # Model architecture summary
    print(f"\nModel Architecture Summary:")
    print(f"Model type: {base_model.config.model_type}")
    print(f"Number of parameters: {sum(p.numel() for p in base_model.parameters()):,}")
    print(f"Estimated model size: {get_model_size_estimate(base_model, as_string=True)}")
    
    # List potential target modules for LoRA
    print("\nPotential target modules for LoRA:")
    target_modules = []
    
    for name, module in base_model.named_modules():
        if any(key in name for key in ['query', 'key', 'value', 'q_lin', 'k_lin', 'v_lin', 'out_lin', 'attention']):
            target_modules.append(name)
            print(f"  {name}")
    
    # Set up evaluation arguments
    eval_args = TrainingArguments(
        output_dir=f"{output_dir}/base_eval",
        per_device_eval_batch_size=config['eval_batch_size'],
        do_train=False,
        do_eval=True,
        report_to="none"  # Disable wandb
    )
    
    # Create a Trainer for the base model
    print("\nCreating evaluation trainer for the base model...")
    base_trainer = Trainer(
        model=base_model,
        args=eval_args,
        compute_metrics=compute_metrics,
        eval_dataset=encoded_validation,
    )
    
    # Evaluate the base model with timing
    print("\nEvaluating the foundation model (this may take a few minutes)...")
    start_time = time.time()
    base_model_results = base_trainer.evaluate()
    eval_time = time.time() - start_time
    
    # Print evaluation results with formatted output
    print(f"\nBase model evaluation completed in {eval_time:.2f} seconds")
    print("\nBase model results:")
    print("-" * 50)
    for metric, value in base_model_results.items():
        # Skip metrics that are not relevant for display
        if not metric.startswith('eval_runtime') and not metric.startswith('eval_samples_per'):
            print(f"{metric:25s}: {value:.4f}")
    
    # Generate predictions for confusion matrix
    print("\nGenerating predictions for analysis...")
    predictions = base_trainer.predict(encoded_validation)
    predicted_labels = np.argmax(predictions.predictions, axis=1)
    true_labels = predictions.label_ids
    
    # Create and save confusion matrix
    visualize_confusion_matrix(
        predicted_labels, 
        true_labels, 
        output_path=f"{viz_dir}/base_model_confusion_matrix.png",
        model_name="Base Model"
    )
    
    # Generate classification report
    report = classification_report(true_labels, predicted_labels, target_names=['Negative', 'Positive'])
    print("\nClassification Report:")
    print(report)
    
    # Save report to file
    with open(f"{results_dir}/base_model_classification_report.txt", "w") as f:
        f.write(report)
    
    # Plot ROC curve
    roc_auc = plot_roc_curve(
        base_model, 
        encoded_validation, 
        tokenizer, 
        device, 
        output_path=f"{viz_dir}/base_model_roc_curve.png",
        model_name="Base Model"
    )
    
    # Add ROC AUC to results
    base_model_results['eval_roc_auc'] = roc_auc
    
    # Store results for later comparison
    all_results['Base Model'] = base_model_results
    
    # Create confidence distribution visualization
    print("\nVisualizing prediction confidence distribution...")
    softmax_outputs = torch.nn.functional.softmax(torch.tensor(predictions.predictions), dim=1)
    confidence_scores = softmax_outputs.max(dim=1).values.numpy()
    
    plt.figure(figsize=(10, 6))
    sns.histplot(confidence_scores, bins=20, kde=True)
    plt.title('Distribution of Prediction Confidence (Base Model)', fontsize=14)
    plt.xlabel('Confidence Score', fontsize=12)
    plt.ylabel('Count', fontsize=12)
    plt.grid(True, alpha=0.3)
    plt.savefig(f"{viz_dir}/base_model_confidence.png")
    plt.close()
    
    print("\nBase model evaluation complete!")
    
except Exception as e:
    import traceback
    print(f"Error during base model evaluation: {e}")
    traceback.print_exc()
    raise RuntimeError("Base model evaluation failed. Cannot proceed with PEFT.")

Loading base model: distilbert-base-uncased...


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Model Architecture Summary:
Model type: distilbert
Number of parameters: 66,955,010
Estimated model size: 255.42 MB

Potential target modules for LoRA:
  distilbert.transformer.layer.0.attention
  distilbert.transformer.layer.0.attention.dropout
  distilbert.transformer.layer.0.attention.q_lin
  distilbert.transformer.layer.0.attention.k_lin
  distilbert.transformer.layer.0.attention.v_lin
  distilbert.transformer.layer.0.attention.out_lin
  distilbert.transformer.layer.1.attention
  distilbert.transformer.layer.1.attention.dropout
  distilbert.transformer.layer.1.attention.q_lin
  distilbert.transformer.layer.1.attention.k_lin
  distilbert.transformer.layer.1.attention.v_lin
  distilbert.transformer.layer.1.attention.out_lin
  distilbert.transformer.layer.2.attention
  distilbert.transformer.layer.2.attention.dropout
  distilbert.transformer.layer.2.attention.q_lin
  distilbert.transformer.layer.2.attention.k_lin
  distilbert.transformer.layer.2.attention.v_lin
  distilbert.transform


Base model evaluation completed in 0.98 seconds

Base model results:
--------------------------------------------------
eval_loss                : 0.6972
eval_accuracy            : 0.4346
eval_f1                  : 0.3621
eval_precision           : 0.3897
eval_recall              : 0.4346
eval_precision_class_0   : 0.4555
eval_recall_class_0      : 0.7780
eval_f1_class_0          : 0.5746
eval_precision_class_1   : 0.3262
eval_recall_class_1      : 0.1036
eval_f1_class_1          : 0.1573
eval_steps_per_second    : 28.7640

Generating predictions for analysis...

Classification Report:
              precision    recall  f1-score   support

    Negative       0.46      0.78      0.57       428
    Positive       0.33      0.10      0.16       444

    accuracy                           0.43       872
   macro avg       0.39      0.44      0.37       872
weighted avg       0.39      0.43      0.36       872

ROC curve saved to ./peft_output/visualizations/base_model_roc_curve.png

Visua

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

## LoRA set up 


In [35]:
# Configure and set up LoRA for parameter-efficient fine-tuning

print("Configuring LoRA for parameter-efficient fine-tuning...")
try:
    # Define LoRA Configuration for the model
    lora_config = LoraConfig(
        task_type=TaskType.SEQ_CLS,         # Sequence classification task
        r=config['lora_r'],                 # Rank of the update matrices
        lora_alpha=config['lora_alpha'],    # Scaling factor for LoRA
        lora_dropout=config['lora_dropout'],# Dropout probability for LoRA layers
        # Target specific attention modules
        target_modules=[
            "q_lin",                        # Query projection
            "v_lin",                        # Value projection 
            "k_lin",                        # Key projection
            "out_lin"                       # Output projection
        ],
        bias="none",                        # Don't train bias parameters
        inference_mode=False,               # We're training, not just inferring
    )
    
    # Create the PEFT model from base model
    peft_model = get_peft_model(base_model, lora_config)
    
    # Print trainable vs total parameters
    trainable_params = sum(p.numel() for p in peft_model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in peft_model.parameters())
    percentage = 100 * trainable_params / total_params
    
    print("\nTrainable parameter analysis:")
    print(f"Total parameters: {total_params:,}")
    print(f"Trainable parameters: {trainable_params:,}")
    print(f"Percentage of parameters being trained: {percentage:.4f}%")
    print(f"Parameter reduction factor: {total_params / trainable_params:.2f}x")
    
    # Record memory usage after creating PEFT model
    lora_memory = track_memory_usage(peft_model, device)
    
    # Calculate memory savings
    memory_reduction = {
        "allocated": base_memory["allocated_mb"] - lora_memory["allocated_mb"],
        "reserved": base_memory["reserved_mb"] - lora_memory["reserved_mb"]
    }
    
    print(f"\nMemory usage:")
    print(f"Base model: {base_memory['allocated_mb']:.2f} MB allocated")
    print(f"LoRA model: {lora_memory['allocated_mb']:.2f} MB allocated")
    print(f"Memory reduction: {memory_reduction['allocated']:.2f} MB ({memory_reduction['allocated']/base_memory['allocated_mb']*100:.2f}%)")
    
    # Visualize parameter efficiency
    plot_parameter_efficiency(
        total_params, 
        trainable_params, 
        output_path=f"{viz_dir}/lora_parameter_efficiency.png"
    )
    
    # Also use the built-in function to print parameters
    print("\nLoRA model trainable parameters:")
    peft_model.print_trainable_parameters()
    
    # Save the LoRA configuration for reference
    with open(f"{output_dir}/lora_config.json", "w") as f:
        # Extract config values that are JSON serializable
        config_dict = {
            "task_type": str(lora_config.task_type),
            "r": lora_config.r,
            "lora_alpha": lora_config.lora_alpha,
            "lora_dropout": lora_config.lora_dropout,
            "target_modules": lora_config.target_modules,
            "bias": lora_config.bias,
        }
        json.dump(config_dict, f, indent=2)
    
    print(f"\nLoRA configuration saved to {output_dir}/lora_config.json")
    
    # Display model architecture changes
    print("\nLoRA adapter added to the following modules:")
    lora_modules = []
    for name, module in peft_model.named_modules():
        if 'lora' in name.lower():
            lora_modules.append(name)
            print(f"  {name}")
    
    print(f"\nLoRA model setup complete with {len(lora_modules)} adapter modules!")
    
except Exception as e:
    import traceback
    print(f"Error setting up LoRA: {e}")
    traceback.print_exc()
    raise RuntimeError("LoRA setup failed.")

Configuring LoRA for parameter-efficient fine-tuning...

Trainable parameter analysis:
Total parameters: 68,136,964
Trainable parameters: 1,774,084
Percentage of parameters being trained: 2.6037%
Parameter reduction factor: 38.41x

Memory usage:
Base model: 3076.95 MB allocated
LoRA model: 3081.46 MB allocated
Memory reduction: -4.51 MB (-0.15%)
Parameter efficiency visualization saved to ./peft_output/visualizations/lora_parameter_efficiency.png

LoRA model trainable parameters:
trainable params: 1,774,084 || all params: 68,136,964 || trainable%: 2.6037027420241383

LoRA configuration saved to ./peft_output/lora_config.json

LoRA adapter added to the following modules:
  base_model.model.distilbert.transformer.layer.0.attention.q_lin.lora_dropout
  base_model.model.distilbert.transformer.layer.0.attention.q_lin.lora_dropout.default
  base_model.model.distilbert.transformer.layer.0.attention.q_lin.lora_A
  base_model.model.distilbert.transformer.layer.0.attention.q_lin.lora_A.default
 

## LoRA training and evaluation


In [36]:
# Train the LoRA model

print("Configuring training parameters...")
try:
    train_args = TrainingArguments(
        # Output settings
        output_dir=f"{output_dir}/lora_model",     # Directory to save outputs
        
        # Training hyperparameters
        learning_rate=config['learning_rate'],     # Learning rate
        per_device_train_batch_size=config['batch_size'],  # Batch size for training
        per_device_eval_batch_size=config['eval_batch_size'],  # Batch size for evaluation
        num_train_epochs=config['epochs'],         # Number of training epochs
        weight_decay=config['weight_decay'],       # Weight decay for regularization
        
        # Training loop settings
        logging_dir=f"{output_dir}/logs",          # Directory for logs
        logging_steps=50,                          # Log every 50 steps
        save_strategy="epoch",                     # Save at the end of each epoch
        evaluation_strategy="epoch",               # Evaluate at the end of each epoch
        
        # Optimization settings
        gradient_accumulation_steps=1,             # Accumulate gradients over steps 
        fp16=torch.cuda.is_available(),            # Use mixed precision if GPU available
        
        # Model selection
        load_best_model_at_end=True,               # Load the best model when finished
        metric_for_best_model="accuracy",          # Use accuracy to determine best model
        greater_is_better=True,                    # Higher accuracy is better
        
        # Miscellaneous
        push_to_hub=False,                         # Don't push to HuggingFace Hub
        report_to="none",                          # Disable wandb reporting
    )
    
    # Create a Trainer for fine-tuning
    print("Creating trainer for LoRA fine-tuning...")
    peft_trainer = Trainer(
        model=peft_model,                          # The LoRA-equipped model
        args=train_args,                           # Training arguments
        train_dataset=encoded_train,               # Training dataset
        eval_dataset=encoded_validation_subset,    # Evaluation dataset
        compute_metrics=compute_metrics,           # Evaluation metrics function
    )
    
    # Log the training start
    print("\nStarting LoRA fine-tuning...")
    print(f"Training on {len(encoded_train)} examples")
    print(f"Evaluating on {len(encoded_validation_subset)} examples during training")
    print(f"Training for {train_args.num_train_epochs} epochs with batch size {train_args.per_device_train_batch_size}")
    
    # Train the model with timing
    start_time = time.time()
    train_result = peft_trainer.train()
    training_time = time.time() - start_time
    
    # Print training statistics
    print(f"\nTraining completed in {format_time(training_time)}")
    print(f"Training loss: {train_result.training_loss:.4f}")
    
    # Evaluate the fine-tuned model on the entire validation set
    print("\nEvaluating the fine-tuned model on the full validation set...")
    peft_trainer.eval_dataset = encoded_validation  # Switch to full validation set
    lora_eval_results = peft_trainer.evaluate()
    
    # Print evaluation results
    print("\nLoRA fine-tuned model results:")
    print("-" * 50)
    for metric, value in lora_eval_results.items():
        # Skip metrics that are not relevant for display
        if not metric.startswith('eval_runtime') and not metric.startswith('eval_samples_per'):
            print(f"{metric:25s}: {value:.4f}")
    
    # Store results for comparison
    all_results['LoRA Model'] = lora_eval_results
    
    # Save the trained PEFT model
    peft_model_path = f"{models_dir}/lora_model"
    peft_model.save_pretrained(peft_model_path)
    print(f"\nLoRA PEFT model saved to {peft_model_path}")
    
    # Create learning curves
    print("\nGenerating learning curves...")
    if hasattr(peft_trainer, 'state') and hasattr(peft_trainer.state, 'log_history'):
        plot_training_history(
            peft_trainer.state.log_history,
            output_path=f"{viz_dir}/lora_learning_curves.png"
        )
    else:
        print("No training history found to generate learning curves.")
    
    # Generate predictions for confusion matrix
    print("\nGenerating predictions for analysis...")
    predictions = peft_trainer.predict(encoded_validation)
    predicted_labels = np.argmax(predictions.predictions, axis=1)
    true_labels = predictions.label_ids
    
    # Create and save confusion matrix
    visualize_confusion_matrix(
        predicted_labels, 
        true_labels, 
        output_path=f"{viz_dir}/lora_model_confusion_matrix.png",
        model_name="LoRA Model"
    )
    
    # Generate classification report
    report = classification_report(true_labels, predicted_labels, target_names=['Negative', 'Positive'])
    print("\nClassification Report:")
    print(report)
    
    # Save report to file
    with open(f"{results_dir}/lora_model_classification_report.txt", "w") as f:
        f.write(report)
    
    # Plot ROC curve
    roc_auc = plot_roc_curve(
        peft_model, 
        encoded_validation, 
        tokenizer, 
        device, 
        output_path=f"{viz_dir}/lora_model_roc_curve.png",
        model_name="LoRA Model"
    )
    
    # Add ROC AUC to results
    lora_eval_results['eval_roc_auc'] = roc_auc
    
    print("\nLoRA model training and evaluation complete!")
    
except Exception as e:
    import traceback
    print(f"Error during LoRA training: {e}")
    traceback.print_exc()
    raise RuntimeError("LoRA training failed.")

Configuring training parameters...
Creating trainer for LoRA fine-tuning...

Starting LoRA fine-tuning...
Training on 5000 examples
Evaluating on 500 examples during training
Training for 3 epochs with batch size 16


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Precision Class 0,Recall Class 0,F1 Class 0,Precision Class 1,Recall Class 1,F1 Class 1
1,0.3201,0.312448,0.858,0.858012,0.861056,0.858,0.823077,0.895397,0.857715,0.895833,0.823755,0.858283
2,0.2521,0.329994,0.864,0.863908,0.864066,0.864,0.866953,0.845188,0.855932,0.861423,0.881226,0.871212
3,0.1409,0.42241,0.856,0.856023,0.856074,0.856,0.846473,0.853556,0.85,0.864865,0.858238,0.861538


Checkpoint destination directory ./peft_output/lora_model/checkpoint-313 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./peft_output/lora_model/checkpoint-626 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./peft_output/lora_model/checkpoint-939 already exists and is non-empty.Saving will proceed but saved results may be invalid.



Training completed in 00:01:05
Training loss: 0.2485

Evaluating the fine-tuned model on the full validation set...



LoRA fine-tuned model results:
--------------------------------------------------
eval_loss                : 0.3263
eval_accuracy            : 0.8727
eval_f1                  : 0.8726
eval_precision           : 0.8736
eval_recall              : 0.8727
eval_precision_class_0   : 0.8914
eval_recall_class_0      : 0.8435
eval_f1_class_0          : 0.8667
eval_precision_class_1   : 0.8565
eval_recall_class_1      : 0.9009
eval_f1_class_1          : 0.8782
eval_steps_per_second    : 22.3230
epoch                    : 3.0000

LoRA PEFT model saved to ./peft_output/models/lora_model

Generating learning curves...
Training history plot saved to ./peft_output/visualizations/lora_learning_curves.png

Generating predictions for analysis...

Classification Report:
              precision    recall  f1-score   support

    Negative       0.89      0.84      0.87       428
    Positive       0.86      0.90      0.88       444

    accuracy                           0.87       872
   macro avg      

## QLoRA set up and training


In [37]:
# Attempt to implement QLoRA with proper error handling

print("\n" + "="*80)
print("Setting up QLoRA (Quantized LoRA) implementation...")
print("="*80)

try:
    # Check if bitsandbytes is available
    try:
        import bitsandbytes as bnb
        print("bitsandbytes is available - version information may vary.")
        bnb_available = True
    except ImportError:
        print("bitsandbytes is not installed. Installing now...")
        !pip install --user bitsandbytes==0.38.1 --quiet
        try:
            import bitsandbytes as bnb
            print("Successfully installed bitsandbytes!")
            bnb_available = True
        except ImportError:
            print("Failed to install bitsandbytes. QLoRA will not be available.")
            bnb_available = False
    
    if not bnb_available:
        raise ImportError("bitsandbytes is required for QLoRA but is not available.")
    
    # Check if we have GPU support
    if not torch.cuda.is_available():
        print("Warning: CUDA is not available. QLoRA may not work as expected.")
    
    print("\nCreating a fresh model for QLoRA...")
    
    # Load a fresh model for QLoRA
    qlora_base_model = AutoModelForSequenceClassification.from_pretrained(
        config['model_name'],
        num_labels=2,
        load_in_8bit=True  # Enable 8-bit quantization
    )
    
    # Crucial step: Prepare model for 8-bit training to prevent FP16 gradient issues
    print("Preparing model for 8-bit training...")
    qlora_base_model = prepare_model_for_kbit_training(qlora_base_model)
    
    # Define QLoRA configuration with more conservative settings
    qlora_config = LoraConfig(
        task_type=TaskType.SEQ_CLS,
        r=8,                          # Lower rank for QLoRA
        lora_alpha=16,                # Lower alpha
        lora_dropout=0.05,            # Less dropout for stability
        # Target fewer modules to reduce complexity
        target_modules=["q_lin", "v_lin"],
        bias="none",
        inference_mode=False,
    )
    
    # Create the QLoRA model
    qlora_model = get_peft_model(qlora_base_model, qlora_config)
    
    # Print trainable parameters
    print("\nQLoRA model trainable parameters:")
    qlora_model.print_trainable_parameters()
    
    # Record memory usage
    qlora_memory = track_memory_usage(qlora_model, device)
    
    # Calculate memory savings compared to base model
    memory_savings = {
        "allocated": base_memory["allocated_mb"] - qlora_memory["allocated_mb"],
        "reserved": base_memory["reserved_mb"] - qlora_memory["reserved_mb"]
    }
    
    print(f"\nMemory usage comparison:")
    print(f"Base model: {base_memory['allocated_mb']:.2f} MB allocated")
    print(f"LoRA model: {lora_memory['allocated_mb']:.2f} MB allocated")
    print(f"QLoRA model: {qlora_memory['allocated_mb']:.2f} MB allocated")
    print(f"QLoRA memory savings vs base: {memory_savings['allocated']:.2f} MB ({memory_savings['allocated']/base_memory['allocated_mb']*100:.2f}%)")
    
    # Set up QLoRA training arguments - critical settings for stability
    qlora_train_args = TrainingArguments(
        output_dir=f"{output_dir}/qlora_model",
        learning_rate=config['qlora_learning_rate'],  # Lower learning rate
        per_device_train_batch_size=config['qlora_batch_size'],  # Smaller batch size
        per_device_eval_batch_size=config['qlora_batch_size'],
        num_train_epochs=config['epochs'],
        weight_decay=0.0,           # No weight decay for stability
        logging_steps=50,
        save_strategy="epoch",
        evaluation_strategy="epoch",
        load_best_model_at_end=True,
        push_to_hub=False,
        report_to="none",
        # CRITICAL: Disable mixed precision which causes issues with 8-bit models
        fp16=False,
        bf16=False
    )
    
    # Create a Trainer for QLoRA
    print("Creating QLoRA trainer...")
    qlora_trainer = Trainer(
        model=qlora_model,
        args=qlora_train_args,
        train_dataset=encoded_train.select(range(min(2000, len(encoded_train)))),  # Smaller subset for stability
        eval_dataset=encoded_validation_subset,
        compute_metrics=compute_metrics,
    )
    
    # Train the QLoRA model
    print("\nTraining the QLoRA model...")
    print(f"Training on {len(encoded_train.select(range(min(2000, len(encoded_train)))))} examples")
    print(f"Evaluating on {len(encoded_validation_subset)} examples")
    
    start_time = time.time()
    qlora_trainer.train()
    qlora_training_time = time.time() - start_time
    
    print(f"\nQLoRA training completed in {format_time(qlora_training_time)}")
    
    # Evaluate on full validation set
    print("\nEvaluating QLoRA model on full validation set...")
    qlora_trainer.eval_dataset = encoded_validation
    qlora_eval_results = qlora_trainer.evaluate()
    
    # Print evaluation results
    print("\nQLoRA model results:")
    print("-" * 50)
    for metric, value in qlora_eval_results.items():
        if not metric.startswith('eval_runtime') and not metric.startswith('eval_samples_per'):
            print(f"{metric:25s}: {value:.4f}")
    
    # Store results for comparison
    all_results['QLoRA Model'] = qlora_eval_results
    
    # Save QLoRA model
    qlora_model_path = f"{models_dir}/qlora_model"
    qlora_model.save_pretrained(qlora_model_path)
    print(f"\nQLoRA model saved to {qlora_model_path}")
    
    # Generate learning curves
    if hasattr(qlora_trainer, 'state') and hasattr(qlora_trainer.state, 'log_history'):
        plot_training_history(
            qlora_trainer.state.log_history,
            output_path=f"{viz_dir}/qlora_learning_curves.png"
        )
    
    # Generate predictions for confusion matrix
    print("\nGenerating predictions for analysis...")
    predictions = qlora_trainer.predict(encoded_validation)
    predicted_labels = np.argmax(predictions.predictions, axis=1)
    true_labels = predictions.label_ids
    
    # Create and save confusion matrix
    visualize_confusion_matrix(
        predicted_labels, 
        true_labels, 
        output_path=f"{viz_dir}/qlora_model_confusion_matrix.png",
        model_name="QLoRA Model"
    )
    
    # QLoRA implementation succeeded
    print("\nQLoRA implementation and training completed successfully!")
    qlora_implemented = True
    
except Exception as e:
    import traceback
    print(f"\nError implementing QLoRA: {e}")
    traceback.print_exc()
    print("\nContinuing with standard LoRA only.")
    qlora_implemented = False


Setting up QLoRA (Quantized LoRA) implementation...
bitsandbytes is available - version information may vary.

Creating a fresh model for QLoRA...


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Preparing model for 8-bit training...

QLoRA model trainable parameters:
trainable params: 1,331,716 || all params: 67,694,596 || trainable%: 1.967241225577297

Memory usage comparison:
Base model: 3076.95 MB allocated
LoRA model: 3081.46 MB allocated
QLoRA model: 2701.94 MB allocated
QLoRA memory savings vs base: 375.02 MB (12.19%)
Creating QLoRA trainer...

Training the QLoRA model...
Training on 2000 examples
Evaluating on 500 examples


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Precision Class 0,Recall Class 0,F1 Class 0,Precision Class 1,Recall Class 1,F1 Class 1
1,0.3951,0.382369,0.834,0.833758,0.834328,0.834,0.842105,0.803347,0.82227,0.827206,0.862069,0.844278
2,0.3136,0.375885,0.832,0.831344,0.833955,0.832,0.857143,0.778243,0.815789,0.812721,0.881226,0.845588
3,0.2968,0.362949,0.846,0.846013,0.846033,0.846,0.8375,0.841004,0.839248,0.853846,0.850575,0.852207


Checkpoint destination directory ./peft_output/qlora_model/checkpoint-250 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./peft_output/qlora_model/checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./peft_output/qlora_model/checkpoint-750 already exists and is non-empty.Saving will proceed but saved results may be invalid.



QLoRA training completed in 00:02:16

Evaluating QLoRA model on full validation set...



QLoRA model results:
--------------------------------------------------
eval_loss                : 0.3598
eval_accuracy            : 0.8452
eval_f1                  : 0.8452
eval_precision           : 0.8453
eval_recall              : 0.8452
eval_precision_class_0   : 0.8383
eval_recall_class_0      : 0.8481
eval_f1_class_0          : 0.8432
eval_precision_class_1   : 0.8519
eval_recall_class_1      : 0.8423
eval_f1_class_1          : 0.8471
eval_steps_per_second    : 15.8850
epoch                    : 3.0000

QLoRA model saved to ./peft_output/models/qlora_model
Training history plot saved to ./peft_output/visualizations/qlora_learning_curves.png

Generating predictions for analysis...

QLoRA implementation and training completed successfully!


## LoRA configurations experiments

In [38]:
# Experiment with different LoRA configurations

print("\n" + "="*80)
print("Experimenting with different LoRA configurations...")
print("="*80)

try:
    # Define different configurations to test
    lora_configs = [
        {
            "name": "Low Rank (r=4)", 
            "config": LoraConfig(
                task_type=TaskType.SEQ_CLS,
                r=4,                      # Lower rank
                lora_alpha=16,            # Lower alpha
                lora_dropout=0.1,
                target_modules=["q_lin", "v_lin", "k_lin", "out_lin"],
                bias="none"
            )
        },
        {
            "name": "High Rank (r=32)", 
            "config": LoraConfig(
                task_type=TaskType.SEQ_CLS,
                r=32,                     # Higher rank
                lora_alpha=64,            # Higher alpha
                lora_dropout=0.1,
                target_modules=["q_lin", "v_lin", "k_lin", "out_lin"],
                bias="none"
            )
        },
        {
            "name": "Query-Only", 
            "config": LoraConfig(
                task_type=TaskType.SEQ_CLS,
                r=16,
                lora_alpha=32,
                lora_dropout=0.1,
                # Only targeting query projections
                target_modules=["q_lin"],
                bias="none"
            )
        },
        {
            "name": "With Bias Training", 
            "config": LoraConfig(
                task_type=TaskType.SEQ_CLS,
                r=16,
                lora_alpha=32,
                lora_dropout=0.1,
                target_modules=["q_lin", "v_lin", "k_lin", "out_lin"],
                bias="lora_only"          # Train biases along with LoRA weights
            )
        },
        {
            "name": "Higher Dropout", 
            "config": LoraConfig(
                task_type=TaskType.SEQ_CLS,
                r=16,
                lora_alpha=32,
                lora_dropout=0.3,         # Higher dropout for regularization
                target_modules=["q_lin", "v_lin", "k_lin", "out_lin"],
                bias="none"
            )
        }
    ]
    
    # Store results of experiments
    comparison_results = []
    variant_metrics = {}
    
    # Test each configuration with a single epoch for efficiency
    for config_idx, config_data in enumerate(lora_configs):
        try:
            print(f"\n{'-'*60}")
            print(f"Testing configuration: {config_data['name']}")
            print(f"{'-'*60}")
            
            # Create a fresh copy of the model
            test_model = AutoModelForSequenceClassification.from_pretrained(
                config['model_name'],
                num_labels=2
            ).to(device)
            
            # Apply the PEFT configuration
            test_peft_model = get_peft_model(test_model, config_data['config'])
            
            # Print trainable parameters
            print("Trainable parameters:")
            trainable_params = test_peft_model.print_trainable_parameters()
            
            # Quick training (1 epoch, subset of data)
            test_train_args = TrainingArguments(
                output_dir=f"{output_dir}/variant_{config_idx}",
                learning_rate=config['learning_rate'],
                per_device_train_batch_size=config['batch_size'],
                per_device_eval_batch_size=config['eval_batch_size'],
                num_train_epochs=1,  # Just 1 epoch for quick comparison
                weight_decay=config['weight_decay'],
                logging_steps=100,
                save_strategy="no",  # Don't save checkpoints
                evaluation_strategy="epoch",
                report_to="none",
                fp16=False  # Disable for consistent comparisons
            )
            
            # Create trainer
            test_trainer = Trainer(
                model=test_peft_model,
                args=test_train_args,
                train_dataset=encoded_train.select(range(min(2000, len(encoded_train)))),  # Use smaller subset
                eval_dataset=encoded_validation_subset,
                compute_metrics=compute_metrics,
            )
            
            # Train the variant
            print(f"Training {config_data['name']} variant...")
            start_time = time.time()
            test_trainer.train()
            train_time = time.time() - start_time
            
            # Evaluate on full validation set
            print(f"Evaluating {config_data['name']} variant on full validation set...")
            test_trainer.eval_dataset = encoded_validation
            test_results = test_trainer.evaluate()
            
            # Calculate trainable parameters
            trainable_params = sum(p.numel() for p in test_peft_model.parameters() if p.requires_grad)
            total_params = sum(p.numel() for p in test_peft_model.parameters())
            
            # Print results
            print(f"Results for {config_data['name']}:")
            print(f"Training time: {format_time(train_time)}")
            print(f"Trainable parameters: {trainable_params:,} ({trainable_params/total_params*100:.4f}%)")
            for metric, value in test_results.items():
                if not metric.startswith('eval_runtime') and not metric.startswith('eval_samples_per'):
                    print(f"  {metric}: {value:.4f}")
            
            # Store results
            variant_metrics[config_data['name']] = test_results
            comparison_results.append({
                "config_name": config_data['name'],
                "trainable_params": trainable_params,
                "param_efficiency": trainable_params/total_params*100,
                "accuracy": test_results["eval_accuracy"],
                "f1": test_results["eval_f1"],
                "precision": test_results["eval_precision"],
                "recall": test_results["eval_recall"],
                "training_time_sec": train_time,
                "training_time_min": train_time / 60
            })
            
            # Save the variant model
            variant_path = f"{models_dir}/variant_{config_idx}"
            test_peft_model.save_pretrained(variant_path)
            print(f"Variant model saved to {variant_path}")
            
        except Exception as e:
            print(f"Error testing configuration {config_data['name']}: {str(e)}")
    
    # Print comparison table
    if comparison_results:
        print("\nComparison of different LoRA configurations:")
        config_df = pd.DataFrame(comparison_results)
        
        # Round numeric columns for better display
        display_df = config_df.copy()
        for col in display_df.columns:
            if display_df[col].dtype in [np.float64, np.float32]:
                display_df[col] = display_df[col].round(4)
        
        print(display_df)
        
        # Save comparison to CSV
        config_df.to_csv(f"{results_dir}/lora_variants_comparison.csv", index=False)
        print(f"Comparison saved to {results_dir}/lora_variants_comparison.csv")
        
        # Create visualization of results
        plt.figure(figsize=(15, 12))
        
        # Plot accuracy comparison
        plt.subplot(2, 2, 1)
        sns.barplot(x='config_name', y='accuracy', data=config_df, palette='viridis')
        plt.title('Accuracy Comparison', fontsize=14)
        plt.xticks(rotation=45, ha='right', fontsize=10)
        plt.ylabel('Accuracy', fontsize=12)
        plt.grid(True, alpha=0.3)
        
        # Plot F1 score comparison
        plt.subplot(2, 2, 2)
        sns.barplot(x='config_name', y='f1', data=config_df, palette='viridis')
        plt.title('F1 Score Comparison', fontsize=14)
        plt.xticks(rotation=45, ha='right', fontsize=10)
        plt.ylabel('F1 Score', fontsize=12)
        plt.grid(True, alpha=0.3)
        
        # Plot training time comparison
        plt.subplot(2, 2, 3)
        sns.barplot(x='config_name', y='training_time_min', data=config_df, palette='viridis')
        plt.title('Training Time Comparison', fontsize=14)
        plt.xticks(rotation=45, ha='right', fontsize=10)
        plt.ylabel('Training Time (min)', fontsize=12)
        plt.grid(True, alpha=0.3)
        
        # Plot parameter efficiency comparison
        plt.subplot(2, 2, 4)
        sns.barplot(x='config_name', y='param_efficiency', data=config_df, palette='viridis')
        plt.title('Parameter Efficiency (% of Total)', fontsize=14)
        plt.xticks(rotation=45, ha='right', fontsize=10)
        plt.ylabel('Parameter Efficiency (%)', fontsize=12)
        plt.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.savefig(f"{viz_dir}/lora_variants_comparison.png")
        plt.close()
        print(f"Comparison visualization saved to {viz_dir}/lora_variants_comparison.png")
        
        # Add best variant to all_results for the final comparison
        best_idx = config_df['accuracy'].argmax()
        best_config = config_df.iloc[best_idx]['config_name']
        all_results[f'Best Variant ({best_config})'] = variant_metrics[best_config]
        print(f"\nBest variant was: {best_config} with accuracy {config_df.iloc[best_idx]['accuracy']:.4f}")
        
        # Create radar chart for comprehensive comparison
        plt.figure(figsize=(10, 10))
        
        # Prepare data for radar chart
        categories = ['Accuracy', 'F1 Score', 'Parameter Efficiency', 'Speed']
        
        # Normalize values for radar chart
        max_acc = config_df['accuracy'].max()
        max_f1 = config_df['f1'].max()
        min_params = config_df['param_efficiency'].min()
        max_params = config_df['param_efficiency'].max()
        min_time = config_df['training_time_sec'].min()
        max_time = config_df['training_time_sec'].max()
        
        # Create radar values (normalized from 0 to 1)
        radar_data = {}
        for _, row in config_df.iterrows():
            # Normalize values to 0-1 scale
            acc_norm = row['accuracy'] / max_acc
            f1_norm = row['f1'] / max_f1
            
            # For params, we want higher efficiency (lower % is better)
            # We normalize so that higher values are better
            if max_params == min_params:
                param_norm = 1.0
            else:
                param_norm = (row['param_efficiency'] - min_params) / (max_params - min_params)
            
            # For time, we want faster (lower is better)
            # We normalize and invert so that higher values are better
            if max_time == min_time:
                time_norm = 1.0
            else:
                time_norm = 1 - ((row['training_time_sec'] - min_time) / (max_time - min_time))
            
            radar_data[row['config_name']] = [acc_norm, f1_norm, param_norm, time_norm]
        
        # Create radar chart
        from math import pi
        
        # Number of variables
        N = len(categories)
        
        # Compute angle for each axis
        angles = [n / float(N) * 2 * pi for n in range(N)]
        angles += angles[:1]  # Close the loop
        
        # Initialize the plot
        ax = plt.subplot(111, polar=True)
        
        # Draw one axis per variable and add labels
        plt.xticks(angles[:-1], categories, fontsize=12)
        
        # Draw the chart for each config
        for i, (config_name, values) in enumerate(radar_data.items()):
            values += values[:1]  # Close the loop
            ax.plot(angles, values, linewidth=2, linestyle='solid', label=config_name)
            ax.fill(angles, values, alpha=0.1)
        
        # Add legend
        plt.legend(loc='upper right', bbox_to_anchor=(0.1, 0.1))
        plt.title('LoRA Configuration Comparison', fontsize=15)
        
        plt.tight_layout()
        plt.savefig(f"{viz_dir}/lora_variants_radar.png")
        plt.close()
        print(f"Radar chart comparison saved to {viz_dir}/lora_variants_radar.png")
    else:
        print("No configuration comparisons were completed successfully.")
    
except Exception as e:
    import traceback
    print(f"Error during LoRA configuration experiments: {e}")
    traceback.print_exc()
    print("Continuing with main LoRA model for the rest of the analysis.")


Experimenting with different LoRA configurations...

------------------------------------------------------------
Testing configuration: Low Rank (r=4)
------------------------------------------------------------


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainable parameters:
trainable params: 1,331,716 || all params: 67,694,596 || trainable%: 1.967241225577297
Training Low Rank (r=4) variant...


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Precision Class 0,Recall Class 0,F1 Class 0,Precision Class 1,Recall Class 1,F1 Class 1
1,0.4503,0.361,0.836,0.836073,0.837458,0.836,0.810277,0.857741,0.833333,0.862348,0.816092,0.838583


Evaluating Low Rank (r=4) variant on full validation set...


Results for Low Rank (r=4):
Training time: 00:00:09
Trainable parameters: 1,331,716 (1.9672%)
  eval_loss: 0.3604
  eval_accuracy: 0.8383
  eval_f1: 0.8383
  eval_precision: 0.8396
  eval_recall: 0.8383
  eval_precision_class_0: 0.8168
  eval_recall_class_0: 0.8645
  eval_f1_class_0: 0.8400
  eval_precision_class_1: 0.8616
  eval_recall_class_1: 0.8131
  eval_f1_class_1: 0.8366
  eval_steps_per_second: 22.0190
  epoch: 1.0000
Variant model saved to ./peft_output/models/variant_0

------------------------------------------------------------
Testing configuration: High Rank (r=32)
------------------------------------------------------------


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainable parameters:
trainable params: 2,363,908 || all params: 68,726,788 || trainable%: 3.439572936247217
Training High Rank (r=32) variant...


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Precision Class 0,Recall Class 0,F1 Class 0,Precision Class 1,Recall Class 1,F1 Class 1
1,0.4246,0.359961,0.84,0.840077,0.841121,0.84,0.816733,0.857741,0.836735,0.863454,0.823755,0.843137


Evaluating High Rank (r=32) variant on full validation set...


Results for High Rank (r=32):
Training time: 00:00:09
Trainable parameters: 2,363,908 (3.4396%)
  eval_loss: 0.3642
  eval_accuracy: 0.8417
  eval_f1: 0.8417
  eval_precision: 0.8428
  eval_recall: 0.8417
  eval_precision_class_0: 0.8222
  eval_recall_class_0: 0.8645
  eval_f1_class_0: 0.8428
  eval_precision_class_1: 0.8626
  eval_recall_class_1: 0.8198
  eval_f1_class_1: 0.8406
  eval_steps_per_second: 21.8570
  epoch: 1.0000
Variant model saved to ./peft_output/models/variant_1

------------------------------------------------------------
Testing configuration: Query-Only
------------------------------------------------------------


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainable parameters:
trainable params: 1,331,716 || all params: 67,694,596 || trainable%: 1.967241225577297
Training Query-Only variant...


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Precision Class 0,Recall Class 0,F1 Class 0,Precision Class 1,Recall Class 1,F1 Class 1
1,0.479,0.381212,0.824,0.824079,0.82458,0.824,0.805668,0.832636,0.81893,0.841897,0.816092,0.828794


Evaluating Query-Only variant on full validation set...


Results for Query-Only:
Training time: 00:00:06
Trainable parameters: 1,331,716 (1.9672%)
  eval_loss: 0.3807
  eval_accuracy: 0.8303
  eval_f1: 0.8303
  eval_precision: 0.8304
  eval_recall: 0.8303
  eval_precision_class_0: 0.8226
  eval_recall_class_0: 0.8341
  eval_f1_class_0: 0.8283
  eval_precision_class_1: 0.8379
  eval_recall_class_1: 0.8266
  eval_f1_class_1: 0.8322
  eval_steps_per_second: 25.6870
  epoch: 1.0000
Variant model saved to ./peft_output/models/variant_2

------------------------------------------------------------
Testing configuration: With Bias Training
------------------------------------------------------------


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainable parameters:
trainable params: 1,792,516 || all params: 68,136,964 || trainable%: 2.630754138091624
Training With Bias Training variant...


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Precision Class 0,Recall Class 0,F1 Class 0,Precision Class 1,Recall Class 1,F1 Class 1
1,0.4403,0.367937,0.828,0.828083,0.829119,0.828,0.804781,0.845188,0.82449,0.851406,0.812261,0.831373


Evaluating With Bias Training variant on full validation set...


Results for With Bias Training:
Training time: 00:00:09
Trainable parameters: 1,792,516 (2.6308%)
  eval_loss: 0.3672
  eval_accuracy: 0.8291
  eval_f1: 0.8291
  eval_precision: 0.8304
  eval_recall: 0.8291
  eval_precision_class_0: 0.8079
  eval_recall_class_0: 0.8551
  eval_f1_class_0: 0.8309
  eval_precision_class_1: 0.8520
  eval_recall_class_1: 0.8041
  eval_f1_class_1: 0.8273
  eval_steps_per_second: 22.0530
  epoch: 1.0000
Variant model saved to ./peft_output/models/variant_3

------------------------------------------------------------
Testing configuration: Higher Dropout
------------------------------------------------------------


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainable parameters:
trainable params: 1,774,084 || all params: 68,136,964 || trainable%: 2.6037027420241383
Training Higher Dropout variant...


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Precision Class 0,Recall Class 0,F1 Class 0,Precision Class 1,Recall Class 1,F1 Class 1
1,0.4448,0.364843,0.832,0.832048,0.834255,0.832,0.801556,0.861925,0.830645,0.864198,0.804598,0.833333


Evaluating Higher Dropout variant on full validation set...


Results for Higher Dropout:
Training time: 00:00:09
Trainable parameters: 1,774,084 (2.6037%)
  eval_loss: 0.3638
  eval_accuracy: 0.8314
  eval_f1: 0.8313
  eval_precision: 0.8333
  eval_recall: 0.8314
  eval_precision_class_0: 0.8061
  eval_recall_class_0: 0.8645
  eval_f1_class_0: 0.8343
  eval_precision_class_1: 0.8596
  eval_recall_class_1: 0.7995
  eval_f1_class_1: 0.8285
  eval_steps_per_second: 21.9420
  epoch: 1.0000
Variant model saved to ./peft_output/models/variant_4

Comparison of different LoRA configurations:
          config_name  trainable_params  param_efficiency  accuracy      f1  precision  recall  training_time_sec  training_time_min
0      Low Rank (r=4)           1331716            1.9672    0.8383  0.8383     0.8396  0.8383             9.3860             0.1564
1    High Rank (r=32)           2363908            3.4396    0.8417  0.8417     0.8428  0.8417             9.3507             0.1558
2          Query-Only           1331716            1.9672    0.8303  0.

## Load and evaluate fine-tuned models

In [39]:
# Load all trained PEFT models correctly
print("\n" + "="*80)
print("Loading all saved PEFT models...")
print("="*80)

# Import the correct model loading classes
import os
import torch
from peft import AutoPeftModelForCausalLM, AutoPeftModelForSequenceClassification, PeftConfig, PeftModel
from transformers import AutoTokenizer

try:
    # Dictionary to store all loaded models
    loaded_models = {}
    
    # Load base model name first (to be consistent across all models)
    base_model_name = None
    
    # Try to find any variant model to extract base model name
    for config_idx in range(len(lora_configs)):
        variant_path = f"{models_dir}/variant_{config_idx}"
        if os.path.exists(variant_path):
            try:
                peft_config = PeftConfig.from_pretrained(variant_path)
                base_model_name = peft_config.base_model_name_or_path
                print(f"Found base model name: {base_model_name}")
                break
            except Exception as e:
                print(f"Could not load config from {variant_path}: {e}")
    
    # Default if still not found
    if not base_model_name:
        if 'config' in globals() and isinstance(config, dict) and 'model_name' in config:
            base_model_name = config['model_name']
        else:
            base_model_name = "distilbert-base-uncased"  # Default fallback
        print(f"Using default base model: {base_model_name}")
    
    # Load tokenizer (same for all models)
    tokenizer = AutoTokenizer.from_pretrained(base_model_name)
    print(f"Tokenizer loaded from base model: {base_model_name}")
    
    # Ensure device is defined
    if 'device' not in locals() and 'device' not in globals():
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        print(f"Device not found, using: {device}")
    
    # 1. Load the best model if it exists
    best_model_path = f"{models_dir}/best_model"
    if os.path.exists(best_model_path):
        try:
            print(f"\nLoading best model from: {best_model_path}")
            
            # Load the PEFT configuration
            peft_config = PeftConfig.from_pretrained(best_model_path)
            
            # Determine model type and load appropriately
            if hasattr(peft_config, 'task_type') and peft_config.task_type == 'SEQ_CLS':
                best_model = AutoPeftModelForSequenceClassification.from_pretrained(best_model_path).to(device)
                print("Loaded best model as AutoPeftModelForSequenceClassification")
            else:
                best_model = AutoPeftModelForCausalLM.from_pretrained(best_model_path).to(device)
                print("Loaded best model as AutoPeftModelForCausalLM")
            
            # Store in dictionary
            loaded_models["Best Model"] = best_model
            print("Best model loaded successfully")
            
        except Exception as e:
            print(f"Error loading best model: {e}")
            print("Continuing with other models...")
    
    # 2. Load all variant models
    for config_idx, config_data in enumerate(lora_configs):
        variant_name = config_data['name']
        variant_path = f"{models_dir}/variant_{config_idx}"
        
        if not os.path.exists(variant_path):
            print(f"\nModel path for {variant_name} not found. Skipping.")
            continue
        
        try:
            print(f"\nLoading variant {variant_name} from: {variant_path}")
            
            # Load the PEFT configuration
            peft_config = PeftConfig.from_pretrained(variant_path)
            
            # Try loading with AutoPeftModel classes first
            try:
                if hasattr(peft_config, 'task_type') and peft_config.task_type == 'SEQ_CLS':
                    variant_model = AutoPeftModelForSequenceClassification.from_pretrained(variant_path).to(device)
                    print(f"Loaded {variant_name} model as AutoPeftModelForSequenceClassification")
                else:
                    variant_model = AutoPeftModelForCausalLM.from_pretrained(variant_path).to(device)
                    print(f"Loaded {variant_name} model as AutoPeftModelForCausalLM")
            except Exception as e:
                print(f"Error loading with AutoPeftModel: {e}")
                print("Trying alternative loading method...")
                
                # Alternative loading method using PeftModel
                from transformers import AutoModelForSequenceClassification, AutoModelForCausalLM
                
                if hasattr(peft_config, 'task_type') and peft_config.task_type == 'SEQ_CLS':
                    base_model = AutoModelForSequenceClassification.from_pretrained(
                        peft_config.base_model_name_or_path, 
                        num_labels=2
                    ).to(device)
                    variant_model = PeftModel.from_pretrained(base_model, variant_path).to(device)
                    print(f"Loaded {variant_name} model using PeftModel.from_pretrained (sequence classification)")
                else:
                    base_model = AutoModelForCausalLM.from_pretrained(
                        peft_config.base_model_name_or_path
                    ).to(device)
                    variant_model = PeftModel.from_pretrained(base_model, variant_path).to(device)
                    print(f"Loaded {variant_name} model using PeftModel.from_pretrained (causal LM)")
            
            # Store in dictionary
            loaded_models[variant_name] = variant_model
            print(f"Variant {variant_name} loaded successfully")
            
            # Display parameter info if available
            if hasattr(variant_model, 'print_trainable_parameters'):
                print(f"Parameter info for {variant_name}:")
                variant_model.print_trainable_parameters()
            
        except Exception as e:
            print(f"Error loading variant {variant_name}: {e}")
            print("Continuing with other models...")
    
    # Summary of loaded models
    print("\n" + "-"*60)
    print(f"Successfully loaded {len(loaded_models)} models:")
    for model_name in loaded_models.keys():
        print(f"- {model_name}")
    
    if not loaded_models:
        print("Warning: No models were successfully loaded.")
    
except Exception as e:
    import traceback
    print(f"Error during model loading: {e}")
    traceback.print_exc()


Loading all saved PEFT models...
Found base model name: distilbert-base-uncased
Tokenizer loaded from base model: distilbert-base-uncased

Loading best model from: ./peft_output/models/best_model


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loaded best model as AutoPeftModelForSequenceClassification
Best model loaded successfully

Loading variant Low Rank (r=4) from: ./peft_output/models/variant_0


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loaded Low Rank (r=4) model as AutoPeftModelForSequenceClassification
Variant Low Rank (r=4) loaded successfully
Parameter info for Low Rank (r=4):
trainable params: 1,184,260 || all params: 67,694,596 || trainable%: 1.749415861791981

Loading variant High Rank (r=32) from: ./peft_output/models/variant_1


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loaded High Rank (r=32) model as AutoPeftModelForSequenceClassification
Variant High Rank (r=32) loaded successfully
Parameter info for High Rank (r=32):
trainable params: 1,184,260 || all params: 68,726,788 || trainable%: 1.7231417827936322

Loading variant Query-Only from: ./peft_output/models/variant_2


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loaded Query-Only model as AutoPeftModelForSequenceClassification
Variant Query-Only loaded successfully
Parameter info for Query-Only:
trainable params: 1,184,260 || all params: 67,694,596 || trainable%: 1.749415861791981

Loading variant With Bias Training from: ./peft_output/models/variant_3


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loaded With Bias Training model as AutoPeftModelForSequenceClassification
Variant With Bias Training loaded successfully
Parameter info for With Bias Training:
trainable params: 1,202,692 || all params: 68,136,964 || trainable%: 1.7651094639320883

Loading variant Higher Dropout from: ./peft_output/models/variant_4


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loaded Higher Dropout model as AutoPeftModelForSequenceClassification
Variant Higher Dropout loaded successfully
Parameter info for Higher Dropout:
trainable params: 1,184,260 || all params: 68,136,964 || trainable%: 1.738058067864603

------------------------------------------------------------
Successfully loaded 6 models:
- Best Model
- Low Rank (r=4)
- High Rank (r=32)
- Query-Only
- With Bias Training
- Higher Dropout


In [40]:
# Step 8: Compare all trained models with the base model
print("\n" + "="*80)
print("Comparing all PEFT models with the base model...")
print("="*80)

# Import necessary libraries
import os
import torch
import numpy as np
import pandas as pd
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

try:
    # Ensure we have loaded models
    if 'loaded_models' not in globals() or not loaded_models:
        print("No loaded PEFT models found. Cannot proceed with comparison.")
        raise ValueError("No loaded PEFT models available")
    
    # Ensure we have the validation dataset for evaluation
    eval_dataset = None
    if 'encoded_validation' in globals():
        eval_dataset = encoded_validation
        print("Using encoded_validation dataset for evaluation")
    elif 'encoded_test' in globals():
        eval_dataset = encoded_test
        print("Using encoded_test dataset for evaluation")
    else:
        print("No evaluation dataset found. Cannot proceed with model comparison.")
        raise ValueError("No evaluation dataset available")
    
    # Get base model name
    base_model_name = None
    for model_name, model in loaded_models.items():
        try:
            # Try to get config from the model
            if hasattr(model, 'peft_config') and hasattr(model.peft_config, 'base_model_name_or_path'):
                base_model_name = model.peft_config.base_model_name_or_path
                break
            elif hasattr(model, 'config') and hasattr(model.config, 'base_model_name_or_path'):
                base_model_name = model.config.base_model_name_or_path
                break
        except:
            continue
    
    # If we still don't have a base model name, use fallbacks
    if not base_model_name:
        if 'config' in globals() and isinstance(config, dict) and 'model_name' in config:
            base_model_name = config['model_name']
        else:
            base_model_name = "distilbert-base-uncased"  # Default
    
    print(f"Using base model: {base_model_name}")
    
    # Define compute_metrics function for evaluation
    def compute_metrics(pred):
        labels = pred.label_ids
        preds = pred.predictions.argmax(-1)
        precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
        acc = accuracy_score(labels, preds)
        return {
            'accuracy': acc,
            'f1': f1,
            'precision': precision,
            'recall': recall
        }
    
    # Create evaluation arguments
    eval_args = TrainingArguments(
        output_dir="./eval_results",
        per_device_eval_batch_size=16,
        report_to="none",
    )
    
    # Dictionary to store all evaluation results
    all_results = {}
    
    # 1. Evaluate the base model
    print("\nEvaluating the base model...")
    
    # Load the base model
    base_model = AutoModelForSequenceClassification.from_pretrained(
        base_model_name,
        num_labels=2
    ).to(device)
    
    # Create trainer for base model
    base_trainer = Trainer(
        model=base_model,
        args=eval_args,
        eval_dataset=eval_dataset,
        compute_metrics=compute_metrics,
    )
    
    # Evaluate base model
    base_metrics = base_trainer.evaluate()
    
    # Print base model metrics
    print("\nBase model metrics:")
    for metric_name, value in base_metrics.items():
        if not metric_name.startswith('eval_runtime') and not metric_name.startswith('eval_samples_per'):
            print(f"  {metric_name}: {value:.4f}")
    
    # Store base model results
    all_results["Base Model"] = {k.replace('eval_', ''): v for k, v in base_metrics.items() 
                              if not k.startswith('eval_runtime') and not k.startswith('eval_samples_per')}
    
    # Calculate base model parameters
    base_params = sum(p.numel() for p in base_model.parameters())
    print(f"Base model total parameters: {base_params:,}")
    
    # 2. Evaluate all PEFT models
    for model_name, model in loaded_models.items():
        print(f"\nEvaluating {model_name}...")
        
        # Create trainer for this model
        model_trainer = Trainer(
            model=model,
            args=eval_args,
            eval_dataset=eval_dataset,
            compute_metrics=compute_metrics,
        )
        
        # Evaluate model
        model_metrics = model_trainer.evaluate()
        
        # Print metrics
        print(f"\n{model_name} metrics:")
        for metric_name, value in model_metrics.items():
            if not metric_name.startswith('eval_runtime') and not metric_name.startswith('eval_samples_per'):
                print(f"  {metric_name}: {value:.4f}")
        
        # Store results
        all_results[model_name] = {k.replace('eval_', ''): v for k, v in model_metrics.items() 
                                if not k.startswith('eval_runtime') and not k.startswith('eval_samples_per')}
        
        # Calculate trainable parameters
        trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
        total_params = sum(p.numel() for p in model.parameters())
        trainable_pct = trainable_params / total_params * 100
        
        print(f"Parameter efficiency: {trainable_params:,} trainable parameters ({trainable_pct:.2f}% of total)")
        
        # Add parameter info to results
        all_results[model_name]['trainable_params'] = trainable_params
        all_results[model_name]['total_params'] = total_params
        all_results[model_name]['trainable_pct'] = trainable_pct
    
    # 3. Create comprehensive comparison table
    print("\n" + "="*80)
    print("Comprehensive Model Comparison")
    print("="*80)
    
    # Get all metrics that exist in any model
    all_metric_names = set()
    for model_metrics in all_results.values():
        all_metric_names.update(model_metrics.keys())
    
    # Remove parameter metrics for cleaner table
    metric_names = [m for m in all_metric_names if m not in 
                    ['trainable_params', 'total_params', 'trainable_pct']]
    
    # Create DataFrame for comparison
    comparison_rows = []
    
    # Add row for base model first
    base_row = {"Model": "Base Model"}
    for metric in metric_names:
        if metric in all_results["Base Model"]:
            base_row[metric] = all_results["Base Model"][metric]
    comparison_rows.append(base_row)
    
    # Add rows for all PEFT models
    for model_name, metrics in all_results.items():
        if model_name == "Base Model":
            continue
            
        model_row = {"Model": model_name}
        
        # Add metrics
        for metric in metric_names:
            if metric in metrics:
                model_row[metric] = metrics[metric]
                
                # Add improvement percentage
                if metric in all_results["Base Model"]:
                    pct_metric = f"{metric} vs base (%)"
                    base_value = all_results["Base Model"][metric]
                    if base_value != 0:  # Avoid division by zero
                        model_row[pct_metric] = (metrics[metric] - base_value) / base_value * 100
        
        # Add parameter efficiency info
        if 'trainable_pct' in metrics:
            model_row["Params trained (%)"] = metrics['trainable_pct']
        
        comparison_rows.append(model_row)
    
    # Create comparison DataFrame
    comparison_df = pd.DataFrame(comparison_rows)
    
    # Format for display - round numeric columns
    display_df = comparison_df.copy()
    for col in display_df.columns:
        if col != "Model" and display_df[col].dtype in [np.float64, np.float32, np.int64, np.int32]:
            if "vs base" in col:
                display_df[col] = display_df[col].apply(lambda x: f"{x:+.2f}%" if not pd.isna(x) else "")
            elif "Params" in col:
                display_df[col] = display_df[col].apply(lambda x: f"{x:.2f}%" if not pd.isna(x) else "")
            else:
                display_df[col] = display_df[col].apply(lambda x: f"{x:.4f}" if not pd.isna(x) else "")
    
    # Display the comparison table
    pd.set_option('display.max_columns', None)
    pd.set_option('display.width', 1000)
    print("\nMetrics comparison table:")
    print(display_df.to_string(index=False))
    
    # 4. Visualize the results
    try:
        import matplotlib.pyplot as plt
        import seaborn as sns
        
        # Main performance metrics to visualize
        main_metrics = ['accuracy', 'f1', 'precision', 'recall']
        main_metrics = [m for m in main_metrics if m in metric_names]
        
        if main_metrics:
            # Create a figure with a subplot for each metric
            fig, axes = plt.subplots(len(main_metrics), 1, figsize=(12, 5 * len(main_metrics)))
            
            # If only one metric, axes isn't an array
            if len(main_metrics) == 1:
                axes = [axes]
            
            # Get model names in desired order (base model first, then others)
            model_order = ["Base Model"] + [m for m in all_results.keys() if m != "Base Model"]
            
            # Create color palette (different color for base model)
            colors = ['#1f77b4'] + sns.color_palette("viridis", len(model_order) - 1).as_hex()
            
            # Plot each metric
            for i, metric in enumerate(main_metrics):
                # Extract data for this metric
                metric_values = []
                model_names = []
                
                for model in model_order:
                    if model in all_results and metric in all_results[model]:
                        metric_values.append(all_results[model][metric])
                        model_names.append(model)
                
                # Create the bar plot
                bars = axes[i].bar(model_names, metric_values, color=colors[:len(model_names)])
                
                # Add labels
                axes[i].set_title(f'{metric.capitalize()} Comparison', fontsize=16)
                axes[i].set_ylabel(metric.capitalize(), fontsize=12)
                
                # Add value labels
                for bar in bars:
                    height = bar.get_height()
                    axes[i].annotate(f'{height:.4f}',
                                    xy=(bar.get_x() + bar.get_width() / 2, height),
                                    xytext=(0, 3),  # 3 points vertical offset
                                    textcoords="offset points",
                                    ha='center', va='bottom')
                
                # Add improvement percentages
                base_value = all_results["Base Model"][metric]
                for j, model in enumerate(model_names):
                    if model != "Base Model":
                        model_value = all_results[model][metric]
                        improvement = (model_value - base_value) / base_value * 100
                        
                        # Add text with color coding
                        color = 'green' if improvement > 0 else 'red'
                        axes[i].text(j, all_results[model][metric] * 0.95, 
                                    f'{improvement:+.1f}%',
                                    ha='center', va='top',
                                    color=color, fontweight='bold')
                
                # Adjust ticks
                axes[i].set_xticks(range(len(model_names)))
                axes[i].set_xticklabels(model_names, rotation=45, ha='right')
                axes[i].grid(axis='y', linestyle='--', alpha=0.7)
            
            plt.tight_layout()
            
            # Create viz directory if it doesn't exist
            if 'viz_dir' in globals():
                viz_path = viz_dir
            else:
                viz_path = "./visualizations"
                
            os.makedirs(viz_path, exist_ok=True)
            plt.savefig(f"{viz_path}/all_models_metrics.png")
            print(f"Metrics visualization saved to {viz_path}/all_models_metrics.png")
            plt.close()
            
            # Create parameter efficiency visualization
            plt.figure(figsize=(12, 6))
            
            # Prepare data
            model_names = [m for m in all_results.keys() if m != "Base Model"]
            param_pcts = [all_results[m].get('trainable_pct', 0) for m in model_names]
            accuracy_values = [all_results[m].get('accuracy', 0) for m in model_names]
            
            # Base model accuracy for reference
            base_accuracy = all_results["Base Model"].get('accuracy', 0)
            
            # Only create this plot if we have parameter data
            if any(param_pcts) and len(model_names) > 0:
                # Create scatter plot
                plt.scatter(param_pcts, accuracy_values, s=100, c=range(len(model_names)), cmap='viridis')
                
                # Add labels for each point
                for i, model in enumerate(model_names):
                    plt.annotate(model, (param_pcts[i], accuracy_values[i]), 
                                fontsize=9, ha='right', va='bottom')
                
                # Add reference line for base model accuracy
                plt.axhline(y=base_accuracy, color='r', linestyle='--', alpha=0.7,
                           label=f'Base Model Accuracy: {base_accuracy:.4f}')
                
                # Labels and title
                plt.xlabel('Percentage of Parameters Trained (%)', fontsize=12)
                plt.ylabel('Accuracy', fontsize=12)
                plt.title('Accuracy vs Parameter Efficiency', fontsize=16)
                plt.legend()
                plt.grid(True, alpha=0.3)
                
                plt.tight_layout()
                plt.savefig(f"{viz_path}/param_efficiency.png")
                print(f"Parameter efficiency visualization saved to {viz_path}/param_efficiency.png")
                plt.close()
        
        # Create radar chart for multi-dimensional comparison
        if len(main_metrics) >= 3:  # Only create radar chart if we have enough metrics
            plt.figure(figsize=(10, 10))
            
            # Get model names and prepare data
            model_names = ["Base Model"] + [m for m in all_results.keys() if m != "Base Model"]
            
            # Normalize the data for the radar chart
            max_values = {metric: max(all_results[model].get(metric, 0) for model in model_names) 
                        for metric in main_metrics}
            min_values = {metric: min(all_results[model].get(metric, 0) for model in model_names) 
                        for metric in main_metrics}
            
            # Prepare radar chart data
            radar_data = {}
            for model in model_names:
                values = []
                for metric in main_metrics:
                    # Normalize between 0 and 1
                    if max_values[metric] == min_values[metric]:
                        # Avoid division by zero
                        norm_value = 1.0
                    else:
                        value = all_results[model].get(metric, 0)
                        norm_value = (value - min_values[metric]) / (max_values[metric] - min_values[metric])
                    values.append(norm_value)
                radar_data[model] = values
            
            # Create the radar chart
            from math import pi
            
            # Number of variables
            N = len(main_metrics)
            
            # Compute angle for each axis
            angles = [n / float(N) * 2 * pi for n in range(N)]
            angles += angles[:1]  # Close the loop
            
            # Initialize the plot
            ax = plt.subplot(111, polar=True)
            
            # Draw one axis per variable and add labels
            plt.xticks(angles[:-1], [m.capitalize() for m in main_metrics], fontsize=12)
            
            # Draw the chart for each model
            for i, (model, values) in enumerate(radar_data.items()):
                values += values[:1]  # Close the loop
                ax.plot(angles, values, linewidth=2, linestyle='solid', label=model)
                ax.fill(angles, values, alpha=0.1)
            
            # Add legend
            plt.legend(loc='upper right', bbox_to_anchor=(0.1, 0.1))
            plt.title('Model Performance Comparison', fontsize=15)
            
            plt.tight_layout()
            plt.savefig(f"{viz_path}/radar_comparison.png")
            print(f"Radar chart saved to {viz_path}/radar_comparison.png")
            plt.close()
            
    except ImportError:
        print("Matplotlib not available. Skipping visualizations.")
    
    # 5. Save results to CSV
    try:
        if 'results_dir' in globals():
            results_path = results_dir
        else:
            results_path = "./results"
            
        os.makedirs(results_path, exist_ok=True)
        comparison_df.to_csv(f"{results_path}/model_comparison.csv", index=False)
        print(f"Comparison data saved to {results_path}/model_comparison.csv")
    except Exception as e:
        print(f"Could not save comparison results to CSV: {e}")
    
    # 6. Final analysis and findings
    print("\n" + "="*80)
    print("Final Analysis and Findings")
    print("="*80)
    
    # Find best model for each metric
    for metric in main_metrics:
        models_with_metric = [(model, metrics.get(metric, 0)) 
                             for model, metrics in all_results.items()]
        best_model = max(models_with_metric, key=lambda x: x[1])
        
        print(f"\nBest model for {metric}: {best_model[0]} ({best_model[1]:.4f})")
        
        # If best model isn't base model, show improvement
        if best_model[0] != "Base Model":
            base_value = all_results["Base Model"].get(metric, 0)
            improvement = (best_model[1] - base_value) / base_value * 100
            print(f"  Improvement over base model: {improvement:+.2f}%")
    
    # Calculate overall best model (using accuracy)
    if 'accuracy' in main_metrics:
        # Get all models except base
        peft_models = [(model, metrics.get('accuracy', 0)) 
                      for model, metrics in all_results.items() 
                      if model != "Base Model"]
        
        if peft_models:
            overall_best = max(peft_models, key=lambda x: x[1])
            print(f"\nOverall best PEFT model: {overall_best[0]} (accuracy: {overall_best[1]:.4f})")
            
            # Compare to base
            base_accuracy = all_results["Base Model"].get('accuracy', 0)
            
            if overall_best[1] > base_accuracy:
                improvement = (overall_best[1] - base_accuracy) / base_accuracy * 100
                print(f"  Improved accuracy by {improvement:+.2f}% over base model")
                
                # Show parameter efficiency benefit
                if 'trainable_pct' in all_results[overall_best[0]]:
                    train_pct = all_results[overall_best[0]]['trainable_pct']
                    print(f"  While training only {train_pct:.2f}% of the parameters")
            else:
                print(f"  Note: Base model still has higher accuracy ({base_accuracy:.4f})")
    
    # Count models that improved on base model
    models_beating_base = {}
    for metric in main_metrics:
        beating_count = 0
        for model, metrics in all_results.items():
            if model == "Base Model":
                continue
            if metric in metrics and metrics[metric] > all_results["Base Model"].get(metric, 0):
                beating_count += 1
        
        total_peft_models = len(all_results) - 1  # Exclude base model
        models_beating_base[metric] = beating_count
        
        print(f"\n{beating_count} out of {total_peft_models} PEFT models improved on {metric}")
    
    # Parameter efficiency summary
    print("\nParameter Efficiency Summary:")
    for model_name, metrics in all_results.items():
        if model_name == "Base Model":
            continue
        if 'trainable_pct' in metrics:
            print(f"  {model_name}: {metrics['trainable_pct']:.2f}% parameters trained")
    
    # Final conclusion and recommendations
    print("\n" + "="*80)
    print("Conclusion and Recommendations")
    print("="*80)
    
    # Overall conclusion about PEFT effectiveness
    if sum(models_beating_base.values()) > 0:
        print("\nParameter-Efficient Fine-Tuning (PEFT) shows promising results:")
        
        # Recommendation based on best model
        if 'accuracy' in main_metrics:
            peft_accuracies = [(model, metrics.get('accuracy', 0)) 
                              for model, metrics in all_results.items() 
                              if model != "Base Model"]
            
            if peft_accuracies:
                best_peft = max(peft_accuracies, key=lambda x: x[1])
                
                print(f"\n1. For best overall performance, use '{best_peft[0]}'")
                print(f"   - Accuracy: {best_peft[1]:.4f}")
                
                base_accuracy = all_results["Base Model"].get('accuracy', 0)
                acc_diff = best_peft[1] - base_accuracy
                
                if acc_diff > 0:
                    print(f"   - Improvement over base model: {acc_diff:.4f} ({acc_diff/base_accuracy*100:+.2f}%)")
                else:
                    print(f"   - Performance difference from base model: {acc_diff:.4f} ({acc_diff/base_accuracy*100:+.2f}%)")
                
                if 'trainable_pct' in all_results[best_peft[0]]:
                    print(f"   - Trains only {all_results[best_peft[0]]['trainable_pct']:.2f}% of parameters")
        
        # Recommendation for parameter efficiency
        peft_models_with_params = [(model, metrics) for model, metrics in all_results.items() 
                                 if model != "Base Model" and 'trainable_pct' in metrics]
        
        if peft_models_with_params:
            # Find model with lowest trainable percentage among those with good performance
            threshold = 0.95  # Models with at least 95% of best accuracy
            
            if 'accuracy' in main_metrics:
                best_acc = max(all_results[m].get('accuracy', 0) for m in all_results if m != "Base Model")
                acceptable_models = [(m, metrics) for m, metrics in all_results.items() 
                                  if m != "Base Model" 
                                  and metrics.get('accuracy', 0) >= best_acc * threshold
                                  and 'trainable_pct' in metrics]
                
                if acceptable_models:
                    most_efficient = min(acceptable_models, key=lambda x: x[1]['trainable_pct'])
                    
                    print(f"\n2. For best parameter efficiency with good performance, use '{most_efficient[0]}'")
                    print(f"   - Trains only {most_efficient[1]['trainable_pct']:.2f}% of parameters")
                    print(f"   - Accuracy: {most_efficient[1].get('accuracy', 0):.4f}")
                    
                    base_accuracy = all_results["Base Model"].get('accuracy', 0)
                    acc_diff = most_efficient[1].get('accuracy', 0) - base_accuracy
                    
                    if acc_diff > 0:
                        print(f"   - Improvement over base model: {acc_diff:.4f} ({acc_diff/base_accuracy*100:+.2f}%)")
                    else:
                        print(f"   - Performance difference from base model: {acc_diff:.4f} ({acc_diff/base_accuracy*100:+.2f}%)")
        
        # Recommendation for future exploration
        print("\n3. Recommendations for future exploration:")
        
        # Check if higher rank models perform better
        try:
            # Try to identify models with different ranks
            rank_models = []
            low_rank_model = None
            high_rank_model = None
            
            # Look for specific model names that indicate ranks
            for model_name in all_results:
                if model_name == "Base Model":
                    continue
                    
                if "Low Rank" in model_name:
                    low_rank_model = model_name
                elif "High Rank" in model_name:
                    high_rank_model = model_name
            
            # Compare the models if we found them
            if low_rank_model and high_rank_model:
                low_rank_acc = all_results[low_rank_model].get('accuracy', 0)
                high_rank_acc = all_results[high_rank_model].get('accuracy', 0)
                
                print(f"   - Rank comparison: {low_rank_model} vs {high_rank_model}")
                print(f"     Low rank accuracy: {low_rank_acc:.4f}")
                print(f"     High rank accuracy: {high_rank_acc:.4f}")
                
                if high_rank_acc > low_rank_acc * 1.01:  # At least 1% better
                    print("     Finding: Higher ranks appear to improve performance")
                    print("     Suggestion: Experiment with even higher ranks")
                elif low_rank_acc > high_rank_acc * 1.01:  # At least 1% better
                    print("     Finding: Lower ranks achieve better performance")
                    print("     Suggestion: Experiment with even lower ranks to find the optimal balance")
                else:
                    print("     Finding: Rank has minimal impact on performance")
                    print("     Suggestion: Use lower ranks for better efficiency")
        except Exception as e:
            print(f"   - Could not analyze rank impact: {e}")
            print("     Suggestion: Experiment with different LoRA ranks manually")
        
        print("\nOverall recommendation:")
        print("Parameter-Efficient Fine-Tuning (PEFT) offers a valuable approach for adapting")
        print("foundation models with minimal computational resources while maintaining or even")
        print("improving performance. Based on this evaluation, PEFT is an effective technique")
        print("for this task and dataset.")
    else:
        print("\nNone of the PEFT models outperformed the base model on the evaluated metrics.")
        print("This could indicate that:")
        print("1. The base model is already well-suited for this task")
        print("2. The LoRA configuration parameters need further tuning")
        print("3. Different PEFT techniques may be more appropriate for this specific task")
        
        print("\nRecommendations:")
        print("- Experiment with higher LoRA ranks")
        print("- Try different target modules for the adapters")
        print("- Adjust learning rates or training durations")
        print("- Consider alternative PEFT methods like prompt tuning or prefix tuning")
    
except Exception as e:
    import traceback
    print(f"Error during model comparison: {e}")
    traceback.print_exc()


Comparing all PEFT models with the base model...
Using encoded_validation dataset for evaluation
Using base model: distilbert-base-uncased

Evaluating the base model...


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Base model metrics:
  eval_loss: 0.6970
  eval_accuracy: 0.5069
  eval_f1: 0.3425
  eval_precision: 0.2587
  eval_recall: 0.5069
  eval_steps_per_second: 51.6020
Base model total parameters: 66,955,010

Evaluating Best Model...



Best Model metrics:
  eval_loss: 0.3642
  eval_accuracy: 0.8417
  eval_f1: 0.8417
  eval_precision: 0.8428
  eval_recall: 0.8417
  eval_steps_per_second: 39.7600
Parameter efficiency: 1,184,260 trainable parameters (1.72% of total)

Evaluating Low Rank (r=4)...



Low Rank (r=4) metrics:
  eval_loss: 0.3604
  eval_accuracy: 0.8383
  eval_f1: 0.8383
  eval_precision: 0.8396
  eval_recall: 0.8383
  eval_steps_per_second: 39.7750
Parameter efficiency: 1,184,260 trainable parameters (1.75% of total)

Evaluating High Rank (r=32)...



High Rank (r=32) metrics:
  eval_loss: 0.3642
  eval_accuracy: 0.8417
  eval_f1: 0.8417
  eval_precision: 0.8428
  eval_recall: 0.8417
  eval_steps_per_second: 40.6880
Parameter efficiency: 1,184,260 trainable parameters (1.72% of total)

Evaluating Query-Only...



Query-Only metrics:
  eval_loss: 0.3807
  eval_accuracy: 0.8303
  eval_f1: 0.8303
  eval_precision: 0.8304
  eval_recall: 0.8303
  eval_steps_per_second: 48.8800
Parameter efficiency: 1,184,260 trainable parameters (1.75% of total)

Evaluating With Bias Training...



With Bias Training metrics:
  eval_loss: 0.3672
  eval_accuracy: 0.8291
  eval_f1: 0.8291
  eval_precision: 0.8304
  eval_recall: 0.8291
  eval_steps_per_second: 40.7640
Parameter efficiency: 1,202,692 trainable parameters (1.77% of total)

Evaluating Higher Dropout...



Higher Dropout metrics:
  eval_loss: 0.3638
  eval_accuracy: 0.8314
  eval_f1: 0.8313
  eval_precision: 0.8333
  eval_recall: 0.8314
  eval_steps_per_second: 40.0110
Parameter efficiency: 1,184,260 trainable parameters (1.74% of total)

Comprehensive Model Comparison

Metrics comparison table:
             Model   loss accuracy recall     f1 precision steps_per_second loss vs base (%) accuracy vs base (%) recall vs base (%) f1 vs base (%) precision vs base (%) steps_per_second vs base (%) Params trained (%)
        Base Model 0.6970   0.5069 0.5069 0.3425    0.2587          51.6020                                                                                                                                              
        Best Model 0.3642   0.8417 0.8417 0.8417    0.8428          39.7600          -47.75%              +66.06%            +66.06%       +145.72%              +225.79%                      -22.95%              1.72%
    Low Rank (r=4) 0.3604   0.8383 0.8383 0.8383  

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [41]:
# Comprehensive Inference with Best LoRA Model
print("\n" + "="*80)
print("Performing inference with the best LoRA model...")
print("="*80)

# Import required libraries
import os
import torch
import time
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel, PeftConfig

try:
    # Simply select the best model by name if available, or use the first one
    if 'loaded_models' not in globals() or not loaded_models:
        print("No loaded models found. Cannot proceed with inference.")
        raise ValueError("No loaded models available")
    
    # Try to find "High Rank" or "Best Model" in loaded models
    best_model = None
    best_variant = None
    
    # Priority order for selecting model
    priority_names = ["Best Model", "High Rank (r=32)", "LoRA Model"]
    
    for name in priority_names:
        if name in loaded_models:
            best_model = loaded_models[name]
            best_variant = name
            print(f"Selected {name} for inference")
            break
    
    # If none of the preferred models found, use the first available
    if not best_model:
        best_variant = list(loaded_models.keys())[0]
        best_model = loaded_models[best_variant]
        print(f"No priority model found. Using {best_variant} for inference.")
    
    if best_model:
        print(f"\nSelected model for inference: {best_variant}")
        
        # Load tokenizer based on the model
        model_name = None
        try:
            if 'config' in globals() and isinstance(config, dict) and 'model_name' in config:
                model_name = config['model_name']
            else:
                # Try to extract from PeftConfig
                for path in [f"{models_dir}/variant_{i}" for i in range(len(lora_configs))]:
                    if os.path.exists(path):
                        try:
                            peft_config = PeftConfig.from_pretrained(path)
                            model_name = peft_config.base_model_name_or_path
                            break
                        except:
                            pass
                
                # Use default if still not found
                if not model_name:
                    model_name = "distilbert-base-uncased"  # Default
        except:
            model_name = "distilbert-base-uncased"  # Default
            
        print(f"Using base model: {model_name}")
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        
        # Create a comprehensive inference function
        def predict_sentiment(text, model=best_model, tokenizer=tokenizer, return_all=False):
            """
            Predict sentiment for a given text using the best trained model
            
            Args:
                text: Input text string or list of strings
                model: Trained model
                tokenizer: Tokenizer for the model
                return_all: Whether to return all details or just the label
            
            Returns:
                If return_all=True: Dict with prediction, confidence, and label
                If return_all=False: Just the label string
                For lists, returns a list of results
            """
            # Handle both single texts and batches
            is_single = isinstance(text, str)
            texts = [text] if is_single else text
            
            # Prepare the inputs
            inputs = tokenizer(
                texts, 
                return_tensors="pt", 
                truncation=True, 
                padding=True, 
                max_length=512
            ).to(device)
            
            # Get predictions
            model.eval()
            with torch.no_grad():
                outputs = model(**inputs)
            
            # Process outputs
            logits = outputs.logits
            probabilities = torch.softmax(logits, dim=-1)
            predictions = torch.argmax(logits, dim=-1).cpu().numpy()
            
            # Prepare results
            results = []
            for i, pred in enumerate(predictions):
                label = "Positive" if pred == 1 else "Negative"
                confidence = probabilities[i][pred].item()
                
                if return_all:
                    results.append({
                        "prediction": int(pred),
                        "confidence": confidence,
                        "label": label,
                        "probabilities": {
                            "Negative": probabilities[i][0].item(),
                            "Positive": probabilities[i][1].item()
                        }
                    })
                else:
                    results.append(label)
            
            # Return single result or list based on input
            return results[0] if is_single else results
        
        # Define a set of diverse test cases
        test_cases = [
            "I absolutely loved this product! It exceeded all my expectations.",
            "The service was terrible and I would never recommend this place.",
            "It was okay, not great but not bad either.",
            "While there were some issues, overall I had a positive experience.",
            "I'm requesting a refund because this product is defective.",
            "This is neither positive nor negative, it's just a statement of fact."
        ]
        
        # Test the model on sample cases with detailed output
        print("\nTesting model on sample texts:")
        results_data = []
        
        for text in test_cases:
            result = predict_sentiment(text, return_all=True)
            print(f"\nText: {text}")
            print(f"Prediction: {result['label']} (confidence: {result['confidence']:.4f})")
            print(f"Probability distribution: Positive={result['probabilities']['Positive']:.4f}, " +
                  f"Negative={result['probabilities']['Negative']:.4f}")
            
            results_data.append({
                "text": text,
                "prediction": result['prediction'],
                "confidence": result['confidence'],
                "label": result['label'],
                "pos_prob": result['probabilities']['Positive'],
                "neg_prob": result['probabilities']['Negative']
            })
        
        # Batch prediction demonstration
        print("\nDemonstrating batch prediction:")
        batch_results = predict_sentiment(test_cases, return_all=True)
        
        # Print batch statistics
        pos_count = sum(1 for r in batch_results if r['label'] == 'Positive')
        neg_count = sum(1 for r in batch_results if r['label'] == 'Negative')
        avg_confidence = sum(r['confidence'] for r in batch_results) / len(batch_results)
        
        print(f"Batch results: {pos_count} Positive, {neg_count} Negative")
        print(f"Average prediction confidence: {avg_confidence:.4f}")
        
        # Save the sample predictions to CSV if possible
        try:
            import pandas as pd
            sample_df = pd.DataFrame(results_data)
            
            # Create results directory if it doesn't exist
            if 'results_dir' in globals():
                results_path = results_dir
            else:
                results_path = "./results"
                
            os.makedirs(results_path, exist_ok=True)
            sample_df.to_csv(f"{results_path}/sample_predictions.csv", index=False)
            print(f"Sample predictions saved to {results_path}/sample_predictions.csv")
            
            # Create visualization of the results if matplotlib is available
            try:
                import matplotlib.pyplot as plt
                import numpy as np
                
                # Set up the figure
                plt.figure(figsize=(12, 8))
                
                # Create bar chart of confidence scores
                texts_short = [t[:30] + "..." if len(t) > 30 else t for t in sample_df['text']]
                indices = np.arange(len(texts_short))
                bar_width = 0.35
                
                # Plot positive and negative probabilities
                plt.bar(indices, sample_df['pos_prob'], bar_width, 
                        label='Positive', color='green', alpha=0.7)
                plt.bar(indices + bar_width, sample_df['neg_prob'], bar_width,
                        label='Negative', color='red', alpha=0.7)
                
                # Add labels and title
                plt.xlabel('Sample Text')
                plt.ylabel('Probability')
                plt.title('Sentiment Analysis Results')
                plt.xticks(indices + bar_width/2, texts_short, rotation=45, ha='right')
                plt.legend()
                
                plt.tight_layout()
                
                # Create viz directory if it doesn't exist
                if 'viz_dir' in globals():
                    viz_path = viz_dir
                else:
                    viz_path = "./visualizations"
                    
                os.makedirs(viz_path, exist_ok=True)
                plt.savefig(f"{viz_path}/sample_predictions.png")
                plt.close()
                print(f"Visualization saved to {viz_path}/sample_predictions.png")
                
            except ImportError:
                print("Matplotlib not available. Skipping visualization.")
                
        except ImportError:
            print("Pandas not available. CSV export skipped.")
        
        # Save the best model in a separate directory for easy reference
        try:
            if 'models_dir' in globals():
                models_path = models_dir
            else:
                models_path = "./models"
                
            best_model_path = f"{models_path}/best_model"
            os.makedirs(best_model_path, exist_ok=True)
            best_model.save_pretrained(best_model_path)
            print(f"Best model saved to {best_model_path}")
            
            # Create a simple README with usage instructions
            readme_content = f"""# LoRA Fine-tuned Sentiment Analysis Model

## Model Information
- Base model: {model_name}
- Configuration: {best_variant}
- Date: {time.strftime("%Y-%m-%d")}

## Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel, PeftConfig

# Load model
peft_config = PeftConfig.from_pretrained("./best_model")
tokenizer = AutoTokenizer.from_pretrained(peft_config.base_model_name_or_path)
base_model = AutoModelForSequenceClassification.from_pretrained(
    peft_config.base_model_name_or_path,
    num_labels=2
)
model = PeftModel.from_pretrained(base_model, "./best_model")

# Predict sentiment
def predict(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)
    logits = outputs.logits
    prediction = torch.argmax(logits, dim=-1).item()
    return "Positive" if prediction == 1 else "Negative"
```
"""
            
            with open(f"{best_model_path}/README.md", "w") as f:
                f.write(readme_content)
                
            print(f"README file with usage instructions created at {best_model_path}/README.md")
            
        except Exception as e:
            print(f"Error saving best model: {e}")
        
    else:
        print("No models available for inference.")
        
except Exception as e:
    import traceback
    print(f"Error during inference with best model: {e}")
    traceback.print_exc()


Performing inference with the best LoRA model...
Selected Best Model for inference

Selected model for inference: Best Model
Using base model: distilbert-base-uncased

Testing model on sample texts:

Text: I absolutely loved this product! It exceeded all my expectations.
Prediction: Positive (confidence: 0.9897)
Probability distribution: Positive=0.9897, Negative=0.0103

Text: The service was terrible and I would never recommend this place.
Prediction: Negative (confidence: 0.9428)
Probability distribution: Positive=0.0572, Negative=0.9428

Text: It was okay, not great but not bad either.
Prediction: Positive (confidence: 0.8879)
Probability distribution: Positive=0.8879, Negative=0.1121

Text: While there were some issues, overall I had a positive experience.
Prediction: Positive (confidence: 0.9931)
Probability distribution: Positive=0.9931, Negative=0.0069

Text: I'm requesting a refund because this product is defective.
Prediction: Negative (confidence: 0.9624)
Probability distrib

## Visualizations

In [46]:
## Visualizations

print("\n" + "="*80)
print("Generating comprehensive visualizations of results...")
print("="*80)

# Make sure visualization directories exist
os.makedirs(viz_dir, exist_ok=True)

# 1. Compare all model performances
def plot_model_comparison(results_dict, metrics=['accuracy', 'f1', 'precision', 'recall']):
    """Create comparison charts for different models on multiple metrics"""
    
    # Filter out metrics that don't start with eval_
    filtered_metrics = [m for m in metrics if f'eval_{m}' in next(iter(results_dict.values()))]
    
    for metric in filtered_metrics:
        plt.figure(figsize=(12, 6))
        
        # Sort models for consistency, with Base Model always first
        models = ["Base Model"] + [m for m in results_dict.keys() if m != "Base Model"]
        values = [results_dict[model][f'eval_{metric}'] for model in models]
        
        # Create bar chart with custom colors
        colors = ['#1f77b4'] + sns.color_palette("viridis", len(models)-1)
        bars = plt.bar(models, values, color=colors)
        
        # Set title and labels
        plt.title(f'{metric.capitalize()} Comparison Across Models', fontsize=14)
        plt.xlabel('Model', fontsize=12)
        plt.ylabel(metric.capitalize(), fontsize=12)
        
        # Add value labels on top of each bar
        for bar in bars:
            height = bar.get_height()
            plt.text(bar.get_x() + bar.get_width()/2., height + 0.005,
                     f'{height:.4f}',
                     ha='center', va='bottom', fontsize=11)
        
        # Add improvement percentages relative to base model
        base_value = results_dict["Base Model"][f'eval_{metric}']
        for i, model in enumerate(models):
            if model != "Base Model":
                model_value = results_dict[model][f'eval_{metric}']
                improvement = (model_value - base_value) / base_value * 100
                color = 'green' if improvement > 0 else 'red'
                plt.text(i, model_value * 0.95, f'{improvement:+.1f}%',
                         ha='center', va='center', color=color, fontweight='bold')
        
        # Rotate x-axis labels for better readability
        plt.xticks(rotation=30, ha='right')
        plt.grid(True, alpha=0.3, axis='y')
        
        plt.tight_layout()
        plt.savefig(f"{viz_dir}/{metric}_comparison.png")
        plt.close()
        print(f"Saved {metric} comparison chart to {viz_dir}/{metric}_comparison.png")

# 2. Plot confusion matrices for all models
def plot_all_confusion_matrices(base_model, loaded_models, validation_data, tokenizer):
    """Generate confusion matrices for base model and all loaded PEFT models"""
    # Function to get predictions from a model
    def get_predictions(model, dataset):
        trainer = Trainer(
            model=model,
            args=TrainingArguments(
                output_dir="./tmp",
                per_device_eval_batch_size=32,
                report_to="none",
            ),
            eval_dataset=dataset,
        )
        outputs = trainer.predict(dataset)
        return outputs.predictions.argmax(axis=1), outputs.label_ids
    
    # Plot base model confusion matrix if not already done
    base_cm_path = f"{viz_dir}/base_model_confusion_matrix.png"
    if not os.path.exists(base_cm_path):
        pred_labels, true_labels = get_predictions(base_model, validation_data)
        
        plt.figure(figsize=(8, 6))
        cm = confusion_matrix(true_labels, pred_labels)
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=True)
        plt.title('Confusion Matrix - Base Model')
        plt.ylabel('True Label')
        plt.xlabel('Predicted Label')
        
        if cm.shape[0] == 2:
            plt.xticks([0.5, 1.5], ['Negative', 'Positive'])
            plt.yticks([0.5, 1.5], ['Negative', 'Positive'])
        
        plt.tight_layout()
        plt.savefig(base_cm_path)
        plt.close()
        print(f"Saved base model confusion matrix to {base_cm_path}")
    
    # Plot confusion matrices for all loaded models
    for model_name, model in loaded_models.items():
        if "confusion_matrix" in model_name.lower():
            continue  # Skip if already has confusion in the name
            
        cm_path = f"{viz_dir}/{model_name.replace(' ', '_').lower()}_confusion_matrix.png"
        if not os.path.exists(cm_path):
            try:
                pred_labels, true_labels = get_predictions(model, validation_data)
                
                plt.figure(figsize=(8, 6))
                cm = confusion_matrix(true_labels, pred_labels)
                sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=True)
                plt.title(f'Confusion Matrix - {model_name}')
                plt.ylabel('True Label')
                plt.xlabel('Predicted Label')
                
                if cm.shape[0] == 2:
                    plt.xticks([0.5, 1.5], ['Negative', 'Positive'])
                    plt.yticks([0.5, 1.5], ['Negative', 'Positive'])
                
                plt.tight_layout()
                plt.savefig(cm_path)
                plt.close()
                print(f"Saved {model_name} confusion matrix to {cm_path}")
            except Exception as e:
                print(f"Error generating confusion matrix for {model_name}: {e}")

# 3. Parameter efficiency visualization
def plot_parameter_efficiency():
    """Create parameter efficiency visualizations for various models"""
    # Create pies showing trainable vs frozen parameters
    for model_name, model_metrics in all_results.items():
        if model_name == "Base Model" or "trainable_pct" not in model_metrics:
            continue
            
        plt.figure(figsize=(10, 6))
        
        # Calculate frozen parameters
        trainable_pct = model_metrics["trainable_pct"]
        frozen_pct = 100 - trainable_pct
        
        # Create pie chart
        sizes = [trainable_pct, frozen_pct]
        labels = ['Trainable Parameters', 'Frozen Parameters']
        colors = ['#ff9999', '#66b3ff']
        explode = (0.1, 0)  # Explode the trainable parameters
        
        plt.pie(sizes, explode=explode, labels=labels, colors=colors,
                autopct='%1.1f%%', shadow=True, startangle=90,
                textprops={'fontsize': 14})
        
        plt.axis('equal')  # Equal aspect ratio ensures pie is drawn as a circle
        
        if "trainable_params" in model_metrics and "total_params" in model_metrics:
            trainable = model_metrics["trainable_params"]
            total = model_metrics["total_params"]
            plt.title(f'{model_name} Parameter Efficiency\n{trainable:,} of {total:,} parameters',
                    fontsize=16)
        else:
            plt.title(f'{model_name} Parameter Efficiency', fontsize=16)
        
        plt.tight_layout()
        plt.savefig(f"{viz_dir}/{model_name.replace(' ', '_').lower()}_parameter_efficiency.png")
        plt.close()
        print(f"Saved parameter efficiency for {model_name} to {viz_dir}/{model_name.replace(' ', '_').lower()}_parameter_efficiency.png")
    
    # Create parameter efficiency vs performance plot
    if any("trainable_pct" in metrics for metrics in all_results.values() if metrics != all_results["Base Model"]):
        plt.figure(figsize=(10, 6))
        
        # Extract data
        model_names = []
        param_efficiency = []
        accuracy_values = []
        
        for model_name, metrics in all_results.items():
            if model_name != "Base Model" and "trainable_pct" in metrics:
                model_names.append(model_name)
                param_efficiency.append(metrics["trainable_pct"])
                accuracy_values.append(metrics["eval_accuracy"])
        
        # Create scatter plot
        plt.scatter(param_efficiency, accuracy_values, s=100, c=range(len(model_names)), cmap='viridis')
        
        # Add labels for each point
        for i, model in enumerate(model_names):
            plt.annotate(model, (param_efficiency[i], accuracy_values[i]), 
                        fontsize=9, ha='right', va='bottom')
        
        # Add reference line for base model accuracy
        base_accuracy = all_results["Base Model"]["eval_accuracy"]
        plt.axhline(y=base_accuracy, color='r', linestyle='--', alpha=0.7,
                   label=f'Base Model Accuracy: {base_accuracy:.4f}')
        
        # Labels and title
        plt.xlabel('Percentage of Parameters Trained (%)', fontsize=12)
        plt.ylabel('Accuracy', fontsize=12)
        plt.title('Accuracy vs Parameter Efficiency', fontsize=14)
        plt.legend()
        plt.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.savefig(f"{viz_dir}/efficiency_vs_accuracy.png")
        plt.close()
        print(f"Saved efficiency vs accuracy plot to {viz_dir}/efficiency_vs_accuracy.png")

# 4. Training history visualization 
def plot_training_curves():
    """Plot training curves from saved history if available"""
    # Plot LoRA learning curves if history exists
    lora_curves_path = f"{viz_dir}/lora_learning_curves.png"
    if not os.path.exists(lora_curves_path) and 'peft_trainer' in locals() and hasattr(peft_trainer, 'state'):
        history = peft_trainer.state.log_history
        
        # Extract training loss
        train_loss = []
        train_steps = []
        eval_loss = []
        eval_acc = []
        eval_steps = []
        
        for entry in history:
            if 'loss' in entry and 'eval_loss' not in entry:
                train_loss.append(entry['loss'])
                train_steps.append(entry['step'])
            if 'eval_loss' in entry:
                eval_loss.append(entry['eval_loss'])
                eval_acc.append(entry['eval_accuracy'])
                eval_steps.append(entry['step'])
        
        if train_loss and eval_acc:  # Only create plot if we have data
            # Create the plot
            plt.figure(figsize=(14, 6))
            
            # Plot training loss
            plt.subplot(1, 2, 1)
            plt.plot(train_steps, train_loss, 'b-', marker='o', markersize=4, alpha=0.7)
            plt.title('Training Loss', fontsize=14)
            plt.xlabel('Steps', fontsize=12)
            plt.ylabel('Loss', fontsize=12)
            plt.grid(True, alpha=0.3)
            
            # Plot evaluation accuracy
            plt.subplot(1, 2, 2)
            plt.plot(eval_steps, eval_acc, 'g-', marker='o', markersize=6, label='Accuracy')
            
            # If eval loss exists, plot it on secondary axis
            if eval_loss:
                ax2 = plt.gca().twinx()
                ax2.plot(eval_steps, eval_loss, 'r--', marker='x', markersize=4, alpha=0.7, label='Loss')
                ax2.set_ylabel('Loss', color='r', fontsize=12)
                ax2.tick_params(axis='y', colors='r')
            
            plt.title('Validation Metrics', fontsize=14)
            plt.xlabel('Steps', fontsize=12)
            plt.ylabel('Accuracy', fontsize=12)
            plt.grid(True, alpha=0.3)
            plt.legend(loc='lower right')
            
            plt.tight_layout()
            plt.savefig(lora_curves_path)
            plt.close()
            print(f"Saved LoRA learning curves to {lora_curves_path}")

# 5. LoRA variants comparison if data exists
def plot_lora_variants_comparison():
    """Plot comparisons of LoRA variants if data exists"""
    # Check if comparison data exists
    if 'comparison_results' in locals() and comparison_results:
        variants_path = f"{viz_dir}/lora_variants_comparison.png"
        if not os.path.exists(variants_path):
            config_df = pd.DataFrame(comparison_results)
            
            plt.figure(figsize=(15, 12))
            
            # Plot accuracy comparison
            plt.subplot(2, 2, 1)
            sns.barplot(x='config_name', y='accuracy', data=config_df, palette='viridis')
            plt.title('Accuracy Comparison', fontsize=14)
            plt.xticks(rotation=45, ha='right', fontsize=10)
            plt.ylabel('Accuracy', fontsize=12)
            plt.grid(True, alpha=0.3)
            
            # Plot F1 score comparison
            plt.subplot(2, 2, 2)
            sns.barplot(x='config_name', y='f1', data=config_df, palette='viridis')
            plt.title('F1 Score Comparison', fontsize=14)
            plt.xticks(rotation=45, ha='right', fontsize=10)
            plt.ylabel('F1 Score', fontsize=12)
            plt.grid(True, alpha=0.3)
            
            # Plot training time comparison
            plt.subplot(2, 2, 3)
            sns.barplot(x='config_name', y='training_time_min', data=config_df, palette='viridis')
            plt.title('Training Time Comparison', fontsize=14)
            plt.xticks(rotation=45, ha='right', fontsize=10)
            plt.ylabel('Training Time (min)', fontsize=12)
            plt.grid(True, alpha=0.3)
            
            # Plot parameter efficiency comparison
            plt.subplot(2, 2, 4)
            sns.barplot(x='config_name', y='param_efficiency', data=config_df, palette='viridis')
            plt.title('Parameter Efficiency (% of Total)', fontsize=14)
            plt.xticks(rotation=45, ha='right', fontsize=10)
            plt.ylabel('Parameter Efficiency (%)', fontsize=12)
            plt.grid(True, alpha=0.3)
            
            plt.tight_layout()
            plt.savefig(variants_path)
            plt.close()
            print(f"Saved LoRA variants comparison to {variants_path}")
            
            # Create radar chart
            radar_path = f"{viz_dir}/lora_variants_radar.png"
            if not os.path.exists(radar_path):
                try:
                    from math import pi
                    
                    # Prepare data for radar chart
                    categories = ['Accuracy', 'F1 Score', 'Parameter Efficiency', 'Speed']
                    
                    # Normalize values for radar chart
                    max_acc = config_df['accuracy'].max()
                    max_f1 = config_df['f1'].max()
                    min_params = config_df['param_efficiency'].min()
                    max_params = config_df['param_efficiency'].max()
                    min_time = config_df['training_time_sec'].min()
                    max_time = config_df['training_time_sec'].max()
                    
                    # Create radar values (normalized from 0 to 1)
                    radar_data = {}
                    for _, row in config_df.iterrows():
                        # Normalize values to 0-1 scale
                        acc_norm = row['accuracy'] / max_acc if max_acc else 0
                        f1_norm = row['f1'] / max_f1 if max_f1 else 0
                        
                        # For params, higher efficiency is better
                        if max_params == min_params:
                            param_norm = 1.0
                        else:
                            param_norm = (row['param_efficiency'] - min_params) / (max_params - min_params)
                        
                        # For time, faster is better (invert)
                        if max_time == min_time:
                            time_norm = 1.0
                        else:
                            time_norm = 1 - ((row['training_time_sec'] - min_time) / (max_time - min_time))
                        
                        radar_data[row['config_name']] = [acc_norm, f1_norm, param_norm, time_norm]
                    
                    # Number of variables
                    N = len(categories)
                    
                    # Compute angle for each axis
                    angles = [n / float(N) * 2 * pi for n in range(N)]
                    angles += angles[:1]  # Close the loop
                    
                    # Initialize the plot
                    plt.figure(figsize=(10, 10))
                    ax = plt.subplot(111, polar=True)
                    
                    # Draw one axis per variable and add labels
                    plt.xticks(angles[:-1], categories, fontsize=12)
                    
                    # Draw the chart for each config
                    for i, (config_name, values) in enumerate(radar_data.items()):
                        values += values[:1]  # Close the loop
                        ax.plot(angles, values, linewidth=2, linestyle='solid', label=config_name)
                        ax.fill(angles, values, alpha=0.1)
                    
                    # Add legend
                    plt.legend(loc='upper right', bbox_to_anchor=(0.1, 0.1))
                    plt.title('LoRA Configuration Comparison', fontsize=15)
                    
                    plt.tight_layout()
                    plt.savefig(radar_path)
                    plt.close()
                    print(f"Saved radar chart comparison to {radar_path}")
                except Exception as e:
                    print(f"Error creating radar chart: {e}")

# 6. Memory usage comparison 
def plot_memory_comparison():
    """Plot memory usage comparison if memory data exists"""
    memory_data = {}
    
    # Extract memory usage from base_memory and other variables if they exist
    if 'base_memory' in locals():
        memory_data["Base Model"] = base_memory["allocated_mb"]
    
    if 'lora_memory' in locals():
        memory_data["LoRA Model"] = lora_memory["allocated_mb"]
        
    if 'qlora_memory' in locals():
        memory_data["QLoRA Model"] = qlora_memory["allocated_mb"]
    
    if memory_data and len(memory_data) > 1:
        plt.figure(figsize=(10, 6))
        
        # Create bar chart
        models = list(memory_data.keys())
        memory_values = list(memory_data.values())
        
        bars = plt.bar(models, memory_values, color=sns.color_palette("viridis", len(models)))
        
        # Add value labels
        for bar in bars:
            height = bar.get_height()
            plt.text(bar.get_x() + bar.get_width()/2, height + 1,
                    f'{height:.1f} MB',
                    ha='center', va='bottom', fontsize=10)
        
        # Customize plot
        plt.ylabel('Memory Usage (MB)', fontsize=12)
        plt.title('Memory Usage Comparison', fontsize=14)
        plt.grid(True, alpha=0.3, axis='y')
        
        # Add reduction percentages
        if "Base Model" in memory_data:
            base_mem = memory_data["Base Model"]
            for i, (model, mem) in enumerate(zip(models, memory_values)):
                if model != "Base Model":
                    reduction = (base_mem - mem) / base_mem * 100
                    plt.text(i, mem / 2,
                            f'{reduction:+.1f}%',
                            ha='center', va='center',
                            color='white', fontweight='bold')
        
        plt.tight_layout()
        plt.savefig(f"{viz_dir}/memory_usage_comparison.png")
        plt.close()
        print(f"Saved memory usage comparison to {viz_dir}/memory_usage_comparison.png")

# 7. Final comprehensive comparison visualization
def create_final_comparison():
    """Create a comprehensive summary visualization of all models"""
    # Create a table-like visualization comparing all models
    if all_results:
        # Determine which metrics are available
        metrics = []
        for metric_name in ['accuracy', 'f1', 'precision', 'recall']:
            if f'eval_{metric_name}' in all_results["Base Model"]:
                metrics.append(metric_name)
        
        if not metrics:
            return
            
        # Extract data
        models = list(all_results.keys())
        data = []
        
        for model in models:
            row = [model]
            for metric in metrics:
                if f'eval_{metric}' in all_results[model]:
                    row.append(all_results[model][f'eval_{metric}'])
                else:
                    row.append(None)
            data.append(row)
        
        # Convert to DataFrame for easier handling
        df = pd.DataFrame(data, columns=['Model'] + [m.capitalize() for m in metrics])
        
        # Create a visually appealing table
        fig, ax = plt.subplots(figsize=(12, len(models) * 0.8 + 2))
        ax.axis('tight')
        ax.axis('off')
        
        # Calculate improvements relative to base model
        improvements = {}
        base_idx = df.index[df['Model'] == 'Base Model'].tolist()[0]
        base_values = df.iloc[base_idx, 1:].values
        
        for i, row in df.iterrows():
            if row['Model'] != 'Base Model':
                improvements[i] = []
                for j, val in enumerate(row.values[1:]):
                    if val is not None and base_values[j] is not None:
                        pct = (val - base_values[j]) / base_values[j] * 100
                        improvements[i].append(f"{val:.4f}\n({pct:+.1f}%)")
                    else:
                        improvements[i].append("")
        
        # Create the table
        table_data = []
        for i, row in df.iterrows():
            if row['Model'] == 'Base Model':
                table_data.append([row['Model']] + [f"{v:.4f}" for v in row.values[1:]])
            else:
                table_data.append([row['Model']] + improvements[i])
        
        table = ax.table(cellText=table_data, colLabels=df.columns, loc='center',
                         cellLoc='center', colColours=['#f2f2f2']*len(df.columns))
        
        # Customize table appearance
        table.auto_set_font_size(False)
        table.set_fontsize(10)
        table.scale(1, 1.5)  # Adjust row height
        
        # Highlight the best value in each column
        for j in range(1, len(df.columns)):
            col_values = df.iloc[:, j].values
            best_idx = np.nanargmax(col_values)
            if best_idx != base_idx:  # If best is not base model
                cell = table[best_idx+1, j]
                cell.set_facecolor('#d4f7d4')  # Light green
        
        plt.title('Comprehensive Model Comparison', fontsize=16, pad=20)
        plt.tight_layout()
        plt.savefig(f"{viz_dir}/comprehensive_comparison.png", bbox_inches='tight')
        plt.close()
        print(f"Saved comprehensive comparison to {viz_dir}/comprehensive_comparison.png")

# 8. Create sample prediction visualization if data exists
def plot_sample_predictions():
    """Create visualization of sample predictions if data exists"""
    sample_pred_path = f"{results_dir}/sample_predictions.csv"
    if os.path.exists(sample_pred_path):
        try:
            sample_df = pd.read_csv(sample_pred_path)
            
            plt.figure(figsize=(12, 8))
            
            # Create shortened texts for display
            texts_short = [t[:30] + "..." if len(t) > 30 else t for t in sample_df['text']]
            
            # Sort by confidence for better visualization
            sample_df['prediction_numeric'] = sample_df['prediction'].apply(lambda x: 1 if x == 1 or x == "Positive" else 0)
            sample_df = sample_df.sort_values(['prediction_numeric', 'confidence'], ascending=[False, False])
            
            # Get sorted data
            texts_short = [texts_short[i] for i in sample_df.index]
            predictions = sample_df['prediction_numeric'].values
            confidences = sample_df['confidence'].values
            
            # Create colored confidence bars
            colors = ['green' if p == 1 else 'red' for p in predictions]
            
            # Plot bars
            plt.figure(figsize=(14, 7))
            bars = plt.bar(range(len(texts_short)), confidences, color=colors)
            
            # Add text labels
            for i, bar in enumerate(bars):
                plt.text(i, bar.get_height() + 0.02, texts_short[i],
                        rotation=45, ha='right', fontsize=9)
            
            # Customize plot
            plt.ylim(0, 1.1)  # Leave room for text labels
            plt.xticks([])  # Hide x-axis labels as we've added text annotations
            plt.ylabel('Prediction Confidence', fontsize=12)
            plt.title('Sentiment Analysis Prediction Confidence', fontsize=14)
            
            # Add legend
            from matplotlib.patches import Patch
            legend_elements = [
                Patch(facecolor='green', label='Positive'),
                Patch(facecolor='red', label='Negative')
            ]
            plt.legend(handles=legend_elements)
            
            plt.tight_layout()
            plt.savefig(f"{viz_dir}/sample_predictions_confidence.png")
            plt.close()
            print(f"Saved sample predictions visualization to {viz_dir}/sample_predictions_confidence.png")
            
            # Create stacked bar chart of positive/negative probabilities
            if 'pos_prob' in sample_df.columns and 'neg_prob' in sample_df.columns:
                plt.figure(figsize=(14, 7))
                
                indices = np.arange(len(texts_short))
                width = 0.35
                
                # Plot positive and negative probabilities
                plt.bar(indices, sample_df['pos_prob'], width, label='Positive', color='green', alpha=0.7)
                plt.bar(indices + width, sample_df['neg_prob'], width, label='Negative', color='red', alpha=0.7)
                
                # Add text labels
                for i in range(len(texts_short)):
                    plt.text(i + width/2, -0.05, texts_short[i], rotation=45, ha='right', fontsize=9)
                
                # Customize plot
                plt.xlabel('Examples')
                plt.ylabel('Probability')
                plt.title('Positive and Negative Sentiment Probabilities')
                plt.legend()
                plt.xticks([])
                plt.grid(axis='y', alpha=0.3)
                
                plt.tight_layout()
                plt.subplots_adjust(bottom=0.15)  # Make room for text labels
                plt.savefig(f"{viz_dir}/sample_predictions_probabilities.png")
                plt.close()
                print(f"Saved probability distribution visualization to {viz_dir}/sample_predictions_probabilities.png")
                
        except Exception as e:
            print(f"Error creating sample predictions visualization: {e}")

# Execute all visualization functions
try:
    # 1. Model performance comparison
    print("\nGenerating model performance comparisons...")
    plot_model_comparison(all_results)
    
    # 2. Parameter efficiency visualization
    print("\nGenerating parameter efficiency visualizations...")
    plot_parameter_efficiency()
    
    # 3. Training curves visualization
    print("\nGenerating training curves...")
    plot_training_curves()
    
    # 4. LoRA variants comparison
    print("\nGenerating LoRA configuration comparisons...")
    plot_lora_variants_comparison()
    
    # 5. Memory usage comparison
    print("\nGenerating memory usage comparison...")
    plot_memory_comparison()
    
    # 6. Final comprehensive comparison
    print("\nGenerating comprehensive comparison...")
    create_final_comparison()
    
    # 7. Sample predictions visualization
    print("\nGenerating sample predictions visualization...")
    plot_sample_predictions()
    
    # 8. Confusion matrices for all models if not already created
    if 'base_model' in globals() and 'loaded_models' in globals() and loaded_models and 'encoded_validation' in globals():
        print("\nGenerating confusion matrices for all models...")
        plot_all_confusion_matrices(base_model, loaded_models, encoded_validation, tokenizer)
    
    print("\nAll visualizations complete! Check the output directory:", viz_dir)
    
except Exception as e:
    import traceback
    print(f"Error during visualization generation: {e}")
    traceback.print_exc()
    print("Continuing with available visualizations...")

# Print summary of created visualizations
try:
    viz_files = os.listdir(viz_dir)
    print(f"\nCreated {len(viz_files)} visualization files:")
    for i, file in enumerate(sorted(viz_files)):
        print(f"{i+1}. {file}")
except:
    pass


Generating comprehensive visualizations of results...

Generating model performance comparisons...
Saved accuracy comparison chart to ./peft_output/visualizations/accuracy_comparison.png
Saved f1 comparison chart to ./peft_output/visualizations/f1_comparison.png
Saved precision comparison chart to ./peft_output/visualizations/precision_comparison.png
Saved recall comparison chart to ./peft_output/visualizations/recall_comparison.png

Generating parameter efficiency visualizations...

Generating training curves...

Generating LoRA configuration comparisons...

Generating memory usage comparison...

Generating comprehensive comparison...
Saved comprehensive comparison to ./peft_output/visualizations/comprehensive_comparison.png

Generating sample predictions visualization...
Saved sample predictions visualization to ./peft_output/visualizations/sample_predictions_confidence.png
Saved probability distribution visualization to ./peft_output/visualizations/sample_predictions_probabilities.

Saved Best Model confusion matrix to ./peft_output/visualizations/best_model_confusion_matrix.png


Saved Low Rank (r=4) confusion matrix to ./peft_output/visualizations/low_rank_(r=4)_confusion_matrix.png


Saved High Rank (r=32) confusion matrix to ./peft_output/visualizations/high_rank_(r=32)_confusion_matrix.png


Saved Query-Only confusion matrix to ./peft_output/visualizations/query-only_confusion_matrix.png


Saved With Bias Training confusion matrix to ./peft_output/visualizations/with_bias_training_confusion_matrix.png


Saved Higher Dropout confusion matrix to ./peft_output/visualizations/higher_dropout_confusion_matrix.png

All visualizations complete! Check the output directory: ./peft_output/visualizations

Created 42 visualization files:
1. accuracy_comparison.png
2. accuracy_vs_time.png
3. all_models_metrics.png
4. base_model_confidence.png
5. base_model_confusion_matrix.png
6. base_model_roc_curve.png
7. best_model_confusion_matrix.png
8. class_distribution.png
9. comprehensive_comparison.png
10. confusion_matrices.png
11. example_model_comparison.png
12. example_param_efficiency.png
13. f1_comparison.png
14. high_rank_(r=32)_confusion_matrix.png
15. higher_dropout_confusion_matrix.png
16. inference_confidence.png
17. learning_curves.png
18. lora_learning_curves.png
19. lora_model_confusion_matrix.png
20. lora_model_roc_curve.png
21. lora_parameter_efficiency.png
22. lora_variants_comparison.png
23. lora_variants_radar.png
24. low_rank_(r=4)_confusion_matrix.png
25. metrics_heatmap.png
26. model

<Figure size 1200x800 with 0 Axes>