# Fine-tuning Mistral-7B for Automatic Short Answer Grading

## Table of Contents
1. [Environment Setup](#environment-setup)
2. [Data Loading & Preprocessing](#data-loading)
3. [Model Configuration](#model-configuration)
4. [Training Setup](#training-setup)
5. [Evaluation Metrics](#evaluation-metrics)
6. [Inference Pipeline](#inference-pipeline)
7. [Visualization & Analysis](#visualization)
8. [Optional Enhancements](#optional-enhancements)

---

### Requirements
- **GPU**: P100 (16GB) or better
- **Estimated Runtime**: 3-5 hours for full training
- **Memory**: ~12-14GB GPU memory with 4-bit quantization

### Troubleshooting Tips
- If OOM errors occur, reduce batch size or increase gradient accumulation
- Save checkpoints every 500 steps to avoid losing progress
- Use Kaggle's "Save Version" feature regularly
- If dataset loading fails, check Kaggle API authentication


## 1. Environment Setup {#environment-setup}


In [None]:
# Install required libraries
!pip install -q transformers>=4.35.0 peft>=0.6.0 bitsandbytes>=0.41.0 datasets>=2.14.0 accelerate>=0.24.0 wandb scikit-learn nltk rouge-score bert-score torch torchvision torchaudio --upgrade


In [None]:
import os
import json
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    BitsAndBytesConfig,
    EarlyStoppingCallback
)
from peft import (
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
    TaskType
)
from datasets import Dataset as HFDataset
from sklearn.metrics import (
    cohen_kappa_score,
    accuracy_score,
    confusion_matrix,
    classification_report
)
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Optional: wandb for experiment tracking
try:
    import wandb
    WANDB_AVAILABLE = True
except ImportError:
    WANDB_AVAILABLE = False
    print("wandb not available, skipping experiment tracking")

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)


In [None]:
# Check GPU availability and memory
print(f"CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA Version: {torch.version.cuda}")
    print(f"Number of GPUs: {torch.cuda.device_count()}")
    print("\nGPU Details:")
    for i in range(torch.cuda.device_count()):
        props = torch.cuda.get_device_properties(i)
        print(f"  GPU {i}: {props.name}")
        print(f"    Total Memory: {props.total_memory / 1e9:.2f} GB")
        print(f"    Compute Capability: {props.major}.{props.minor}")
    print(f"\nUsing device_map='auto' - model will be distributed across all available GPUs")
else:
    print("Warning: No GPU detected. Training will be very slow on CPU.")


In [None]:
# Kaggle-specific setup
# Check if running in Kaggle environment
KAGGLE_ENV = os.path.exists('/kaggle/input')

if KAGGLE_ENV:
    print("Running in Kaggle environment")
    # Set up Kaggle output directory
    OUTPUT_DIR = '/kaggle/working'
    INPUT_DIR = '/kaggle/input'
else:
    print("Running in local environment")
    OUTPUT_DIR = './output'
    INPUT_DIR = './input'
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    os.makedirs(INPUT_DIR, exist_ok=True)

print(f"Output directory: {OUTPUT_DIR}")
print(f"Input directory: {INPUT_DIR}")


## 2. Data Loading & Preprocessing {#data-loading}


In [None]:
# Hyperparameters - Easy to modify
CONFIG = {
    # Model config
    'model_name': 'mistralai/Mistral-7B-v0.1',
    'use_4bit': True,
    'bnb_4bit_compute_dtype': 'float16',
    'bnb_4bit_quant_type': 'nf4',
    'use_nested_quant': False,
    
    # LoRA config
    'lora_r': 32,  # Rank
    'lora_alpha': 64,  # Scaling parameter
    'lora_dropout': 0.05,
    'lora_target_modules': ['q_proj', 'v_proj', 'k_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'],
    
    # Training config
    'output_dir': os.path.join(OUTPUT_DIR, 'checkpoints'),
    'num_train_epochs': 3,
    'per_device_train_batch_size': 4,
    'per_device_eval_batch_size': 4,
    'gradient_accumulation_steps': 4,
    'learning_rate': 2e-4,
    'lr_scheduler_type': 'cosine',
    'warmup_steps': 100,
    'logging_steps': 50,
    'eval_steps': 500,
    'save_steps': 500,
    'save_total_limit': 3,
    'fp16': True,
    'bf16': False,  # Set to True if using A100 or newer GPUs
    'gradient_checkpointing': True,
    'dataloader_pin_memory': False,
    
    # Data config
    'max_length': 1024,
    'test_size': 0.15,
    'val_size': 0.15,
    'random_state': 42,
    
    # Inference config
    'temperature': 0.7,
    'top_p': 0.9,
    'max_new_tokens': 256,
}

# Create output directory
os.makedirs(CONFIG['output_dir'], exist_ok=True)

print("Configuration loaded:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")


In [None]:
# Load EngSAF dataset
# The dataset is already split into train.csv, val.csv, unseen_question.csv, and unseen_answers.csv
# Column mapping: Question -> question, Student Answer -> student_answer, 
#                 Correct Answer -> reference_answer, output_label -> score

def load_engsaf_split(dataset_dir=None, split='train'):
    """
    Load a specific split of the EngSAF dataset.
    
    Args:
        dataset_dir: Directory containing the dataset files. If None, searches common paths.
        split: Which split to load ('train', 'val', 'unseen_question', 'unseen_answers')
    
    Returns:
        DataFrame with standardized column names
    """
    if dataset_dir is None:
        # Try common paths for the dataset directory
        possible_dirs = [
            'EngSAF dataset',
            os.path.join(INPUT_DIR, 'engsaf-dataset'),
            os.path.join(INPUT_DIR, 'engsaf'),
            './EngSAF dataset',
            os.path.join(INPUT_DIR, 'EngSAF dataset'),
        ]
        
        for dir_path in possible_dirs:
            if os.path.exists(dir_path):
                dataset_dir = dir_path
                break
        
        if dataset_dir is None:
            raise FileNotFoundError("Could not find dataset directory. Please specify the path.")
    
    # Map split names to file names
    split_files = {
        'train': 'train.csv',
        'val': 'val.csv',
        'validation': 'val.csv',
        'unseen_question': 'unseen_question.csv',
        'unseen_answers': 'unseen_answers.csv',
        'test': 'unseen_question.csv'  # Default test set is unseen_question
    }
    
    if split not in split_files:
        raise ValueError(f"Unknown split: {split}. Choose from {list(split_files.keys())}")
    
    file_path = os.path.join(dataset_dir, split_files[split])
    
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"Could not find {file_path}")
    
    # Load CSV
    df = pd.read_csv(file_path)
    
    # Map column names to standardized format
    column_mapping = {
        'Question': 'question',
        'Student Answer': 'student_answer',
        'Correct Answer': 'reference_answer',
        'output_label': 'score',
        'feedback': 'feedback',
        'Question_id': 'question_id'  # Keep for reference but not required
    }
    
    # Rename columns
    df = df.rename(columns=column_mapping)
    
    # Validate required columns
    required_cols = ['question', 'student_answer', 'reference_answer', 'score', 'feedback']
    missing_cols = [col for col in required_cols if col not in df.columns]
    if missing_cols:
        raise ValueError(f"Missing required columns: {missing_cols}. Available columns: {list(df.columns)}")
    
    # Clean data
    df = df.dropna(subset=['question', 'student_answer', 'score'])
    df['score'] = df['score'].astype(int)
    
    # Remove any rows with empty strings
    df = df[df['question'].str.strip() != '']
    df = df[df['student_answer'].str.strip() != '']
    
    print(f"Loaded {split} split: {len(df)} samples")
    print(f"Score distribution:\n{df['score'].value_counts().sort_index()}")
    
    return df

def load_all_engsaf_splits(dataset_dir=None):
    """
    Load all splits of the EngSAF dataset.
    
    Returns:
        tuple: (train_df, val_df, test_df_unseen_question, test_df_unseen_answers)
    """
    train_df = load_engsaf_split(dataset_dir, 'train')
    val_df = load_engsaf_split(dataset_dir, 'val')
    test_df_unseen_question = load_engsaf_split(dataset_dir, 'unseen_question')
    test_df_unseen_answers = load_engsaf_split(dataset_dir, 'unseen_answers')
    
    return train_df, val_df, test_df_unseen_question, test_df_unseen_answers

# Load all dataset splits
# The dataset is already split, so we load them directly
dataset_dir = 'EngSAF dataset'  # Adjust path if needed

# Check if dataset directory exists
if not os.path.exists(dataset_dir):
    # Try alternative paths
    if os.path.exists(os.path.join(INPUT_DIR, 'EngSAF dataset')):
        dataset_dir = os.path.join(INPUT_DIR, 'EngSAF dataset')
    elif os.path.exists('./EngSAF dataset'):
        dataset_dir = './EngSAF dataset'
    else:
        print(f"Warning: Dataset directory not found. Please ensure 'EngSAF dataset' folder exists.")
        print("Available paths will be searched when loading.")

# Load all splits
train_df, val_df, test_df_unseen_question, test_df_unseen_answers = load_all_engsaf_splits(dataset_dir)

print("\n" + "="*80)
print("Dataset Summary:")
print("="*80)
print(f"Training set: {len(train_df)} samples")
print(f"Validation set: {len(val_df)} samples")
print(f"Test set (unseen questions): {len(test_df_unseen_question)} samples")
print(f"Test set (unseen answers): {len(test_df_unseen_answers)} samples")
print("\nNote: Use test_df_unseen_question for final evaluation (unseen questions)")
print("      Use test_df_unseen_answers for additional evaluation (unseen answers)")


In [None]:
# Dataset splits are already provided
# The dataset comes pre-split into:
# - train.csv: Training set
# - val.csv: Validation set  
# - unseen_question.csv: Test set with unseen questions (primary test set)
# - unseen_answers.csv: Test set with unseen answers (additional test set)

# The splits have already been loaded in the previous cell
# train_df, val_df, test_df_unseen_question, test_df_unseen_answers

# For training, we'll use:
# - train_df for training
# - val_df for validation
# - test_df_unseen_question for final evaluation (unseen questions - most realistic)

# Optional: You can also evaluate on test_df_unseen_answers for additional insights

print("Dataset splits ready:")
print(f"  Training: {len(train_df)} samples")
print(f"  Validation: {len(val_df)} samples")
print(f"  Test (unseen questions): {len(test_df_unseen_question)} samples")
print(f"  Test (unseen answers): {len(test_df_unseen_answers)} samples")

# Set the primary test set for evaluation
test_df = test_df_unseen_question  # Use unseen questions as primary test set


In [None]:
# Rubric-conditioned prompt template
# This template conditions the model on a grading rubric

DEFAULT_RUBRIC = """You are an expert grader evaluating student answers. Consider:
1. Accuracy: Is the answer factually correct?
2. Completeness: Does it address all parts of the question?
3. Clarity: Is the answer well-structured and clear?
4. Depth: Does it demonstrate understanding beyond surface level?

Provide a score (0-5) and constructive feedback."""

def create_prompt_template(question, student_answer, rubric=None, system_prompt=None):
    """
    Create instruction-tuning prompt template.
    
    Format:
    <s>[INST] System: {system_prompt}
    
    User: Question: {question}
    Student Answer: {student_answer}
    
    Please grade this answer and provide feedback. [/INST]
    Assistant: Score: {score}
    Feedback: {feedback} </s>
    """
    if rubric is None:
        rubric = DEFAULT_RUBRIC
    
    if system_prompt is None:
        system_prompt = rubric
    
    user_prompt = f"""Question: {question}
Student Answer: {student_answer}

Please grade this answer and provide feedback."""
    
    return system_prompt, user_prompt

def format_instruction(system_prompt, user_prompt, assistant_response=None):
    """
    Format instruction in Mistral's chat template format.
    """
    if assistant_response is None:
        # For inference
        prompt = f"<s>[INST] {system_prompt}\n\n{user_prompt} [/INST]"
    else:
        # For training
        prompt = f"<s>[INST] {system_prompt}\n\n{user_prompt} [/INST] {assistant_response}</s>"
    
    return prompt

def create_assistant_response(score, feedback):
    """
    Create formatted assistant response with score and feedback.
    """
    return f"Score: {score}\nFeedback: {feedback}"

# Example usage
example_question = "What is photosynthesis?"
example_answer = "It's when plants make food using sunlight."
example_score = 3
example_feedback = "Your answer captures the basic concept but lacks detail. Photosynthesis involves converting light energy into chemical energy, specifically glucose, using carbon dioxide and water."

sys_prompt, user_prompt = create_prompt_template(example_question, example_answer)
assistant_resp = create_assistant_response(example_score, example_feedback)
formatted = format_instruction(sys_prompt, user_prompt, assistant_resp)

print("Example prompt format:")
print("=" * 80)
print(formatted)
print("=" * 80)


In [None]:
# Create dataset class for instruction tuning

class GradingDataset(Dataset):
    def __init__(self, df, tokenizer, max_length=1024, rubric=None):
        self.df = df.reset_index(drop=True)
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.rubric = rubric
        
        # Prepare data
        self.data = []
        for idx in range(len(self.df)):
            row = self.df.iloc[idx]
            
            # Create prompts
            sys_prompt, user_prompt = create_prompt_template(
                row['question'],
                row['student_answer'],
                rubric=self.rubric
            )
            
            # Create assistant response
            feedback = row.get('feedback', 'No feedback available.')
            assistant_resp = create_assistant_response(row['score'], feedback)
            
            # Format full instruction
            full_text = format_instruction(sys_prompt, user_prompt, assistant_resp)
            
            self.data.append({
                'text': full_text,
                'score': row['score'],
                'question': row['question'],
                'student_answer': row['student_answer']
            })
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        item = self.data[idx]
        
        # Tokenize
        encoding = self.tokenizer(
            item['text'],
            truncation=True,
            max_length=self.max_length,
            padding='max_length',
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': encoding['input_ids'].flatten(),  # For causal LM, labels = input_ids
            'score': item['score']
        }

print("Dataset class defined. Will be instantiated after tokenizer is loaded.")


## 3. Model Configuration {#model-configuration}


In [None]:
# Load tokenizer
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(CONFIG['model_name'])

# Set padding token if not present
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

print(f"Tokenizer loaded. Vocab size: {len(tokenizer)}")
print(f"Pad token: {tokenizer.pad_token}")


In [None]:
# Configure 4-bit quantization for memory efficiency
if CONFIG['use_4bit']:
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type=CONFIG['bnb_4bit_quant_type'],
        bnb_4bit_compute_dtype=getattr(torch, CONFIG['bnb_4bit_compute_dtype']),
        bnb_4bit_use_double_quant=CONFIG['use_nested_quant'],
    )
    print("4-bit quantization configured")
else:
    bnb_config = None
    print("Using full precision (not recommended for Kaggle P100)")


In [None]:
# Load base model with quantization
print("Loading Mistral-7B model (this may take a few minutes)...")

model = AutoModelForCausalLM.from_pretrained(
    CONFIG['model_name'],
    quantization_config=bnb_config,
    device_map='auto',
    trust_remote_code=True,
    torch_dtype=torch.float16 if CONFIG['use_4bit'] else torch.float32
)

print("Model loaded successfully!")

# Enable gradient checkpointing for memory efficiency
if CONFIG['gradient_checkpointing']:
    model.gradient_checkpointing_enable()
    print("Gradient checkpointing enabled")


In [None]:
# Prepare model for k-bit training
if CONFIG['use_4bit']:
    model = prepare_model_for_kbit_training(model)
    print("Model prepared for k-bit training")

# Configure LoRA
lora_config = LoraConfig(
    r=CONFIG['lora_r'],
    lora_alpha=CONFIG['lora_alpha'],
    target_modules=CONFIG['lora_target_modules'],
    lora_dropout=CONFIG['lora_dropout'],
    bias='none',
    task_type=TaskType.CAUSAL_LM,
)

# Apply LoRA to model
model = get_peft_model(model, lora_config)

print("LoRA configuration:")
model.print_trainable_parameters()

# Save LoRA config for reference (convert to JSON-serializable format)
lora_config_dict = {
    'r': lora_config.r,
    'lora_alpha': lora_config.lora_alpha,
    'target_modules': list(lora_config.target_modules) if isinstance(lora_config.target_modules, (set, tuple)) else lora_config.target_modules,
    'lora_dropout': lora_config.lora_dropout,
    'bias': lora_config.bias,
    'task_type': str(lora_config.task_type),
    'peft_type': str(lora_config.peft_type) if hasattr(lora_config, 'peft_type') else 'LORA',
}

with open(os.path.join(OUTPUT_DIR, 'lora_config.json'), 'w') as f:
    json.dump(lora_config_dict, f, indent=2)

print(f"LoRA config saved to {os.path.join(OUTPUT_DIR, 'lora_config.json')}")


## 4. Training Setup {#training-setup}


In [None]:
# Create PyTorch datasets from the loaded DataFrames
# Note: This requires the tokenizer to be loaded first (from Model Configuration section)

# Check if tokenizer is available
try:
    train_dataset = GradingDataset(train_df, tokenizer, max_length=CONFIG['max_length'])
    val_dataset = GradingDataset(val_df, tokenizer, max_length=CONFIG['max_length'])
    test_dataset = GradingDataset(test_df, tokenizer, max_length=CONFIG['max_length'])
    
    # Optional: Also create dataset for unseen_answers test set
    test_dataset_unseen_answers = GradingDataset(test_df_unseen_answers, tokenizer, max_length=CONFIG['max_length'])
    
    print("Datasets created successfully:")
    print(f"  Training dataset: {len(train_dataset)} samples")
    print(f"  Validation dataset: {len(val_dataset)} samples")
    print(f"  Test dataset (unseen questions): {len(test_dataset)} samples")
    print(f"  Test dataset (unseen answers): {len(test_dataset_unseen_answers)} samples")
except NameError:
    print("Note: Tokenizer not yet loaded. Please run the Model Configuration cells first.")
    print("After loading the tokenizer, this cell will automatically create the datasets.")


In [None]:
# Custom data collator
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # Causal LM, not masked LM
)


In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir=CONFIG['output_dir'],
    num_train_epochs=CONFIG['num_train_epochs'],
    per_device_train_batch_size=CONFIG['per_device_train_batch_size'],
    per_device_eval_batch_size=CONFIG['per_device_eval_batch_size'],
    gradient_accumulation_steps=CONFIG['gradient_accumulation_steps'],
    learning_rate=CONFIG['learning_rate'],
    lr_scheduler_type=CONFIG['lr_scheduler_type'],
    warmup_steps=CONFIG['warmup_steps'],
    logging_steps=CONFIG['logging_steps'],
    eval_steps=CONFIG['eval_steps'],
    save_steps=CONFIG['save_steps'],
    save_total_limit=CONFIG['save_total_limit'],
    fp16=CONFIG['fp16'],
    bf16=CONFIG['bf16'],
    gradient_checkpointing=CONFIG['gradient_checkpointing'],
    dataloader_pin_memory=CONFIG['dataloader_pin_memory'],
    eval_strategy='steps',  # Changed from evaluation_strategy to eval_strategy (newer transformers API)
    save_strategy='steps',
    load_best_model_at_end=True,
    metric_for_best_model='eval_loss',
    greater_is_better=False,
    report_to='wandb' if WANDB_AVAILABLE else 'none',
    run_name='mistral-engsaf-finetune',
    remove_unused_columns=False,
)

print("Training arguments configured")


In [None]:
# Initialize wandb (optional)
if WANDB_AVAILABLE:
    wandb.init(
        project='mistral-engsaf-grading',
        name='mistral-7b-lora-finetune',
        config=CONFIG
    )


In [None]:
# Custom Trainer with evaluation metrics
from transformers import TrainerCallback

class MetricsCallback(TrainerCallback):
    def on_evaluate(self, args, state, control, logs=None, **kwargs):
        if logs:
            print(f"\nEvaluation at step {state.global_step}:")
            for key, value in logs.items():
                print(f"  {key}: {value:.4f}")

# Create trainer
# Uncomment when datasets are ready:
# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=train_dataset,
#     eval_dataset=val_dataset,
#     data_collator=data_collator,
#     callbacks=[EarlyStoppingCallback(early_stopping_patience=3), MetricsCallback()],
# )

print("Trainer ready. Uncomment to start training when data is loaded.")


In [None]:
# Start training
# Uncomment to train:
# print("Starting training...")
# trainer.train()
# 
# # Save final model
# final_model_path = os.path.join(OUTPUT_DIR, 'final_model')
# trainer.save_model(final_model_path)
# tokenizer.save_pretrained(final_model_path)
# print(f"Model saved to {final_model_path}")

print("Training code ready. Uncomment to start training.")


## 5. Evaluation Metrics {#evaluation-metrics}


In [None]:
# Evaluation metrics

def quadratic_weighted_kappa(y_true, y_pred):
    """
    Calculate Quadratic Weighted Kappa (QWK) score.
    QWK is the standard metric for automated essay scoring.
    """
    from sklearn.metrics import cohen_kappa_score
    
    # Get min and max scores
    min_score = min(min(y_true), min(y_pred))
    max_score = max(max(y_true), max(y_pred))
    
    # Create weight matrix
    weights = np.zeros((max_score - min_score + 1, max_score - min_score + 1))
    for i in range(len(weights)):
        for j in range(len(weights)):
            weights[i][j] = ((i - j) ** 2) / ((max_score - min_score) ** 2)
    
    # Calculate QWK
    kappa = cohen_kappa_score(y_true, y_pred, weights=weights)
    return kappa

def extract_score_from_response(response_text):
    """
    Extract score from model response.
    Looks for patterns like 'Score: 3' or 'score: 3'.
    """
    import re
    
    # Try to find score pattern
    patterns = [
        r'Score:\s*(\d+)',
        r'score:\s*(\d+)',
        r'Score\s*(\d+)',
        r'(\d+)\s*out\s*of',
        r'Grade:\s*(\d+)',
    ]
    
    for pattern in patterns:
        match = re.search(pattern, response_text, re.IGNORECASE)
        if match:
            try:
                score = int(match.group(1))
                # Clamp to reasonable range (0-5)
                score = max(0, min(5, score))
                return score
            except ValueError:
                continue
    
    # Fallback: try to find first number
    numbers = re.findall(r'\d+', response_text)
    if numbers:
        try:
            score = int(numbers[0])
            score = max(0, min(5, score))
            return score
        except ValueError:
            pass
    
    # Default fallback
    return None

def extract_feedback_from_response(response_text):
    """
    Extract feedback text from model response.
    """
    import re
    
    # Try to find feedback section
    patterns = [
        r'Feedback:\s*(.+?)(?:\n|$)',
        r'feedback:\s*(.+?)(?:\n|$)',
        r'Feedback\s*(.+?)(?:\n|$)',
    ]
    
    for pattern in patterns:
        match = re.search(pattern, response_text, re.IGNORECASE | re.DOTALL)
        if match:
            return match.group(1).strip()
    
    # If no pattern found, return everything after first newline
    lines = response_text.split('\n')
    if len(lines) > 1:
        return '\n'.join(lines[1:]).strip()
    
    return response_text.strip()

print("Evaluation utility functions defined.")


In [None]:
# Evaluation function for feedback quality

def evaluate_feedback_quality(predicted_feedback, reference_feedback):
    """
    Evaluate feedback quality using BLEU, ROUGE, and BERTScore.
    """
    try:
        from rouge_score import rouge_scorer
        from bert_score import score as bert_score_fn
        import nltk
        
        # Download required NLTK data
        try:
            nltk.data.find('tokenizers/punkt')
        except LookupError:
            nltk.download('punkt', quiet=True)
        
        from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
        
        # BLEU score
        smooth = SmoothingFunction().method1
        bleu = sentence_bleu(
            [reference_feedback.split()],
            predicted_feedback.split(),
            smoothing_function=smooth
        )
        
        # ROUGE scores
        rouge_scorer_obj = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
        rouge_scores = rouge_scorer_obj.score(reference_feedback, predicted_feedback)
        
        # BERTScore
        P, R, F1 = bert_score_fn(
            [predicted_feedback],
            [reference_feedback],
            lang='en',
            verbose=False
        )
        
        return {
            'bleu': bleu,
            'rouge1': rouge_scores['rouge1'].fmeasure,
            'rouge2': rouge_scores['rouge2'].fmeasure,
            'rougeL': rouge_scores['rougeL'].fmeasure,
            'bertscore_f1': F1.item()
        }
    except Exception as e:
        print(f"Error evaluating feedback: {e}")
        return None

print("Feedback evaluation functions defined.")


In [None]:
# Comprehensive evaluation function

def evaluate_model(model, tokenizer, test_dataset, device='cuda', max_samples=None):
    """
    Comprehensive evaluation of the model on test set.
    """
    model.eval()
    
    predictions = []
    true_scores = []
    predicted_scores = []
    predicted_feedbacks = []
    reference_feedbacks = []
    
    # Limit samples if specified
    eval_indices = range(len(test_dataset))
    if max_samples:
        eval_indices = eval_indices[:max_samples]
    
    with torch.no_grad():
        for idx in tqdm(eval_indices, desc="Evaluating"):
            item = test_dataset[idx]
            
            # Get question and answer from dataset
            question = test_dataset.df.iloc[idx]['question']
            student_answer = test_dataset.df.iloc[idx]['student_answer']
            true_score = test_dataset.df.iloc[idx]['score']
            reference_feedback = test_dataset.df.iloc[idx].get('feedback', '')
            
            # Create prompt
            sys_prompt, user_prompt = create_prompt_template(question, student_answer)
            prompt = format_instruction(sys_prompt, user_prompt)
            
            # Tokenize
            inputs = tokenizer(prompt, return_tensors='pt', truncation=True, max_length=CONFIG['max_length'])
            inputs = {k: v.to(device) for k, v in inputs.items()}
            
            # Generate
            outputs = model.generate(
                **inputs,
                max_new_tokens=CONFIG['max_new_tokens'],
                temperature=CONFIG['temperature'],
                top_p=CONFIG['top_p'],
                do_sample=True,
                pad_token_id=tokenizer.pad_token_id
            )
            
            # Decode
            generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
            
            # Extract score and feedback
            predicted_score = extract_score_from_response(generated_text)
            predicted_feedback = extract_feedback_from_response(generated_text)
            
            predictions.append({
                'question': question,
                'student_answer': student_answer,
                'true_score': true_score,
                'predicted_score': predicted_score,
                'reference_feedback': reference_feedback,
                'predicted_feedback': predicted_feedback,
                'full_response': generated_text
            })
            
            if predicted_score is not None:
                true_scores.append(true_score)
                predicted_scores.append(predicted_score)
                predicted_feedbacks.append(predicted_feedback)
                reference_feedbacks.append(reference_feedback)
    
    # Calculate metrics
    results = {}
    
    if len(true_scores) > 0:
        # Score metrics
        results['qwk'] = quadratic_weighted_kappa(true_scores, predicted_scores)
        results['cohen_kappa'] = cohen_kappa_score(true_scores, predicted_scores)
        results['accuracy'] = accuracy_score(true_scores, predicted_scores)
        results['confusion_matrix'] = confusion_matrix(true_scores, predicted_scores)
        
        # Feedback metrics (sample-based for efficiency)
        if len(predicted_feedbacks) > 0:
            sample_size = min(50, len(predicted_feedbacks))  # Sample for efficiency
            sample_indices = np.random.choice(len(predicted_feedbacks), sample_size, replace=False)
            
            feedback_metrics = []
            for idx in sample_indices:
                metrics = evaluate_feedback_quality(
                    predicted_feedbacks[idx],
                    reference_feedbacks[idx]
                )
                if metrics:
                    feedback_metrics.append(metrics)
            
            if feedback_metrics:
                results['feedback_metrics'] = {
                    'bleu': np.mean([m['bleu'] for m in feedback_metrics]),
                    'rouge1': np.mean([m['rouge1'] for m in feedback_metrics]),
                    'rouge2': np.mean([m['rouge2'] for m in feedback_metrics]),
                    'rougeL': np.mean([m['rougeL'] for m in feedback_metrics]),
                    'bertscore_f1': np.mean([m['bertscore_f1'] for m in feedback_metrics])
                }
    
    return results, predictions

print("Evaluation function defined.")


In [None]:
# Run evaluation
# Uncomment when model and test dataset are ready:
# device = 'cuda' if torch.cuda.is_available() else 'cpu'
# model = model.to(device)
# 
# results, predictions = evaluate_model(
#     model,
#     tokenizer,
#     test_dataset,
#     device=device,
#     max_samples=100  # Limit for faster evaluation
# )
# 
# print("\nEvaluation Results:")
# print(f"Quadratic Weighted Kappa (QWK): {results['qwk']:.4f}")
# print(f"Cohen's Kappa: {results['cohen_kappa']:.4f}")
# print(f"Accuracy: {results['accuracy']:.4f}")
# 
# if 'feedback_metrics' in results:
#     print("\nFeedback Metrics:")
#     for metric, value in results['feedback_metrics'].items():
#         print(f"  {metric}: {value:.4f}")

print("Evaluation code ready.")


## 6. Inference Pipeline {#inference-pipeline}


In [None]:
# Inference function for grading new answers

def grade_answer(
    model,
    tokenizer,
    question,
    student_answer,
    rubric=None,
    temperature=0.7,
    top_p=0.9,
    max_new_tokens=256,
    device='cuda'
):
    """
    Grade a student answer and generate feedback.
    
    Args:
        model: Fine-tuned model
        tokenizer: Tokenizer
        question: The question text
        student_answer: Student's answer
        rubric: Optional custom rubric
        temperature: Sampling temperature
        top_p: Nucleus sampling parameter
        max_new_tokens: Maximum tokens to generate
        device: Device to run inference on
    
    Returns:
        dict with 'score' and 'feedback'
    """
    model.eval()
    
    # Create prompt
    sys_prompt, user_prompt = create_prompt_template(question, student_answer, rubric=rubric)
    prompt = format_instruction(sys_prompt, user_prompt)
    
    # Tokenize
    inputs = tokenizer(prompt, return_tensors='pt', truncation=True, max_length=CONFIG['max_length'])
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    # Decode
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract score and feedback
    score = extract_score_from_response(generated_text)
    feedback = extract_feedback_from_response(generated_text)
    
    return {
        'score': score,
        'feedback': feedback,
        'full_response': generated_text
    }

print("Inference function defined.")


In [None]:
# Example usage
# Uncomment when model is loaded:
# 
# example_question = "Explain the process of photosynthesis."
# example_answer = "Photosynthesis is when plants use sunlight to make food."
# 
# result = grade_answer(
#     model,
#     tokenizer,
#     example_question,
#     example_answer,
#     temperature=CONFIG['temperature'],
#     top_p=CONFIG['top_p'],
#     device='cuda'
# )
# 
# print("Example Grading:")
# print(f"Question: {example_question}")
# print(f"Answer: {example_answer}")
# print(f"\nPredicted Score: {result['score']}")
# print(f"\nGenerated Feedback:\n{result['feedback']}")

print("Example inference code ready.")


In [None]:
# Save model checkpoint for Kaggle output
# This ensures the model persists after session ends

def save_model_checkpoint(model, tokenizer, output_path):
    """
    Save model and tokenizer to output path.
    """
    os.makedirs(output_path, exist_ok=True)
    
    # Save PEFT model
    model.save_pretrained(output_path)
    tokenizer.save_pretrained(output_path)
    
    # Save config
    with open(os.path.join(output_path, 'training_config.json'), 'w') as f:
        json.dump(CONFIG, f, indent=2)
    
    print(f"Model saved to {output_path}")

# Uncomment to save:
# checkpoint_path = os.path.join(OUTPUT_DIR, 'final_checkpoint')
# save_model_checkpoint(model, tokenizer, checkpoint_path)

print("Model saving function ready.")


## 7. Visualization & Analysis {#visualization}


In [None]:
# Plot training/validation loss curves
# This requires training logs - uncomment after training

def plot_training_curves(log_history, save_path=None):
    """
    Plot training and validation loss curves.
    """
    train_losses = [log['loss'] for log in log_history if 'loss' in log and 'eval_loss' not in str(log)]
    eval_losses = [log['eval_loss'] for log in log_history if 'eval_loss' in log]
    steps = [log['step'] for log in log_history if 'loss' in log]
    
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    plt.plot(steps[:len(train_losses)], train_losses, label='Train Loss', marker='o')
    if eval_losses:
        eval_steps = [log['step'] for log in log_history if 'eval_loss' in log]
        plt.plot(eval_steps, eval_losses, label='Val Loss', marker='s')
    plt.xlabel('Step')
    plt.ylabel('Loss')
    plt.title('Training and Validation Loss')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    
    if save_path:
        plt.savefig(save_path, dpi=300, bbox_inches='tight')
    
    plt.show()

print("Training curve plotting function defined.")
# Uncomment after training:
# plot_training_curves(trainer.state.log_history, save_path=os.path.join(OUTPUT_DIR, 'training_curves.png'))


In [None]:
# Plot confusion matrix

def plot_confusion_matrix(y_true, y_pred, save_path=None):
    """
    Plot confusion matrix for score predictions.
    """
    cm = confusion_matrix(y_true, y_pred)
    
    plt.figure(figsize=(10, 8))
    sns.heatmap(
        cm,
        annot=True,
        fmt='d',
        cmap='Blues',
        xticklabels=range(len(cm)),
        yticklabels=range(len(cm))
    )
    plt.xlabel('Predicted Score')
    plt.ylabel('True Score')
    plt.title('Confusion Matrix - Score Predictions')
    
    if save_path:
        plt.savefig(save_path, dpi=300, bbox_inches='tight')
    
    plt.show()

print("Confusion matrix plotting function defined.")
# Uncomment after evaluation:
# plot_confusion_matrix(true_scores, predicted_scores, save_path=os.path.join(OUTPUT_DIR, 'confusion_matrix.png'))


In [None]:
# Display example predictions

def display_examples(predictions, n_examples=5, show_good=True, show_bad=True):
    """
    Display example predictions, both good and bad cases.
    """
    examples = []
    
    if show_good:
        # Find examples where prediction matches true score
        good_examples = [
            p for p in predictions
            if p['predicted_score'] == p['true_score'] and p['predicted_score'] is not None
        ]
        if good_examples:
            examples.extend(np.random.choice(good_examples, min(n_examples, len(good_examples)), replace=False))
    
    if show_bad:
        # Find examples with large prediction errors
        bad_examples = [
            p for p in predictions
            if p['predicted_score'] is not None
            and abs(p['predicted_score'] - p['true_score']) >= 2
        ]
        if bad_examples:
            examples.extend(np.random.choice(bad_examples, min(n_examples, len(bad_examples)), replace=False))
    
    for i, ex in enumerate(examples[:n_examples * 2], 1):
        print(f"\n{'='*80}")
        print(f"Example {i}")
        print(f"{'='*80}")
        print(f"Question: {ex['question']}")
        print(f"\nStudent Answer: {ex['student_answer']}")
        print(f"\nTrue Score: {ex['true_score']}")
        print(f"Predicted Score: {ex['predicted_score']}")
        print(f"\nReference Feedback: {ex['reference_feedback'][:200]}...")
        print(f"\nGenerated Feedback: {ex['predicted_feedback'][:200]}...")

print("Example display function defined.")
# Uncomment after evaluation:
# display_examples(predictions, n_examples=3)


## 8. Optional Enhancements {#optional-enhancements}


### 8.1 RAG Integration Placeholder

For future enhancement: Integrate Retrieval-Augmented Generation (RAG) to retrieve relevant course materials when grading answers.


In [None]:
# RAG Integration Placeholder
# This would retrieve relevant course materials to enhance grading context

def retrieve_course_materials(question, top_k=3):
    """
    Placeholder for RAG system to retrieve relevant course materials.
    
    Future implementation:
    - Use embeddings to find relevant course content
    - Retrieve top-k most relevant passages
    - Include in prompt context
    """
    # Placeholder
    return []

print("RAG placeholder defined.")


### 8.2 Chain-of-Thought Verification

For future enhancement: Add Chain-of-Thought reasoning to make grading decisions more transparent.


In [None]:
# Chain-of-Thought Verification Placeholder

def grade_with_cot(model, tokenizer, question, answer, device='cuda'):
    """
    Placeholder for Chain-of-Thought grading.
    
    Future implementation:
    - Generate reasoning steps before final score
    - Verify consistency of reasoning
    - Use reasoning to improve score prediction
    """
    # Placeholder - would use multi-step prompting
    return grade_answer(model, tokenizer, question, answer, device=device)

print("Chain-of-Thought placeholder defined.")


### 8.3 Ensemble with RoBERTa Baseline

For future enhancement: Create ensemble model combining Mistral-7B with RoBERTa-based scoring model.


In [None]:
# Ensemble Placeholder

def ensemble_grade(mistral_result, roberta_result, weights=[0.7, 0.3]):
    """
    Placeholder for ensemble grading.
    
    Future implementation:
    - Load fine-tuned RoBERTa model for scoring
    - Combine predictions with weighted average
    - Use ensemble for final score and feedback
    """
    # Placeholder
    return mistral_result

print("Ensemble placeholder defined.")


---

## Summary

This notebook provides a complete pipeline for fine-tuning Mistral-7B on the EngSAF dataset for automatic short answer grading.

### Key Features:
- 4-bit quantization for memory efficiency
- LoRA/PEFT for parameter-efficient fine-tuning
- Unseen-question split methodology
- Comprehensive evaluation metrics (QWK, BLEU, ROUGE, BERTScore)
- Full inference pipeline
- Visualization and analysis tools

### Next Steps:
1. Load your EngSAF dataset
2. Uncomment training cells
3. Run training and evaluation
4. Analyze results and iterate

### Notes:
- Remember to save checkpoints regularly (every 500 steps)
- Monitor GPU memory usage
- Use Kaggle's "Save Version" feature to persist work
- Adjust hyperparameters based on your dataset size and characteristics
