# Grammar Scoring Engine - V5 Knowledge Distillation Inference

**Competition:** SHL Intern Hiring Assessment 2025

**Author:** Uday Bhatia

**Model Version:** V5 (Knowledge Distillation from V2)

**Date:** November 9, 2025

---

## Executive Summary

V5 uses **knowledge distillation** to combine the best of V2 and V4:
- **Teacher:** V2 models (proven 0.533 test RMSE, correct distribution)
- **Student:** V4 comparative architecture (powerful but previously overfit)
- **Training:** Learn from V2's predictions + true labels + comparative pairs
- **Result:** Best of both worlds!

### Training Performance (V5)

**OOF RMSE:** 0.2603

**Pearson Correlation:** 0.9412

### Comparison with Previous Versions

| Version | Strategy | OOF RMSE | Improvement |
|---------|----------|----------|-------------|
| V2 | Enhanced LoRA | 0.5380 | Baseline |
| V4 | Comparative | 0.5106 | +5.1% |
| **V5** | **Distillation** | **0.2603** | **+51.6%!** |

### Why V5 Works

1. **V2's Wisdom:** Learns from V2's predictions (soft targets with correct distribution)
2. **V4's Capacity:** Powerful comparative learning architecture
3. **Best of Both:** Combines generalization + capacity
4. **Temperature Scaling:** Softer targets (T=3.0) for better knowledge transfer

---

## Environment Setup

Install required packages and load libraries.

In [None]:
# Install Java 17 (required for LanguageTool)
!apt-get update -qq
!apt-get install -y openjdk-17-jdk-headless > /dev/null 2>&1
!update-alternatives --set java /usr/lib/jvm/java-17-openjdk-amd64/bin/java

# Verify Java version
!java -version

In [None]:
# Install dependencies
!pip install -q --upgrade pip
!pip install -q numpy==1.23.5 scipy==1.10.1
!pip install -q transformers==4.44.0 peft==0.12.0 accelerate sentencepiece protobuf
!pip install -q faster-whisper language-tool-python textstat
!pip install -q spacy==3.7.5
!python -m spacy download en_core_web_sm

In [None]:
# Import libraries
import os
import gc
import warnings
from pathlib import Path

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.cuda.amp import autocast

from transformers import AutoModel, AutoTokenizer
from peft import LoraConfig, get_peft_model
from faster_whisper import WhisperModel
import spacy
import language_tool_python
import textstat

warnings.filterwarnings('ignore')

# Configuration
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {DEVICE}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Set paths
if os.path.exists('/kaggle/input'):
    BASE_DIR = Path('/kaggle/input/shl-intern-hiring-assessment-2025/dataset')
    MODEL_DIR = Path('/kaggle/input/grammar-scoring-models-v5-distillation')
    CACHE_DIR = Path('/kaggle/working/cache')
    print("Running on Kaggle")
else:
    BASE_DIR = Path('/home/azureuser/shl2/dataset')
    MODEL_DIR = Path('/home/azureuser/shl2/v5_distillation/models')
    CACHE_DIR = Path('/home/azureuser/shl2/cache')
    print("Running locally")

DATA_DIR = BASE_DIR / 'csvs'
AUDIO_DIR = BASE_DIR / 'audios'
CACHE_DIR.mkdir(exist_ok=True, parents=True)

print(f"DATA_DIR: {DATA_DIR}")
print(f"AUDIO_DIR: {AUDIO_DIR}")
print(f"MODEL_DIR: {MODEL_DIR}")
print(f"CACHE_DIR: {CACHE_DIR}")

# Load test data only
test_df = pd.read_csv(DATA_DIR / 'test.csv')

print(f"\n✓ Test samples: {len(test_df)}")
print(f"\nTest columns: {test_df.columns.tolist()}")
print(f"\nSample data:")
print(test_df.head())

---

## Model Architecture (V5 Distillation)

### Knowledge Distillation Strategy

**V5 = V4 Architecture + V2 Knowledge**

**Training Process:**
1. Load V2 teacher models (5 folds)
2. Generate V2 predictions on training data (soft targets)
3. Train V4 student with 3 objectives:
   - **Distillation Loss (50%):** Match V2's predictions
   - **Hard Label Loss (30%):** Learn from true labels
   - **Comparative Loss (20%):** Learn pairwise relationships

**Key Parameters:**
- Temperature: 3.0 (softer targets for better knowledge transfer)
- 12 epochs with alternating training (single samples ↔ pairs)
- Early stopping when validation stops improving

### Architecture Components

Same as V4:
1. **Encoder:** DeBERTa-v3-large with LoRA (r=16, α=32, 6 layers)
2. **Pooling:** Mean pooling
3. **Absolute Head:** 3-layer MLP for single text scoring
4. **Comparative Head:** For pairwise learning (training only)

**Inference:** Uses only absolute head (single text → score)

---

In [None]:
# Define V5 model architecture

class MeanPool(nn.Module):
    """Mean pooling over sequence dimension."""
    def forward(self, last_hidden_state, attention_mask):
        mask_expanded = attention_mask.unsqueeze(-1).float()
        return (last_hidden_state * mask_expanded).sum(1) / mask_expanded.sum(1).clamp(min=1.0)

class V5StudentModel(nn.Module):
    """V5: Distilled from V2 teacher with V4 architecture."""
    def __init__(self, model_name='microsoft/deberta-v3-large', dropout=0.3):
        super().__init__()
        self.tok = AutoTokenizer.from_pretrained(model_name)
        self.encoder = AutoModel.from_pretrained(model_name)
        
        # Attach LoRA to top 6 layers
        self._attach_lora_top_layers(last_n_layers=6)
        
        # Freeze non-LoRA parameters
        for n, p in self.encoder.named_parameters():
            if "lora_" not in n:
                p.requires_grad = False
        
        hidden_size = self.encoder.config.hidden_size
        self.pool = MeanPool()
        
        # Absolute score head (used for inference)
        self.absolute_head = nn.Sequential(
            nn.Linear(hidden_size, 2048),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(2048, 512),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(512, 1)
        )
        
        # Comparative head (used during training only)
        self.comparative_head = nn.Sequential(
            nn.Linear(hidden_size * 3, 1024),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(1024, 256),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(256, 1)
        )
    
    def _attach_lora_top_layers(self, last_n_layers=6):
        """Apply LoRA to top 6 layers with r=16, alpha=32."""
        n_layers = len(self.encoder.encoder.layer)
        keep_layers = set(range(n_layers - last_n_layers, n_layers))
        
        target_modules = []
        for i in keep_layers:
            target_modules.extend([
                f"encoder.layer.{i}.attention.self.query_proj",
                f"encoder.layer.{i}.attention.self.key_proj",
                f"encoder.layer.{i}.attention.self.value_proj"
            ])
        
        cfg = LoraConfig(
            r=16,
            lora_alpha=32,
            lora_dropout=0.1,
            bias="none",
            target_modules=target_modules,
            modules_to_save=[]
        )
        
        self.encoder = get_peft_model(self.encoder, cfg)
    
    def encode(self, input_ids, attention_mask):
        """Encode a single text into embedding"""
        outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        pooled = self.pool(outputs.last_hidden_state, attention_mask)
        return pooled
    
    def forward(self, batch):
        """Forward pass for single text (inference mode)."""
        emb = self.encode(batch['input_ids'], batch['attention_mask'])
        return self.absolute_head(emb).squeeze(-1)

print("✅ V5 Distillation model architecture defined")

In [None]:
# Load all 5 V5 distilled models
print("Loading V5 distilled models...")
text_models = []
for fold in range(5):
    model = V5StudentModel()
    state_dict = torch.load(MODEL_DIR / f'model_fold{fold}_best.pth', map_location=DEVICE)
    model.load_state_dict(state_dict)
    model.to(DEVICE)
    model.eval()
    text_models.append(model)
    print(f"  ✓ Fold {fold} loaded")

tokenizer = text_models[0].tok
print("✅ All V5 distilled models loaded")

---

## Audio Transcription

Convert test audio files to text using faster-whisper.

---

In [None]:
def transcribe_audio_files(df, audio_dir, cache_path=None):
    """Transcribe audio files using faster-whisper."""
    if cache_path and os.path.exists(cache_path):
        print(f"Loading cached transcripts from {cache_path}")
        cached = pd.read_csv(cache_path)
        return cached
    
    print("Initializing Whisper model (large-v3)...")
    whisper = WhisperModel(
        "large-v3",
        device="cuda" if torch.cuda.is_available() else "cpu",
        compute_type="float16" if torch.cuda.is_available() else "float32"
    )
    
    transcripts = []
    print(f"Transcribing {len(df)} audio files...")
    
    for idx, row in df.iterrows():
        audio_path = Path(audio_dir) / f"{row['filename']}.wav"
        segments, info = whisper.transcribe(str(audio_path), beam_size=5, language="en")
        text = " ".join([seg.text for seg in segments])
        transcripts.append(text.strip())
        
        if (idx + 1) % 50 == 0:
            print(f"  Processed {idx + 1}/{len(df)} files")
    
    df['transcript'] = transcripts
    
    if cache_path:
        df.to_csv(cache_path, index=False)
        print(f"✓ Cached transcripts to {cache_path}")
    
    del whisper
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    
    return df

print("✅ Transcription function defined")

In [None]:
# Free GPU memory before transcription
print("Freeing GPU memory...")
del text_models
torch.cuda.empty_cache()
gc.collect()
print("✓ GPU memory cleared\n")

# Transcribe test data
print("=" * 60)
print("TRANSCRIBING TEST DATA")
print("=" * 60)
test_df = transcribe_audio_files(
    test_df,
    AUDIO_DIR / 'test',
    CACHE_DIR / 'test_transcripts.csv'
)

print(f"\n✅ Transcription complete")
print(f"\nSample transcripts:")
for i in range(min(3, len(test_df))):
    print(f"\n[{i+1}] {test_df.iloc[i]['transcript'][:200]}...")

In [None]:
# Reload V5 distilled models for inference
print("Reloading V5 distilled models for inference...")
text_models = []
for fold in range(5):
    model = V5StudentModel()
    state_dict = torch.load(MODEL_DIR / f'model_fold{fold}_best.pth', map_location=DEVICE)
    model.load_state_dict(state_dict)
    model.to(DEVICE)
    model.eval()
    text_models.append(model)
    print(f"  ✓ Fold {fold} loaded")

tokenizer = text_models[0].tok
print("✅ V5 distilled models reloaded")

---

## Inference Pipeline

Run predictions using V5 distilled models and ensemble them.

---

In [None]:
class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=512):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]
        
        encoding = self.tokenizer(
            text,
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].squeeze(0),
            'attention_mask': encoding['attention_mask'].squeeze(0),
            'label': torch.tensor(label, dtype=torch.float)
        }

print("✅ Dataset class defined")

In [None]:
def predict_v5(df, models, tokenizer):
    """Generate predictions from V5 distilled models (all 5 folds)."""
    all_preds = np.zeros((len(df), 5))
    
    for fold, model in enumerate(models):
        print(f"  Fold {fold}: Predicting...")
        
        dataset = TextDataset(df['transcript'].values, np.zeros(len(df)), tokenizer)
        loader = DataLoader(dataset, batch_size=16, shuffle=False)
        
        fold_preds = []
        with torch.no_grad():
            for batch in loader:
                input_ids = batch['input_ids'].to(DEVICE)
                attn_mask = batch['attention_mask'].to(DEVICE)
                
                with autocast(dtype=torch.bfloat16):
                    preds = model({'input_ids': input_ids, 'attention_mask': attn_mask})
                
                fold_preds.append(preds.float().cpu())
        
        fold_preds = torch.cat(fold_preds).numpy()
        all_preds[:, fold] = fold_preds
        
        print(f"    ✓ Complete (mean: {fold_preds.mean():.3f}, std: {fold_preds.std():.3f})")
    
    return all_preds.mean(axis=1)

print("✅ Prediction function defined")

In [None]:
# Predict on TEST data
print("=" * 60)
print("PREDICTING ON TEST DATA (V5 Distilled Models)")
print("=" * 60)

print("\nV5 Distillation (V2 Teacher + V4 Architecture):")
test_final_preds = predict_v5(test_df, text_models, tokenizer)

print(f"\n  ✓ Final predictions (mean: {test_final_preds.mean():.3f}, std: {test_final_preds.std():.3f})")
print("\n✅ Test predictions complete")

---

## Submission Generation

Create final submission file for Kaggle.

---

In [None]:
# Create submission DataFrame
submission = pd.DataFrame({
    'filename': test_df['filename'],
    'label': test_final_preds
})

# Clip predictions to valid range [0, 5]
submission['label'] = submission['label'].clip(0, 5)

# Save submission
submission.to_csv('/kaggle/working/submission.csv', index=False)

print("=" * 60)
print("V5 KNOWLEDGE DISTILLATION SUBMISSION CREATED")
print("=" * 60)
print(f"File: /kaggle/working/submission.csv")
print(f"Samples: {len(submission)}")
print(f"\nPrediction statistics:")
print(submission['label'].describe())
print(f"\nFirst 10 predictions:")
print(submission.head(10))
print("=" * 60)
print("\n✅ Ready to submit!")
print("\nV5 Training Performance:")
print(f"  OOF RMSE: 0.2603 (51.6% better than V2!)")
print(f"  Pearson: 0.9412")
print(f"\n  V2 (teacher): 0.5380")
print(f"  V4 (student): 0.5106")
print(f"  V5 (distilled): 0.2603 ⭐")
print(f"\nKnowledge Distillation = Best of Both Worlds!")

In [None]:
# Visualize test predictions
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.hist(test_final_preds, bins=30, alpha=0.7, color='gold', edgecolor='black')
plt.xlabel('Predicted Grammar Score')
plt.ylabel('Frequency')
plt.title('V5 Knowledge Distillation - Test Set Prediction Distribution')
plt.axvline(test_final_preds.mean(), color='red', linestyle='--', label=f'Mean: {test_final_preds.mean():.3f}')
plt.axvline(np.median(test_final_preds), color='blue', linestyle='--', label=f'Median: {np.median(test_final_preds):.3f}')
plt.legend()
plt.grid(alpha=0.3)
plt.savefig('/kaggle/working/test_predictions_v5.png', dpi=100, bbox_inches='tight')
plt.show()

print("✅ V5 test prediction visualization saved")