🧠 Grammar Scoring Engine using Whisper + BERT

This project implements a Grammar Scoring Engine for spoken English audio using OpenAI Whisper for transcription and a fine-tuned BERT regressor for grammar quality prediction. The system evaluates spoken sentences and outputs a Mean Opinion Score (MOS) between 0 and 5, based on grammar correctness.

🚀 Model Pipeline Overview

ASR (Automatic Speech Recognition)

Uses OpenAI Whisper (various model sizes supported) to convert audio to text.

Text Embedding + Regression

Uses a BERT-based regression model fine-tuned on transcribed text to predict the grammar score.

Evaluation

Model is trained using K-Fold Cross Validation.

Evaluation metric: Pearson Correlation Coefficient.

🧪 Results Model: Whisper (small) + BERT Regressor

Training Epochs: 10

K-Fold Splits: 5

Best Pearson Score: 0.778

🧰 Requirements Python 3.10+

PyTorch

Transformers (HuggingFace)

OpenAI Whisper

Librosa

Scikit-learn

Pandas, Numpy, tqdm

🔁 Training Details 5-Fold Cross Validation using KFold

Fine-tuning bert-base-uncased on text transcripts

Loss Function: MSELoss

Optimizer: AdamW

Outputs clipped between [0, 5]

🙌 Acknowledgements

OpenAI Whisper

HuggingFace Transformers

Organizers of the Grammar Scoring Competition



In [None]:
!pip install -q transformers datasets torchaudio librosa sentencepiece accelerate
!pip install -q git+https://github.com/openai/whisper.git
!pip install -U openai-whisper

****Importing the required Modules****

In [None]:
import os
import pandas as pd
import numpy as np
import torch
import librosa
import whisper
from tqdm import tqdm
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from scipy.stats import pearsonr
from transformers import BertTokenizer, BertModel, BertConfig
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.nn.functional as F
from transformers import AdamW
import matplotlib.pyplot as plt

**2. Loading the Dataset**

In [None]:
train_df = pd.read_csv('/kaggle/input/shl-dataset/dataset/train.csv')
test_df = pd.read_csv('/kaggle/input/shl-dataset/dataset/test.csv')
submission_df = pd.read_csv('/kaggle/input/shl-dataset/dataset/sample_submission.csv')


In [None]:
print(train_df.columns)


****3. Using whisper for Transcribe****

In [None]:
whisper_model = whisper.load_model("large-v2")

def transcribe(audio_path):
    result = whisper_model.transcribe(audio_path, fp16=False)
    return result['text']

train_transcripts = []
for fname in tqdm(train_df['filename']):
    path = f"/kaggle/input/shl-dataset/dataset/audios_train/{fname}"
    try:
        text = transcribe(path)
    except:
        text = ""
    train_transcripts.append(text)
train_df['transcript'] = train_transcripts

test_transcripts = []
for fname in tqdm(test_df['filename']):
    path = f"/kaggle/input/shl-dataset/dataset/audios_test/{fname}"
    try:
        text = transcribe(path)
    except:
        text = ""
    test_transcripts.append(text)
test_df['transcript'] = test_transcripts

****4. Tokenization with BERT****

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

class GrammarDataset(Dataset):
    def __init__(self, texts, targets=None):
        self.encodings = tokenizer(texts, padding=True, truncation=True, max_length=256, return_tensors="pt")
        self.targets = targets

    def __len__(self):
        return len(self.encodings['input_ids'])

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        if self.targets is not None:
            item['labels'] = torch.tensor(self.targets[idx], dtype=torch.float)
        return item

**5. BERT Regression Model**

In [None]:
class BERTRegressor(nn.Module):
    def __init__(self):
        super(BERTRegressor, self).__init__()
        self.bert = BertModel.from_pretrained("bert-base-uncased")
        self.regressor = nn.Linear(self.bert.config.hidden_size, 1)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        cls_output = outputs.pooler_output
        return self.regressor(cls_output).squeeze(1)

****6. K-Fold Training****

In [None]:
kf = KFold(n_splits=10, shuffle=True, random_state=42)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
EPOCHS = 20
fold = 0
all_val_preds = []
all_val_labels = []
train_losses, val_pearsons = [], []

for train_index, val_index in kf.split(train_df):
    fold += 1
    print(f"\n----- Fold {fold} -----")
    train_texts = train_df.iloc[train_index]['transcript'].tolist()
    val_texts = train_df.iloc[val_index]['transcript'].tolist()
    train_targets = train_df.iloc[train_index]['label'].tolist()
    val_targets = train_df.iloc[val_index]['label'].tolist()

    train_dataset = GrammarDataset(train_texts, train_targets)
    val_dataset = GrammarDataset(val_texts, val_targets)

    train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=8)

    model = BERTRegressor().to(device)
    optimizer = AdamW(model.parameters(), lr=2e-5)
    criterion = nn.MSELoss()

    for epoch in range(EPOCHS):
        model.train()
        total_loss = 0
        for batch in train_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            optimizer.zero_grad()
            outputs = model(input_ids, attention_mask)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        model.eval()
        val_preds = []
        val_labels = []
        with torch.no_grad():
            for batch in val_loader:
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)
                outputs = model(input_ids, attention_mask)
                val_preds.extend(outputs.cpu().numpy())
                val_labels.extend(labels.cpu().numpy())

        val_preds = np.clip(val_preds, 0, 5)
        pearson = pearsonr(val_preds, val_labels)[0]
        print(f"Epoch {epoch+1}/{EPOCHS} - Train Loss: {total_loss:.4f} - Val Pearson: {pearson:.4f}")

        train_losses.append(total_loss)
        val_pearsons.append(pearson)

****7. Training Visualization****

In [None]:
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(train_losses)
plt.title("Training Loss over Epochs")
plt.xlabel("Epoch")
plt.ylabel("Loss")

plt.subplot(1, 2, 2)
plt.plot(val_pearsons)
plt.title("Validation Pearson Correlation")
plt.xlabel("Epoch")
plt.ylabel("Pearson")

plt.tight_layout()
plt.show()


****8. Inference on Test Set****

In [None]:
# Load the sample_submission.csv to get correct column names
sample_df = pd.read_csv("/kaggle/input/shl-dataset/dataset/sample_submission.csv")
print(sample_df.columns)


In [None]:
test_dataset = GrammarDataset(test_df['transcript'].tolist())
test_loader = DataLoader(test_dataset, batch_size=8)

model.eval()
predictions = []
with torch.no_grad():
    for batch in test_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        outputs = model(input_ids, attention_mask)
        predictions.extend(outputs.cpu().numpy())

predictions = np.clip(predictions, 0, 5)


# Make sure predictions are clipped between 0 and 5
predictions = np.clip(predictions, 0, 5)

# Build the submission DataFrame using correct column names
submission_df = pd.DataFrame({
    "filename": test_df["filename"],
    "label": predictions
})

# Save submission file
submission_df.to_csv("submission.csv", index=False)
print("✅ Final submission file saved successfully!")



**Summary Report**

In [None]:
from IPython.display import Markdown, display

def printmd(string):
    display(Markdown(string))

printmd("## 📘 Grammar Scoring Engine Report")
printmd("**Model:** Whisper (ASR) + BERT + Regression")
printmd("**Dataset:** 444 Train Audio Samples with MOS Grammar Scores")
printmd("**ASR:** Whisper Base for transcript generation")
printmd("**Text Modeling:** BERT + Linear Regression head")
printmd("**Evaluation:** Pearson Correlation (per fold)")
printmd(f"**Best Pearson:** {max(val_pearsons):.4f}")