# Grammar Scoring Competition

## 1. Approach Overview
The objective of this competition is to predict a grammar score, which is fundamentally a **regression problem**. The chosen approach involves conversion of provided audio files to transcripts using a pre-trained ASR model name Whisper. This was then fed as the fine-tuning training data to a transformer based model named DaBERT for grammar scoring task.

## 2. Preprocessing Steps
The initial preprocessing pipeline is designed to standardize the audio data before feature extraction:
* **Package Installation:** All required libraries (`torchaudio`, `wordfreq`, etc.) are installed.
* **Audio Standardization:** The core `load_and_resample` function ensures all raw audio files are loaded and resampled to a consistent rate of **16,000 Hz**.
* **Mono Conversion:** Multi-channel audio (e.g., stereo) is converted to **mono** (single channel) by averaging the channels, which is standard practice for speech processing.


## 3. Pipeline Architecture
The machine learning pipeline follows a standard supervised learning flow:
1.  **Raw Input:** Audio File + Ground Truth Score.
2.  **Audio Preprocessing:** Resampling (16kHz) and Mono Conversion.
3.  **Speech to text:** The audio was then converted to raw text using ASR model name **Whisper**. The generated texts were saved inform of csv.
4.  **Model Training:** Training the **transformer** on the saved text csv.
5.  **Evaluation:** Performance is assessed using **Root Mean Square Error (RMSE)**.
## 4. Evaluation Results
The final results from the model run are presented below.

| Transformer model | Train RMSE | Leaderboard Score
| :--- | :--- | :--- |
| DaBERT |  0.16188851 | 0.600
| DaBERT-small|  0.16698851 | 0.599
| DaBERT-large |  0.1449279 | 0.735

In [None]:
!pip install wordfreq

In [None]:
!pip install evaluate

In [None]:
!pip install TorchCodec

In [None]:
!pip install transformers==4.57.1

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Visualizing waveform

In [None]:
import matplotlib.pyplot as plt
import torch
import torchaudio
import torchaudio.transforms as T
import torchaudio.functional as F

In [None]:
# Function to handle the actual plotting logic for waveform or spectrogram.
def _plot(waveform, sample_rate, title):
  """
  Internal helper function to plot the waveform or spectrogram.
  Handles conversion to numpy and setting up the plot axes.
  """
  # Convert PyTorch tensor to NumPy array for plotting
  waveform = waveform.numpy()

  num_channels, num_frames = waveform.shape
  # Create a time axis based on the number of frames and sample rate
  time_axis = torch.arange(0, num_frames) / sample_rate

  figure, axes = plt.subplots(num_channels, 1)
  if num_channels == 1:
    axes = [axes]
  for c in range(num_channels):
    # Plot the waveform vs time for the Waveform visualization
    if title == "Waveform":
      axes[c].plot(time_axis, waveform[c], linewidth=1)
      axes[c].grid(True)
    # Plot the spectrogram (frequency vs time)
    else:
      axes[c].specgram(waveform[c], Fs=sample_rate)
    if num_channels > 1:
      axes[c].set_ylabel(f'Channel {c+1}')
  figure.suptitle(title)
  plt.show(block=False)

# Public function to display the audio waveform (amplitude over time).
def plot_waveform(waveform, sample_rate):
  """Plots the time-domain waveform of the audio signal."""
  _plot(waveform, sample_rate, title="Waveform")

# Public function to display the audio spectrogram (frequency content over time).
def plot_specgram(waveform, sample_rate):
  """Plots the spectrogram of the audio signal (currently not used but included for completeness)."""
  _plot(waveform, sample_rate, title="Spectrogram")

In [None]:
# Purpose: Ensure all audio files have the same format:
# ✔ Same sample rate (e.g., 16kHz)
# ✔ Converted to mono (1 channel)

def load_and_resample(path, target_sr=16000):
    # Load the audio file from the specified path, obtaining the waveform tensor and original sample rate (sr).
    waveform, sr = torchaudio.load(path)  # shape: [channels, time]

    # Check if the original sample rate (sr) matches the target rate.
    if sr != target_sr:
        # Initialize the Resample transform from torchaudio.
        resampler = T.Resample(orig_freq=sr, new_freq=target_sr)
        # Apply the resampling transformation to the waveform.
        waveform = resampler(waveform)

    # Convert to mono if the audio has multiple channels (e.g., stereo)
    if waveform.shape[0] > 1:
        # Average the channels along the first dimension to create a single mono channel
        waveform = torch.mean(waveform, dim=0, keepdim=True)

    return waveform, target_sr

In [None]:
file_path = "drive/MyDrive/grammar_scoring/audios/train/audio_1.wav"
waveform, sample = load_and_resample(file_path)
plot_waveform(waveform, sample)
waveform = waveform.squeeze() #always squeeze waveform to avoid dimension related errors

## Speech to Text

### Whisper
Loading and tesing Whisper model

In [None]:
from transformers import WhisperProcessor, WhisperForConditionalGeneration, pipeline

In [None]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [None]:
# # load model and processor

#The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration.
#However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length.
#This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline.

pipe = pipeline(
  "automatic-speech-recognition",
  model="openai/whisper-medium",
  chunk_length_s=30,
  stride_length_s=2,
  device=device,
)

In [None]:
#this is the main function where transcripts are generated from audio
def transcript(file_name):
  file_path = "drive/MyDrive/grammar_scoring/audios/train/" + file_name
  waveform, sample = load_and_resample(file_path)
  waveform = waveform.squeeze()
  # input_features = processor(waveform, sampling_rate=sample, return_tensors="pt").input_features
  # predicted_ids = model.generate(input_features)
  # transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
  transcription = pipe(waveform, batch_size=8, return_timestamps=True)["text"]
  # print(transcription)
  return transcription


In [None]:
# #testing with an audio file
file_path = "audio_1.wav"
print(transcript(file_path))

### Generating csv
After loading and testing the pre-trained Whisper model, the audios from train dataset were now converted to their respective transcript which was saved in the form of csv.

In [None]:
import os
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [None]:
#creating dataframe
df = pd.DataFrame(columns=["filename", "transcript"])

In [None]:
dir_path = "drive/MyDrive/grammar_scoring/audios/train"
for files in os.listdir(dir_path):

  name = os.path.splitext(os.path.basename(files))[0]
  print(name)

  # speech to audio convertion
  trans = transcript(files)
  print(trans)

  df.loc[len(df)] = {'filename': name, 'transcript': trans}


In [None]:
df.to_csv("drive/MyDrive/grammar_scoring/csvs/transcript_train.csv", index=False)

In [None]:
df

###data cleaning

In [None]:
import pandas as pd

In [None]:
df_main = pd.read_csv("drive/MyDrive/grammar_scoring/csvs/train.csv")
df_train = pd.read_csv("drive/MyDrive/grammar_scoring/csvs/transcript_train.csv")

In [None]:
df_final = pd.merge(df_train, df_main, on="filename", how="inner")

In [None]:
df_final

In [None]:
# removing rows having non-english characters
# Regex pattern allowing only English letters, digits, whitespace, and some punctuation
import re
pattern = re.compile(r'^[\x00-\x7F]*$')

# Function to test each cell (convert to string to avoid errors)
def is_clean(value):
    return bool(pattern.match(str(value)))

# Keep rows where **all columns** satisfy the condition
clean_df = df_final[df_final.apply(lambda row: all(is_clean(x) for x in row), axis=1)]
clean_df

In [None]:
clean_df = df_final.drop(columns=["filename"])
clean_df = clean_df.rename(columns={"label": "labels"})

## Transformer model for regression
A pre-trained DaBERT model was fine-tuned as a regressor for grammar scoring task. General steps like loading and tokenizing were done as per HuggingFace documentation.

In [None]:
import pandas as pd
from datasets import Dataset, Value
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)
import torch
import evaluate

In [None]:
dataset_train = Dataset.from_pandas(clean_df)
dataset_train = dataset_train.cast_column("labels", Value("float32"))

In [None]:
model_name = "microsoft/deberta-v3-large"   # recommended for grammar scoring
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
def tokenize_fn(batch):
    return tokenizer(
        batch["transcript"],
        padding="max_length",
        truncation=True,
        max_length=128,
    )

train_ds = dataset_train.map(tokenize_fn, batched=True)

In [None]:
train_ds = train_ds.remove_columns(
    [col for col in train_ds.column_names if col not in ["input_ids","attention_mask","labels"]]
)

train_ds.set_format(type="torch")

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=1,                 # regression
    problem_type="regression"
)

In [None]:
mse = evaluate.load("mse")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.squeeze()
    return {"mse": mse.compute(predictions=preds, references=labels)["mse"]}

In [None]:
training_args = TrainingArguments(
    output_dir="./grammar_model",
    num_train_epochs=40,
    per_device_train_batch_size=16,
    learning_rate=2e-5,
    logging_steps=50,
    save_steps=500,          # optional
    load_best_model_at_end=False,   # important: no eval → cannot pick “best”
)


In [None]:
class RMSETrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        # Extract labels
        labels = inputs.pop("labels")

        # Forward pass
        outputs = model(**inputs)
        logits = outputs.logits.squeeze()

        # MSE loss
        mse = torch.nn.functional.mse_loss(logits, labels)

        # RMSE = sqrt(MSE)
        rmse = torch.sqrt(mse)

        return (rmse, outputs) if return_outputs else rmse

In [None]:
trainer = RMSETrainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    tokenizer=tokenizer
)

In [None]:
#epoch = 100
trainer.train()

In [None]:
trainer.save_model("drive/MyDrive/grammar_scoring/grammar_model_dabert_large4")

## Full model pipeline for inferencing

In [None]:
#step 1: Whisper for audio to text conversion
import torch
import torchaudio
import torchaudio.transforms as T
import torchaudio.functional as F
from transformers import WhisperProcessor, WhisperForConditionalGeneration, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"

#waveform load function
def load_and_resample(path, target_sr=16000):
    waveform, sr = torchaudio.load(path)  # shape: [channels, time]
    if sr != target_sr:
        resampler = T.Resample(orig_freq=sr, new_freq=target_sr)
        waveform = resampler(waveform)
    # convert to mono (average channels)
    if waveform.shape[0] > 1:
        waveform = torch.mean(waveform, dim=0, keepdim=True)
    return waveform, target_sr

#Whisper model defintion
pipe = pipeline(
  "automatic-speech-recognition",
  model="openai/whisper-medium",
  chunk_length_s=30,
  stride_length_s=2,
  device=device
)

#audio to text function
def transcript(file_name):
  file_path = "drive/MyDrive/grammar_scoring/audios/test/" + file_name
  waveform, sample = load_and_resample(file_path)
  waveform = waveform.squeeze()
  # input_features = processor(waveform, sampling_rate=sample, return_tensors="pt").input_features
  # predicted_ids = model.generate(input_features)
  # transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
  transcription = pipe(waveform, batch_size=8, return_timestamps=True)["text"]
  # print(transcription)
  return transcription


In [None]:
# fine-tuned transformer(dabert) for text preprocessing and grammar scoring
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)

model_name = "/content/drive/MyDrive/grammar_scoring/grammar_model_dabert_large3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# def score_batch(sentences):
#     inputs = tokenizer(
#         sentences,
#         return_tensors="pt",
#         padding=True,
#         truncation=True,
#         max_length=128,
#     )

#     with torch.no_grad():
#         outputs = model(**inputs)

#     scores = outputs.logits.squeeze().tolist()
#     return scores

def score_text(sentence):
    inputs = tokenizer(
        sentence,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=128,
        device=device,
    )

    with torch.no_grad():
        outputs = model(**inputs)

    score = outputs.logits.squeeze().item()
    return score


In [None]:
import os
import pandas as pd

file_path = "drive/MyDrive/grammar_scoring/audios/test"
test_df =  pd.DataFrame(columns=["filename", "label"])

for files in os.listdir(file_path):
  name = os.path.splitext(os.path.basename(files))[0]
  print(name)
  # speech to audio convertion
  trans = transcript(files)
  print(trans)
  score = score_text(trans)
  print(score)

  test_df.loc[len(test_df)] = {'filename': name, 'label': score}



In [None]:
test_df.to_csv("output4.csv", index=False)