In [1]:
# Grammar Scoring Engine for Spoken Audios
**SHL Intern Hiring Assessment 2025**

In this notebook, I have built a simple machine learning pipeline to predict grammar scores from spoken English audio samples.
The goal is to understand how audio features can be used to estimate grammar quality.

## Problem Statement

The task is to build a Grammar Scoring Engine that can predict a grammar score between 0 and 5 from spoken English audio recordings.

Each audio file is around 45 to 60 seconds long and is sampled at 16 kHz.
Using these audio samples, the model should learn patterns related to grammar quality and output a continuous score.

## Dataset Description

The dataset is provided as part of the SHL Intern Hiring Assessment 2025 and is hosted on Kaggle.

It includes:
- WAV audio files containing spoken English responses
- A training CSV file with audio filenames and grammar scores
- A test CSV file with audio filenames but without labels

For this notebook, the dataset is assumed to be already available locally in the following structure:

dataset/
├── audios/
│   ├── train/
│   └── test/
└── csvs/
    ├── train.csv
    └── test.csv


In [11]:
import os
import numpy as np
import pandas as pd
import librosa

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from scipy.stats import pearsonr


In [5]:
BASE_PATH = "dataset"

TRAIN_AUDIO_PATH = f"{BASE_PATH}/audios/train/"
TEST_AUDIO_PATH  = f"{BASE_PATH}/audios/test/"

TRAIN_CSV_PATH = f"{BASE_PATH}/csvs/train.csv"
TEST_CSV_PATH  = f"{BASE_PATH}/csvs/test.csv"

train_df = pd.read_csv(TRAIN_CSV_PATH)
test_df  = pd.read_csv(TEST_CSV_PATH)


In [None]:
## Feature Extraction

To represent the audio signals numerically, Mel-Frequency Cepstral Coefficients (MFCCs) are used.

For each audio file:
- 13 MFCC features are extracted
- The mean and standard deviation of each MFCC are calculated
- This results in a fixed-length feature vector of size 26

This approach helps convert variable-length audio signals into a consistent feature format that can be used for model training.


In [6]:
def extract_mfcc_features(audio_path, sr=16000, n_mfcc=13):
    y, sr = librosa.load(audio_path, sr=sr)
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc)

    mfcc_mean = np.mean(mfcc, axis=1)
    mfcc_std  = np.std(mfcc, axis=1)

    return np.concatenate([mfcc_mean, mfcc_std])


In [7]:
X = []
y = []

for _, row in train_df.iterrows():
    audio_file = TRAIN_AUDIO_PATH + row["filename"] + ".wav"
    features = extract_mfcc_features(audio_file)
    X.append(features)
    y.append(row["label"])

X = np.array(X)
y = np.array(y)


  y, sr = librosa.load(audio_path, sr=sr)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
  y, sr = librosa.load(audio_path, sr=sr)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
  y, sr = librosa.load(audio_path, sr=sr)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
  y, sr = librosa.load(audio_path, sr=sr)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
  y, sr = librosa.load(audio_path, sr=sr)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
  y, sr = librosa.load(audio_path, 

In [8]:
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [None]:
## Model Architecture

A Random Forest Regressor is used as the baseline model for this task.

This model was chosen because it is easy to implement, works well with small datasets, and can capture non-linear relationships between the extracted audio features and grammar scores.


In [9]:
model = RandomForestRegressor(
    n_estimators=200,
    random_state=42
)

model.fit(X_train, y_train)


In [12]:
y_pred = model.predict(X_val)

mse = mean_squared_error(y_val, y_pred)
rmse = np.sqrt(mse)
pearson_corr, _ = pearsonr(y_val, y_pred)
print(f"Validation RMSE: {rmse:.4f}")
print(f"Pearson Correlation: {pearson_corr:.4f}")

Validation RMSE: 0.6784
Pearson Correlation: 0.5052


In [None]:
## Evaluation Metrics

The model performance is evaluated using Root Mean Squared Error (RMSE).

RMSE shows how far the predicted grammar scores are from the actual scores on average.
A lower RMSE value indicates better performance.

Pearson correlation is also relevant for this task, as it helps measure how well the predicted scores follow the same trend as the true grammar scores.

## Conclusion

This notebook demonstrates a complete workflow for predicting grammar scores from spoken audio data.

Starting from audio preprocessing and feature extraction, a regression model is trained and evaluated using appropriate metrics.
While this approach provides a solid baseline, the performance can be further improved using more advanced audio features or deep learning-based methods.
