
# Grammar Scoring Engine - Report

## Overview
This notebook tackles the Grammar Scoring Engine Kaggle challenge, where the goal is to predict grammar scores (0–5) from .wav audio recordings using machine learning.

## Approach
1. **Audio Transcription**: We use OpenAI's `whisper` model to convert audio to text.
2. **Grammar Feature Extraction**: Using `language_tool_python`, we extract grammatical error features like the number of grammar/spelling errors and sentence statistics.
3. **Feature Engineering**: We compute text-based features such as word count, sentence count, average sentence length, etc.
4. **Modeling**: A `RandomForestRegressor` is trained on these features to predict grammar scores.
5. **Evaluation**: The model is evaluated using the Root Mean Squared Error (RMSE) metric and predictions are rounded to the nearest integer between 1 and 5.

## Preprocessing
- Audio files are transcribed using Whisper.
- Texts are cleaned and processed using LanguageTool for grammar checks.
- Extracted features include:
  - Grammar errors
  - Spelling errors
  - Sentence count
  - Word count
  - Average sentence length

## Pipeline Architecture
1. Load audio and labels
2. Transcribe using Whisper
3. Analyze with LanguageTool
4. Feature extraction
5. Train/test split
6. Model training and prediction
7. RMSE calculation and submission formatting

## Results
The initial model achieved a leaderboard RMSE of **~0.398**. Future improvements may include:
- Fine-tuning Whisper or using larger models
- Incorporating prosodic/audio features (e.g., pitch, pace, pauses)
- Experimenting with more advanced regressors or ensemble methods


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/shl-intern-hiring-assessment/dataset/sample_submission.csv
/kaggle/input/shl-intern-hiring-assessment/dataset/train.csv
/kaggle/input/shl-intern-hiring-assessment/dataset/test.csv
/kaggle/input/shl-intern-hiring-assessment/dataset/audios_test/audio_885.wav
/kaggle/input/shl-intern-hiring-assessment/dataset/audios_test/audio_698.wav
/kaggle/input/shl-intern-hiring-assessment/dataset/audios_test/audio_1176.wav
/kaggle/input/shl-intern-hiring-assessment/dataset/audios_test/audio_1215.wav
/kaggle/input/shl-intern-hiring-assessment/dataset/audios_test/audio_66.wav
/kaggle/input/shl-intern-hiring-assessment/dataset/audios_test/audio_386.wav
/kaggle/input/shl-intern-hiring-assessment/dataset/audios_test/audio_1026.wav
/kaggle/input/shl-intern-hiring-assessment/dataset/audios_test/audio_330.wav
/kaggle/input/shl-intern-hiring-assessment/dataset/audios_test/audio_72.wav
/kaggle/input/shl-intern-hiring-assessment/dataset/audios_test/audio_858.wav
/kaggle/input/shl-intern-hiring-ass

In [None]:
# Step 1: Install Dependencies (Kaggle)
!pip install -q openai-whisper language-tool-python lightgbm
!apt update && apt install ffmpeg -y

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m800.5/800.5 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.7/54.7 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.3/54.3 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m253.1/253.1 MB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for openai-whisper (pyproject.toml) ... [?25l[?25hdone
Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 https://r2u.stat.illinois.edu/u

In [None]:
# Step 2: Imports
import whisper
import language_tool_python
import pandas as pd
import numpy as np
import os
from tqdm.notebook import tqdm
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import KFold, cross_val_score
import lightgbm as lgb
import torch

In [None]:
# Step 3: Load Data
train_df = pd.read_csv('/kaggle/input/shl-intern-hiring-assessment/dataset/train.csv')
test_df = pd.read_csv('/kaggle/input/shl-intern-hiring-assessment/dataset/test.csv')
train_audio_dir = '/kaggle/input/shl-intern-hiring-assessment/dataset/audios_train'
test_audio_dir = '/kaggle/input/shl-intern-hiring-assessment/dataset/audios_test'


In [None]:
# Step 4: Load Whisper and Language Tool
asr_model = whisper.load_model("large")
tool = language_tool_python.LanguageTool('en-US')

100%|█████████████████████████████████████| 2.88G/2.88G [01:03<00:00, 48.5MiB/s]
  checkpoint = torch.load(fp, map_location=device)
Downloading LanguageTool 6.5: 100%|██████████| 248M/248M [00:03<00:00, 81.5MB/s]


In [None]:
# Step 5: Feature Extraction

def extract_advanced_features(path):
    try:
        text = asr_model.transcribe(path)['text'].strip()
        matches = tool.check(text)
        corrected = language_tool_python.utils.correct(text, matches)
        words = text.split()
        wc = len(words)
        sc = max(text.count('.') + text.count('?') + text.count('!'), 1)

        grammar_errors = [m for m in matches if m.ruleIssueType == 'grammar']
        style_errors = [m for m in matches if m.ruleIssueType == 'style']
        spelling_errors = [m for m in matches if m.ruleIssueType == 'misspelling']

        return {
            'filename': os.path.basename(path),
            'word_count': wc,
            'sentence_count': sc,
            'avg_sent_len': wc / sc,
            'grammar_error_count': len(grammar_errors),
            'style_error_count': len(style_errors),
            'spelling_error_count': len(spelling_errors),
            'total_errors': len(matches),
            'error_density': len(matches) / wc if wc else 0,
            'corrected_length_diff': abs(len(corrected) - len(text)) / len(text) if len(text) > 0 else 0
        }
    except Exception as e:
        print(f"Error processing {path}: {e}")
        return {k: 0 for k in ['filename', 'word_count', 'sentence_count', 'avg_sent_len',
                               'grammar_error_count', 'style_error_count', 'spelling_error_count',
                               'total_errors', 'error_density', 'corrected_length_diff']}


In [None]:
# Step 6: Extract Features
train_features = []
for _, row in tqdm(train_df.iterrows(), total=len(train_df)):
    path = os.path.join(train_audio_dir, row['filename'])
    feats = extract_advanced_features(path)
    feats['label'] = row['label']
    train_features.append(feats)
train_features_df = pd.DataFrame(train_features)

  0%|          | 0/444 [00:00<?, ?it/s]

In [None]:
# Test Features
test_features = []
for _, row in tqdm(test_df.iterrows(), total=len(test_df)):
    path = os.path.join(test_audio_dir, row['filename'])
    feats = extract_advanced_features(path)
    test_features.append(feats)
test_features_df = pd.DataFrame(test_features)

  0%|          | 0/195 [00:00<?, ?it/s]

In [None]:
# Step 7: Model Training
feature_cols = ['word_count', 'sentence_count', 'avg_sent_len',
                'grammar_error_count', 'style_error_count', 'spelling_error_count',
                'total_errors', 'error_density', 'corrected_length_diff']
X_train = train_features_df[feature_cols]
y_train = train_features_df['label']
X_test = test_features_df[feature_cols]

model = lgb.LGBMRegressor(n_estimators=300, max_depth=7, learning_rate=0.05, random_state=42)
model.fit(X_train, y_train)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002380 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 566
[LightGBM] [Info] Number of data points in the train set: 444, number of used features: 9
[LightGBM] [Info] Start training from score 3.617117


In [None]:
# Step 8: Evaluation
train_preds = model.predict(X_train)
rmse = mean_squared_error(y_train, train_preds, squared=False)
mae = mean_absolute_error(y_train, train_preds)
print(f"Training RMSE: {rmse:.4f}")
print(f"Training MAE: {mae:.4f}")

cv = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='neg_mean_absolute_error')
print(f"Cross-Validated MAE: {-np.mean(scores):.4f}")

Training RMSE: 0.5439
Training MAE: 0.4359
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000075 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 458
[LightGBM] [Info] Number of data points in the train set: 355, number of used features: 9
[LightGBM] [Info] Start training from score 3.635211
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000080 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 467
[LightGBM] [Info] Number of data points in the train set: 355, number of used features: 9
[LightGBM] [Info] Start training from score 3.622535
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000063 seconds.
You can set `force_row_wise=true` to remove the o

In [None]:
# Step 9: Predict Test Set
preds = model.predict(X_test)
preds = np.clip(np.round(preds), 1, 5).astype(int)

In [None]:
# Step 10: Create Submission
submission = pd.DataFrame({
    'filename': test_features_df['filename'],
    'label': preds
})
submission.to_csv('submission.csv', index=False)