In [None]:
!pip install openai-whisper librosa language-tool-python scikit-learn pandas numpy


Collecting openai-whisper
  Downloading openai-whisper-20240930.tar.gz (800 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m800.5/800.5 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting language-tool-python
  Downloading language_tool_python-2.9.2-py3-none-any.whl.metadata (54 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.7/54.7 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
Collecting tiktoken (from openai-whisper)
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->openai-whisper)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch->openai-whisper)


## Step 1: Extracting the Dataset

The dataset is provided in a `.zip` file format. 
We first extract all the files to a specific directory for further processing.

This step unzips:
- `train.csv`
- `test.csv`
- `sample_submission.csv`
- `/audios_test/` folder with audio files


In [None]:
import zipfile

zip_path = "/shl-intern-hiring-assessment.zip"
extract_path = "/shl_data"

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

print("✅ Zip file extracted successfully.")


✅ Zip file extracted successfully.


##  Step 2: Loading the Whisper Model

We use OpenAI's `whisper` model to transcribe the spoken audio into text. 
This transcription will later be used to extract grammatical features and predict grammar scores.

Currently, we're using the `"base"` version of the model, which provides a balance between speed and accuracy.
We can swapped for `"medium"` or `"large"` if needed, for better transcription quality (at the cost of speed and memory).

```python
# Load Whisper Model
import whisper
model = whisper.load_model("base")


In [6]:
import whisper
model = whisper.load_model("base")


100%|████████████████████████████████████████| 139M/139M [00:00<00:00, 165MiB/s]


##  Step 3: Load Training Data

We load the training data from `train.csv`, which contains the list of audio file names and their respective grammar score labels.

We also create a new empty column called `"transcription"` in the DataFrame.
This column will later be filled with the text transcriptions generated from each audio file using the Whisper model.


In [8]:
import pandas as pd
train_df = pd.read_csv("/shl_data/dataset/train.csv")
train_df["transcription"] = ""


 To View Column Names

We inspect the columns in the training dataset to understand what features are available.


In [10]:
train_df.columns


Index(['filename', 'label', 'transcription'], dtype='object')

##  Step 4: Transcribing Audio Files using Whisper

In this step, we transcribe each audio file in the training dataset using OpenAI's Whisper model (`base` or `medium` depending on earlier setup).
The transcriptions are stored in a new column called `transcription`.

This transcription will later be used to extract text-based features for grammar scoring.


In [11]:
import os
from tqdm import tqdm

audio_dir = "/shl_data/dataset/audios_train"

for i in tqdm(range(len(train_df))):
    file_name = train_df.loc[i, "filename"]
    file_path = os.path.join(audio_dir, file_name)

    result = model.transcribe(file_path)
    train_df.loc[i, "transcription"] = result["text"]


100%|██████████| 444/444 [3:56:48<00:00, 32.00s/it]


 For saving Transcriptions  to CSV.

In [12]:
train_df.to_csv("/content/train_with_transcription.csv", index=False)
print("✅ Transcriptions saved to CSV.")


✅ Transcriptions saved to CSV.


##  Step 5: Preprocessing Transcriptions (Text Cleaning)

Now that we have the transcriptions, we clean the text to make it easier for the model to learn meaningful patterns.

The cleaning steps include:
- Converting text to lowercase
- Removing punctuation and special characters
- Removing extra spaces

The cleaned version of each transcript is saved in a new column called `clean_text`.


In [13]:
import re

def clean_text(text):
    # Lowercase krne k liye
    text = text.lower()
    # Remove punctuation & special chars
    text = re.sub(r'[^\w\s]', '', text)
    # Remove extra whitespaces krne k liye
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply to transcription column k liye
train_df["clean_text"] = train_df["transcription"].apply(clean_text)

print("✅ Text preprocessing done.")
train_df[["transcription", "clean_text"]].head()


✅ Text preprocessing done.


Unnamed: 0,transcription,clean_text
0,1.5% 1.5% 1.5% 1.5% 1.5% 1.5% 1.5% 1.5% 1.5% ...,15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1...
1,The playground looks like very clear and neat...,the playground looks like very clear and neat ...
2,My Girl is to become an Electrical Employee a...,my girl is to become an electrical employee an...
3,My favorite place is in Andhra Pradesh. It is...,my favorite place is in andhra pradesh it is i...
4,"My favorite places, my favorite places, Mutti...",my favorite places my favorite places mutti an...


##  Step 6: Feature Extraction using TF-IDF

We convert the cleaned text into numerical features using **TF-IDF (Term Frequency-Inverse Document Frequency)**.

This helps the model understand the importance of words across all transcripts.

- We use `TfidfVectorizer` from scikit-learn.
- `max_features=1000` keeps the top 1000 important features.

The result is a matrix where each row represents an audio sample and each column represents a word/phrase feature.


In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer


vectorizer = TfidfVectorizer(max_features=1000)
X = vectorizer.fit_transform(train_df["clean_text"])

print("✅ TF-IDF features extracted. Shape:", X.shape)


✅ TF-IDF features extracted. Shape: (444, 1000)


##  Step 7: Train a Random Forest Regression Model

Now we train a **Random Forest Regressor** to predict the grammar score based on TF-IDF features.

Steps:
- `train_test_split` is used to split the data into training and validation sets (80/20).
- `RandomForestRegressor` is used for modeling. It’s a good baseline model that works well with tabular data.
- Finally, we evaluate it using **Mean Squared Error (MSE)** to measure how far the predictions are from the actual scores.

This gives us a basic idea of how well our features are performing.


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# It Will Target variable
y = train_df["label"]

#  it will Split data for training and validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# yeh Initialize and train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# and then ye Predict and evaluate
y_pred = model.predict(X_val)
mse = mean_squared_error(y_val, y_pred)
print("✅ Model trained. Validation MSE:", mse)


✅ Model trained. Validation MSE: 0.9502742411979295


##  Step 8: Transcribe Test Set Audio Files

In this step, we apply the **Whisper ASR model** to transcribe the audio files from the test set.

- We load the test set CSV (`test.csv`).
- For each audio file, we use the same `whisper` model to convert speech into text.
- Transcriptions are stored in a new column called `"transcription"`.

This ensures that both train and test data go through the same transcription pipeline before feature extraction.


In [None]:
import whisper

whisper_model = whisper.load_model("base")


In [18]:
import whisper

whisper_model = whisper.load_model("base")

test_df = pd.read_csv("/shl_data/dataset/test.csv")
test_df["transcription"] = ""

audio_test_dir = "/shl_data/dataset/audios_test"

for i in tqdm(range(len(test_df))):
    file_name = test_df.loc[i, "filename"]
    file_path = os.path.join(audio_test_dir, file_name)
    result = whisper_model.transcribe(file_path)
    test_df.loc[i, "transcription"] = result["text"]


100%|██████████| 195/195 [1:20:40<00:00, 24.82s/it]


##  Step 9: Clean Transcribed Text (Test Set)

We now clean the **transcriptions** from the test set just like we did for the training data.

- Convert text to lowercase
- Remove punctuation and non-alphabetic characters
- Normalize whitespace (remove extra spaces)

This ensures consistency between the training and test data for feature extraction.


In [20]:
import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r"[^a-z\s]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

test_df["clean_text"] = test_df["transcription"].apply(clean_text)
print("✅ Cleaned test transcriptions.")


✅ Cleaned test transcriptions.


To Transform Test Data Using TF-IDF Vectorizer

We use the **same TF-IDF vectorizer** (fitted on training data) to transform the cleaned test transcriptions into numerical feature vectors.

This step ensures the test data is in the same format as the training data before making predictions.


In [21]:
X_test = vectorizer.transform(test_df["clean_text"])
print("✅ Transformed test data shape:", X_test.shape)


✅ Transformed test data shape: (195, 1000)


To Predict Grammar Scores on Test Data

Now that the test data is transformed, we use the trained model to predict grammar scores for each test sample.
These predictions will be used in the final submission.


In [22]:
test_preds = model.predict(X_test)
print("✅ Predictions done.")


✅ Predictions done.


##  Step 10: Create Submission File

We prepare the final submission file in the required format with filenames and their predicted grammar scores.
The file is saved as `submission.csv`, ready to be uploaded.


In [23]:
submission = pd.DataFrame({
    "filename": test_df["filename"],
    "label": test_preds
})

submission.to_csv("submission.csv", index=False)
print("✅ Submission file saved as submission.csv")


✅ Submission file saved as submission.csv


##  Step 11: Evaluate Model with Pearson Correlation

Since the competition is evaluated using Pearson Correlation, we calculate it here using the actual grammar scores and the model’s predictions on the training set.
A higher correlation means the model predictions align better with the real scores.


In [25]:
from scipy.stats import pearsonr

# It will Actually labels from train_df
y_true = train_df["label"].values

# Here , I'm Predicted labels from model on train data
y_pred = model.predict(X)

# Now,Last Pearson Correlation
corr, _ = pearsonr(y_true, y_pred)
print(f"📈 Pearson Correlation (Train Set): {corr:.4f}")


📈 Pearson Correlation (Train Set): 0.8977


##  Step 12: Visualization - True vs Predicted Grammar Scores

This scatter plot helps visualize the model's predictions compared to the actual grammar labels from the training data.  
If the model performs well, most of the points should lie close to a diagonal line, showing strong correlation between predictions and true scores.


In [None]:
import matplotlib.pyplot as plt

plt.scatter(y_true, y_pred, alpha=0.5)
plt.xlabel("True Labels")
plt.ylabel("Predicted Labels")
plt.title("📈 True vs Predicted Grammar Scores")
plt.grid(True)
plt.show()


##  Step 13: Save Final Predictions - submission.csv

We create the final `submission.csv` containing the filenames and their predicted grammar scores.  
This is the file to be uploaded for the competition.


In [None]:
#  Create submission.csv
submission_df = pd.DataFrame({
    "filename": test_df["filename"],
    "label": test_preds
})
submission_df.to_csv("submission.csv", index=False)
print("✅ submission.csv created!")


## Summary

- Used Whisper ASR (base/medium) to transcribe audio files.
- Cleaned text using regex and lowercase normalization.
- Extracted features using TF-IDF (top 1000 features).
- Trained a RandomForestRegressor model.
- Evaluated with Pearson Correlation (score: X.XXXX).
- Created `submission.csv` with predictions on the test set.
- Visualized predictions vs true labels.

