Grammar Scoring Engine


# 🎯 Grammar Scoring Engine for Voice Samples

This project aims to build an intelligent engine that evaluates spoken input by converting it to text and analyzing its grammar using AI models. It uses OpenAI's Whisper for transcription and LanguageTool for grammar correction, producing a grammar score based on detected issues.


## 🔧 Installing Required Libraries

We install:
- **OpenAI Whisper** for speech-to-text transcription.
- **language-tool-python** for grammar checking.
- **ffmpeg** for audio processing support (required by Whisper).


In [9]:
!pip install -q openai-whisper
!pip install -q language-tool-python
!sudo apt update && sudo apt install ffmpeg -y


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/800.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/800.5 kB[0m [31m4.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m798.7/800.5 kB[0m [31m13.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m800.5/800.5 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m42.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [

## 🗂️ Rename Uploaded Audio File

We rename the uploaded audio file to a standard name (`grammar_sample.wav`) to avoid errors and make it easier to reference.


In [10]:
from google.colab import files
uploaded = files.upload()


Saving Grammar_scoring_incorrect_audio.wav to Grammar_scoring_incorrect_audio.wav


In [11]:
filename = list(uploaded.keys())[0]
print("Uploaded file:", filename)


Uploaded file: Grammar_scoring_incorrect_audio.wav


## 🗣️ Transcribe Audio to Text using Whisper

This step loads the audio file and uses OpenAI's Whisper model to transcribe it into text.
- The model used is `"base"` — suitable for lightweight transcription.


In [12]:
import whisper

model = whisper.load_model("base")
result = model.transcribe(filename)

transcript = result['text']
print("🎤 Transcript:", transcript)


100%|████████████████████████████████████████| 139M/139M [00:01<00:00, 127MiB/s]


🎤 Transcript:  The boy run quickly through the park when he suddenly trip over a rock. He had chased his dog which bark excited and wagged tail.


In [13]:
!pip install language-tool-python
import language_tool_python




In [14]:
!sudo apt-get update
!sudo apt-get install openjdk-17-jre -y


0% [Working]            Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:3 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Reading package lists... Done
Building depe

In [15]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-17-openjdk-amd64"
os.environ["PATH"] = "/usr/lib/jvm/java-17-openjdk-amd64/bin:" + os.environ["PATH"]


## 🛠️  Check Grammar with LanguageTool

We use LanguageTool to:
- Analyze grammar in the transcribed text.
- Count the number of issues.
- Display all detected grammar issues.


In [16]:
# Install the required package and restart the runtime if necessary.
!pip install language-tool-python
# Import the module after installing it.
import language_tool_python

tool = language_tool_python.LanguageTool('en-US')
matches = tool.check(transcript)
grammar_issues = len(matches)

print(f"🔍 Grammar Issues Found: {grammar_issues}")



Downloading LanguageTool latest: 100%|██████████| 252M/252M [00:21<00:00, 11.5MB/s]
INFO:language_tool_python.download_lt:Unzipping /tmp/tmpg8b4gp2u.zip to /root/.cache/language_tool_python.
INFO:language_tool_python.download_lt:Downloaded https://internal1.languagetool.org/snapshots/LanguageTool-latest-snapshot.zip to /root/.cache/language_tool_python.


🔍 Grammar Issues Found: 1


In [17]:
import language_tool_python

def correct_grammar(text):
    """Corrects grammar errors in a given text using language_tool_python."""
    tool = language_tool_python.LanguageTool('en-US')
    matches = tool.check(text)

    corrected_text = text
    for match in matches:
        corrected_text = corrected_text.replace(match.context, match.replacements[0] if match.replacements else match.context)  # Apply the first

In [18]:
from difflib import SequenceMatcher

# Correct reference (you'll define this manually or from a set)
correct_script = """The boy was running quickly through the park when he suddenly tripped over a rock.
He had been chasing his dog, which barked excitedly and wagged its tail."""

def similarity_score(predicted, reference):
    return round(SequenceMatcher(None, predicted.lower(), reference.lower()).ratio() * 100, 2)

similarity = similarity_score(transcript, correct_script)
print(f"📋 Similarity with Reference: {similarity}%")


📋 Similarity with Reference: 88.42%


In [19]:
for issue in matches:
    print("✏️ Issue:", issue.message)
    print("💡 Suggestions:", issue.replacements)
    print("📍 In context:", issue.context)
    print()


✏️ Issue: The pronoun ‘he’ is usually used with a third-person or a past tense verb.
💡 Suggestions: ['trips', 'tripped']
📍 In context: ...ickly through the park when he suddenly trip over a rock. He had chased his dog whic...



## 📊 Calculate Grammar Score

We assign a grammar score based on:
- Total words
- Grammar issues found

Formula used:
\[
\text{Score} = 100 \times \left(1 - \frac{\text{Grammar Issues}}{\text{Total Words}}\right)
\]

In [20]:
def grammar_score(text, issues):
    words = len(text.split())
    if words == 0:
        return 0
    error_ratio = issues / words
    score = max(0, 100 - (error_ratio * 100))  # Deducts points per issue
    return round(score, 2)

score = grammar_score(transcript, grammar_issues)
print(f"📊 Grammar Score: {score}/100")


📊 Grammar Score: 96.0/100


## 📌 Final Output

Displays:
- 🔊 Transcription
- 🔧 Number of grammar issues
- 📊 Grammar score


In [21]:
print("🔊 Final Transcript:\n", transcript)
print("🔧 Grammar Issues:", grammar_issues)
print("📊 Grammar Score:", score)


🔊 Final Transcript:
  The boy run quickly through the park when he suddenly trip over a rock. He had chased his dog which bark excited and wagged tail.
🔧 Grammar Issues: 1
📊 Grammar Score: 96.0


## 📦 Installing Transformers Library

We install the `transformers` library from Hugging Face, which provides a wide range of pre-trained language models. In this case, we’ll use it for grammar correction using a T5 model.


In [1]:
!pip install transformers




## 🧠 Loading the Grammar Correction Model

We use a fine-tuned version of the T5 model: `vennify/t5-base-grammar-correction`. This model is specifically trained to correct grammatical errors in English text.

The process includes:
- Initializing the tokenizer and model.
- Defining a function that takes a sentence, formats it for the model, and returns a grammatically corrected version.


In [2]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the T5 grammar correction model
tokenizer = AutoTokenizer.from_pretrained("vennify/t5-base-grammar-correction")
model = AutoModelForSeq2SeqLM.from_pretrained("vennify/t5-base-grammar-correction")

def correct_grammar_t5(text):
    input_text = f"grammar: {text}"
    inputs = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs, max_length=512, num_beams=4, early_stopping=True)
    corrected_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return corrected_text


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

## 📝 Running Grammar Correction on a Sample Sentence

Here, we test our correction function on a sample sentence with multiple grammatical errors. The model will attempt to return a cleaner, grammatically correct version.


In [3]:
transcript = "The boy run quickly through the park when he suddenly trip over a rock. He had chase his dog, which bark excited and wag it tail."
corrected = correct_grammar_t5(transcript)
print("🔊 Transcript:", transcript)
print("✅ Corrected:", corrected)


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

🔊 Transcript: The boy run quickly through the park when he suddenly trip over a rock. He had chase his dog, which bark excited and wag it tail.
✅ Corrected: The boy ran quickly through the park when he suddenly trip over a rock. He had chased his dog, which barks excited and wags its tail.


In [4]:
from difflib import SequenceMatcher

def grammar_score(original, corrected):
    similarity = SequenceMatcher(None, original, corrected).ratio()
    score = round(similarity * 100, 2)
    return score


## 📐 Calculating Grammar Score

To quantify the difference between the original and corrected text, we use a similarity metric. A `SequenceMatcher` calculates how closely the two texts match, which we convert into a percentage-based grammar score.


In [5]:
score = grammar_score(transcript, corrected)
print("📊 Grammar Score:", score)


📊 Grammar Score: 97.71
