<a href="https://colab.research.google.com/github/Nuwantha97/Sinhala_spell_and_grammer_checker/blob/Notebooks/Grammer_transformer_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# prompt: mount google drive

from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [3]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [4]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [5]:
import pandas as pd
from datasets import Dataset
from transformers import (
    XLMRobertaTokenizer,
    XLMRobertaForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)
import numpy as np
import torch
import evaluate
from typing import Dict, List, Tuple
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix

class SinhalaGrammarChecker:
    def __init__(self):
        self.model_path = "/content/drive/MyDrive/Projects/Sinhala Spell and Grammer checker/spell check/model"
        self.tokenizer = None
        self.model = None

    def preprocess_text(self, text: str) -> str:
        """Clean and normalize text"""
        return text.strip()

    def create_dataset(self, texts: List[str], labels: List[int]) -> Dataset:
        """Create a HuggingFace dataset"""
        return Dataset.from_dict({
            'text': [self.preprocess_text(str(text)) for text in texts],
            'label': labels
        })

    def prepare_training_data(self, file_path: str) -> Tuple[Dataset, Dataset]:
        """Prepare training and validation datasets"""
        df = pd.read_csv(file_path)

        texts = []
        labels = []

        # Add incorrect sentences (label 1)
        incorrect_sentences = df['incorrect_sentence'].tolist()
        texts.extend(incorrect_sentences)
        labels.extend([1] * len(incorrect_sentences))

        # Add correct sentences (label 0)
        correct_sentences = df['correct_sentence'].tolist()
        texts.extend(correct_sentences)
        labels.extend([0] * len(correct_sentences))

        # Shuffle the data
        combined = list(zip(texts, labels))
        np.random.shuffle(combined)
        texts, labels = zip(*combined)

        # Create train/validation split
        train_texts = texts[:int(0.9 * len(texts))]
        train_labels = labels[:int(0.9 * len(texts))]
        val_texts = texts[int(0.9 * len(texts)):]
        val_labels = labels[int(0.9 * len(texts)):]

        return (
            self.create_dataset(train_texts, train_labels),
            self.create_dataset(val_texts, val_labels)
        )

    def tokenize_function(self, examples: Dict) -> Dict:
        """Tokenize the texts and prepare for training"""
        tokenized = self.tokenizer(
            examples['text'],
            truncation=True,
            max_length=128,
            padding='max_length'
        )
        tokenized['labels'] = examples['label']
        return tokenized

    def compute_metrics(self, eval_pred: Tuple) -> Dict:
        """Compute evaluation metrics"""
        predictions, labels = eval_pred
        predictions = np.argmax(predictions, axis=1)

        metrics = {}

        # Calculate accuracy
        accuracy = evaluate.load("accuracy")
        metrics.update(accuracy.compute(predictions=predictions, references=labels))

        # Calculate precision, recall, and F1 score
        metrics['precision'] = float(precision_score(labels, predictions, average='binary'))
        metrics['recall'] = float(recall_score(labels, predictions, average='binary'))
        metrics['f1'] = float(f1_score(labels, predictions, average='binary'))

        return metrics

    def train(self, train_file: str):
        """Train the model"""
        print("Preparing datasets...")
        train_dataset, val_dataset = self.prepare_training_data(train_file)

        print("Initializing tokenizer...")
        self.tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')

        print("Tokenizing datasets...")
        tokenized_train = train_dataset.map(
            self.tokenize_function,
            batched=True,
            remove_columns=train_dataset.column_names
        )
        tokenized_val = val_dataset.map(
            self.tokenize_function,
            batched=True,
            remove_columns=val_dataset.column_names
        )

        data_collator = DataCollatorWithPadding(tokenizer=self.tokenizer)

        print("Initializing model...")
        self.model = XLMRobertaForSequenceClassification.from_pretrained(
            'xlm-roberta-base',
            num_labels=2
        )

        training_args = TrainingArguments(
            output_dir=self.model_path,
            learning_rate=1e-5,
            per_device_train_batch_size=16,
            per_device_eval_batch_size=16,
            num_train_epochs=3,
            weight_decay=0.01,
            evaluation_strategy="steps",
            eval_steps=100,
            save_strategy="steps",
            save_steps=100,
            load_best_model_at_end=True,
            metric_for_best_model="accuracy",
            greater_is_better=True,
            push_to_hub=False,
            warmup_ratio=0.1,
            logging_steps=50,
            gradient_accumulation_steps=2,
            fp16=True
        )

        trainer = Trainer(
            model=self.model,
            args=training_args,
            train_dataset=tokenized_train,
            eval_dataset=tokenized_val,
            tokenizer=self.tokenizer,
            data_collator=data_collator,
            compute_metrics=self.compute_metrics
        )

        print("Training model...")
        trainer.train()

        print("Saving model...")
        trainer.save_model(self.model_path)
        self.tokenizer.save_pretrained(self.model_path)

        print("\nFinal Evaluation Metrics:")
        final_metrics = trainer.evaluate()
        for key, value in final_metrics.items():
            print(f"{key}: {value:.4f}")

    def get_correction(self, text: str, df: pd.DataFrame) -> str:
        """Get correction from dataset"""
        match = df[df['incorrect_sentence'] == text]
        if not match.empty:
            return match.iloc[0]['correct_sentence']
        return None

    def check_grammar(self, text: str, df: pd.DataFrame) -> Dict:
        """Check grammar and provide correction"""
        if not self.model or not self.tokenizer:
            self.tokenizer = XLMRobertaTokenizer.from_pretrained(self.model_path)
            self.model = XLMRobertaForSequenceClassification.from_pretrained(self.model_path)

        device = torch.device('cpu')
        self.model = self.model.to(device)

        text = self.preprocess_text(text)

        inputs = self.tokenizer(
            text,
            return_tensors="pt",
            truncation=True,
            max_length=128,
            padding='max_length'
        )

        inputs = {k: v.to(device) for k, v in inputs.items()}

        with torch.no_grad():
            outputs = self.model(**inputs)
            predictions = torch.softmax(outputs.logits, dim=1)
            has_error = torch.argmax(predictions).item()
            confidence = predictions[0][has_error].item()

        correction = None
        if has_error == 1:
            correction = self.get_correction(text, df)

        return {
            'text': text,
            'has_error': bool(has_error),
            'confidence': confidence,
            'correction': correction,
            'suggestion': correction if correction else ('Grammatical error detected' if has_error else 'No grammatical errors detected.')
        }



In [6]:
def evaluate_model(checker, test_df):
    """Evaluate model performance with balanced testing"""
    all_predictions = []
    all_labels = []
    results = []

    print("\nEvaluating model performance...")

    # Test both incorrect and correct sentences
    for _, row in test_df.iterrows():
        # Test incorrect sentence
        result = checker.check_grammar(row['incorrect_sentence'], test_df)
        all_predictions.append(int(result['has_error']))
        all_labels.append(1)
        results.append({
            'sentence': row['incorrect_sentence'],
            'expected': 1,
            'predicted': int(result['has_error']),
            'confidence': result['confidence'],
            'correction': result['correction']
        })

        # Test correct sentence
        result = checker.check_grammar(row['correct_sentence'], test_df)
        all_predictions.append(int(result['has_error']))
        all_labels.append(0)
        results.append({
            'sentence': row['correct_sentence'],
            'expected': 0,
            'predicted': int(result['has_error']),
            'confidence': result['confidence'],
            'correction': result['correction']
        })

    # Calculate metrics
    accuracy = sum(1 for x, y in zip(all_predictions, all_labels) if x == y) / len(all_labels)
    precision = precision_score(all_labels, all_predictions, average='binary')
    recall = recall_score(all_labels, all_predictions, average='binary')
    f1 = f1_score(all_labels, all_predictions, average='binary')

    print("\nTest Metrics:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")

    # Display confusion matrix
    cm = confusion_matrix(all_labels, all_predictions)
    print("\nConfusion Matrix:")
    print("TN FP")
    print("FN TP")
    print(cm)

    # Show sample predictions
    print("\nSample Predictions (5 correct and 5 incorrect sentences):")
    correct_samples = [r for r in results if r['expected'] == 0][:5]
    incorrect_samples = [r for r in results if r['expected'] == 1][:5]

    print("\nCorrect Sentences:")
    for sample in correct_samples:
        print(f"\nInput: {sample['sentence']}")
        print(f"Predicted has error: {bool(sample['predicted'])}")
        print(f"Confidence: {sample['confidence']:.2f}")

    print("\nIncorrect Sentences:")
    for sample in incorrect_samples:
        print(f"\nInput: {sample['sentence']}")
        print(f"Predicted has error: {bool(sample['predicted'])}")
        print(f"Confidence: {sample['confidence']:.2f}")
        if sample['correction']:
            print(f"Suggested correction: {sample['correction']}")

def main():
    # Initialize checker
    checker = SinhalaGrammarChecker()

    # Load and split dataset
    print("Loading and splitting dataset...")
    full_df = pd.read_csv('/content/drive/MyDrive/Projects/Sinhala Spell and Grammer checker/spell check/merged_sentences.csv')

    # Shuffle and split the dataset
    train_df = full_df.sample(frac=0.8, random_state=42)
    test_df = full_df.drop(train_df.index)

    # Save splits
    train_df.to_csv('train_data.csv', index=False)
    test_df.to_csv('test_data.csv', index=False)

    print(f"Dataset split: {len(train_df)} training samples, {len(test_df)} test samples")

    # Train model
    print("\nTraining model...")
    checker.train('train_data.csv')

    # Evaluate model
    evaluate_model(checker, test_df)

if __name__ == "__main__":
    main()

Loading and splitting dataset...
Dataset split: 12041 training samples, 3010 test samples

Training model...
Preparing datasets...
Initializing tokenizer...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Tokenizing datasets...


Map:   0%|          | 0/21673 [00:00<?, ? examples/s]

Map:   0%|          | 0/2409 [00:00<?, ? examples/s]

Initializing model...


model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Training model...


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
100,1.4235,0.698791,0.507264,0.0,0.0,0.0
200,1.3881,0.648312,0.64093,0.822,0.346251,0.487255
300,1.242,0.594068,0.730178,0.75743,0.665543,0.70852
400,1.1023,0.51518,0.747198,0.881266,0.562763,0.686889
500,1.0515,0.514698,0.752179,0.903005,0.556866,0.6889
600,1.0834,0.474658,0.781237,0.86105,0.663016,0.749167
700,1.011,0.458069,0.789539,0.84,0.707666,0.768176
800,0.9322,0.465415,0.789124,0.876804,0.665543,0.756705
900,0.9734,0.441565,0.792445,0.882943,0.667228,0.760077
1000,0.9601,0.433122,0.796181,0.911348,0.649537,0.758485


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
Could not locate the best model at /content/drive/MyDrive/Projects/Sinhala Spell and Grammer checker/spell check/model/checkpoint-1800/pytorch_model.bin, if you are running a distributed training on multiple nodes, you should activate `--save_on_each_node`.


Saving model...

Final Evaluation Metrics:


eval_loss: 0.4141
eval_accuracy: 0.8257
eval_precision: 0.8999
eval_recall: 0.7270
eval_f1: 0.8043
eval_runtime: 13.0129
eval_samples_per_second: 185.1240
eval_steps_per_second: 11.6040
epoch: 2.9963

Evaluating model performance...

Test Metrics:
Accuracy: 0.8221
Precision: 0.8975
Recall: 0.7272
F1 Score: 0.8035

Confusion Matrix:
TN FP
FN TP
[[2760  250]
 [ 821 2189]]

Sample Predictions (5 correct and 5 incorrect sentences):

Correct Sentences:

Input: මිනිසුන් දෙදෙනෙකු මෝටර් රථ ක්‍රීඩාවක යෙදෙයි
Predicted has error: False
Confidence: 0.96

Input: මෝටර්සයිකල් තරඟයකදී තරඟකරුවෙක් අනෙකා හඹා යයි
Predicted has error: False
Confidence: 0.91

Input: මෝටර්සයිකල් තරඟයකදී තරඟකරුවෙක් අනෙකා හඹා යයි
Predicted has error: False
Confidence: 0.91

Input: ක්‍රීඩකයන් දෙදෙනෙකු මෝටර් රථයක් පදවයි
Predicted has error: False
Confidence: 0.95

Input: මිනිසෙකු වාහනයක් අලුත් වැඩියා කරයි
Predicted has error: False
Confidence: 0.94

Incorrect Sentences:

Input: මිනිසුන් දෙදෙනෙකු මෝටර් රථ ක්‍රීඩාාවක යෙදෙයි
Predic