<a href="https://colab.research.google.com/github/IshaSarangi/Edureka_Notes/blob/main/Edureka_Machine_Translation_Eng_French_using_MarianMT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://colab.research.google.com/drive/1hHhmI65QRQhlRV-E65SwnzOciycH5__r?usp=sharing

###Demo: Automatic Evaluation and Error Detection in English-to-French Machine Translation using Neural Models

Problem Statement:

Machine Translation (MT) systems often product grammatically and semantically acceptable outputs, yet they can make subtle errors like lexical mismatches, incorrect tenses, or unintended pragmatic shifts. Incritical domains like healthcare, law, or literature, these errors can distort meaning or compromise accuracy.

There is a need to build a robust evaluation framework that not only generates translations but also:
*   Evaluates translation quality using standard metrics like BLEU.
*   Detects errors in meaning, terminology, and tone/formality.
*   Provides explainable output for model improvement and human feedback.

###Objective
*   Translate English text to French using a pre-trained neural model (MarianMT).
*   Automatically evaluate translation accuracy using BLEU score.
*   Detect lexical, semantic, and pragmatic errors in translation.
*   Use multilingual sentence embeddings to measure meaning preservation.
*   Normalize text (e.g., via lemmatization) to reduce grammatical bias in evaluation.
*   Generate detailed outputs for translation quality assessment and debugging.

###Dataset Overview:
| ID | English (`en`)                          | French (`fr`)                          |
| -- | --------------------------------------- | -------------------------------------- |
| 0  | She looked out the window.              | Elle regarda par la fenêtre.           |
| 1  | He opened the book and started reading. | Il ouvrit le livre et commença à lire. |
| 2  | The sun was shining brightly.           | Le soleil brillait de mille feux.      |
| …  | …                                       | …                                      |

###Workflow:

Step 1: Data Loading
*   Load a multilingual dataset (EN-FR sentence pairs) from .parquet format.
*   Extract English (en) and French (fr) columns from JSON-like dictionary.

Step 2: Translation
*   Use HuggingFace MarianMT model: Helsinki-NLP/opus-mt-en-fr
*   Translate English sentences to French.

Step 3: Evaluation
*   Use SacreBLEU to calculate corpus-level BLUE score.
*   Compare predicted vs. reference translation.

Step 4: Error Detection
*   Lexical Errors: Check if expected terms (from a glossary) are mmissing in the prediction.
*   Semantic Errors: Use LaBSE embeddings to check cosine similarity between predicted and reference French sentences (threshold = 0.85).
*   Pragmatic Errors: Detect tone or formality mismatch using simple heuristics (e.g., -> "vous" vs. "tu").

Step 5: Export Results
*   Output predictions, errors, and BLUE score into a structured CSV.

In [1]:
#Step 1: Install Dependencies
!pip install -q transformers sentencepiece sacrebleu nltk sentence-transformers spacy
!python -m spacy download fr_core_news_sm

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fr-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.8.0/fr_core_news_sm-3.8.0-py3-none-any.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m67.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fr-core-news-sm
Successfully installed fr-core-news-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' o

In [2]:
#Step 2: Import Libraries
import torch
import pandas as pd
import nltk
import spacy
from transformers import MarianMTModel, MarianTokenizer
from sentence_transformers import SentenceTransformer, util
import sacrebleu
nltk.download('punkt')
nlp_fr = spacy.load("fr_core_news_sm")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [3]:
#Step 3: Load MarianMT Translation Model (EN -> FR)
model_name = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

def translate(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    outputs = model.generate(**inputs)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]



pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

In [4]:
#Step 4: BLEU Score with SacreBLEU
def compute_bleu_sacrebleu(references, predictions):
    return sacrebleu.corpus_bleu(predictions, [references]).score

In [5]:
#Step 5: Semantic Similarity using LaBSE (French-French)
semantic_model = SentenceTransformer("sentence-transformers/LaBSE")

def normalize_french(text):
    doc = nlp_fr(text)
    return " ".join([token.lemma_ for token in doc])

def check_semantic(ref, pred):
    ref_norm = normalize_french(ref)
    pred_norm = normalize_french(pred)
    embs = semantic_model.encode([ref_norm, pred_norm])
    score = util.cos_sim(embs[0], embs[1]).item()
    return score < 0.9 #Higher threshold for stricter evaluation

modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/804 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/397 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

2_Dense/model.safetensors:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

In [6]:
#Step 6: Lexical Error Detection
glossary = {
    "physician": "médecin",
    "heart attack": "crise cardiaque",
    "blood pressure": "pression artérielle"
}

def check_lexical(src, trans):
    for term, correct in glossary.items():
        if term.lower() in src.lower() and correct.lower() not in trans.lower():
            return True
        return False

In [7]:
#Step 7: Pragmatic Error (Heuristic)
def check_pragmatic(src, trans):
    if "<formal>" in src and "tu" in trans:
        return True
    return False

In [9]:
#Step 8: Load and Process Dataset
df = pd.read_csv("/content/opus_books_en-fr_test.csv")
subset = df[['en', 'fr']].rename(columns={'en':'src', 'fr':'ref'})

preds = [translate(src) for src in subset['src']]
refs = list(subset['ref'])

In [10]:
print(preds)

['Elle a regardé par la fenêtre.', 'Il a ouvert le livre et a commencé à lire.', 'Le soleil brillait brillamment.', 'Ils marchaient le long de la rivière.', "C'était un après-midi tranquille.", 'Elle portait une écharpe rouge.', 'Il lui chuchotait un secret.', 'Les enfants riaient et jouaient.', 'La chambre sentait des fleurs fraîches.', 'Elle tourna la page lentement.']


In [11]:
print(subset.head())

                                       src  \
0               She looked out the window.   
1  He opened the book and started reading.   
2            The sun was shining brightly.   
3             They walked along the river.   
4                It was a quiet afternoon.   

                                      ref  
0            Elle regarda par la fenêtre.  
1  Il ouvrit le livre et commença à lire.  
2       Le soleil brillait de mille feux.  
3   Ils marchaient le long de la rivière.  
4       C'était un après-midi tranquille.  


In [12]:
#Step 9: Evaluate and Display
bleu_score = compute_bleu_sacrebleu(refs, preds)

print("=== Translation Outputs & Error Analysis ===\n")

for i in range(len(subset)):
    src = subset.iloc[i]['src']
    ref = subset.iloc[i]['ref']
    pred = preds[i]

    print(f"English: {src}")
    print(f"Reference: {ref}")
    print(f"Prediction: {pred}")

    if check_lexical(src, pred):
        print("Lexical Error Detected!")
    if check_semantic(ref, pred):
        print("Semantic Error Detected!")
    if check_pragmatic(src, pred):
        print("Pragmatic Error Detected!")

    print("-"*60)

print(f"BLEU Score: {bleu_score: .2f}")

=== Translation Outputs & Error Analysis ===

English: She looked out the window.
Reference: Elle regarda par la fenêtre.
Prediction: Elle a regardé par la fenêtre.
------------------------------------------------------------
English: He opened the book and started reading.
Reference: Il ouvrit le livre et commença à lire.
Prediction: Il a ouvert le livre et a commencé à lire.
------------------------------------------------------------
English: The sun was shining brightly.
Reference: Le soleil brillait de mille feux.
Prediction: Le soleil brillait brillamment.
Semantic Error Detected!
------------------------------------------------------------
English: They walked along the river.
Reference: Ils marchaient le long de la rivière.
Prediction: Ils marchaient le long de la rivière.
------------------------------------------------------------
English: It was a quiet afternoon.
Reference: C'était un après-midi tranquille.
Prediction: C'était un après-midi tranquille.
---------------------