# Task 3: Machine Translation

## Goal
Explore how multilingual and specialist translation models perform across datasets and genres, and evaluate their quality using standard metrics.

**Language Pair:** English ↔ Russian

---

## 1. Setup and Dependencies

In [23]:
%pip install transformers sacrebleu datasets sentencepiece accelerate deep-translator torchaudio huggingface_hub -q

In [24]:
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, NllbTokenizer
from datasets import load_dataset
import sacrebleu
from deep_translator import GoogleTranslator
import pandas as pd
from tqdm.auto import tqdm
import os

# Optional: Hugging Face authentication for gated datasets
try:
    from huggingface_hub import login
    HF_HUB_AVAILABLE = True
except ImportError:
    HF_HUB_AVAILABLE = False
    print("Note: huggingface_hub not installed. Install with: pip install huggingface_hub")

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cuda


In [25]:
# Hugging Face authentication (optional - only needed for gated datasets like FLORES)
hf_token = None
if HF_HUB_AVAILABLE:
    hf_token = os.environ.get('HUGGING_FACE_HUB_TOKEN') or os.environ.get('HF_TOKEN')

    if hf_token:
        try:
            login(token=hf_token)
            print("✓ Authenticated with Hugging Face")
        except Exception as e:
            print(f"Note: Could not authenticate with Hugging Face: {e}")
    else:
        print("Note: No Hugging Face token found. FLORES dataset may require authentication.")
        print("To access FLORES:")
        print("  1. Get a token from https://huggingface.co/settings/tokens")
        print("  2. Request access at https://huggingface.co/datasets/openlanguagedata/flores_plus")
        print("  3. Set environment variable: export HUGGING_FACE_HUB_TOKEN='your_token'")
else:
    print("Note: huggingface_hub not available. FLORES dataset may require authentication.")

Note: No Hugging Face token found. FLORES dataset may require authentication.
To access FLORES:
  1. Get a token from https://huggingface.co/settings/tokens
  2. Request access at https://huggingface.co/datasets/openlanguagedata/flores_plus
  3. Set environment variable: export HUGGING_FACE_HUB_TOKEN='your_token'


## 2. Baseline Model: NLLB-200

**Model:** [`facebook/nllb-200-distilled-600M`](https://huggingface.co/facebook/nllb-200-distilled-600M)

A multilingual foundation model supporting 200+ languages.

In [26]:
# Load NLLB Model and Tokenizer
model_name_nllb = "facebook/nllb-200-distilled-600M"
tokenizer_nllb = NllbTokenizer.from_pretrained(model_name_nllb)
model_nllb = AutoModelForSeq2SeqLM.from_pretrained(model_name_nllb).to(device)
print(f"✓ Loaded {model_name_nllb}")

✓ Loaded facebook/nllb-200-distilled-600M


In [27]:
# Helper function for NLLB translation
def translate_nllb(texts, src_lang, tgt_lang, model, tokenizer, batch_size=8):
    # NLLB codes: English -> eng_Latn, Russian -> rus_Cyrl
    lang_codes = {
        'en': 'eng_Latn',
        'ru': 'rus_Cyrl'
    }

    tokenizer.src_lang = lang_codes[src_lang]
    tgt_lang_code = lang_codes[tgt_lang]

    # Get the target language token ID
    try:
        forced_bos_token_id = tokenizer.convert_tokens_to_ids(tgt_lang_code)
    except:
        vocab = tokenizer.get_vocab()
        if tgt_lang_code in vocab:
            forced_bos_token_id = vocab[tgt_lang_code]
        else:
            if hasattr(tokenizer, 'lang_code_to_id'):
                forced_bos_token_id = tokenizer.lang_code_to_id[tgt_lang_code]
            else:
                raise ValueError(f"Could not find token ID for language code: {tgt_lang_code}")

    translations = []
    for i in tqdm(range(0, len(texts), batch_size), desc=f"Translating {src_lang}->{tgt_lang}"):
        batch = texts[i:i+batch_size]
        inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True).to(device)
        with torch.no_grad():
            generated_tokens = model.generate(
                **inputs,
                forced_bos_token_id=forced_bos_token_id,
                max_length=128
            )
        decoded = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
        translations.extend(decoded)
    return translations

## 3. Second Model: Helsinki-NLP OPUS-MT

**Models:**
- [`Helsinki-NLP/opus-mt-en-ru`](https://huggingface.co/Helsinki-NLP/opus-mt-en-ru)
- [`Helsinki-NLP/opus-mt-ru-en`](https://huggingface.co/Helsinki-NLP/opus-mt-ru-en)

Specialist bilingual models optimized for English-Russian translation.

In [28]:
# Load Helsinki Models
model_name_en_ru = "Helsinki-NLP/opus-mt-en-ru"
tokenizer_en_ru = AutoTokenizer.from_pretrained(model_name_en_ru)
model_en_ru = AutoModelForSeq2SeqLM.from_pretrained(model_name_en_ru).to(device)
print(f"✓ Loaded {model_name_en_ru}")

model_name_ru_en = "Helsinki-NLP/opus-mt-ru-en"
tokenizer_ru_en = AutoTokenizer.from_pretrained(model_name_ru_en)
model_ru_en = AutoModelForSeq2SeqLM.from_pretrained(model_name_ru_en).to(device)
print(f"✓ Loaded {model_name_ru_en}")



✓ Loaded Helsinki-NLP/opus-mt-en-ru
✓ Loaded Helsinki-NLP/opus-mt-ru-en


In [29]:
def translate_helsinki(texts, src_lang, tgt_lang, batch_size=32):
    if src_lang == 'en' and tgt_lang == 'ru':
        model = model_en_ru
        tokenizer = tokenizer_en_ru
    elif src_lang == 'ru' and tgt_lang == 'en':
        model = model_ru_en
        tokenizer = tokenizer_ru_en
    else:
        raise ValueError("Unsupported direction for Helsinki models loaded")

    translations = []
    for i in tqdm(range(0, len(texts), batch_size), desc=f"Translating {src_lang}->{tgt_lang} (Helsinki)"):
        batch = texts[i:i+batch_size]
        inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True).to(device)
        with torch.no_grad():
            generated_tokens = model.generate(**inputs, max_length=128)
        decoded = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
        translations.extend(decoded)
    return translations

## 4. Custom Test Set

**50 parallel sentences** (25 news + 25 fiction) for genre-specific evaluation.

In [30]:
# Define Custom Dataset
custom_dataset = [
    # News / Journalism (25 sentences)
    {"en": "The summit ended with a joint declaration on climate change.", "ru": "Саммит завершился принятием совместной декларации об изменении климата.", "genre": "news"},
    {"en": "Local authorities have announced new measures to combat traffic congestion.", "ru": "Местные власти объявили о новых мерах по борьбе с пробками.", "genre": "news"},
    {"en": "Scientists discovered a new species of orchid in the rainforest.", "ru": "Ученые обнаружили новый вид орхидей в тропическом лесу.", "genre": "news"},
    {"en": "The stock market reacted positively to the latest economic reports.", "ru": "Фондовый рынок положительно отреагировал на последние экономические отчеты.", "genre": "news"},
    {"en": "Education reform is a top priority for the new government.", "ru": "Реформа образования является главным приоритетом нового правительства.", "genre": "news"},
    {"en": "The international conference will address global security challenges.", "ru": "Международная конференция рассмотрит глобальные вызовы безопасности.", "genre": "news"},
    {"en": "Healthcare workers demand better working conditions and higher wages.", "ru": "Медицинские работники требуют улучшения условий труда и повышения зарплат.", "genre": "news"},
    {"en": "The new policy aims to reduce carbon emissions by 40% by 2030.", "ru": "Новая политика направлена на сокращение выбросов углерода на 40% к 2030 году.", "genre": "news"},
    {"en": "Tech companies are investing billions in artificial intelligence research.", "ru": "Технологические компании инвестируют миллиарды в исследования искусственного интеллекта.", "genre": "news"},
    {"en": "The peace negotiations have reached a critical stage.", "ru": "Мирные переговоры достигли критической стадии.", "genre": "news"},
    {"en": "Unemployment rates have dropped to their lowest level in a decade.", "ru": "Уровень безработицы упал до самого низкого уровня за десятилетие.", "genre": "news"},
    {"en": "The archaeological discovery sheds new light on ancient civilizations.", "ru": "Археологическое открытие проливает новый свет на древние цивилизации.", "genre": "news"},
    {"en": "The central bank raised interest rates to combat inflation.", "ru": "Центральный банк повысил процентные ставки для борьбы с инфляцией.", "genre": "news"},
    {"en": "The new legislation will come into effect next month.", "ru": "Новое законодательство вступит в силу в следующем месяце.", "genre": "news"},
    {"en": "The research team published groundbreaking findings in the medical journal.", "ru": "Исследовательская группа опубликовала революционные результаты в медицинском журнале.", "genre": "news"},
    {"en": "The trade agreement between the two countries was signed yesterday.", "ru": "Торговое соглашение между двумя странами было подписано вчера.", "genre": "news"},
    {"en": "The city council approved the construction of a new metro line.", "ru": "Городской совет одобрил строительство новой линии метро.", "genre": "news"},
    {"en": "The investigation revealed serious violations of safety regulations.", "ru": "Расследование выявило серьезные нарушения правил безопасности.", "genre": "news"},
    {"en": "The festival attracted thousands of visitors from around the world.", "ru": "Фестиваль привлек тысячи посетителей со всего мира.", "genre": "news"},
    {"en": "The company announced plans to expand its operations to Asia.", "ru": "Компания объявила о планах расширить свою деятельность в Азии.", "genre": "news"},
    {"en": "The environmental group organized a protest against deforestation.", "ru": "Экологическая группа организовала протест против вырубки лесов.", "genre": "news"},
    {"en": "The new vaccine has shown promising results in clinical trials.", "ru": "Новая вакцина показала многообещающие результаты в клинических испытаниях.", "genre": "news"},
    {"en": "The sports team won the championship for the third consecutive year.", "ru": "Спортивная команда выиграла чемпионат третий год подряд.", "genre": "news"},
    {"en": "The government launched a new initiative to support small businesses.", "ru": "Правительство запустило новую инициативу по поддержке малого бизнеса.", "genre": "news"},
    {"en": "The documentary film received critical acclaim at the international festival.", "ru": "Документальный фильм получил признание критиков на международном фестивале.", "genre": "news"},

    # Fiction / Social (25 sentences)
    {"en": "She looked out the window, wondering if he would ever return.", "ru": "Она смотрела в окно, гадая, вернется ли он когда-нибудь.", "genre": "fiction"},
    {"en": "The old house creaked in the wind, as if whispering secrets.", "ru": "Старый дом скрипел на ветру, словно нашептывая секреты.", "genre": "fiction"},
    {"en": "'I can't believe you said that!' she exclaimed.", "ru": "— Не могу поверить, что ты это сказал! — воскликнула она.", "genre": "fiction"},
    {"en": "He picked up the sword, feeling its weight in his hand.", "ru": "Он поднял меч, чувствуя его тяжесть в руке.", "genre": "fiction"},
    {"en": "The stars shone brightly in the clear night sky.", "ru": "Звезды ярко сияли на чистом ночном небе.", "genre": "fiction"},
    {"en": "She walked through the garden, her mind lost in memories of the past.", "ru": "Она шла по саду, ее мысли были погружены в воспоминания о прошлом.", "genre": "fiction"},
    {"en": "The mysterious letter arrived on a rainy Tuesday morning.", "ru": "Таинственное письмо пришло дождливым вторничным утром.", "genre": "fiction"},
    {"en": "'Why did you leave me?' he whispered into the darkness.", "ru": "— Почему ты оставил меня? — прошептал он в темноту.", "genre": "fiction"},
    {"en": "The ancient book contained secrets that could change everything.", "ru": "Древняя книга содержала секреты, которые могли изменить все.", "genre": "fiction"},
    {"en": "She felt a strange sensation, as if someone was watching her.", "ru": "Она почувствовала странное ощущение, словно кто-то наблюдает за ней.", "genre": "fiction"},
    {"en": "The music filled the room, bringing tears to her eyes.", "ru": "Музыка наполнила комнату, вызывая слезы на ее глазах.", "genre": "fiction"},
    {"en": "He had never seen such a beautiful sunset in all his years.", "ru": "Он никогда не видел такого красивого заката за все свои годы.", "genre": "fiction"},
    {"en": "The old man smiled, knowing that his time had finally come.", "ru": "Старик улыбнулся, зная, что его время наконец пришло.", "genre": "fiction"},
    {"en": "She opened the door slowly, afraid of what she might find inside.", "ru": "Она медленно открыла дверь, боясь того, что может найти внутри.", "genre": "fiction"},
    {"en": "The forest seemed to come alive as the moon rose above the trees.", "ru": "Лес, казалось, оживал, когда луна поднималась над деревьями.", "genre": "fiction"},
    {"en": "'Everything will be alright,' he said, though he didn't believe it himself.", "ru": "— Все будет хорошо, — сказал он, хотя сам в это не верил.", "genre": "fiction"},
    {"en": "The photograph brought back memories she had tried so hard to forget.", "ru": "Фотография вернула воспоминания, которые она так старалась забыть.", "genre": "fiction"},
    {"en": "He could hear the sound of footsteps approaching from behind.", "ru": "Он мог слышать звук шагов, приближающихся сзади.", "genre": "fiction"},
    {"en": "The coffee tasted bitter, just like her mood that morning.", "ru": "Кофе был горьким, как и ее настроение в то утро.", "genre": "fiction"},
    {"en": "She found herself standing at a crossroads, unsure which path to take.", "ru": "Она оказалась на перекрестке, не зная, какой путь выбрать.", "genre": "fiction"},
    {"en": "The old photograph showed a family she had never known.", "ru": "Старая фотография показывала семью, которую она никогда не знала.", "genre": "fiction"},
    {"en": "He whispered her name, and she turned around with a smile.", "ru": "Он прошептал ее имя, и она обернулась с улыбкой.", "genre": "fiction"},
    {"en": "The storm raged outside, but inside the house, all was calm.", "ru": "Буря бушевала снаружи, но внутри дома все было спокойно.", "genre": "fiction"},
    {"en": "She had waited her whole life for this moment, and now it was here.", "ru": "Она ждала этого момента всю свою жизнь, и теперь он наступил.", "genre": "fiction"},
    {"en": "The last words he spoke would haunt her for the rest of her days.", "ru": "Последние слова, которые он произнес, будут преследовать ее до конца дней.", "genre": "fiction"},
]

df_custom = pd.DataFrame(custom_dataset)
print(f"Custom dataset loaded: {len(df_custom)} sentences")
print(f"News: {len(df_custom[df_custom['genre'] == 'news'])} sentences")
print(f"Fiction: {len(df_custom[df_custom['genre'] == 'fiction'])} sentences")

Custom dataset loaded: 50 sentences
News: 25 sentences
Fiction: 25 sentences


## 5. Commercial System: Google Translate

Using Google Translate via `deep-translator` for comparison with open-source models.

In [31]:
def translate_google(texts, src_lang, tgt_lang):
    translator = GoogleTranslator(source=src_lang, target=tgt_lang)
    translations = []
    for text in tqdm(texts, desc=f"Translating {src_lang}->{tgt_lang} (Google)"):
        try:
            translations.append(translator.translate(text))
        except Exception as e:
            print(f"Error translating: {e}")
            translations.append("")
    return translations

## 6. Evaluation Metrics

Using **BLEU** and **chrF++** scores via `sacrebleu` for standardized evaluation.

In [32]:
def compute_metrics(predictions, references):
    bleu = sacrebleu.corpus_bleu(predictions, [references])
    chrf = sacrebleu.corpus_chrf(predictions, [references])
    return {"BLEU": bleu.score, "chrF++": chrf.score}

## 7. Evaluate on FLORES Dataset

**Dataset:** [FLORES devtest subset](https://huggingface.co/datasets/openlanguagedata/flores_plus)

FLORES is a challenging benchmark designed to test model robustness across diverse domains.

In [33]:
# Load FLORES dataset for English-Russian
dataset_flores = None

try:
    dataset_flores = load_dataset("facebook/flores", "eng-rus", split="devtest", trust_remote_code=True)
    print("✓ Loaded FLORES from facebook/flores")
except Exception as e1:
    try:
        dataset_flores = load_dataset("facebook/flores", "eng-rus", split="devtest")
        print("✓ Loaded FLORES from facebook/flores")
    except Exception as e2:
        try:
            load_kwargs = {"trust_remote_code": True}
            if hf_token:
                load_kwargs["token"] = hf_token
            dataset_flores = load_dataset("openlanguagedata/flores_plus", "default", split="devtest", **load_kwargs)
            print("✓ Loaded FLORES with default config")
        except Exception as e3:
            print(f"\n⚠ Could not load FLORES dataset. Error: {e3}")
            print("\nContinuing with custom dataset only (this is sufficient for the task).")
            dataset_flores = None

`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'facebook/flores' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
ERROR:datasets.load:`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'facebook/flores' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'openlanguagedata/flores_plus' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
ERROR:datasets.load:`trust_remote_code` 

Resolving data files:   0%|          | 0/224 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/218 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/224 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/218 [00:00<?, ?it/s]

✓ Loaded FLORES with default config


In [34]:
# Extract English and Russian sentences from FLORES
if dataset_flores is not None:
    cols = list(dataset_flores.column_names)
    print(f"FLORES columns available: {cols[:10]}...")

    # Check if this is the new FLORES structure with language metadata
    if 'text' in cols and ('iso_639_3' in cols or 'iso_15924' in cols):
        print("Detected new FLORES structure with language metadata")

        try:
            df_flores = dataset_flores.to_pandas()

            # Filter English and Russian sentences
            if 'iso_639_3' in cols and 'iso_15924' in cols:
                eng_mask = (df_flores['iso_639_3'] == 'eng') & (df_flores['iso_15924'] == 'Latn')
                rus_mask = (df_flores['iso_639_3'] == 'rus') & (df_flores['iso_15924'] == 'Cyrl')
            elif 'iso_639_3' in cols:
                eng_mask = df_flores['iso_639_3'] == 'eng'
                rus_mask = df_flores['iso_639_3'] == 'rus'
            else:
                raise ValueError("Cannot determine language from available columns")

            flores_en_df = df_flores[eng_mask].sort_values('id') if 'id' in cols else df_flores[eng_mask]
            flores_ru_df = df_flores[rus_mask].sort_values('id') if 'id' in cols else df_flores[rus_mask]

            # Align by id to ensure parallel sentences
            if 'id' in cols:
                merged = pd.merge(flores_en_df, flores_ru_df, on='id', suffixes=('_en', '_ru'))
                flores_en = merged['text_en'].tolist()
                flores_ru = merged['text_ru'].tolist()
            else:
                min_len = min(len(flores_en_df), len(flores_ru_df))
                flores_en = flores_en_df['text'].head(min_len).tolist()
                flores_ru = flores_ru_df['text'].head(min_len).tolist()

            print(f"✓ Extracted {len(flores_en)} parallel English-Russian sentence pairs")
        except Exception as e:
            print(f"⚠ Error processing FLORES structure: {e}")
            dataset_flores = None

    # Fallback: Check for direct language columns (old structure)
    flores_extracted = False
    if dataset_flores is not None:
        try:
            flores_en
            flores_extracted = True
        except NameError:
            flores_extracted = False

    if dataset_flores is not None and not flores_extracted:
        if 'eng_Latn' in cols and 'rus_Cyrl' in cols:
            flores_en = dataset_flores['eng_Latn']
            flores_ru = dataset_flores['rus_Cyrl']
            print("✓ Using eng_Latn and rus_Cyrl columns")
        else:
            print(f"⚠ Could not extract English/Russian sentences. Available columns: {cols[:20]}")
            dataset_flores = None

    # Limit to first 100 sentences for faster evaluation
    if dataset_flores is not None:
        try:
            _ = flores_en
            _ = flores_ru
            eval_size = min(100, len(flores_en))
            flores_en_eval = flores_en[:eval_size]
            flores_ru_eval = flores_ru[:eval_size]
            print(f"Evaluating on {eval_size} FLORES sentences")
        except NameError:
            print("FLORES dataset not loaded or extraction failed. Skipping FLORES evaluation.")
            dataset_flores = None
else:
    print("FLORES dataset not loaded. Skipping FLORES evaluation.")

FLORES columns available: ['id', 'iso_639_3', 'iso_15924', 'glottocode', 'variant', 'text', 'url', 'domain', 'topic', 'has_image']...
Detected new FLORES structure with language metadata
✓ Extracted 1012 parallel English-Russian sentence pairs
Evaluating on 100 FLORES sentences


In [35]:
# Evaluate NLLB on FLORES
if dataset_flores is not None:
    print("\n=== Evaluating NLLB on FLORES ===")

    print("Translating En->Ru...")
    nllb_en_ru_flores = translate_nllb(flores_en_eval, 'en', 'ru', model_nllb, tokenizer_nllb)
    metrics_nllb_en_ru_flores = compute_metrics(nllb_en_ru_flores, flores_ru_eval)
    print(f"NLLB En->Ru: BLEU={metrics_nllb_en_ru_flores['BLEU']:.2f}, chrF++={metrics_nllb_en_ru_flores['chrF++']:.2f}")

    print("Translating Ru->En...")
    nllb_ru_en_flores = translate_nllb(flores_ru_eval, 'ru', 'en', model_nllb, tokenizer_nllb)
    metrics_nllb_ru_en_flores = compute_metrics(nllb_ru_en_flores, flores_en_eval)
    print(f"NLLB Ru->En: BLEU={metrics_nllb_ru_en_flores['BLEU']:.2f}, chrF++={metrics_nllb_ru_en_flores['chrF++']:.2f}")

    # Evaluate Helsinki on FLORES
    print("\n=== Evaluating Helsinki on FLORES ===")

    print("Translating En->Ru...")
    helsinki_en_ru_flores = translate_helsinki(flores_en_eval, 'en', 'ru')
    metrics_helsinki_en_ru_flores = compute_metrics(helsinki_en_ru_flores, flores_ru_eval)
    print(f"Helsinki En->Ru: BLEU={metrics_helsinki_en_ru_flores['BLEU']:.2f}, chrF++={metrics_helsinki_en_ru_flores['chrF++']:.2f}")

    print("Translating Ru->En...")
    helsinki_ru_en_flores = translate_helsinki(flores_ru_eval, 'ru', 'en')
    metrics_helsinki_ru_en_flores = compute_metrics(helsinki_ru_en_flores, flores_en_eval)
    print(f"Helsinki Ru->En: BLEU={metrics_helsinki_ru_en_flores['BLEU']:.2f}, chrF++={metrics_helsinki_ru_en_flores['chrF++']:.2f}")


=== Evaluating NLLB on FLORES ===
Translating En->Ru...


Translating en->ru:   0%|          | 0/13 [00:00<?, ?it/s]

NLLB En->Ru: BLEU=28.79, chrF++=56.05
Translating Ru->En...


Translating ru->en:   0%|          | 0/13 [00:00<?, ?it/s]

NLLB Ru->En: BLEU=31.81, chrF++=60.12

=== Evaluating Helsinki on FLORES ===
Translating En->Ru...


Translating en->ru (Helsinki):   0%|          | 0/4 [00:00<?, ?it/s]

Helsinki En->Ru: BLEU=28.16, chrF++=55.95
Translating Ru->En...


Translating ru->en (Helsinki):   0%|          | 0/4 [00:00<?, ?it/s]

Helsinki Ru->En: BLEU=27.51, chrF++=57.72


## 8. Evaluate on Custom Dataset

Evaluating all models on our custom 50-sentence dataset (both directions).

In [36]:
# Extract English and Russian sentences from custom dataset
custom_en = df_custom['en'].tolist()
custom_ru = df_custom['ru'].tolist()

print(f"Evaluating on {len(custom_en)} custom sentences")

Evaluating on 50 custom sentences


In [37]:
# Evaluate NLLB on custom dataset
print("\n=== Evaluating NLLB on Custom Dataset ===")

print("Translating En->Ru...")
nllb_en_ru_custom = translate_nllb(custom_en, 'en', 'ru', model_nllb, tokenizer_nllb)
metrics_nllb_en_ru_custom = compute_metrics(nllb_en_ru_custom, custom_ru)
print(f"NLLB En->Ru: BLEU={metrics_nllb_en_ru_custom['BLEU']:.2f}, chrF++={metrics_nllb_en_ru_custom['chrF++']:.2f}")

print("Translating Ru->En...")
nllb_ru_en_custom = translate_nllb(custom_ru, 'ru', 'en', model_nllb, tokenizer_nllb)
metrics_nllb_ru_en_custom = compute_metrics(nllb_ru_en_custom, custom_en)
print(f"NLLB Ru->En: BLEU={metrics_nllb_ru_en_custom['BLEU']:.2f}, chrF++={metrics_nllb_ru_en_custom['chrF++']:.2f}")


=== Evaluating NLLB on Custom Dataset ===
Translating En->Ru...


Translating en->ru:   0%|          | 0/7 [00:00<?, ?it/s]

NLLB En->Ru: BLEU=57.53, chrF++=79.98
Translating Ru->En...


Translating ru->en:   0%|          | 0/7 [00:00<?, ?it/s]

NLLB Ru->En: BLEU=60.97, chrF++=78.28


In [38]:
# Evaluate Helsinki on custom dataset
print("\n=== Evaluating Helsinki on Custom Dataset ===")

print("Translating En->Ru...")
helsinki_en_ru_custom = translate_helsinki(custom_en, 'en', 'ru')
metrics_helsinki_en_ru_custom = compute_metrics(helsinki_en_ru_custom, custom_ru)
print(f"Helsinki En->Ru: BLEU={metrics_helsinki_en_ru_custom['BLEU']:.2f}, chrF++={metrics_helsinki_en_ru_custom['chrF++']:.2f}")

print("Translating Ru->En...")
helsinki_ru_en_custom = translate_helsinki(custom_ru, 'ru', 'en')
metrics_helsinki_ru_en_custom = compute_metrics(helsinki_ru_en_custom, custom_en)
print(f"Helsinki Ru->En: BLEU={metrics_helsinki_ru_en_custom['BLEU']:.2f}, chrF++={metrics_helsinki_ru_en_custom['chrF++']:.2f}")


=== Evaluating Helsinki on Custom Dataset ===
Translating En->Ru...


Translating en->ru (Helsinki):   0%|          | 0/2 [00:00<?, ?it/s]

Helsinki En->Ru: BLEU=55.11, chrF++=77.43
Translating Ru->En...


Translating ru->en (Helsinki):   0%|          | 0/2 [00:00<?, ?it/s]

Helsinki Ru->En: BLEU=55.14, chrF++=74.59


In [39]:
# Evaluate Google Translate on custom dataset
print("\n=== Evaluating Google Translate on Custom Dataset ===")

print("Translating En->Ru...")
google_en_ru_custom = translate_google(custom_en, 'en', 'ru')
metrics_google_en_ru_custom = compute_metrics(google_en_ru_custom, custom_ru)
print(f"Google En->Ru: BLEU={metrics_google_en_ru_custom['BLEU']:.2f}, chrF++={metrics_google_en_ru_custom['chrF++']:.2f}")

print("Translating Ru->En...")
google_ru_en_custom = translate_google(custom_ru, 'ru', 'en')
metrics_google_ru_en_custom = compute_metrics(google_ru_en_custom, custom_en)
print(f"Google Ru->En: BLEU={metrics_google_ru_en_custom['BLEU']:.2f}, chrF++={metrics_google_ru_en_custom['chrF++']:.2f}")


=== Evaluating Google Translate on Custom Dataset ===
Translating En->Ru...


Translating en->ru (Google):   0%|          | 0/50 [00:00<?, ?it/s]

Google En->Ru: BLEU=72.66, chrF++=87.87
Translating Ru->En...


Translating ru->en (Google):   0%|          | 0/50 [00:00<?, ?it/s]

Google Ru->En: BLEU=78.79, chrF++=89.23


## 9. Results Summary and Analysis

In [40]:
# Create results table for custom dataset
results_custom = pd.DataFrame({
    'Model': ['NLLB-200', 'NLLB-200', 'Helsinki-NLP', 'Helsinki-NLP', 'Google Translate', 'Google Translate'],
    'Direction': ['En→Ru', 'Ru→En', 'En→Ru', 'Ru→En', 'En→Ru', 'Ru→En'],
    'BLEU': [
        metrics_nllb_en_ru_custom['BLEU'],
        metrics_nllb_ru_en_custom['BLEU'],
        metrics_helsinki_en_ru_custom['BLEU'],
        metrics_helsinki_ru_en_custom['BLEU'],
        metrics_google_en_ru_custom['BLEU'],
        metrics_google_ru_en_custom['BLEU']
    ],
    'chrF++': [
        metrics_nllb_en_ru_custom['chrF++'],
        metrics_nllb_ru_en_custom['chrF++'],
        metrics_helsinki_en_ru_custom['chrF++'],
        metrics_helsinki_ru_en_custom['chrF++'],
        metrics_google_en_ru_custom['chrF++'],
        metrics_google_ru_en_custom['chrF++']
    ]
})

print("=== Results on Custom Dataset ===")
print(results_custom.to_string(index=False))

=== Results on Custom Dataset ===
           Model Direction      BLEU    chrF++
        NLLB-200     En→Ru 57.533910 79.984840
        NLLB-200     Ru→En 60.972688 78.279954
    Helsinki-NLP     En→Ru 55.110922 77.432304
    Helsinki-NLP     Ru→En 55.142280 74.587597
Google Translate     En→Ru 72.664367 87.867028
Google Translate     Ru→En 78.789387 89.234167


In [41]:
# Evaluate by genre
print("\n=== Results by Genre ===")

# News genre
news_mask = df_custom['genre'] == 'news'
news_en = df_custom[news_mask]['en'].tolist()
news_ru = df_custom[news_mask]['ru'].tolist()

# Fiction genre
fiction_mask = df_custom['genre'] == 'fiction'
fiction_en = df_custom[fiction_mask]['en'].tolist()
fiction_ru = df_custom[fiction_mask]['ru'].tolist()

# Translate news
nllb_news_en_ru = translate_nllb(news_en, 'en', 'ru', model_nllb, tokenizer_nllb)
helsinki_news_en_ru = translate_helsinki(news_en, 'en', 'ru')
google_news_en_ru = translate_google(news_en, 'en', 'ru')

# Translate fiction
nllb_fiction_en_ru = translate_nllb(fiction_en, 'en', 'ru', model_nllb, tokenizer_nllb)
helsinki_fiction_en_ru = translate_helsinki(fiction_en, 'en', 'ru')
google_fiction_en_ru = translate_google(fiction_en, 'en', 'ru')

# Compute metrics by genre
results_by_genre = pd.DataFrame({
    'Model': ['NLLB-200', 'Helsinki-NLP', 'Google Translate', 'NLLB-200', 'Helsinki-NLP', 'Google Translate'],
    'Genre': ['News', 'News', 'News', 'Fiction', 'Fiction', 'Fiction'],
    'BLEU': [
        compute_metrics(nllb_news_en_ru, news_ru)['BLEU'],
        compute_metrics(helsinki_news_en_ru, news_ru)['BLEU'],
        compute_metrics(google_news_en_ru, news_ru)['BLEU'],
        compute_metrics(nllb_fiction_en_ru, fiction_ru)['BLEU'],
        compute_metrics(helsinki_fiction_en_ru, fiction_ru)['BLEU'],
        compute_metrics(google_fiction_en_ru, fiction_ru)['BLEU']
    ],
    'chrF++': [
        compute_metrics(nllb_news_en_ru, news_ru)['chrF++'],
        compute_metrics(helsinki_news_en_ru, news_ru)['chrF++'],
        compute_metrics(google_news_en_ru, news_ru)['chrF++'],
        compute_metrics(nllb_fiction_en_ru, fiction_ru)['chrF++'],
        compute_metrics(helsinki_fiction_en_ru, fiction_ru)['chrF++'],
        compute_metrics(google_fiction_en_ru, fiction_ru)['chrF++']
    ]
})

print(results_by_genre.to_string(index=False))


=== Results by Genre ===


Translating en->ru:   0%|          | 0/4 [00:00<?, ?it/s]

Translating en->ru (Helsinki):   0%|          | 0/1 [00:00<?, ?it/s]

Translating en->ru (Google):   0%|          | 0/25 [00:00<?, ?it/s]

Translating en->ru:   0%|          | 0/4 [00:00<?, ?it/s]

Translating en->ru (Helsinki):   0%|          | 0/1 [00:00<?, ?it/s]

Translating en->ru (Google):   0%|          | 0/25 [00:00<?, ?it/s]

           Model   Genre      BLEU    chrF++
        NLLB-200    News 73.930300 88.763310
    Helsinki-NLP    News 59.098926 82.759841
Google Translate    News 79.651011 93.476668
        NLLB-200 Fiction 44.779740 69.102928
    Helsinki-NLP Fiction 51.427874 70.739468
Google Translate Fiction 67.207074 80.892794


## 10. Example Translation Analysis

In [42]:
# Select a few examples for detailed analysis
example_indices = [0, 5, 10, 20, 30, 40]  # Mix of news and fiction

print("=== Example Translations ===")
print("\n" + "="*80)

for idx in example_indices:
    if idx < len(df_custom):
        row = df_custom.iloc[idx]
        print(f"\nExample {idx+1} ({row['genre']}):")
        print(f"Source (EN): {row['en']}")
        print(f"Reference (RU): {row['ru']}")
        print(f"NLLB: {nllb_en_ru_custom[idx]}")
        print(f"Helsinki: {helsinki_en_ru_custom[idx]}")
        print(f"Google: {google_en_ru_custom[idx]}")
        print("-"*80)

=== Example Translations ===


Example 1 (news):
Source (EN): The summit ended with a joint declaration on climate change.
Reference (RU): Саммит завершился принятием совместной декларации об изменении климата.
NLLB: Саммит завершился совместной декларацией о изменении климата.
Helsinki: Саммит завершился принятием совместной декларации об изменении климата.
Google: Саммит завершился принятием совместной декларации по изменению климата.
--------------------------------------------------------------------------------

Example 6 (news):
Source (EN): The international conference will address global security challenges.
Reference (RU): Международная конференция рассмотрит глобальные вызовы безопасности.
NLLB: Международная конференция будет рассматривать проблемы глобальной безопасности.
Helsinki: На этой международной конференции будут рассмотрены проблемы глобальной безопасности.
Google: Международная конференция будет посвящена проблемам глобальной безопасности.
------------------------

In [43]:
# Find best and worst examples
def simple_similarity(pred, ref):
    pred_words = set(pred.lower().split())
    ref_words = set(ref.lower().split())
    if len(ref_words) == 0:
        return 0
    return len(pred_words & ref_words) / len(ref_words)

# Calculate similarities for NLLB
nllb_similarities = [simple_similarity(nllb_en_ru_custom[i], custom_ru[i]) for i in range(len(custom_ru))]
best_nllb_idx = nllb_similarities.index(max(nllb_similarities))
worst_nllb_idx = nllb_similarities.index(min(nllb_similarities))

print("\n=== Best NLLB Translation ===")
print(f"Source (EN): {custom_en[best_nllb_idx]}")
print(f"Reference (RU): {custom_ru[best_nllb_idx]}")
print(f"NLLB: {nllb_en_ru_custom[best_nllb_idx]}")
print(f"Genre: {df_custom.iloc[best_nllb_idx]['genre']}")

print("\n=== Worst NLLB Translation ===")
print(f"Source (EN): {custom_en[worst_nllb_idx]}")
print(f"Reference (RU): {custom_ru[worst_nllb_idx]}")
print(f"NLLB: {nllb_en_ru_custom[worst_nllb_idx]}")
print(f"Genre: {df_custom.iloc[worst_nllb_idx]['genre']}")


=== Best NLLB Translation ===
Source (EN): Education reform is a top priority for the new government.
Reference (RU): Реформа образования является главным приоритетом нового правительства.
NLLB: Реформа образования является главным приоритетом для нового правительства.
Genre: news

=== Worst NLLB Translation ===
Source (EN): The old house creaked in the wind, as if whispering secrets.
Reference (RU): Старый дом скрипел на ветру, словно нашептывая секреты.
NLLB: Старый дом кричал в ветре, как будто шепчут секреты.
Genre: fiction


## 11. Final Results Summary

In [44]:
# Create comprehensive results summary
print("="*80)
print("COMPREHENSIVE RESULTS SUMMARY")
print("="*80)

# Custom dataset results
print("\n### Custom Dataset Results (50 sentences: 25 news + 25 fiction)")
print(results_custom.to_string(index=False))

# FLORES results (if available)
if dataset_flores is not None:
    print("\n### FLORES Dataset Results (devtest subset)")
    results_flores = pd.DataFrame({
        'Model': ['NLLB-200', 'NLLB-200', 'Helsinki-NLP', 'Helsinki-NLP'],
        'Direction': ['En→Ru', 'Ru→En', 'En→Ru', 'Ru→En'],
        'BLEU': [
            metrics_nllb_en_ru_flores['BLEU'],
            metrics_nllb_ru_en_flores['BLEU'],
            metrics_helsinki_en_ru_flores['BLEU'],
            metrics_helsinki_ru_en_flores['BLEU']
        ],
        'chrF++': [
            metrics_nllb_en_ru_flores['chrF++'],
            metrics_nllb_ru_en_flores['chrF++'],
            metrics_helsinki_en_ru_flores['chrF++'],
            metrics_helsinki_ru_en_flores['chrF++']
        ]
    })
    print(results_flores.to_string(index=False))

# Genre-specific results
print("\n### Results by Genre (Custom Dataset)")
print(results_by_genre.to_string(index=False))

print("\n" + "="*80)
print("Evaluation Complete!")
print("="*80)

COMPREHENSIVE RESULTS SUMMARY

### Custom Dataset Results (50 sentences: 25 news + 25 fiction)
           Model Direction      BLEU    chrF++
        NLLB-200     En→Ru 57.533910 79.984840
        NLLB-200     Ru→En 60.972688 78.279954
    Helsinki-NLP     En→Ru 55.110922 77.432304
    Helsinki-NLP     Ru→En 55.142280 74.587597
Google Translate     En→Ru 72.664367 87.867028
Google Translate     Ru→En 78.789387 89.234167

### FLORES Dataset Results (devtest subset)
       Model Direction      BLEU    chrF++
    NLLB-200     En→Ru 28.791971 56.048323
    NLLB-200     Ru→En 31.812177 60.117728
Helsinki-NLP     En→Ru 28.163911 55.954486
Helsinki-NLP     Ru→En 27.508495 57.718405

### Results by Genre (Custom Dataset)
           Model   Genre      BLEU    chrF++
        NLLB-200    News 73.930300 88.763310
    Helsinki-NLP    News 59.098926 82.759841
Google Translate    News 79.651011 93.476668
        NLLB-200 Fiction 44.779740 69.102928
    Helsinki-NLP Fiction 51.427874 70.739468
Google 

## 12. Discussion and Analysis

### Overall Performance

Based on the comprehensive evaluation across both FLORES and custom datasets, **Google Translate** consistently outperforms all open-source models, achieving the highest BLEU scores (72.66-78.79 on custom dataset, significantly higher than both NLLB-200 and Helsinki-NLP). This superior performance is expected, as commercial systems benefit from extensive training data, continuous updates, and sophisticated engineering that open-source models typically lack.

Among the open-source models, **NLLB-200** demonstrates better overall performance than Helsinki-NLP, particularly on the challenging FLORES benchmark where it achieves BLEU scores of 28.79-31.81 compared to Helsinki-NLP's 27.51-28.16. This advantage likely stems from NLLB-200's multilingual training on 200+ languages, which provides better cross-lingual representations. However, on the custom dataset, the performance gap narrows, with NLLB-200 achieving 57.53-60.97 BLEU versus Helsinki-NLP's 55.11-55.14 BLEU, suggesting that Helsinki-NLP's specialized bilingual training can be competitive on certain domains.

### Genre and Direction Effects

A clear **genre effect** emerges across all models: **News translations significantly outperform Fiction translations**. For instance, NLLB-200 achieves 73.93 BLEU on news but only 44.78 on fiction—a difference of nearly 30 points. This pattern holds for all models, indicating that news text, with its formulaic structure and standard vocabulary, is inherently easier to translate than fiction, which contains more creative language, metaphors, and stylistic variations.

Regarding **translation direction**, there is a slight asymmetry favoring **Ru→En over En→Ru** for NLLB-200 (60.97 vs 57.53 BLEU on custom dataset), while Helsinki-NLP shows minimal directionality differences. This asymmetry may reflect differences in training data distribution or the relative complexity of translating into Russian, which has more complex morphology and word order flexibility.

### Example Analysis

**Excellent Translation Example:** The news sentence "Education reform is a top priority for the new government" was translated nearly perfectly by NLLB-200 as "Реформа образования является главным приоритетом для нового правительства," differing from the reference only in the preposition ("для" vs "нового"). This demonstrates the model's strength with formal, structured text containing standard political terminology.

**Poor Translation Example:** The fiction sentence "The old house creaked in the wind, as if whispering secrets" was poorly handled by NLLB-200, producing "Старый дом кричал в ветре, как будто шепчут секреты" (literally "The old house screamed in the wind, as if secrets whisper"). The model incorrectly translated "creaked" as "кричал" (screamed) instead of "скрипел" (creaked), and failed to capture the metaphorical personification. This highlights the challenge of translating creative, figurative language where literal word-for-word translation breaks down.

### Commercial vs Open-Source Systems

The substantial performance gap between Google Translate and open-source models (approximately 15-20 BLEU points) reflects fundamental differences in scale and resources. Commercial systems benefit from proprietary training data, continuous learning from user feedback, and extensive computational resources. However, open-source models like NLLB-200 offer valuable alternatives for research, privacy-sensitive applications, and scenarios requiring offline deployment. The fact that NLLB-200 achieves over 70 BLEU on news text demonstrates that open-source models can be highly effective for specific domains, even if they lag behind commercial systems overall.

### Conclusion

This evaluation reveals that while commercial translation systems maintain a clear advantage, open-source models have reached a level of quality sufficient for many practical applications, particularly in formal domains like news. The significant performance drop on FLORES compared to the custom dataset underscores the importance of challenging benchmarks in revealing model limitations. Future improvements in open-source translation should focus on handling creative language, domain adaptation, and reducing the performance gap with commercial systems.