# English to Nepali Translation with Transformer Models
## Using mBART-50, NLLB-200 & Understanding mBERT Limitations

**BUS 405: Foundations of Big Data Analytics**

---

## Important: Why NOT mBERT or RoBERTa for Translation?

### ‚ùå mBERT and RoBERTa are NOT translation models!

| Model | Architecture | Purpose | Can Translate? |
|-------|--------------|---------|----------------|
| **mBERT** | Encoder-only | Understanding multilingual text | ‚ùå No |
| **XLM-RoBERTa** | Encoder-only | Cross-lingual understanding | ‚ùå No |
| **mBART-50** | Encoder-Decoder | Multilingual translation | ‚úÖ Yes |
| **NLLB-200** | Encoder-Decoder | 200+ language translation | ‚úÖ Yes |

### Why Encoder-Only Models Can't Translate:
- **mBERT/RoBERTa** are trained to **understand** text, not **generate** it
- They produce **embeddings**, not **translated sentences**
- Translation requires an **encoder-decoder** architecture

### What mBERT/RoBERTa ARE Good For:
- Cross-lingual text classification
- Multilingual named entity recognition (NER)
- Cross-lingual similarity/search
- Multilingual question answering

---

In this notebook, we will:
1. Demonstrate why mBERT can't translate (and what it does instead)
2. Use **mBART-50** for English ‚Üí Nepali translation
3. Use **NLLB-200** for English ‚Üí Nepali translation
4. Compare translation quality
5. Build a complete English-Nepali translator

## 1. Setup and Installation

In [None]:
# Install required packages
!pip install -q transformers torch sentencepiece protobuf accelerate

In [None]:
# Import libraries
import torch
import warnings
warnings.filterwarnings('ignore')

from transformers import (
    # mBERT (for demonstration of what it can/cannot do)
    BertModel,
    BertTokenizer,

    # mBART-50 for translation
    MBartForConditionalGeneration,
    MBart50TokenizerFast,

    # NLLB for translation
    AutoModelForSeq2SeqLM,
    AutoTokenizer,

    # Pipeline for easy use
    pipeline
)

# Check device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## 2. Why mBERT Cannot Translate (Demonstration)

Let's see what mBERT actually does - it creates **embeddings**, not translations!

In [None]:
# Load mBERT (Multilingual BERT)
print("Loading mBERT...")
mbert_tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
mbert_model = BertModel.from_pretrained('bert-base-multilingual-cased')
print("mBERT loaded!")

In [None]:
# Demonstrate what mBERT outputs
print("="*70)
print("WHAT mBERT ACTUALLY DOES (NOT Translation!)")
print("="*70)

english_text = "Hello, how are you?"

# Tokenize and get embeddings
inputs = mbert_tokenizer(english_text, return_tensors='pt')

with torch.no_grad():
    outputs = mbert_model(**inputs)

print(f"\nInput text: '{english_text}'")
print(f"\nmBERT Output Shape: {outputs.last_hidden_state.shape}")
print(f"  - Batch size: {outputs.last_hidden_state.shape[0]}")
print(f"  - Sequence length: {outputs.last_hidden_state.shape[1]}")
print(f"  - Hidden dimension: {outputs.last_hidden_state.shape[2]}")

print("\n‚ùå mBERT outputs EMBEDDINGS (numbers), NOT translated text!")
print("‚ùå There is NO way to get '‡§§‡§™‡§æ‡§à‡§Ç‡§≤‡§æ‡§à ‡§ï‡§∏‡•ç‡§§‡•ã ‡§õ?' from these embeddings directly!")
print("\nüí° For translation, we need ENCODER-DECODER models like mBART or NLLB.")

In [None]:
# What mBERT IS good for: Cross-lingual similarity
print("\n" + "="*70)
print("WHAT mBERT IS GOOD FOR: Cross-lingual Similarity")
print("="*70)

# Same meaning in different languages
texts = [
    "Hello, how are you?",           # English
    "‡§®‡§Æ‡§∏‡•ç‡§§‡•á, ‡§§‡§™‡§æ‡§à‡§Ç‡§≤‡§æ‡§à ‡§ï‡§∏‡•ç‡§§‡•ã ‡§õ?",        # Nepali
    "Bonjour, comment allez-vous?",   # French
    "I love pizza."                   # Different meaning
]

def get_sentence_embedding(text):
    """Get sentence embedding using [CLS] token."""
    inputs = mbert_tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        outputs = mbert_model(**inputs)
    # Use [CLS] token embedding
    return outputs.last_hidden_state[:, 0, :]

# Get embeddings
embeddings = [get_sentence_embedding(text) for text in texts]

# Calculate cosine similarity
from torch.nn.functional import cosine_similarity

print("\nCosine Similarity between sentences:")
print("-"*70)
print(f"English vs Nepali (same meaning):  {cosine_similarity(embeddings[0], embeddings[1]).item():.4f}")
print(f"English vs French (same meaning):  {cosine_similarity(embeddings[0], embeddings[2]).item():.4f}")
print(f"English vs 'I love pizza' (diff):  {cosine_similarity(embeddings[0], embeddings[3]).item():.4f}")

print("\n‚úÖ mBERT understands that sentences with SAME meaning are similar!")
print("‚úÖ This is useful for cross-lingual search, classification, etc.")
print("‚ùå But it still cannot GENERATE translations!")

## 3. mBART-50: English to Nepali Translation

**mBART-50** is a multilingual translation model that supports **50 languages including Nepali**!

Nepali language code: `ne_NP`

In [None]:
# Load mBART-50 model
print("Loading mBART-50 (this may take a minute)...")

mbart_model = MBartForConditionalGeneration.from_pretrained(
    "facebook/mbart-large-50-one-to-many-mmt"
)
mbart_tokenizer = MBart50TokenizerFast.from_pretrained(
    "facebook/mbart-large-50-one-to-many-mmt",
    src_lang="en_XX"  # Source language is English
)

# Move to GPU if available
mbart_model = mbart_model.to(device)

print("mBART-50 loaded successfully!")

In [None]:
# Check supported languages in mBART-50
print("mBART-50 Supported Languages (50 languages):")
print("="*70)

# Some key languages
mbart_languages = {
    'en_XX': 'English',
    'ne_NP': 'Nepali',
    'hi_IN': 'Hindi',
    'bn_IN': 'Bengali',
    'zh_CN': 'Chinese',
    'ja_XX': 'Japanese',
    'ko_KR': 'Korean',
    'fr_XX': 'French',
    'de_DE': 'German',
    'es_XX': 'Spanish',
    'ar_AR': 'Arabic',
    'ru_RU': 'Russian'
}

for code, name in mbart_languages.items():
    print(f"  {code}: {name}")

print("\n‚úÖ Nepali (ne_NP) is supported!")

In [None]:
def translate_with_mbart(text, target_lang="ne_NP"):
    """
    Translate English text to target language using mBART-50.

    Args:
        text: English text to translate
        target_lang: Target language code (default: ne_NP for Nepali)

    Returns:
        Translated text
    """
    # Tokenize
    inputs = mbart_tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Generate translation
    with torch.no_grad():
        generated_tokens = mbart_model.generate(
            **inputs,
            forced_bos_token_id=mbart_tokenizer.lang_code_to_id[target_lang],
            max_length=128,
            num_beams=5,
            early_stopping=True
        )

    # Decode
    translation = mbart_tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    return translation

In [None]:
# Test English to Nepali translation with mBART
print("="*70)
print("ENGLISH TO NEPALI TRANSLATION (mBART-50)")
print("="*70)

english_sentences = [
    "Hello, how are you?",
    "My name is Ram.",
    "Nepal is a beautiful country.",
    "I love Nepali food.",
    "Mount Everest is in Nepal.",
    "Good morning!",
    "Thank you very much.",
    "What is your name?",
    "The weather is nice today.",
    "I am learning Nepali language."
]

print("\nüá¨üáß English ‚Üí üá≥üáµ Nepali Translations:")
print("-"*70)

for sentence in english_sentences:
    nepali = translate_with_mbart(sentence, "ne_NP")
    print(f"üá¨üáß {sentence}")
    print(f"üá≥üáµ {nepali}")
    print()

## 4. NLLB-200: English to Nepali Translation

**NLLB (No Language Left Behind)** by Meta supports **200+ languages** with high quality, especially for low-resource languages like Nepali.

Nepali language code in NLLB: `npi_Deva` (Nepali in Devanagari script)

In [None]:
# Load NLLB model (smaller distilled version)
print("Loading NLLB-200 (distilled 600M)...")

nllb_model_name = "facebook/nllb-200-distilled-600M"
nllb_tokenizer = AutoTokenizer.from_pretrained(nllb_model_name)
nllb_model = AutoModelForSeq2SeqLM.from_pretrained(nllb_model_name)

# Move to GPU if available
nllb_model = nllb_model.to(device)

print("NLLB-200 loaded successfully!")

In [None]:
def translate_with_nllb(text, src_lang="eng_Latn", tgt_lang="npi_Deva"):
    """
    Translate text using NLLB-200.

    Args:
        text: Text to translate
        src_lang: Source language code (default: eng_Latn for English)
        tgt_lang: Target language code (default: npi_Deva for Nepali)

    Returns:
        Translated text
    """
    # Set source language
    nllb_tokenizer.src_lang = src_lang

    # Tokenize
    inputs = nllb_tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Generate translation
    with torch.no_grad():
        generated_tokens = nllb_model.generate(
            **inputs,
            forced_bos_token_id=nllb_tokenizer.convert_tokens_to_ids(tgt_lang),
            max_length=128,
            num_beams=5,
            early_stopping=True
        )

    # Decode
    translation = nllb_tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    return translation

In [None]:
# Test English to Nepali translation with NLLB
print("="*70)
print("ENGLISH TO NEPALI TRANSLATION (NLLB-200)")
print("="*70)

print("\nüá¨üáß English ‚Üí üá≥üáµ Nepali Translations:")
print("-"*70)

for sentence in english_sentences:
    nepali = translate_with_nllb(sentence)
    print(f"üá¨üáß {sentence}")
    print(f"üá≥üáµ {nepali}")
    print()

## 5. Compare mBART vs NLLB Translations

In [None]:
# Side-by-side comparison
print("="*70)
print("COMPARISON: mBART-50 vs NLLB-200")
print("="*70)

comparison_sentences = [
    "Hello, how are you?",
    "Nepal is a beautiful country in the Himalayas.",
    "I am learning to speak Nepali.",
    "The food is very delicious.",
    "Where is the nearest hospital?"
]

print("\n{:<40} | {:<40} | {:<40}".format("English", "mBART-50", "NLLB-200"))
print("-"*125)

for sentence in comparison_sentences:
    mbart_trans = translate_with_mbart(sentence, "ne_NP")
    nllb_trans = translate_with_nllb(sentence)

    print(f"\nüá¨üáß {sentence}")
    print(f"   mBART: {mbart_trans}")
    print(f"   NLLB:  {nllb_trans}")

## 6. Complete English-Nepali Translator Class

In [None]:
class EnglishNepaliTranslator:
    """
    A complete English to Nepali translator using multiple models.
    """

    def __init__(self, model_type='nllb'):
        """
        Initialize translator.

        Args:
            model_type: 'nllb' or 'mbart'
        """
        self.model_type = model_type
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

        if model_type == 'nllb':
            print("Loading NLLB-200...")
            self.tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
            self.model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
            self.src_lang = "eng_Latn"
            self.tgt_lang = "npi_Deva"
        else:
            print("Loading mBART-50...")
            self.tokenizer = MBart50TokenizerFast.from_pretrained(
                "facebook/mbart-large-50-one-to-many-mmt",
                src_lang="en_XX"
            )
            self.model = MBartForConditionalGeneration.from_pretrained(
                "facebook/mbart-large-50-one-to-many-mmt"
            )
            self.tgt_lang = "ne_NP"

        self.model = self.model.to(self.device)
        print(f"Translator ready! Using {model_type.upper()}")

    def translate(self, text, num_beams=5, max_length=128):
        """
        Translate English text to Nepali.

        Args:
            text: English text to translate
            num_beams: Beam search width
            max_length: Maximum output length

        Returns:
            Nepali translation
        """
        if self.model_type == 'nllb':
            self.tokenizer.src_lang = self.src_lang
            inputs = self.tokenizer(text, return_tensors="pt", padding=True, truncation=True)
            inputs = {k: v.to(self.device) for k, v in inputs.items()}

            with torch.no_grad():
                output = self.model.generate(
                    **inputs,
                    forced_bos_token_id=self.tokenizer.convert_tokens_to_ids(self.tgt_lang),
                    max_length=max_length,
                    num_beams=num_beams,
                    early_stopping=True
                )
        else:
            inputs = self.tokenizer(text, return_tensors="pt", padding=True, truncation=True)
            inputs = {k: v.to(self.device) for k, v in inputs.items()}

            with torch.no_grad():
                output = self.model.generate(
                    **inputs,
                    forced_bos_token_id=self.tokenizer.lang_code_to_id[self.tgt_lang],
                    max_length=max_length,
                    num_beams=num_beams,
                    early_stopping=True
                )

        return self.tokenizer.batch_decode(output, skip_special_tokens=True)[0]

    def translate_batch(self, texts):
        """
        Translate multiple texts.
        """
        return [self.translate(text) for text in texts]

In [None]:
# Test the complete translator
print("="*70)
print("COMPLETE ENGLISH-NEPALI TRANSLATOR")
print("="*70)

# Create translator (using NLLB)
translator = EnglishNepaliTranslator(model_type='nllb')

# Nepal-related sentences
nepal_sentences = [
    "Kathmandu is the capital of Nepal.",
    "Pokhara is famous for its beautiful lakes.",
    "Mount Everest is the tallest mountain in the world.",
    "Nepali people are very friendly and hospitable.",
    "Dal Bhat is the traditional food of Nepal.",
    "The Himalayan mountains are majestic.",
    "Buddha was born in Lumbini, Nepal.",
    "I want to visit Nepal someday."
]

print("\nüá¨üáß English ‚Üí üá≥üáµ Nepali:")
print("-"*70)

for sentence in nepal_sentences:
    translation = translator.translate(sentence)
    print(f"üá¨üáß {sentence}")
    print(f"üá≥üáµ {translation}")
    print()

## 7. Translation to Other South Asian Languages

Both mBART and NLLB support other South Asian languages too!

In [None]:
# Translate to multiple South Asian languages using NLLB
print("="*70)
print("ENGLISH TO SOUTH ASIAN LANGUAGES (NLLB)")
print("="*70)

test_text = "Nepal is a beautiful country."

# NLLB language codes for South Asian languages
south_asian_langs = {
    'npi_Deva': 'üá≥üáµ Nepali',
    'hin_Deva': 'üáÆüá≥ Hindi',
    'ben_Beng': 'üáßüá© Bengali',
    'urd_Arab': 'üáµüá∞ Urdu',
    'tam_Taml': 'üáÆüá≥ Tamil',
    'sin_Sinh': 'üá±üá∞ Sinhala'
}

print(f"\nOriginal (English): {test_text}\n")

for lang_code, lang_name in south_asian_langs.items():
    try:
        translation = translate_with_nllb(test_text, tgt_lang=lang_code)
        print(f"{lang_name}: {translation}")
    except Exception as e:
        print(f"{lang_name}: Error - {str(e)[:30]}")

## 8. Interactive Translator

In [None]:
def interactive_translator():
    """
    Interactive English to Nepali translator.
    """
    print("\n" + "="*60)
    print("INTERACTIVE ENGLISH-NEPALI TRANSLATOR")
    print("="*60)
    print("Enter English text to translate to Nepali.")
    print("Type 'quit' to exit.\n")

    while True:
        text = input("üá¨üáß English: ")

        if text.lower() == 'quit':
            print("‡§ß‡§®‡•ç‡§Ø‡§µ‡§æ‡§¶! (Thank you!)")
            break

        if text.strip():
            nepali = translator.translate(text)
            print(f"üá≥üáµ Nepali:  {nepali}\n")

# Uncomment to run interactive translator
# interactive_translator()

## 9. Common Nepali Phrases

In [None]:
# Common phrases translation
print("="*70)
print("COMMON ENGLISH-NEPALI PHRASES")
print("="*70)

common_phrases = [
    # Greetings
    "Hello",
    "Good morning",
    "Good night",
    "How are you?",
    "I am fine",

    # Basic
    "Thank you",
    "You are welcome",
    "Please",
    "Sorry",
    "Yes",
    "No",

    # Questions
    "What is your name?",
    "Where are you from?",
    "How much does this cost?",
    "Where is the bathroom?",

    # Useful
    "I don't understand",
    "Can you help me?",
    "I love Nepal",
    "The food is delicious",
    "See you later"
]

print("\n{:<35} | {}".format("English", "Nepali"))
print("-"*70)

for phrase in common_phrases:
    nepali = translator.translate(phrase)
    print(f"{phrase:<35} | {nepali}")

## 10. Summary

### Key Takeaways:

| Model | Architecture | For Translation? | Nepali Support |
|-------|--------------|------------------|----------------|
| **mBERT** | Encoder-only | ‚ùå No | ‚úÖ Understanding |
| **XLM-RoBERTa** | Encoder-only | ‚ùå No | ‚úÖ Understanding |
| **mBART-50** | Encoder-Decoder | ‚úÖ Yes | ‚úÖ ne_NP |
| **NLLB-200** | Encoder-Decoder | ‚úÖ Yes | ‚úÖ npi_Deva |

### For English to Nepali Translation:
1. **Use NLLB-200** - Best quality for low-resource languages
2. **Use mBART-50** - Good alternative with 50 language support
3. **Don't use mBERT/RoBERTa** - They can't generate translations!

### Language Codes:
- **mBART-50**: English = `en_XX`, Nepali = `ne_NP`
- **NLLB-200**: English = `eng_Latn`, Nepali = `npi_Deva`

In [None]:
print("\n" + "="*70)
print("‚úÖ NOTEBOOK COMPLETE!")
print("="*70)
print("\nYou have learned:")
print("  ‚úì Why mBERT/RoBERTa cannot translate (encoder-only)")
print("  ‚úì What mBERT is actually good for (cross-lingual similarity)")
print("  ‚úì Using mBART-50 for English-Nepali translation")
print("  ‚úì Using NLLB-200 for English-Nepali translation")
print("  ‚úì Building a complete translator class")
print("\nüá≥üáµ ‡§ß‡§®‡•ç‡§Ø‡§µ‡§æ‡§¶! (Thank you!)")