# 🌐 English-Assamese Translation Model Training (Updated)

This notebook fine-tunes Meta's NLLB model for English-Assamese translation using the **Helsinki-NLP/opus-100** dataset.

## Setup Instructions:
1. **Runtime → Change runtime type → GPU (T4 recommended)**
2. **Run all cells in order**
3. **Monitor training progress**
4. **Download the trained model**

---

## 1. Environment Setup

In [None]:
print("Installing required packages...")
!pip install -q torch transformers datasets accelerate sentencepiece
!pip install -q pandas numpy tqdm sacrebleu

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 2. Clone Repository and Setup

In [None]:
# Clone the repository (replace with your actual repo URL)
!git clone https://github.com/Chandan735729/Machine-Translation-.git
%cd Machine-Translation-

## 3. Dataset Loading with Helsinki-NLP/opus-100

In [None]:
from datasets import load_dataset
import pandas as pd
from transformers import AutoTokenizer
import torch

print("🔄 Loading Helsinki-NLP/opus-100 dataset for English-Assamese...")

# Load the opus-100 dataset with English-Assamese pair
try:
    # Load opus-100 dataset with as-en language pair (reverse direction)
    dataset = load_dataset("Helsinki-NLP/opus-100", "as-en")
    print("✅ Successfully loaded Helsinki-NLP/opus-100 dataset")
    print(f"Available splits: {list(dataset.keys())}")
    print(f"Train samples: {len(dataset['train'])}")
    print(f"Test samples: {len(dataset['test'])}")
    
    # Show sample data
    print("\n📝 Sample data:")
    for i in range(3):
        sample = dataset['train'][i]
        print(f"Sample {i+1}:")
        print(f"  English: {sample['translation']['en']}")
        print(f"  Assamese: {sample['translation']['as']}")
        print()
        
except Exception as e:
    print(f"❌ Error loading opus-100: {e}")
    print("🔄 Creating fallback dataset...")
    
    # Fallback to sample dataset with proper structure
    sample_data_train = [
        {'en': 'Hello, how are you?', 'as': 'নমস্কাৰ, আপুনি কেনে আছে?'},
        {'en': 'Thank you very much.', 'as': 'বহুত ধন্যবাদ।'},
        {'en': 'Good morning.', 'as': 'শুভ ৰাতিপুৱা।'},
        {'en': 'How can I help you?', 'as': 'মই আপোনাক কেনেকৈ সহায় কৰিব পাৰো?'},
        {'en': 'Education is very important.', 'as': 'শিক্ষা অতি গুৰুত্বপূৰ্ণ।'},
        {'en': 'Health is wealth.', 'as': 'স্বাস্থ্যই সম্পদ।'},
        {'en': 'Water is essential for life.', 'as': 'জীৱনৰ বাবে পানী অপৰিহাৰ্য।'},
        {'en': 'Children need proper nutrition.', 'as': 'শিশুসকলৰ উপযুক্ত পুষ্টিৰ প্ৰয়োজন।'},
        {'en': 'Clean environment is important.', 'as': 'পৰিষ্কাৰ পৰিৱেশ গুৰুত্বপূৰ্ণ।'},
        {'en': 'Technology helps development.', 'as': 'প্ৰযুক্তিয়ে উন্নয়নত সহায় কৰে।'},
        {'en': 'Women empowerment is crucial.', 'as': 'মহিলা সৱলীকৰণ অতি গুৰুত্বপূৰ্ণ।'},
        {'en': 'Agriculture feeds the nation.', 'as': 'কৃষিয়ে দেশক খুৱায়।'},
        {'en': 'Peace brings prosperity.', 'as': 'শান্তিয়ে সমৃদ্ধি আনে।'},
        {'en': 'Knowledge is power.', 'as': 'জ্ঞানেই শক্তি।'},
        {'en': 'Unity in diversity.', 'as': 'বৈচিত্ৰ্যৰ মাজত ঐক্য।'},
        {'en': 'Hard work pays off.', 'as': 'কঠোৰ পৰিশ্ৰমৰ ফল পোৱা যায়।'},
        {'en': 'Time is precious.', 'as': 'সময় অমূল্য।'},
        {'en': 'Respect your elders.', 'as': 'বয়োজ্যেষ্ঠসকলক সন্মান কৰক।'},
        {'en': 'Nature is beautiful.', 'as': 'প্ৰকৃতি সুন্দৰ।'},
        {'en': 'Love your country.', 'as': 'নিজৰ দেশক ভাল পাওক।'}
    ]
    
    sample_data_val = [
        {'en': 'Good evening.', 'as': 'শুভ সন্ধিয়া।'},
        {'en': 'See you tomorrow.', 'as': 'কাইলৈ লগ পাম।'},
        {'en': 'Take care of yourself.', 'as': 'নিজৰ যত্ন লওক।'},
        {'en': 'Have a nice day.', 'as': 'দিনটো ভাল কটাওক।'},
        {'en': 'Welcome to Assam.', 'as': 'অসমলৈ স্বাগতম।'}
    ]
    
    from datasets import Dataset, DatasetDict
    dataset = DatasetDict({
        'train': Dataset.from_list(sample_data_train),
        'validation': Dataset.from_list(sample_data_val)
    })
    print(f"✅ Created fallback dataset with {len(dataset['train'])} training samples")

## 4. Data Preprocessing

In [None]:
# Cell 4: Data Preprocessing
print("🔄 Setting up tokenizer and preprocessing...")

# Initialize tokenizer
model_name = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Language codes for NLLB
source_lang = "eng_Latn"  # English
target_lang = "asm_Beng"  # Assamese

# Set source and target languages for the tokenizer
tokenizer.src_lang = source_lang
tokenizer.tgt_lang = target_lang

def preprocess_function(examples):
    """Preprocess the dataset for training"""
    # Extract source and target texts
    if 'translation' in examples:
        # Handle opus-100 format
        inputs = [ex['en'] for ex in examples['translation']]  # English as input
        targets = [ex['as'] for ex in examples['translation']]  # Assamese as target
    else:
        # Handle fallback format (direct en/as columns)
        inputs = examples['en']
        targets = examples['as']

    # Tokenize inputs
    model_inputs = tokenizer(
        inputs,
        max_length=128,
        truncation=True,
        padding=True
    )

    # Tokenize targets
    labels = tokenizer(
        text_target=targets,
        max_length=128,
        truncation=True,
        padding=True
    )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply preprocessing
print("📊 Preprocessing dataset...")
tokenized_dataset = dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=dataset['train'].column_names
)

print(f"✅ Preprocessing completed!")
print(f"Train samples: {len(tokenized_dataset['train'])}")
print(f"Validation samples: {len(tokenized_dataset.get('validation', []))}")

## 5. Model Training

In [None]:
# Import training modules
from train import TranslationTrainer
import os

# Initialize trainer
print("🤖 Initializing translation trainer...")
trainer_obj = TranslationTrainer()

# Load processed data
print("📂 Loading processed dataset...")
dataset = trainer_obj.load_processed_data()

print(f"\n📋 Training Configuration:")
print(f"  Model: facebook/nllb-200-distilled-600M")
print(f"  Dataset: ai4bharat/sangraha (English-Assamese)")
print(f"  Train samples: {len(dataset['train'])}")
print(f"  Validation samples: {len(dataset.get('validation', []))}")
print(f"  Device: {trainer_obj.device}")

# Start training
print("\n🚀 Starting model training...")
print("This may take 30-60 minutes depending on your GPU and dataset size.")

trainer, model_path = trainer_obj.train_model(dataset)

print(f"\n✅ Training completed!")
print(f"📁 Model saved to: {model_path}")

## 6. Model Evaluation

In [None]:
# Evaluate the trained model
print("📊 Evaluating model performance...")
eval_results = trainer_obj.evaluate_model(trainer, dataset)

print(f"\n📈 Evaluation Results:")
for key, value in eval_results.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")

## 7. Test Translation

In [None]:
# Test the trained model
from translate import EnglishToAssameseTranslator

# Initialize translator with trained model
print("🔄 Loading trained model for testing...")
translator = EnglishToAssameseTranslator(model_path)

# Test sentences
test_sentences = [
    "Community health workers are the backbone of our medical system.",
    "Education is the key to development.",
    "Clean water is essential for good health.",
    "Vaccination protects children from diseases.",
    "Women's empowerment leads to stronger communities.",
    "Hello, how are you?",
    "Thank you for your help.",
    "The weather is nice today."
]

print("\n🧪 Testing translations with trained model:")
print("=" * 80)

for i, sentence in enumerate(test_sentences, 1):
    print(f"\n{i}. English: {sentence}")
    translation = translator.translate(sentence)
    print(f"   Assamese: {translation}")
    print("-" * 60)

print("\n✅ Translation testing completed!")

## 8. Download Trained Model

In [None]:
# Create a zip file of the trained model
import shutil
import os

model_dir = "models/nllb-finetuned-en-to-asm-final"

if os.path.exists(model_dir):
    print("📦 Creating model archive...")
    
    # Create zip file
    shutil.make_archive("trained_model_sangraha", 'zip', model_dir)
    
    print("✅ Model archived as 'trained_model_sangraha.zip'")
    print("📥 Download it from the Files panel on the left")
    
    # Show file size
    size_mb = os.path.getsize("trained_model_sangraha.zip") / (1024 * 1024)
    print(f"📊 Archive size: {size_mb:.1f} MB")
    
    # Also save training info
    import json
    training_info = {
        "dataset": "ai4bharat/sangraha",
        "base_model": "facebook/nllb-200-distilled-600M",
        "language_pair": "English-Assamese",
        "training_date": str(pd.Timestamp.now()),
        "model_path": model_dir
    }
    
    with open("training_info.json", "w") as f:
        json.dump(training_info, f, indent=2)
    
    print("📄 Training info saved to 'training_info.json'")
    
else:
    print("❌ Model directory not found. Training may have failed.")

## 9. Optional: Upload to Google Drive

In [None]:
# Optional: Mount Google Drive and upload model
from google.colab import drive
import shutil

# Mount Google Drive
print("🔗 Mounting Google Drive...")
drive.mount('/content/drive')

# Copy model to Drive
drive_path = "/content/drive/MyDrive/translation_models/"
os.makedirs(drive_path, exist_ok=True)

if os.path.exists("trained_model_sangraha.zip"):
    shutil.copy("trained_model_sangraha.zip", f"{drive_path}trained_model_sangraha.zip")
    shutil.copy("training_info.json", f"{drive_path}training_info.json")
    print(f"✅ Model and info uploaded to Google Drive: {drive_path}")
else:
    print("❌ Model archive not found")

## 🎉 Training Complete!

### What's New:
- ✅ **Updated Dataset**: Now using `ai4bharat/sangraha` instead of PMIndia
- ✅ **Fixed Tokenization**: Resolved the deprecated `as_target_tokenizer` warning
- ✅ **Better Error Handling**: Multiple fallback options for dataset loading
- ✅ **Flexible Column Mapping**: Handles different dataset column formats

### Next Steps:
1. **Download** the trained model (`trained_model_sangraha.zip`)
2. **Extract** it to your local project's `models/` directory
3. **Update** the model path in your local `translate.py` if needed
4. **Test** the model locally using the FastAPI backend

### Model Usage:
```python
from translate import EnglishToAssameseTranslator

translator = EnglishToAssameseTranslator("models/nllb-finetuned-en-to-asm-final")
result = translator.translate("Hello, how are you?")
print(result)
```

### Deployment:
- Place the model in your project directory
- Run the FastAPI server: `python run_server.py`
- Access the web interface at `http://localhost:8000`

---
**Happy Translating with Sangraha Dataset! 🌐**