# 🌐 English-Assamese Translation Model Training (Updated)

This notebook fine-tunes Meta's NLLB model for English-Assamese translation using the **ai4bharat/sangraha** dataset.

## Setup Instructions:
1. **Runtime → Change runtime type → GPU (T4 recommended)**
2. **Run all cells in order**
3. **Monitor training progress**
4. **Download the trained model**

---

## 1. Environment Setup

In [None]:
# Install required packages
!pip install -q torch transformers datasets accelerate sentencepiece
!pip install -q pandas numpy tqdm

# Check GPU availability
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 2. Clone Repository and Setup

In [None]:
# Clone the repository (replace with your actual repo URL)
!git clone https://github.com/your-username/Machine-Translation-.git
%cd Machine-Translation-

# List files to verify
!ls -la

## 3. Data Preparation with Sangraha Dataset

In [None]:
# Import and run data preparation
import sys
sys.path.append('src')

from data_preparation import DataPreparator
import logging

# Set up logging
logging.basicConfig(level=logging.INFO)

# Initialize data preparator with sangraha dataset
print("🔄 Initializing data preparation with ai4bharat/sangraha dataset...")
preparator = DataPreparator()

# Load and prepare dataset
print("📥 Loading sangraha dataset...")
raw_dataset = preparator.load_dataset("ai4bharat/sangraha")

# Check dataset structure
print("📊 Dataset structure:")
print(f"Available splits: {list(raw_dataset.keys())}")
if 'train' in raw_dataset:
    print(f"Train columns: {raw_dataset['train'].column_names}")
    print(f"Train size: {len(raw_dataset['train'])}")
    
    # Show sample data
    print("\n📝 Sample data:")
    for i in range(min(3, len(raw_dataset['train']))):
        sample = raw_dataset['train'][i]
        print(f"Sample {i+1}: {sample}")

print("\n⚙️ Processing dataset...")
processed_dataset = preparator.prepare_datasets(raw_dataset)

# Save processed data
print("💾 Saving processed data...")
preparator.save_processed_data(processed_dataset)

# Print statistics
stats = preparator.get_data_stats(processed_dataset)
print(f"\n📊 Dataset Statistics:")
for key, value in stats.items():
    print(f"  {key}: {value}")

## 4. Dataset Exploration (Optional)

In [None]:
# Explore the dataset structure in more detail
from datasets import load_dataset

print("🔍 Exploring ai4bharat/sangraha dataset...")

# Try to load and inspect the dataset
try:
    # Check available configurations
    from datasets import get_dataset_config_names
    configs = get_dataset_config_names("ai4bharat/sangraha")
    print(f"Available configurations: {configs}")
    
    # Try different configurations for English-Assamese
    possible_configs = ['eng-asm', 'en-as', 'english-assamese']
    
    for config in possible_configs:
        if config in configs:
            print(f"\n✅ Found configuration: {config}")
            dataset = load_dataset("ai4bharat/sangraha", config)
            print(f"Splits: {list(dataset.keys())}")
            if 'train' in dataset:
                print(f"Columns: {dataset['train'].column_names}")
                print(f"Sample: {dataset['train'][0]}")
            break
    else:
        print("❌ No suitable English-Assamese configuration found")
        print("Available configs:", configs)
        
except Exception as e:
    print(f"Error exploring dataset: {e}")
    print("Will use fallback sample dataset")

## 5. Model Training

In [None]:
# Import training modules
from train import TranslationTrainer
import os

# Initialize trainer
print("🤖 Initializing translation trainer...")
trainer_obj = TranslationTrainer()

# Load processed data
print("📂 Loading processed dataset...")
dataset = trainer_obj.load_processed_data()

print(f"\n📋 Training Configuration:")
print(f"  Model: facebook/nllb-200-distilled-600M")
print(f"  Dataset: ai4bharat/sangraha (English-Assamese)")
print(f"  Train samples: {len(dataset['train'])}")
print(f"  Validation samples: {len(dataset.get('validation', []))}")
print(f"  Device: {trainer_obj.device}")

# Start training
print("\n🚀 Starting model training...")
print("This may take 30-60 minutes depending on your GPU and dataset size.")

trainer, model_path = trainer_obj.train_model(dataset)

print(f"\n✅ Training completed!")
print(f"📁 Model saved to: {model_path}")

## 6. Model Evaluation

In [None]:
# Evaluate the trained model
print("📊 Evaluating model performance...")
eval_results = trainer_obj.evaluate_model(trainer, dataset)

print(f"\n📈 Evaluation Results:")
for key, value in eval_results.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")

## 7. Test Translation

In [None]:
# Test the trained model
from translate import EnglishToAssameseTranslator

# Initialize translator with trained model
print("🔄 Loading trained model for testing...")
translator = EnglishToAssameseTranslator(model_path)

# Test sentences
test_sentences = [
    "Community health workers are the backbone of our medical system.",
    "Education is the key to development.",
    "Clean water is essential for good health.",
    "Vaccination protects children from diseases.",
    "Women's empowerment leads to stronger communities.",
    "Hello, how are you?",
    "Thank you for your help.",
    "The weather is nice today."
]

print("\n🧪 Testing translations with trained model:")
print("=" * 80)

for i, sentence in enumerate(test_sentences, 1):
    print(f"\n{i}. English: {sentence}")
    translation = translator.translate(sentence)
    print(f"   Assamese: {translation}")
    print("-" * 60)

print("\n✅ Translation testing completed!")

## 8. Download Trained Model

In [None]:
# Create a zip file of the trained model
import shutil
import os

model_dir = "models/nllb-finetuned-en-to-asm-final"

if os.path.exists(model_dir):
    print("📦 Creating model archive...")
    
    # Create zip file
    shutil.make_archive("trained_model_sangraha", 'zip', model_dir)
    
    print("✅ Model archived as 'trained_model_sangraha.zip'")
    print("📥 Download it from the Files panel on the left")
    
    # Show file size
    size_mb = os.path.getsize("trained_model_sangraha.zip") / (1024 * 1024)
    print(f"📊 Archive size: {size_mb:.1f} MB")
    
    # Also save training info
    import json
    training_info = {
        "dataset": "ai4bharat/sangraha",
        "base_model": "facebook/nllb-200-distilled-600M",
        "language_pair": "English-Assamese",
        "training_date": str(pd.Timestamp.now()),
        "model_path": model_dir
    }
    
    with open("training_info.json", "w") as f:
        json.dump(training_info, f, indent=2)
    
    print("📄 Training info saved to 'training_info.json'")
    
else:
    print("❌ Model directory not found. Training may have failed.")

## 9. Optional: Upload to Google Drive

In [None]:
# Optional: Mount Google Drive and upload model
from google.colab import drive
import shutil

# Mount Google Drive
print("🔗 Mounting Google Drive...")
drive.mount('/content/drive')

# Copy model to Drive
drive_path = "/content/drive/MyDrive/translation_models/"
os.makedirs(drive_path, exist_ok=True)

if os.path.exists("trained_model_sangraha.zip"):
    shutil.copy("trained_model_sangraha.zip", f"{drive_path}trained_model_sangraha.zip")
    shutil.copy("training_info.json", f"{drive_path}training_info.json")
    print(f"✅ Model and info uploaded to Google Drive: {drive_path}")
else:
    print("❌ Model archive not found")

## 🎉 Training Complete!

### What's New:
- ✅ **Updated Dataset**: Now using `ai4bharat/sangraha` instead of PMIndia
- ✅ **Fixed Tokenization**: Resolved the deprecated `as_target_tokenizer` warning
- ✅ **Better Error Handling**: Multiple fallback options for dataset loading
- ✅ **Flexible Column Mapping**: Handles different dataset column formats

### Next Steps:
1. **Download** the trained model (`trained_model_sangraha.zip`)
2. **Extract** it to your local project's `models/` directory
3. **Update** the model path in your local `translate.py` if needed
4. **Test** the model locally using the FastAPI backend

### Model Usage:
```python
from translate import EnglishToAssameseTranslator

translator = EnglishToAssameseTranslator("models/nllb-finetuned-en-to-asm-final")
result = translator.translate("Hello, how are you?")
print(result)
```

### Deployment:
- Place the model in your project directory
- Run the FastAPI server: `python run_server.py`
- Access the web interface at `http://localhost:8000`

---
**Happy Translating with Sangraha Dataset! 🌐**