# üöÄ English-Vietnamese Translation Model Training

**Transformer-based Neural Machine Translation from Scratch**

This notebook trains an EN-VI translation model using:
- **PhoMT Dataset** (~3M sentence pairs)
- **SentencePiece Tokenization**
- **Transformer Architecture** with Pre-LayerNorm
- **Beam Search** for inference

---

## 1. ‚öôÔ∏è Setup Environment

In [None]:
# Clone repository
!git clone https://github.com/TranKien2005/EV_Translate_Modle_NLP_Project.git
%cd EV_Translate_Modle_NLP_Project

In [None]:
# Install dependencies (skip torch - already installed on Kaggle)
# Kaggle already has PyTorch with correct CUDA, don't reinstall!
!pip install -q datasets sentencepiece sacrebleu google-generativeai python-dotenv tqdm tensorboard seaborn pyyaml

In [None]:
# Verify PyTorch and CUDA
import torch
print(f'PyTorch version: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
if torch.cuda.is_available():
    print(f'CUDA device: {torch.cuda.get_device_name(0)}')

In [None]:
# Create .env file with API keys
# ‚ö†Ô∏è IMPORTANT: Replace with your actual keys!

GEMINI_API_KEY = "YOUR_GEMINI_API_KEY_HERE"  # Get from Google AI Studio
HF_TOKEN = "YOUR_HF_TOKEN_HERE"  # Get from Hugging Face

with open('.env', 'w') as f:
    f.write(f'GEMINI_API_KEY={GEMINI_API_KEY}\n')
    f.write(f'HF_TOKEN={HF_TOKEN}\n')

print('‚úì .env file created')

## 2. üì• Download & Preprocess Data

In [None]:
# Download PhoMT dataset (~500MB)
!python scripts/download_phomt.py

In [None]:
# Preprocess data: tokenize, filter by length, save to .pt files
# This will train SentencePiece tokenizers and process train/val/test sets
!python scripts/preprocess_data.py

## 3. üîß Configuration Check

In [None]:
import sys
sys.path.insert(0, '.')

from src.config import load_config
from src.models import Transformer

config = load_config()

print("="*50)
print("üìã Configuration Summary")
print("="*50)
print(f"\nüîπ Model:")
print(f"   d_model: {config.d_model}")
print(f"   layers: {config.num_encoder_layers} enc + {config.num_decoder_layers} dec")
print(f"   d_ff: {config.d_ff}")
print(f"   dropout: {config.dropout}")
print(f"\nüîπ Training:")
print(f"   epochs: {config.epochs}")
print(f"   batch_size: {config.batch_size} (effective: {config.batch_size * config.gradient_accumulation_steps})")
print(f"   learning_rate: {config.learning_rate}")
print(f"   warmup_steps: {config.warmup_steps}")
print(f"\nüîπ Data:")
print(f"   vocab_size: {config.src_vocab_size}")
print(f"   max_seq_len: {config.max_seq_len}")
print(f"   data_source: {config.data_source}")

In [None]:
# Count model parameters
model = Transformer(
    src_vocab_size=config.src_vocab_size,
    tgt_vocab_size=config.tgt_vocab_size,
    d_model=config.d_model,
    num_heads=config.num_heads,
    num_encoder_layers=config.num_encoder_layers,
    num_decoder_layers=config.num_decoder_layers,
    d_ff=config.d_ff
)

total_params = sum(p.numel() for p in model.parameters())
print(f"\nüìä Model Parameters: {total_params:,} ({total_params/1e6:.1f}M)")
print(f"üì¶ Model Size: ~{total_params * 4 / 1024 / 1024:.1f} MB")

## 4. üèãÔ∏è Training

In [None]:
# Switch to processed data for faster training
import yaml

with open('config/config.yaml', 'r') as f:
    cfg = yaml.safe_load(f)

cfg['data']['source'] = 'processed'  # Use pre-tokenized data

with open('config/config.yaml', 'w') as f:
    yaml.dump(cfg, f, default_flow_style=False)

print('‚úì Config updated to use processed data')

### ‚ö° Resume Training from Checkpoint (Optional)

N·∫øu b·∫°n c√≥ checkpoint t·ª´ l·∫ßn train tr∆∞·ªõc, upload l√™n Kaggle Input v√† nh·∫≠p ƒë∆∞·ªùng d·∫´n b√™n d∆∞·ªõi.

**C√°ch s·ª≠ d·ª•ng:**
1. Upload file checkpoint (.pt) l√™n Kaggle Dataset
2. Add dataset v√†o notebook
3. Nh·∫≠p ƒë∆∞·ªùng d·∫´n v√†o √¥ b√™n d∆∞·ªõi (v√≠ d·ª•: `/kaggle/input/my-checkpoint/best_model.pt`)
4. N·∫øu ƒë·ªÉ tr·ªëng ho·∫∑c file kh√¥ng t·ªìn t·∫°i ‚Üí train t·ª´ ƒë·∫ßu

In [None]:
#@title üìÇ Nh·∫≠p ƒë∆∞·ªùng d·∫´n checkpoint ƒë·ªÉ resume training
#@markdown ƒê·ªÉ tr·ªëng n·∫øu mu·ªën train t·ª´ ƒë·∫ßu

import os

# ========================================
# üëá NH·∫¨P ƒê∆Ø·ªúNG D·∫™N CHECKPOINT T·∫†I ƒê√ÇY üëá
# ========================================
RESUME_CHECKPOINT_PATH = ""  # V√≠ d·ª•: "/kaggle/input/my-model/best_model.pt"

# Ki·ªÉm tra checkpoint
if RESUME_CHECKPOINT_PATH and os.path.exists(RESUME_CHECKPOINT_PATH):
    print(f"‚úÖ Checkpoint found: {RESUME_CHECKPOINT_PATH}")
    print(f"   Size: {os.path.getsize(RESUME_CHECKPOINT_PATH) / 1024 / 1024:.1f} MB")
    RESUME_FROM = RESUME_CHECKPOINT_PATH
else:
    if RESUME_CHECKPOINT_PATH:
        print(f"‚ö†Ô∏è Checkpoint NOT found: {RESUME_CHECKPOINT_PATH}")
    print("üìå Will train from scratch")
    RESUME_FROM = None

In [None]:
# Start training!
from src.train import Trainer

trainer = Trainer()
trainer.setup()

# Train v·ªõi ho·∫∑c kh√¥ng c√≥ checkpoint
if RESUME_FROM:
    print(f"\nüîÑ Resuming training from: {RESUME_FROM}")
    trainer.train(resume_from=RESUME_FROM)
else:
    print("\nüöÄ Starting training from scratch")
    trainer.train()

## 5. üìä Evaluation

In [None]:
# Evaluate on test set with BLEU score
!python -m src.evaluate \
    --checkpoint checkpoints/best_model.pt \
    --vocab-src checkpoints/tokenizers/tokenizer_src.model \
    --vocab-tgt checkpoints/tokenizers/tokenizer_tgt.model \
    --config config/config.yaml \
    --test

In [None]:
# Evaluate with Gemini Score (optional, requires API key)
!python -m src.evaluate \
    --checkpoint checkpoints/best_model.pt \
    --vocab-src checkpoints/tokenizers/tokenizer_src.model \
    --vocab-tgt checkpoints/tokenizers/tokenizer_tgt.model \
    --config config/config.yaml \
    --test --gemini

## 6. üîÆ Interactive Translation

In [None]:
from src.evaluate import load_translator
from src.config import load_config

config = load_config()

# Load trained model
translator = load_translator(
    checkpoint_path='checkpoints/best_model.pt',
    vocab_src_path='checkpoints/tokenizers/tokenizer_src.model',
    vocab_tgt_path='checkpoints/tokenizers/tokenizer_tgt.model',
    config_path='config/config.yaml'
)

In [None]:
# Test translation!
test_sentences = [
    "Hello, how are you?",
    "The weather is nice today.",
    "I love learning new languages.",
    "Machine translation is improving rapidly.",
    "Can you help me with this problem?"
]

print("="*60)
print("üåê Translation Examples")
print("="*60)

for sentence in test_sentences:
    translation = translator.translate(sentence, beam_size=4)
    print(f"\nüîπ EN: {sentence}")
    print(f"üîπ VI: {translation}")

## 7. üíæ Save Model to Kaggle Output

In [None]:
import shutil
import os

# Copy model files to Kaggle output
output_dir = '/kaggle/working/model_output'
os.makedirs(output_dir, exist_ok=True)

# Copy checkpoint
shutil.copy('checkpoints/best_model.pt', output_dir)

# Copy tokenizers
shutil.copytree('checkpoints/tokenizers', f'{output_dir}/tokenizers', dirs_exist_ok=True)

# Copy config
shutil.copy('config/config.yaml', output_dir)

# Copy evaluation results
if os.path.exists('logs'):
    shutil.copytree('logs', f'{output_dir}/logs', dirs_exist_ok=True)

print(f'\n‚úì Model saved to {output_dir}')
print('\nFiles:')
for f in os.listdir(output_dir):
    print(f'  - {f}')

---

## üìù Notes

**Model Info:**
- Architecture: Transformer with Pre-LayerNorm
- Parameters: ~18.4M  
- Training time: ~8-10 hours on Kaggle GPU

**Expected Results:**
- BLEU Score: 15-25 (depends on training time)
- Gemini Score: 50-70

**Tips:**
1. Use GPU accelerator (P100/T4) for faster training
2. If running out of time, reduce epochs in config
3. Save checkpoint frequently to resume if interrupted