# üöÄ English-Vietnamese Translation Model Training

**Transformer-based Neural Machine Translation: EN ‚Üí VI**

This notebook trains a translation model from **scratch**:
- **Source**: English üá¨üáß
- **Target**: Vietnamese üáªüá≥

---

## 1. ‚öôÔ∏è Setup Environment

In [None]:
# Clone repository
!git clone https://github.com/TranKien2005/EV_Translate_Modle_NLP_Project.git
%cd EV_Translate_Modle_NLP_Project

In [None]:
# Install dependencies
!pip install -q datasets sentencepiece sacrebleu google-generativeai python-dotenv tqdm tensorboard seaborn pyyaml

In [None]:
# Verify PyTorch and CUDA
import torch
print(f'PyTorch version: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
if torch.cuda.is_available():
    print(f'CUDA device: {torch.cuda.get_device_name(0)}')

In [None]:
# Create .env file with API keys
GEMINI_API_KEY = "YOUR_GEMINI_API_KEY_HERE"
HF_TOKEN = "YOUR_HF_TOKEN_HERE"

with open('.env', 'w') as f:
    f.write(f'GEMINI_API_KEY={GEMINI_API_KEY}\n')
    f.write(f'HF_TOKEN={HF_TOKEN}\n')

print('‚úì .env file created')

## 2. üîß Configure Paths for Kaggle

In [None]:
# ‚ö†Ô∏è IMPORTANT: Configure paths for Kaggle
import yaml

CONFIG_FILE = 'config/config.yaml'

with open(CONFIG_FILE, 'r') as f:
    cfg = yaml.safe_load(f)

# Update paths for Kaggle
cfg['paths'] = {
    'data_dir': '/kaggle/working/data',
    'checkpoint_dir': '/kaggle/working/checkpoints',
    'log_dir': '/kaggle/working/logs'
}

with open(CONFIG_FILE, 'w') as f:
    yaml.dump(cfg, f, default_flow_style=False, allow_unicode=True)

print('‚úì Config paths updated for Kaggle:')
print(f"  data_dir: {cfg['paths']['data_dir']}")
print(f"  checkpoint_dir: {cfg['paths']['checkpoint_dir']}")

## 3. üì• Download & Preprocess Data

In [None]:
!mkdir -p /kaggle/working/data
!python scripts/download_phomt.py

In [None]:
# Preprocess data (EN-VI direction)
!python scripts/preprocess_data.py --config config/config.yaml

## 4. üîç Configuration Check

In [None]:
import sys
sys.path.insert(0, '.')

from src.config import load_config

config = load_config('config/config.yaml')

print("="*50)
print("üìã Configuration Summary (EN ‚Üí VI)")
print("="*50)
print(f"\nüîπ Paths:")
print(f"   data_dir: {config.paths.data_dir}")
print(f"   checkpoint_dir: {config.paths.checkpoint_dir}")
print(f"\nüîπ Model:")
print(f"   d_model: {config.d_model}")
print(f"   layers: {config.num_encoder_layers} enc + {config.num_decoder_layers} dec")
print(f"\nüîπ Training:")
print(f"   epochs: {config.epochs}")
print(f"   batch_size: {config.batch_size}")
print(f"   learning_rate: {config.learning_rate}")

## 5. üèãÔ∏è Training

In [None]:
# Switch to processed data
import yaml

with open('config/config.yaml', 'r') as f:
    cfg = yaml.safe_load(f)

cfg['data']['source'] = 'processed'

with open('config/config.yaml', 'w') as f:
    yaml.dump(cfg, f, default_flow_style=False, allow_unicode=True)

print('‚úì Config updated to use processed data')

In [None]:
# Start training!
from src.train import Trainer

trainer = Trainer(config_path='config/config.yaml')
trainer.setup()

print("\n" + "="*50)
print("üá¨üáß ‚Üí üáªüá≥ English to Vietnamese Translation")
print("="*50)

trainer.train()