# üîÑ Resume Training: Vietnamese-English Model

Resume training from existing checkpoint.

**Prerequisites:**
- Checkpoint file (`best_model.pt`)
- Tokenizer files (`tokenizer_vi.model`, `tokenizer_en.model`)
- Processed data (`train.pt`, `val.pt`)

---

## 1. ‚öôÔ∏è Setup

In [None]:
# Clone repository
!git clone https://github.com/TranKien2005/EV_Translate_Modle_NLP_Project.git
%cd EV_Translate_Modle_NLP_Project

In [None]:
!pip install -q datasets sentencepiece sacrebleu google-generativeai python-dotenv tqdm tensorboard pyyaml

In [None]:
# Create .env
GEMINI_API_KEY = "YOUR_GEMINI_API_KEY_HERE"
HF_TOKEN = "YOUR_HF_TOKEN_HERE"

with open('.env', 'w') as f:
    f.write(f'GEMINI_API_KEY={GEMINI_API_KEY}\n')
    f.write(f'HF_TOKEN={HF_TOKEN}\n')
print('‚úì .env created')

## 2. üîß Configure Paths for Kaggle

In [None]:
# ‚ö†Ô∏è IMPORTANT: Define paths
import yaml

# Path to your uploaded model dataset on Kaggle
KAGGLE_INPUT_MODEL = '/kaggle/input/vi-en-model'  # CHANGE THIS to your dataset name

CONFIG_FILE = 'config/config_vi_en.yaml'

with open(CONFIG_FILE, 'r') as f:
    cfg = yaml.safe_load(f)

# Update paths for Kaggle
cfg['paths'] = {
    'data_dir': '/kaggle/working/data',
    'checkpoint_dir': '/kaggle/working/checkpoints_vi_en',
    'log_dir': '/kaggle/working/logs_vi_en'
}

# Use processed data
cfg['data']['source'] = 'processed'

with open(CONFIG_FILE, 'w') as f:
    yaml.dump(cfg, f, default_flow_style=False, allow_unicode=True)

print('‚úì Config paths updated for Kaggle')
print(f"  Input model: {KAGGLE_INPUT_MODEL}")

## 3. üìÇ Copy Files from Input Dataset

In [None]:
import shutil
import os

# Create directories
!mkdir -p /kaggle/working/data/processed_vi_en
!mkdir -p /kaggle/working/checkpoints_vi_en/tokenizers

# Copy processed data
shutil.copy(f'{KAGGLE_INPUT_MODEL}/train.pt', '/kaggle/working/data/processed_vi_en/train.pt')
shutil.copy(f'{KAGGLE_INPUT_MODEL}/val.pt', '/kaggle/working/data/processed_vi_en/val.pt')
print('‚úì Processed data copied')

# Copy tokenizers
shutil.copy(f'{KAGGLE_INPUT_MODEL}/tokenizer_vi.model', '/kaggle/working/checkpoints_vi_en/tokenizers/tokenizer_vi.model')
shutil.copy(f'{KAGGLE_INPUT_MODEL}/tokenizer_en.model', '/kaggle/working/checkpoints_vi_en/tokenizers/tokenizer_en.model')
print('‚úì Tokenizers copied')

# Copy checkpoint
shutil.copy(f'{KAGGLE_INPUT_MODEL}/best_model.pt', '/kaggle/working/checkpoints_vi_en/best_model.pt')
print('‚úì Checkpoint copied')

## 4. üîç Verify Files

In [None]:
import sys
sys.path.insert(0, '.')

from src.config import load_config
import torch

config = load_config('config/config_vi_en.yaml')

# Check files exist
print("üìÅ Checking files...")
print(f"  data_dir: {config.paths.data_dir}")
print(f"  checkpoint_dir: {config.paths.checkpoint_dir}")

# Check checkpoint
checkpoint_path = config.paths.checkpoint_dir / 'best_model.pt'
checkpoint = torch.load(checkpoint_path, map_location='cpu')
print(f"\n‚úì Checkpoint loaded from epoch {checkpoint.get('epoch', '?')}")
print(f"  Best val loss: {checkpoint.get('best_val_loss', 'N/A'):.4f}")
print(f"  Has scheduler state: {'scheduler_state_dict' in checkpoint}")

## 5. üèãÔ∏è Resume Training

In [None]:
from src.train import Trainer

trainer = Trainer(config_path='config/config_vi_en.yaml')
trainer.setup()

print("\n" + "="*50)
print("üîÑ Resuming VI ‚Üí EN Training")
print("="*50)

# Resume from checkpoint
resume_path = str(config.paths.checkpoint_dir / 'best_model.pt')
trainer.train(resume_from=resume_path)