# üß™ Test Notebook - 100K Samples Only

**Quick test to verify everything works before full training**

- Uses only **100,000 samples** instead of ~3M
- Runs **2 epochs** only
- Should complete in **~30 minutes**

---

## 1. ‚öôÔ∏è Setup

In [None]:
# Clone repository
!rm -rf EV_Translate_Modle_NLP_Project
!git clone https://github.com/TranKien2005/EV_Translate_Modle_NLP_Project.git
%cd EV_Translate_Modle_NLP_Project

In [None]:
# Install dependencies (skip torch - already on Kaggle)
!pip install -q datasets sentencepiece sacrebleu google-generativeai python-dotenv tqdm tensorboard seaborn pyyaml

In [None]:
# Verify PyTorch + CUDA
import torch
print(f'PyTorch: {torch.__version__}')
print(f'CUDA: {torch.cuda.is_available()}')
if torch.cuda.is_available():
    print(f'Device: {torch.cuda.get_device_name(0)}')

In [None]:
# Create .env (replace with your keys)
GEMINI_API_KEY = "YOUR_GEMINI_API_KEY"  # Optional
HF_TOKEN = "YOUR_HF_TOKEN"  # Get from huggingface.co

with open('.env', 'w') as f:
    f.write(f'GEMINI_API_KEY={GEMINI_API_KEY}\n')
    f.write(f'HF_TOKEN={HF_TOKEN}\n')
print('‚úì .env created')

## 2. üì• Download Data

In [None]:
# Download PhoMT dataset
!python scripts/download_phomt.py

## 3. üîß Preprocess (100K samples only)

In [None]:
# Preprocess with ONLY 100K samples for quick test
!python scripts/preprocess_data.py --max-samples 100000

In [None]:
# Verify processed data exists
from src.config import load_config
config = load_config()
print(f"Data dir: {config.paths.data_dir}")
!ls -la {config.paths.data_dir}/processed/

## 4. üèãÔ∏è Quick Training (2 epochs)

In [None]:
# Modify config for quick test: 2 epochs only
import yaml

with open('config/config.yaml', 'r') as f:
    cfg = yaml.safe_load(f)

# Quick test settings
cfg['data']['source'] = 'processed'
cfg['training']['epochs'] = 2  # Only 2 epochs for test
cfg['training']['save_every'] = 1

with open('config/config.yaml', 'w') as f:
    yaml.dump(cfg, f, default_flow_style=False)

print('‚úì Config updated for quick test')
print(f"  - epochs: 2")
print(f"  - source: processed")

In [None]:
# Start training!
from src.train import Trainer

trainer = Trainer()
trainer.setup()
trainer.train()

## 5. üìä Quick Evaluation

In [None]:
# Load trained model
from src.evaluate import load_translator

translator = load_translator(
    checkpoint_path='checkpoints/best_model.pt',
    vocab_src_path='checkpoints/tokenizers/tokenizer_src.model',
    vocab_tgt_path='checkpoints/tokenizers/tokenizer_tgt.model',
    config_path='config/config.yaml'
)

In [None]:
# Test translations
test_sentences = [
    "Hello, how are you?",
    "I love you.",
    "What is your name?",
    "The weather is nice today.",
    "Thank you very much."
]

print("="*50)
print("üåê Translation Test")
print("="*50)

for sentence in test_sentences:
    # Greedy (fast)
    greedy = translator.translate(sentence, beam_size=1)
    # Beam search (better)
    beam = translator.translate(sentence, beam_size=4)
    
    print(f"\nüîπ EN: {sentence}")
    print(f"   Greedy: {greedy}")
    print(f"   Beam-4: {beam}")

In [None]:
# Quick BLEU on 100 samples
!python -m src.evaluate \
    --checkpoint checkpoints/best_model.pt \
    --vocab-src checkpoints/tokenizers/tokenizer_src.model \
    --vocab-tgt checkpoints/tokenizers/tokenizer_tgt.model \
    --config config/config.yaml \
    --val

## ‚úÖ Success!

If you see translations above, everything works!

**Next steps:**
1. Use `train_kaggle.ipynb` for full training
2. Or increase epochs/samples in this notebook

**Expected results with 2 epochs on 100K:**
- BLEU: 1-5 (very low, just testing)
- Translations may be nonsense (not enough training)