# üé§ Vietnamese ASR Training - Google Colab

**Nh·∫≠n d·∫°ng gi·ªçng n√≥i ti·∫øng Vi·ªát v·ªõi Wav2Vec2**

---

## üìã Setup Checklist
- [ ] Runtime ‚Üí Change runtime type ‚Üí **GPU (T4)**
- [ ] Mount Google Drive
- [ ] Upload dataset l√™n Drive
- [ ] Run all cells

---

## 1Ô∏è‚É£ Check GPU & Environment

In [None]:
import torch
import sys

print("="*60)
print("üîß Environment Info")
print("="*60)
print(f"Python: {sys.version}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    print("\n‚úÖ GPU Ready!")
else:
    print("\n‚ö†Ô∏è WARNING: GPU not available!")
    print("Go to: Runtime ‚Üí Change runtime type ‚Üí Hardware accelerator ‚Üí GPU")

## 2Ô∏è‚É£ Mount Google Drive

In [None]:
from google.colab import drive
import os

# Mount Drive
drive.mount('/content/drive')

# T·∫°o working directory
DRIVE_ROOT = "/content/drive/MyDrive/VietnameseASR"
os.makedirs(DRIVE_ROOT, exist_ok=True)

print(f"\n‚úì Drive mounted at: {DRIVE_ROOT}")
print("\nüìÇ C·∫•u tr√∫c th∆∞ m·ª•c khuy·∫øn ngh·ªã:")
print(f"{DRIVE_ROOT}/")
print("  ‚îú‚îÄ‚îÄ data/               # Dataset files")
print("  ‚îÇ   ‚îú‚îÄ‚îÄ train.jsonl")
print("  ‚îÇ   ‚îú‚îÄ‚îÄ validation.jsonl")
print("  ‚îÇ   ‚îî‚îÄ‚îÄ test.jsonl")
print("  ‚îú‚îÄ‚îÄ models/             # Checkpoints (auto-created)")
print("  ‚îî‚îÄ‚îÄ final_model/        # Final output (auto-created)")

## 3Ô∏è‚É£ Install Dependencies

In [None]:
%%capture
# Install packages (silent mode)
!pip install -q transformers datasets evaluate jiwer soundfile librosa accelerate tensorboard

In [None]:
# Verify installation
import transformers
import datasets
import evaluate

print("‚úÖ All packages installed successfully!")
print(f"   - transformers: {transformers.__version__}")
print(f"   - datasets: {datasets.__version__}")
print(f"   - evaluate: {evaluate.__version__}")

## 4Ô∏è‚É£ Upload Source Code

**Option 1: Clone t·ª´ GitHub (Khuy·∫øn ngh·ªã)**

In [None]:
# Clone repository
!git clone https://github.com/YOUR_USERNAME/vietnamese-asr.git
%cd vietnamese-asr

# Ho·∫∑c n·∫øu ƒë√£ clone r·ªìi, ch·ªâ c·∫ßn pull
# !git pull origin main

**Option 2: Upload t·ª´ Drive**

In [None]:
# N·∫øu ƒë√£ upload source code l√™n Drive
import sys
CODE_DIR = f"{DRIVE_ROOT}/code"  # Folder ch·ª©a src/
sys.path.insert(0, CODE_DIR)

print(f"‚úì Source code loaded from: {CODE_DIR}")

## 5Ô∏è‚É£ Check Dataset

**H∆∞·ªõng d·∫´n upload dataset:**
1. Ch·∫°y `python prepare_vivos.py` tr√™n m√°y local
2. Upload folder `processed_data_vivos/` l√™n Google Drive
3. ƒê·∫∑t v√†o: `MyDrive/VietnameseASR/data/`

In [None]:
from pathlib import Path

# ƒê∆∞·ªùng d·∫´n dataset
DATA_DIR = Path(f"{DRIVE_ROOT}/data")

# Ki·ªÉm tra files
required_files = ['train.jsonl', 'validation.jsonl', 'test.jsonl']
missing = [f for f in required_files if not (DATA_DIR / f).exists()]

if missing:
    print("‚ùå Missing dataset files:")
    for f in missing:
        print(f"   - {f}")
    print(f"\nüìÅ Expected location: {DATA_DIR}")
    print("\nüí° Upload dataset files to Google Drive first!")
else:
    print("‚úÖ All dataset files found!")
    # Count samples
    for file in required_files:
        with open(DATA_DIR / file, 'r', encoding='utf-8') as f:
            count = sum(1 for _ in f)
        print(f"   - {file}: {count:,} samples")

## 6Ô∏è‚É£ Training Configuration

In [None]:
import json

# Configuration - T·ªëi ∆∞u cho Colab GPU
config = {
    'pretrained_model': 'nguyenvulebinh/wav2vec2-base-vietnamese-250h',
    'num_train_epochs': 30,          # S·ªë epochs
    'batch_size': 16,                # GPU T4 ~ 16GB RAM
    'gradient_accumulation_steps': 1,
    'learning_rate': 3e-4,
    'use_fp16': True,                # Mixed precision training
    'apply_quantization': False,     # Kh√¥ng quantize khi training
    'save_steps': 500,               # L∆∞u checkpoint m·ªói 500 steps
    'eval_steps': 500,               # Evaluate m·ªói 500 steps
}

# Output directories
OUTPUT_DIR = Path(f"{DRIVE_ROOT}/models/wav2vec2-vietnamese")
FINAL_MODEL_DIR = Path(f"{DRIVE_ROOT}/final_model")

OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
FINAL_MODEL_DIR.mkdir(parents=True, exist_ok=True)

# Save config
with open(OUTPUT_DIR / 'config.json', 'w') as f:
    json.dump(config, f, indent=2)

print("‚úÖ Configuration:")
for key, value in config.items():
    print(f"   - {key}: {value}")
print(f"\nüìÅ Output: {OUTPUT_DIR}")

## 7Ô∏è‚É£ Load Processor & Datasets

In [None]:
from transformers import Wav2Vec2Processor
from src.data.preprocessing import load_and_prepare_datasets

print("Loading processor...")
processor = Wav2Vec2Processor.from_pretrained(config['pretrained_model'])

print("\nLoading datasets...")
train_dataset, val_dataset, test_dataset = load_and_prepare_datasets(
    str(DATA_DIR / 'train.jsonl'),
    str(DATA_DIR / 'validation.jsonl'),
    str(DATA_DIR / 'test.jsonl'),
    processor
)

print(f"\n‚úÖ Datasets loaded:")
print(f"   - Train: {len(train_dataset):,} samples")
print(f"   - Validation: {len(val_dataset):,} samples")
print(f"   - Test: {len(test_dataset):,} samples")

## 8Ô∏è‚É£ Create Model

In [None]:
from src.training.train_wav2vec2 import create_model

print("Creating model...")
vocab_size = len(processor.tokenizer)
model = create_model(vocab_size, config['pretrained_model'])

# Move to GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

# Model info
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"\n‚úÖ Model ready on {device}")
print(f"   - Total parameters: {total_params:,}")
print(f"   - Trainable: {trainable_params:,}")
print(f"   - Frozen: {total_params - trainable_params:,}")

## 9Ô∏è‚É£ Start Training

**‚è±Ô∏è Estimated time: 15-20 hours on T4 GPU**

In [None]:
from src.training.train_wav2vec2 import train_model

print("="*60)
print("üöÄ Starting Training...")
print("="*60)
print("\n‚ö†Ô∏è IMPORTANT:")
print("   - Keep this tab open!")
print("   - Colab timeout: ~12 hours")
print("   - Checkpoints auto-saved to Drive every 500 steps")
print("\n" + "="*60 + "\n")

# Train
trainer = train_model(
    model=model,
    train_dataset=train_dataset,
    val_dataset=val_dataset,
    processor=processor,
    output_dir=str(OUTPUT_DIR),
    num_train_epochs=config['num_train_epochs'],
    batch_size=config['batch_size'],
    gradient_accumulation_steps=config['gradient_accumulation_steps'],
    learning_rate=config['learning_rate'],
    use_fp16=config['use_fp16']
)

## üîü Save Final Model

In [None]:
print("Saving final model...")

# Save model
trainer.save_model(str(FINAL_MODEL_DIR))
processor.save_pretrained(str(FINAL_MODEL_DIR))

# Save training history
import pandas as pd
if hasattr(trainer.state, 'log_history'):
    history_df = pd.DataFrame(trainer.state.log_history)
    history_df.to_csv(f"{DRIVE_ROOT}/training_history.csv", index=False)
    print(f"‚úì Training history saved")

print(f"\n‚úÖ Training completed!")
print(f"üì¶ Final model: {FINAL_MODEL_DIR}")
print(f"\nüí° Model ƒë√£ l∆∞u v√†o Google Drive, b·∫°n c√≥ th·ªÉ:")
print(f"   1. Download v·ªÅ m√°y t·ª´ Drive")
print(f"   2. D√πng tr·ª±c ti·∫øp t·ª´ Drive trong notebook kh√°c")
print(f"   3. Upload l√™n HuggingFace Hub")

## üìä Monitor Training (Optional)

Ch·∫°y cell n√†y trong l√∫c training ƒë·ªÉ theo d√µi

In [None]:
# TensorBoard
%load_ext tensorboard
%tensorboard --logdir {OUTPUT_DIR}/runs

In [None]:
# GPU monitoring
!nvidia-smi

In [None]:
# Check latest checkpoint
!ls -lh {OUTPUT_DIR}/checkpoint-*/ | tail -5

## üéØ Test Model (After Training)

In [None]:
# Load trained model
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import soundfile as sf

model = Wav2Vec2ForCTC.from_pretrained(str(FINAL_MODEL_DIR))
processor = Wav2Vec2Processor.from_pretrained(str(FINAL_MODEL_DIR))
model = model.to(device)
model.eval()

print("‚úÖ Model loaded for inference")

In [None]:
# Transcribe audio file
def transcribe(audio_path):
    # Load audio
    speech, sr = sf.read(audio_path)
    
    # Process
    inputs = processor(speech, sampling_rate=16000, return_tensors="pt", padding=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Predict
    with torch.no_grad():
        logits = model(**inputs).logits
    
    # Decode
    pred_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(pred_ids)[0]
    
    return transcription

# Test
# audio_file = "/path/to/audio.wav"
# result = transcribe(audio_file)
# print(f"Transcription: {result}")

---

## üìå Tips

### Tr√°nh Colab timeout:
- Training m·∫•t ~15-20h, Colab free timeout sau ~12h
- **Gi·∫£i ph√°p:** Chia nh·ªè training th√†nh nhi·ªÅu session
  ```python
  # Session 1: Train 10 epochs
  config['num_train_epochs'] = 10
  
  # Session 2: Resume t·ª´ checkpoint, train th√™m 10 epochs
  config['resume_from_checkpoint'] = str(OUTPUT_DIR / 'checkpoint-5000')
  config['num_train_epochs'] = 20
  ```

### Colab Pro:
- Timeout: ~24h
- Better GPU: A100/V100
- Training time: ~8-10h

### Auto-save to Drive:
Checkpoints t·ª± ƒë·ªông l∆∞u v√†o Drive m·ªói 500 steps, an to√†n n·∫øu Colab disconnect!

---