# üöÄ BERT Training dengan Google Colab

**Panduan Lengkap Training BERT Model untuk Chatbot**

---

## üìã Langkah-langkah:

1. **Klik 'Runtime' > 'Change runtime type' > Ubah ke GPU T4**
2. **Running cell pertama** - Install dependencies (2-3 menit)
3. **Restart Session** setelah install selesai
4. **Upload dataset** atau clone dari GitHub
5. **Pilih model** yang akan digunakan (3 pilihan)
6. **Running training** - Tunggu ~7-60 menit (tergantung dataset)
7. **Download model** - Ekstrak ke `data/bert_model/`
8. **Jalankan server** di local

---

## ‚è±Ô∏è Estimasi Waktu:
- Install Dependencies: **2-3 menit**
- Upload Dataset: **1-2 menit**
- Training BERT: **7-60 menit** (tergantung ukuran dataset)
- Download Model: **2-3 menit**
- **Total: ~15-70 menit**

---

## ‚ö†Ô∏è PENTING:
‚úÖ Pastikan GPU sudah aktif (T4)  
‚úÖ Restart session setelah install dependencies  
‚úÖ Siapkan file `dataset_training.csv`  
‚úÖ Simpan model setelah training selesai

# üì¶ STEP 1: Install Dependencies

**Copy-paste code di bawah dan jalankan**

‚è±Ô∏è Waktu: ~2-3 menit

‚ö†Ô∏è **PENTING:** Setelah selesai, **RESTART SESSION** untuk menerapkan perubahan!
- Klik: **Runtime > Restart session**

In [None]:
# ==================== INSTALL DEPENDENCIES ====================
# Copy-paste ini di cell pertama Colab

print("üîÑ Installing dependencies...")
print("‚è±Ô∏è  Estimasi waktu: 2-3 menit\n")

# Install dengan versi yang sudah tested di Colab
!pip install -q transformers datasets accelerate torch pandas
!pip install -q scikit-learn numpy tqdm

print("\n" + "="*60)
print("üöÄ Colab environment ready for BERT training!")
print("‚úÖ All dependencies installed successfully!")
print("="*60)
print("\n‚ö†Ô∏è  PENTING: Restart Session sekarang!")
print("    Klik: Runtime > Restart session")
print("="*60)

# üîç STEP 2: Verify GPU & Import Libraries

Setelah **restart session**, jalankan cell ini untuk:
- ‚úÖ Verifikasi GPU tersedia
- ‚úÖ Import libraries
- ‚úÖ Check CUDA availability

In [None]:
# ==================== VERIFY GPU ====================
import torch
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import os
from datetime import datetime

# Check GPU
print("üîç Checking GPU availability...")
print("="*60)

if torch.cuda.is_available():
    print(f"‚úÖ GPU Available: {torch.cuda.get_device_name(0)}")
    print(f"‚úÖ CUDA Version: {torch.version.cuda}")
    print(f"‚úÖ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
    device = "cuda"
else:
    print("‚ö†Ô∏è  GPU not available! Training will be slow.")
    print("   Pastikan Runtime Type = GPU T4")
    device = "cpu"

print(f"‚úÖ Device: {device}")
print("="*60)

# Verify GPU with nvidia-smi
!nvidia-smi

# üìÇ STEP 3: Upload Dataset

**Pilih salah satu metode:**

### Method 1: Upload Manual (Recommended)
Jalankan cell di bawah, lalu upload file `dataset_training.csv`

### Method 2: Clone dari GitHub
Jika dataset sudah ada di repository

In [None]:
# ==================== UPLOAD DATASET ====================
from google.colab import files

print("üìÇ Upload file dataset_training.csv")
print("="*60)

# Upload dataset
uploaded = files.upload()

# Verify upload
if 'dataset_training.csv' in uploaded:
    print("\n‚úÖ Dataset uploaded successfully!")
    
    # Preview dataset
    df = pd.read_csv('dataset_training.csv')
    print(f"\nüìä Dataset Info:")
    print(f"   - Total rows: {len(df)}")
    print(f"   - Columns: {list(df.columns)}")
    print(f"\nüîç Sample data:")
    print(df.head(3))
    print(f"\nüìà Intent distribution:")
    print(df['tag'].value_counts().head(10))
else:
    print("\n‚ùå Dataset not found! Please upload dataset_training.csv")

# üéØ STEP 4: Pilih Model BERT

**Ada 3 pilihan model yang bisa digunakan:**

| Model | Ukuran | Kecepatan | Akurasi | Recommended |
|-------|--------|-----------|---------|-------------|
| **IndoBERT Base** | ~500MB | Sedang | Tinggi | ‚úÖ **Production** |
| **IndoBERT Lite** | ~200MB | Cepat | Sedang | Testing |
| **mBERT** | ~700MB | Lambat | Tinggi | Multi-bahasa |

**Pilih model dengan mengubah variabel `MODEL_CHOICE`:**
- `1` = IndoBERT Base (Recommended)
- `2` = IndoBERT Lite  
- `3` = Multilingual BERT

In [None]:
# ==================== PILIH MODEL ====================

# üéØ UBAH ANGKA DI BAWAH UNTUK PILIH MODEL (1, 2, atau 3)
MODEL_CHOICE = 1  # Default: IndoBERT Base

# Model configurations
MODELS = {
    1: {
        'name': 'indobenchmark/indobert-base-p1',
        'display_name': 'IndoBERT Base',
        'description': 'üèÜ Best for production - High accuracy',
        'size': '~500MB'
    },
    2: {
        'name': 'indobenchmark/indobert-lite-base-p1',
        'display_name': 'IndoBERT Lite',
        'description': '‚ö° Fast training - Good for testing',
        'size': '~200MB'
    },
    3: {
        'name': 'cahya/bert-base-indonesian-522M',
        'display_name': 'bert-base-indonesian-522M',
        'description': 'indonesian-language support',
        'size': '~700MB'
    }
}

# Get selected model
selected_model = MODELS[MODEL_CHOICE]
MODEL_NAME = selected_model['name']

print("="*60)
print(f"üìå Model dipilih: {selected_model['display_name']}")
print(f"   {selected_model['description']}")
print(f"   Size: {selected_model['size']}")
print(f"   Estimasi training: {selected_model['training_time']}")
print(f"   Hugging Face: {MODEL_NAME}")
print("="*60)

# üèãÔ∏è STEP 5: Training BERT Model

**Jalankan cell di bawah untuk mulai training**

üìä **Proses:**
1. Load & preprocess data
2. Tokenize dataset
3. Train model (3 epochs)
4. Save model

üí° **Tips:**
- Jangan tutup tab Colab selama training
- Monitor GPU usage
- Training bisa dihentikan kapan saja (Ctrl+M I)

In [None]:
# ==================== BERT TRAINING ====================
import warnings
warnings.filterwarnings('ignore')

print("üèãÔ∏è  Starting BERT Training...")
print("="*60)

# Configuration
NUM_EPOCHS = 3
BATCH_SIZE = 16
LEARNING_RATE = 2e-5
MAX_LENGTH = 128
OUTPUT_DIR = './bert_model'

print(f"üìã Training Configuration:")
print(f"   Model: {MODEL_NAME}")
print(f"   Epochs: {NUM_EPOCHS}")
print(f"   Batch Size: {BATCH_SIZE}")
print(f"   Learning Rate: {LEARNING_RATE}")
print(f"   Max Length: {MAX_LENGTH}")
print("="*60)

# Load dataset
print("\nüìÇ Loading dataset...")
df = pd.read_csv('dataset_training.csv')

# Prepare data
texts = df['patterns'].astype(str).tolist()
labels = df['tag'].astype(str).tolist()

# Encode labels
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels)
num_labels = len(label_encoder.classes_)

print(f"‚úÖ Dataset loaded:")
print(f"   Total samples: {len(texts)}")
print(f"   Unique intents: {num_labels}")

# Split data
train_texts, val_texts, train_labels, val_labels = train_test_split(
    texts, encoded_labels, test_size=0.2, random_state=42, stratify=encoded_labels
)

print(f"   Training samples: {len(train_texts)}")
print(f"   Validation samples: {len(val_texts)}")

# Load tokenizer and model
print(f"\nüîÑ Loading model: {MODEL_NAME}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=num_labels,
    ignore_mismatched_sizes=True
)

print("‚úÖ Model loaded successfully!")

# Tokenize data
print("\nüîÑ Tokenizing dataset...")
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=MAX_LENGTH)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=MAX_LENGTH)

# Create PyTorch dataset
class IntentDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IntentDataset(train_encodings, train_labels)
val_dataset = IntentDataset(val_encodings, val_labels)

print("‚úÖ Tokenization complete!")

# Training arguments
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    learning_rate=LEARNING_RATE,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
)

# Metrics
from sklearn.metrics import accuracy_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    return {'accuracy': acc}

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

# Start training
print("\n" + "="*60)
print("üöÄ TRAINING STARTED!")
print("="*60)
print(f"‚è±Ô∏è  Estimasi waktu: {selected_model['training_time']}")
print("üí° Anda bisa monitor GPU dengan: !nvidia-smi")
print("="*60 + "\n")

start_time = datetime.now()

# Train
trainer.train()

end_time = datetime.now()
duration = (end_time - start_time).total_seconds() / 60

print("\n" + "="*60)
print("‚úÖ TRAINING COMPLETED!")
print("="*60)
print(f"‚è±Ô∏è  Total time: {duration:.2f} minutes")
print(f"üìÅ Model saved to: {OUTPUT_DIR}")
print("="*60)

# Save model and tokenizer
print("\nüíæ Saving final model...")
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

# Save label encoder
import pickle
with open(f'{OUTPUT_DIR}/label_encoder.pkl', 'wb') as f:
    pickle.dump(label_encoder, f)

print("‚úÖ Model, tokenizer, and label encoder saved!")
print("\nüéâ Training completed successfully!")
print(f"üìÇ Output directory: {OUTPUT_DIR}")

# üß™ STEP 6: Test Model (Optional)

Test model dengan beberapa contoh pertanyaan untuk verifikasi akurasi

In [None]:
# ==================== TEST MODEL ====================
print("üß™ Testing trained model...")
print("="*60)

# Test samples
test_queries = [
    "jam buka bappenda",
    "cara buat ktp",
    "syarat nikah",
    "bayar pajak online"
]

def predict(text):
    # Tokenize
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
    
    # Move to device
    inputs = {k: v.to(device) for k, v in inputs.items()}
    model.to(device)
    
    # Predict
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(predictions, dim=1).item()
        confidence = predictions[0][predicted_class].item()
    
    # Decode label
    intent = label_encoder.inverse_transform([predicted_class])[0]
    
    return intent, confidence

# Test predictions
print("\nüìù Test Predictions:\n")
for query in test_queries:
    intent, confidence = predict(query)
    print(f"Query: '{query}'")
    print(f"  ‚ûú Intent: {intent}")
    print(f"  ‚ûú Confidence: {confidence:.4f} ({confidence*100:.2f}%)")
    print()

print("="*60)
print("‚úÖ Model testing complete!")

# üì• STEP 7: Download Model

**Download model yang sudah di-training**

File akan di-download sebagai `bert_model.zip` (~500-700 MB)

‚è±Ô∏è Waktu download: ~2-5 menit (tergantung koneksi internet)

In [None]:
# ==================== DOWNLOAD MODEL ====================
from google.colab import files
import shutil

print("üì¶ Preparing model for download...")
print("="*60)

# Zip the model directory
print("üîÑ Compressing model files...")
shutil.make_archive('bert_model', 'zip', OUTPUT_DIR)

print("‚úÖ Model compressed successfully!")
print(f"üì¶ File: bert_model.zip")

# Get file size
import os
file_size = os.path.getsize('bert_model.zip') / (1024 * 1024)
print(f"üìä Size: {file_size:.2f} MB")

print("\nüöÄ Starting download...")
print("="*60)

# Download
files.download('bert_model.zip')

print("\n" + "="*60)
print("‚úÖ MODEL DOWNLOADED SUCCESSFULLY!")
print("="*60)
print("\nüìã Next Steps:")
print("1. Extract bert_model.zip")
print("2. Copy isi folder ke: '/data/bert_model/'")
print("3. Struktur folder harus seperti:")
print("   data/bert_model/")
print("   ‚îú‚îÄ‚îÄ config.json")
print("   ‚îú‚îÄ‚îÄ model.safetensors")
print("   ‚îú‚îÄ‚îÄ tokenizer.json")
print("   ‚îú‚îÄ‚îÄ label_encoder.pkl")
print("   ‚îî‚îÄ‚îÄ ...")
print("\n4. Test model di local:")
print("   python -c \"from main import get_hybrid_nlu; print('OK')\"")
print("\n5. Jalankan server:")
print("   uvicorn app:app --reload")
print("="*60)
print("\nüéâ SELESAI! Model siap digunakan di local!")

# üíæ BONUS: Backup ke Google Drive (Optional)

Jika ingin backup model ke Google Drive untuk keamanan

In [None]:
# ==================== BACKUP TO GOOGLE DRIVE ====================
from google.colab import drive

# Mount Google Drive
print("üìÇ Mounting Google Drive...")
drive.mount('/content/drive')

# Create backup directory
backup_dir = '/content/drive/MyDrive/fira-bot-backup'
!mkdir -p "{backup_dir}"

# Copy model
print("\nüíæ Backing up model to Google Drive...")
!cp -r {OUTPUT_DIR} "{backup_dir}/"

print("\n‚úÖ Model backed up successfully!")
print(f"üìÅ Location: {backup_dir}/bert_model")
print("\nüí° Model akan tersimpan permanen di Google Drive Anda")