# üéØ Konkani NER Training - Google Colab

Train custom Named Entity Recognition model for Konkani

**Steps:**
1. Install dependencies
2. Mount Google Drive
3. Upload code files
4. Auto-label data (15 min)
5. Train NER model (2-3 hours)

---

## üì¶ Cell 1: Install Dependencies

In [None]:
print("üì¶ Installing dependencies...\n")
!pip install -q torch transformers pytorch-crf tqdm
print("‚úÖ Dependencies installed!\n")

# Verify GPU
import torch
print(f"üîç GPU available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## üíæ Cell 2: Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

print("\n‚úÖ Google Drive mounted!")

## üì§ Cell 3: Upload Code & Data

**Option A:** Upload from your Mac (recommended)

In [None]:
from google.colab import files
import zipfile
import os

print("üì§ Please upload these files from your Mac:")
print("   1. transcripts_konkani_cleaned.json")
print("   2. scripts/auto_label_ner.py")
print("   3. models/konkani_ner.py")
print("   4. train_konkani_ner.py")
print("\nOr create a zip with all files and upload that.\n")

uploaded = files.upload()

# If zip uploaded, extract it
for filename in uploaded.keys():
    if filename.endswith('.zip'):
        print(f"\nüìÇ Extracting {filename}...")
        with zipfile.ZipFile(filename, 'r') as zip_ref:
            zip_ref.extractall('/content/')
        print("‚úÖ Extracted!")

print("\nüìã Files in /content/:")
!ls -la /content/

## üè∑Ô∏è Cell 4: Auto-Label NER Data

**This will take ~15 minutes for 2,500 samples**

In [None]:
%%time

print("üè∑Ô∏è Starting auto-labeling...\n")
print("This uses a pre-trained multilingual NER model to label your Konkani text.")
print("Expected time: ~15 minutes\n")

!python3 /content/scripts/auto_label_ner.py \
    --input /content/transcripts_konkani_cleaned.json \
    --output /content/data/ner_labeled_data.json

print("\n‚úÖ Auto-labeling complete!")
print("\nüìä Checking output...")
!ls -lh /content/data/ner_labeled_data*

## üöÄ Cell 5: Train Custom NER Model

**This will take ~2-3 hours on GPU**

In [None]:
%%time

print("="*70)
print("üöÄ STARTING NER TRAINING")
print("="*70)
print("\nConfiguration:")
print("  ‚Ä¢ Device: CUDA (GPU)")
print("  ‚Ä¢ Batch size: 32")
print("  ‚Ä¢ Epochs: 20")
print("  ‚Ä¢ Model: BiLSTM-CRF")
print("  ‚Ä¢ Expected time: 2-3 hours")
print("\n" + "="*70 + "\n")

!python3 /content/train_konkani_ner.py \
    --data_file /content/data/ner_labeled_data.json \
    --batch_size 32 \
    --num_epochs 20 \
    --learning_rate 0.001 \
    --device cuda \
    --checkpoint_dir /content/checkpoints/ner \
    --use_crf

print("\n" + "="*70)
print("‚úÖ TRAINING COMPLETE!")
print("="*70)

## üìä Cell 6: Check Results

In [None]:
print("üìä Training Results\n")
print("="*70)

print("\nüíæ Saved Models:\n")
!ls -lh /content/checkpoints/ner/

print("\nüìà Model Size:")
!du -h /content/checkpoints/ner/best_ner_model.pt

print("\n‚úÖ Model ready to use!")

## üß™ Cell 7: Test the Model

In [None]:
import torch
import json
import sys
sys.path.append('/content')

from models.konkani_ner import create_ner_model

print("üß™ Testing NER Model\n")
print("="*70)

# Load vocabularies
with open('/content/checkpoints/ner/vocabularies.json', 'r') as f:
    vocabs = json.load(f)
    word2id = vocabs['word2id']
    char2id = vocabs['char2id']

# Load label map
with open('/content/data/ner_labeled_data_label_map.json', 'r') as f:
    label_map = json.load(f)
    id2label = {int(k): v for k, v in label_map['id2label'].items()}

# Create model
model = create_ner_model(
    vocab_size=len(word2id),
    char_vocab_size=len(char2id),
    num_tags=9,
    use_crf=True
)

# Load weights
checkpoint = torch.load('/content/checkpoints/ner/best_ner_model.pt')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
model = model.to('cuda')

print("‚úÖ Model loaded!\n")

# Test on sample text
test_texts = [
    "‡§Æ‡•Ä ‡§Æ‡•Å‡§Ç‡§¨‡§à‡§Ç‡§§ ‡§ó‡•Ç‡§ó‡§≤‡§æ‡§Ç‡§§ ‡§ï‡§æ‡§Æ ‡§ï‡§∞‡§§‡§æ‡§Ç",
    "‡§ó‡•ã‡§Ø‡§æ‡§Ç‡§§ ‡§ï‡§≤‡§Ç‡§ó‡•Å‡§ü ‡§¨‡•Ä‡§ö ‡§Ü‡§∏‡§æ",
    "‡§Æ‡§æ‡§ù‡•á ‡§®‡§æ‡§µ ‡§∏‡•ç‡§ü‡§æ‡§µ‡§ø‡§® ‡§´‡§∞‡•ç‡§®‡§æ‡§Ç‡§°‡§ø‡§∏"
]

for text in test_texts:
    print(f"\nüìù Text: {text}")
    
    # Tokenize
    tokens = text.split()
    
    # Convert to IDs
    word_ids = torch.tensor([[word2id.get(t, 1) for t in tokens]]).to('cuda')
    
    # Character IDs
    char_ids = []
    for token in tokens:
        char_id_list = [char2id.get(c, 1) for c in token]
        char_ids.append(char_id_list)
    
    max_char_len = max(len(chars) for chars in char_ids)
    char_ids_padded = torch.zeros(1, len(tokens), max_char_len, dtype=torch.long).to('cuda')
    for i, word_chars in enumerate(char_ids):
        char_ids_padded[0, i, :len(word_chars)] = torch.tensor(word_chars, dtype=torch.long)
    
    # Predict
    with torch.no_grad():
        predictions = model(word_ids, char_ids_padded)
    
    # Decode predictions
    pred_labels = [id2label[p] for p in predictions[0]]
    
    # Show results
    print("\n   Entities found:")
    for token, label in zip(tokens, pred_labels):
        if label != 'O':
            print(f"      {token:20s} ‚Üí {label}")
    
    if all(label == 'O' for label in pred_labels):
        print("      (No entities detected)")

print("\n" + "="*70)
print("‚úÖ Testing complete!")

## üíæ Cell 8: Backup to Google Drive

In [None]:
from datetime import datetime
import shutil

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
backup_path = f"/content/drive/MyDrive/konkanivani_training/ner_backup_{timestamp}"

print(f"üíæ Backing up NER model to Google Drive...\n")
print(f"Backup location: {backup_path}\n")

!mkdir -p {backup_path}
!cp -r /content/checkpoints/ner/* {backup_path}/
!cp /content/data/ner_labeled_data* {backup_path}/

print("\n‚úÖ Backup complete!\n")
print("üìã Backed up files:\n")
!ls -lh {backup_path}/

## üì• Cell 9: Download Model

In [None]:
from google.colab import files
import zipfile

print("üì¶ Creating download package...\n")

# Create zip
!cd /content && zip -r ner_model.zip checkpoints/ner/ data/ner_labeled_data*

print("\nüìä Package size:")
!ls -lh /content/ner_model.zip

print("\nüì• Downloading...")
files.download('/content/ner_model.zip')

print("\n‚úÖ Download complete!")

---

## ‚úÖ Summary

After running all cells, you'll have:

1. ‚úÖ Auto-labeled NER dataset (~2,500 samples)
2. ‚úÖ Trained custom NER model (BiLSTM-CRF)
3. ‚úÖ Model backed up to Google Drive
4. ‚úÖ Model downloaded to your computer

**Next steps:**
- Integrate NER into your complete audio analyzer
- Test with real Konkani audio
- Deploy to Hugging Face Spaces

---