# Custom Keyboard Transformer Model Training

Train a lightweight custom transformer (3.7M params) optimized for keyboard suggestions.

**Features:**
1. Word Completion: "hel" ‚Üí ["hello", "help", "held"]
2. Next-Word Prediction: "how are" ‚Üí ["you", "they", "we"]
3. Typo Correction: "thers" ‚Üí ["there", "theirs"]

**Model Specifications:**
- Architecture: Custom Transformer (6 layers, 128 hidden, 4 heads)
- Parameters: 3.7M
- Vocabulary: 10,000 words (keyboard-optimized)
- Model Size: 14MB (FP32), 4MB (INT8)
- Expected Accuracy: 80-85%
- Training Time: 30-40 minutes on Colab GPU (T4)

**Why Custom Model?**
- ‚úÖ Small vocabulary (10k vs 50k) = Better learning
- ‚úÖ Trained from scratch on keyboard data
- ‚úÖ Optimized for mobile deployment

---

**Instructions:**
1. Runtime ‚Üí Change runtime type ‚Üí GPU (T4)
2. Run all cells in order
3. Model will be saved to Google Drive
4. Download CoreML/TFLite for mobile deployment

## 1. Environment Setup

In [None]:
# Mount Google Drive
from google.colab import drive
import os

drive.mount('/content/drive')

# Define directories
DRIVE_DIR = '/content/drive/MyDrive/Keyboard-Suggestions-ML-Colab'
DATA_DIR = f"{DRIVE_DIR}/data/datasets"
PROCESSED_DIR = f"{DRIVE_DIR}/data/processed"
MODEL_DIR = f"{DRIVE_DIR}/models/custom_keyboard"

# Create directories
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(PROCESSED_DIR, exist_ok=True)
os.makedirs(MODEL_DIR, exist_ok=True)

print(f"‚úì Google Drive mounted")
print(f"‚úì Data directory: {DATA_DIR}")
print(f"‚úì Model directory: {MODEL_DIR}")

In [None]:
# Install dependencies
!pip install -q torch transformers tqdm coremltools tensorflow

In [None]:
# Clone repository to get custom model code
!git clone https://github.com/MinhPhuPham/Keyboard-Suggestions-ML-Colab.git /content/repo

# Copy custom model scripts
import shutil
shutil.copytree('/content/repo/scripts/custom-model', '/content/custom_model', dirs_exist_ok=True)

print("‚úì Repository cloned")
print("‚úì Custom model code copied to /content/custom_model")

## 2. Verify Datasets

Upload these files to Google Drive at `Keyboard-Suggestions-ML-Colab/data/datasets/`:
- `single_word_freq.csv`
- `keyboard_training_data.txt`
- `misspelled.csv`

In [None]:
# Verify datasets exist
required_files = [
    f"{DATA_DIR}/single_word_freq.csv",
    f"{DATA_DIR}/keyboard_training_data.txt",
    f"{DATA_DIR}/misspelled.csv"
]

print("Checking datasets...")

all_exist = True
for file_path in required_files:
    exists = os.path.exists(file_path)
    status = "‚úì" if exists else "‚ùå"
    print(f"{status} {os.path.basename(file_path)}: {exists}")
    if not exists:
        all_exist = False

if not all_exist:
    print("\n‚ö†Ô∏è  Please upload missing datasets to Google Drive!")
    print(f"   Upload to: {DATA_DIR}/")
else:
    print("\n‚úÖ All datasets found!")

## 3. Prepare Training Data

In [None]:
# Import data preparation script
import sys
sys.path.insert(0, '/content/custom_model')

# Run data preparation
!cd /content/custom_model && python prepare_data.py \
    --data-dir {DATA_DIR} \
    --output-dir {PROCESSED_DIR} \
    --max-completion 50000 \
    --max-nextword 100000 \
    --max-typo 20000

## 4. Train Custom Model

Training will take approximately 30-40 minutes on GPU.

In [None]:
# Train model
!cd /content/custom_model && python train.py \
    --data-dir {PROCESSED_DIR} \
    --save-dir {MODEL_DIR} \
    --num-epochs 20 \
    --batch-size 64 \
    --device cuda

## 5. Test Model - 10 Test Cases

Test the trained model with 10 cases covering all features.

In [None]:
# Load trained model for testing
import torch
from tokenizer import KeyboardTokenizer
from model import KeyboardTransformer

# Load tokenizer
tokenizer = KeyboardTokenizer.load(f"{MODEL_DIR}/tokenizer.pkl")

# Load model
model = KeyboardTransformer(
    vocab_size=len(tokenizer),
    hidden_size=256,
    num_layers=8,
    num_heads=8,
    ff_dim=1024,
    max_length=16
)

checkpoint = torch.load(f"{MODEL_DIR}/best_model.pt", map_location='cuda')
model.load_state_dict(checkpoint['model_state_dict'])
model = model.to('cuda')
model.eval()

print("‚úì Model loaded successfully")
print(f"‚úì Val Loss: {checkpoint.get('val_loss', 'N/A')}")
print(f"‚úì Val Accuracy: {checkpoint.get('val_accuracy', 0)*100:.2f}%")

In [None]:
# Test function
def test_prediction(input_text, top_k=5):
    """Test model prediction"""
    # Encode input
    input_ids = tokenizer.encode(input_text, max_length=15, padding=False)
    input_ids.append(tokenizer.mask_token_id)
    
    while len(input_ids) < 16:
        input_ids.append(tokenizer.pad_token_id)
    
    input_tensor = torch.tensor([input_ids], dtype=torch.long).to('cuda')
    attention_mask = torch.tensor(
        [[1 if idx != tokenizer.pad_token_id else 0 for idx in input_ids]],
        dtype=torch.long
    ).to('cuda')
    
    # Predict
    with torch.no_grad():
        top_tokens, top_probs = model.predict(input_tensor, attention_mask, top_k=top_k*2)
    
    # Decode
    predictions = []
    for token_id, prob in zip(top_tokens[0], top_probs[0]):
        word = tokenizer.idx2word.get(token_id.item(), tokenizer.unk_token)
        if word not in [tokenizer.pad_token, tokenizer.unk_token, tokenizer.mask_token]:
            predictions.append((word, prob.item() * 100))
        if len(predictions) >= top_k:
            break
    
    return predictions

In [None]:
# 10 Test Cases
test_cases = [
    # Word Completion (4 cases)
    ("hel", "Word Completion", "Should suggest: hello, help, held"),
    ("prod", "Word Completion", "Should suggest: product, production, produce"),
    ("beau", "Word Completion", "Should suggest: beautiful, beauty, because"),
    ("comp", "Word Completion", "Should suggest: complete, computer, company"),
    
    # Next-Word Prediction (4 cases)
    ("how are", "Next-Word Prediction", "Should suggest: you, they, we"),
    ("thank", "Next-Word Prediction", "Should suggest: you, for, god"),
    ("good morning", "Next-Word Prediction", "Should suggest: to, and, everyone"),
    ("see you", "Next-Word Prediction", "Should suggest: later, soon, tomorrow"),
    
    # Typo Correction (2 cases)
    ("thers", "Typo Correction", "Should suggest: there, theirs"),
    ("recieve", "Typo Correction", "Should suggest: receive"),
]

print("="*80)
print("TESTING CUSTOM KEYBOARD MODEL - 10 TEST CASES")
print("="*80)

for i, (input_text, task, expected) in enumerate(test_cases, 1):
    print(f"\nTest {i}/10: {task}")
    print(f"Input: '{input_text}'")
    print(f"Expected: {expected}")
    
    predictions = test_prediction(input_text, top_k=3)
    
    print("Predictions:")
    if predictions:
        for j, (word, prob) in enumerate(predictions, 1):
            confidence = "üü¢" if prob > 50 else "üü°" if prob > 20 else "üî¥"
            print(f"  {j}. {word:15s} {confidence} {prob:5.1f}%")
    else:
        print("  (no predictions)")
    
    print("-" * 80)

print("\n" + "="*80)
print("‚úÖ ALL TESTS COMPLETE")
print("="*80)

## 6. Export to CoreML (iOS)

In [None]:
# Export to CoreML
import coremltools as ct
import numpy as np

print("Exporting to CoreML...")

# Prepare model for export
model.eval()
model = model.to('cpu')

# Create dummy input
dummy_input = torch.randint(0, len(tokenizer), (1, 16))

# Trace model
traced_model = torch.jit.trace(model, dummy_input)

# Convert to CoreML
mlmodel = ct.convert(
    traced_model,
    inputs=[ct.TensorType(name="input_ids", shape=(1, 16), dtype=np.int32)],
    compute_units=ct.ComputeUnit.ALL,
    compute_precision=ct.precision.FLOAT16,
    minimum_deployment_target=ct.target.iOS14
)

# Add metadata
mlmodel.author = "MinhPhuPham"
mlmodel.short_description = "Custom keyboard transformer model"
mlmodel.version = "1.0"

# Quantize to INT8
print("Quantizing to INT8...")
import coremltools.optimize.coreml as cto

op_config = cto.OpLinearQuantizerConfig(
    mode="linear_symmetric",
    dtype="int8",
    granularity="per_channel"
)
config = cto.OptimizationConfig(global_config=op_config)
mlmodel_int8 = cto.linear_quantize_weights(mlmodel, config=config)

# Save
coreml_path = f"{MODEL_DIR}/CustomKeyboard.mlpackage"
mlmodel_int8.save(coreml_path)

print(f"‚úì CoreML model saved: {coreml_path}")
print(f"‚úì Model size: ~4-6MB (INT8)")
print(f"‚úì Expected RAM: 12-15MB")
print(f"‚úì Expected latency: <50ms")

In [None]:
# Save vocabulary for iOS
import json

vocab_data = {
    'word2idx': tokenizer.word2idx,
    'idx2word': {str(k): v for k, v in tokenizer.idx2word.items()},
    'vocab_size': len(tokenizer),
    'pad_token_id': tokenizer.pad_token_id,
    'unk_token_id': tokenizer.unk_token_id,
    'mask_token_id': tokenizer.mask_token_id
}

vocab_path = f"{MODEL_DIR}/vocabulary.json"
with open(vocab_path, 'w', encoding='utf-8') as f:
    json.dump(vocab_data, f, ensure_ascii=False, indent=2)

print(f"‚úì Vocabulary saved: {vocab_path}")
print(f"‚úì Vocab size: {len(tokenizer):,} words")

## 7. Export to TFLite (Android)

In [None]:
# Export to TFLite
import tensorflow as tf

print("Exporting to TFLite...")

# Convert traced model to ONNX first
onnx_path = f"{MODEL_DIR}/custom_keyboard.onnx"
torch.onnx.export(
    model,
    dummy_input,
    onnx_path,
    input_names=['input_ids'],
    output_names=['logits'],
    dynamic_axes={
        'input_ids': {0: 'batch_size'},
        'logits': {0: 'batch_size'}
    },
    opset_version=12
)

print(f"‚úì ONNX model saved: {onnx_path}")

# Note: Full ONNX ‚Üí TFLite conversion requires onnx-tf package
# For now, we'll save the ONNX model which can be converted separately
print("\nüìù Note: Convert ONNX to TFLite using:")
print("   pip install onnx-tf")
print("   onnx-tf convert -i custom_keyboard.onnx -o custom_keyboard.pb")
print("   Then use TensorFlow Lite converter")

## 8. Training Summary

In [None]:
# Display training summary
import json

history_path = f"{MODEL_DIR}/training_history.json"
if os.path.exists(history_path):
    with open(history_path, 'r') as f:
        history = json.load(f)
    
    print("="*80)
    print("TRAINING SUMMARY")
    print("="*80)
    
    print(f"\nFinal Results:")
    print(f"  Train Loss: {history['train_loss'][-1]:.4f}")
    print(f"  Val Loss:   {history['val_loss'][-1]:.4f}")
    print(f"  Val Accuracy: {history['val_accuracy'][-1]*100:.2f}%")
    
    print(f"\nBest Results:")
    best_val_loss = min(history['val_loss'])
    best_epoch = history['val_loss'].index(best_val_loss) + 1
    print(f"  Best Val Loss: {best_val_loss:.4f} (Epoch {best_epoch})")
    print(f"  Best Val Accuracy: {max(history['val_accuracy'])*100:.2f}%")
    
    print(f"\nModel Files:")
    print(f"  PyTorch Model: {MODEL_DIR}/best_model.pt")
    print(f"  CoreML Model: {MODEL_DIR}/CustomKeyboard.mlpackage")
    print(f"  ONNX Model: {MODEL_DIR}/custom_keyboard.onnx")
    print(f"  Vocabulary: {MODEL_DIR}/vocabulary.json")
    
    print("\n" + "="*80)
    print("‚úÖ TRAINING COMPLETE!")
    print("="*80)
    print("\nNext Steps:")
    print("1. Download models from Google Drive")
    print("2. Integrate CoreML model into iOS app")
    print("3. Convert ONNX to TFLite for Android")
    print("4. Test on actual devices")
else:
    print("‚ö†Ô∏è  Training history not found. Training may not have completed.")

## 9. Download Models

Download these files from Google Drive for mobile deployment:

**For iOS:**
- `CustomKeyboard.mlpackage` (CoreML model)
- `vocabulary.json` (Vocabulary file)

**For Android:**
- `custom_keyboard.onnx` (ONNX model - convert to TFLite)
- `vocabulary.json` (Vocabulary file)

**Location:** `Google Drive/Keyboard-Suggestions-ML-Colab/models/custom_keyboard/`