# English Keyboard Suggestion Model Training

This notebook trains an English keyboard suggestion model using Microsoft Phi-3 Mini with LoRA fine-tuning.

**Target Specifications:**
- Model Size: 20-30 MB (after optimization)
- Latency: < 50 ms
- Perplexity: < 20
- Top-3 Accuracy: > 85%

## 1. Environment Setup

In [None]:
# Clone repository (if running in Colab)
import os
if 'COLAB_GPU' in os.environ:
    !git clone https://github.com/MinhPhuPham/Keyboard-Suggestions-ML-Colab.git
    %cd KeyboardSuggestionsML
else:
    print("Running locally")

In [None]:
# Install dependencies
!pip install -q -r requirements.txt

In [None]:
# Import libraries
import sys
sys.path.append('./src')

import torch
from transformers import AutoTokenizer
from datasets import load_dataset

from data_prep import clean_english_text, augment_with_emojis, split_dataset
from model_utils import load_model_with_lora, train_causal_lm, evaluate_perplexity, prune_model, quantize_model, merge_lora_weights
from export_utils import export_to_onnx, export_to_coreml, verify_model_size, benchmark_latency, package_for_download

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## 2. Data Preparation

In [None]:
# Download SwiftKey corpus
# Manual download required: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
# Extract en_US folder to ./data/english/

# For this example, we'll use a sample dataset
print("Loading sample English dataset...")

# You can also use a Hugging Face dataset as a substitute
# dataset = load_dataset('wikitext', 'wikitext-2-raw-v1')

In [None]:
# Prepare training data
# This is a simplified example - replace with actual SwiftKey corpus processing

sample_sentences = [
    "Today is a beautiful day",
    "I love programming in Python",
    "The weather is nice today",
    "Let's meet tomorrow morning",
    "Thank you for your help",
]

# Clean text
cleaned = [clean_english_text(s) for s in sample_sentences]

# Augment with emojis
augmented = augment_with_emojis(cleaned, emoji_ratio=0.2)

print(f"Sample augmented sentences:")
for sent in augmented[:3]:
    print(f"  {sent}")

## 3. Model Setup and Fine-Tuning

In [None]:
# Load model with LoRA
MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"

model, tokenizer = load_model_with_lora(
    model_name=MODEL_NAME,
    lora_r=8,
    lora_alpha=16,
    lora_dropout=0.1
)

In [None]:
# Prepare dataset for training
# This is a simplified example - use actual SwiftKey data

def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, max_length=8)

# Create a simple dataset from our sample
from datasets import Dataset
train_data = Dataset.from_dict({'text': augmented})
train_dataset = train_data.map(tokenize_function, batched=True)

print(f"Training dataset size: {len(train_dataset)}")

In [None]:
# Train model
trainer = train_causal_lm(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    output_dir="./checkpoints/english",
    num_epochs=3,
    batch_size=8,  # Adjust based on GPU memory
    learning_rate=1e-5,
    max_seq_length=8,
    save_steps=100
)

## 4. Optimization and Export

In [None]:
# Merge LoRA weights
model = merge_lora_weights(model)

In [None]:
# Prune model
model = prune_model(model, amount=0.3)

In [None]:
# Quantize model
model = quantize_model(model, dtype=torch.qint8)

In [None]:
# Export to ONNX
os.makedirs("./models/english", exist_ok=True)
onnx_path = export_to_onnx(
    model=model,
    tokenizer=tokenizer,
    output_path="./models/english/english_model.onnx",
    max_seq_length=8
)

In [None]:
# Export to Core ML (for iOS)
coreml_path = export_to_coreml(
    onnx_path=onnx_path,
    output_path="./models/english/english_model.mlmodel",
    model_name="EnglishKeyboardSuggestion"
)

## 5. Verification

In [None]:
# Verify model size
size_mb, meets_req = verify_model_size(
    model_path=onnx_path,
    max_size_mb=30
)

In [None]:
# Benchmark latency (on original model before quantization)
# Note: Quantized model latency should be tested on actual device
print("Note: Benchmark on CPU - actual mobile latency will differ")

## 6. Save and Download

In [None]:
# Package model for download
zip_path = package_for_download(
    model_dir="./models/english",
    output_zip="english_model.zip"
)

In [None]:
# Download (Colab only)
if 'COLAB_GPU' in os.environ:
    from google.colab import files
    files.download(zip_path)
else:
    print(f"Model saved to: {zip_path}")

## Next Steps

1. Download the model zip file
2. Extract on your local machine
3. Integrate into iOS/Android keyboard app
4. Test on actual devices
5. Iterate based on performance