# Japanese Keyboard Suggestion Model Training

This notebook trains a Japanese keyboard suggestion model using Qwen2-1.5B with LoRA fine-tuning.

**Target Specifications:**
- Model Size: 40-60 MB (after optimization)
- Latency: < 80 ms
- Perplexity: < 20
- Top-3 Accuracy: > 80%
- IME Support: Romaji → Kanji conversion

## 1. Environment Setup

In [None]:
# Clone repository (if running in Colab)
import os
if 'COLAB_GPU' in os.environ:
    !git clone https://github.com/YOUR_USERNAME/KeyboardSuggestionsML.git
    %cd KeyboardSuggestionsML
else:
    print("Running locally")

In [None]:
# Install dependencies
!pip install -q -r requirements.txt

In [None]:
# Download UniDic for Japanese morphological analysis
!python -m unidic download

In [None]:
# Import libraries
import sys
sys.path.append('./src')

import torch
from transformers import AutoTokenizer
from datasets import load_dataset
import fugashi

from data_prep import clean_japanese_text, prepare_japanese_data
from model_utils import load_model_with_lora, train_causal_lm, evaluate_perplexity, prune_model, quantize_model, merge_lora_weights
from export_utils import export_to_onnx, export_to_coreml, verify_model_size, benchmark_latency, package_for_download

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## 2. Data Preparation

In [None]:
# Load Japanese CC100 dataset (streaming)
print("Loading CC100 Japanese dataset (10% sample)...")

dataset = load_dataset(
    'cc100',
    lang='ja',
    split='train[:10%]',
    streaming=True
)

print("Dataset loaded (streaming mode)")

In [None]:
# Initialize Japanese morphological analyzer
tagger = fugashi.Tagger('-Owakati')

# Test morphological analysis
test_text = "今日は昨日より良い日だ"
print(f"Original: {test_text}")
print(f"Morphemes: {tagger.parse(test_text)}")

In [None]:
# Prepare sample training data
sample_sentences = [
    "今日は良い天気ですね",
    "明日会議があります",
    "ありがとうございます",
    "お疲れ様でした",
    "よろしくお願いします",
]

# Clean text
cleaned = [clean_japanese_text(s) for s in sample_sentences]

print(f"Sample sentences:")
for sent in cleaned[:3]:
    print(f"  {sent}")

## 3. Model Setup and Fine-Tuning

In [None]:
# Load model with LoRA
MODEL_NAME = "Qwen/Qwen2-1.5B-Instruct"

model, tokenizer = load_model_with_lora(
    model_name=MODEL_NAME,
    lora_r=8,
    lora_alpha=16,
    lora_dropout=0.1,
    target_modules=["q_proj", "k_proj", "v_proj"]  # More modules for Japanese
)

In [None]:
# Prepare dataset for training
def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, max_length=8)

# Create a simple dataset from our sample
from datasets import Dataset
train_data = Dataset.from_dict({'text': cleaned})
train_dataset = train_data.map(tokenize_function, batched=True)

print(f"Training dataset size: {len(train_dataset)}")

In [None]:
# Train model
trainer = train_causal_lm(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    output_dir="./checkpoints/japanese",
    num_epochs=3,
    batch_size=8,  # Adjust based on GPU memory
    learning_rate=5e-6,  # Lower LR for Japanese
    max_seq_length=8,
    save_steps=100
)

## 4. Optimization and Export

In [None]:
# Merge LoRA weights
model = merge_lora_weights(model)

In [None]:
# Prune model (more aggressive for larger model)
model = prune_model(model, amount=0.4)

In [None]:
# Quantize model
model = quantize_model(model, dtype=torch.qint8)

In [None]:
# Export to ONNX
os.makedirs("./models/japanese", exist_ok=True)
onnx_path = export_to_onnx(
    model=model,
    tokenizer=tokenizer,
    output_path="./models/japanese/japanese_model.onnx",
    max_seq_length=8
)

In [None]:
# Export to Core ML (for iOS)
coreml_path = export_to_coreml(
    onnx_path=onnx_path,
    output_path="./models/japanese/japanese_model.mlmodel",
    model_name="JapaneseKeyboardSuggestion"
)

## 5. Verification

In [None]:
# Verify model size
size_mb, meets_req = verify_model_size(
    model_path=onnx_path,
    max_size_mb=60
)

In [None]:
# Test IME functionality (romaji → kanji)
test_inputs = [
    "kyouha",  # 今日は
    "arigatou",  # ありがとう
]

print("IME Test (requires additional IME layer):")
for inp in test_inputs:
    print(f"  {inp} → [IME conversion needed]")

## 6. Save and Download

In [None]:
# Package model for download
zip_path = package_for_download(
    model_dir="./models/japanese",
    output_zip="japanese_model.zip"
)

In [None]:
# Download (Colab only)
if 'COLAB_GPU' in os.environ:
    from google.colab import files
    files.download(zip_path)
else:
    print(f"Model saved to: {zip_path}")

## Next Steps

1. Download the model zip file
2. Extract on your local machine
3. Add IME layer for romaji → kanji conversion
4. Integrate into iOS/Android keyboard app
5. Test on actual devices with Japanese input
6. Iterate based on performance