# GRU Japanese Keyboard Model (2026) - Smart Word + Emoji Suggestions

Train a GRU model for **Japanese keyboard suggestions** with:
- **Word-level predictions**: „Åî ‚Üí „Åî„ÅØ„Çì, „Åî„Åñ„ÅÑ„Åæ„Åô
- **Prefix completion**: „Åî„Åñ ‚Üí „Åî„Åñ„ÅÑ„Åæ„Åô, „Åî„Åñ„ÅÑ„Åæ„Åó„Åü
- **Phrase suggestions**: „Åî„Åñ„ÅÑ„Åæ„Åô ‚Üí „ÅÇ„Çä„Åå„Å®„ÅÜ„Åî„Åñ„ÅÑ„Åæ„Åô
- **Emoji suggestions**: „ÅÇ„Çä„Åå„Å®„ÅÜ ‚Üí üôè, Á¨ë ‚Üí üòä

**Workflow:**
1. Setup & Config
2. Load Data (zenz-v2.5-dataset)
3. Build Word Vocabulary with Prefix Index (+ Emoji)
4. Create Training Data (Input ‚Üí Output pairs)
5. Build & Train Model
6. Visualize Training
7. Save Model
8. Export TFLite (Android)
9. Export CoreML (iOS)
10. Export Mobile Resources (Prefix Index, Word Lists, Emoji)
11. Verification Test

**Key Features:**
- Uses zenz-v2.5-dataset (kana ‚Üí kanji conversion pairs)
- Word/morpheme-level tokenization for complete word suggestions
- Prefix matching for smart autocomplete
- **Emoji support** - suggests emojis based on context from dataset
- 6000 vocabulary limit for mobile optimization

---
**Instructions:**
1. Runtime ‚Üí Change runtime type ‚Üí GPU (T4)
2. Set `TESTING_MODE = True` for quick test
3. Set `TESTING_MODE = False` for full training

## 1. Environment Setup

In [None]:
# Mount Google Drive and setup directories
from google.colab import drive
import os

drive.mount('/content/drive')

DRIVE_DIR = '/content/drive/MyDrive/Keyboard-Suggestions-ML-Colab'
MODEL_DIR = f"{DRIVE_DIR}/models/gru_keyboard_japanese"
os.makedirs(MODEL_DIR, exist_ok=True)
print(f"‚úì Model directory: {MODEL_DIR}")

In [None]:
# Install dependencies (including regex for emoji support)
!pip install -q tensorflow keras datasets pandas numpy scikit-learn tqdm regex

In [None]:
# ============================================================
# CONFIGURATION
# ============================================================

TESTING_MODE = True  # ‚Üê Change to False for full training

if TESTING_MODE:
    NUM_EPOCHS = 5
    BATCH_SIZE = 256
    VOCAB_SIZE_LIMIT = 6000  # As requested
    SEQUENCE_LENGTH = 15    # Longer for word sequences
    MAX_SAMPLES = 200000     # Limited for testing
else:
    NUM_EPOCHS = 20
    BATCH_SIZE = 256
    VOCAB_SIZE_LIMIT = 6000
    SEQUENCE_LENGTH = 15
    MAX_SAMPLES = 300000    # More samples for full training

# Model architecture
EMBEDDING_DIM = 128
GRU_UNITS = 256

# Special tokens
PAD_TOKEN = '<PAD>'
UNK_TOKEN = '<UNK>'
BOS_TOKEN = '<BOS>'  # Beginning of sequence
EOS_TOKEN = '<EOS>'  # End of sequence

print(f"Config: vocab={VOCAB_SIZE_LIMIT:,}, seq={SEQUENCE_LENGTH}, epochs={NUM_EPOCHS}")
print(f"Max samples: {MAX_SAMPLES:,}")

## 2. Load Datasets

The zenz-v2.5-dataset contains:
- `input`: Hiragana/Katakana input (what user types)
- `output`: Kanji-mixed output (the conversion result)
- `left_context`: Previous text for context

This is perfect for learning word suggestions!

In [None]:
from datasets import load_dataset
import re
import regex  # For emoji support
from collections import Counter, defaultdict

print("Loading zenz-v2.5-dataset from Hugging Face...")
print("="*60)

# Load dataset
try:
    dataset = load_dataset(
        "Miwa-Keita/zenz-v2.5-dataset",
        data_files="train_wikipedia.jsonl",
        split=f"train[:{MAX_SAMPLES}]"
    )
    print(f"‚úì Loaded {len(dataset):,} samples from Wikipedia subset")
except Exception as e:
    print(f"Wikipedia subset not available, trying full dataset...")
    dataset = load_dataset(
        "Miwa-Keita/zenz-v2.5-dataset",
        split=f"train[:{MAX_SAMPLES}]"
    )
    print(f"‚úì Loaded {len(dataset):,} samples")

# Show sample data
print("\nSample entries:")
for i in range(min(5, len(dataset))):
    item = dataset[i]
    print(f"  Input: {item['input'][:30]}...")
    print(f"  Output: {item['output'][:30]}...")
    print(f"  Context: {str(item.get('left_context', ''))[:30]}...")
    print()

## 3. Build Word Vocabulary with Prefix Index (+ Emoji)

Build a vocabulary of common Japanese words and emojis:
- „Åî ‚Üí [„Åî„ÅØ„Çì, „Åî„ÇÅ„Çì„Å™„Åï„ÅÑ, „ÅîÂçîÂäõ, ...]
- „Åî„Åñ ‚Üí [„Åî„Åñ„ÅÑ„Åæ„Åô, „Åî„Åñ„Çã, ...]
- „ÅÇ„Çä„Åå„Å®„ÅÜ ‚Üí [üôè, „Åî„Åñ„ÅÑ„Åæ„Åô, ...]

In [None]:
import re
import regex
from collections import Counter, defaultdict

print("Building word vocabulary with emoji support...")
print("="*60)

# Emoji pattern (covers most common emojis)
EMOJI_PATTERN = regex.compile(r'[\p{Emoji_Presentation}\p{Extended_Pictographic}]')

def is_emoji(char):
    """Check if character is an emoji."""
    return bool(EMOJI_PATTERN.match(char))

def extract_emojis(text):
    """Extract all emojis from text."""
    return EMOJI_PATTERN.findall(text)

def segment_japanese(text):
    """Simple Japanese word segmentation with emoji support."""
    particles = r'(„ÅØ|„Åå|„Çí|„Å´|„Åß|„Å®|„ÅÆ|„Åã„Çâ|„Åæ„Åß|„Çà„Çä|„Å∏|„ÇÑ|„ÇÇ|„Åã|„Å≠|„Çà|„Çè|„Å™|„Çâ|„Åó|„Å¶|„Åü|„Å†|„Åß„Åô|„Åæ„Åô)'
    
    # Split by punctuation and spaces (but NOT emojis!)
    segments = re.split(r'[„ÄÇ„ÄÅÔºÅÔºü\s\n„Éª„Äå„Äç„Äé„ÄèÔºàÔºâ„Äê„Äë]', text)
    
    words = []
    for seg in segments:
        if not seg:
            continue
        
        # Extract emojis from this segment
        emojis_in_seg = extract_emojis(seg)
        
        # Remove emojis temporarily for word splitting
        text_only = EMOJI_PATTERN.sub('', seg)
        
        if text_only:
            if len(text_only) <= 6:
                words.append(text_only)
            else:
                parts = re.split(particles, text_only)
                words.extend([p for p in parts if p])
        
        # Add emojis as separate tokens (they follow the word)
        words.extend(emojis_in_seg)
    
    return [w for w in words if w and len(w) <= 20]

# Collect all words from output text
word_counts = Counter()
all_outputs = []

for item in dataset:
    output = item.get('output', '')
    if output:
        all_outputs.append(output)
        words = segment_japanese(output)
        word_counts.update(words)
    
    # Also use left_context
    context = item.get('left_context', '')
    if context:
        all_outputs.append(context)
        words = segment_japanese(context)
        word_counts.update(words)

print(f"‚úì Found {len(word_counts):,} unique words/tokens (including emoji)")

# Count emojis found
emoji_count = sum(1 for w in word_counts if len(w) <= 2 and EMOJI_PATTERN.match(w))
print(f"‚úì Found {emoji_count:,} unique emojis in dataset")

In [None]:
# Filter to valid Japanese words and emojis
def is_valid_japanese_word(word):
    """Check if word contains valid Japanese characters or emoji."""
    if not word or len(word) < 1:
        return False
    
    # Single emoji is valid
    if len(word) <= 2 and EMOJI_PATTERN.match(word):
        return True
    
    for char in word:
        code = ord(char)
        if not (0x3040 <= code <= 0x309F or  # Hiragana
                0x30A0 <= code <= 0x30FF or  # Katakana
                0x4E00 <= code <= 0x9FFF or  # Kanji
                0x3400 <= code <= 0x4DBF or  # CJK Extension
                is_emoji(char) or            # Emoji
                char in '„Éº„Äú'):
            return False
    return True

# Get top words by frequency
valid_words = [(word, count) for word, count in word_counts.most_common()
               if is_valid_japanese_word(word)]

# Limit vocabulary
valid_words = valid_words[:VOCAB_SIZE_LIMIT - 4]  # Reserve 4 for special tokens

# Create word_to_index
word_to_index = {
    PAD_TOKEN: 0,
    UNK_TOKEN: 1,
    BOS_TOKEN: 2,
    EOS_TOKEN: 3
}

for idx, (word, count) in enumerate(valid_words, start=4):
    word_to_index[word] = idx

index_to_word = {idx: word for word, idx in word_to_index.items()}
vocab_size = len(word_to_index)

print(f"\n‚úì Vocabulary size: {vocab_size:,}")
print(f"\nTop 20 words:")
for i, (word, count) in enumerate(valid_words[:20], 1):
    idx = word_to_index[word]
    emoji_mark = "üìé" if EMOJI_PATTERN.match(word) else ""
    print(f"  {i:2d}. '{word}' {emoji_mark} (idx={idx}, count={count:,})")

# Show emojis in vocabulary
emojis_in_vocab = [(w, c) for w, c in valid_words if len(w) <= 2 and EMOJI_PATTERN.match(w)][:10]
print(f"\nTop emojis in vocab:")
for emoji, count in emojis_in_vocab:
    print(f"  {emoji} (count={count:,})")

In [None]:
# Build Prefix Index for smart suggestions
# Maps prefix ‚Üí [word_indices sorted by frequency]

print("Building prefix index...")
print("="*60)

prefix_index = defaultdict(list)

# For each word, add to all its prefixes
for word, count in valid_words:
    idx = word_to_index[word]
    # Generate prefixes (min 1 char, max full word)
    for prefix_len in range(1, len(word) + 1):
        prefix = word[:prefix_len]
        # Store (count, idx) for sorting
        prefix_index[prefix].append((count, idx))

# Sort each prefix's words by frequency and limit to top 20
for prefix in prefix_index:
    prefix_index[prefix].sort(reverse=True)  # Higher count first
    prefix_index[prefix] = [idx for count, idx in prefix_index[prefix][:20]]

print(f"‚úì Created prefix index with {len(prefix_index):,} prefixes")

# Test examples
test_prefixes = ['„Åî', '„Åî„Åñ', '„Åî„Åñ„ÅÑ„Åæ', '„ÅÇ„Çä', '„ÅÇ„Çä„Åå„Å®', '„Åä„ÅØ', '„Åì„Çì„Å´„Å°']
print("\nPrefix suggestions:")
for prefix in test_prefixes:
    if prefix in prefix_index:
        suggestions = [index_to_word[idx] for idx in prefix_index[prefix][:5]]
        print(f"  '{prefix}' ‚Üí {suggestions}")
    else:
        print(f"  '{prefix}' ‚Üí (no matches)")

## 4. Create Training Data

Create sequences for training:
- Input: Previous words in context
- Target: Next word to predict (can be emoji!)

In [None]:
import tensorflow as tf
import numpy as np

print("Creating training sequences...")
print("="*60)

# Tokenize all outputs into word sequences
all_sequences = []

for output in all_outputs:
    words = segment_japanese(output)
    # Convert to indices
    seq = [word_to_index.get(w, 1) for w in words]  # 1 = UNK
    if len(seq) >= 2:  # Need at least input + target
        all_sequences.append(seq)

print(f"‚úì Created {len(all_sequences):,} sequences")

# Create input-target pairs
# [w1, w2, w3, w4] ‚Üí input: [w1,w2,w3], target: w4
X_data = []
y_data = []

for seq in all_sequences:
    for i in range(1, len(seq)):
        input_seq = seq[:i]
        target = seq[i]
        # Pad/truncate to SEQUENCE_LENGTH
        if len(input_seq) > SEQUENCE_LENGTH:
            input_seq = input_seq[-SEQUENCE_LENGTH:]
        X_data.append(input_seq)
        y_data.append(target)

print(f"‚úì Created {len(X_data):,} training pairs")

# Pad sequences
X_padded = tf.keras.preprocessing.sequence.pad_sequences(
    X_data, maxlen=SEQUENCE_LENGTH, padding='pre'
)
y_array = np.array(y_data)

# Create tf.data dataset
dataset_train = tf.data.Dataset.from_tensor_slices((X_padded, y_array))
dataset_train = dataset_train.shuffle(10000).batch(BATCH_SIZE)

# Split 90/10
total_batches = len(X_data) // BATCH_SIZE
val_batches = max(1, total_batches // 10)
train_batches = total_batches - val_batches

train_dataset = dataset_train.take(train_batches).prefetch(tf.data.AUTOTUNE)
val_dataset = dataset_train.skip(train_batches).take(val_batches).prefetch(tf.data.AUTOTUNE)

print(f"‚úì Train: {train_batches} batches, Val: {val_batches} batches")

## 5. Build & Train GRU Model

In [None]:
from tensorflow.keras import mixed_precision
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, GRU, Dense, Dropout
from tensorflow.keras.optimizers import AdamW

# Enable Mixed Precision
mixed_precision.set_global_policy('mixed_float16')

# Build model
inputs = Input(shape=(SEQUENCE_LENGTH,), name='input')
x = Embedding(vocab_size, EMBEDDING_DIM, name='embedding')(inputs)
x = GRU(GRU_UNITS, dropout=0.2, recurrent_dropout=0.2, name='gru')(x)
x = Dropout(0.3, name='dropout')(x)
outputs = Dense(vocab_size, activation='softmax', dtype='float32', name='output')(x)

model = Model(inputs=inputs, outputs=outputs, name='gru_keyboard_japanese')

model.compile(
    optimizer=AdamW(learning_rate=1e-3, weight_decay=1e-4),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

model.summary()

In [None]:
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau

callbacks = [
    ModelCheckpoint(
        f'{MODEL_DIR}/best_model.keras',
        monitor='val_accuracy',
        save_best_only=True,
        verbose=1
    ),
    EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True),
    ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3)
]

history = model.fit(
    train_dataset,
    epochs=NUM_EPOCHS,
    validation_data=val_dataset,
    callbacks=callbacks,
    verbose=1
)

## 6. Visualize Training

In [None]:
import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.plot(history.history['loss'], label='Train')
ax1.plot(history.history['val_loss'], label='Val')
ax1.set_title('Loss')
ax1.legend()
ax1.grid(True)

ax2.plot(history.history['accuracy'], label='Train')
ax2.plot(history.history['val_accuracy'], label='Val')
ax2.set_title('Accuracy')
ax2.legend()
ax2.grid(True)

plt.tight_layout()
plt.show()

print(f"\nFinal: Val Acc={history.history['val_accuracy'][-1]*100:.2f}%")

## 7. Save Model

In [None]:
import json

# Save Keras model
model.save(f'{MODEL_DIR}/gru_model.keras')

# Save word_to_index
with open(f'{MODEL_DIR}/word_to_index.json', 'w', encoding='utf-8') as f:
    json.dump(word_to_index, f, ensure_ascii=False, separators=(',', ':'))

# Save prefix_index
with open(f'{MODEL_DIR}/prefix_index.json', 'w', encoding='utf-8') as f:
    json.dump(dict(prefix_index), f, ensure_ascii=False, separators=(',', ':'))

# Save config
config = {
    'vocab_size': vocab_size,
    'sequence_length': SEQUENCE_LENGTH,
    'embedding_dim': EMBEDDING_DIM,
    'gru_units': GRU_UNITS,
    'language': 'japanese',
    'tokenization': 'word-level',
    'emoji_support': True,
    'special_tokens': {
        'PAD': 0, 'UNK': 1, 'BOS': 2, 'EOS': 3
    }
}
with open(f'{MODEL_DIR}/model_config.json', 'w') as f:
    json.dump(config, f, indent=2)

print("‚úì Saved: gru_model.keras, word_to_index.json, prefix_index.json, model_config.json")

## 8. Export TFLite (Android)

In [None]:
import tensorflow as tf
import numpy as np
import time

print("Exporting TFLite models...")
print("="*60)

# Option 1: Standard TFLite with Flex ops
print("\n[1] Creating TFLite with Flex ops...")
try:
    converter = tf.lite.TFLiteConverter.from_keras_model(model)
    converter.target_spec.supported_ops = [
        tf.lite.OpsSet.TFLITE_BUILTINS,
        tf.lite.OpsSet.SELECT_TF_OPS
    ]
    converter._experimental_lower_tensor_list_ops = False
    converter.optimizations = [tf.lite.Optimize.DEFAULT]
    
    tflite_model = converter.convert()
    tflite_path = f'{MODEL_DIR}/gru_model.tflite'
    with open(tflite_path, 'wb') as f:
        f.write(tflite_model)
    
    size_mb = len(tflite_model) / (1024 * 1024)
    print(f"   ‚úì gru_model.tflite ({size_mb:.2f}MB)")
    
except Exception as e:
    print(f"   ‚úó Error: {str(e)[:100]}")
    tflite_path = None

# Option 2: FP16 quantized (smaller)
print("\n[2] Creating FP16 quantized TFLite...")
try:
    converter_fp16 = tf.lite.TFLiteConverter.from_keras_model(model)
    converter_fp16.target_spec.supported_ops = [
        tf.lite.OpsSet.TFLITE_BUILTINS,
        tf.lite.OpsSet.SELECT_TF_OPS
    ]
    converter_fp16._experimental_lower_tensor_list_ops = False
    converter_fp16.optimizations = [tf.lite.Optimize.DEFAULT]
    converter_fp16.target_spec.supported_types = [tf.float16]
    
    tflite_fp16 = converter_fp16.convert()
    fp16_path = f'{MODEL_DIR}/gru_model_fp16.tflite'
    with open(fp16_path, 'wb') as f:
        f.write(tflite_fp16)
    
    size_mb = len(tflite_fp16) / (1024 * 1024)
    print(f"   ‚úì gru_model_fp16.tflite ({size_mb:.2f}MB)")
    tflite_path = fp16_path
    
except Exception as e:
    print(f"   ‚úó FP16 error: {str(e)[:100]}")

# Benchmark
print("\n[3] Running latency benchmark...")
if tflite_path:
    interpreter = tf.lite.Interpreter(model_path=tflite_path)
    interpreter.allocate_tensors()
    input_details = interpreter.get_input_details()[0]
    
    for _ in range(10):
        test_input = np.random.randint(0, vocab_size, (1, SEQUENCE_LENGTH)).astype(np.float32)
        interpreter.set_tensor(input_details['index'], test_input)
        interpreter.invoke()
    
    latencies = []
    for _ in range(50):
        start = time.time()
        interpreter.set_tensor(input_details['index'], test_input)
        interpreter.invoke()
        latencies.append((time.time() - start) * 1000)
    
    print(f"   ‚úì Latency: avg={np.mean(latencies):.2f}ms, min={np.min(latencies):.2f}ms")

print("\nNOTE: Android needs TensorFlow Lite Flex delegate.")

## 9. Export CoreML Weights (iOS)

In [None]:
import numpy as np

print("Exporting weights for CoreML conversion...")
print("="*60)

# Export weights
weights_list = model.get_weights()
weights_path = f'{MODEL_DIR}/gru_weights.npz'
np.savez(weights_path, *weights_list)

print(f"‚úì gru_weights.npz ({len(weights_list)} arrays)")
for i, w in enumerate(weights_list):
    print(f"   Weight {i}: {w.shape}")

print(f"\n‚Üí Run on Mac: python scripts/convert_to_coreml.py")

## 10. Export Mobile Resources (+ Emoji)

Export optimized data structures for iOS/Android:
- `word_to_index.json` - Word to index mapping
- `index_to_word.json` - Index to word mapping
- `prefix_index.json` - Prefix ‚Üí word indices for smart suggestions
- `emoji_suggestions.json` - Word ‚Üí emoji associations („ÅÇ„Çä„Åå„Å®„ÅÜ ‚Üí üôè)

In [None]:
import json
import os

print("Exporting mobile resources (with emoji)...")
print("="*60)

# 1. Export index_to_word
print("\n[1/5] Exporting index_to_word...")
path = f'{MODEL_DIR}/index_to_word.json'
i2w_str_keys = {str(k): v for k, v in index_to_word.items()}
with open(path, 'w', encoding='utf-8') as f:
    json.dump(i2w_str_keys, f, ensure_ascii=False, separators=(',', ':'))
size_kb = os.path.getsize(path) / 1024
print(f"   ‚úì index_to_word.json ({len(index_to_word):,} words, {size_kb:.1f}KB)")

# 2. Export prefix_index
print("\n[2/5] Verifying prefix_index...")
path = f'{MODEL_DIR}/prefix_index.json'
size_kb = os.path.getsize(path) / 1024
print(f"   ‚úì prefix_index.json ({len(prefix_index):,} prefixes, {size_kb:.1f}KB)")

# 3. Build common phrase completions
print("\n[3/5] Building phrase suggestions...")
word_pairs = defaultdict(Counter)
for seq in all_sequences[:10000]:
    for i in range(len(seq) - 1):
        word_pairs[seq[i]][seq[i+1]] += 1

phrase_suggestions = {}
for word_idx, next_counts in word_pairs.items():
    if word_idx < 4:
        continue
    word = index_to_word.get(word_idx)
    if not word:
        continue
    top_next = next_counts.most_common(10)
    phrase_suggestions[word] = [next_idx for next_idx, count in top_next]

path = f'{MODEL_DIR}/phrase_suggestions.json'
with open(path, 'w', encoding='utf-8') as f:
    json.dump(phrase_suggestions, f, ensure_ascii=False, separators=(',', ':'))
size_kb = os.path.getsize(path) / 1024
print(f"   ‚úì phrase_suggestions.json ({len(phrase_suggestions):,} words, {size_kb:.1f}KB)")

# 4. Build word ‚Üí emoji associations
print("\n[4/5] Building word‚Üíemoji associations...")
word_emoji_map = defaultdict(Counter)
for seq in all_sequences[:10000]:
    for i in range(len(seq) - 1):
        word = index_to_word.get(seq[i])
        next_token = index_to_word.get(seq[i+1])
        if word and next_token and EMOJI_PATTERN.match(next_token):
            word_emoji_map[word][next_token] += 1

emoji_suggestions = {}
for word, emoji_counts in word_emoji_map.items():
    if emoji_counts:
        emoji_suggestions[word] = [emoji for emoji, count in emoji_counts.most_common(5)]

path = f'{MODEL_DIR}/emoji_suggestions.json'
with open(path, 'w', encoding='utf-8') as f:
    json.dump(emoji_suggestions, f, ensure_ascii=False, separators=(',', ':'))
size_kb = os.path.getsize(path) / 1024
print(f"   ‚úì emoji_suggestions.json ({len(emoji_suggestions):,} word‚Üíemoji pairs, {size_kb:.1f}KB)")

# Show sample emoji associations
print("\n   Sample word‚Üíemoji:")
for word, emojis in list(emoji_suggestions.items())[:5]:
    print(f"     '{word}' ‚Üí {emojis}")

# 5. Export Japanese keyboard layout
print("\n[5/5] Exporting keyboard layout...")
JAPANESE_KEYBOARD = {
    '„ÅÇ': '„ÅÑ„ÅÜ„Åà„Åä', '„Åã': '„Åç„Åè„Åë„Åì', '„Åï': '„Åó„Åô„Åõ„Åù',
    '„Åü': '„Å°„Å§„Å¶„Å®', '„Å™': '„Å´„Å¨„Å≠„ÅÆ', '„ÅØ': '„Å≤„Åµ„Å∏„Åª',
    '„Åæ': '„Åø„ÇÄ„ÇÅ„ÇÇ', '„ÇÑ': '„ÇÜ„Çà', '„Çâ': '„Çä„Çã„Çå„Çç',
    '„Çè': '„Çí„Çì„Éº', '„Åå': '„Åé„Åê„Åí„Åî', '„Åñ': '„Åò„Åö„Åú„Åû',
    '„Å†': '„Å¢„Å•„Åß„Å©', '„Å∞': '„Å≥„Å∂„Åπ„Åº', '„Å±': '„Å¥„Å∑„Å∫„ÅΩ'
}
path = f'{MODEL_DIR}/japanese_keyboard.json'
with open(path, 'w', encoding='utf-8') as f:
    json.dump(JAPANESE_KEYBOARD, f, ensure_ascii=False)
print("   ‚úì japanese_keyboard.json")

print("\n" + "="*60)
print("EXPORT COMPLETE")
print("="*60)
print(f"\nFiles in {MODEL_DIR}/:")
for f in sorted(os.listdir(MODEL_DIR)):
    size = os.path.getsize(f'{MODEL_DIR}/{f}') / 1024
    print(f"   {f}: {size:.1f}KB")

## 11. Verification Test

Test the model with real Japanese input examples + emoji.

In [None]:
import json
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences

print("="*60)
print("VERIFICATION TEST - Smart Word + Emoji Suggestions")
print("="*60)

# Load exported mappings
with open(f'{MODEL_DIR}/word_to_index.json', 'r', encoding='utf-8') as f:
    loaded_w2i = json.load(f)
with open(f'{MODEL_DIR}/index_to_word.json', 'r', encoding='utf-8') as f:
    loaded_i2w = {int(k): v for k, v in json.load(f).items()}
with open(f'{MODEL_DIR}/prefix_index.json', 'r', encoding='utf-8') as f:
    loaded_prefix = json.load(f)
with open(f'{MODEL_DIR}/emoji_suggestions.json', 'r', encoding='utf-8') as f:
    loaded_emoji = json.load(f)

def get_prefix_suggestions(prefix, top_k=5):
    """Get word suggestions for a prefix."""
    if prefix not in loaded_prefix:
        return []
    indices = loaded_prefix[prefix][:top_k]
    return [(loaded_i2w.get(idx, '?'), 100 / (i + 1)) for i, idx in enumerate(indices)]

def get_emoji_suggestions(word, top_k=5):
    """Get emoji suggestions for a word."""
    if word not in loaded_emoji:
        return []
    return loaded_emoji[word][:top_k]

def predict_next_word(context_words, top_k=5):
    """Predict next word using the model."""
    seq = [loaded_w2i.get(w, 1) for w in context_words]
    seq = pad_sequences([seq], maxlen=SEQUENCE_LENGTH, padding='pre')
    
    preds = model.predict(seq, verbose=0)[0]
    top_idx = np.argsort(preds)[-top_k:][::-1]
    
    results = []
    for idx in top_idx:
        if idx in loaded_i2w and idx >= 4:
            results.append((loaded_i2w[idx], preds[idx] * 100))
    return results

# Test 1: Prefix completion
print("\nüìù TEST 1: Prefix Completion")
print("-"*40)

prefix_tests = ['„Åî', '„Åî„Åñ', '„Åî„Åñ„ÅÑ„Åæ', '„ÅÇ„Çä', '„ÅÇ„Çä„Åå„Å®', '„Åä„ÅØ', '„Åì„Çì„Å´„Å°']
for prefix in prefix_tests:
    results = get_prefix_suggestions(prefix, top_k=5)
    print(f"  '{prefix}' ‚Üí {[r[0] for r in results] if results else '(no matches)'}")

# Test 2: Word ‚Üí Emoji suggestions (NEW!)
print("\n\nüòä TEST 2: Word ‚Üí Emoji Suggestions")
print("-"*40)

emoji_tests = ['„ÅÇ„Çä„Åå„Å®„ÅÜ', '„Åä„ÇÅ„Åß„Å®„ÅÜ', '„Åã„Çè„ÅÑ„ÅÑ', '„Åü„ÅÆ„Åó„ÅÑ', 'Á¨ë']
for word in emoji_tests:
    emojis = get_emoji_suggestions(word)
    print(f"  '{word}' ‚Üí {emojis if emojis else '(no emoji associations)'}")

# Test 3: Combined flow (prefix ‚Üí word ‚Üí emoji)
print("\n\nüîÑ TEST 3: Complete Flow (Prefix ‚Üí Word ‚Üí Emoji)")
print("-"*40)

flow_tests = ['„ÅÇ„Çä„Åå', '„Åä„ÇÅ„Åß„Å®', '„Åã„Çè„ÅÑ']
for prefix in flow_tests:
    print(f"\n  User types: '{prefix}'")
    words = get_prefix_suggestions(prefix, top_k=3)
    if words:
        word = words[0][0]
        print(f"  ‚Üí Prefix suggestions: {[w[0] for w in words]}")
        print(f"  ‚Üí User selects: '{word}'")
        emojis = get_emoji_suggestions(word)
        if emojis:
            print(f"  ‚Üí Emoji suggestions: {emojis}")
        else:
            print(f"  ‚Üí No emoji suggestions")

print("\n" + "="*60)
print("‚úÖ VERIFICATION COMPLETE")
print("   - Prefix index provides instant word completion")
print("   - Emoji suggestions from dataset associations")
print("   - Complete flow: prefix ‚Üí word ‚Üí emoji")
print("="*60)

## Usage Guide

### Mobile Integration Flow

```
User types: "„ÅÇ„Çä„Åå"
        ‚Üì
1. Check prefix_index.json
   ‚Üí ["„ÅÇ„Çä„Åå„Å®„ÅÜ", "„ÅÇ„Çä„Åå„Åü„ÅÑ"...]
        ‚Üì
2. User selects "„ÅÇ„Çä„Åå„Å®„ÅÜ"
        ‚Üì
3. Check emoji_suggestions.json
   ‚Üí ["üôè", "üòä"...]
   Check phrase_suggestions.json
   ‚Üí ["„Åî„Åñ„ÅÑ„Åæ„Åô", "„Å≠"...]
        ‚Üì
4. Show combined: [üôè, „Åî„Åñ„ÅÑ„Åæ„Åô, üòä, „Å≠]
```

### Files for Mobile
- `prefix_index.json` - Fast prefix completion
- `emoji_suggestions.json` - Word ‚Üí emoji associations
- `phrase_suggestions.json` - Word ‚Üí next word predictions
- `gru_model.tflite` - ML model for context-aware predictions
- `word_to_index.json` - Vocabulary for tokenization
- `index_to_word.json` - Decode predictions to words