# üåç Part 3: Multilingual Model Comparison

Not all embedding models support all languages! In this notebook, we'll explore what happens when you use:
- An **English-only model** with Thai text
- A **multilingual model** with Thai text

## What You'll Learn:
1. Why language support matters for embeddings
2. What happens when you use an unsupported language
3. How to choose the right model for multilingual RAG

## üì¶ Install Dependencies

In [None]:
# Install required packages (run this in Colab)
!pip install sentence-transformers plotly seaborn scikit-learn -q

## üìö Import Libraries

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import seaborn as sns
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ All libraries imported successfully!")

---
## üìù Define Bilingual Word List (English + Thai)

We'll use the **same words** in both English and Thai to see how models handle them.

In [None]:
# ============================================================
# üîß BILINGUAL WORD LIST (English + Thai)
# ============================================================

bilingual_words = {
    # Fruits
    "üçé Fruits (EN)": ["apple", "banana", "orange", "mango", "grape"],
    "üçé ‡∏ú‡∏•‡πÑ‡∏°‡πâ (TH)": ["‡πÅ‡∏≠‡∏õ‡πÄ‡∏õ‡∏¥‡πâ‡∏•", "‡∏Å‡∏•‡πâ‡∏ß‡∏¢", "‡∏™‡πâ‡∏°", "‡∏°‡∏∞‡∏°‡πà‡∏ß‡∏á", "‡∏≠‡∏á‡∏∏‡πà‡∏ô"],
    
    # Animals
    "üêæ Animals (EN)": ["dog", "cat", "elephant", "bird", "fish"],
    "üêæ ‡∏™‡∏±‡∏ï‡∏ß‡πå (TH)": ["‡∏™‡∏∏‡∏ô‡∏±‡∏Ç", "‡πÅ‡∏°‡∏ß", "‡∏ä‡πâ‡∏≤‡∏á", "‡∏ô‡∏Å", "‡∏õ‡∏•‡∏≤"],
    
    # Colors
    "üé® Colors (EN)": ["red", "blue", "green", "yellow", "white"],
    "üé® ‡∏™‡∏µ (TH)": ["‡∏™‡∏µ‡πÅ‡∏î‡∏á", "‡∏™‡∏µ‡∏ô‡πâ‡∏≥‡πÄ‡∏á‡∏¥‡∏ô", "‡∏™‡∏µ‡πÄ‡∏Ç‡∏µ‡∏¢‡∏ß", "‡∏™‡∏µ‡πÄ‡∏´‡∏•‡∏∑‡∏≠‡∏á", "‡∏™‡∏µ‡∏Ç‡∏≤‡∏ß"],
    
    # Food
    "üçú Food (EN)": ["rice", "noodle", "chicken", "soup", "salad"],
    "üçú ‡∏≠‡∏≤‡∏´‡∏≤‡∏£ (TH)": ["‡∏Ç‡πâ‡∏≤‡∏ß", "‡∏Å‡πã‡∏ß‡∏¢‡πÄ‡∏ï‡∏µ‡πã‡∏¢‡∏ß", "‡πÑ‡∏Å‡πà", "‡∏ã‡∏∏‡∏õ", "‡∏™‡∏•‡∏±‡∏î"]
}

# Flatten
words = []
categories = []
languages = []

for category, word_list in bilingual_words.items():
    words.extend(word_list)
    categories.extend([category] * len(word_list))
    lang = "Thai" if "(TH)" in category else "English"
    languages.extend([lang] * len(word_list))

print(f"üìä Total words: {len(words)}")
print(f"üá¨üáß English words: {languages.count('English')}")
print(f"üáπüá≠ Thai words: {languages.count('Thai')}")
print(f"\nüìù Sample words:")
for cat, word_list in list(bilingual_words.items())[:4]:
    print(f"   {cat}: {word_list[:3]}...")

---
## ü§ñ Define Models to Compare

We'll compare:
1. **English-only model** - Does NOT support Thai
2. **Multilingual model** - Supports 50+ languages including Thai

In [None]:
# ============================================================
# üîß MODELS TO COMPARE
# ============================================================

models_config = {
    "all-MiniLM-L6-v2": {
        "description": "English-only model (22M params)",
        "supports_thai": False,
        "languages": "English only",
        "icon": "üá¨üáß"
    },
    "paraphrase-multilingual-MiniLM-L12-v2": {
        "description": "Multilingual model (118M params, 50+ languages)",
        "supports_thai": True,
        "languages": "50+ languages including Thai",
        "icon": "üåç"
    }
}

print("üìã Models to compare:")
print("=" * 70)
for name, config in models_config.items():
    icon = config['icon']
    thai_support = "‚úÖ Yes" if config['supports_thai'] else "‚ùå No"
    print(f"\n{icon} {name}")
    print(f"   {config['description']}")
    print(f"   Thai Support: {thai_support}")
    print(f"   Languages: {config['languages']}")

## üîÑ Load Models & Generate Embeddings

In [None]:
# Store results
model_embeddings = {}
loaded_models = {}

for model_name in models_config.keys():
    config = models_config[model_name]
    print(f"\n{'='*60}")
    print(f"{config['icon']} Loading: {model_name}")
    print(f"{'='*60}")
    
    model = SentenceTransformer(model_name)
    loaded_models[model_name] = model
    
    print(f"   Embedding dimension: {model.get_sentence_embedding_dimension()}")
    print(f"   Generating embeddings...")
    
    embeddings = model.encode(words, show_progress_bar=True)
    model_embeddings[model_name] = embeddings
    
    print(f"   ‚úÖ Done! Shape: {embeddings.shape}")

print(f"\n\nüéâ All models loaded!")

---
## üåê 3D Visualization: English vs Thai Clustering

**Key Question:** Do English and Thai translations of the same concept cluster together?

In [None]:
# Apply t-SNE to each model
tsne_results = {}
perplexity = min(30, len(words) - 1)

for model_name, embeddings in model_embeddings.items():
    print(f"üîÑ Applying t-SNE for {model_name}...")
    
    tsne = TSNE(
        n_components=3,
        perplexity=perplexity,
        random_state=42,
        n_iter=1000,
        learning_rate='auto',
        init='pca'
    )
    
    embeddings_3d = tsne.fit_transform(embeddings)
    tsne_results[model_name] = embeddings_3d
    print(f"   ‚úÖ Done!")

print("\nüéâ All t-SNE transformations complete!")

In [None]:
# Create 3D plots for each model - colored by LANGUAGE
for model_name, embeddings_3d in tsne_results.items():
    config = models_config[model_name]
    
    df = pd.DataFrame({
        'word': words,
        'category': categories,
        'language': languages,
        'x': embeddings_3d[:, 0],
        'y': embeddings_3d[:, 1],
        'z': embeddings_3d[:, 2]
    })
    
    thai_support = "‚úÖ Supports Thai" if config['supports_thai'] else "‚ùå Does NOT support Thai"
    
    fig = px.scatter_3d(
        df,
        x='x', y='y', z='z',
        color='language',
        symbol='language',
        text='word',
        title=f"{config['icon']} {model_name}<br><sub>{thai_support}</sub>",
        labels={'x': 't-SNE 1', 'y': 't-SNE 2', 'z': 't-SNE 3'},
        height=600,
        color_discrete_map={'English': '#3498db', 'Thai': '#e74c3c'}
    )
    
    fig.update_traces(
        marker=dict(size=10, line=dict(width=1, color='white')),
        textposition='top center',
        textfont=dict(size=9)
    )
    
    fig.update_layout(
        legend=dict(
            orientation="h",
            yanchor="bottom",
            y=-0.15,
            xanchor="center",
            x=0.5
        ),
        margin=dict(l=0, r=0, b=100, t=80)
    )
    
    fig.show()
    print(f"\n{'‚îÄ'*60}\n")

### üîç What Do You Notice?

**English-only model (all-MiniLM-L6-v2):**
- Thai words are likely clustered randomly or separately
- No semantic understanding of Thai text
- "‡πÅ‡∏≠‡∏õ‡πÄ‡∏õ‡∏¥‡πâ‡∏•" (apple) won't be near "apple"

**Multilingual model (paraphrase-multilingual-MiniLM-L12-v2):**
- Thai and English translations should cluster together!
- "‡πÅ‡∏≠‡∏õ‡πÄ‡∏õ‡∏¥‡πâ‡∏•" (apple) should be near "apple"
- Cross-lingual semantic understanding

---
## üéØ Cross-Language Similarity Test

The key test: Do translations have high similarity?

In [None]:
# Define translation pairs to test
translation_pairs = [
    ("apple", "‡πÅ‡∏≠‡∏õ‡πÄ‡∏õ‡∏¥‡πâ‡∏•"),
    ("banana", "‡∏Å‡∏•‡πâ‡∏ß‡∏¢"),
    ("dog", "‡∏™‡∏∏‡∏ô‡∏±‡∏Ç"),
    ("cat", "‡πÅ‡∏°‡∏ß"),
    ("elephant", "‡∏ä‡πâ‡∏≤‡∏á"),
    ("red", "‡∏™‡∏µ‡πÅ‡∏î‡∏á"),
    ("blue", "‡∏™‡∏µ‡∏ô‡πâ‡∏≥‡πÄ‡∏á‡∏¥‡∏ô"),
    ("rice", "‡∏Ç‡πâ‡∏≤‡∏ß"),
    ("chicken", "‡πÑ‡∏Å‡πà"),
]

print("üéØ CROSS-LANGUAGE SIMILARITY TEST")
print("="*80)
print("\nDo English-Thai translation pairs have high similarity?\n")

# Calculate similarity for each model
similarity_matrices = {}
for model_name, embeddings in model_embeddings.items():
    similarity_matrices[model_name] = cosine_similarity(embeddings)

# Header
print(f"{'English':<12} {'Thai':<12}", end="")
for model_name in models_config.keys():
    short_name = model_name.split('-')[0][:8]
    print(f"{short_name:^20}", end="")
print("\n" + "-"*80)

# Compare each pair
model_avg_scores = {name: [] for name in models_config.keys()}

for en_word, th_word in translation_pairs:
    if en_word in words and th_word in words:
        idx_en = words.index(en_word)
        idx_th = words.index(th_word)
        
        print(f"{en_word:<12} {th_word:<12}", end="")
        
        for model_name in models_config.keys():
            sim = similarity_matrices[model_name][idx_en, idx_th]
            model_avg_scores[model_name].append(sim)
            
            # Color code
            if sim > 0.7:
                indicator = "üü¢"
            elif sim > 0.4:
                indicator = "üü°"
            else:
                indicator = "üî¥"
            
            print(f"{indicator} {sim:.3f}            ", end="")
        print()

# Summary
print("\n" + "="*80)
print("üìä AVERAGE CROSS-LANGUAGE SIMILARITY:")
print("-"*80)

for model_name in models_config.keys():
    config = models_config[model_name]
    avg = np.mean(model_avg_scores[model_name])
    
    if avg > 0.6:
        grade = "‚úÖ EXCELLENT - Model understands both languages!"
    elif avg > 0.3:
        grade = "üü° PARTIAL - Some cross-language understanding"
    else:
        grade = "‚ùå POOR - Model doesn't understand Thai properly"
    
    print(f"\n{config['icon']} {model_name}")
    print(f"   Average: {avg:.3f}")
    print(f"   Result: {grade}")

---
## üî• Side-by-Side Heatmaps

Let's compare the full similarity matrices.

In [None]:
# Create heatmaps for each model
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

for ax, (model_name, sim_matrix) in zip(axes, similarity_matrices.items()):
    config = models_config[model_name]
    
    sns.heatmap(
        sim_matrix,
        xticklabels=words,
        yticklabels=words,
        cmap='RdYlBu_r',
        vmin=0,
        vmax=1,
        ax=ax,
        cbar_kws={'shrink': 0.8}
    )
    
    thai_support = "‚úÖ Thai" if config['supports_thai'] else "‚ùå No Thai"
    ax.set_title(f"{config['icon']} {model_name}\n{thai_support}", fontsize=11)
    ax.tick_params(axis='x', rotation=90, labelsize=7)
    ax.tick_params(axis='y', rotation=0, labelsize=7)

plt.suptitle('Similarity Comparison: English-Only vs Multilingual Model', 
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

### üîç Reading the Heatmaps

Look at the **cross-language blocks** (where English words meet Thai words):

**English-only model:**
- Low similarity (blue) between English and Thai translations
- Thai words might have random similarities

**Multilingual model:**
- High similarity (red) between translations
- "apple" and "‡πÅ‡∏≠‡∏õ‡πÄ‡∏õ‡∏¥‡πâ‡∏•" should show bright red

---
## üß™ What Happens with Unsupported Languages?

Let's investigate what the English-only model "sees" when given Thai text.

In [None]:
print("üî¨ INVESTIGATING: What does an English-only model see in Thai?")
print("="*70)

# Get the English-only model
english_model = loaded_models["all-MiniLM-L6-v2"]
english_sim = similarity_matrices["all-MiniLM-L6-v2"]

# Find what Thai words are most similar to
thai_words_idx = [i for i, lang in enumerate(languages) if lang == "Thai"]

print("\nFor each Thai word, what is it most similar to?")
print("-"*70)

for thai_idx in thai_words_idx[:8]:  # Show first 8
    thai_word = words[thai_idx]
    
    # Get similarities to all other words
    similarities = english_sim[thai_idx].copy()
    similarities[thai_idx] = -1  # Exclude self
    
    # Find top 3 most similar
    top_indices = np.argsort(similarities)[-3:][::-1]
    
    print(f"\nüáπüá≠ {thai_word}")
    print(f"   Most similar to:")
    for idx in top_indices:
        sim = similarities[idx]
        similar_word = words[idx]
        lang_icon = "üá¨üáß" if languages[idx] == "English" else "üáπüá≠"
        print(f"      {lang_icon} {similar_word}: {sim:.3f}")

print("\n" + "="*70)
print("\nüí° INSIGHT: English-only models often cluster Thai words based on")
print("   character patterns, not meaning. Thai words may cluster with")
print("   other Thai words or random English words!")

---
## üìä Summary: Model Language Support Comparison

In [None]:
# Create summary visualization
summary_data = []

for model_name in models_config.keys():
    config = models_config[model_name]
    avg_cross_lang = np.mean(model_avg_scores[model_name])
    
    summary_data.append({
        'Model': model_name,
        'Thai Support': 'Yes' if config['supports_thai'] else 'No',
        'Cross-Language Similarity': avg_cross_lang
    })

summary_df = pd.DataFrame(summary_data)

# Bar chart
fig = px.bar(
    summary_df,
    x='Model',
    y='Cross-Language Similarity',
    color='Thai Support',
    title='üìä Cross-Language Understanding: English‚ÜîThai Translation Similarity',
    color_discrete_map={'Yes': '#2ecc71', 'No': '#e74c3c'},
    height=400
)

fig.add_hline(y=0.7, line_dash="dash", line_color="green", 
              annotation_text="Good threshold (0.7)")
fig.add_hline(y=0.4, line_dash="dash", line_color="orange",
              annotation_text="Poor threshold (0.4)")

fig.update_layout(yaxis_range=[0, 1])
fig.show()

---
## üéì Key Takeaways

### ‚ùå What happens with UNSUPPORTED languages:
- Words are tokenized without understanding
- Similar meanings DON'T get similar embeddings
- "apple" and "‡πÅ‡∏≠‡∏õ‡πÄ‡∏õ‡∏¥‡πâ‡∏•" are treated as unrelated
- RAG retrieval will FAIL for non-English queries!

### ‚úÖ What happens with SUPPORTED languages:
- Cross-language semantic understanding
- Translations have HIGH similarity scores
- RAG can match Thai queries to English documents (and vice versa)

### üéØ For Thai RAG Applications:
| Use Case | Recommended Model |
|----------|------------------|
| Thai-only content | `paraphrase-multilingual-MiniLM-L12-v2` |
| Thai + English mixed | `paraphrase-multilingual-MiniLM-L12-v2` |
| English-only content | `all-MiniLM-L6-v2` (faster) |
| High quality multilingual | `paraphrase-multilingual-mpnet-base-v2` |

### üìö More Multilingual Models to Try:
- `distiluse-base-multilingual-cased-v2` (15 languages)
- `paraphrase-multilingual-mpnet-base-v2` (50+ languages, higher quality)
- `LaBSE` (109 languages, by Google)

---
## üß™ Try Your Own Languages!

Modify the code below to test other languages you work with.

In [None]:
# üîß TEST YOUR OWN TRANSLATIONS
# Add your own language pairs here!

custom_translations = [
    ("hello", "‡∏™‡∏ß‡∏±‡∏™‡∏î‡∏µ"),      # Thai
    ("thank you", "‡∏Ç‡∏≠‡∏ö‡∏Ñ‡∏∏‡∏ì"),  # Thai
    ("good morning", "‡∏≠‡∏£‡∏∏‡∏ì‡∏™‡∏ß‡∏±‡∏™‡∏î‡∏¥‡πå"),  # Thai
    # Add more pairs here!
]

print("üß™ Testing custom translation pairs:")
print("="*60)

multilingual_model = loaded_models["paraphrase-multilingual-MiniLM-L12-v2"]
english_model = loaded_models["all-MiniLM-L6-v2"]

for en, th in custom_translations:
    # Get embeddings
    en_emb_multi = multilingual_model.encode([en])
    th_emb_multi = multilingual_model.encode([th])
    
    en_emb_eng = english_model.encode([en])
    th_emb_eng = english_model.encode([th])
    
    # Calculate similarity
    sim_multi = cosine_similarity(en_emb_multi, th_emb_multi)[0][0]
    sim_eng = cosine_similarity(en_emb_eng, th_emb_eng)[0][0]
    
    print(f"\n'{en}' ‚Üî '{th}'")
    print(f"   üåç Multilingual model: {sim_multi:.3f}")
    print(f"   üá¨üáß English-only model: {sim_eng:.3f}")