# Session 1: Foundations of Large Language Models ü§ñ

<div align="center">


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NinaKivanani/Tutorials_low-resource-llm/blob/main/Session1_Foundations_of_Large_Language_Models.ipynb)
[![GitHub](https://img.shields.io/badge/GitHub-View%20Repository-blue?logo=github)](https://github.com/NinaKivanani/Tutorials_low-resource-llm)
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)

**üìö Course Repository:** [github.com/NinaKivanani/Tutorials_low-resource-llm](https://github.com/NinaKivanani/Tutorials_low-resource-llm)

</div>

---

**Core Concepts:**
- **LLM Architecture** - Understand transformer models and attention mechanisms
- **Tokenization** - How models process and understand text across languages
- **Text Representation** - Embeddings, vectors, and semantic similarity
- **Model Comparison** - Analyze different LLM architectures and capabilities
- **Low-Resource Considerations** - Challenges with underrepresented languages

**Practical Skills:**
- Compare tokenization across different models
- Analyze model behavior with multilingual text
- Implement basic text processing pipelines
- Evaluate model performance on various languages
- Build foundation for advanced NLP applications

**Why This Matters:** Understanding LLM fundamentals is crucial for effective use in real-world applications, especially when working with diverse languages and limited computational resources.


## Course Context

| Session | Focus | Techniques | Prerequisites |
|---------|-------|------------|---------------|
| **Session 0** | Setup & Orientation | Environment, Basic Concepts | None |
| **‚Üí This Session** | **LLM Foundations** | **Tokenization, Embeddings, Model Analysis** | **Session 0** |
| **Session 2** | Prompt Engineering | Advanced Prompting, Chain-of-Thought | Sessions 0-1 |
| **Session 3** | Fine-tuning | LoRA, QLoRA, Custom Training | Sessions 0-2 |
| **Session 4** | Bias & Ethics | Fairness, Evaluation, Mitigation | Sessions 0-3 |


## üõ†Ô∏è Environment Setup

### What This Section Does
This section prepares your coding environment with all necessary libraries for exploring Large Language Model foundations. We'll install packages optimized for **interactive learning** - educational, efficient, and GPU-optional!

### Why These Specific Packages?

**Core Dependencies:**
- `numpy` + `pandas`: Essential for data manipulation and analysis
- `scikit-learn`: Similarity metrics and basic ML utilities
- `matplotlib`: Visualization of model behaviors and comparisons

**LLM Ecosystem:**
- `transformers`: Access to pretrained models and tokenizers
- `sentence-transformers`: Semantic embeddings and similarity
- `torch`: PyTorch backend for model operations

In [None]:
# Quick setup for this session
!pip install -q transformers sentence-transformers scikit-learn matplotlib pandas

In [None]:
# Core imports for LLM foundations
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from transformers import AutoTokenizer
from sentence_transformers import SentenceTransformer
import torch

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Set seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print("‚úÖ Environment ready for LLM foundations exploration!")

# Chapter 1: Understanding Tokenization

## What We'll Explore

Tokenization is how models convert text into numbers they can process. Let's see how this works with different languages and models.

### Step 1: Prepare Test Sentences

**Model Selection:** We'll compare two popular multilingual models from [Hugging Face Hub](https://huggingface.co/models):

- **BERT** (Google): Bidirectional Encoder Representations from Transformers - one of the first successful transformer models
- **XLM-RoBERTa** (Facebook): Cross-lingual Language Model based on RoBERTa - specifically designed for multilingual tasks

These model names are the official identifiers used to download them from Hugging Face's model repository.

In [None]:
"""
Expanded Multilingual Test Corpus for LLM Analysis

üìä CORPUS DESIGN:
This corpus contains 5 semantic clusters across 3 languages (15 sentences total):
- English: Germanic, high-resource language
- Luxembourgish: Germanic, low-resource language  
- French: Romance, high-resource language

üéØ DOMAINS COVERED:
1. Medical: Doctor-patient communication & treatment planning
2. Daily Life: Weather and environmental descriptions
3. Technology: Digital communication and tools
4. Education: Learning and academic contexts

üî¨ RESEARCH QUESTIONS:
1. Semantic Clustering: Do equivalent meanings cluster together in embedding space?
2. Language Separation: Do languages form distinct clusters despite shared meanings?
3. Domain Effects: Do different domains create separable clusters?
4. Resource Impact: Does low-resource Luxembourgish show different patterns?

üìà PCA VISUALIZATION EXPECTATIONS:
- **Semantic Clusters**: Sentences with same meaning should be close
- **Language Patterns**: Each language might form sub-clusters
- **Domain Separation**: Medical vs. daily life vs. tech might separate
- **Quality Assessment**: Tight cross-lingual clusters = good multilingual model

üí° PRACTICAL INSIGHTS:
This expanded corpus will reveal:
- How well the model handles cross-lingual semantic equivalence
- Whether domain affects multilingual performance
- Resource availability impact on embedding quality
- Model's ability to generalize across typologically related languages

Note: Perfect multilingual models would show tight semantic clusters with
      mixed languages, not language-separated clusters.
"""

# Multilingual test corpus - Expanded for better PCA visualization
test_sentences = {
    # Medical Domain - Set 1: Doctor-Patient Communication
    "English_medical_1": "The doctor explains the diagnosis carefully to the patient.",
    "Luxembourgish_medical_1": "Den Dokter erkl√§ert d'Diagnos ganz roueg dem Patient.",
    "French_medical_1": "Le m√©decin explique le diagnostic avec soin au patient.",
    
    # Medical Domain - Set 2: Treatment Planning  
    "English_medical_2": "We need to schedule your surgery for next week.",
    "Luxembourgish_medical_2": "Mir musse √§r Operatioun fir n√§chst Woch plangen.",
    "French_medical_2": "Nous devons programmer votre chirurgie pour la semaine prochaine.",
    
    # Daily Life Domain - Set 3: Weather/Environment
    "English_daily_1": "It's raining heavily outside today.",
    "Luxembourgish_daily_1": "Et gitt haut schw√©ier drausser.",
    "French_daily_1": "Il pleut beaucoup dehors aujourd'hui.",
    
    # Technology Domain - Set 4: Digital Communication
    "English_tech_1": "Please send me the email with the important documents.",
    "Luxembourgish_tech_1": "Sch√©ckt mir w√©i gelift d'Email mat de wichtege Dokumenter.",
    "French_tech_1": "Veuillez m'envoyer l'email avec les documents importants.",
    
    # Education Domain - Set 5: Learning Context
    "English_edu_1": "The students are learning new concepts in mathematics.",
    "Luxembourgish_edu_1": "D'Studenten l√©ieren nei Konzepter an der Mathematik.",
    "French_edu_1": "Les √©tudiants apprennent de nouveaux concepts en math√©matiques."
}

# Display corpus for verification
print("=" * 80)
print("EXPANDED MULTILINGUAL TEST CORPUS (15 sentences across 5 semantic groups)")
print("=" * 80)

# Group sentences by semantic meaning for better display
semantic_groups = {
    "Medical Communication": ["medical_1"],
    "Treatment Planning": ["medical_2"], 
    "Weather Description": ["daily_1"],
    "Digital Communication": ["tech_1"],
    "Educational Context": ["edu_1"]
}

for group_name, identifiers in semantic_groups.items():
    print(f"\nüîπ {group_name.upper()}:")
    print("-" * 50)
    
    for identifier in identifiers:
        for lang in ["English", "Luxembourgish", "French"]:
            key = f"{lang}_{identifier}"
            if key in test_sentences:
                sentence = test_sentences[key]
                word_count = len(sentence.split())
                char_count = len(sentence)
                print(f"  {lang:12} ({word_count:2d}w, {char_count:3d}c): {sentence}")

print(f"\n" + "=" * 80)
print(f"üìä CORPUS STATISTICS:")
print(f"   ‚Ä¢ Total sentences: {len(test_sentences)}")
print(f"   ‚Ä¢ Languages: 3 (English, Luxembourgish, French)")  
print(f"   ‚Ä¢ Semantic groups: {len(semantic_groups)}")
print(f"   ‚Ä¢ PCA plot will show {len(test_sentences)} data points")
print(f"   ‚Ä¢ Expected clusters: Semantic groups should cluster across languages")
print("=" * 80)

### Step 2: Compare Tokenization Across Models

## üîÑ Alternative: Public Models Only

If you want to skip gated model setup and run immediately, uncomment this alternative:

In [None]:
# ============================================================================
# üöÄ QUICK START: PUBLIC MODELS ONLY (No authentication needed)
# ============================================================================

"""
üéØ ALTERNATIVE APPROACH: Use this if you want to start immediately without gated models

To use this instead:
1. Comment out the previous cell's models_to_compare list
2. Uncomment the code below
3. Run immediately - no authentication required!
"""

# üîì UNCOMMENT FOR PUBLIC-ONLY DEMO:
# from transformers import AutoTokenizer
# 
# # Public models only - no authentication required
# models_to_compare = [
#     "bert-base-multilingual-cased",        # WordPiece, multilingual
#     "xlm-roberta-base",                    # SentencePiece, multilingual  
#     "google/mt5-small",                    # SentencePiece, multilingual encoder-decoder
#     "gpt2",                                # BPE, English-focused
# ]
# 
# print("üîì PUBLIC MODELS SELECTED (No authentication needed):")
# for i, model in enumerate(models_to_compare, 1):
#     print(f"   {i}. {model}")
# print("\n‚úÖ Ready to run immediately!")

print("üí° Choose one approach above: Gated models (more advanced) or Public models (immediate start)")

In [None]:
# ============================================================================
# COMPREHENSIVE TOKENIZATION COMPARISON ACROSS MODEL ARCHITECTURES
# ============================================================================
# These models represent different tokenization algorithms and training approaches:

from transformers import AutoTokenizer

# ============================================================================
# üîê COLAB AUTHENTICATION FOR GATED MODELS
# ============================================================================

"""
üí° TO USE GATED MODELS IN COLAB:

1. UNCOMMENT the authentication code below
2. Run this cell - it will show a popup in Colab
3. Enter your Hugging Face token in the popup
4. Then the gated models will work!

üéØ STEPS TO GET ACCESS:
‚Ä¢ Visit: https://huggingface.co/google/gemma-2-2b-it
‚Ä¢ Click "Request Access" and wait for approval
‚Ä¢ Go to: https://huggingface.co/settings/tokens
‚Ä¢ Create a token with "Read" permissions
‚Ä¢ Use that token in the popup below
"""

# üéì OPTIONAL ADVANCED FEATURE (for individual student exploration):
# 
# IF you want to try gated models like Gemma (completely optional):
# 1. Get approval at: https://huggingface.co/google/gemma-2-2b-it
# 2. Create token at: https://huggingface.co/settings/tokens  
# 3. Uncomment the code below:
#
# from huggingface_hub import notebook_login
# notebook_login()  # This will show a popup for your personal token
#
# 4. Add "google/gemma-2-2b-it" to models_to_compare list above
#
# ‚ö†Ô∏è NOTE: This is optional! The tutorial is complete with public models only.

# Model selection - PUBLIC MODELS ONLY (safe for shared notebooks)
models_to_compare = [
    "bert-base-multilingual-cased",        # WordPiece, multilingual  
    "xlm-roberta-base",                    # SentencePiece, multilingual
    "google/mt5-small",                    # SentencePiece, multilingual encoder-decoder (PUBLIC)
    "gpt2",                                # BPE, English-focused (shows language bias)
]

# üö® VERIFICATION: All models above are PUBLIC and require NO authentication
print("‚úÖ USING PUBLIC MODELS ONLY:")
print("   ‚ùå NO gated models (like Gemma) in this list")  
print("   ‚úÖ All models work without HuggingFace tokens")
print("   ‚úÖ Safe for shared notebooks and classroom use")

# Double-check: no gated models present
gated_models = [m for m in models_to_compare if 'gemma' in m.lower() or 'llama' in m.lower()]
if gated_models:
    print(f"‚ö†Ô∏è  WARNING: Found gated models: {gated_models}")
    print("   ‚Üí Remove these from the list above")
else:
    print("   üéØ CONFIRMED: Only public models detected")

# ‚≠ê FOR ADVANCED STUDENTS (Optional - requires individual setup):
# If you have gated model access and want to compare cutting-edge models:
# 1. Add: "google/gemma-2-2b-it" to the list above
# 2. Uncomment the authentication code below
# 3. This is optional - the tutorial works perfectly with public models only!

print("üìã MODELS SELECTED FOR COMPARISON:")
print(f"   üîì Public models: {len(models_to_compare)} models (no authentication needed)")
print()
for i, model in enumerate(models_to_compare, 1):
    # Determine tokenization algorithm for educational value
    if "bert" in model.lower():
        algorithm = "WordPiece"
    elif "xlm" in model.lower() or "mt5" in model.lower():
        algorithm = "SentencePiece"  
    elif "gpt" in model.lower():
        algorithm = "BPE"
    else:
        algorithm = "Various"
        
    print(f"   {i}. {model}")
    print(f"      ‚Üí Algorithm: {algorithm}")

print(f"\n‚úÖ READY TO RUN:")
print(f"   ‚Ä¢ All models are publicly available")
print(f"   ‚Ä¢ No authentication required") 
print(f"   ‚Ä¢ Tutorial covers multiple tokenization algorithms")
print(f"   ‚Ä¢ Students can run immediately in any environment")

# ============================================================================
# üìö EXPANDED TOKENIZATION TEST CORPUS
# ============================================================================

"""
üéØ MULTI-DOMAIN SENTENCE PAIRS FOR COMPREHENSIVE ANALYSIS:

This expanded corpus tests tokenization across different domains and linguistic structures:
- Academic (original): Technical terminology
- Medical: Specialized vocabulary  
- Daily Life: Common conversational language
- Technology: Modern digital terminology
- Business: Professional/commercial language

Each pair is semantically equivalent but may reveal different tokenization patterns
due to domain-specific vocabulary, morphological complexity, and training data availability.
"""

# Test sentence pairs (English ‚Üî Luxembourgish)
test_sentence_pairs = [
    # Academic Domain (original)
    {
        "domain": "Academic", 
        "en": "Students are learning about large language models.",
        "lb": "D'Studenten l√©ieren iwwer grouss Sproochmodeller.",
        "concept": "Educational technology"
    },
    
    # Medical Domain
    {
        "domain": "Medical",
        "en": "The doctor carefully examines the patient's symptoms.",
        "lb": "Den Dokter √´nnersicht ganz virsiichteg d'Symptomer vum Patient.",
        "concept": "Healthcare interaction"
    },
    
    # Daily Life Domain
    {
        "domain": "Daily Life", 
        "en": "Today the weather is beautiful and sunny.",
        "lb": "Haut ass d'Wieder sch√©in a sonneg.",
        "concept": "Weather description"
    },
    
    # Technology Domain
    {
        "domain": "Technology",
        "en": "The smartphone application works perfectly on all devices.", 
        "lb": "D'Smartphone-App funktion√©iert perfekt op all Apparater.",
        "concept": "Digital technology"
    },
    
    # Business Domain
    {
        "domain": "Business",
        "en": "The company develops innovative solutions for customers.",
        "lb": "D'Firma entw√©ckelt innovativ L√©isungen fir d'Clienten.",
        "concept": "Commercial activity"
    }
]

print("üìä EXPANDED TOKENIZATION TEST CORPUS")
print("=" * 60)
print(f"üìà Analysis Scope:")
print(f"   ‚Ä¢ {len(test_sentence_pairs)} sentence pairs")
print(f"   ‚Ä¢ {len(set(p['domain'] for p in test_sentence_pairs))} different domains")
print(f"   ‚Ä¢ English (high-resource) ‚Üî Luxembourgish (low-resource)")
print(f"   ‚Ä¢ Tests domain-specific vocabulary effects")

print(f"\nüìù SENTENCE PAIRS BY DOMAIN:")
for pair in test_sentence_pairs:
    print(f"\nüîπ {pair['domain'].upper()} ({pair['concept']}):")
    print(f"   EN: {pair['en']}")  
    print(f"   LB: {pair['lb']}")
    print(f"   Words: EN={len(pair['en'].split())} | LB={len(pair['lb'].split())}")

# For backward compatibility with existing code, keep original variables
text_en = test_sentence_pairs[0]["en"]  # Academic example as default
text_lr = test_sentence_pairs[0]["lb"]

def show_tokenization(model_name, text):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokens = tokenizer.tokenize(text)
    print(f"\nModel: {model_name}")
    print("Text :", text)
    print("Tokens:", tokens)
    print("Number of tokens:", len(tokens))
    
    # Return data for DataFrame creation
    return {
        'model': model_name.split('/')[-1],  # Short name
        'text': text,
        'num_tokens': len(tokens),
        'num_words': len(text.split()),
        'tokens_per_word': len(tokens) / len(text.split()) if text.split() else 0,
        'tokens_preview': tokens[:5]  # First 5 tokens for reference
    }

# Collect results for analysis
df_results = []

# ============================================================================
# üîç COMPREHENSIVE TOKENIZATION ANALYSIS ACROSS DOMAINS
# ============================================================================

print("üöÄ RUNNING COMPREHENSIVE TOKENIZATION ANALYSIS")
print("=" * 70)
print(f"üìä Processing {len(test_sentence_pairs)} sentence pairs across {len(set(p['domain'] for p in test_sentence_pairs))} domains")
print(f"ü§ñ Testing {len(models_to_compare)} different model architectures")
print("=" * 70)

# Process each sentence pair across all models
for pair_idx, sentence_pair in enumerate(test_sentence_pairs):
    domain = sentence_pair['domain']
    concept = sentence_pair['concept']
    
    print(f"\nüîπ DOMAIN {pair_idx+1}: {domain.upper()} ({concept})")
    print("-" * 50)
    
    # Analyze English sentence
    print("üá¨üáß ENGLISH:")
    print(f"   Text: {sentence_pair['en']}")
    for model_name in models_to_compare:
        result = show_tokenization(model_name, sentence_pair['en'])
        result['language'] = 'English'
        result['domain'] = domain  
        result['concept'] = concept
        result['sentence_pair_id'] = pair_idx
        df_results.append(result)
    
    # Analyze Luxembourgish sentence
    print(f"\nüá±üá∫ LUXEMBOURGISH:")
    print(f"   Text: {sentence_pair['lb']}")
    for model_name in models_to_compare:
        result = show_tokenization(model_name, sentence_pair['lb'])
        result['language'] = 'Luxembourgish'
        result['domain'] = domain
        result['concept'] = concept  
        result['sentence_pair_id'] = pair_idx
        df_results.append(result)

print(f"\nüéØ ANALYSIS COMPLETE!")
print("=" * 70)

# Convert to pandas DataFrame for comprehensive analysis
import pandas as pd
df_results = pd.DataFrame(df_results)

print(f"\nüìä COMPREHENSIVE RESULTS SUMMARY")
print("=" * 60)
print(f"‚úÖ Total entries collected: {len(df_results)}")
print(f"üìã DataFrame columns: {list(df_results.columns)}")
print(f"üåç Languages analyzed: {list(df_results['language'].unique())}")
print(f"üè¢ Domains covered: {list(df_results['domain'].unique())}")
print(f"ü§ñ Models tested: {list(df_results['model'].unique())}")

# Quick domain-based analysis preview
print(f"\nüîç DOMAIN-BASED TOKENIZATION EFFICIENCY PREVIEW:")
print("-" * 50)
for domain in df_results['domain'].unique():
    domain_data = df_results[df_results['domain'] == domain]
    avg_tokens_per_word = domain_data['tokens_per_word'].mean()
    print(f"   {domain:12}: Avg {avg_tokens_per_word:.2f} tokens/word")

print(f"\nüí° INSIGHTS AVAILABLE:")
print(f"   ‚Ä¢ Cross-domain tokenization efficiency comparison")
print(f"   ‚Ä¢ Language-specific challenges by domain")  
print(f"   ‚Ä¢ Model architecture performance across contexts")
print(f"   ‚Ä¢ Resource availability impact (EN vs LB)")
print("=" * 60)


### üîç Inspecting Tokenizer Types Programmatically

Sometimes you need to determine what tokenization algorithm a model uses (WordPiece, BPE, SentencePiece, etc.). While there's no universal flag, you can inspect the tokenizer programmatically:

**Why This Matters:**
- Different algorithms handle subwords differently
- Understanding the algorithm helps predict tokenization behavior
- Important for debugging and optimization

In [None]:
# ============================================================================
# TOKENIZER INTROSPECTION: Understanding Algorithm Types
# ============================================================================

from transformers import AutoTokenizer

model_name = "xlm-roberta-base"
tok = AutoTokenizer.from_pretrained(model_name)

print("Tokenizer class:", tok.__class__.__name__)
print("Backend:", getattr(tok, "backend_tokenizer", None))
print("Special tokens:", tok.special_tokens_map)



### üéØ Key Takeaways: Tokenization Algorithms in Practice

**Understanding these differences helps you:**

1. **Choose the Right Model**: 
   - Need to handle many languages? ‚Üí SentencePiece models (XLM-RoBERTa, mT5)
   - Working primarily with English? ‚Üí WordPiece or BPE might be sufficient
   - Need fast inference? ‚Üí Consider algorithm efficiency for your text type

2. **Predict Performance**:
   - SentencePiece typically handles low-resource languages better
   - WordPiece good for languages with complex morphology
   - BPE optimized for languages similar to training data

3. **Debug Issues**:
   - Unexpected tokenization? Check the algorithm type
   - High token counts? Algorithm might not be suited for your language
   - Special token conflicts? Inspect the special_tokens_map

**Next**: Let's see how these tokenization differences affect semantic representations...

In [None]:
# ============================================================================
# üîß DEPENDENCY FIX: Calculate PCA Coordinates First
# ============================================================================

print("üö® FIXING NAMEERROR: Calculating coords_2d before visualization")

# Import required modules
from sklearn.decomposition import PCA
import numpy as np

# Apply PCA to create the 2D coordinates needed for visualization
print("üìê Computing PCA coordinates...")
pca = PCA(n_components=2, random_state=42) 
coords_2d = pca.fit_transform(embeddings)

# Verify PCA results
explained_var = pca.explained_variance_ratio_
print(f"‚úÖ coords_2d is now defined!")
print(f"   ‚Ä¢ Shape: {coords_2d.shape}") 
print(f"   ‚Ä¢ Variance retained: {sum(explained_var)*100:.1f}%")
print(f"   ‚Ä¢ Ready for visualization in next cell")

### ü§î Reflection Questions

Look at the results above and consider:

- Which language uses more tokens per word?
- How might more tokens affect inference cost and speed?
- Do you see any unusual token splits (broken words, weird subwords)?

**Key Insight:** Languages with fewer training examples often get split into more subword tokens, increasing computational costs.

# üìä Chapter 2: Text Embeddings & Semantic Similarity

## Understanding Vector Representations

**What are embeddings?** Numbers that capture the meaning of text in high-dimensional space.

Let's see how different models create these representations!

In [None]:
# Load a multilingual sentence embedding model
embedder_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
embedder = SentenceTransformer(embedder_name, device=device)

print(f"üìä Loaded embedding model: {embedder_name}")

# Get embeddings for our test sentences
sentences = list(test_sentences.values())
languages = list(test_sentences.keys())

embeddings = embedder.encode(sentences, convert_to_numpy=True)
print(f"‚úÖ Created embeddings with shape: {embeddings.shape}")
print(f"   Each sentence ‚Üí {embeddings.shape[1]} dimensional vector")

In [None]:
# ============================================================================
# üö® QUICK FIX: Visualization for Expanded Corpus (Fixed IndexError)
# ============================================================================

print(f"üîß FIXING VISUALIZATION FOR {len(test_sentences)} SENTENCES")
print("=" * 60)

# Create the visualization with dynamic color generation
plt.figure(figsize=(12, 8))

# Import required modules for color generation
import matplotlib.cm as cm
import numpy as np

# Generate colors dynamically based on number of sentences
n_sentences = len(test_sentences)
sentence_keys = list(test_sentences.keys())

print(f"üìä Visualizing {n_sentences} sentence embeddings")
print(f"üé® Generating {n_sentences} distinct colors automatically")

# Create color array that matches number of sentences
colors = cm.Set3(np.linspace(0, 1, n_sentences))

# Plot each sentence with generated colors
for i, sentence_key in enumerate(sentence_keys):
    # Use generated color
    color = colors[i]
    
    # Create short labels for better readability
    if '_' in sentence_key and len(sentence_key.split('_')) >= 2:
        parts = sentence_key.split('_')
        short_label = f"{parts[0][:2]}-{parts[1][:3]}"  # "En-med", "Fr-dai", etc.
    else:
        short_label = sentence_key[:8]  # Truncate long keys
    
    # Plot the point
    plt.scatter(coords_2d[i, 0], coords_2d[i, 1], 
               c=[color], s=100, alpha=0.8, 
               edgecolor='black', linewidth=0.5)
    
    # Add text annotation
    plt.annotate(short_label, (coords_2d[i, 0], coords_2d[i, 1]), 
                xytext=(5, 5), textcoords='offset points', 
                fontsize=8, alpha=0.9, fontweight='bold')

# Customize plot
plt.title("Sentence Embeddings: Expanded Multilingual Corpus\n15 Sentences Across 5 Domains and 3 Languages", 
         fontsize=12, pad=15)
plt.xlabel("Principal Component 1 (Largest Variation Direction)")
plt.ylabel("Principal Component 2 (Second Largest Variation Direction)")
plt.grid(True, alpha=0.3)

# Add explanation box
textstr = "üîç Look for:\\n‚Ä¢ Semantic clusters (same meanings group)\\n‚Ä¢ Language mixing within clusters\\n‚Ä¢ Domain-based patterns"
props = dict(boxstyle='round', facecolor='lightblue', alpha=0.8)
plt.text(0.02, 0.98, textstr, transform=plt.gca().transAxes, fontsize=9,
        verticalalignment='top', bbox=props)

plt.tight_layout()
plt.show()

print("‚úÖ VISUALIZATION COMPLETE!")
print("=" * 60)
print("üí° ANALYSIS GUIDE:")
print("   üéØ GOOD multilingual model: Mixed languages in semantic clusters")
print("   ‚ö†Ô∏è  CONCERNING: Languages separated regardless of meaning")
print("   üìä Position = Semantic similarity (distance = meaning difference)")
print("   üåç Cross-lingual success = Same concepts cluster across languages")

## Applying PCA for Visualization

This section applies Principal Component Analysis to reduce high-dimensional embeddings to 2D coordinates for visualization while analyzing variance retention.

## üìä Understanding PCA (Principal Component Analysis)

**ü§î The Problem:** Our embeddings are 384-dimensional vectors - impossible to visualize directly!

**üéØ The Solution:** PCA reduces high-dimensional data to 2D while preserving the most important relationships.

### üìö How PCA Works:

1. **Find Principal Components**: Directions in the data with maximum variance
2. **Project Data**: Transform original data onto these new axes
3. **Keep Top Components**: Use only the first 2 components for 2D visualization

### üí° Key Insights:

- **Component 1**: Captures the most variation in the data
- **Component 2**: Captures the second most variation  
- **Relationship Preservation**: Similar sentences should stay close even after reduction
- **Information Loss**: We lose some information, but keep the most important patterns

### üéØ Why This Matters:

- Allows us to **visualize** high-dimensional embeddings
- Helps us **understand** if similar meanings cluster together across languages
- **Quality check** for our multilingual model performance

In [None]:
# ============================================================================
# üî¨ APPLYING PCA FOR VISUALIZATION  
# ============================================================================

print("üìä PCA ANALYSIS")
print("=" * 40)
print(f"üìê Original embedding dimensions: {embeddings.shape[1]}")
print(f"üéØ Reducing to: 2 dimensions for plotting") 
print(f"‚ö° Method: Principal Component Analysis")

# Apply PCA reduction
pca = PCA(n_components=2, random_state=42)
coords_2d = pca.fit_transform(embeddings)

# Analyze the results
explained_var = pca.explained_variance_ratio_
print(f"\nüìä VARIANCE EXPLANATION:")
print(f"   ‚Ä¢ Component 1: {explained_var[0]*100:.1f}% of original variance")
print(f"   ‚Ä¢ Component 2: {explained_var[1]*100:.1f}% of original variance") 
print(f"   ‚Ä¢ Total retained: {sum(explained_var)*100:.1f}% of information")

print(f"\nüí° INTERPRETATION:")
if sum(explained_var) > 0.7:
    print(f"   ‚úÖ Great! We retained most of the important patterns")
elif sum(explained_var) > 0.5:
    print(f"   ‚ö†Ô∏è  Decent retention - visualization should be meaningful")
else:
    print(f"   üî¥ Low retention - visualization may not show all relationships")

print(f"\nüéØ COORDINATES READY FOR PLOTTING:")
print(f"   Shape: {coords_2d.shape} (each sentence ‚Üí x,y coordinates)")

## üîß CORRECTED Visualization for Expanded Corpus

**‚ö†Ô∏è IMPORTANT NOTE**: This cell fixes the `IndexError` you might encounter with visualization cells that use hardcoded color arrays. 

**What was wrong?** Earlier cells used a fixed 5-color array (`['red', 'blue', 'green', 'orange', 'purple']`) but we now have 15 sentences in our expanded corpus.

**How we fix it?** Dynamic color generation using matplotlib's colormap that creates exactly the right number of colors for any corpus size.

**üéØ Use this cell instead** of any problematic visualization cells you encounter.

In [None]:
# ============================================================================
# üé® FIXED VISUALIZATION: Dynamic Colors for Expanded Corpus
# ============================================================================

# Create visualization that handles any number of sentences (fixes IndexError)
plt.figure(figsize=(12, 9))

# Import colormap modules for dynamic color generation
import matplotlib.cm as cm
import numpy as np

# Generate sufficient colors for expanded test corpus
n_sentences = len(test_sentences)
sentence_keys = list(test_sentences.keys())
colors_array = cm.Set3(np.linspace(0, 1, n_sentences))

print(f"üé® Creating visualization for {n_sentences} sentences with dynamic colors")

# Plot each sentence with proper color handling
for i, sentence_key in enumerate(sentence_keys):
    # Use dynamic color - no more IndexError!
    color = colors_array[i]
    
    # Create meaningful short labels from sentence keys
    if '_' in sentence_key:
        parts = sentence_key.split('_')
        short_label = f"{parts[0][:2]}-{parts[1][:3]}"  # "En-med", "Lu-dai", etc.
    else:
        short_label = sentence_key[:6]
    
    # Plot the point
    plt.scatter(coords_2d[i, 0], coords_2d[i, 1], 
               c=[color], s=120, alpha=0.7, 
               edgecolor='black', linewidth=0.4)
    
    # Add readable annotation
    plt.annotate(short_label, (coords_2d[i, 0], coords_2d[i, 1]), 
                xytext=(7, 7), textcoords='offset points', 
                fontsize=8, fontweight='bold')

plt.title("Multilingual Sentence Embeddings in 2D Space\nExpanded Corpus: 5 Domains √ó 3 Languages", 
         fontsize=12, pad=15)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")

# Add interpretation guide as text box
guide_text = ("üîç ANALYSIS GUIDE:\n"
             "‚Ä¢ Distance = Semantic similarity\n" 
             "‚Ä¢ Good: Same meanings cluster\n"
             "‚Ä¢ Concerning: Languages separate")

plt.text(0.02, 0.98, guide_text, transform=plt.gca().transAxes,
         bbox=dict(boxstyle="round,pad=0.4", facecolor="lightyellow", alpha=0.8),
         fontsize=9, verticalalignment='top')

plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("‚úÖ VISUALIZATION COMPLETE - IndexError fixed with dynamic colors!")

print(f"\nüí° KEY OBSERVATIONS TO LOOK FOR:")
print(f"   üìè Distance = Semantic similarity (closer = more similar meaning)")
print(f"   üéØ Good model: Mixed languages within semantic clusters")
print(f"   ‚ö†Ô∏è  Poor model: Languages separated regardless of meaning")
print(f"   üåç Cross-lingual success: Same concepts group across languages")
print(f"   üè¢ Domain effects: Professional vs casual language patterns")

In [None]:
# ============================================================================
# üé® VISUALIZE PCA RESULTS
# ============================================================================

# Create visualization (using coords_2d from previous cell)
plt.figure(figsize=(10, 8))

colors = ['red', 'blue', 'green', 'orange', 'purple']
for i, (lang, sentence) in enumerate(test_sentences.items()):
    plt.scatter(coords_2d[i, 0], coords_2d[i, 1], 
               c=colors[i], s=200, alpha=0.7, label=lang)
    plt.annotate(lang, (coords_2d[i, 0], coords_2d[i, 1]), 
                xytext=(10, 10), textcoords='offset points', fontsize=12)

plt.title("Sentence Embeddings in 2D Space\n(All sentences have similar meaning)", fontsize=14)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("üí° Key Observation: Similar-meaning sentences in different languages should cluster together!")

print(f"\nüî¨ PCA VISUALIZATION ANALYSIS:")
print(f"   üìè What distance means: Closer points = more similar semantic meaning")
print(f"   üéØ What to look for: Languages clustering together despite different words")
print(f"   ‚öñÔ∏è  What variance tells us: Higher variance = more distinguishable patterns")
print(f"   üåç Cross-lingual success: Different languages expressing same meaning should be near each other")

### üìä Understanding Similarity Values: What Do The Numbers Mean?

The cosine similarity values you see above tell us how semantically similar the sentences are. Here's how to interpret them:

**üìê Cosine Similarity Scale (0.0 to 1.0):**
- **0.9-1.0**: Nearly identical meaning (excellent cross-lingual alignment)
- **0.7-0.89**: High similarity (strong semantic equivalence) 
- **0.5-0.69**: Moderate similarity (related concepts, some semantic overlap)
- **0.3-0.49**: Low similarity (weakly related or different topics)
- **0.0-0.29**: Very low similarity (mostly unrelated concepts)

**What To Expect for Our Semantically Equivalent Sentences:**
- **Good multilingual models**: Should show 0.7-0.9+ similarity across languages
- **Diagonal values**: Should always be 1.0 (sentence compared to itself)
- **Lower than expected scores**: May indicate model struggles with certain languages

**Real-World Implications:**
- **High scores (>0.7)**: Model is suitable for multilingual applications like translation, search
- **Medium scores (0.5-0.7)**: Proceed with caution, may need language-specific tuning
- **Low scores (<0.5)**: Consider different model or additional training for that language

**Why Scores Might Be Lower Than Expected:**
- Model had limited training data in the low-resource language
- Different sentence structures or vocabulary between languages  
- Domain mismatch (model trained on general text, tested on medical text)
- **Tokenization issues affecting embedding quality** ‚Üê Let's explain this!

## Calculate Similarity Matrix

This section computes pairwise cosine similarities between sentence embeddings to prepare for detailed cross-lingual analysis.

In [None]:
# ============================================================================
# CALCULATE SIMILARITY MATRIX (Required for Later Analysis)
# ============================================================================

# Calculate pairwise cosine similarities between all sentence embeddings  
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(embeddings)

print("üîó SIMILARITY MATRIX CALCULATED")
print("=" * 50)
print(f"‚úÖ Matrix shape: {similarity_matrix.shape}")
print(f"‚úÖ Values range from 0.0 (unrelated) to 1.0 (identical)")
print(f"‚úÖ Ready for detailed analysis in upcoming cells")

# Quick preview of the matrix
print(f"\nüìä Quick Preview (first few values):")
lang_names = list(test_sentences.keys())
for i in range(min(2, len(lang_names))):
    for j in range(min(2, len(lang_names))):
        sim = similarity_matrix[i, j]
        print(f"   {lang_names[i]} ‚Üî {lang_names[j]}: {sim:.3f}")

print(f"\nüí° Full analysis coming in the next sections!")

### üîß Deep Dive: How Tokenization Issues Affect Embedding Quality

**The Connection:** Tokenization ‚Üí Embeddings ‚Üí Similarity Scores

This is a crucial concept that many people overlook! Here's how poor tokenization can ruin your similarity analysis:

#### üß© **The Process Chain:**
```
Raw Text ‚Üí Tokenization ‚Üí Token Embeddings ‚Üí Sentence Embedding ‚Üí Similarity Score
```

**When tokenization goes wrong, everything downstream suffers!**

#### üìù **Concrete Examples:**

**Example 1: Word Breaking**
```
English: "carefully" ‚Üí ["careful", "##ly"] (good: preserves meaning)
Low-resource: "sorgf√§ltig" ‚Üí ["so", "##r", "##g", "##f√§", "##lt", "##ig"] (bad: loses word structure)
```

**Impact:** The low-resource word gets broken into meaningless fragments. The model can't learn that "sorgf√§ltig" = "carefully" because it never sees "sorgf√§ltig" as a coherent unit.

**Example 2: Unknown Token Explosion**
```
English: "doctor" ‚Üí ["doctor"] (1 token, well-known)
Low-resource: "Dokter" ‚Üí ["[UNK]"] (1 unknown token, no meaning)
```

**Impact:** The model has no representation for "[UNK]", so it gets a generic "unknown" embedding that doesn't capture the medical concept.

**Example 3: Inconsistent Splitting**
```
Same concept, different tokenization:
"diagnosis" ‚Üí ["diagnosis"] 
"Diagnos" ‚Üí ["Dia", "##gno", "##s"]
```

**Impact:** Even though both mean "diagnosis," they get completely different embeddings because the tokenizer treats them as unrelated token sequences.

#### ‚ö° **The Cascade Effect:**

1. **Bad tokenization** ‚Üí Fragments or unknown tokens
2. **Poor token embeddings** ‚Üí Generic or meaningless vectors  
3. **Bad sentence embeddings** ‚Üí Average of poor-quality token vectors
4. **Low similarity scores** ‚Üí Model appears to "not understand" the language

#### üõ°Ô∏è **How to Detect This:**
- Look at tokenization output: many tiny fragments = problem
- High number of [UNK] tokens = problem  
- Same meaning, very different token patterns = problem

#### üí° **Solutions:**
- Choose models trained specifically on your target language
- Use SentencePiece-based models (better with unseen languages)
- Consider domain-specific models if your text has specialized vocabulary
- Fine-tune tokenizers on your target language data

In [None]:
# ============================================================================
# EXPANDED MULTILINGUAL WORD ANALYSIS
# ============================================================================

# Multilingual concepts across language families and resource levels
multilingual_concepts = {
    "medical_professional": {
        "English": "doctor",           # Germanic, high-resource
        "German": "Arzt",            # Germanic, high-resource  
        "Luxembourgish": "Dokter",   # Germanic, low-resource
        "French": "m√©decin",         # Romance, high-resource
        "Spanish": "doctor",         # Romance, high-resource
        "Dutch": "dokter",           # Germanic, medium-resource
        "Italian": "medico",         # Romance, high-resource
    },
    "medical_assessment": {
        "English": "diagnosis",
        "German": "Diagnose", 
        "Luxembourgish": "Diagnos",
        "French": "diagnostic",
        "Spanish": "diagn√≥stico", 
        "Dutch": "diagnose",
        "Italian": "diagnosi",
    },
    "with_care": {
        "English": "carefully",
        "German": "sorgf√§ltig",
        "Luxembourgish": "roueg", 
        "French": "soigneusement",
        "Spanish": "cuidadosamente",
        "Dutch": "zorgvuldig",
        "Italian": "attentamente",
    },
    "sick_person": {
        "English": "patient",
        "German": "Patient",
        "Luxembourgish": "Patient",
        "French": "patient", 
        "Spanish": "paciente",
        "Dutch": "pati√´nt",
        "Italian": "paziente",
    },
    "explains_meaning": {
        "English": "explains",
        "German": "erkl√§rt",
        "Luxembourgish": "erkl√§ert", 
        "French": "explique",
        "Spanish": "explica",
        "Dutch": "legt uit",
        "Italian": "spiega",
    }
}

print("üåç EXPANDED MULTILINGUAL TOKENIZATION ANALYSIS")
print("=" * 75)
print("üìä LANGUAGE FAMILIES & RESOURCE LEVELS:")
print("   üá¨üáß Germanic Family:")
print("      ‚Ä¢ English (high-resource) ‚Üí German (high-resource)")  
print("      ‚Ä¢ Dutch (medium-resource) ‚Üí Luxembourgish (low-resource)")
print("   üá´üá∑ Romance Family:")
print("      ‚Ä¢ French, Spanish, Italian (all high-resource)")
print()
print("üéØ RESEARCH QUESTIONS:")
print("   ‚Ä¢ Do models favor same-family languages? (Germanic vs Romance)")
print("   ‚Ä¢ How severe is the low-resource penalty? (Luxembourgish)")
print("   ‚Ä¢ Which architectures handle cross-lingual diversity best?")
print("=" * 75)

# Generate comparison pairs (English baseline vs all others)
word_pairs = []
target_languages = ["German", "Luxembourgish", "French", "Spanish", "Dutch", "Italian"]

for concept, translations in multilingual_concepts.items():
    concept_display = concept.replace("_", " ").title()
    english_word = translations["English"]
    
    for lang in target_languages:
        if lang in translations:
            other_word = translations[lang]
            # Add resource level info for analysis
            resource_level = "high" if lang in ["German", "French", "Spanish", "Italian"] else "med" if lang == "Dutch" else "low"
            family = "Germanic" if lang in ["German", "Luxembourgish", "Dutch"] else "Romance"
            
            pair_label = f"{concept_display} (EN‚Üí{lang}/{family}/{resource_level})"
            word_pairs.append((english_word, other_word, pair_label))

print(f"\nüìà ANALYSIS SCOPE:")
print(f"   ‚Ä¢ {len(word_pairs)} cross-lingual word pairs generated")
print(f"   ‚Ä¢ {len(multilingual_concepts)} semantic concepts tested")
print(f"   ‚Ä¢ {len(target_languages)} target languages analyzed")
print(f"   ‚Ä¢ 2 language families (Germanic + Romance)")
print(f"   ‚Ä¢ 3 resource levels (high, medium, low)")
print("\nüî¨ This will reveal systematic tokenization biases across:")
print("   ‚Üí Language families (typological similarity)")
print("   ‚Üí Resource availability (training data volume)")  
print("   ‚Üí Model architectures (BERT vs XLM-R approaches)")
print("=" * 75)

In [None]:
# ============================================================================
# WHAT'S BEEN ENHANCED: Before vs After
# ============================================================================

print("üîÑ EXPANSION SUMMARY:")
print("=" * 60)
print("üìä BEFORE (Original):")
print("   ‚Ä¢ 5 word pairs (EN ‚Üî LB only)")
print("   ‚Ä¢ 1 language family comparison")  
print("   ‚Ä¢ Limited resource level analysis")
print()
print("üöÄ AFTER (Enhanced):")
print(f"   ‚Ä¢ {len(word_pairs)} word pairs across multiple language pairs")
print("   ‚Ä¢ 2 language families (Germanic + Romance)")
print("   ‚Ä¢ 3 resource levels (high/medium/low)")
print("   ‚Ä¢ 6-7 languages total coverage")
print()
print("üéØ EDUCATIONAL VALUE:")
print("   ‚úÖ Students can see systematic tokenization patterns")
print("   ‚úÖ Compare language family effects (Germanic vs Romance)")  
print("   ‚úÖ Understand resource availability impact")
print("   ‚úÖ Identify model architecture differences")
print("=" * 60)

# Sample the enhanced word_pairs to show the structure
print(f"\nüìù SAMPLE OF ENHANCED WORD PAIRS:")
print("   (First 6 pairs as examples)")
for i, (word1, word2, label) in enumerate(word_pairs[:6]):
    print(f"   {i+1:2}. {word1:12} ‚Üí {word2:15} | {label}")
if len(word_pairs) > 6:
    print(f"   ... and {len(word_pairs)-6} more pairs")

print(f"\nüí° This systematic expansion makes tokenization analysis much more educational!")

## Practical Demonstration: Tokenization Quality Impact

This demonstration shows how tokenization quality affects embedding similarity by analyzing word fragmentation and unknown tokens across different models and languages.

In [None]:
# ============================================================================
# üéì STUDENT GUIDE: How to Use Gated Models in Colab (Optional Advanced Section)
# ============================================================================

"""
üìö QUESTION: How can students use Gemma (or other gated models) in Google Colab?

‚úÖ ANSWER: Follow these steps (one-time setup per student):

STEP 1: Get Model Access (Outside of Colab)
==========================================
1. Go to: https://huggingface.co/google/gemma-2-2b-it
2. Click the "Request Access" button
3. Wait for approval from Google
4. You'll get an email when approved

STEP 2: Create Hugging Face Token (Outside of Colab)  
===================================================
1. Go to: https://huggingface.co/settings/tokens
2. Click "New token"
3. Choose "Read" permissions (sufficient for downloading models)
4. Copy the token (starts with "hf_...")

STEP 3: Authenticate in Colab (Every Session)
=============================================
Run this code at the start of your Colab session:
"""

print("üîë TO USE GATED MODELS IN COLAB:")
print("1. Get model access approval (one-time)")  
print("2. Create HF token (one-time)")
print("3. Login in Colab (every session)")
print("\nExample authentication code for Colab:")
print("-" * 40)
print("# Option A: Interactive login (recommended for beginners)")
print("from huggingface_hub import notebook_login")
print("notebook_login()  # This will show a popup to enter your token")
print()
print("# Option B: Direct token login (for advanced users)")  
print("from huggingface_hub import login")
print("login(token='hf_your_token_here')  # Replace with your actual token")
print()
print("# Option C: Environment variable (most secure)")
print("import os")
print("os.environ['HF_TOKEN'] = 'your_token_here'")
print("from huggingface_hub import login") 
print("login()")

print(f"\nüí° AFTER AUTHENTICATION:")
print(f"   Just uncomment the gated model in the list above!")
print(f"   models_to_compare.append('google/gemma-2-2b-it')")

print(f"\nüéØ FOR INSTRUCTORS:")
print(f"   ‚Ä¢ You could demo this live for interested students")
print(f"   ‚Ä¢ Or provide it as bonus/homework material") 
print(f"   ‚Ä¢ Main tutorial works fine with public models only")

In [None]:
# ============================================================================
# PRACTICAL DEMONSTRATION: Tokenization Quality Impact
# ============================================================================

def demonstrate_tokenization_quality(word_pairs, model_name):
    """
    Show how tokenization quality varies between equivalent words across languages.
    
    Args:
        word_pairs: List of (lang1_word, lang2_word, meaning) tuples
        model_name: HuggingFace model to test
    """
    print(f"\nüîç TOKENIZATION QUALITY ANALYSIS: {model_name}")
    print("=" * 60)
    
    try:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        
        for word1, word2, meaning in word_pairs:
            tokens1 = tokenizer.tokenize(word1)
            tokens2 = tokenizer.tokenize(word2)
            
            # Count fragmentations and unknowns
            frag1 = len(tokens1)
            frag2 = len(tokens2)
            unk1 = sum(1 for t in tokens1 if '[UNK]' in t or '<unk>' in t)
            unk2 = sum(1 for t in tokens2 if '[UNK]' in t or '<unk>' in t)
            
            # Quality assessment
            quality1 = "üü¢ Good" if frag1 == 1 and unk1 == 0 else ("üü° OK" if unk1 == 0 else "üî¥ Poor")
            quality2 = "üü¢ Good" if frag2 == 1 and unk2 == 0 else ("üü° OK" if unk2 == 0 else "üî¥ Poor")
            
            print(f"\nüìù Concept: '{meaning}'")
            print(f"   {word1:15} ‚Üí {tokens1} | Fragments: {frag1}, UNK: {unk1} | {quality1}")
            print(f"   {word2:15} ‚Üí {tokens2} | Fragments: {frag2}, UNK: {unk2} | {quality2}")
            
            # Predict embedding quality
            if quality1 == quality2 == "üü¢ Good":
                prediction = "üéØ High similarity expected"
            elif "üî¥ Poor" in [quality1, quality2]:
                prediction = "‚ö†Ô∏è  Low similarity likely (tokenization issues)"
            else:
                prediction = "ü§î Moderate similarity possible"
            
            print(f"   üí° Similarity prediction: {prediction}")
            
    except Exception as e:
        print(f"‚ùå Error loading {model_name}: {e}")

# Test with concrete examples from our corpus
word_pairs = [
    ("doctor", "Dokter", "medical professional"),
    ("diagnosis", "Diagnos", "medical assessment"),  
    ("carefully", "roueg", "with care"),
    ("patient", "Patient", "sick person"),
    ("explains", "erkl√§ert", "makes clear")
]

# Test with our models to see quality differences
test_models = ["bert-base-multilingual-cased", "xlm-roberta-base"]

for model in test_models:
    demonstrate_tokenization_quality(word_pairs, model)

print(f"\nüí° INTERPRETATION:")
print(f"   üü¢ Good tokenization ‚Üí Better embeddings ‚Üí Higher similarity scores")
print(f"   üî¥ Poor tokenization ‚Üí Worse embeddings ‚Üí Lower similarity scores")
print(f"   This explains why some language pairs might score lower than expected!")

In [None]:
# ============================================================================
# SIMILARITY ANALYSIS & INTERPRETATION
# ============================================================================

# Analyze the similarity results with automatic interpretation
print("üîç DETAILED SIMILARITY ANALYSIS")
print("=" * 60)

# Get language names from our test sentences
lang_names = list(test_sentences.keys())

# Calculate cross-lingual similarities (excluding self-comparisons)
cross_lingual_similarities = []
print("\nüìä Cross-lingual Similarity Scores:")
print("-" * 40)

for i, lang1 in enumerate(lang_names):
    for j, lang2 in enumerate(lang_names):
        if i < j:  # Avoid duplicates and self-comparisons
            sim = similarity_matrix[i, j]
            cross_lingual_similarities.append(sim)
            
            # Provide automatic interpretation
            if sim >= 0.8:
                quality = "üü¢ EXCELLENT"
                note = "Very strong semantic alignment"
            elif sim >= 0.7:
                quality = "üü° GOOD"  
                note = "Clear semantic similarity"
            elif sim >= 0.5:
                quality = "üü† MODERATE"
                note = "Some semantic overlap, could be better"
            else:
                quality = "üî¥ CONCERNING"
                note = "Weak alignment - investigate model/language"
                
            print(f"   {lang1:12} ‚Üî {lang2:12}: {sim:.3f} | {quality} - {note}")

# Calculate summary statistics
if cross_lingual_similarities:
    avg_similarity = sum(cross_lingual_similarities) / len(cross_lingual_similarities)
    max_similarity = max(cross_lingual_similarities)
    min_similarity = min(cross_lingual_similarities)
    
    print(f"\nüìà SUMMARY STATISTICS:")
    print(f"   ‚Ä¢ Average cross-lingual similarity: {avg_similarity:.3f}")
    print(f"   ‚Ä¢ Best language pair similarity: {max_similarity:.3f}")  
    print(f"   ‚Ä¢ Worst language pair similarity: {min_similarity:.3f}")
    print(f"   ‚Ä¢ Number of language pairs: {len(cross_lingual_similarities)}")
    
    # Overall assessment
    print(f"\nüéØ OVERALL MODEL ASSESSMENT:")
    if avg_similarity >= 0.75:
        print(f"   üéâ EXCELLENT: This model shows strong multilingual understanding!")
        print(f"      ‚Üí Suitable for production multilingual applications")
    elif avg_similarity >= 0.60:
        print(f"   ‚úÖ GOOD: Model shows decent cross-lingual capabilities")  
        print(f"      ‚Üí Usable for multilingual tasks with some caution")
    elif avg_similarity >= 0.45:
        print(f"   ‚ö†Ô∏è  FAIR: Model has limited multilingual alignment")
        print(f"      ‚Üí Consider fine-tuning or using different model")
    else:
        print(f"   üö® POOR: Model struggles with multilingual understanding")
        print(f"      ‚Üí Not recommended for cross-lingual applications")
        
    print(f"\nüí° ACTIONABLE INSIGHTS:")
    print(f"   ‚Ä¢ Use this analysis to choose appropriate models for your languages")
    print(f"   ‚Ä¢ Lower scores indicate need for more training data or different architectures")
    print(f"   ‚Ä¢ Compare different models using this same methodology")

## Similarity Analysis & Interpretation

This section interprets cross-lingual cosine similarities and summarizes model quality with actionable insights.

In [None]:
# Calculate semantic similarities
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(embeddings)

print("üîç SEMANTIC SIMILARITY ANALYSIS")
print("\nSimilarity Matrix (1.0 = identical, 0.0 = unrelated):")
print()

# Create a nice formatted table
lang_names = list(test_sentences.keys())
print(f"{'Language':<12} ", end="")
for lang in lang_names:
    print(f"{lang:<10}", end="")
print()

for i, lang1 in enumerate(lang_names):
    print(f"{lang1:<12} ", end="")
    for j, lang2 in enumerate(lang_names):
        sim = similarity_matrix[i, j]
        print(f"{sim:.3f}     ", end="")
    print()

print(f"\nüí° Cross-lingual similarities (excluding self-comparisons):")
for i, lang1 in enumerate(lang_names):
    for j, lang2 in enumerate(lang_names):
        if i < j:  # Avoid duplicates
            sim = similarity_matrix[i, j]
            print(f"   {lang1} ‚Üî {lang2}: {sim:.3f}")

In [None]:
# ============================================================================
# üè¢ ENHANCED ANALYSIS: DOMAIN-SPECIFIC TOKENIZATION PATTERNS
# ============================================================================

print("üöÄ DOMAIN-SPECIFIC TOKENIZATION ANALYSIS")
print("=" * 70)

if 'domain' in df_results.columns:
    print("üìä Available domains:", list(df_results['domain'].unique()))
    
    # Domain-specific efficiency analysis
    print(f"\nüéØ CROSS-DOMAIN TOKENIZATION EFFICIENCY:")
    print("-" * 50)
    
    domain_summary = df_results.groupby(['domain', 'language'])['tokens_per_word'].mean().round(2)
    
    for domain in df_results['domain'].unique():
        print(f"\nüìç {domain.upper()}:")
        try:
            en_efficiency = domain_summary[domain]['English']  
            lb_efficiency = domain_summary[domain]['Luxembourgish']
            penalty = ((lb_efficiency - en_efficiency) / en_efficiency * 100)
            
            print(f"   English:       {en_efficiency:.2f} tokens/word")
            print(f"   Luxembourgish: {lb_efficiency:.2f} tokens/word")
            print(f"   Resource penalty: {penalty:+.1f}%")
            
            # Domain-specific interpretation
            if penalty < 25:
                assessment = "üü¢ Minimal penalty"
                advice = "Good multilingual coverage for this domain"
            elif penalty < 60:  
                assessment = "üü° Moderate penalty"
                advice = "Acceptable but monitor computational costs"
            else:
                assessment = "üî¥ High penalty"
                advice = "Consider domain-specific model fine-tuning"
                
            print(f"   Assessment: {assessment} - {advice}")
            
        except KeyError as e:
            print(f"   ‚ö†Ô∏è  Missing data: {str(e)}")
    
    # Most challenging domains
    print(f"\nüéØ DOMAIN RANKING (by multilingual difficulty):")
    print("-" * 50)
    
    domain_penalties = {}
    for domain in df_results['domain'].unique():
        try:
            en_eff = domain_summary[domain]['English']
            lb_eff = domain_summary[domain]['Luxembourgish'] 
            penalty = ((lb_eff - en_eff) / en_eff * 100)
            domain_penalties[domain] = penalty
        except KeyError:
            continue
    
    # Sort by difficulty (highest penalty = most challenging)
    sorted_domains = sorted(domain_penalties.items(), key=lambda x: x[1], reverse=True)
    
    for rank, (domain, penalty) in enumerate(sorted_domains, 1):
        difficulty = "üî¥ High" if penalty > 60 else "üü° Medium" if penalty > 25 else "üü¢ Low"
        print(f"   {rank}. {domain:12} ({penalty:+.1f}% penalty) - {difficulty} difficulty")
    
    # Educational insights
    print(f"\nüí° KEY INSIGHTS FOR STUDENTS:")
    print("-" * 50)
    print("   ‚Ä¢ Technical domains (Medical, Technology) may have specialized vocabulary")
    print("   ‚Ä¢ Daily conversation may be easier for multilingual models")
    print("   ‚Ä¢ Business language shows formal vs informal tokenization patterns")
    print("   ‚Ä¢ Academic language tests model coverage of educational terms")
    print("   ‚Ä¢ Cross-domain consistency indicates robust multilingual training")
    
else:
    print("üìù Domain-specific analysis not available")
    print("   ‚Üí Use the expanded sentence pairs above to enable this analysis")

print("=" * 70)

# Chapter 3: Model Comparison Summary

Let's summarize what we've learned about different models and languages:

In [None]:
# Create a summary of our analysis
summary_df = df_results.pivot_table(
    index='language', 
    columns='model', 
    values=['tokens_per_word', 'num_tokens'], 
    aggfunc='mean'
).round(2)

print("üìä TOKENIZATION EFFICIENCY SUMMARY")
print("=" * 50)
print("\nTokens per word (lower = more efficient):")
print(summary_df['tokens_per_word'])

print("\nTotal tokens per sentence:")
print(summary_df['num_tokens'])

# Find the most efficient model for each language
print("\nüîç DEBUG INFO:")
print(f"   Languages in test_sentences: {list(test_sentences.keys())}")
print(f"   Languages in df_results: {list(df_results['language'].unique())}")
print(f"   DataFrame shape: {df_results.shape}")

print("\nüèÜ RECOMMENDATIONS:")
# Use the languages that actually exist in the DataFrame to avoid errors
for lang in df_results['language'].unique():
    lang_data = df_results[df_results['language'] == lang]
    
    if not lang_data.empty and len(lang_data) > 0:
        try:
            best_idx = lang_data['tokens_per_word'].idxmin()
            best_model = lang_data.loc[best_idx, 'model']
            best_ratio = lang_data['tokens_per_word'].min()
            print(f"   {lang:15}: Best model is {best_model} (ratio: {best_ratio:.2f})")
        except Exception as e:
            print(f"   {lang:15}: Error processing data - {str(e)}")
    else:
        print(f"   {lang:15}: No data available")

In [None]:
# ============================================================================
# üîß FIX DATA STRUCTURE (Ensure df_results is a DataFrame)
# ============================================================================

print("üîç CHECKING DATA STRUCTURE:")
print(f"   Type of df_results: {type(df_results)}")

# Ensure df_results is a DataFrame (fix for AttributeError)
if isinstance(df_results, list):
    print("   ‚ö†Ô∏è  Converting list to DataFrame...")
    df_results = pd.DataFrame(df_results)
    print(f"   ‚úÖ Converted! Shape: {df_results.shape}")
    print(f"   üìä Columns: {list(df_results.columns)}")
else:
    print(f"   ‚úÖ Already a DataFrame! Shape: {df_results.shape}")

print(f"\nüéØ READY FOR ANALYSIS!")
print(f"   Data type: {type(df_results)}")
print(f"   Available for pivot_table operations")

# üéì Session 1 Complete

## What You've Learned

Congratulations! You've explored the core foundations of Large Language Models:

- ‚úÖ **Tokenization**: How models convert text into processable tokens
- ‚úÖ **Cross-lingual Analysis**: Understanding language differences in model processing  
- ‚úÖ **Text Embeddings**: Converting text to meaningful vector representations
- ‚úÖ **Model Comparison**: Evaluating different architectures for your needs
- ‚úÖ **Practical Skills**: Analyzing tokenization quality and embedding behavior


---

## üìö Optional: Try It Yourself - Dialogue Summarization

*Want to apply these concepts? Try creating your own dialogue summarization system using the foundations you've learned:*

1. **Choose your own dialogue data** (conversations, meetings, chat logs)
2. **Apply tokenization analysis** to understand processing costs
3. **Use embeddings** to find similar conversation segments  
4. **Compare models** for your specific language/domain
5. **Implement TextRank** for extractive summarization (research the algorithm!)

*This makes great homework or project work to deepen your understanding!*

### Your Toolkit for Future Projects

```python
# Core functions you can reuse:
analyze_tokenization(text, model_name)    # Compare tokenization efficiency
embedder.encode(sentences)                # Create semantic embeddings
cosine_similarity(embeddings)            # Measure text similarity
```