# Session 1: Foundations of Large Language Models ü§ñ

<div align="center">


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NinaKivanani/Tutorials_low-resource-llm/blob/main/Session1_Foundations_of_Large_Language_Models.ipynb)
[![GitHub](https://img.shields.io/badge/GitHub-View%20Repository-blue?logo=github)](https://github.com/NinaKivanani/Tutorials_low-resource-llm)
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)

**üìö Course Repository:** [github.com/NinaKivanani/Tutorials_low-resource-llm](https://github.com/NinaKivanani/Tutorials_low-resource-llm)

</div>

---

**Core Concepts:**
- **LLM Architecture** - Understand transformer models and attention mechanisms
- **Tokenization** - How models process and understand text across languages
- **Text Representation** - Embeddings, vectors, and semantic similarity
- **Model Comparison** - Analyze different LLM architectures and capabilities
- **Low-Resource Considerations** - Challenges with underrepresented languages

**Practical Skills:**
- Compare tokenization across different models
- Analyze model behavior with multilingual text
- Implement basic text processing pipelines
- Evaluate model performance on various languages
- Build foundation for advanced NLP applications

**Why This Matters:** Understanding LLM fundamentals is crucial for effective use in real-world applications, especially when working with diverse languages and limited computational resources.


## Course Context

| Session | Focus | Techniques | Prerequisites |
|---------|-------|------------|---------------|
| **Session 0** | Setup & Orientation | Environment, Basic Concepts | None |
| **‚Üí This Session** | **LLM Foundations** | **Tokenization, Embeddings, Model Analysis** | **Session 0** |
| **Session 2** | Prompt Engineering | Advanced Prompting, Chain-of-Thought | Sessions 0-1 |
| **Session 3** | Fine-tuning | LoRA, QLoRA, Custom Training | Sessions 0-2 |
| **Session 4** | Bias & Ethics | Fairness, Evaluation, Mitigation | Sessions 0-3 |


## üõ†Ô∏è Environment Setup

### What This Section Does
This section prepares your coding environment with all necessary libraries for exploring Large Language Model foundations. We'll install packages optimized for **interactive learning** - educational, efficient, and GPU-optional!

### Why These Specific Packages?

**Core Dependencies:**
- `numpy` + `pandas`: Essential for data manipulation and analysis
- `scikit-learn`: Similarity metrics and basic ML utilities
- `matplotlib`: Visualization of model behaviors and comparisons

**LLM Ecosystem:**
- `transformers`: Access to pretrained models and tokenizers
- `sentence-transformers`: Semantic embeddings and similarity
- `torch`: PyTorch backend for model operations

In [None]:
# Quick setup for this session
!pip install -q transformers sentence-transformers scikit-learn matplotlib pandas

In [None]:
# Core imports for LLM foundations
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from transformers import AutoTokenizer
from sentence_transformers import SentenceTransformer
import torch

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Set seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print("‚úÖ Environment ready for LLM foundations exploration!")

# Chapter 1: Understanding Tokenization

## What We'll Explore

Tokenization is how models convert text into numbers they can process. Let's see how this works with different languages and models.

### Step 1: Prepare Test Sentences

**Model Selection:** We'll compare two popular multilingual models from [Hugging Face Hub](https://huggingface.co/models):

- **BERT** (Google): Bidirectional Encoder Representations from Transformers - one of the first successful transformer models
- **XLM-RoBERTa** (Facebook): Cross-lingual Language Model based on RoBERTa - specifically designed for multilingual tasks

These model names are the official identifiers used to download them from Hugging Face's model repository.

In [None]:
"""
Multilingual Test Corpus Definition

This corpus contains semantically equivalent sentences across three languages 
representing different language families and resource levels:
- English: Germanic, high-resource language
- Luxembourgish: Germanic, low-resource language
- French: Romance, high-resource language

Domain: Medical/Healthcare (to test domain-specific tokenization)
Semantic equivalence: All sentences convey the same meaning

Research Question: 
    How do multilingual models handle typologically similar vs. different 
    languages with varying resource availability?

Expected Findings (Hypothesis):
    1. Resource Availability Effect:
       - English & French (high-resource) ‚Üí Lower tokens-per-word ratio
       - Luxembourgish (low-resource) ‚Üí Higher tokens-per-word ratio
       - Reason: Models trained predominantly on high-resource languages learn
                 better subword representations for those languages
    
    2. Typological Similarity:
       - English ‚Üî Luxembourgish (both Germanic): May show some overlap in 
         tokenization patterns despite resource difference
       - French (Romance) vs. Germanic languages: Different morphological 
         patterns may lead to different tokenization strategies
    
    3. Model Architecture Differences:
       - BERT: Trained on fewer languages, may show stronger resource bias
       - XLM-RoBERTa: Trained on 100 languages, may handle low-resource 
         languages more efficiently

Practical Implications:
    If Luxembourgish requires 2-3x more tokens than English:
    ‚Üí Processing costs increase proportionally
    ‚Üí Context window fills up faster (fewer words fit in same token budget)
    ‚Üí Inference latency increases
    ‚Üí This quantifies the "low-resource penalty" in production systems

Note: You may substitute these examples with sentences from your target language
      and domain for comparative analysis.
"""

# Multilingual test corpus
test_sentences = {
    "English": "The doctor explains the diagnosis carefully to the patient.",
    "Luxembourgish": "Den Dokter erkl√§ert d'Diagnos ganz roueg dem Patient.",
    "French": "Le m√©decin explique le diagnostic avec soin au patient."
}

# Display corpus for verification
print("=" * 70)
print("MULTILINGUAL TEST CORPUS")
print("=" * 70)
for language, sentence in test_sentences.items():
    word_count = len(sentence.split())
    char_count = len(sentence)
    print(f"\n{language:15} | Words: {word_count:2d} | Characters: {char_count:3d}")
    print(f"{'':15} | {sentence}")
print("\n" + "=" * 70)

### Step 2: Compare Tokenization Across Models

In [None]:
# ============================================================================
# COMPREHENSIVE TOKENIZATION COMPARISON ACROSS MODEL ARCHITECTURES
# ============================================================================
# These models represent different tokenization algorithms and training approaches:

from transformers import AutoTokenizer

models_to_compare = [
    "bert-base-multilingual-cased",        # WordPiece, multilingual
    "xlm-roberta-base",                    # SentencePiece, multilingual
    #"google/mt5-small",                    # SentencePiece, multilingual encoder-decoder
    #"gpt2",                                # BPE, English only (no spaces before non English chars)
    "google/gemma-2-2b-it"                  # SentencePiece/Unigram, decoder-only chat-style model
]

text_en = "Students are learning about large language models."
text_lr = "D'Studenten l√©ieren iwwer grouss Sproochmodeller."  # replace with your low resource sentence

def show_tokenization(model_name, text):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokens = tokenizer.tokenize(text)
    print(f"\nModel: {model_name}")
    print("Text :", text)
    print("Tokens:", tokens)
    print("Number of tokens:", len(tokens))
    
    # Return data for DataFrame creation
    return {
        'model': model_name.split('/')[-1],  # Short name
        'text': text,
        'num_tokens': len(tokens),
        'num_words': len(text.split()),
        'tokens_per_word': len(tokens) / len(text.split()) if text.split() else 0,
        'tokens_preview': tokens[:5]  # First 5 tokens for reference
    }

# Collect results for analysis
df_results = []

print("üî§ ENGLISH TEXT ANALYSIS")
print("-" * 40)
for model_name in models_to_compare:
    result = show_tokenization(model_name, text_en)
    result['language'] = 'English'
    df_results.append(result)

print("\n" + "=" * 80 + "\nüåç LOW RESOURCE LANGUAGE EXAMPLES\n")

for model_name in models_to_compare:
    result = show_tokenization(model_name, text_lr)
    result['language'] = 'Luxembourgish'  # or whatever your low-resource language is
    df_results.append(result)

# Convert to pandas DataFrame for easy analysis
import pandas as pd
df_results = pd.DataFrame(df_results)

print(f"\nüìä Results collected in DataFrame: {len(df_results)} entries")
print(f"    Columns: {list(df_results.columns)}")
print(f"    Ready for summary analysis!")


### üîç Inspecting Tokenizer Types Programmatically

Sometimes you need to determine what tokenization algorithm a model uses (WordPiece, BPE, SentencePiece, etc.). While there's no universal flag, you can inspect the tokenizer programmatically:

**Why This Matters:**
- Different algorithms handle subwords differently
- Understanding the algorithm helps predict tokenization behavior
- Important for debugging and optimization

In [None]:
# ============================================================================
# TOKENIZER INTROSPECTION: Understanding Algorithm Types
# ============================================================================

from transformers import AutoTokenizer

model_name = "xlm-roberta-base"
tok = AutoTokenizer.from_pretrained(model_name)

print("Tokenizer class:", tok.__class__.__name__)
print("Backend:", getattr(tok, "backend_tokenizer", None))
print("Special tokens:", tok.special_tokens_map)



### üéØ Key Takeaways: Tokenization Algorithms in Practice

**Understanding these differences helps you:**

1. **Choose the Right Model**: 
   - Need to handle many languages? ‚Üí SentencePiece models (XLM-RoBERTa, mT5)
   - Working primarily with English? ‚Üí WordPiece or BPE might be sufficient
   - Need fast inference? ‚Üí Consider algorithm efficiency for your text type

2. **Predict Performance**:
   - SentencePiece typically handles low-resource languages better
   - WordPiece good for languages with complex morphology
   - BPE optimized for languages similar to training data

3. **Debug Issues**:
   - Unexpected tokenization? Check the algorithm type
   - High token counts? Algorithm might not be suited for your language
   - Special token conflicts? Inspect the special_tokens_map

**Next**: Let's see how these tokenization differences affect semantic representations...

### ü§î Reflection Questions

Look at the results above and consider:

- Which language uses more tokens per word?
- How might more tokens affect inference cost and speed?
- Do you see any unusual token splits (broken words, weird subwords)?

**Key Insight:** Languages with fewer training examples often get split into more subword tokens, increasing computational costs.

# üìä Chapter 2: Text Embeddings & Semantic Similarity

## Understanding Vector Representations

**What are embeddings?** Numbers that capture the meaning of text in high-dimensional space.

Let's see how different models create these representations!

In [None]:
# Load a multilingual sentence embedding model
embedder_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
embedder = SentenceTransformer(embedder_name, device=device)

print(f"üìä Loaded embedding model: {embedder_name}")

# Get embeddings for our test sentences
sentences = list(test_sentences.values())
languages = list(test_sentences.keys())

embeddings = embedder.encode(sentences, convert_to_numpy=True)
print(f"‚úÖ Created embeddings with shape: {embeddings.shape}")
print(f"   Each sentence ‚Üí {embeddings.shape[1]} dimensional vector")

## üìä Understanding PCA (Principal Component Analysis)

**ü§î The Problem:** Our embeddings are 384-dimensional vectors - impossible to visualize directly!

**üéØ The Solution:** PCA reduces high-dimensional data to 2D while preserving the most important relationships.

### üìö How PCA Works:

1. **Find Principal Components**: Directions in the data with maximum variance
2. **Project Data**: Transform original data onto these new axes
3. **Keep Top Components**: Use only the first 2 components for 2D visualization

### üí° Key Insights:

- **Component 1**: Captures the most variation in the data
- **Component 2**: Captures the second most variation  
- **Relationship Preservation**: Similar sentences should stay close even after reduction
- **Information Loss**: We lose some information, but keep the most important patterns

### üéØ Why This Matters:

- Allows us to **visualize** high-dimensional embeddings
- Helps us **understand** if similar meanings cluster together across languages
- **Quality check** for our multilingual model performance

In [None]:
# ============================================================================
# üî¨ APPLYING PCA FOR VISUALIZATION  
# ============================================================================

print("üìä PCA ANALYSIS")
print("=" * 40)
print(f"üìê Original embedding dimensions: {embeddings.shape[1]}")
print(f"üéØ Reducing to: 2 dimensions for plotting") 
print(f"‚ö° Method: Principal Component Analysis")

# Apply PCA reduction
pca = PCA(n_components=2, random_state=42)
coords_2d = pca.fit_transform(embeddings)

# Analyze the results
explained_var = pca.explained_variance_ratio_
print(f"\nüìä VARIANCE EXPLANATION:")
print(f"   ‚Ä¢ Component 1: {explained_var[0]*100:.1f}% of original variance")
print(f"   ‚Ä¢ Component 2: {explained_var[1]*100:.1f}% of original variance") 
print(f"   ‚Ä¢ Total retained: {sum(explained_var)*100:.1f}% of information")

print(f"\nüí° INTERPRETATION:")
if sum(explained_var) > 0.7:
    print(f"   ‚úÖ Great! We retained most of the important patterns")
elif sum(explained_var) > 0.5:
    print(f"   ‚ö†Ô∏è  Decent retention - visualization should be meaningful")
else:
    print(f"   üî¥ Low retention - visualization may not show all relationships")

print(f"\nüéØ COORDINATES READY FOR PLOTTING:")
print(f"   Shape: {coords_2d.shape} (each sentence ‚Üí x,y coordinates)")

In [None]:
# ============================================================================
# üé® VISUALIZE PCA RESULTS
# ============================================================================

# Create visualization (using coords_2d from previous cell)
plt.figure(figsize=(10, 8))

colors = ['red', 'blue', 'green', 'orange', 'purple']
for i, (lang, sentence) in enumerate(test_sentences.items()):
    plt.scatter(coords_2d[i, 0], coords_2d[i, 1], 
               c=colors[i], s=200, alpha=0.7, label=lang)
    plt.annotate(lang, (coords_2d[i, 0], coords_2d[i, 1]), 
                xytext=(10, 10), textcoords='offset points', fontsize=12)

plt.title("Sentence Embeddings in 2D Space\n(All sentences have similar meaning)", fontsize=14)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("üí° Key Observation: Similar-meaning sentences in different languages should cluster together!")

print(f"\nüî¨ PCA VISUALIZATION ANALYSIS:")
print(f"   üìè What distance means: Closer points = more similar semantic meaning")
print(f"   üéØ What to look for: Languages clustering together despite different words")
print(f"   ‚öñÔ∏è  What variance tells us: Higher variance = more distinguishable patterns")
print(f"   üåç Cross-lingual success: Different languages expressing same meaning should be near each other")

### üìä Understanding Similarity Values: What Do The Numbers Mean?

The cosine similarity values you see above tell us how semantically similar the sentences are. Here's how to interpret them:

**üìê Cosine Similarity Scale (0.0 to 1.0):**
- **0.9-1.0**: Nearly identical meaning (excellent cross-lingual alignment)
- **0.7-0.89**: High similarity (strong semantic equivalence) 
- **0.5-0.69**: Moderate similarity (related concepts, some semantic overlap)
- **0.3-0.49**: Low similarity (weakly related or different topics)
- **0.0-0.29**: Very low similarity (mostly unrelated concepts)

**What To Expect for Our Semantically Equivalent Sentences:**
- **Good multilingual models**: Should show 0.7-0.9+ similarity across languages
- **Diagonal values**: Should always be 1.0 (sentence compared to itself)
- **Lower than expected scores**: May indicate model struggles with certain languages

**Real-World Implications:**
- **High scores (>0.7)**: Model is suitable for multilingual applications like translation, search
- **Medium scores (0.5-0.7)**: Proceed with caution, may need language-specific tuning
- **Low scores (<0.5)**: Consider different model or additional training for that language

**Why Scores Might Be Lower Than Expected:**
- Model had limited training data in the low-resource language
- Different sentence structures or vocabulary between languages  
- Domain mismatch (model trained on general text, tested on medical text)
- **Tokenization issues affecting embedding quality** ‚Üê Let's explain this!

In [None]:
# ============================================================================
# CALCULATE SIMILARITY MATRIX (Required for Later Analysis)
# ============================================================================

# Calculate pairwise cosine similarities between all sentence embeddings  
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(embeddings)

print("üîó SIMILARITY MATRIX CALCULATED")
print("=" * 50)
print(f"‚úÖ Matrix shape: {similarity_matrix.shape}")
print(f"‚úÖ Values range from 0.0 (unrelated) to 1.0 (identical)")
print(f"‚úÖ Ready for detailed analysis in upcoming cells")

# Quick preview of the matrix
print(f"\nüìä Quick Preview (first few values):")
lang_names = list(test_sentences.keys())
for i in range(min(2, len(lang_names))):
    for j in range(min(2, len(lang_names))):
        sim = similarity_matrix[i, j]
        print(f"   {lang_names[i]} ‚Üî {lang_names[j]}: {sim:.3f}")

print(f"\nüí° Full analysis coming in the next sections!")

### üîß Deep Dive: How Tokenization Issues Affect Embedding Quality

**The Connection:** Tokenization ‚Üí Embeddings ‚Üí Similarity Scores

This is a crucial concept that many people overlook! Here's how poor tokenization can ruin your similarity analysis:

#### üß© **The Process Chain:**
```
Raw Text ‚Üí Tokenization ‚Üí Token Embeddings ‚Üí Sentence Embedding ‚Üí Similarity Score
```

**When tokenization goes wrong, everything downstream suffers!**

#### üìù **Concrete Examples:**

**Example 1: Word Breaking**
```
English: "carefully" ‚Üí ["careful", "##ly"] (good: preserves meaning)
Low-resource: "sorgf√§ltig" ‚Üí ["so", "##r", "##g", "##f√§", "##lt", "##ig"] (bad: loses word structure)
```

**Impact:** The low-resource word gets broken into meaningless fragments. The model can't learn that "sorgf√§ltig" = "carefully" because it never sees "sorgf√§ltig" as a coherent unit.

**Example 2: Unknown Token Explosion**
```
English: "doctor" ‚Üí ["doctor"] (1 token, well-known)
Low-resource: "Dokter" ‚Üí ["[UNK]"] (1 unknown token, no meaning)
```

**Impact:** The model has no representation for "[UNK]", so it gets a generic "unknown" embedding that doesn't capture the medical concept.

**Example 3: Inconsistent Splitting**
```
Same concept, different tokenization:
"diagnosis" ‚Üí ["diagnosis"] 
"Diagnos" ‚Üí ["Dia", "##gno", "##s"]
```

**Impact:** Even though both mean "diagnosis," they get completely different embeddings because the tokenizer treats them as unrelated token sequences.

#### ‚ö° **The Cascade Effect:**

1. **Bad tokenization** ‚Üí Fragments or unknown tokens
2. **Poor token embeddings** ‚Üí Generic or meaningless vectors  
3. **Bad sentence embeddings** ‚Üí Average of poor-quality token vectors
4. **Low similarity scores** ‚Üí Model appears to "not understand" the language

#### üõ°Ô∏è **How to Detect This:**
- Look at tokenization output: many tiny fragments = problem
- High number of [UNK] tokens = problem  
- Same meaning, very different token patterns = problem

#### üí° **Solutions:**
- Choose models trained specifically on your target language
- Use SentencePiece-based models (better with unseen languages)
- Consider domain-specific models if your text has specialized vocabulary
- Fine-tune tokenizers on your target language data

In [None]:
# ============================================================================
# üéì STUDENT GUIDE: How to Use Gated Models in Colab (Optional Advanced Section)
# ============================================================================

"""
üìö QUESTION: How can students use Gemma (or other gated models) in Google Colab?

‚úÖ ANSWER: Follow these steps (one-time setup per student):

STEP 1: Get Model Access (Outside of Colab)
==========================================
1. Go to: https://huggingface.co/google/gemma-2-2b-it
2. Click the "Request Access" button
3. Wait for approval from Google
4. You'll get an email when approved

STEP 2: Create Hugging Face Token (Outside of Colab)  
===================================================
1. Go to: https://huggingface.co/settings/tokens
2. Click "New token"
3. Choose "Read" permissions (sufficient for downloading models)
4. Copy the token (starts with "hf_...")

STEP 3: Authenticate in Colab (Every Session)
=============================================
Run this code at the start of your Colab session:
"""

print("üîë TO USE GATED MODELS IN COLAB:")
print("1. Get model access approval (one-time)")  
print("2. Create HF token (one-time)")
print("3. Login in Colab (every session)")
print("\nExample authentication code for Colab:")
print("-" * 40)
print("# Option A: Interactive login (recommended for beginners)")
print("from huggingface_hub import notebook_login")
print("notebook_login()  # This will show a popup to enter your token")
print()
print("# Option B: Direct token login (for advanced users)")  
print("from huggingface_hub import login")
print("login(token='hf_your_token_here')  # Replace with your actual token")
print()
print("# Option C: Environment variable (most secure)")
print("import os")
print("os.environ['HF_TOKEN'] = 'your_token_here'")
print("from huggingface_hub import login") 
print("login()")

print(f"\nüí° AFTER AUTHENTICATION:")
print(f"   Just uncomment the gated model in the list above!")
print(f"   models_to_compare.append('google/gemma-2-2b-it')")

print(f"\nüéØ FOR INSTRUCTORS:")
print(f"   ‚Ä¢ You could demo this live for interested students")
print(f"   ‚Ä¢ Or provide it as bonus/homework material") 
print(f"   ‚Ä¢ Main tutorial works fine with public models only")

In [None]:
# ============================================================================
# PRACTICAL DEMONSTRATION: Tokenization Quality Impact
# ============================================================================

def demonstrate_tokenization_quality(word_pairs, model_name):
    """
    Show how tokenization quality varies between equivalent words across languages.
    
    Args:
        word_pairs: List of (lang1_word, lang2_word, meaning) tuples
        model_name: HuggingFace model to test
    """
    print(f"\nüîç TOKENIZATION QUALITY ANALYSIS: {model_name}")
    print("=" * 60)
    
    try:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        
        for word1, word2, meaning in word_pairs:
            tokens1 = tokenizer.tokenize(word1)
            tokens2 = tokenizer.tokenize(word2)
            
            # Count fragmentations and unknowns
            frag1 = len(tokens1)
            frag2 = len(tokens2)
            unk1 = sum(1 for t in tokens1 if '[UNK]' in t or '<unk>' in t)
            unk2 = sum(1 for t in tokens2 if '[UNK]' in t or '<unk>' in t)
            
            # Quality assessment
            quality1 = "üü¢ Good" if frag1 == 1 and unk1 == 0 else ("üü° OK" if unk1 == 0 else "üî¥ Poor")
            quality2 = "üü¢ Good" if frag2 == 1 and unk2 == 0 else ("üü° OK" if unk2 == 0 else "üî¥ Poor")
            
            print(f"\nüìù Concept: '{meaning}'")
            print(f"   {word1:15} ‚Üí {tokens1} | Fragments: {frag1}, UNK: {unk1} | {quality1}")
            print(f"   {word2:15} ‚Üí {tokens2} | Fragments: {frag2}, UNK: {unk2} | {quality2}")
            
            # Predict embedding quality
            if quality1 == quality2 == "üü¢ Good":
                prediction = "üéØ High similarity expected"
            elif "üî¥ Poor" in [quality1, quality2]:
                prediction = "‚ö†Ô∏è  Low similarity likely (tokenization issues)"
            else:
                prediction = "ü§î Moderate similarity possible"
            
            print(f"   üí° Similarity prediction: {prediction}")
            
    except Exception as e:
        print(f"‚ùå Error loading {model_name}: {e}")

# Test with concrete examples from our corpus
word_pairs = [
    ("doctor", "Dokter", "medical professional"),
    ("diagnosis", "Diagnos", "medical assessment"),  
    ("carefully", "roueg", "with care"),
    ("patient", "Patient", "sick person"),
    ("explains", "erkl√§ert", "makes clear")
]

# Test with our models to see quality differences
test_models = ["bert-base-multilingual-cased", "xlm-roberta-base"]

for model in test_models:
    demonstrate_tokenization_quality(word_pairs, model)

print(f"\nüí° INTERPRETATION:")
print(f"   üü¢ Good tokenization ‚Üí Better embeddings ‚Üí Higher similarity scores")
print(f"   üî¥ Poor tokenization ‚Üí Worse embeddings ‚Üí Lower similarity scores")
print(f"   This explains why some language pairs might score lower than expected!")

In [None]:
# ============================================================================
# SIMILARITY ANALYSIS & INTERPRETATION
# ============================================================================

# Analyze the similarity results with automatic interpretation
print("üîç DETAILED SIMILARITY ANALYSIS")
print("=" * 60)

# Get language names from our test sentences
lang_names = list(test_sentences.keys())

# Calculate cross-lingual similarities (excluding self-comparisons)
cross_lingual_similarities = []
print("\nüìä Cross-lingual Similarity Scores:")
print("-" * 40)

for i, lang1 in enumerate(lang_names):
    for j, lang2 in enumerate(lang_names):
        if i < j:  # Avoid duplicates and self-comparisons
            sim = similarity_matrix[i, j]
            cross_lingual_similarities.append(sim)
            
            # Provide automatic interpretation
            if sim >= 0.8:
                quality = "üü¢ EXCELLENT"
                note = "Very strong semantic alignment"
            elif sim >= 0.7:
                quality = "üü° GOOD"  
                note = "Clear semantic similarity"
            elif sim >= 0.5:
                quality = "üü† MODERATE"
                note = "Some semantic overlap, could be better"
            else:
                quality = "üî¥ CONCERNING"
                note = "Weak alignment - investigate model/language"
                
            print(f"   {lang1:12} ‚Üî {lang2:12}: {sim:.3f} | {quality} - {note}")

# Calculate summary statistics
if cross_lingual_similarities:
    avg_similarity = sum(cross_lingual_similarities) / len(cross_lingual_similarities)
    max_similarity = max(cross_lingual_similarities)
    min_similarity = min(cross_lingual_similarities)
    
    print(f"\nüìà SUMMARY STATISTICS:")
    print(f"   ‚Ä¢ Average cross-lingual similarity: {avg_similarity:.3f}")
    print(f"   ‚Ä¢ Best language pair similarity: {max_similarity:.3f}")  
    print(f"   ‚Ä¢ Worst language pair similarity: {min_similarity:.3f}")
    print(f"   ‚Ä¢ Number of language pairs: {len(cross_lingual_similarities)}")
    
    # Overall assessment
    print(f"\nüéØ OVERALL MODEL ASSESSMENT:")
    if avg_similarity >= 0.75:
        print(f"   üéâ EXCELLENT: This model shows strong multilingual understanding!")
        print(f"      ‚Üí Suitable for production multilingual applications")
    elif avg_similarity >= 0.60:
        print(f"   ‚úÖ GOOD: Model shows decent cross-lingual capabilities")  
        print(f"      ‚Üí Usable for multilingual tasks with some caution")
    elif avg_similarity >= 0.45:
        print(f"   ‚ö†Ô∏è  FAIR: Model has limited multilingual alignment")
        print(f"      ‚Üí Consider fine-tuning or using different model")
    else:
        print(f"   üö® POOR: Model struggles with multilingual understanding")
        print(f"      ‚Üí Not recommended for cross-lingual applications")
        
    print(f"\nüí° ACTIONABLE INSIGHTS:")
    print(f"   ‚Ä¢ Use this analysis to choose appropriate models for your languages")
    print(f"   ‚Ä¢ Lower scores indicate need for more training data or different architectures")
    print(f"   ‚Ä¢ Compare different models using this same methodology")

## Similarity Analysis & Interpretation

This section interprets cross-lingual cosine similarities and summarizes model quality with actionable insights.

In [None]:
# Calculate semantic similarities
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(embeddings)

print("üîç SEMANTIC SIMILARITY ANALYSIS")
print("\nSimilarity Matrix (1.0 = identical, 0.0 = unrelated):")
print()

# Create a nice formatted table
lang_names = list(test_sentences.keys())
print(f"{'Language':<12} ", end="")
for lang in lang_names:
    print(f"{lang:<10}", end="")
print()

for i, lang1 in enumerate(lang_names):
    print(f"{lang1:<12} ", end="")
    for j, lang2 in enumerate(lang_names):
        sim = similarity_matrix[i, j]
        print(f"{sim:.3f}     ", end="")
    print()

print(f"\nüí° Cross-lingual similarities (excluding self-comparisons):")
for i, lang1 in enumerate(lang_names):
    for j, lang2 in enumerate(lang_names):
        if i < j:  # Avoid duplicates
            sim = similarity_matrix[i, j]
            print(f"   {lang1} ‚Üî {lang2}: {sim:.3f}")

# Chapter 3: Model Comparison Summary

Let's summarize what we've learned about different models and languages:

In [None]:
# Create a summary of our analysis
summary_df = df_results.pivot_table(
    index='language', 
    columns='model', 
    values=['tokens_per_word', 'num_tokens'], 
    aggfunc='mean'
).round(2)

print("üìä TOKENIZATION EFFICIENCY SUMMARY")
print("=" * 50)
print("\nTokens per word (lower = more efficient):")
print(summary_df['tokens_per_word'])

print("\nTotal tokens per sentence:")
print(summary_df['num_tokens'])

# Find the most efficient model for each language
print("\nüîç DEBUG INFO:")
print(f"   Languages in test_sentences: {list(test_sentences.keys())}")
print(f"   Languages in df_results: {list(df_results['language'].unique())}")
print(f"   DataFrame shape: {df_results.shape}")

print("\nüèÜ RECOMMENDATIONS:")
# Use the languages that actually exist in the DataFrame to avoid errors
for lang in df_results['language'].unique():
    lang_data = df_results[df_results['language'] == lang]
    
    if not lang_data.empty and len(lang_data) > 0:
        try:
            best_idx = lang_data['tokens_per_word'].idxmin()
            best_model = lang_data.loc[best_idx, 'model']
            best_ratio = lang_data['tokens_per_word'].min()
            print(f"   {lang:15}: Best model is {best_model} (ratio: {best_ratio:.2f})")
        except Exception as e:
            print(f"   {lang:15}: Error processing data - {str(e)}")
    else:
        print(f"   {lang:15}: No data available")

In [None]:
# ============================================================================
# üîß FIX DATA STRUCTURE (Ensure df_results is a DataFrame)
# ============================================================================

print("üîç CHECKING DATA STRUCTURE:")
print(f"   Type of df_results: {type(df_results)}")

# Ensure df_results is a DataFrame (fix for AttributeError)
if isinstance(df_results, list):
    print("   ‚ö†Ô∏è  Converting list to DataFrame...")
    df_results = pd.DataFrame(df_results)
    print(f"   ‚úÖ Converted! Shape: {df_results.shape}")
    print(f"   üìä Columns: {list(df_results.columns)}")
else:
    print(f"   ‚úÖ Already a DataFrame! Shape: {df_results.shape}")

print(f"\nüéØ READY FOR ANALYSIS!")
print(f"   Data type: {type(df_results)}")
print(f"   Available for pivot_table operations")

# üéì Session 1 Complete

## What You've Learned

Congratulations! You've explored the core foundations of Large Language Models:

- ‚úÖ **Tokenization**: How models convert text into processable tokens
- ‚úÖ **Cross-lingual Analysis**: Understanding language differences in model processing  
- ‚úÖ **Text Embeddings**: Converting text to meaningful vector representations
- ‚úÖ **Model Comparison**: Evaluating different architectures for your needs
- ‚úÖ **Practical Skills**: Analyzing tokenization quality and embedding behavior


---

## üìö Optional: Try It Yourself - Dialogue Summarization

*Want to apply these concepts? Try creating your own dialogue summarization system using the foundations you've learned:*

1. **Choose your own dialogue data** (conversations, meetings, chat logs)
2. **Apply tokenization analysis** to understand processing costs
3. **Use embeddings** to find similar conversation segments  
4. **Compare models** for your specific language/domain
5. **Implement TextRank** for extractive summarization (research the algorithm!)

*This makes great homework or project work to deepen your understanding!*

### Your Toolkit for Future Projects

```python
# Core functions you can reuse:
analyze_tokenization(text, model_name)    # Compare tokenization efficiency
embedder.encode(sentences)                # Create semantic embeddings
cosine_similarity(embeddings)            # Measure text similarity
```