# Session 1: Foundations of Large Language Models ü§ñ

<div align="center">


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NinaKivanani/Tutorials_low-resource-llm/blob/main/Session1_Foundations_of_Large_Language_Models.ipynb)
[![GitHub](https://img.shields.io/badge/GitHub-View%20Repository-blue?logo=github)](https://github.com/NinaKivanani/Tutorials_low-resource-llm)
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)

**üìö Course Repository:** [github.com/NinaKivanani/Tutorials_low-resource-llm](https://github.com/NinaKivanani/Tutorials_low-resource-llm)

</div>

---

**Core Concepts:**
- **LLM Architecture** - Understand transformer models and attention mechanisms
- **Tokenization** - How models process and understand text across languages
- **Text Representation** - Embeddings, vectors, and semantic similarity
- **Model Comparison** - Analyze different LLM architectures and capabilities
- **Low-Resource Considerations** - Challenges with underrepresented languages

**Practical Skills:**
- Compare tokenization across different models
- Analyze model behavior with multilingual text
- Implement basic text processing pipelines
- Evaluate model performance on various languages
- Build foundation for advanced NLP applications

**Why This Matters:** Understanding LLM fundamentals is crucial for effective use in real-world applications, especially when working with diverse languages and limited computational resources.


## Course Context

| Session | Focus | Techniques | Prerequisites |
|---------|-------|------------|---------------|
| **Session 0** | Setup & Orientation | Environment, Basic Concepts | None |
| **‚Üí This Session** | **LLM Foundations** | **Tokenization, Embeddings, Model Analysis** | **Session 0** |
| **Session 2** | Prompt Engineering | Advanced Prompting, Chain-of-Thought | Sessions 0-1 |
| **Session 3** | Fine-tuning | LoRA, QLoRA, Custom Training | Sessions 0-2 |
| **Session 4** | Bias & Ethics | Fairness, Evaluation, Mitigation | Sessions 0-3 |


## üõ†Ô∏è Environment Setup

### What This Section Does
This section prepares your coding environment with all necessary libraries for exploring Large Language Model foundations. We'll install packages optimized for **interactive learning** - educational, efficient, and GPU-optional!

### Why These Specific Packages?

**Core Dependencies:**
- `numpy` + `pandas`: Essential for data manipulation and analysis
- `scikit-learn`: Similarity metrics and basic ML utilities
- `matplotlib`: Visualization of model behaviors and comparisons

**LLM Ecosystem:**
- `transformers`: Access to pretrained models and tokenizers
- `sentence-transformers`: Semantic embeddings and similarity
- `torch`: PyTorch backend for model operations

In [None]:
# Quick setup for this session
!pip install -q transformers sentence-transformers scikit-learn matplotlib pandas

In [None]:
# Core imports for LLM foundations
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from transformers import AutoTokenizer
from sentence_transformers import SentenceTransformer
import torch

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Set seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print("‚úÖ Environment ready for LLM foundations exploration!")

# Chapter 1: Understanding Tokenization

## What We'll Explore

Tokenization is how models convert text into numbers they can process. Let's see how this works with different languages and models.

### Step 1: Prepare Test Sentences

**Model Selection:** We'll compare two popular multilingual models from [Hugging Face Hub](https://huggingface.co/models):

- **BERT** (Google): Bidirectional Encoder Representations from Transformers - one of the first successful transformer models
- **XLM-RoBERTa** (Facebook): Cross-lingual Language Model based on RoBERTa - specifically designed for multilingual tasks

These model names are the official identifiers used to download them from Hugging Face's model repository.

In [None]:
"""
Multilingual Test Corpus Definition

This corpus contains semantically equivalent sentences across three languages 
representing different language families and resource levels:
- English: Germanic, high-resource language
- Luxembourgish: Germanic, low-resource language
- French: Romance, high-resource language

Domain: Medical/Healthcare (to test domain-specific tokenization)
Semantic equivalence: All sentences convey the same meaning

Research Question: 
    How do multilingual models handle typologically similar vs. different 
    languages with varying resource availability?

Expected Findings (Hypothesis):
    1. Resource Availability Effect:
       - English & French (high-resource) ‚Üí Lower tokens-per-word ratio
       - Luxembourgish (low-resource) ‚Üí Higher tokens-per-word ratio
       - Reason: Models trained predominantly on high-resource languages learn
                 better subword representations for those languages
    
    2. Typological Similarity:
       - English ‚Üî Luxembourgish (both Germanic): May show some overlap in 
         tokenization patterns despite resource difference
       - French (Romance) vs. Germanic languages: Different morphological 
         patterns may lead to different tokenization strategies
    
    3. Model Architecture Differences:
       - BERT: Trained on fewer languages, may show stronger resource bias
       - XLM-RoBERTa: Trained on 100 languages, may handle low-resource 
         languages more efficiently

Practical Implications:
    If Luxembourgish requires 2-3x more tokens than English:
    ‚Üí Processing costs increase proportionally
    ‚Üí Context window fills up faster (fewer words fit in same token budget)
    ‚Üí Inference latency increases
    ‚Üí This quantifies the "low-resource penalty" in production systems

Note: You may substitute these examples with sentences from your target language
      and domain for comparative analysis.
"""

# Multilingual test corpus
test_sentences = {
    "English": "The doctor explains the diagnosis carefully to the patient.",
    "Luxembourgish": "Den Dokter erkl√§ert d'Diagnos ganz roueg dem Patient.",
    "French": "Le m√©decin explique le diagnostic avec soin au patient."
}

# Display corpus for verification
print("=" * 70)
print("MULTILINGUAL TEST CORPUS")
print("=" * 70)
for language, sentence in test_sentences.items():
    word_count = len(sentence.split())
    char_count = len(sentence)
    print(f"\n{language:15} | Words: {word_count:2d} | Characters: {char_count:3d}")
    print(f"{'':15} | {sentence}")
print("\n" + "=" * 70)

### Step 2: Compare Tokenization Across Models

In [None]:
# ============================================================================
# MODEL SELECTION FROM HUGGING FACE HUB
# ============================================================================
# These model identifiers come from Hugging Face Hub (https://huggingface.co/models)
# - A repository hosting 500,000+ pre-trained models from research labs and community
# - Each model has a unique identifier in format: "organization/model-name"
# - Models can be loaded directly using these identifiers (no manual download needed)
#
# How to find models:
# 1. Visit: https://huggingface.co/models
# 2. Filter by task: "Fill-Mask" or "Text Classification" for tokenizers
# 3. Filter by language: Select your target language(s)
# 4. Sort by: "Most downloads" or "Trending" for popular models
#
# Example searches:
# - Multilingual models: Filter by "multilingual" tag
# - Language-specific: Search "arabic-bert" or "french-camembert"
# - Domain-specific: Search "biobert" (medical) or "finbert" (finance)
#
# Selected Models for This Analysis:
# ----------------------------------
models_to_test = [
    "bert-base-multilingual-cased",    # Google BERT: 104 languages, 110M parameters
                                        # Link: https://huggingface.co/bert-base-multilingual-cased
                                        # Trained on: Wikipedia in 104 languages
    
    "xlm-roberta-base"                 # Facebook XLM-RoBERTa: 100 languages, 270M parameters  
                                        # Link: https://huggingface.co/xlm-roberta-base
                                        # Trained on: CommonCrawl in 100 languages (2.5TB data)
]

# Alternative models you can try (uncomment to test):
# "distilbert-base-multilingual-cased"  # Faster, lighter version (66M params)
# "google/mt5-small"                     # Google's multilingual T5 (300M params)
# "microsoft/mdeberta-v3-base"          # Microsoft's DeBERTa (multilingual, 278M params)

print("üìö Loading models from Hugging Face Hub...")
print("   These will be downloaded automatically on first use (cached locally afterward)")

def analyze_tokenization(text, model_name):
    """Analyze how a model tokenizes text"""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokens = tokenizer.tokenize(text)
    words = text.split()
    
    return {
        'model': model_name.split('/')[-1],
        'num_tokens': len(tokens),
        'num_words': len(words),
        'tokens_per_word': len(tokens) / len(words) if words else 0,
        'tokens_preview': tokens[:8]  # First 8 tokens
    }

print("üîç TOKENIZATION ANALYSIS\n" + "="*60)

results = []
for lang, sentence in test_sentences.items():
    print(f"\nüìù {lang}: {sentence}")
    print()
    
    for model_name in models_to_test:
        result = analyze_tokenization(sentence, model_name)
        results.append({**result, 'language': lang, 'sentence': sentence})
        
        print(f"  ü§ñ {result['model']:25} | Tokens: {result['num_tokens']:2d} | Ratio: {result['tokens_per_word']:.2f}")
        print(f"     Sample tokens: {' '.join(result['tokens_preview'][:5])}...")

# Create summary DataFrame
df_results = pd.DataFrame(results)
print(f"\nüìä Summary saved to DataFrame with {len(df_results)} comparisons")

### ü§î Reflection Questions

Look at the results above and consider:

- Which language uses more tokens per word?
- How might more tokens affect inference cost and speed?
- Do you see any unusual token splits (broken words, weird subwords)?

**Key Insight:** Languages with fewer training examples often get split into more subword tokens, increasing computational costs.

# üìä Chapter 2: Text Embeddings & Semantic Similarity

## Understanding Vector Representations

**What are embeddings?** Numbers that capture the meaning of text in high-dimensional space.

Let's see how different models create these representations!

In [None]:
# Load a multilingual sentence embedding model
embedder_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
embedder = SentenceTransformer(embedder_name, device=device)

print(f"üìä Loaded embedding model: {embedder_name}")

# Get embeddings for our test sentences
sentences = list(test_sentences.values())
languages = list(test_sentences.keys())

embeddings = embedder.encode(sentences, convert_to_numpy=True)
print(f"‚úÖ Created embeddings with shape: {embeddings.shape}")
print(f"   Each sentence ‚Üí {embeddings.shape[1]} dimensional vector")

In [None]:
# Reduce to 2D for visualization
pca = PCA(n_components=2, random_state=42)
coords_2d = pca.fit_transform(embeddings)

# Create visualization
plt.figure(figsize=(10, 8))

colors = ['red', 'blue', 'green', 'orange', 'purple']
for i, (lang, sentence) in enumerate(test_sentences.items()):
    plt.scatter(coords_2d[i, 0], coords_2d[i, 1], 
               c=colors[i], s=200, alpha=0.7, label=lang)
    plt.annotate(lang, (coords_2d[i, 0], coords_2d[i, 1]), 
                xytext=(10, 10), textcoords='offset points', fontsize=12)

plt.title("Sentence Embeddings in 2D Space\n(All sentences have similar meaning)", fontsize=14)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("üí° Key Observation: Similar-meaning sentences in different languages should cluster together!")

In [None]:
# Calculate semantic similarities
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(embeddings)

print("üîç SEMANTIC SIMILARITY ANALYSIS")
print("\nSimilarity Matrix (1.0 = identical, 0.0 = unrelated):")
print()

# Create a nice formatted table
lang_names = list(test_sentences.keys())
print(f"{'Language':<12} ", end="")
for lang in lang_names:
    print(f"{lang:<10}", end="")
print()

for i, lang1 in enumerate(lang_names):
    print(f"{lang1:<12} ", end="")
    for j, lang2 in enumerate(lang_names):
        sim = similarity_matrix[i, j]
        print(f"{sim:.3f}     ", end="")
    print()

print(f"\nüí° Cross-lingual similarities (excluding self-comparisons):")
for i, lang1 in enumerate(lang_names):
    for j, lang2 in enumerate(lang_names):
        if i < j:  # Avoid duplicates
            sim = similarity_matrix[i, j]
            print(f"   {lang1} ‚Üî {lang2}: {sim:.3f}")

# Chapter 3: Model Comparison Summary

Let's summarize what we've learned about different models and languages:

In [None]:
# Create a summary of our analysis
summary_df = df_results.pivot_table(
    index='language', 
    columns='model', 
    values=['tokens_per_word', 'num_tokens'], 
    aggfunc='mean'
).round(2)

print("üìä TOKENIZATION EFFICIENCY SUMMARY")
print("=" * 50)
print("\nTokens per word (lower = more efficient):")
print(summary_df['tokens_per_word'])

print("\nTotal tokens per sentence:")
print(summary_df['num_tokens'])

# Find the most efficient model for each language
print("\nüèÜ RECOMMENDATIONS:")
for lang in test_sentences.keys():
    lang_data = df_results[df_results['language'] == lang]
    best_model = lang_data.loc[lang_data['tokens_per_word'].idxmin(), 'model']
    best_ratio = lang_data['tokens_per_word'].min()
    print(f"   {lang:12}: Best model is {best_model} (ratio: {best_ratio:.2f})")

# üéì Session 1 Complete: LLM Foundations Mastered!

## üéØ What You've Learned

Congratulations! You've explored the core foundations of Large Language Models:

- ‚úÖ **Tokenization**: How models convert text into processable tokens
- ‚úÖ **Cross-lingual Analysis**: Understanding language differences in model processing  
- ‚úÖ **Text Embeddings**: Converting text to meaningful vector representations
- ‚úÖ **Model Comparison**: Evaluating different architectures for your needs
- ‚úÖ **Practical Skills**: Analyzing tokenization quality and embedding behavior

## üöÄ Ready for Next Steps

With this foundation, you're prepared for:
- **Session 2**: Advanced Prompt Engineering 
- **Session 3**: Fine-tuning Techniques
- **Session 4**: Bias and Ethical Considerations

---

## üìö Optional: Try It Yourself - Dialogue Summarization

*Want to apply these concepts? Try creating your own dialogue summarization system using the foundations you've learned:*

1. **Choose your own dialogue data** (conversations, meetings, chat logs)
2. **Apply tokenization analysis** to understand processing costs
3. **Use embeddings** to find similar conversation segments  
4. **Compare models** for your specific language/domain
5. **Implement TextRank** for extractive summarization (research the algorithm!)

*This makes great homework or project work to deepen your understanding!*

### üéØ Your Toolkit for Future Projects

```python
# Core functions you can reuse:
analyze_tokenization(text, model_name)    # Compare tokenization efficiency
embedder.encode(sentences)                # Create semantic embeddings
cosine_similarity(embeddings)            # Measure text similarity
```

**üåü Achievement Unlocked: LLM Foundations Expert! üíé**