# üß† Session 1: Foundations of Large Language Models

<div align="center">

**üìö Course Repository:** [github.com/NinaKivanani/Tutorials_low-resource-llm](https://github.com/NinaKivanani/Tutorials_low-resource-llm)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NinaKivanani/Tutorials_low-resource-llm/blob/main/1_Foundations_of_Large_Language_Models.ipynb)
[![GitHub](https://img.shields.io/badge/GitHub-View%20Repository-blue?logo=github)](https://github.com/NinaKivanani/Tutorials_low-resource-llm)
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)

</div>

---

## üéØ What You'll Learn

This notebook introduces the **fundamental concepts** behind Large Language Models through hands-on exploration. You'll understand how LLMs process text and work with multilingual models to see how well your language is represented.

**üî¨ Core Focus:** Understanding tokenization, embeddings, and transformers  
**üíª Requirements:** CPU is sufficient - no GPU needed!  
**üåç Language Support:** Multilingual examples with customizable content

---

## üìã Table of Contents

1. [üõ†Ô∏è Setup & Installation](#setup)
2. [ü§î Why LLMs Matter](#why-llms)
3. [üî§ Tokenization Deep Dive](#tokenization)
4. [üéØ Sentence Embeddings](#embeddings)
5. [‚öôÔ∏è Inside Transformers](#transformers)
6. [üåç Multilingual Exploration](#multilingual)
7. [üéì Wrap-up & Next Steps](#wrap-up)

---

## üéØ Learning Objectives

By the end of this session, you will:

‚úÖ **Understand** why LLMs revolutionized NLP  
‚úÖ **Explore** how tokenization affects different languages  
‚úÖ **Visualize** sentence embeddings in 2D space  
‚úÖ **Examine** transformer architecture basics  
‚úÖ **Compare** multilingual model performance  
‚úÖ **Reflect** on low-resource language representation

---

## üìñ Prerequisites

**Recommended learning path:**
1. **Session 0:** Basic Python and NLP concepts *(optional)*
2. **This session:** LLM foundations *(you are here)*
3. **Session 2:** Advanced applications and fine-tuning

---

## üöÄ How to Use This Notebook

- üîç **Checkpoint cells** mark good stopping points for discussion
- üéØ **Challenge cells** contain optional exercises
- üìù **Reflection cells** encourage you to think about the concepts
- ‚ö° **Run cells in order** - dependencies build on each other
- üîÑ **If stuck:** Restart runtime and re-run from the top


## üõ†Ô∏è Setup & Installation {#setup}

This section will install all the required packages for our LLM exploration. We'll use minimal dependencies to keep things lightweight and fast.

### üì¶ What We'll Install

- **transformers** - For working with pre-trained models
- **sentence-transformers** - For sentence embeddings
- **scikit-learn** - For PCA and basic ML utilities
- **matplotlib** - For visualizations
- **numpy & pandas** - For data manipulation

### üöÄ Quick Installation

Choose your preferred installation method:


# üöÄ Quick Installation (Option A - Recommended)
# This installs only what we need for this notebook

!pip install -q transformers sentence-transformers scikit-learn matplotlib numpy pandas torch

In [None]:
# üîß Alternative Installation (Option B - More Control)
# Use this if you want to see exactly what's being installed

import sys
import subprocess

def install_packages():
    """Install required packages with error handling"""
    packages = [
        "transformers==4.49.0",
        "sentence-transformers", 
        "scikit-learn",
        "matplotlib",
        "numpy",
        "pandas",
        "torch"
    ]
    
    print("üîÑ Installing packages...")
    for pkg in packages:
        try:
            print(f"  üì¶ Installing {pkg}")
            subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pkg])
        except Exception as e:
            print(f"  ‚ùå Failed to install {pkg}: {e}")
            
    print("‚úÖ Installation complete!")

# Uncomment the line below to run the installation
# install_packages()


## üìö Import Libraries & Setup

Let's import all the libraries we'll need and set up our environment for reproducible results.

In [None]:
# üìö Core imports
import os
import random
import warnings
from typing import List, Dict

# üî¢ Data & ML libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# ü§ó Hugging Face libraries
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer
import torch

# üé® Configure plotting
plt.style.use('default')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

# üéØ Set random seeds for reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

# üñ•Ô∏è Device setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"üñ•Ô∏è  Using device: {device}")
print(f"üé≤ Random seed: {SEED}")

# üîá Suppress warnings for cleaner output
warnings.filterwarnings('ignore')




## ü§î Why Do LLMs Matter? {#why-llms}

Large Language Models have revolutionized how we approach natural language processing. Let's understand why they're so powerful.

### üîÑ The Old Way vs. The New Way

**Traditional NLP (Pre-2017):**
- üîß **Task-specific models** - Different model for each task
- ‚úã **Hand-crafted features** - Manual feature engineering
- üìä **Limited data** - Smaller, task-specific datasets
- üéØ **Single purpose** - One model = one task

**LLMs (2017+):**
- üß† **One large model** - Single model for multiple tasks
- ü§ñ **Learned representations** - Features learned automatically
- üìö **Massive data** - Trained on internet-scale text
- üé≠ **Multi-purpose** - Same model for many tasks

### üéØ The Magic of Prompting

With **prompting**, the same LLM can:

- üåê **Translate** languages
- üìù **Summarize** documents  
- ‚ùì **Answer** questions
- üíª **Generate** code
- üé® **Write** creatively

### üî¨ How LLMs Work (3 Key Steps)

1. **üî§ Tokenization** - Break text into pieces (tokens)
2. **üìä Embeddings** - Convert tokens to numerical vectors
3. **‚öôÔ∏è Transformers** - Process sequences with attention

---

### üéØ What We'll Explore

In this notebook, we won't train a model from scratch. Instead, we'll **look inside** existing multilingual models to understand:

- How they process your language
- What they "see" when they read text
- How well they represent low-resource languages


## üåç Multilingual Text Examples {#multilingual}

Let's start our exploration with sample sentences in two languages. This will help us understand how LLMs handle different languages.

### üìù Sample Languages

- **üá¨üáß English** (`en`) - High-resource language
- **üá±üá∫ Luxembourgish** (`lb`) - Low-resource language *(you can change this!)*

### ‚úèÔ∏è Customize Your Language

Feel free to replace the Luxembourgish examples with sentences in your own language of interest. Just update the `LOW_RESOURCE_LANG` variable and the example sentences below.


In [None]:
# üåç Define your languages and example sentences
# üí° Feel free to change LOW_RESOURCE_LANG and the example sentences!

LOW_RESOURCE_LANG = "lb"  # Options: "lb", "hy", "fa", "sw", etc.

examples = [
    # üá¨üáß English examples
    {"id": "en_1", "lang": "en", "text": "The doctor explains the diagnosis carefully."},
    {"id": "en_2", "lang": "en", "text": "Students are learning about large language models."},
    {"id": "en_3", "lang": "en", "text": "The weather is cloudy today but we still go for a walk."},

    # üá±üá∫ Luxembourgish examples (replace with your language!)
    {"id": f"{LOW_RESOURCE_LANG}_1", "lang": LOW_RESOURCE_LANG, "text": "Den Dokter erkl√§ert d'Diagnos ganz roueg."},
    {"id": f"{LOW_RESOURCE_LANG}_2", "lang": LOW_RESOURCE_LANG, "text": "D'Studenten l√©ieren iwwer grouss Sproochmodeller."},
    {"id": f"{LOW_RESOURCE_LANG}_3", "lang": LOW_RESOURCE_LANG, "text": "D'Wieder ass haut wollekeg, mee mir ginn trotzdeem spads√©ieren."},
]

print("üìù Our example sentences:")
print("=" * 50)
for ex in examples:
    flag = "üá¨üáß" if ex['lang'] == 'en' else "üá±üá∫"
    print(f"{flag} [{ex['lang']}] {ex['id']}: {ex['text']}")


## üî§ Tokenization Deep Dive {#tokenization}

**Tokenization** is the first step in how LLMs process text. Let's explore how different models break down our sentences into tokens.

### üîç What is Tokenization?

Models don't see raw text - they see **tokens**! Tokens are usually subword units that help models handle:
- **Unknown words** by breaking them into familiar pieces
- **Different languages** with shared subword patterns
- **Morphologically rich languages** by capturing stems and affixes

### üéØ Our Exploration Plan

1. **Load** a multilingual tokenizer
2. **Analyze** how it splits our sentences  
3. **Compare** token counts across languages
4. **Reflect** on implications for low-resource languages

### ü§ñ Models to Try

Feel free to experiment with different tokenizers:
- `bert-base-multilingual-cased` - BERT's multilingual tokenizer
- `xlm-roberta-base` - XLM-RoBERTa (often better for low-resource)
- `google/mt5-small` - Multilingual T5


### üî¨ Tokenization Analysis

In [None]:
# ü§ñ Load a multilingual tokenizer
multilingual_model_name = "xlm-roberta-base"  # Feel free to change this!

print(f"üîÑ Loading tokenizer: {multilingual_model_name}")
tokenizer = AutoTokenizer.from_pretrained(multilingual_model_name)
print(f"‚úÖ Tokenizer loaded successfully!")

def inspect_tokenization(examples, tokenizer, max_tokens_to_show=15):
    """Analyze how the tokenizer processes our example sentences"""
    results = []
    for ex in examples:
        tokens = tokenizer.tokenize(ex["text"])
        results.append({
            "id": ex["id"],
            "lang": ex["lang"], 
            "text": ex["text"],
            "n_tokens": len(tokens),
            "tokens_preview": tokens[:max_tokens_to_show],
            "tokens_full": tokens
        })
    return results

# üîç Analyze tokenization
print(f"\nüî§ Tokenization Analysis with {multilingual_model_name}")
print("=" * 70)

tokenization_results = inspect_tokenization(examples, tokenizer)

for result in tokenization_results:
    flag = "üá¨üáß" if result['lang'] == 'en' else "üá±üá∫"
    print(f"\n{flag} {result['id']} ({result['lang']})")
    print(f"üìù Text: {result['text']}")
    print(f"üî¢ Tokens: {result['n_tokens']}")
    print(f"üî§ Preview: {result['tokens_preview']}")
    if len(result['tokens_full']) > 15:
        print(f"   ... and {len(result['tokens_full']) - 15} more tokens")


### üìä Token Count Summary

In [None]:
# üìä Create summary statistics by language
df_tokens = pd.DataFrame(tokenization_results)
summary = df_tokens.groupby("lang")["n_tokens"].agg(["mean", "min", "max", "count"]).round(2)

print("üìä Token Count Statistics by Language:")
print("=" * 40)
print(summary)

# üìà Quick comparison
en_avg = summary.loc['en', 'mean']
lr_avg = summary.loc[LOW_RESOURCE_LANG, 'mean']
ratio = lr_avg / en_avg

print(f"\nüîç Key Insights:")
print(f"   üá¨üáß English average: {en_avg:.1f} tokens")
print(f"   üá±üá∫ {LOW_RESOURCE_LANG.upper()} average: {lr_avg:.1f} tokens")
print(f"   üìà Ratio: {ratio:.2f}x {'more' if ratio > 1 else 'fewer'} tokens for {LOW_RESOURCE_LANG.upper()}")


### ü§î Reflection: What Do These Results Mean?

Take a moment to analyze the tokenization results above. Consider these important questions:

#### üîç Analysis Questions

1. **Token Efficiency**: Do English sentences use more or fewer tokens than your low-resource language?

2. **Quality Issues**: Do you notice any problematic splits?
   - Broken accents or diacritics?
   - Words split into meaningless pieces?
   - Proper nouns fragmented?

3. **Practical Implications**: How might different token counts affect:
   - ‚ö° **Speed** - More tokens = slower processing
   - üìè **Context length** - Fewer words fit in model's context window  
   - üí∞ **Cost** - Many APIs charge per token

#### üí≠ Discussion Points

- What does this tell us about model bias toward certain languages?
- How might this affect performance on low-resource languages?
- What strategies could help mitigate these issues?


### üìù Your Observations (Interactive)

In [None]:
# üìù Write your observations here!
# Edit the text below to record your insights

my_observations = f"""
üîç TOKENIZATION OBSERVATIONS FOR {LOW_RESOURCE_LANG.upper()}:

Token Efficiency:
- English vs {LOW_RESOURCE_LANG}: [Your comparison here]

Quality Issues I Notice:
- [e.g., "Words split into too many pieces"]
- [e.g., "Accents/diacritics handled poorly"] 
- [e.g., "Named entities fragmented"]

Implications:
- Processing speed: [Your thoughts]
- Context limitations: [Your thoughts]
- Cost considerations: [Your thoughts]

Overall Assessment:
- [How well does this tokenizer handle your language?]
"""

print(my_observations)


---

## üéØ Sentence Embeddings {#embeddings}

Now let's explore how LLMs convert text into numerical representations that capture meaning.

### üß† What are Sentence Embeddings?

A **sentence embedding** is a dense vector (typically 384-768 dimensions) that captures the semantic meaning of a sentence. Think of it as a "fingerprint" that represents what the sentence means.

### üéØ Our Exploration Plan

1. **ü§ñ Load** a multilingual sentence embedding model
2. **üî¢ Convert** each sentence to a numerical vector  
3. **üìâ Reduce** high-dimensional vectors to 2D using PCA
4. **üìä Visualize** sentences in 2D space, colored by language

### üåç Model We'll Use

We'll use `paraphrase-multilingual-MiniLM-L12-v2` - a model trained to create similar embeddings for sentences with similar meanings, regardless of language!


### üî¢ Computing Sentence Embeddings

In [None]:
# ü§ñ Load multilingual sentence embedding model
embedder_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"

print(f"üîÑ Loading sentence embedder: {embedder_name}")
embedder = SentenceTransformer(embedder_name, device=device)
print(f"‚úÖ Model loaded successfully!")

# üìù Extract text data for embedding
sent_texts = [ex["text"] for ex in examples]
sent_langs = [ex["lang"] for ex in examples]
sent_ids = [ex["id"] for ex in examples]

print(f"\nüî¢ Computing embeddings for {len(sent_texts)} sentences...")

# üß† Generate embeddings
embeddings = embedder.encode(sent_texts, convert_to_numpy=True, show_progress_bar=False)

print(f"‚úÖ Embeddings computed!")
print(f"üìä Shape: {embeddings.shape} (sentences √ó dimensions)")
print(f"üéØ Each sentence is now represented as a {embeddings.shape[1]}-dimensional vector")


### üìâ Reducing to 2D for Visualization

In [None]:
# üìâ Use PCA to reduce from high dimensions to 2D
print("üîÑ Reducing embeddings to 2D using PCA...")
pca = PCA(n_components=2, random_state=SEED)
coords_2d = pca.fit_transform(embeddings)

# üìä Create DataFrame for easier plotting and analysis
df_plot = pd.DataFrame({
    "id": sent_ids,
    "lang": sent_langs, 
    "text": sent_texts,
    "x": coords_2d[:, 0],
    "y": coords_2d[:, 1],
})

# üìà Show how much variance is captured by our 2D projection
variance_explained = pca.explained_variance_ratio_
total_variance = sum(variance_explained)

print(f"‚úÖ 2D projection complete!")
print(f"üìä Variance explained: {variance_explained[0]:.1%} (PC1) + {variance_explained[1]:.1%} (PC2) = {total_variance:.1%} total")
print(f"üí° This means our 2D plot captures {total_variance:.1%} of the original information")

print(f"\nüìã Data for plotting:")
df_plot


### üìä Visualizing Sentence Embeddings

In [None]:
# üìä Create a beautiful 2D visualization
plt.figure(figsize=(12, 8))

# üé® Define colors and markers for each language
colors = {'en': '#2E86AB', LOW_RESOURCE_LANG: '#A23B72'}  # Blue for English, Purple for other
markers = {'en': 'o', LOW_RESOURCE_LANG: 's'}  # Circle for English, Square for other
sizes = {'en': 100, LOW_RESOURCE_LANG: 120}  # Slightly larger for low-resource language

# üìç Plot points for each language
for lang in df_plot["lang"].unique():
    subset = df_plot[df_plot["lang"] == lang]
    flag = "üá¨üáß" if lang == 'en' else "üá±üá∫"
    label = f"{flag} {lang.upper()}"
    
    plt.scatter(subset["x"], subset["y"], 
               c=colors[lang], marker=markers[lang], s=sizes[lang],
               label=label, alpha=0.8, edgecolors='white', linewidth=2)

# üè∑Ô∏è Add labels for each point
for _, row in df_plot.iterrows():
    plt.annotate(row["id"], 
                (row["x"], row["y"]),
                xytext=(8, 8), textcoords='offset points',
                fontsize=10, fontweight='bold',
                bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.7))

# üé® Styling
plt.title("üß† Sentence Embeddings in 2D Space", fontsize=16, fontweight='bold', pad=20)
plt.xlabel(f"üìä First Principal Component ({variance_explained[0]:.1%} variance)", fontsize=12)
plt.ylabel(f"üìä Second Principal Component ({variance_explained[1]:.1%} variance)", fontsize=12)
plt.legend(fontsize=12, loc='best')
plt.grid(True, alpha=0.3)

# üîç Add some context
plt.figtext(0.02, 0.02, 
           "üí° Points closer together = more similar meanings\n" +
           "üåç Good multilingual models cluster by meaning, not language", 
           fontsize=10, style='italic')

plt.tight_layout()
plt.show()

# üìè Calculate distances between similar sentences
print("\nüìè Distance Analysis:")
print("=" * 40)
for i in range(0, len(examples), 2):  # Compare pairs
    if i+1 < len(examples):
        p1, p2 = df_plot.iloc[i], df_plot.iloc[i+1] 
        distance = np.sqrt((p1['x'] - p2['x'])**2 + (p1['y'] - p2['y'])**2)
        print(f"Distance between {p1['id']} and {p2['id']}: {distance:.3f}")
        print(f"  üá¨üáß {p1['text'][:50]}...")
        print(f"  üá±üá∫ {p2['text'][:50]}...")
        print()


reflection on embeddings

### 3.1 . Reflection

Look at the plot and discuss.

- Do sentences cluster more by **language** or by **meaning**.
- Are English and low resource sentences with similar meaning close in the plot.
- Does any sentence look like an outlier.

Again, use the next cell as a scratch pad.


free text reflection for embeddings

In [None]:
reflection_embeddings = """
Notes about the 2D embedding plot.

- Example: The two sentences about doctors are close, even across languages.
- Example: One sentence in my language is far away, maybe the model does not know this vocabulary well.
"""

print(reflection_embeddings)


## 4 . Inside a transformer (very briefly)

We do not derive the full transformer math here. 
Instead we run a **forward pass** and inspect tensor shapes.

We will.

1. Use a small multilingual transformer (encoder only).
2. Tokenize one sentence.
3. See shapes of.
   - `input_ids`.
   - `attention_mask`.
   - `last_hidden_state`.


### üî¨ Transformer Forward Pass Exploration

In [None]:
# ü§ñ Load a small multilingual transformer model
small_model_name = "distilbert-base-multilingual-cased"

print(f"üîÑ Loading transformer model: {small_model_name}")
tok_small = AutoTokenizer.from_pretrained(small_model_name)
model_small = AutoModel.from_pretrained(small_model_name).to(device)
model_small.eval()  # Set to evaluation mode
print("‚úÖ Model loaded successfully!")

# üìù Use one of our example sentences
test_sentence = examples[1]["text"]  # "Students are learning about large language models."
print(f"\nüî§ Test sentence: '{test_sentence}'")

# üî¢ Tokenize the input
inputs = tok_small(test_sentence, return_tensors="pt").to(device)

# üß† Run forward pass through the transformer
print("\nüîÑ Running forward pass...")
with torch.no_grad():
    outputs = model_small(**inputs)

# üìä Examine tensor shapes
print("\nüìä Tensor Shape Analysis:")
print("=" * 40)
print(f"üìù Original sentence: '{test_sentence}'")
print(f"üî¢ input_ids shape: {inputs['input_ids'].shape}")
print(f"üëÅÔ∏è  attention_mask shape: {inputs['attention_mask'].shape}")  
print(f"üß† last_hidden_state shape: {outputs.last_hidden_state.shape}")

# üîç Explain what these shapes mean
batch_size, seq_len, hidden_dim = outputs.last_hidden_state.shape
print(f"\nüîç What these shapes mean:")
print(f"   üì¶ Batch size: {batch_size} (we're processing 1 sentence)")
print(f"   üìè Sequence length: {seq_len} (number of tokens)")
print(f"   üéØ Hidden dimensions: {hidden_dim} (size of each token's representation)")

# üéØ Create sentence-level embedding via mean pooling
print(f"\nüéØ Creating sentence embedding via mean pooling...")
last_hidden = outputs.last_hidden_state  # [batch, seq_len, hidden_dim]
attention_mask = inputs["attention_mask"]

# Apply attention mask and compute mean
mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden.size()).float()
sum_embeddings = torch.sum(last_hidden * mask_expanded, dim=1)
sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9)
sentence_vector = (sum_embeddings / sum_mask).squeeze(0)

print(f"‚úÖ Sentence embedding shape: {sentence_vector.shape}")
print(f"üí° We've converted '{test_sentence}' into a {len(sentence_vector)}-dimensional vector!")


---

## üéì Wrap-up & Next Steps {#wrap-up}

Congratulations! You've completed your journey through the foundations of Large Language Models. 

### üéØ What You've Accomplished

‚úÖ **Understanding LLMs**: You learned why LLMs revolutionized NLP through unified, large-scale models  
‚úÖ **Tokenization Analysis**: You explored how different languages are tokenized and the implications for low-resource languages  
‚úÖ **Embedding Visualization**: You converted sentences to vectors and visualized semantic relationships in 2D space  
‚úÖ **Transformer Internals**: You peeked inside a transformer to understand tensor shapes and data flow  
‚úÖ **Multilingual Insights**: You gained hands-on experience with how models handle multiple languages

### üîç Key Insights

- **Tokenization matters**: Low-resource languages often require more tokens, affecting cost and performance
- **Embeddings capture meaning**: Similar sentences cluster together regardless of language (when models work well)
- **Transformers process sequences**: Text flows through attention layers as high-dimensional tensors
- **Representation quality varies**: Some languages are better represented than others in multilingual models

### üöÄ Suggested Next Steps

#### üî¨ **Experiment Further**
- Try different multilingual models (`bert-base-multilingual-cased`, `google/mt5-small`)
- Replace example sentences with text from your domain/language
- Add more languages and analyze clustering patterns
- Experiment with different embedding models

#### üìö **Deepen Your Knowledge**
- Learn about attention mechanisms in transformers
- Explore fine-tuning for your specific language/task
- Study cross-lingual transfer learning techniques
- Investigate bias and fairness in multilingual models

#### üõ†Ô∏è **Apply Your Skills**
- Use these concepts in downstream tasks (classification, summarization, QA)
- Build applications using sentence embeddings for semantic search
- Contribute to improving low-resource language support

### üåü Remember

The field of multilingual NLP is rapidly evolving. The techniques you've learned here are foundational - use them as building blocks for more advanced applications and research!

---

**Happy coding! üöÄ**


---

## ‚öôÔ∏è Inside Transformers {#transformers}

Let's peek inside a transformer model to understand how it processes our tokenized text.


In [None]:
### üîç What Happens Inside a Transformer?

We won't dive into the complex mathematics, but we can run a **forward pass** through a transformer and examine the tensor shapes to understand the data flow.

### üéØ Our Exploration

1. **Load** a small multilingual transformer (encoder-only)
2. **Tokenize** one of our sentences
3. **Inspect** the shapes of key tensors:
   - `input_ids` - The tokenized input
   - `attention_mask` - Which tokens to pay attention to
   - `last_hidden_state` - The final representations


### 1.1 Create dialogue windows

Many dialogue datasets are long conversations. Summarization is easier to teach with smaller windows. We will create overlapping windows of turns, then treat each window as a dialogue sample.

You can adjust the window size. Smaller windows are easier for small models. Larger windows stress test context handling.


In [None]:
def make_dialogue_windows(turns: pd.DataFrame, window_turns: int = 10, stride: int = 6) -> pd.DataFrame:
    """
    Convert a turn DataFrame into overlapping dialogue windows.

    Returns a DataFrame with: sample_id, dialogue_text, speakers_involved, n_turns.
    """
    samples = []
    n = len(turns)
    sample_id = 0
    for start in range(0, max(1, n - window_turns + 1), stride):
        end = min(n, start + window_turns)
        chunk = turns.iloc[start:end]
        dialogue_lines = [f"{r.speaker}: {r.text}" for r in chunk.itertuples()]
        dialogue_text = "\n".join(dialogue_lines)
        speakers = sorted(set(chunk["speaker"].tolist()))
        samples.append(
            {
                "sample_id": sample_id,
                "dialogue_text": dialogue_text,
                "speakers_involved": speakers,
                "n_turns": int(end - start),
            }
        )
        sample_id += 1
        if end == n:
            break
    return pd.DataFrame(samples)

samples_df = make_dialogue_windows(turns_df, window_turns=10, stride=6)
samples_df


### 1.2 Preview one sample

Read the dialogue. Then, in your own words, write a one sentence summary in the next cell. Keep it short. This will become our first human reference.


In [None]:
sample = samples_df.loc[0, "dialogue_text"]
print(sample)


In [None]:
# Your one sentence reference summary.
# You can edit this string. The notebook will still run if you do not.

REFERENCE_SUMMARY = "Jack arrives and learns Algernon is visiting, then Algernon teases Jack and reveals he plans to marry Jack's cousin Cecily."

print(REFERENCE_SUMMARY)


## 2. Baseline. Extractive TextRank summarization

Before using an LLM, build a baseline that is fast, cheap, and interpretable. TextRank selects the most central sentences using a similarity graph and PageRank.

This baseline is language agnostic, as long as you can split text into sentences. That is why it is valuable for low resource languages.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx

def split_sentences(text: str) -> List[str]:
    """
    Very simple sentence splitter.
    For robust multilingual splitting, consider spaCy or Stanza.
    """
    text = re.sub(r"\s+", " ", text).strip()
    sents = re.split(r"(?<=[\.\?\!])\s+", text)
    return [s.strip() for s in sents if s.strip()]

def textrank_summarize(dialogue_text: str, max_sentences: int = 2) -> str:
    """
    Extractive summarization using TextRank on sentence similarity.
    """
    content = re.sub(r"^[A-Z][A-Z\s'\-]+:\s*", "", dialogue_text, flags=re.MULTILINE)
    sentences = split_sentences(content)
    if not sentences:
        return ""
    if len(sentences) <= max_sentences:
        return " ".join(sentences)

    vectorizer = TfidfVectorizer(stop_words="english")
    X = vectorizer.fit_transform(sentences)
    sim = cosine_similarity(X)
    np.fill_diagonal(sim, 0.0)

    graph = nx.from_numpy_array(sim)
    scores = nx.pagerank(graph, max_iter=200)

    ranked = sorted(range(len(sentences)), key=lambda i: scores.get(i, 0.0), reverse=True)
    picked = sorted(ranked[:max_sentences])
    return " ".join([sentences[i] for i in picked])

baseline_summary = textrank_summarize(sample, max_sentences=2)
print("Baseline summary:\n", baseline_summary)


### 2.1 Quick evaluation. ROUGE

ROUGE is imperfect, but it is a quick sanity check. We will compute ROUGE 1, ROUGE 2, and ROUGE L against your reference summary.


In [None]:
from rouge_score import rouge_scorer

def rouge_scores(pred: str, ref: str) -> Dict[str, float]:
    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
    scores = scorer.score(ref, pred)
    return {k: v.fmeasure for k, v in scores.items()}

print("ROUGE (baseline vs reference):")
rouge_scores(baseline_summary, REFERENCE_SUMMARY)


## 3. Mini quiz. What makes dialogue summarization harder?

Try to answer before running the cell. Then run it for instant feedback.


In [None]:
try:
    import ipywidgets as widgets
    from IPython.display import display
except Exception:
    widgets = None

QUESTION = "Which factor is most specific to dialogue summarization, compared to single speaker summarization?"
OPTIONS = [
    "A. Dialogues contain named entities.",
    "B. Dialogues include speaker turns and pragmatic intent.",
    "C. Dialogues use punctuation.",
    "D. Dialogues are always longer than articles.",
]
CORRECT = 1
EXPLANATION = "Speaker turns and pragmatic intent are core. You often need to resolve who said what and why."

def run_quiz():
    if widgets is None:
        print(QUESTION)
        for opt in OPTIONS:
            print(opt)
        print("\nCorrect:", OPTIONS[CORRECT])
        print("Explanation:", EXPLANATION)
        return

    radio = widgets.RadioButtons(options=OPTIONS, description="Your answer:")
    out = widgets.Output()

    def on_change(change):
        if change["name"] != "value":
            return
        with out:
            out.clear_output()
            idx = OPTIONS.index(change["new"])
            if idx == CORRECT:
                print("Correct.")
            else:
                print("Not quite.")
            print("Explanation:", EXPLANATION)

    radio.observe(on_change)
    display(radio, out)

run_quiz()


üîç **Checkpoint 1: You now have a solid baseline!** 

**‚úÖ What you've accomplished:**
- Built a dialogue dataset from raw text
- Implemented TextRank extractive summarization  
- Evaluated with ROUGE metrics
- Learned what makes dialogue summarization challenging

**üéØ Next up:** We'll simulate low-resource conditions and learn adaptation strategies.

**üí° For LLM-based summarization:** Check out Session 2 on Prompt Engineering!


In [None]:
# üìä Quick Results Summary

# Display summary of our baseline approach
results_summary = {
    "Approach": "TextRank (Extractive)",
    "Model Size": "No model required",
    "Hardware": "CPU sufficient", 
    "Language Support": "Any language",
    "Training Data": "None required",
    "Key Advantage": "Fast, interpretable, language-agnostic"
}

print("üéØ BASELINE APPROACH SUMMARY")
print("="*50)
for key, value in results_summary.items():
    print(f"{key:15}: {value}")
    
print(f"\n‚úÖ Your ROUGE score: {rouge_scores(baseline_summary, REFERENCE_SUMMARY)}")


## 4. Low Resource Mode üåç

Now we'll simulate low-resource conditions and learn adaptation strategies.

**What makes a language "low-resource"?**
- Very little labeled data
- Limited preprocessing tools  
- Domain mismatch with training data
- Orthographic variation and noise


In [None]:
### 4.1 Simulate Low-Resource Conditions

We'll corrupt our clean English dialogue to simulate challenges faced by low-resource languages:

def low_resource_corrupt(text: str, drop_punct_prob: float = 0.5, typo_prob: float = 0.08) -> str:
    """Simulate low-resource conditions by adding noise"""
    import random
    rng = random.Random(842)
    out_chars = []
    for ch in text:
        # Randomly drop punctuation
        if ch in ".?!," and rng.random() < drop_punct_prob:
            continue
        # Add random typos
        if ch.isalpha() and rng.random() < typo_prob:
            if rng.random() < 0.5:
                out_chars.append(ch.swapcase())  # Case error
            else:
                out_chars.append(chr(((ord(ch.lower()) - 97 + 1) % 26) + 97))  # Letter shift
        else:
            out_chars.append(ch)
    return "".join(out_chars)

# Apply corruption to our sample
low_resource_sample = low_resource_corrupt(sample, drop_punct_prob=0.6, typo_prob=0.04)
print("üåç SIMULATED LOW-RESOURCE TEXT:")
print("="*60)
print(low_resource_sample[:400] + "...")
print("\nüí° Notice: Missing punctuation, typos, inconsistent casing")


### 4.2 Prompt remix playground

You will remix a prompt by selecting options. This is a safe way to teach prompt engineering without making it feel abstract.

Pick your settings, then run the cell. Try to make the summary both concise and faithful.


In [None]:
STYLE_OPTIONS = ["neutral", "bullet", "tweet", "meeting_minutes"]
FOCUS_OPTIONS = ["decisions", "conflict", "relationships", "actions"]

def build_prompt(style: str, focus: str, max_sentences: int) -> str:
    style = style.lower().strip()
    focus = focus.lower().strip()

    base = f"Summarize the conversation in at most {max_sentences} sentences."
    if focus == "decisions":
        base += " Focus on decisions and commitments."
    elif focus == "conflict":
        base += " Focus on disagreements and what caused them."
    elif focus == "relationships":
        base += " Focus on who relates to whom and the social situation."
    elif focus == "actions":
        base += " Focus on actions and next steps."

    if style == "bullet":
        base += " Use 2 to 4 bullet points."
    elif style == "tweet":
        base += " Write it as a single tweet style sentence, under 240 characters."
    elif style == "meeting_minutes":
        base += " Format as meeting minutes with sections: Context, Key Points, Next Steps."

    base += " Do not invent facts. Preserve names."
    return base

def run_playground(style="neutral", focus="relationships", max_sentences=2):
    prompt = build_prompt(style, focus, max_sentences)
    print("Prompt:\n", prompt, "\n")
    out = generate_summary_t5(sample, prompt, max_new_tokens=120, temperature=0.0)
    print("Model output:\n", out)
    return out

llm_play = run_playground(style="meeting_minutes", focus="relationships", max_sentences=2)


### 4.3 One shot and few shot prompts

When you have little data, examples are powerful. We will create a small in notebook prompt set.

You can replace the examples with your own dialogues later.


In [None]:
EXAMPLE_DIALOGUE = """ALICE: Are we still meeting at 3?
BOB: Yes, but I will be 10 minutes late.
ALICE: Ok. Please bring the slides.
BOB: Will do."""

EXAMPLE_SUMMARY = "Alice and Bob confirm a 3 pm meeting. Bob will arrive 10 minutes late and will bring the slides."

ONE_SHOT_PROMPT = f"""Summarize the conversation in 1 to 2 sentences. Do not invent facts.

Example.
DIALOGUE:
{EXAMPLE_DIALOGUE}

SUMMARY:
{EXAMPLE_SUMMARY}

Now summarize this dialogue.
"""

llm_one = generate_summary_t5(sample, ONE_SHOT_PROMPT, max_new_tokens=120, temperature=0.0)
print(llm_one)


### 4.4 Generation parameters. Temperature and length

Temperature can change factuality. Length controls how much detail you get.

Use the sliders if available. Otherwise, edit the numbers and rerun.


In [None]:
def demo_generation_controls(temperature: float = 0.0, max_new_tokens: int = 80):
    prompt = build_prompt(style="neutral", focus="actions", max_sentences=2)
    out = generate_summary_t5(sample, prompt, max_new_tokens=max_new_tokens, temperature=temperature, top_p=0.95)
    print("temperature:", temperature, "max_new_tokens:", max_new_tokens)
    print(out)

try:
    import ipywidgets as widgets
    from IPython.display import display
    if tokenizer is None or model is None:
        raise RuntimeError("Model not available, skipping widgets.")
    ui = widgets.interactive(
        demo_generation_controls,
        temperature=widgets.FloatSlider(min=0.0, max=1.0, step=0.1, value=0.0),
        max_new_tokens=widgets.IntSlider(min=30, max=200, step=10, value=80),
    )
    display(ui)
except Exception:
    demo_generation_controls(temperature=0.0, max_new_tokens=80)
    demo_generation_controls(temperature=0.7, max_new_tokens=120)


## 5. Compare Baselines vs LLM

We compare summaries and compute ROUGE against your reference.

In real work, you should also do human evaluation. For example factuality checks, missing action items, and speaker attribution.


In [None]:
results = []
results.append(("TextRank baseline", baseline_summary))
results.append(("LLM zero shot", llm_zero))
results.append(("LLM one shot", llm_one))
results.append(("LLM prompt remix", llm_play))

rows = []
for name, pred in results:
    rows.append({"system": name, "summary": pred, **rouge_scores(pred, REFERENCE_SUMMARY)})

pd.DataFrame(rows).sort_values("rougeL", ascending=False)


## 6. Low resource mode. Make English behave like a low resource language

Low resource usually means one or more of the following.
- Very little labeled data.
- Limited tools for tokenization, sentence splitting, and normalization.
- Domain mismatch. Your data looks different from what models saw during pre training.
- Orthography variation and borrowing, including code switching.

We will simulate these constraints in English by.
1) Reducing the available context.
2) Corrupting the text with noise and inconsistent spelling.
3) Removing punctuation, which hurts naive sentence splitting.

Then we apply strategies that transfer to true low resource settings.


In [None]:
def low_resource_corrupt(text: str, drop_punct_prob: float = 0.5, typo_prob: float = 0.08) -> str:
    rng = random.Random(842)
    out_chars = []
    for ch in text:
        if ch in ".?!," and rng.random() < drop_punct_prob:
            continue
        if ch.isalpha() and rng.random() < typo_prob:
            if rng.random() < 0.5:
                out_chars.append(ch.swapcase())
            else:
                out_chars.append(chr(((ord(ch.lower()) - 97 + 1) % 26) + 97))
        else:
            out_chars.append(ch)
    return "".join(out_chars)

low_text = low_resource_corrupt(sample, drop_punct_prob=0.8, typo_prob=0.05)
print(low_text[:600])


In [None]:
print("Baseline on clean text:")
print(textrank_summarize(sample, max_sentences=2))
print("\nBaseline on low resource corrupted text:")
print(textrank_summarize(low_text, max_sentences=2))


### 6.1 Strategy toolkit

Here are practical tactics that often help in low resource dialogue summarization.

1. Normalize input.
   - Fix common punctuation issues.
   - Normalize whitespace.
   - Normalize speaker labels.

2. Use robust segmentation.
   - If sentence splitting fails, summarize at turn level.

3. Constrain generation.
   - Use explicit length limits.
   - Instruct the model to preserve names, numbers, and decisions.

4. Add lightweight context.
   - Provide a glossary of names and places.
   - Provide a domain hint, such as "family conversation" or "customer support".

5. Evaluate with targeted checks.
   - Did we preserve who wants to marry whom.
   - Did we hallucinate actions that never happened.

We will implement 1 and 2 now.


In [None]:
def normalize_dialogue(text: str) -> str:
    text = text.replace("\t", " ")
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r"([A-Z][A-Z\s'\-]+:)\s*", r"\n\1 ", text)
    return text.strip()

def turn_level_summarize(dialogue_text: str, max_turns: int = 3) -> str:
    """
    Extractive turn level summarization, more robust than sentence splitting.
    """
    lines = [ln.strip() for ln in dialogue_text.splitlines() if ln.strip()]
    lines = [ln for ln in lines if len(ln) > 10]
    if not lines:
        return ""
    if len(lines) <= max_turns:
        return " ".join(lines)

    vectorizer = TfidfVectorizer(stop_words="english")
    X = vectorizer.fit_transform(lines)
    sim = cosine_similarity(X)
    np.fill_diagonal(sim, 0.0)
    graph = nx.from_numpy_array(sim)
    scores = nx.pagerank(graph, max_iter=200)
    ranked = sorted(range(len(lines)), key=lambda i: scores.get(i, 0.0), reverse=True)
    picked = sorted(ranked[:max_turns])
    return " ".join([lines[i] for i in picked])

print("Before normalization:\n", low_text[:250], "\n")
norm_low = normalize_dialogue(low_text)
print("After normalization:\n", norm_low[:250])
print("\nTurn level summary on corrupted text:\n", turn_level_summarize(norm_low, max_turns=3))


### 6.2 Low resource prompting

If you can use an instruction model, you can push it to behave better on noisy input. The key is to add constraints.

We will.
- Ask for short output.
- Ask it to avoid inventing facts.
- Ask it to preserve names.


In [None]:
LOW_RESOURCE_PROMPT = """Summarize the conversation in 1 sentence.
Rules.
1) Do not invent facts.
2) Preserve names exactly as they appear.
3) If the text is noisy, infer only what is obvious."""

if tokenizer is None or model is None:
    print("Model not available, skipping.")
else:
    print(generate_summary_t5(norm_low, LOW_RESOURCE_PROMPT, max_new_tokens=60, temperature=0.0))


## Optional mini dataset hook. Try a real non-English case in two minutes

This workshop is designed to start in English, then transfer the same workflow to a low-resource language.

Below are two quick options.

1. **MiniLux micro-set (synthetic)**. A small set of short Luxembourgish and LU, FR mixed snippets created for teaching. It is intentionally tiny and imperfect, so that you can iterate fast and discuss typical issues, like code-switching, named entities, and spelling variation.

2. **Hugging Face low-resource sample (real text)**. Pull 20 examples from a multilingual summarization dataset and run the same prompt, plus the same evaluation, to see how performance changes outside English.


In [None]:
# Option 1. MiniLux micro-set (synthetic)
# This is only for the workshop. You can replace it with your own low-resource dialogues later.

mini_lux = [
    {
        "id": "lux_001",
        "dialogue": "A: Moien. Hues du Z√§it fir e Kaffi?\nB: Jo, m√§ just z√©ng Minutten. Ech muss gl√§ich op d'Aarbecht.\nA: Ok. Mir treffen eis beim Gare.\nB: Super, ech kommen direkt.",
        "reference_summary_en": "They agree to meet for a quick coffee at the station before B goes to work."
    },
    {
        "id": "lux_002",
        "dialogue": "A: W√©i war d'Reunioun haut?\nB: Ganz laang. Mir hu just d'Agenda diskut√©iert.\nA: An hu mir eng Decisioun?\nB: Nee, mir maachen et n√§chste Woch nach eng K√©ier.",
        "reference_summary_en": "The meeting was long, they only discussed the agenda, and no decision was made."
    },
    {
        "id": "lux_003",
        "dialogue": "A: Kanns du mer de Rapport sch√©cken?\nB: Jo. Ech sch√©cken en elo per Mail.\nA: Merci. Ech muss en nach haut ofginn.\nB: Kloer, ech maachen et direkt.",
        "reference_summary_en": "B will email A the report immediately because A must submit it today."
    },
    {
        "id": "lux_004",
        "dialogue": "A: Ech sinn am Stau op der A6.\nB: Ok, dann f√§nke mir ouni dech un.\nA: Gitt mir z√©ng Minutten.\nB: Passt. Mir halen dir e S√´tz fr√§i.",
        "reference_summary_en": "A is stuck in traffic but will arrive in about ten minutes, and the others will start and save a seat."
    },
    {
        "id": "lux_005",
        "dialogue": "A: Ech hu muer en rendez-vous chez le m√©decin.\nB: Bass du ok?\nA: Jo, just e Check-up.\nB: Ok, soen mer dono w√©i et gaangen ass.",
        "reference_summary_en": "A has a doctor appointment tomorrow for a check-up and will update B afterward."
    },
    {
        "id": "lux_006",
        "dialogue": "A: Wou si mir mam Projet?\nB: Mir hu 80 Prozent f√§erdeg.\nA: Wat feelt nach?\nB: D'Dokumentatioun an d'Tester.",
        "reference_summary_en": "The project is about 80 percent done, but documentation and testing are still missing."
    },
    {
        "id": "lux_007",
        "dialogue": "A: Ech kr√©ien √´mmer eng Fehlermeldung.\nB: W√©i eng?\nA: 'Permission denied'.\nB: Dann hues du wahrscheinlech keng Rechter. Prob√©ier et mat sudo oder fro den Admin.",
        "reference_summary_en": "A gets a permission error, and B suggests using sudo or asking the admin for access."
    },
    {
        "id": "lux_008",
        "dialogue": "A: Mir treffen eis um 14:00.\nB: Ech sinn um 14:15 do.\nA: Ok, ech waarden am Caf√©.\nB: Merci. Bis gl√§ich.",
        "reference_summary_en": "They planned to meet at 14:00, but B will arrive at 14:15 and A will wait at a caf√©."
    },
    {
        "id": "lux_009",
        "dialogue": "A: Hues du d'Presentatioun gesinn?\nB: Jo, si ass gutt, m√§ d'Grafike sinn ze kleng.\nA: Ok, ech maachen se m√©i grouss.\nB: Super, dann ass et perfekt.",
        "reference_summary_en": "B thinks the presentation is good but the charts are too small, so A will enlarge them."
    },
    {
        "id": "lux_010",
        "dialogue": "A: Ech sinn haut am Homeoffice.\nB: Ok, k√´nns du trotzdem an de Call?\nA: Jo, ech sinn do um 10:00.\nB: Top, ech sch√©cken de Link.",
        "reference_summary_en": "A works from home but will join the 10:00 call, and B will send the link."
    },
    {
        "id": "lux_011",
        "dialogue": "A: Mir brauche nach e Beispill fir d'Course.\nB: Wat fir ee Beispill?\nA: E klengt Dialog-Set fir Zesummefaassung.\nB: Ok, ech schreiwen 20 kuerz Dialogen.",
        "reference_summary_en": "They need a small dialogue dataset for a summarization course, and B will write 20 short dialogues."
    },
    {
        "id": "lux_012",
        "dialogue": "A: Kanns du den Text nach eng K√©ier kontroll√©ieren?\nB: Jo, ech kucken no Tippfeeler.\nA: An och Punktuatioun.\nB: Maachen ech.",
        "reference_summary_en": "B will proofread the text for typos and punctuation."
    },
    {
        "id": "lux_013",
        "dialogue": "A: Ech hu meng Schl√´sselen vergiess.\nB: Wou bass du?\nA: Virun der Dier.\nB: Ech kommen, ginn mer f√´nnef Minutten.",
        "reference_summary_en": "A forgot their keys and is locked out, and B will come in five minutes."
    },
    {
        "id": "lux_014",
        "dialogue": "A: De Bus k√´nnt net.\nB: Hues du d'App gekuckt?\nA: Jo, et steet 'retard'.\nB: Dann huele mir en Taxi.",
        "reference_summary_en": "The bus is delayed, so they decide to take a taxi."
    },
    {
        "id": "lux_015",
        "dialogue": "A: Ech muss nach d'Fichieren eroplueden.\nB: Wou?\nA: Op Hugging Face.\nB: Ok, vergiss net d'Lizens an d'Readme.",
        "reference_summary_en": "A needs to upload files to Hugging Face, and B reminds them to include a license and README."
    },
    {
        "id": "lux_016",
        "dialogue": "A: D'GPU ass fr√§i.\nB: Super, dann starte mir den Training.\nA: Ech setzen batch size op 4.\nB: Ok, da maache mir gradient accumulation.",
        "reference_summary_en": "They have GPU availability and will start training with a small batch size and gradient accumulation."
    },
    {
        "id": "lux_017",
        "dialogue": "A: Kanns du mir den Deadline soen?\nB: Et ass Freideg um 18:00.\nA: Merci, ech maachen et haut nach.\nB: Gutt Iddi.",
        "reference_summary_en": "The deadline is Friday at 18:00, and A plans to finish today."
    },
    {
        "id": "lux_018",
        "dialogue": "A: Ech hunn d'Donn√©e√´n gereinegt.\nB: Super. Hues du och d'Nummeren normalis√©iert?\nA: Jo, ech hunn se an Wierder √´mgewandelt.\nB: Perfekt.",
        "reference_summary_en": "A cleaned the data and normalized numbers by converting them into words."
    },
    {
        "id": "lux_019",
        "dialogue": "A: Ech verstinn d'Resultater net.\nB: Wat ass komesch?\nA: D'Accuracy ass h√©ich, m√§ d'F1 ass niddreg.\nB: Dann ass et wahrscheinlech Klassen-Imbalance.",
        "reference_summary_en": "Accuracy is high but F1 is low, suggesting class imbalance."
    },
    {
        "id": "lux_020",
        "dialogue": "A: Tu peux me rappeler le plan?\nB: Oui. D'abord on teste en anglais, apr√®s on passe au luxembourgeois.\nA: An de Prompt bleift √§hnlech.\nB: Genau.",
        "reference_summary_en": "They will test in English first, then switch to Luxembourgish while keeping a similar prompt."
    },
    {
        "id": "lux_021",
        "dialogue": "A: Ech sinn net s√©cher ob 'Zentrum' richteg ass.\nB: Et h√§nkt vum Dialektgebiet of.\nA: Ok, ech kontroll√©ieren d'Metadata.\nB: Gutt, d'Labels mussen konsistent sinn.",
        "reference_summary_en": "They will verify the metadata because dialect labels must be consistent."
    },
    {
        "id": "lux_022",
        "dialogue": "A: D'Audio ass ze laang.\nB: W√©i laang?\nA: 25 Sekonnen.\nB: Dann schneiden mir et op 10 Sekonnen fir d'Training.",
        "reference_summary_en": "The audio is 25 seconds long, so they will trim it to 10 seconds for training."
    },
    {
        "id": "lux_023",
        "dialogue": "A: Ech hu keng Internet um Laptop.\nB: Prob√©ier d'WLAN nei.\nA: Ok, ech maachen restart.\nB: Wann et net geet, huele mir en Hotspot.",
        "reference_summary_en": "A has no internet, B suggests restarting Wi-Fi, and they may use a hotspot if needed."
    },
    {
        "id": "lux_024",
        "dialogue": "A: D'Zesummefaassung ass ze laang.\nB: Setz eng Limit.\nA: W√©i vill?\nB: Prob√©ier 2 S√§tz an maximal 60 Wierder.",
        "reference_summary_en": "They will constrain the summary length to two sentences and at most 60 words."
    },
    {
        "id": "lux_025",
        "dialogue": "A: Ech w√´ll eng neutral Zesummefaassung.\nB: Da schreiwe mir am Prompt: 'neutral, factual, no opinion'.\nA: Ok, ech testen dat.\nB: Gutt, a kuck ob Bias k√´nnt.",
        "reference_summary_en": "They want a neutral factual summary and will encode that in the prompt and then test for bias."
    },
]

def sample_and_summarize(dialogue_set, k=1, seed=7, prompt=ZERO_SHOT_PROMPT):
    import random
    random.seed(seed)
    items = random.sample(dialogue_set, k=k)
    for ex in items:
        print("ID:", ex["id"])
        print("\nDIALOGUE:\n", ex["dialogue"])
        pred = generate_summary_t5(ex["dialogue"], prompt=prompt, max_new_tokens=80, temperature=0.0)
        print("\nMODEL SUMMARY:\n", pred)
        print("\nREFERENCE (EN):\n", ex["reference_summary_en"])
        print("\n" + "-"*70 + "\n")

sample_and_summarize(mini_lux, k=2)

# Option 2. Pull a tiny real low-resource sample from Hugging Face
# This uses XL-Sum (multilingual news summarization). Not a dialogue dataset.
# For the workshop, we convert each article into a "pseudo-dialogue" so we can reuse the same pipeline.

from datasets import load_dataset

def article_to_pseudo_dialogue(article_text: str, max_turns: int = 6) -> str:
    # Lightweight sentence split. Good enough for teaching.
    sentences = [s.strip() for s in article_text.replace("\n", " ").split(".") if s.strip()]
    sentences = sentences[:max_turns]
    turns = []
    for i, s in enumerate(sentences):
        speaker = "ANCHOR" if i % 2 == 0 else "REPORTER"
        turns.append(f"{speaker}: {s}.")
    return "\n".join(turns)

def load_low_resource_hf_sample(language_subset: str = "yoruba", n: int = 20):
    ds = load_dataset("csebuetnlp/xlsum", language_subset, split=f"train[:{n}]")
    # XL-Sum fields are typically: "text" and "summary"
    out = []
    for i, ex in enumerate(ds):
        dialogue = article_to_pseudo_dialogue(ex["text"], max_turns=8)
        out.append(
            {
                "id": f"xlsum_{language_subset}_{i:03d}",
                "dialogue": dialogue,
                "reference_summary": ex["summary"],
            }
        )
    return out

xlsum_yoruba = load_low_resource_hf_sample(language_subset="yoruba", n=5)
print("Example pseudo-dialogue from XL-Sum (yoruba subset):")
print(xlsum_yoruba[0]["dialogue"])
print("\nReference summary (yoruba):")
print(xlsum_yoruba[0]["reference_summary"])

print("\nNow run the same English prompt on the pseudo-dialogue. It will usually struggle, and that is the point.")
pred = generate_summary_t5(xlsum_yoruba[0]["dialogue"], prompt=ZERO_SHOT_PROMPT, max_new_tokens=80, temperature=0.0)
print("\nMODEL SUMMARY:\n", pred)


## 7. Challenge. Adapt to your own low resource language

Now you have an English pipeline. The next step is to replace the English dialogue with data from your target language.

If you work on a language with limited resources, use the same structure.
1) Create turns with speaker labels.
2) Normalize and segment.
3) Start with an extractive baseline.
4) Add a multilingual model or a translation pivot only if you need it.
5) Evaluate with a small set of human references.

The next cell includes a ready to use template. It runs as is. Replace `MY_DIALOGUE` with your own data.


In [None]:
MY_DIALOGUE = """SPEAKER1: Replace this with your own dialogue in any language.
SPEAKER2: Keep speaker labels. Keep short lines if possible.
SPEAKER1: Then rerun the cells below."""

clean = normalize_dialogue(MY_DIALOGUE)
summary_baseline = turn_level_summarize(clean, max_turns=3)
print("Baseline summary:\n", summary_baseline)

if tokenizer is not None and model is not None:
    prompt = build_prompt(style="neutral", focus="actions", max_sentences=2)
    summary_llm = generate_summary_t5(clean, prompt, max_new_tokens=80, temperature=0.0)
    print("\nLLM summary:\n", summary_llm)
else:
    print("\nLLM not available. Baseline is your default.")


## 8. Wrap up

You now have a reproducible dialogue summarization pipeline that is usable with.
- No LLM, via TextRank and turn level extraction.
- A small instruction model, via prompt engineering.
- Low resource conditions, via normalization and constraints.

If you want to push further for true low resource languages.
- Swap English stopwords for a custom list, or disable stopwords.
- Use character n gram TF IDF for languages without whitespace.
- Add a small glossary and a retrieval step, then feed only the relevant turns to the model.
- Build a tiny evaluation set, 50 to 200 dialogues with one reference summary each.
