# Session 2: Pretrained Models and Prompt Engineering ü§ñ

<div align="center">

**üìö Course Repository:** [github.com/NinaKivanani/Tutorials_low-resource-llm](https://github.com/NinaKivanani/Tutorials_low-resource-llm)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NinaKivanani/Tutorials_low-resource-llm/blob/main/2_pretrained_models_prompt_engineering.ipynb)
[![GitHub](https://img.shields.io/badge/GitHub-View%20Repository-blue?logo=github)](https://github.com/NinaKivanani/Tutorials_low-resource-llm)
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)

</div>

---

Welcome to **LLM-based approaches** for dialogue summarization! This session focuses on prompt engineering strategies that work across languages, with special attention to low-resource language challenges.

**üéØ Focus:** Prompt engineering, few-shot learning, Chain-of-Thought  
**üíª Requirements:** GPU recommended for large models (T5, mT5)

## Prerequisites

**üìã Recommended learning path:**
1. **Session 0:** Setup and tokenization basics ‚úÖ  
2. **Session 1:** Baseline summarization techniques ‚úÖ
3. **This session (Session 2):** LLM prompt engineering ‚Üê You are here!

## What You Will Learn

1. **üèóÔ∏è Pretrained model families** and their strengths/weaknesses
2. **üé® Prompt design vs. prompt engineering** principles
3. **üéØ Zero-shot, one-shot, and few-shot** prompting strategies
4. **üß† Chain-of-Thought prompting** for complex reasoning
5. **üåç Cross-lingual prompt transfer** techniques
6. **üìä Evaluation and cultural appropriateness** assessment

## Learning Objectives

By the end of this session, you will:
- ‚úÖ Compare different families of pretrained models
- ‚úÖ Design effective prompts for classification and QA
- ‚úÖ Apply few-shot learning with multilingual examples
- ‚úÖ Use Chain-of-Thought prompting across languages
- ‚úÖ Evaluate outputs for correctness, fluency, and cultural fit
- ‚úÖ Adapt prompting strategies to your target language

## How to Use This Notebook

- **Cells marked üîç Checkpoint** are recommended stopping points
- **Cells marked üéØ Challenge** are hands-on exercises  
- **Cells marked üí¨ Discussion** are for group activities
- **Run cells in order** - some require model loading time
- **If models are slow:** Use smaller variants or CPU-only mode


## 0. Setup and Model Loading

We'll use multilingual T5 models for this session. These models are instruction-tuned and work well across languages.


In [None]:
# üì¶ Setup for Session 2: Prompt Engineering
# Install additional packages needed for LLMs

import sys
import subprocess

def install_packages(packages):
    for package in packages:
        try:
            subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package])
            print(f"‚úÖ {package}")
        except Exception as e:
            print(f"‚ùå {package}: {str(e)[:50]}...")

print("üöÄ Installing LLM packages...")
packages = [
    "transformers>=4.35.0",
    "torch>=1.13.0", 
    "sentencepiece",
    "accelerate",
    "datasets"
]

install_packages(packages)

# Essential imports
import torch
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from typing import List, Dict
import warnings
warnings.filterwarnings('ignore')

print(f"üéØ PyTorch version: {torch.__version__}")
print(f"ü§ñ GPU available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"   Device: {torch.cuda.get_device_name(0)}")

print("‚úÖ Setup complete!")


## 1. Pretrained Model Families Overview üèóÔ∏è

**Understanding your options for multilingual text generation:**


## 2. üéØ Hands-on Prompt Engineering Workshop

**Your Mission:** Design and test prompts for classification and question answering in English and your target low-resource language.

### Task Overview:
1. **Classification:** Categorize dialogues into topics (meeting, social, support, transaction, other)
2. **Question Answering:** Extract information from context using Chain-of-Thought reasoning
3. **Evaluation:** Rate outputs for correctness, fluency, and cultural appropriateness

### üèóÔ∏è Model Family Comparison

**Available approaches for multilingual tasks:**

| **Approach** | **Pros** | **Cons** | **Best For** |
|--------------|----------|----------|--------------|
| **Local LLM (mT5)** | Privacy, customizable, offline | Requires compute resources | Controlled environments |
| **API (GPT-3.5/4)** | State-of-art, no setup | Cost per token, internet needed | Production applications |
| **Hosted (Colab)** | Free experimentation | Limited resources | Learning and prototyping |


In [None]:
# ü§ñ Load multilingual model and create prompt engineering toolkit

print("üì• Loading multilingual T5 model...")
print("‚è±Ô∏è  This may take 2-3 minutes on first run")

# Use smaller model for workshop - upgrade to base for better quality
MODEL_NAME = "google/mt5-small"  
# MODEL_NAME = "google/mt5-base"  # Better quality, requires more resources

try:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    
    print(f"‚úÖ Model loaded on {device}")
    print(f"üìä Parameters: ~{sum(p.numel() for p in model.parameters()) / 1e6:.0f}M")
    
    def generate_text(prompt: str, max_length: int = 100, temperature: float = 0.7) -> str:
        """Generate text using our multilingual model"""
        inputs = tokenizer.encode(prompt, return_tensors="pt", max_length=512, truncation=True)
        inputs = inputs.to(device)
        
        with torch.no_grad():
            outputs = model.generate(
                inputs, max_length=max_length, temperature=temperature,
                do_sample=True, pad_token_id=tokenizer.eos_token_id
            )
        
        return tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
    
    # Test the model
    test_output = generate_text("Classify this dialogue: A: Let's meet at 3pm. B: Perfect!", max_length=50)
    print(f"\nüß™ Test: {test_output}")
    
except Exception as e:
    print(f"‚ùå Error: {str(e)}")
    print("üí° Try restarting runtime or using CPU-only mode")
    model, tokenizer = None, None

# üìä Multilingual test data for classification and QA
test_data = {
    "English": {
        "classification": [
            {"dialogue": "A: Can we schedule the meeting for 3pm? B: Yes, I'll send the invite.", "topic": "meeting"},
            {"dialogue": "A: How was your weekend? B: Great! Went hiking with friends.", "topic": "social"},
            {"dialogue": "A: My laptop won't start. B: Try holding the power button for 10 seconds.", "topic": "support"},
        ],
        "qa": {
            "context": "Alice and Bob plan a meeting. Alice suggests 3pm but Bob is busy until 4pm. They agree to meet at 4:30pm in the conference room.",
            "questions": ["What time did they agree to meet?", "Where will they meet?"],
            "answers": ["4:30pm", "conference room"]
        }
    },
    "French": {
        "classification": [
            {"dialogue": "A: Pouvons-nous programmer la r√©union pour 15h? B: Oui, j'enverrai l'invitation.", "topic": "meeting"},
            {"dialogue": "A: Comment s'est pass√© ton week-end? B: Super! J'ai fait de la randonn√©e.", "topic": "social"},
        ],
        "qa": {
            "context": "Marie et Jean planifient une r√©union. Marie propose 15h mais Jean est occup√© jusqu'√† 16h. Ils conviennent de se rencontrer √† 16h30 en salle de conf√©rence.",
            "questions": ["√Ä quelle heure ont-ils convenu de se rencontrer?"],
            "answers": ["16h30"]
        }
    },
    # üåç ADD YOUR LANGUAGE HERE:
    # "YourLanguage": {
    #     "classification": [
    #         {"dialogue": "Your dialogue", "topic": "meeting"}
    #     ],
    #     "qa": {
    #         "context": "Your context",
    #         "questions": ["Your question?"],
    #         "answers": ["Your answer"]
    #     }
    # }
}

print(f"\nüìã Test data loaded for {len(test_data)} languages")
for lang in test_data:
    print(f"  {lang}: {len(test_data[lang]['classification'])} classification + {len(test_data[lang]['qa']['questions'])} QA examples")


In [None]:
### 2.1 üéØ Zero-shot, Few-shot, and Chain-of-Thought Comparison

# üîß Prompt engineering toolkit
def create_zero_shot_prompt(dialogue: str, task: str = "classification") -> str:
    """Zero-shot prompt - no examples provided"""
    if task == "classification":
        return f"""Classify this dialogue into one topic: meeting, social, support, transaction, other.

Dialogue: {dialogue}

Topic:"""
    else:  # QA
        return f"""Answer the question based on the context.

Context: {dialogue}

Answer:"""

def create_few_shot_prompt(dialogue: str, examples: list, task: str = "classification") -> str:
    """Few-shot prompt - includes examples"""
    if task == "classification":
        prompt = "Classify dialogues into topics: meeting, social, support, transaction, other.\n\nExamples:\n\n"
        for ex in examples[:2]:  # Use 2 examples to avoid length issues
            prompt += f"Dialogue: {ex['dialogue']}\nTopic: {ex['topic']}\n\n"
        prompt += f"Dialogue: {dialogue}\nTopic:"
        return prompt
    else:  # QA
        return f"""Answer questions based on context.

Context: {dialogue}

Answer with specific information:"""

def create_chain_of_thought_prompt(context: str, question: str) -> str:
    """Chain-of-Thought prompt for step-by-step reasoning"""
    return f"""Answer the question step by step based on the context.

Context: {context}

Question: {question}

Let me think step by step:
1. What is the question asking?
2. What relevant information is in the context?
3. What is the answer?

Answer:"""

# üß™ Run comprehensive prompt testing
def test_all_prompting_strategies():
    """Test zero-shot, few-shot, and Chain-of-Thought across languages"""
    
    results = []
    
    print("üéØ COMPREHENSIVE PROMPT ENGINEERING TEST")
    print("="*70)
    
    for language, data in test_data.items():
        print(f"\nüåç TESTING: {language.upper()}")
        print("-" * 50)
        
        # Test classification
        if data["classification"]:
            test_dialogue = data["classification"][0]["dialogue"]
            true_topic = data["classification"][0]["topic"]
            
            print(f"üìù Classification task: {test_dialogue[:60]}...")
            print(f"üìã Expected: {true_topic}")
            
            # Zero-shot classification
            zero_prompt = create_zero_shot_prompt(test_dialogue, "classification")
            if model:
                zero_result = generate_text(zero_prompt, max_length=50, temperature=0.1)
                print(f"üéØ Zero-shot: {zero_result}")
                
                # Few-shot classification (using English examples for transfer)
                few_prompt = create_few_shot_prompt(test_dialogue, test_data["English"]["classification"], "classification")
                few_result = generate_text(few_prompt, max_length=50, temperature=0.1)
                print(f"üìö Few-shot: {few_result}")
                
                results.append({
                    "language": language, "task": "classification", "method": "zero-shot",
                    "input": test_dialogue[:50] + "...", "output": zero_result, "expected": true_topic
                })
                results.append({
                    "language": language, "task": "classification", "method": "few-shot", 
                    "input": test_dialogue[:50] + "...", "output": few_result, "expected": true_topic
                })
            
        # Test QA with Chain-of-Thought
        if data["qa"]["questions"]:
            context = data["qa"]["context"]
            question = data["qa"]["questions"][0]
            expected_answer = data["qa"]["answers"][0]
            
            print(f"\\n‚ùì QA task: {question}")
            print(f"üìã Expected: {expected_answer}")
            
            # Chain-of-Thought QA
            cot_prompt = create_chain_of_thought_prompt(context, question)
            if model:
                cot_result = generate_text(cot_prompt, max_length=120, temperature=0.2)
                print(f"üß† Chain-of-Thought: {cot_result}")
                
                results.append({
                    "language": language, "task": "qa", "method": "chain-of-thought",
                    "input": question, "output": cot_result, "expected": expected_answer
                })
        
        print()
    
    return results

# Run the comprehensive test
if model:
    test_results = test_all_prompting_strategies()
    print(f"‚úÖ Completed testing across {len(test_data)} languages")
else:
    print("‚ö†Ô∏è  Model not available - showing prompt structure only")
    # Show example prompts
    example_dialogue = "A: Can we meet at 3pm? B: Perfect!"
    print("\\nüìù EXAMPLE PROMPTS:")
    print("\\nüéØ Zero-shot:")
    print(create_zero_shot_prompt(example_dialogue))
    print("\\nüìö Few-shot structure:")
    print(create_few_shot_prompt(example_dialogue, [{"dialogue": "Example", "topic": "meeting"}])[:200] + "...")

### 2.2 üìä Evaluation Framework

def create_evaluation_rubric():
    """Evaluation framework for model outputs"""
    return {
        "correctness": {
            "1": "Completely wrong", "2": "Partially wrong", "3": "Mostly right", 
            "4": "Right answer", "5": "Perfect with reasoning"
        },
        "fluency": {
            "1": "Unnatural/errors", "2": "Awkward phrasing", "3": "Acceptable", 
            "4": "Good language", "5": "Native-like"
        },
        "cultural_appropriateness": {
            "1": "Inappropriate", "2": "Questionable", "3": "Neutral", 
            "4": "Appropriate", "5": "Culturally aware"
        }
    }

def evaluate_output(output: str, expected: str, language: str, task: str, method: str):
    """Template for manual evaluation"""
    return {
        "output": output,
        "expected": expected,
        "language": language,
        "task": task,
        "method": method,
        "correctness_score": 0,  # Fill in 1-5
        "fluency_score": 0,      # Fill in 1-5
        "cultural_score": 0,     # Fill in 1-5
        "notes": "",            # Your observations
        "improvement_suggestions": ""
    }

print("\\nüìã EVALUATION FRAMEWORK")
print("="*40)
rubric = create_evaluation_rubric()
for dimension, scale in rubric.items():
    print(f"\\n{dimension.upper()}:")
    for score, description in scale.items():
        print(f"  {score}: {description}")

print(f"\\nüéØ YOUR TURN: Evaluate the outputs above using this 1-5 scale")
print("üí° Focus on how well each method works for your target language")

### 2.3 üí¨ Discussion Questions and Key Takeaways

discussion_guide = """
ü§î REFLECTION QUESTIONS:

1. **Cross-language Performance:**
   - Which prompting method worked best for your target language?
   - How did performance differ between English and your language?

2. **Method Comparison:** 
   - When did few-shot examples help vs. hurt?
   - How effective was Chain-of-Thought reasoning in non-English?

3. **Cultural Considerations:**
   - What cultural assumptions did you notice in outputs?
   - How would you adapt prompts for your cultural context?

4. **Practical Applications:**
   - Which approach would you use in production?
   - What are the trade-offs between methods?

üìù ACTION ITEMS:
‚ñ° Document 3 key insights about your target language
‚ñ° Identify best prompting strategies for your use case  
‚ñ° Note major challenges needing further research
‚ñ° Plan next steps for your project

üéØ KEY TAKEAWAYS:
‚Ä¢ Prompt structure matters more than complexity
‚Ä¢ Cultural context significantly impacts performance  
‚Ä¢ Few-shot examples can bridge language gaps effectively
‚Ä¢ Chain-of-Thought helps with reasoning across languages
‚Ä¢ Evaluation must consider cultural appropriateness
"""

print("\\n" + discussion_guide)

print("\\nüéâ CONGRATULATIONS!")
print("You've completed hands-on prompt engineering for low-resource languages!")
print("Use these techniques responsibly and keep experimenting! üöÄ")
