# Session 2: Pretrained Models and Prompt Engineering ü§ñ

<div align="center">

**üìö Course Repository:** [github.com/NinaKivanani/Tutorials_low-resource-llm](https://github.com/NinaKivanani/Tutorials_low-resource-llm)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NinaKivanani/Tutorials_low-resource-llm/blob/main/Session2_prompt_engineering.ipynb)
[![GitHub](https://img.shields.io/badge/GitHub-View%20Repository-blue?logo=github)](https://github.com/NinaKivanani/Tutorials_low-resource-llm)
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)

</div>

---

Welcome to **systematic LLM-based prompt engineering** for dialogue summarization and cross-lingual tasks! This session combines rigorous methodology with practical applications, focusing on low-resource language challenges.

**üéØ Focus:** Systematic prompt engineering, multilingual evaluation, dialogue summarization  
**üíª Requirements:** GPU recommended for large models OR API access for best results  
**üî¨ Methodology:** Research-grade systematic evaluation with pandas DataFrames

## Prerequisites

**üìã Recommended learning path:**
1. **Session 0:** Setup and tokenization basics ‚úÖ  
2. **Session 1:** Systematic baseline techniques ‚úÖ
3. **This session (Session 2):** Systematic LLM prompt engineering ‚Üê You are here!

## What You Will Master

1. **üèóÔ∏è Model family comparison** and systematic access pattern evaluation
2. **üé® Prompt engineering vs. prompt design** with systematic methodology
3. **üéØ Multi-strategy prompting** (zero-shot, few-shot, Chain-of-Thought) with quantitative comparison
4. **üåç Cross-lingual prompt transfer** with systematic cultural adaptation
5. **üìä Comprehensive evaluation framework** (correctness, fluency, cultural appropriateness)
6. **üíº Production-ready insights** with actionable recommendations

## Learning Objectives

By the end of this session, you will:
- ‚úÖ **Systematically compare** pretrained model families using structured evaluation
- ‚úÖ **Design culturally-aware prompts** for dialogue summarization and classification
- ‚úÖ **Implement systematic prompt engineering** with quantitative tracking
- ‚úÖ **Evaluate cross-lingual performance** using research-grade metrics
- ‚úÖ **Generate actionable insights** for production deployment decisions
- ‚úÖ **Export structured findings** for research and business applications

## üî¨ Research Methodology

**This session follows systematic research practices:**

- **üìä Structured Data Collection:** All experiments tracked in pandas DataFrames
- **üéØ Controlled Comparisons:** Systematic A/B testing of prompt strategies  
- **üìà Quantitative Analysis:** Statistical evaluation with visualization
- **üåç Cultural Sensitivity:** Multi-dimensional appropriateness assessment
- **üíæ Reproducible Results:** Exportable data for further analysis

## How This Advanced Session Works

- **üéì Theory + Systematic Practice:** Learn concepts ‚Üí Apply systematically ‚Üí Analyze quantitatively
- **üî¨ Hypothesis-Driven Experiments:** Form hypotheses ‚Üí Test systematically ‚Üí Draw conclusions
- **üìä Data-First Analysis:** Every decision backed by quantitative evidence
- **üí¨ Evidence-Based Discussions:** Group analysis using concrete experimental data
- **üåç Cross-Cultural Focus:** Systematic evaluation across language/culture pairs
- **üèÜ Production Insights:** Actionable recommendations for real-world deployment


## 0. üî¨ Systematic Setup and Model Access Framework

**Strategic Decision:** Choose your model access pattern based on systematic evaluation of your requirements.

### 0.1 Access Pattern Decision Framework

| **Access Pattern** | **Pros** | **Cons** | **Best For** | **Cost** |
|-------------------|----------|----------|--------------|----------|
| **üî• Local Models (mT5)** | Privacy, offline, customizable | GPU required, setup time | Research, sensitive data | Hardware only |
| **‚òÅÔ∏è API Access (GPT-4)** | State-of-art, no setup, scalable | Token costs, internet needed | Production, experiments | $0.01-0.03/1K tokens |
| **üåê Hosted (Colab/HF)** | Free tiers, easy setup | Limited resources, usage caps | Learning, prototyping | Free-$10/month |

**üéØ Recommendation Matrix:**
- **Learning/Research:** Start with Local Models (below) + backup API for comparison
- **Production Planning:** Test with APIs, validate costs, then decide on deployment
- **Resource-Constrained:** Use hosted solutions with systematic evaluation


In [None]:
# üì¶ Systematic Setup for Session 2: Advanced Prompt Engineering
# Install packages with systematic dependency management

import sys
import subprocess
import warnings
warnings.filterwarnings('ignore')

def install_packages_systematic(packages):
    """Install packages with better error handling and progress tracking"""
    installed = []
    failed = []
    
    for package in packages:
        try:
            subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package])
            installed.append(package.split(">=")[0].split("==")[0])
            print(f"‚úÖ {package}")
        except Exception as e:
            failed.append((package, str(e)[:50]))
            print(f"‚ùå {package}: {str(e)[:50]}...")
    
    return installed, failed

print("üöÄ SYSTEMATIC SETUP: Installing advanced packages...")
print("=" * 60)

# Core packages for systematic evaluation
core_packages = [
    "pandas>=1.5.0",
    "matplotlib>=3.5.0", 
    "seaborn>=0.11.0",
    "numpy>=1.21.0"
]

# LLM packages (grouped for better dependency management)
llm_packages = [
    "transformers>=4.35.0",
    "torch>=1.13.0", 
    "sentencepiece",
    "accelerate",
    "datasets"
]

print("üìä Installing data science packages...")
core_installed, core_failed = install_packages_systematic(core_packages)

print("ü§ñ Installing LLM packages...")
llm_installed, llm_failed = install_packages_systematic(llm_packages)

# Essential imports for systematic evaluation
try:
    import torch
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
    from typing import List, Dict, Optional, Tuple
    import time
    import json
    from datetime import datetime
    
    # Set plotting style for professional visualizations
    plt.style.use('default')
    sns.set_palette("husl")
    
    print(f"\nüéØ SYSTEM CONFIGURATION:")
    print(f"   Python: {sys.version.split()[0]}")
    print(f"   PyTorch: {torch.__version__}")
    print(f"   Pandas: {pd.__version__}")
    print(f"   GPU Available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"   GPU Device: {torch.cuda.get_device_name(0)}")
        print(f"   GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    
    print(f"\nüìä EXPERIMENTAL FRAMEWORK READY:")
    print(f"   ‚úÖ Structured data collection with pandas")
    print(f"   ‚úÖ Statistical analysis and visualization")
    print(f"   ‚úÖ Export capabilities for research")
    print(f"   ‚úÖ Systematic model comparison framework")
    
    setup_success = True
    
except ImportError as e:
    print(f"\n‚ùå IMPORT ERROR: {e}")
    print("üîÑ Try restarting the runtime and running this cell again")
    setup_success = False

# Verification and troubleshooting
if core_failed or llm_failed:
    print(f"\n‚ö†Ô∏è  INSTALLATION ISSUES DETECTED:")
    for pkg, error in core_failed + llm_failed:
        print(f"   ‚ùå {pkg}: {error}")
    print(f"\nüí° SOLUTIONS:")
    print(f"   1. Runtime ‚Üí Restart Runtime, then re-run this cell")
    print(f"   2. Check internet connection")
    print(f"   3. Try installing packages individually")

print("\n" + "=" * 60)
print("‚úÖ SYSTEMATIC SETUP COMPLETE!")
print("üî¨ Ready for research-grade prompt engineering experiments")


## 1. üî¨ Systematic Experimental Framework Setup

**Research-Grade Methodology:** Before testing models, we establish systematic evaluation framework.

### 1.1 üìä Define Experimental Scope


In [None]:
# üåç Configure Your Systematic Experiment: Languages and Tasks
# This systematic approach ensures reproducible, comparable results

# Define your target languages (CUSTOMIZE THIS FOR YOUR RESEARCH)
target_languages = [
    {
        "code": "en", 
        "name": "English", 
        "family": "Germanic",
        "speakers": "1.5B",
        "resource_level": "high",
        "writing_system": "Latin"
    },
    {
        "code": "fr", 
        "name": "French", 
        "family": "Romance",
        "speakers": "280M", 
        "resource_level": "high",
        "writing_system": "Latin"
    },
    {
        "code": "ar", 
        "name": "Arabic", 
        "family": "Semitic",
        "speakers": "420M",
        "resource_level": "medium",
        "writing_system": "Arabic"
    },
    # üéØ ADD YOUR LOW-RESOURCE LANGUAGE HERE:
    # {
    #     "code": "your_code", 
    #     "name": "Your Language", 
    #     "family": "Language Family",
    #     "speakers": "~XXXk",
    #     "resource_level": "low",
    #     "writing_system": "Script"
    # },
]

# Define systematic task framework for dialogue summarization and classification
experimental_tasks = [
    {
        "task_id": "dialogue_classification",
        "description": "Classify dialogue type: meeting, social, support, transaction",
        "evaluation_type": "categorical_accuracy",
        "cultural_sensitivity": "medium",
        "domain": "general"
    },
    {
        "task_id": "dialogue_summarization", 
        "description": "Generate concise summary of dialogue content",
        "evaluation_type": "generative_quality",
        "cultural_sensitivity": "high",
        "domain": "general"
    },
    {
        "task_id": "intent_extraction",
        "description": "Extract primary intent/action items from dialogue",
        "evaluation_type": "information_extraction", 
        "cultural_sensitivity": "high",
        "domain": "business"
    }
]

# Convert to DataFrames for systematic analysis
languages_df = pd.DataFrame(target_languages)
tasks_df = pd.DataFrame(experimental_tasks)

print("üåç SYSTEMATIC EXPERIMENTAL DESIGN")
print("=" * 50)

print("\\nüìã Target Languages:")
display(languages_df[["name", "family", "resource_level", "speakers", "writing_system"]])

print("\\nüéØ Experimental Tasks:")
display(tasks_df[["task_id", "description", "evaluation_type", "cultural_sensitivity"]])

print(f"\\nüìä EXPERIMENTAL MATRIX:")
print(f"   Languages: {len(languages_df)} ")
print(f"   Tasks: {len(tasks_df)}")
print(f"   Total combinations: {len(languages_df) * len(tasks_df)}")
print(f"   Systematic evaluation ensures comprehensive coverage!")

# Create systematic evaluation tracking framework
evaluation_columns = [
    # Experiment identifiers
    "experiment_id", "timestamp", "language_code", "language_name", "task_id",
    
    # Model and prompt configuration
    "model_name", "model_type", "access_pattern", "prompt_strategy", "shots_used",
    
    # Input/output data
    "input_text", "expected_output", "actual_output", "prompt_text",
    
    # Quantitative metrics
    "correctness_score", "fluency_score", "cultural_appropriateness_score",
    "response_time_ms", "token_count_input", "token_count_output",
    
    # Qualitative assessment
    "quality_issues", "cultural_notes", "improvement_suggestions", 
    
    # Systematic metadata
    "experiment_conditions", "model_parameters", "success_flag"
]

# Initialize systematic experiment tracking
experiments_df = pd.DataFrame(columns=evaluation_columns)

print(f"\\nüî¨ EVALUATION FRAMEWORK INITIALIZED:")
print(f"   Tracking {len(evaluation_columns)} systematic metrics per experiment")
print(f"   Ready for systematic data collection and analysis")
print(f"   ‚úÖ Research-grade methodology established!")


### 1.2 üìù Systematic Test Data Framework

**Structured Test Cases:** Ensure consistent, comparable evaluation across languages and tasks.


In [None]:
# üìä Systematic Test Data Creation
# Structured test cases for systematic comparison across languages and tasks

def create_systematic_test_data():
    """
    Create structured test data for systematic evaluation across languages and tasks.
    This ensures consistent comparison and eliminates ad-hoc testing bias.
    """
    
    # Systematic dialogue test cases (parallel across languages)
    test_cases = []
    
    # Test Case 1: Business Meeting Scenario
    test_cases.append({
        "case_id": "business_meeting_01",
        "domain": "business", 
        "complexity": "medium",
        "cultural_context": "professional",
        "languages": {
            "en": {
                "dialogue": "A: We need to finalize the budget by Friday. B: I'll have the numbers ready by Thursday. A: Perfect, let's schedule a review meeting.",
                "expected_classification": "meeting",
                "expected_summary": "Team discusses budget deadline and schedules review meeting for Thursday numbers.",
                "expected_intent": "Schedule budget review meeting"
            },
            "fr": {
                "dialogue": "A: Nous devons finaliser le budget vendredi. B: J'aurai les chiffres pr√™ts jeudi. A: Parfait, planifions une r√©union de r√©vision.", 
                "expected_classification": "meeting",
                "expected_summary": "L'√©quipe discute de l'√©ch√©ance budg√©taire et planifie une r√©union de r√©vision jeudi.",
                "expected_intent": "Planifier une r√©union de r√©vision budg√©taire"
            },
            "ar": {
                "dialogue": "ÿ£: ŸÜÿ≠ÿ™ÿßÿ¨ ŸÑÿ•ŸÜŸáÿßÿ° ÿßŸÑŸÖŸäÿ≤ÿßŸÜŸäÿ© ŸäŸàŸÖ ÿßŸÑÿ¨ŸÖÿπÿ©. ÿ®: ÿ≥ÿ£ÿ¨Ÿáÿ≤ ÿßŸÑÿ£ÿ±ŸÇÿßŸÖ ŸäŸàŸÖ ÿßŸÑÿÆŸÖŸäÿ≥. ÿ£: ŸÖŸÖÿ™ÿßÿ≤ÿå ŸÑŸÜÿ≠ÿØÿØ ŸÖŸàÿπÿØ ÿßÿ¨ÿ™ŸÖÿßÿπ ŸÖÿ±ÿßÿ¨ÿπÿ©.",
                "expected_classification": "meeting", 
                "expected_summary": "ŸäŸÜÿßŸÇÿ¥ ÿßŸÑŸÅÿ±ŸäŸÇ ŸÖŸàÿπÿØ ÿßŸÑŸÖŸäÿ≤ÿßŸÜŸäÿ© ÿßŸÑŸÜŸáÿßÿ¶Ÿä ŸàŸäÿ≠ÿØÿØ ÿßÿ¨ÿ™ŸÖÿßÿπ ŸÖÿ±ÿßÿ¨ÿπÿ© ŸÑŸÑÿ£ÿ±ŸÇÿßŸÖ ŸäŸàŸÖ ÿßŸÑÿÆŸÖŸäÿ≥.",
                "expected_intent": "ÿ¨ÿØŸàŸÑÿ© ÿßÿ¨ÿ™ŸÖÿßÿπ ŸÖÿ±ÿßÿ¨ÿπÿ© ÿßŸÑŸÖŸäÿ≤ÿßŸÜŸäÿ©"
            }
            # üéØ ADD YOUR LANGUAGE HERE following the same structure
        }
    })
    
    # Test Case 2: Technical Support Scenario  
    test_cases.append({
        "case_id": "tech_support_01",
        "domain": "support",
        "complexity": "low", 
        "cultural_context": "service",
        "languages": {
            "en": {
                "dialogue": "A: My computer won't start this morning. B: Did you try unplugging it for 30 seconds? A: Yes, but still nothing. B: Let me schedule a technician visit.",
                "expected_classification": "support",
                "expected_summary": "Customer reports computer startup issue, basic troubleshooting attempted, technician visit scheduled.",
                "expected_intent": "Schedule technician visit for computer repair"
            },
            "fr": {
                "dialogue": "A: Mon ordinateur ne d√©marre pas ce matin. B: Avez-vous essay√© de le d√©brancher 30 secondes? A: Oui, mais toujours rien. B: Laissez-moi programmer une visite de technicien.",
                "expected_classification": "support", 
                "expected_summary": "Le client signale un probl√®me de d√©marrage d'ordinateur, d√©pannage de base tent√©, visite de technicien programm√©e.",
                "expected_intent": "Programmer une visite de technicien pour r√©paration d'ordinateur"
            },
            "ar": {
                "dialogue": "ÿ£: ÿ¨Ÿáÿßÿ≤ ÿßŸÑŸÉŸÖÿ®ŸäŸàÿ™ÿ± ŸÑÿß ŸäÿπŸÖŸÑ Ÿáÿ∞ÿß ÿßŸÑÿµÿ®ÿßÿ≠. ÿ®: ŸáŸÑ ÿ¨ÿ±ÿ®ÿ™ ŸÅÿµŸÑŸá ŸÑŸÖÿØÿ© 30 ÿ´ÿßŸÜŸäÿ©ÿü ÿ£: ŸÜÿπŸÖÿå ŸÑŸÉŸÜ ŸÑÿß ÿ¥Ÿäÿ°. ÿ®: ÿØÿπŸÜŸä ÿ£ÿ≠ÿØÿØ ÿ≤Ÿäÿßÿ±ÿ© ŸÅŸÜŸä.",
                "expected_classification": "support",
                "expected_summary": "Ÿäÿ®ŸÑÿ∫ ÿßŸÑÿπŸÖŸäŸÑ ÿπŸÜ ŸÖÿ¥ŸÉŸÑÿ© ÿ®ÿØÿ° ÿ™ÿ¥ÿ∫ŸäŸÑ ÿßŸÑŸÉŸÖÿ®ŸäŸàÿ™ÿ±ÿå ÿ™ŸÖ ŸÖÿ≠ÿßŸàŸÑÿ© ÿßÿ≥ÿ™ŸÉÿ¥ÿßŸÅ ÿßŸÑÿ£ÿÆÿ∑ÿßÿ° ÿßŸÑÿ£ÿ≥ÿßÿ≥Ÿäÿ©ÿå ÿ™ŸÖ ÿ¨ÿØŸàŸÑÿ© ÿ≤Ÿäÿßÿ±ÿ© ŸÅŸÜŸä.", 
                "expected_intent": "ÿ¨ÿØŸàŸÑÿ© ÿ≤Ÿäÿßÿ±ÿ© ŸÅŸÜŸä ŸÑÿ•ÿµŸÑÿßÿ≠ ÿßŸÑŸÉŸÖÿ®ŸäŸàÿ™ÿ±"
            }
        }
    })
    
    # Test Case 3: Social Conversation Scenario
    test_cases.append({
        "case_id": "social_conversation_01", 
        "domain": "social",
        "complexity": "low",
        "cultural_context": "informal",
        "languages": {
            "en": {
                "dialogue": "A: How was your weekend hiking trip? B: Amazing! The weather was perfect and the views were incredible. A: I'd love to join you next time!",
                "expected_classification": "social",
                "expected_summary": "Friends discuss successful weekend hiking trip with great weather and views, plan future trip together.",
                "expected_intent": "Plan future hiking trip together"
            },
            "fr": {
                "dialogue": "A: Comment s'est pass√©e ta randonn√©e du week-end? B: Fantastique! Le temps √©tait parfait et les vues incroyables. A: J'aimerais vous accompagner la prochaine fois!",
                "expected_classification": "social",
                "expected_summary": "Les amis discutent d'une randonn√©e r√©ussie du week-end avec beau temps et vues, planifient un voyage futur ensemble.",
                "expected_intent": "Planifier une future randonn√©e ensemble"
            },
            "ar": {
                "dialogue": "ÿ£: ŸÉŸäŸÅ ŸÉÿßŸÜÿ™ ÿ±ÿ≠ŸÑÿ© ÿßŸÑŸÖÿ¥Ÿä ŸÅŸä ŸÜŸáÿßŸäÿ© ÿßŸÑÿ£ÿ≥ÿ®Ÿàÿπÿü ÿ®: ÿ±ÿßÿ¶ÿπÿ©! ŸÉÿßŸÜ ÿßŸÑÿ∑ŸÇÿ≥ ŸÖÿ´ÿßŸÑŸäÿßŸã ŸàÿßŸÑŸÖŸÜÿßÿ∏ÿ± ŸÑÿß ÿ™ÿµÿØŸÇ. ÿ£: ÿ£ŸàÿØ ÿßŸÑÿßŸÜÿ∂ŸÖÿßŸÖ ÿ•ŸÑŸäŸÉŸÖ ÿßŸÑŸÖÿ±ÿ© ÿßŸÑŸÇÿßÿØŸÖÿ©!",
                "expected_classification": "social",
                "expected_summary": "ŸäŸÜÿßŸÇÿ¥ ÿßŸÑÿ£ÿµÿØŸÇÿßÿ° ÿ±ÿ≠ŸÑÿ© ÿßŸÑŸÖÿ¥Ÿä ÿßŸÑŸÜÿßÿ¨ÿ≠ÿ© ŸÅŸä ŸÜŸáÿßŸäÿ© ÿßŸÑÿ£ÿ≥ÿ®Ÿàÿπ ŸÖÿπ ÿ∑ŸÇÿ≥ ÿ±ÿßÿ¶ÿπ ŸàŸÖŸÜÿßÿ∏ÿ±ÿå ŸäÿÆÿ∑ÿ∑ŸàŸÜ ŸÑÿ±ÿ≠ŸÑÿ© ŸÖÿ≥ÿ™ŸÇÿ®ŸÑŸäÿ© ŸÖÿπÿßŸã.",
                "expected_intent": "ÿßŸÑÿ™ÿÆÿ∑Ÿäÿ∑ ŸÑÿ±ÿ≠ŸÑÿ© ŸÖÿ¥Ÿä ŸÖÿ≥ÿ™ŸÇÿ®ŸÑŸäÿ© ŸÖÿπÿßŸã"
            }
        }
    })
    
    return test_cases

# Create systematic test data
systematic_test_data = create_systematic_test_data()

# Convert to structured format for systematic analysis
test_items = []
for case in systematic_test_data:
    for lang_code, lang_data in case["languages"].items():
        # Create test items for each task type
        for task in tasks_df["task_id"].values:
            expected_output = ""
            if task == "dialogue_classification":
                expected_output = lang_data["expected_classification"]
            elif task == "dialogue_summarization": 
                expected_output = lang_data["expected_summary"]
            elif task == "intent_extraction":
                expected_output = lang_data["expected_intent"]
            
            test_items.append({
                "item_id": f"{case['case_id']}_{lang_code}_{task}",
                "case_id": case["case_id"],
                "language_code": lang_code,
                "task_id": task,
                "domain": case["domain"],
                "complexity": case["complexity"], 
                "cultural_context": case["cultural_context"],
                "input_text": lang_data["dialogue"],
                "expected_output": expected_output,
                "language_name": languages_df[languages_df["code"] == lang_code]["name"].iloc[0]
            })

# Convert to DataFrame for systematic analysis
test_items_df = pd.DataFrame(test_items)

print("üìä SYSTEMATIC TEST DATA FRAMEWORK")
print("=" * 50)

print(f"\\nüéØ Test Coverage:")
print(f"   Test cases: {len(systematic_test_data)}")
print(f"   Languages per case: {len(systematic_test_data[0]['languages'])}")
print(f"   Tasks per language: {len(tasks_df)}")
print(f"   Total test items: {len(test_items_df)}")

print(f"\\nüìã Test Distribution:")
test_summary = test_items_df.groupby(["language_name", "task_id"]).size().unstack(fill_value=0)
display(test_summary)

print(f"\\n‚úÖ Systematic test framework ready!")
print(f"   All languages have parallel test cases for fair comparison")
print(f"   Multiple domains and complexity levels covered")
print(f"   Cultural contexts systematically varied")

# Show sample test items
print(f"\\nüîç Sample Test Items (first 3):")
display(test_items_df[["item_id", "language_name", "task_id", "domain", "complexity"]].head(6))


## 2. ü§ñ Systematic Model Loading and Prompt Engineering Framework

**Strategic Mission:** Systematically evaluate multiple model access patterns and prompt strategies across languages.

### 2.1 üî• Advanced Model Loading with Performance Tracking

**Systematic Model Evaluation:** Compare local, API, and hosted approaches with quantitative metrics.

| **Access Pattern** | **Pros** | **Cons** | **Cost Analysis** | **Performance Expectation** |
|-------------------|----------|----------|-------------------|----------------------------|
| **üî• Local mT5-Base** | Privacy, offline, fine-tunable | 2-4GB RAM, setup time | Hardware only (~$500-2000) | Good multilingual, customizable |
| **‚òÅÔ∏è GPT-4 API** | State-of-art, minimal setup | $0.01-0.03/1K tokens | ~$50-200/month typical use | Excellent English, good multilingual |
| **üåê Hosted Free (Colab)** | Zero cost, easy access | 12hr limits, GPU competition | Free (with limits) | Variable quality, good for learning |
| **üè¢ Claude API** | Strong reasoning, safety | Limited multilingual support | Similar to GPT-4 | Excellent reasoning, English-focused |

### 2.2 üìä Systematic Task Framework

**Our systematic evaluation covers:**
1. **üìä Dialogue Classification:** Business/social/support categorization with cultural context
2. **‚úçÔ∏è Dialogue Summarization:** Concise, culturally-appropriate summaries
3. **üéØ Intent Extraction:** Action items and next steps identification
4. **üß† Chain-of-Thought Reasoning:** Step-by-step multilingual reasoning evaluation


In [None]:
# üöÄ STEP 1: Create the Model Manager Class
# This creates a system to handle different types of AI models

class SystematicModelManager:
    """
    Advanced model management with systematic evaluation and performance tracking
    """
    
    def __init__(self):
        self.models = {}
        self.performance_metrics = []
        self.current_model = None
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        """Load local multilingual model with systematic tracking"""
        print(f"üîÑ Loading local model: {model_name}")
        print(f"‚è±Ô∏è  This may take 2-3 minutes on first run...")
        
        start_time = time.time()
        try:
            tokenizer = AutoTokenizer.from_pretrained(model_name)
            model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
            model = model.to(self.device)
            
            load_time = time.time() - start_time
            param_count = sum(p.numel() for p in model.parameters()) / 1e6
            
            # Store model configuration
            model_config = {
                "name": model_name,
                "type": "local_multilingual", 
                "tokenizer": tokenizer,
                "model": model,
                "device": str(self.device),
                "parameters_M": param_count,
                "load_time_s": load_time,
                "memory_gb": torch.cuda.max_memory_allocated() / 1e9 if torch.cuda.is_available() else 0
            }
            
            self.models[model_name] = model_config
            self.current_model = model_name
            
            print(f"‚úÖ Model loaded successfully:")
            print(f"   Device: {self.device}")
            print(f"   Parameters: {param_count:.0f}M")
            print(f"   Load time: {load_time:.1f}s")
            if torch.cuda.is_available():
                print(f"   GPU memory: {model_config['memory_gb']:.1f}GB")
            
            return True
            
        except Exception as e:
            print(f"‚ùå Error loading {model_name}: {str(e)}")
            print("üí° Solutions:")
            print("   1. Try smaller model: google/mt5-small")
            print("   2. Restart runtime to free memory")
            print("   3. Use CPU-only mode")
            return False
    
    def setup_api_access(self, api_type: str = "placeholder"):
        """Setup API access (placeholder for actual API integration)"""
        print(f"üåê Setting up {api_type} API access...")
        
        # Placeholder for API setup - in real use, add API key configuration
        api_config = {
            "name": f"{api_type}_api",
            "type": "api_access",
            "cost_per_1k_tokens": 0.02 if api_type == "gpt4" else 0.01,
            "setup_time_s": 0.1,
            "requires_internet": True
        }
        
        self.models[f"{api_type}_api"] = api_config
        print(f"‚úÖ {api_type} API configured (placeholder)")
        print(f"   Estimated cost: ${api_config['cost_per_1k_tokens']}/1K tokens")
        return True
    
    def generate_text_systematic(self, prompt: str, max_length: int = 100, 
                                temperature: float = 0.7, model_name: str = None) -> Dict:
        """
        Generate text with systematic performance tracking
        """
        model_name = model_name or self.current_model
        if not model_name or model_name not in self.models:
            return {"error": "No model loaded", "output": "", "metrics": {}}
        
        model_config = self.models[model_name]
        
        if model_config["type"] == "local_multilingual":
            return self._generate_local(prompt, model_config, max_length, temperature)
        elif model_config["type"] == "api_access":
            return self._generate_api(prompt, model_config, max_length, temperature)
        else:
            return {"error": "Unknown model type", "output": "", "metrics": {}}
    
    def _generate_local(self, prompt: str, model_config: Dict, max_length: int, temperature: float) -> Dict:
        """Generate using local model with performance tracking"""
        start_time = time.time()
        
        try:
            tokenizer = model_config["tokenizer"]
            model = model_config["model"]
            
            # Tokenization with tracking
            inputs = tokenizer.encode(prompt, return_tensors="pt", max_length=512, truncation=True)
            input_tokens = inputs.shape[1]
            inputs = inputs.to(self.device)
            
            # Generation with tracking
            with torch.no_grad():
                outputs = model.generate(
                    inputs, 
                    max_length=max_length, 
                    temperature=temperature,
                    do_sample=True, 
                    pad_token_id=tokenizer.eos_token_id
                )
            
            # Decode output
            output_text = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
            output_tokens = outputs.shape[1] - input_tokens
            
            generation_time = time.time() - start_time
            
            # Calculate performance metrics
            metrics = {
                "model_name": model_config["name"],
                "generation_time_ms": generation_time * 1000,
                "input_tokens": input_tokens,
                "output_tokens": output_tokens,
                "tokens_per_second": output_tokens / generation_time if generation_time > 0 else 0,
                "success": True
            }
            
            return {
                "output": output_text,
                "metrics": metrics,
                "error": None
            }
            
        except Exception as e:
            return {
                "output": "", 
                "metrics": {"success": False, "error": str(e)},
                "error": str(e)
            }
    
    def _generate_api(self, prompt: str, model_config: Dict, max_length: int, temperature: float) -> Dict:
        """Placeholder for API generation - replace with actual API calls"""
        return {
            "output": f"[API placeholder - would call {model_config['name']} with prompt]",
            "metrics": {
                "model_name": model_config["name"],
                "estimated_cost": len(prompt) * model_config["cost_per_1k_tokens"] / 1000,
                "success": True
            },
            "error": None
        }

# Initialize systematic model manager
print("üöÄ SYSTEMATIC MODEL MANAGER INITIALIZATION")
print("=" * 60)

model_manager = SystematicModelManager()

# Option 1: Load local model (recommended for learning)
print("üî• Loading local multilingual model...")
local_success = model_manager.load_local_model("google/mt5-small")

# Option 2: Setup API access (for production comparison)
print("\\n‚òÅÔ∏è Configuring API access...")
api_success = model_manager.setup_api_access("gpt4")

# Test the systematic framework
if local_success:
    print("\\nüß™ TESTING SYSTEMATIC GENERATION:")
    test_prompt = "Classify this dialogue: A: Let's schedule a meeting for 3pm. B: Perfect, I'll send the invite."
    
    result = model_manager.generate_text_systematic(test_prompt, max_length=50, temperature=0.1)
    
    if result["error"]:
        print(f"‚ùå Test failed: {result['error']}")
    else:
        print(f"‚úÖ Test successful:")
        print(f"   Output: {result['output']}")
        print(f"   Generation time: {result['metrics']['generation_time_ms']:.1f}ms")
        print(f"   Tokens/second: {result['metrics']['tokens_per_second']:.1f}")

print(f"\\nüìä MODEL MANAGER STATUS:")
print(f"   Available models: {len(model_manager.models)}")
print(f"   Current model: {model_manager.current_model}")
print(f"   Device: {model_manager.device}")
print(f"   ‚úÖ Ready for systematic prompt engineering experiments!")


### üìö What You Just Did:
**You created the foundation of your model manager!** This class will help you:
- Keep track of different AI models (local and online)
- Measure how fast they work
- Switch between models easily

**üéØ What to Expect:** You should see no output (this just defines the class)  
**‚úÖ Success Indicator:** No error messages  
**‚ö†Ô∏è If you see errors:** Check that you ran the setup cell above first

In [None]:
# üöÄ STEP 2: Add Model Loading Function
# This function downloads and loads an AI model onto your computer/Colab

def load_local_model(self, model_name: str = "google/mt5-small"):
    """Load local multilingual model with systematic tracking"""
    print(f"üîÑ Loading local model: {model_name}")
    print(f"‚è±Ô∏è  This may take 2-3 minutes on first run...")
    
    start_time = time.time()
    try:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
        model = model.to(self.device)
        
        load_time = time.time() - start_time
        param_count = sum(p.numel() for p in model.parameters()) / 1e6
        
        # Store model configuration
        model_config = {
            "name": model_name,
            "type": "local_multilingual", 
            "tokenizer": tokenizer,
            "model": model,
            "device": str(self.device),
            "parameters_M": param_count,
            "load_time_s": load_time,
            "memory_gb": torch.cuda.max_memory_allocated() / 1e9 if torch.cuda.is_available() else 0
        }
        
        self.models[model_name] = model_config
        self.current_model = model_name
        
        print(f"‚úÖ Model loaded successfully:")
        print(f"   Device: {self.device}")
        print(f"   Parameters: {param_count:.0f}M")
        print(f"   Load time: {load_time:.1f}s")
        if torch.cuda.is_available():
            print(f"   GPU memory: {model_config['memory_gb']:.1f}GB")
        
        return True
        
    except Exception as e:
        print(f"‚ùå Error loading {model_name}: {str(e)}")
        print("üí° Solutions:")
        print("   1. Try smaller model: google/mt5-small")
        print("   2. Restart runtime to free memory")
        print("   3. Use CPU-only mode")
        return False

# Add this method to our SystematicModelManager class
SystematicModelManager.load_local_model = load_local_model

### üìö What You Just Did:
**You added a function to load AI models!** This is like installing an app on your phone.

**üéØ What to Expect:** 
- First time: Nothing visible (just adds the function)
- When you actually use it later: Downloads ~500MB model file

**üìä What the Numbers Mean:**
- **Parameters**: How "smart" the model is (more = smarter but slower)
- **Load time**: How long it took to start up
- **GPU memory**: How much of your graphics card it's using

**‚ö†Ô∏è Common Issues:**
- "Out of memory": The model is too big for your system
- "Connection error": Internet issue during download
- Takes forever: Normal for first download, fast after that

In [None]:
# üöÄ STEP 3: Add API Setup Function
# This function sets up online AI services (like ChatGPT APIs)

def setup_api_access(self, api_type: str = "placeholder"):
    """Setup API access (placeholder for actual API integration)"""
    print(f"üåê Setting up {api_type} API access...")
    
    # Placeholder for API setup - in real use, add API key configuration
    api_config = {
        "name": f"{api_type}_api",
        "type": "api_access",
        "cost_per_1k_tokens": 0.02 if api_type == "gpt4" else 0.01,
        "setup_time_s": 0.1,
        "requires_internet": True
    }
    
    self.models[f"{api_type}_api"] = api_config
    print(f"‚úÖ {api_type} API configured (placeholder)")
    print(f"   Estimated cost: ${api_config['cost_per_1k_tokens']}/1K tokens")
    return True

# Add this method to our SystematicModelManager class  
SystematicModelManager.setup_api_access = setup_api_access

### üìö What You Just Did:
**You added a function to connect to online AI services!** This is like connecting to ChatGPT, but through code.

**üéØ What to Expect:** 
- Shows "API configured (placeholder)" message
- This is just a demo - real APIs need keys

**üí∞ What the Cost Numbers Mean:**
- **Cost per 1K tokens**: How much it costs to process ~750 words
- Example: $0.02/1K tokens = 2 cents per ~750 words processed

**üîÑ Next:** We'll add the actual FREE APIs that work on Colab later

In [None]:
# üöÄ STEP 4: Add Text Generation Function (Main Function)
# This is the function that actually generates text from your prompts

def generate_text_systematic(self, prompt: str, max_length: int = 100, 
                            temperature: float = 0.7, model_name: str = None) -> Dict:
    """Generate text with systematic performance tracking"""
    model_name = model_name or self.current_model
    if not model_name or model_name not in self.models:
        return {"error": "No model loaded", "output": "", "metrics": {}}
    
    model_config = self.models[model_name]
    
    if model_config["type"] == "local_multilingual":
        return self._generate_local(prompt, model_config, max_length, temperature)
    elif model_config["type"] == "api_access":
        return self._generate_api(prompt, model_config, max_length, temperature)
    else:
        return {"error": "Unknown model type", "output": "", "metrics": {}}

# Add this method to our SystematicModelManager class
SystematicModelManager.generate_text_systematic = generate_text_systematic

In [None]:
# üöÄ STEP 5: Add Helper Functions for Text Generation
# These functions handle the details of generating text with local models vs APIs

def _generate_local(self, prompt: str, model_config: Dict, max_length: int, temperature: float) -> Dict:
    """Generate using local model with performance tracking"""
    start_time = time.time()
    
    try:
        tokenizer = model_config["tokenizer"]
        model = model_config["model"]
        
        # Tokenization with tracking
        inputs = tokenizer.encode(prompt, return_tensors="pt", max_length=512, truncation=True)
        input_tokens = inputs.shape[1]
        inputs = inputs.to(self.device)
        
        # Generation with tracking
        with torch.no_grad():
            outputs = model.generate(
                inputs, 
                max_length=max_length, 
                temperature=temperature,
                do_sample=True, 
                pad_token_id=tokenizer.eos_token_id
            )
        
        # Decode output
        output_text = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
        output_tokens = outputs.shape[1] - input_tokens
        
        generation_time = time.time() - start_time
        
        # Calculate performance metrics
        metrics = {
            "model_name": model_config["name"],
            "generation_time_ms": generation_time * 1000,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "tokens_per_second": output_tokens / generation_time if generation_time > 0 else 0,
            "success": True
        }
        
        return {
            "output": output_text,
            "metrics": metrics,
            "error": None
        }
        
    except Exception as e:
        return {
            "output": "", 
            "metrics": {"success": False, "error": str(e)},
            "error": str(e)
        }

def _generate_api(self, prompt: str, model_config: Dict, max_length: int, temperature: float) -> Dict:
    """Placeholder for API generation - replace with actual API calls"""
    return {
        "output": f"[API placeholder - would call {model_config['name']} with prompt]",
        "metrics": {
            "model_name": model_config["name"],
            "estimated_cost": len(prompt) * model_config["cost_per_1k_tokens"] / 1000,
            "success": True
        },
        "error": None
    }

# Add these methods to our SystematicModelManager class
SystematicModelManager._generate_local = _generate_local
SystematicModelManager._generate_api = _generate_api

### üìö What You Just Did:
**You added the "engine" that makes AI models work!** This is like the brain of your AI system.

**üéØ What Each Function Does:**
- **`generate_text_systematic`**: The main function you'll call to get AI responses
- **`_generate_local`**: Handles local models (like mT5 on your computer)  
- **`_generate_api`**: Handles online services (like ChatGPT APIs)

**üìä What the Performance Metrics Mean:**
- **Generation time**: How long the AI took to respond
- **Input/Output tokens**: How many "words" went in and came out
- **Tokens per second**: How fast the AI is working

**‚ö†Ô∏è These are helper functions - you won't call them directly!**

In [None]:
# üöÄ STEP 6: Initialize and Test Your Model Manager
# Now let's create your model manager and load a model!

print("üöÄ SYSTEMATIC MODEL MANAGER INITIALIZATION")
print("=" * 60)

# Create your model manager
model_manager = SystematicModelManager()
print("‚úÖ Model manager created!")
print(f"   Device available: {model_manager.device}")

# Try to load a local model
print("\nüî• Loading local multilingual model...")
local_success = model_manager.load_local_model("google/mt5-small")

if local_success:
    print(f"\nüìä MODEL MANAGER STATUS:")
    print(f"   Available models: {len(model_manager.models)}")
    print(f"   Current model: {model_manager.current_model}")
    print(f"   Device: {model_manager.device}")
    print(f"   ‚úÖ Ready for experiments!")
else:
    print("‚ö†Ô∏è  Local model failed to load - we'll setup APIs next")

### üìö What You Should See:
**This cell starts up your AI system!** Here's what to expect:

**‚úÖ Success Indicators:**
- "Model manager created!" message
- "Device available: cuda" (on Colab GPU) or "cpu" 
- Loading progress for the mT5 model (2-3 minutes first time)
- "Model loaded successfully" with statistics
- "Ready for experiments!" at the end

**‚è±Ô∏è Expected Timeline:**
- First run: 2-3 minutes (downloads ~500MB model)
- Subsequent runs: 10-30 seconds (model already cached)

**‚ùå If You See Errors:**
- "CUDA out of memory": Model too big, restart runtime
- "Connection error": Internet issue, try again
- "Import error": Run the setup cells above first

**üìä What the Statistics Mean:**
- **Parameters**: 300M = small model, 3B = large model
- **GPU Memory**: How much graphics card memory it's using
- **Load time**: Normal range is 10-180 seconds

In [None]:
# üß™ STEP 7: Test Your AI Model!
# Let's see if your model can understand and respond to text

if 'model_manager' in locals() and model_manager.current_model:
    print("üß™ TESTING YOUR AI MODEL")
    print("=" * 50)
    
    # Simple test prompt
    test_prompt = "Hello! Can you help me classify this conversation: A: Let's meet at 3pm. B: Perfect, see you then."
    
    print(f"üéØ Test prompt: {test_prompt}")
    print("\n‚è±Ô∏è  Generating response...")
    
    # Generate response
    result = model_manager.generate_text_systematic(test_prompt, max_length=50, temperature=0.1)
    
    if result["error"]:
        print(f"‚ùå Test failed: {result['error']}")
        print("üí° Try restarting the runtime and running all cells again")
    else:
        print(f"‚úÖ SUCCESS! Your AI responded:")
        print(f"üìù Output: {result['output']}")
        print(f"\nüìä Performance:")
        print(f"   Generation time: {result['metrics']['generation_time_ms']:.1f}ms")
        print(f"   Tokens per second: {result['metrics']['tokens_per_second']:.1f}")
        print(f"\nüéâ Your AI model is working perfectly!")
        
else:
    print("‚ö†Ô∏è  Model not loaded yet. Run the cell above first!")
    print("üîÑ If it failed, try restarting runtime and running all cells")

### üìö What You Should See:
**This cell tests that your AI is actually working!** 

**‚úÖ Success Indicators:**
- "SUCCESS! Your AI responded:" message
- Some text output (might be weird/incomplete - that's normal!)
- Performance metrics showing generation time and speed
- "Your AI model is working perfectly!" message

**üéØ What the Output Means:**
- **The AI response**: Might be random/incomplete (mT5 needs fine-tuning for good results)
- **Generation time**: How fast your AI is (faster = better)  
- **Tokens per second**: Processing speed (10-100 is normal range)

**üöÄ What's Next:**
- Don't worry if the response seems random
- We'll improve it with proper prompting techniques
- The important thing is that it's working without errors!

**‚ùå If It Doesn't Work:**
- Check that all previous cells ran without errors
- Try restarting runtime (Runtime ‚Üí Restart Runtime)

In [None]:
### 2.1 üéØ Zero-shot, Few-shot, and Chain-of-Thought Comparison

# üîß Prompt engineering toolkit
def create_zero_shot_prompt(dialogue: str, task: str = "classification") -> str:
    """Zero-shot prompt - no examples provided"""
    if task == "classification":
        return f"""Classify this dialogue into one topic: meeting, social, support, transaction, other.

Dialogue: {dialogue}

Topic:"""
    else:  # QA
        return f"""Answer the question based on the context.

Context: {dialogue}

Answer:"""

def create_few_shot_prompt(dialogue: str, examples: list, task: str = "classification") -> str:
    """Few-shot prompt - includes examples"""
    if task == "classification":
        prompt = "Classify dialogues into topics: meeting, social, support, transaction, other.\n\nExamples:\n\n"
        for ex in examples[:2]:  # Use 2 examples to avoid length issues
            prompt += f"Dialogue: {ex['dialogue']}\nTopic: {ex['topic']}\n\n"
        prompt += f"Dialogue: {dialogue}\nTopic:"
        return prompt
    else:  # QA
        return f"""Answer questions based on context.

Context: {dialogue}

Answer with specific information:"""

def create_chain_of_thought_prompt(context: str, question: str) -> str:
    """Chain-of-Thought prompt for step-by-step reasoning"""
    return f"""Answer the question step by step based on the context.

Context: {context}

Question: {question}

Let me think step by step:
1. What is the question asking?
2. What relevant information is in the context?
3. What is the answer?

Answer:"""

# üß™ Run comprehensive prompt testing
def test_all_prompting_strategies():
    """Test zero-shot, few-shot, and Chain-of-Thought across languages"""
    
    results = []
    
    print("üéØ COMPREHENSIVE PROMPT ENGINEERING TEST")
    print("="*70)
    
    for language, data in test_data.items():
        print(f"\nüåç TESTING: {language.upper()}")
        print("-" * 50)
        
        # Test classification
        if data["classification"]:
            test_dialogue = data["classification"][0]["dialogue"]
            true_topic = data["classification"][0]["topic"]
            
            print(f"üìù Classification task: {test_dialogue[:60]}...")
            print(f"üìã Expected: {true_topic}")
            
            # Zero-shot classification
            zero_prompt = create_zero_shot_prompt(test_dialogue, "classification")
            if model:
                zero_result = generate_text(zero_prompt, max_length=50, temperature=0.1)
                print(f"üéØ Zero-shot: {zero_result}")
                
                # Few-shot classification (using English examples for transfer)
                few_prompt = create_few_shot_prompt(test_dialogue, test_data["English"]["classification"], "classification")
                few_result = generate_text(few_prompt, max_length=50, temperature=0.1)
                print(f"üìö Few-shot: {few_result}")
                
                results.append({
                    "language": language, "task": "classification", "method": "zero-shot",
                    "input": test_dialogue[:50] + "...", "output": zero_result, "expected": true_topic
                })
                results.append({
                    "language": language, "task": "classification", "method": "few-shot", 
                    "input": test_dialogue[:50] + "...", "output": few_result, "expected": true_topic
                })
            
        # Test QA with Chain-of-Thought
        if data["qa"]["questions"]:
            context = data["qa"]["context"]
            question = data["qa"]["questions"][0]
            expected_answer = data["qa"]["answers"][0]
            
            print(f"\\n‚ùì QA task: {question}")
            print(f"üìã Expected: {expected_answer}")
            
            # Chain-of-Thought QA
            cot_prompt = create_chain_of_thought_prompt(context, question)
            if model:
                cot_result = generate_text(cot_prompt, max_length=120, temperature=0.2)
                print(f"üß† Chain-of-Thought: {cot_result}")
                
                results.append({
                    "language": language, "task": "qa", "method": "chain-of-thought",
                    "input": question, "output": cot_result, "expected": expected_answer
                })
        
        print()
    
    return results

# Run the comprehensive test
if model:
    test_results = test_all_prompting_strategies()
    print(f"‚úÖ Completed testing across {len(test_data)} languages")
else:
    print("‚ö†Ô∏è  Model not available - showing prompt structure only")
    # Show example prompts
    example_dialogue = "A: Can we meet at 3pm? B: Perfect!"
    print("\\nüìù EXAMPLE PROMPTS:")
    print("\\nüéØ Zero-shot:")
    print(create_zero_shot_prompt(example_dialogue))
    print("\\nüìö Few-shot structure:")
    print(create_few_shot_prompt(example_dialogue, [{"dialogue": "Example", "topic": "meeting"}])[:200] + "...")

### 2.2 üìä Evaluation Framework

def create_evaluation_rubric():
    """Evaluation framework for model outputs"""
    return {
        "correctness": {
            "1": "Completely wrong", "2": "Partially wrong", "3": "Mostly right", 
            "4": "Right answer", "5": "Perfect with reasoning"
        },
        "fluency": {
            "1": "Unnatural/errors", "2": "Awkward phrasing", "3": "Acceptable", 
            "4": "Good language", "5": "Native-like"
        },
        "cultural_appropriateness": {
            "1": "Inappropriate", "2": "Questionable", "3": "Neutral", 
            "4": "Appropriate", "5": "Culturally aware"
        }
    }

def evaluate_output(output: str, expected: str, language: str, task: str, method: str):
    """Template for manual evaluation"""
    return {
        "output": output,
        "expected": expected,
        "language": language,
        "task": task,
        "method": method,
        "correctness_score": 0,  # Fill in 1-5
        "fluency_score": 0,      # Fill in 1-5
        "cultural_score": 0,     # Fill in 1-5
        "notes": "",            # Your observations
        "improvement_suggestions": ""
    }

print("\\nüìã EVALUATION FRAMEWORK")
print("="*40)
rubric = create_evaluation_rubric()
for dimension, scale in rubric.items():
    print(f"\\n{dimension.upper()}:")
    for score, description in scale.items():
        print(f"  {score}: {description}")

print(f"\\nüéØ YOUR TURN: Evaluate the outputs above using this 1-5 scale")
print("üí° Focus on how well each method works for your target language")

### 2.3 üí¨ Discussion Questions and Key Takeaways

discussion_guide = """
ü§î REFLECTION QUESTIONS:

1. **Cross-language Performance:**
   - Which prompting method worked best for your target language?
   - How did performance differ between English and your language?

2. **Method Comparison:** 
   - When did few-shot examples help vs. hurt?
   - How effective was Chain-of-Thought reasoning in non-English?

3. **Cultural Considerations:**
   - What cultural assumptions did you notice in outputs?
   - How would you adapt prompts for your cultural context?

4. **Practical Applications:**
   - Which approach would you use in production?
   - What are the trade-offs between methods?

üìù ACTION ITEMS:
‚ñ° Document 3 key insights about your target language
‚ñ° Identify best prompting strategies for your use case  
‚ñ° Note major challenges needing further research
‚ñ° Plan next steps for your project

üéØ KEY TAKEAWAYS:
‚Ä¢ Prompt structure matters more than complexity
‚Ä¢ Cultural context significantly impacts performance  
‚Ä¢ Few-shot examples can bridge language gaps effectively
‚Ä¢ Chain-of-Thought helps with reasoning across languages
‚Ä¢ Evaluation must consider cultural appropriateness
"""

print("\\n" + discussion_guide)

print("\\nüéâ CONGRATULATIONS!")
print("You've completed hands-on prompt engineering for low-resource languages!")
print("Use these techniques responsibly and keep experimenting! üöÄ")


### 2.2 üÜì FREE API Setup for Colab Students

**üéØ Student-Friendly Free Options:**

| **Service** | **Free Tier** | **Models Available** | **Setup Difficulty** | **Recommended For** |
|-------------|---------------|---------------------|---------------------|-------------------|
| **ü§ó Hugging Face** | 1000 requests/month | Llama-2, CodeLlama, Mistral | Easy | Learning & experiments |
| **üü¢ Google Gemini** | 60 requests/minute | Gemini-1.5-flash, Gemini-pro | Easy | High-quality outputs |
| **üî• Local mT5** | GPU memory only | mT5-small/base | Medium | Privacy & customization |

**‚úÖ All options work perfectly on Google Colab GPU!**

In [None]:
# üÜì FREE API SETUP FOR COLAB STUDENTS
# Choose your preferred free option!

import requests
import os
from typing import Dict, List

class FreeAPIManager:
    """Manage multiple free API services for students"""
    
    def __init__(self):
        self.available_services = {}
        
    def setup_huggingface_free(self, hf_token: str = None):
        """Setup Hugging Face Inference API (1000 free requests/month)"""
        if not hf_token:
            print("üîë Get your FREE Hugging Face token:")
            print("   1. Visit: https://huggingface.co/settings/tokens")
            print("   2. Create a token with 'Read' permission")
            print("   3. Paste it when prompted")
            hf_token = input("Enter your HF token: ").strip()
        
        self.available_services["huggingface"] = {
            "token": hf_token,
            "api_url": "https://api-inference.huggingface.co/models/",
            "models": ["microsoft/DialoGPT-medium", "google/flan-t5-base", "meta-llama/Llama-2-7b-chat-hf"],
            "cost": "Free (1000 requests/month)",
            "setup": True
        }
        print("‚úÖ Hugging Face API configured!")
        return True
    
    def setup_gemini_free(self, api_key: str = None):
        """Setup Google Gemini API (60 free requests/minute)"""
        if not api_key:
            print("üîë Get your FREE Google AI Studio API key:")
            print("   1. Visit: https://makersuite.google.com/app/apikey")
            print("   2. Create a new API key")
            print("   3. Paste it when prompted")
            api_key = input("Enter your Gemini API key: ").strip()
        
        self.available_services["gemini"] = {
            "api_key": api_key,
            "models": ["gemini-1.5-flash", "gemini-1.5-pro"],
            "cost": "Free (60 requests/minute)",
            "setup": True
        }
        print("‚úÖ Google Gemini API configured!")
        return True
    
    def query_huggingface(self, model_name: str, prompt: str, max_length: int = 100) -> Dict:
        """Query Hugging Face model with free API"""
        if "huggingface" not in self.available_services:
            return {"error": "Hugging Face not setup. Run setup_huggingface_free() first"}
        
        api_url = self.available_services["huggingface"]["api_url"] + model_name
        headers = {"Authorization": f"Bearer {self.available_services['huggingface']['token']}"}
        
        payload = {
            "inputs": prompt,
            "parameters": {"max_length": max_length, "temperature": 0.7}
        }
        
        try:
            response = requests.post(api_url, headers=headers, json=payload)
            result = response.json()
            
            if isinstance(result, list) and len(result) > 0:
                return {
                    "output": result[0].get("generated_text", "").replace(prompt, "").strip(),
                    "model": model_name,
                    "service": "huggingface",
                    "success": True
                }
            else:
                return {"error": f"Unexpected response: {result}", "success": False}
                
        except Exception as e:
            return {"error": str(e), "success": False}
    
    def query_gemini(self, prompt: str, model: str = "gemini-1.5-flash") -> Dict:
        """Query Gemini with free API (placeholder - would need google-generativeai package)"""
        if "gemini" not in self.available_services:
            return {"error": "Gemini not setup. Run setup_gemini_free() first"}
        
        # This would require: pip install google-generativeai
        # For now, showing the structure students would use
        return {
            "output": f"[Gemini response - install google-generativeai package to use]",
            "model": model,
            "service": "gemini", 
            "success": True,
            "note": "Install: pip install google-generativeai"
        }

# Initialize free API manager
free_api = FreeAPIManager()

print("üÜì FREE API OPTIONS FOR STUDENTS")
print("=" * 50)
print("Choose one of these FREE options:")
print()
print("Option 1: ü§ó Hugging Face (Recommended for beginners)")
print("   - 1000 free requests per month")
print("   - Multiple models available")
print("   - Easy to get started")
print()
print("Option 2: üü¢ Google Gemini")  
print("   - 60 requests per minute (very generous!)")
print("   - High-quality responses")
print("   - Excellent for production testing")
print()
print("Option 3: üî• Local Model (Already loaded above)")
print("   - Completely free (uses Colab GPU)")
print("   - Works offline")
print("   - Perfect for learning")

print("\nüí° SETUP INSTRUCTIONS:")
print("   Uncomment ONE of these lines to setup your preferred API:")
print("   # free_api.setup_huggingface_free()  # For Hugging Face")
print("   # free_api.setup_gemini_free()       # For Google Gemini")
print("\n‚úÖ All services work perfectly on Google Colab!")

# Test function for whichever API is setup
def test_free_api(prompt: str = "Hello! How are you today?"):
    """Test whichever API service is configured"""
    if "huggingface" in free_api.available_services:
        print("Testing Hugging Face API...")
        result = free_api.query_huggingface("microsoft/DialoGPT-medium", prompt)
        print(f"Result: {result}")
    elif "gemini" in free_api.available_services:
        print("Testing Gemini API...")
        result = free_api.query_gemini(prompt)
        print(f"Result: {result}")
    else:
        print("‚ö†Ô∏è  No API configured yet. Run setup first!")

print("\nüß™ After setup, test with: test_free_api()")

### 2.3 üéØ Self-Consistency Prompting (MISSING TECHNIQUE)

**Self-Consistency** is a powerful technique where we generate multiple responses to the same prompt and choose the most consistent answer. This significantly improves accuracy, especially for reasoning tasks.

**üî¨ How Self-Consistency Works:**
1. Generate multiple responses (typically 3-5) to the same prompt
2. Analyze the consistency across responses  
3. Choose the most frequent/consistent answer
4. Particularly effective for mathematical reasoning and factual questions

**üìä Research shows 10-20% accuracy improvement with self-consistency!**

In [None]:
# üß† SELF-CONSISTENCY PROMPTING IMPLEMENTATION
# Generate multiple responses and find the most consistent answer

from collections import Counter
import time

class SelfConsistencyEngine:
    """
    Implement self-consistency prompting for improved accuracy
    """
    
    def __init__(self, model_manager):
        self.model_manager = model_manager
        
    def generate_multiple_responses(self, prompt: str, num_responses: int = 5, 
                                  temperature: float = 0.8) -> List[Dict]:
        """Generate multiple responses for self-consistency evaluation"""
        responses = []
        
        print(f"üîÑ Generating {num_responses} responses for self-consistency...")
        
        for i in range(num_responses):
            print(f"   Response {i+1}/{num_responses}...", end="")
            
            result = self.model_manager.generate_text_systematic(
                prompt, 
                max_length=100, 
                temperature=temperature
            )
            
            if result["error"]:
                print(f" ‚ùå Error")
                continue
                
            responses.append({
                "response_id": i+1,
                "output": result["output"],
                "metrics": result["metrics"]
            })
            print(f" ‚úÖ")
            time.sleep(0.1)  # Small delay to avoid overwhelming APIs
            
        return responses
    
    def extract_answers(self, responses: List[Dict], task_type: str = "classification") -> List[str]:
        """Extract the core answers from responses for comparison"""
        answers = []
        
        for response in responses:
            output = response["output"].strip()
            
            if task_type == "classification":
                # Extract the classification result (first word or after ":")
                if ":" in output:
                    answer = output.split(":")[-1].strip().split()[0].lower()
                else:
                    answer = output.split()[0].lower()
                    
            elif task_type == "number":
                # Extract numerical answers
                import re
                numbers = re.findall(r'\d+', output)
                answer = numbers[0] if numbers else "no_number"
                
            else:  # general text
                # Use first significant word/phrase
                answer = output.split('.')[0].strip()[:50]
            
            answers.append(answer)
            
        return answers
    
    def find_consensus(self, answers: List[str]) -> Dict:
        """Find the most consistent answer across multiple responses"""
        if not answers:
            return {"consensus": None, "confidence": 0, "distribution": {}}
        
        # Count frequency of each answer
        answer_counts = Counter(answers)
        most_common_answer, max_count = answer_counts.most_common(1)[0]
        
        # Calculate confidence as percentage of responses
        confidence = max_count / len(answers)
        
        return {
            "consensus": most_common_answer,
            "confidence": confidence,
            "distribution": dict(answer_counts),
            "total_responses": len(answers)
        }
    
    def self_consistent_query(self, prompt: str, task_type: str = "classification", 
                            num_responses: int = 5) -> Dict:
        """
        Perform complete self-consistency evaluation
        """
        print(f"üéØ SELF-CONSISTENCY EVALUATION")
        print(f"   Task: {task_type}")
        print(f"   Responses: {num_responses}")
        print(f"   Prompt: {prompt[:100]}...")
        print()
        
        # Generate multiple responses
        responses = self.generate_multiple_responses(prompt, num_responses)
        
        if len(responses) < 2:
            return {"error": "Need at least 2 successful responses for consensus"}
        
        # Extract answers for comparison
        answers = self.extract_answers(responses, task_type)
        
        # Find consensus
        consensus_result = self.find_consensus(answers)
        
        print(f"\\nüìä SELF-CONSISTENCY RESULTS:")
        print(f"   Consensus: {consensus_result['consensus']}")
        print(f"   Confidence: {consensus_result['confidence']:.1%}")
        print(f"   Distribution: {consensus_result['distribution']}")
        
        return {
            "consensus": consensus_result,
            "individual_responses": responses,
            "extracted_answers": answers,
            "success": True
        }

# Initialize self-consistency engine (if model manager is available)
if 'model_manager' in locals() and model_manager.current_model:
    sc_engine = SelfConsistencyEngine(model_manager)
    
    print("üß† SELF-CONSISTENCY ENGINE READY!")
    print("=" * 50)
    
    # Demo with a classification task
    demo_prompt = """Classify this dialogue type: meeting, social, support, or transaction.

Dialogue: A: I need help with my password reset. B: I can help you with that. Let me send you a reset link.

Classification:"""
    
    print("üß™ DEMO: Self-Consistency vs Single Response")
    print("-" * 40)
    
    # Single response (traditional)
    single_result = model_manager.generate_text_systematic(demo_prompt, temperature=0.1)
    print(f"üî∏ Single response: {single_result.get('output', 'Error')}")
    
    # Self-consistency (multiple responses)
    print("\\nüî∏ Self-consistency evaluation:")
    sc_result = sc_engine.self_consistent_query(demo_prompt, "classification", 3)
    
    if sc_result["success"]:
        print(f"\\n‚úÖ IMPROVEMENT WITH SELF-CONSISTENCY:")
        print(f"   Single: {single_result.get('output', 'Error')}")
        print(f"   Consensus: {sc_result['consensus']['consensus']}")
        print(f"   Confidence: {sc_result['consensus']['confidence']:.1%}")
        
        if sc_result['consensus']['confidence'] >= 0.6:
            print(f"   üéØ High confidence - reliable answer!")
        else:
            print(f"   ‚ö†Ô∏è  Low confidence - may need more responses or prompt tuning")
    
else:
    print("‚ö†Ô∏è  Model manager not available - showing self-consistency concept only")
    print("\\nüî¨ Self-Consistency Process:")
    print("1. Generate 3-5 responses with temperature > 0.5")
    print("2. Extract core answers from each response") 
    print("3. Count frequency of each answer")
    print("4. Choose most frequent answer as consensus")
    print("5. Calculate confidence as agreement percentage")

print("\\nüí° WHEN TO USE SELF-CONSISTENCY:")
print("   ‚úÖ Mathematical reasoning tasks")
print("   ‚úÖ Factual questions with definitive answers")
print("   ‚úÖ Classification tasks")
print("   ‚úÖ When accuracy is more important than speed")
print("   ‚ùå Creative writing tasks")
print("   ‚ùå Open-ended discussions")

### 2.4 üéõÔ∏è LLM Hyperparameters Deep Dive (EXPANDED COVERAGE)

**Understanding hyperparameters is crucial for effective prompt engineering!**

| **Parameter** | **Range** | **Effect** | **Use When** | **Avoid When** |
|---------------|-----------|------------|--------------|----------------|
| **üå°Ô∏è Temperature** | 0.0-2.0 | Controls randomness | Creative tasks (0.7-1.2) | Factual Q&A (0.0-0.3) |
| **üéØ Top-p** | 0.1-1.0 | Nucleus sampling | Balanced control (0.8-0.95) | Extreme creativity or determinism |
| **üî¢ Top-k** | 1-100 | Limits vocabulary | Focused domains (10-40) | Open conversations (high k) |
| **üìè Max Length** | 1-4096+ | Output length limit | Specific formats | Open exploration |
| **üîÅ Repetition Penalty** | 0.8-1.5 | Reduces repetition | Avoiding loops (1.1-1.3) | Poetry/patterns (‚â§1.0) |

**üéØ Optimal Settings by Task:**
- **Factual Q&A**: temp=0.1, top_p=0.9, max_length=100
- **Creative Writing**: temp=0.8, top_p=0.95, max_length=500
- **Code Generation**: temp=0.2, top_p=0.9, max_length=200
- **Classification**: temp=0.0, top_p=0.8, max_length=10

In [None]:
# üß™ HYPERPARAMETER EXPERIMENTATION FRAMEWORK
# Systematic exploration of LLM hyperparameters

class HyperparameterExplorer:
    """
    Systematically test different hyperparameter combinations
    """
    
    def __init__(self, model_manager):
        self.model_manager = model_manager
        self.results_df = pd.DataFrame()
        
    def create_hyperparameter_grid(self, task_type: str = "classification") -> List[Dict]:
        """Create systematic hyperparameter combinations for testing"""
        
        if task_type == "classification":
            return [
                {"temperature": 0.0, "max_length": 20, "label": "Deterministic"},
                {"temperature": 0.3, "max_length": 20, "label": "Low creativity"},
                {"temperature": 0.7, "max_length": 20, "label": "Balanced"},
                {"temperature": 1.0, "max_length": 20, "label": "Creative"},
            ]
        elif task_type == "generation":
            return [
                {"temperature": 0.1, "max_length": 100, "label": "Conservative"}, 
                {"temperature": 0.5, "max_length": 100, "label": "Moderate"},
                {"temperature": 0.8, "max_length": 100, "label": "Creative"},
                {"temperature": 1.2, "max_length": 100, "label": "Very creative"},
            ]
        else:
            return [
                {"temperature": 0.2, "max_length": 50, "label": "Default low"},
                {"temperature": 0.7, "max_length": 50, "label": "Default balanced"},
            ]
    
    def test_hyperparameter_grid(self, prompt: str, task_type: str = "classification", 
                                runs_per_config: int = 3) -> pd.DataFrame:
        """Test systematic combinations of hyperparameters"""
        
        param_grid = self.create_hyperparameter_grid(task_type)
        results = []
        
        print(f"üß™ HYPERPARAMETER GRID SEARCH")
        print(f"   Prompt: {prompt[:50]}...")
        print(f"   Configurations: {len(param_grid)}")
        print(f"   Runs per config: {runs_per_config}")
        print(f"   Total experiments: {len(param_grid) * runs_per_config}")
        print("=" * 60)
        
        for i, config in enumerate(param_grid):
            print(f"\\nüéõÔ∏è  Config {i+1}/{len(param_grid)}: {config['label']}")
            print(f"   Temperature: {config['temperature']}")
            print(f"   Max length: {config['max_length']}")
            
            config_results = []
            
            for run in range(runs_per_config):
                print(f"   Run {run+1}/{runs_per_config}...", end="")
                
                result = self.model_manager.generate_text_systematic(
                    prompt,
                    max_length=config["max_length"],
                    temperature=config["temperature"]
                )
                
                if result["error"]:
                    print(" ‚ùå")
                    continue
                
                # Store systematic results
                results.append({
                    "config_id": i,
                    "config_label": config["label"],
                    "temperature": config["temperature"],
                    "max_length": config["max_length"],
                    "run_id": run,
                    "output": result["output"],
                    "generation_time_ms": result["metrics"].get("generation_time_ms", 0),
                    "output_tokens": result["metrics"].get("output_tokens", 0),
                    "output_length": len(result["output"]),
                    "timestamp": time.time()
                })
                
                config_results.append(result["output"])
                print(" ‚úÖ")
            
            # Show variety within this configuration
            if config_results:
                unique_outputs = len(set(config_results))
                print(f"   Unique outputs: {unique_outputs}/{len(config_results)}")
                if len(config_results) > 1:
                    print(f"   Sample outputs:")
                    for j, output in enumerate(config_results[:2]):
                        print(f"     {j+1}: {output[:60]}...")
        
        # Convert to DataFrame for analysis
        results_df = pd.DataFrame(results)
        self.results_df = pd.concat([self.results_df, results_df], ignore_index=True)
        
        return results_df
    
    def analyze_hyperparameter_effects(self, results_df: pd.DataFrame):
        """Analyze the effects of different hyperparameter settings"""
        
        print(f"\\nüìä HYPERPARAMETER ANALYSIS")
        print("=" * 50)
        
        # Group by configuration
        config_analysis = results_df.groupby(['config_label', 'temperature']).agg({
            'output_length': ['mean', 'std'],
            'generation_time_ms': ['mean', 'std'],
            'output': lambda x: len(set(x))  # Unique outputs (diversity)
        }).round(2)
        
        config_analysis.columns = ['Avg_Length', 'Std_Length', 'Avg_Time_ms', 'Std_Time_ms', 'Diversity']
        
        print("\\nüéØ Configuration Performance:")
        display(config_analysis)
        
        # Temperature effect analysis
        print(f"\\nüå°Ô∏è TEMPERATURE EFFECTS:")
        temp_effects = results_df.groupby('temperature').agg({
            'output_length': 'mean',
            'output': lambda x: len(set(x)) / len(x)  # Diversity ratio
        }).round(3)
        
        for temp, row in temp_effects.iterrows():
            diversity_pct = row['output'] * 100
            print(f"   Temperature {temp}: Avg length {row['output_length']:.0f}, Diversity {diversity_pct:.0f}%")
        
        # Visualization if matplotlib is available
        try:
            fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
            
            # Length vs Temperature
            results_df.groupby('temperature')['output_length'].mean().plot(kind='bar', ax=ax1)
            ax1.set_title('Output Length vs Temperature')
            ax1.set_ylabel('Average Output Length')
            
            # Generation time vs Temperature  
            results_df.groupby('temperature')['generation_time_ms'].mean().plot(kind='bar', ax=ax2)
            ax2.set_title('Generation Time vs Temperature')
            ax2.set_ylabel('Average Time (ms)')
            
            plt.tight_layout()
            plt.show()
            
        except:
            print("   (Visualization requires matplotlib)")
        
        return config_analysis

# Initialize hyperparameter explorer (if model available)
if 'model_manager' in locals() and model_manager.current_model:
    hp_explorer = HyperparameterExplorer(model_manager)
    
    print("üéõÔ∏è HYPERPARAMETER EXPLORER READY!")
    print("=" * 50)
    
    # Demo hyperparameter experimentation
    demo_prompt = "Classify this conversation: A: Can you help me reset my password? B: Of course, I'll send you a link."
    
    print("\\nüß™ DEMO: Hyperparameter Grid Search")
    print("Testing different temperature settings...")
    
    # Run hyperparameter grid search
    hp_results = hp_explorer.test_hyperparameter_grid(
        demo_prompt, 
        task_type="classification",
        runs_per_config=2  # Use 2 for demo (normally 3-5)
    )
    
    # Analyze results
    analysis = hp_explorer.analyze_hyperparameter_effects(hp_results)
    
    print("\\nüí° KEY INSIGHTS:")
    print("   üå°Ô∏è  Lower temperature = more consistent outputs")
    print("   üå°Ô∏è  Higher temperature = more diverse/creative outputs") 
    print("   ‚è±Ô∏è  Temperature has minimal effect on generation speed")
    print("   üìè Max length controls output verbosity")
    
else:
    print("‚ö†Ô∏è  Model manager not available - showing hyperparameter concepts")
    
print("\\nüéØ HYPERPARAMETER BEST PRACTICES:")
print("   üìã Classification tasks: temperature = 0.0-0.3")
print("   ‚úçÔ∏è  Creative generation: temperature = 0.7-1.0") 
print("   üîç Factual Q&A: temperature = 0.0-0.2")
print("   üó£Ô∏è  Dialogue: temperature = 0.4-0.7")
print("   üßÆ Code generation: temperature = 0.1-0.4")

print("\\n‚ö° PERFORMANCE TIPS:")
print("   ‚úÖ Start with temperature=0.0 for reproducible results")
print("   ‚úÖ Increase temperature gradually for more creativity")
print("   ‚úÖ Use max_length to prevent overly long responses")
print("   ‚úÖ Test multiple runs to understand variability")

## üéì SESSION COMPLETE: Advanced Prompt Engineering Mastery

### ‚úÖ What You've Mastered Today

**üèóÔ∏è Pre-trained Models:**
- ‚úÖ Model family comparison (local vs API access)
- ‚úÖ Systematic evaluation framework with performance tracking
- ‚úÖ Free API integration for Colab (Hugging Face, Gemini)

**üé® Prompt Engineering & Design:**
- ‚úÖ Zero-shot prompting for immediate results
- ‚úÖ Few-shot prompting with cross-lingual examples
- ‚úÖ Chain-of-thought for step-by-step reasoning
- ‚úÖ **Self-consistency for improved accuracy** (NEW!)

**üéõÔ∏è LLM Hyperparameters:**
- ‚úÖ Temperature, top-p, max_length optimization
- ‚úÖ Task-specific parameter tuning
- ‚úÖ Systematic hyperparameter grid search (NEW!)

**üåç Low-Resource Languages:**
- ‚úÖ Cross-lingual prompt transfer strategies
- ‚úÖ Cultural adaptation and sensitivity evaluation
- ‚úÖ Few-shot learning for resource-constrained languages

**üìä Systematic Evaluation:**
- ‚úÖ Research-grade methodology with pandas tracking
- ‚úÖ Quantitative metrics and performance analysis
- ‚úÖ Structured comparison across techniques

### üöÄ Next Steps for Your Projects

**üìã Immediate Actions:**
1. **Choose your API**: Set up Hugging Face or Gemini free API
2. **Test with your language**: Apply techniques to your specific use case
3. **Document insights**: Record what works best for your domain
4. **Experiment systematically**: Use the evaluation frameworks provided

**üî¨ Advanced Experiments:**
- Compare self-consistency vs single-shot for your tasks
- Optimize hyperparameters for your specific language/domain
- Create custom few-shot examples for your cultural context
- Build evaluation metrics for your specific requirements

### üí° Key Production Insights

**üéØ For Classification Tasks:**
- Use temperature=0.0 with self-consistency (3-5 responses)
- Few-shot examples improve cross-lingual transfer
- Cultural context matters more than linguistic accuracy

**‚úçÔ∏è For Text Generation:**
- Start with temperature=0.7, adjust based on creativity needs
- Chain-of-thought improves factual accuracy significantly  
- Self-consistency reduces hallucination in factual tasks

**‚ö° For Performance:**
- Local mT5 models: Best for privacy and customization
- Free APIs: Excellent for learning and small-scale projects
- Systematic evaluation saves time in the long run

### üåü Remember: Ethics and Responsible AI

**Always consider:**
- Cultural sensitivity in your prompts and evaluations
- Privacy implications of your chosen API service
- Bias evaluation across different demographic groups
- Environmental impact of your model choices

---

**üéâ Congratulations!** You now have production-ready prompt engineering skills with systematic evaluation methodology. Keep experimenting and building responsibly! üöÄ