# Module 17: LLM Fundamentals

**Goal:** Learn effective prompting techniques, understand RAG vs fine-tuning, and evaluate LLM outputs.

**Prerequisites:** Module 14 (Retrieval), Module 16 (Transformers)

**Expected Runtime:** ~25 minutes

**Outputs:**
- Compared prompting techniques
- Built a simple RAG pipeline
- Evaluated response quality

---

## Setup

In [None]:
import numpy as np
import pandas as pd
import json
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

## Part 1: Prompt Templates

Different prompting styles for different needs.

In [None]:
# Task: Classify support tickets
ticket = "I can't log into my account after resetting my password"

# Zero-shot prompt
zero_shot = f"""Classify this support ticket as: billing, technical, or general.

Ticket: "{ticket}"

Category:"""

# Few-shot prompt
few_shot = f"""Classify support tickets into categories.

Examples:
Ticket: "Why was I charged twice this month?"
Category: billing

Ticket: "The app keeps crashing on startup"
Category: technical

Ticket: "What are your business hours?"
Category: general

Now classify this ticket:
Ticket: "{ticket}"
Category:"""

# Chain-of-thought prompt
cot_prompt = f"""Classify this support ticket. Think step by step.

Ticket: "{ticket}"

Step 1: What is the customer's main issue?
Step 2: Which category (billing, technical, general) best describes this?
Step 3: Final classification.

Analysis:"""

print("=== Zero-Shot Prompt ===")
print(zero_shot)
print("\n" + "="*50)
print("\n=== Few-Shot Prompt ===")
print(few_shot[:300] + "...")

## Part 2: Simulating LLM Responses

In production, you'd call an API. Here we'll simulate responses to understand the patterns.

In [None]:
# Simulated responses (what an LLM might return)
responses = {
    'zero_shot': 'technical',
    'few_shot': 'technical',
    'cot': '''Step 1: The customer is having trouble logging in after resetting their password. This is an account access issue.

Step 2: Login and password issues are related to the technical functionality of the system, not billing or general inquiries.

Step 3: technical'''
}

print("=== Simulated Responses ===")
for style, response in responses.items():
    print(f"\n{style.upper()}:")
    print(response[:200] + ("..." if len(response) > 200 else ""))

## Part 3: Building a Simple RAG System

In [None]:
# Knowledge base (support documentation)
knowledge_base = [
    {
        "title": "Password Reset Guide",
        "content": "Password resets are processed within 5 minutes. If you can't log in after reset, clear your browser cache and cookies. Try incognito mode. If issues persist, contact support for manual account unlock."
    },
    {
        "title": "Billing FAQ",
        "content": "Charges appear within 24 hours. Double charges may occur due to payment retries. Request refunds within 30 days. View billing history in Account Settings."
    },
    {
        "title": "Shipping Information",
        "content": "Standard shipping takes 5-7 business days. Express is 2-3 days. Track orders in Order History. Free shipping on orders over $50."
    },
    {
        "title": "Account Security",
        "content": "Enable two-factor authentication in Security Settings. Use strong passwords with 12+ characters. We never ask for passwords via email."
    },
]

# Create document texts
docs = [f"{d['title']}: {d['content']}" for d in knowledge_base]

# Build retriever using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
doc_vectors = vectorizer.fit_transform(docs)

def retrieve(query, k=2):
    """Retrieve top-k relevant documents."""
    query_vector = vectorizer.transform([query])
    similarities = cosine_similarity(query_vector, doc_vectors).flatten()
    top_k_idx = similarities.argsort()[-k:][::-1]
    
    results = []
    for idx in top_k_idx:
        results.append({
            'title': knowledge_base[idx]['title'],
            'content': knowledge_base[idx]['content'],
            'score': similarities[idx]
        })
    return results

# Test retrieval
query = "I can't log in after password reset"
retrieved = retrieve(query, k=2)

print(f"Query: '{query}'\n")
print("Retrieved Documents:")
for doc in retrieved:
    print(f"\n[{doc['score']:.3f}] {doc['title']}")
    print(f"   {doc['content'][:100]}...")

In [None]:
def build_rag_prompt(query, retrieved_docs):
    """Build a RAG-augmented prompt."""
    context = "\n\n".join([f"- {d['title']}: {d['content']}" for d in retrieved_docs])
    
    prompt = f"""You are a helpful support agent. Answer the customer's question using ONLY the information provided below.

RETRIEVED CONTEXT:
{context}

CUSTOMER QUESTION: {query}

If the answer is not in the context, say "I don't have that information."

RESPONSE:"""
    
    return prompt

# Build RAG prompt
rag_prompt = build_rag_prompt(query, retrieved)
print("=== RAG Prompt ===")
print(rag_prompt)

## Part 4: Evaluating LLM Outputs

In [None]:
# Simulated responses to evaluate
test_cases = [
    {
        'query': 'How long does shipping take?',
        'context': 'Standard shipping takes 5-7 business days. Express is 2-3 days.',
        'response': 'Standard shipping takes 5-7 business days, and express shipping delivers in 2-3 days.',
        'expected': 'Standard shipping: 5-7 days, Express: 2-3 days'
    },
    {
        'query': 'How long does shipping take?',
        'context': 'Standard shipping takes 5-7 business days. Express is 2-3 days.',
        'response': 'Shipping usually takes 1-2 weeks, but premium members get it in 24 hours.',  # Hallucinated!
        'expected': 'Standard shipping: 5-7 days, Express: 2-3 days'
    },
    {
        'query': 'What is the refund policy?',
        'context': 'Request refunds within 30 days.',
        'response': 'You can request a refund within 30 days of purchase.',
        'expected': 'Refunds available within 30 days'
    },
]

def check_faithfulness(response, context):
    """Simple faithfulness check: does response contain claims not in context?"""
    # In production, use more sophisticated NLI models
    response_words = set(response.lower().split())
    context_words = set(context.lower().split())
    
    # Check for numbers in response not in context (common hallucination)
    import re
    response_numbers = set(re.findall(r'\d+', response))
    context_numbers = set(re.findall(r'\d+', context))
    
    hallucinated_numbers = response_numbers - context_numbers
    
    return len(hallucinated_numbers) == 0, hallucinated_numbers

print("=== Faithfulness Evaluation ===")
for i, case in enumerate(test_cases):
    faithful, issues = check_faithfulness(case['response'], case['context'])
    
    print(f"\nCase {i+1}: {'✓ Faithful' if faithful else '✗ Hallucination Detected'}")
    print(f"  Query: {case['query']}")
    print(f"  Response: {case['response'][:60]}...")
    if not faithful:
        print(f"  ⚠️ Numbers not in context: {issues}")

## Part 5: LLM-as-Judge Pattern

In [None]:
def create_judge_prompt(query, response, context=None):
    """Create a prompt for LLM-as-judge evaluation."""
    
    context_section = f"\nCONTEXT PROVIDED:\n{context}\n" if context else ""
    
    prompt = f"""Evaluate this response on a scale of 1-5 for each criterion.

QUERY: {query}
{context_section}
RESPONSE: {response}

Rate the response:
1. RELEVANCE (1-5): Does it answer the question?
2. ACCURACY (1-5): Is the information correct?
3. HELPFULNESS (1-5): Is it actionable and useful?
4. FAITHFULNESS (1-5): Does it stick to provided context? (if context given)

Provide scores in JSON format:
{{
    "relevance": <score>,
    "accuracy": <score>,
    "helpfulness": <score>,
    "faithfulness": <score>,
    "reasoning": "<brief explanation>"
}}
"""
    return prompt

# Example judge prompt
judge_prompt = create_judge_prompt(
    query="How do I reset my password?",
    response="You can reset your password by clicking 'Forgot Password' on the login page. A reset link will be sent to your email.",
    context="Password resets are processed within 5 minutes. Click 'Forgot Password' on login page."
)

print("=== LLM-as-Judge Prompt ===")
print(judge_prompt)

## Part 6: RAG vs Fine-Tuning Decision

In [None]:
# Decision framework
scenarios = [
    {
        'scenario': 'Company FAQ that changes monthly',
        'recommendation': 'RAG',
        'reason': 'Information changes frequently; RAG allows instant updates without retraining'
    },
    {
        'scenario': 'Consistent brand voice across all responses',
        'recommendation': 'Fine-tune',
        'reason': 'Style/tone is a learned behavior, not retrievable information'
    },
    {
        'scenario': 'Customer order status lookup',
        'recommendation': 'RAG',
        'reason': 'Real-time data that must come from database, not training'
    },
    {
        'scenario': 'Technical domain expertise (medical, legal)',
        'recommendation': 'Fine-tune + RAG',
        'reason': 'Need specialized knowledge (fine-tune) AND current facts (RAG)'
    },
    {
        'scenario': 'Specific output format (JSON, structured)',
        'recommendation': 'Fine-tune or few-shot',
        'reason': 'Format consistency is behavioral; fine-tuning or examples work well'
    },
]

df = pd.DataFrame(scenarios)
print("=== RAG vs Fine-Tuning Decision Guide ===")
print(df.to_string(index=False))

## Part 7: TODO - Build a Complete RAG Flow

In [None]:
# TODO: Implement a full RAG pipeline
def rag_answer(query, knowledge_base_docs=None, top_k=2):
    """
    Complete RAG pipeline:
    1. Retrieve relevant documents
    2. Build augmented prompt
    3. (In production) Call LLM API
    4. Return response
    """
    # Step 1: Retrieve
    retrieved = retrieve(query, k=top_k)
    
    # Step 2: Build prompt
    prompt = build_rag_prompt(query, retrieved)
    
    # Step 3: Call LLM (simulated)
    # In production:
    # response = openai.ChatCompletion.create(
    #     model="gpt-4",
    #     messages=[{"role": "user", "content": prompt}]
    # )
    
    # Simulated response
    simulated_response = "Based on the information provided, here's my answer..."
    
    return {
        'query': query,
        'retrieved_docs': retrieved,
        'prompt': prompt,
        'response': simulated_response
    }

# Test the pipeline
result = rag_answer("How do I get a refund?")
print("=== RAG Pipeline Result ===")
print(f"Query: {result['query']}")
print(f"\nRetrieved: {[d['title'] for d in result['retrieved_docs']]}")
print(f"\nPrompt length: {len(result['prompt'])} chars")

## Self-Check

Uncomment and run the asserts below to verify your RAG pipeline components work correctly.

In [None]:
# SELF-CHECK: Verify your RAG pipeline
assert callable(retrieve), "retrieve function should exist"
results = retrieve("password reset", k=2)
assert len(results) == 2, "retrieve should return k results"
assert all('title' in r and 'score' in r for r in results), "Results should have title and score"
assert callable(build_rag_prompt), "build_rag_prompt function should exist"
print(f"✅ Self-check passed! RAG pipeline retrieved {len(results)} documents")

## Part 8: Stakeholder Summary

### TODO: Write a 3-bullet summary (~100 words) for the PM

Template:
• **Three approaches:** Prompting (fast, no training), RAG (retrieves facts at query time), Fine-tuning (trains model on your data).
• **When to use each:** RAG for [current/changing facts]; Fine-tuning for [consistent style/domain expertise]; Prompting for [quick experiments].
• **Evaluation:** Check faithfulness (does it stick to context?), relevance (does it answer the question?), and [specific business metrics].

### Your Summary:

*Write your explanation here...*

---

## Key Takeaways

1. **Prompt engineering:** Zero-shot, few-shot, and CoT for different needs
2. **RAG:** Retrieve context to ground responses in facts
3. **Fine-tuning:** For style, tone, and specialized behavior
4. **Evaluation:** Faithfulness, relevance, accuracy; use LLM-as-judge
5. **Hallucination:** Address with RAG and "use only provided info"

### Next Steps
- Explore the interactive playground
- Complete the quiz
- Move to Module 18: Tool Calling