# Chapter 15: Building Applications

> "Knowledge is of no value unless you put it into practice."
> — **Anton Chekhov**, Writer

---

## What You'll Learn

- How to build a simple chat application with conversation history
- The RAG (Retrieval-Augmented Generation) pattern for giving models access to external knowledge
- How to chunk documents, create embeddings, and retrieve relevant information
- Basic tool-calling patterns that let models take actions
- Evaluation strategies for testing whether your application actually works

---

## Setup

First, let's install required packages and set up **Ollama** for free local LLM inference.

> **Why Ollama?** It's completely free, works offline, and runs on any computer. 
> No API keys or credit cards needed. Many production applications now use local 
> models for privacy and cost savings.

In [None]:
# Install required packages
!pip install -q sentence-transformers numpy requests

# === OLLAMA SETUP ===
# Ollama runs locally - completely free, no API key needed!

print("Installing Ollama...")
!curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama server in background
import subprocess
subprocess.Popen(["ollama", "serve"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

import time
time.sleep(3)  # Wait for server to start

# Pull a small model (~2GB download, one-time)
print("\nPulling llama3.2 model (this may take a few minutes on first run)...")
!ollama pull llama3.2

# Download our helper library
!wget -q https://raw.githubusercontent.com/FirstLLM/code/main/llm_helper.py

print("\n✓ Setup complete! You can now use local LLMs for free.")

In [None]:
# ===== IMPORTS =====
import os
import json
import re
from datetime import datetime

import numpy as np

# Import our LLM helper
from llm_helper import chat, chat_with_history
print("✓ llm_helper loaded")

# Check if sentence-transformers is available (for RAG embeddings)
try:
    from sentence_transformers import SentenceTransformer
    print("✓ sentence-transformers installed")
except ImportError:
    print("⚠ sentence-transformers not found. Run: pip install sentence-transformers")

In [None]:
# ===== TEST OLLAMA CONNECTION =====
# Verify that Ollama is running and the model is available

print("Testing Ollama connection...")
try:
    response = chat("Say 'Connection successful!' and nothing else.", temperature=0)
    print(f"✓ Ollama is working!")
    print(f"  Response: {response}")
except Exception as e:
    print(f"⚠ Ollama connection failed: {e}")
    print("  Make sure Ollama is running: ollama serve")

## 1. Building a Chat Loop

Let's start with a simple chat application that remembers conversation history.

In [None]:
class ChatSession:
    """Chat session with history management."""
    
    def __init__(self, max_history=20):
        self.max_history = max_history
        self.history = []
        self.system_prompt = "You are a helpful assistant."
    
    def count_messages(self):
        """Count messages in conversation history."""
        return len(self.history)
    
    def trim_history_if_needed(self):
        """Remove oldest messages if we're exceeding the limit."""
        while len(self.history) > self.max_history:
            # Remove the oldest user/assistant pair
            self.history.pop(0)
            if self.history and self.history[0]["role"] == "assistant":
                self.history.pop(0)
    
    def chat(self, user_message):
        """Process a user message and return the response."""
        # Add user message
        self.history.append({"role": "user", "content": user_message})
        
        # Trim if needed
        self.trim_history_if_needed()
        
        # Get response using our helper
        assistant_message = chat_with_history(
            self.history,
            system=self.system_prompt
        )
        
        # Add to history
        self.history.append({"role": "assistant", "content": assistant_message})
        
        return assistant_message

print("ChatSession class defined!")

In [None]:
# Test the chat session
session = ChatSession()

# First message
response1 = session.chat("Hello! What's 2+2?")
print(f"You: Hello! What's 2+2?")
print(f"Assistant: {response1}")
print(f"History messages: {session.count_messages()}\n")

# Follow-up (demonstrates memory)
response2 = session.chat("What did I just ask you?")
print(f"You: What did I just ask you?")
print(f"Assistant: {response2}")
print(f"History messages: {session.count_messages()}")

## 2. RAG: Retrieval-Augmented Generation

RAG gives your model access to documents it wasn't trained on.

Think of it like an **open-book exam**: the model can look things up instead of relying only on memory.

In [None]:
class SimpleRAG:
    """A simple RAG system for document retrieval."""
    
    def __init__(self, embedding_model="all-MiniLM-L6-v2"):
        """Initialize the RAG system."""
        self.encoder = SentenceTransformer(embedding_model)
        self.documents = []  # Original text chunks
        self.embeddings = None  # NumPy array of vectors
        self.metadata = []  # Source information for citations
    
    def chunk_text(self, text, chunk_size=200, overlap=50):
        """Split text into overlapping chunks."""
        words = text.split()
        chunks = []
        
        for i in range(0, len(words), chunk_size - overlap):
            chunk = " ".join(words[i:i + chunk_size])
            if chunk.strip():
                chunks.append(chunk)
        
        return chunks
    
    def add_document(self, text, source_name="unknown"):
        """Add a document to the knowledge base."""
        chunks = self.chunk_text(text)
        
        for i, chunk in enumerate(chunks):
            self.documents.append(chunk)
            self.metadata.append({
                "source": source_name,
                "chunk_index": i
            })
        
        # Re-embed all documents
        self.embeddings = self.encoder.encode(
            self.documents,
            normalize_embeddings=True  # Important for cosine similarity
        )
        
        print(f"Added {len(chunks)} chunks from '{source_name}'")
    
    def retrieve(self, query, top_k=3, min_score=0.3):
        """Find the most relevant chunks for a query."""
        if self.embeddings is None or len(self.embeddings) == 0:
            return []
        
        # Embed the query
        query_embedding = self.encoder.encode(
            query,
            normalize_embeddings=True
        )
        
        # Calculate similarities (dot product of normalized vectors = cosine)
        similarities = np.dot(self.embeddings, query_embedding)
        
        # Get top-k indices
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        
        # Build results, filtering by minimum score
        results = []
        for idx in top_indices:
            score = float(similarities[idx])
            if score >= min_score:
                results.append({
                    "text": self.documents[idx],
                    "score": score,
                    "source": self.metadata[idx]["source"],
                    "index": int(idx)
                })
        
        return results

print("SimpleRAG class defined!")

In [None]:
# Sample documents for testing
vacation_policy = """
Employees receive 15 days of paid vacation per year. 
Unused vacation days can be carried over to the next year, up to a maximum of 5 days. 
Vacation requests must be submitted at least 2 weeks in advance through the HR portal.
New employees are eligible for vacation after completing their 90-day probation period.
"""

expense_policy = """
Business expenses must be submitted within 30 days of the expense date.
All expenses over $50 require a receipt. Meals during business travel are reimbursed up to $75 per day.
Submit expense reports through the finance portal with appropriate documentation.
Manager approval is required for expenses over $500.
"""

remote_work_policy = """
Employees may work remotely up to 3 days per week with manager approval.
Remote workers must be available during core hours: 10am to 3pm in their local timezone.
Home office equipment can be reimbursed up to $500 with manager approval.
Remote work arrangements should be documented in writing.
"""

print("Sample documents created!")

In [None]:
# Initialize RAG and add documents
rag = SimpleRAG()

rag.add_document(vacation_policy, "vacation_policy.txt")
rag.add_document(expense_policy, "expense_policy.txt")
rag.add_document(remote_work_policy, "remote_work_policy.txt")

print(f"\nTotal documents in knowledge base: {len(rag.documents)}")

In [None]:
# Test retrieval
query = "How many vacation days do I get?"
results = rag.retrieve(query)

print(f"Query: '{query}'\n")
print("Retrieved documents:")
for i, doc in enumerate(results):
    print(f"\n[{i+1}] Score: {doc['score']:.3f} | Source: {doc['source']}")
    print(f"    {doc['text'][:100]}...")

In [None]:
# Test with a different query
query2 = "What's the expense limit for meals?"
results2 = rag.retrieve(query2)

print(f"Query: '{query2}'\n")
print("Retrieved documents:")
for i, doc in enumerate(results2):
    print(f"\n[{i+1}] Score: {doc['score']:.3f} | Source: {doc['source']}")
    print(f"    {doc['text'][:100]}...")

## 3. Prompting with Retrieved Context

Now let's build prompts that use the retrieved documents.

In [None]:
def build_rag_prompt(query, retrieved_docs, min_score=0.3):
    """Build a prompt with retrieved context and citation instructions."""
    # Filter by score
    good_docs = [d for d in retrieved_docs if d["score"] >= min_score]
    
    # Handle case where no relevant documents were found
    if not good_docs:
        return f"""I could not find relevant information to answer your question.

Question: {query}

Please either rephrase your question or let me know if you'd like me to answer based on general knowledge."""
    
    # Build context with citation markers
    context_parts = []
    for i, doc in enumerate(good_docs):
        source = doc.get("source", "unknown")
        context_parts.append(f"[{i+1}] (Source: {source})\n{doc['text']}")
    
    context = "\n\n".join(context_parts)
    
    return f"""Use the following sources to answer the question.
Cite sources using [1], [2], etc. Only use information from the provided sources.
If the sources don't contain the answer, say so.

Sources:
{context}

Question: {query}

Answer:"""

print("build_rag_prompt() defined!")

In [None]:
# See what the RAG prompt looks like
query = "How many vacation days do I get?"
docs = rag.retrieve(query)
prompt = build_rag_prompt(query, docs)

print("RAG PROMPT:")
print("=" * 50)
print(prompt)

In [None]:
def rag_answer(query, rag_system):
    """Answer a question using RAG."""
    # Retrieve relevant documents
    docs = rag_system.retrieve(query, top_k=3)
    
    # Build the prompt
    prompt = build_rag_prompt(query, docs)
    
    # Generate response using our helper
    response = chat(
        prompt,
        system="You are a helpful assistant that answers questions based on provided sources. Always cite your sources.",
        temperature=0.3  # Lower temperature for factual accuracy
    )
    
    return response

print("rag_answer() defined!")

In [None]:
# Test RAG answering
questions = [
    "How many vacation days do employees get?",
    "What's the meal expense limit for business travel?",
    "Can I work from home?",
]

for q in questions:
    print(f"Q: {q}")
    answer = rag_answer(q, rag)
    print(f"A: {answer}\n")
    print("-" * 50 + "\n")

## 4. Tool Calling Basics

Sometimes a model needs to do more than retrieve information. It needs to take actions.

In [None]:
# Define available tools
# WARNING: eval() is used here for simplicity. In production, use a proper
# math parser library like `simpleeval` to prevent code injection attacks.
TOOLS = {
    "calculate": {
        "description": "Perform basic arithmetic. Input should be a math expression like '2 + 2' or '15 * 3'.",
        "function": lambda expr: str(eval(expr, {"__builtins__": {}}, {}))
    },
    "get_date": {
        "description": "Get the current date.",
        "function": lambda: datetime.now().strftime("%Y-%m-%d")
    }
}

def parse_tool_call(response):
    """Extract tool call from model output."""
    match = re.search(r'<tool>(\w+)\((.*)\)</tool>', response, re.DOTALL)
    if match:
        return match.group(1), match.group(2).strip()
    return None, None

def execute_tool(tool_name, argument):
    """Safely execute a whitelisted tool."""
    if tool_name not in TOOLS:
        return f"Error: Unknown tool '{tool_name}'"
    
    try:
        if argument:
            result = TOOLS[tool_name]["function"](argument)
        else:
            result = TOOLS[tool_name]["function"]()
        return str(result)
    except Exception as e:
        return f"Error executing {tool_name}: {e}"

print("Tool functions defined!")

In [None]:
def chat_with_tools(user_message):
    """Chat with tool-calling capability."""
    
    # Build tool descriptions
    tool_descriptions = "\n".join(
        f"- {name}: {info['description']}"
        for name, info in TOOLS.items()
    )
    
    # First turn: ask the model
    prompt = f"""You have access to these tools:
{tool_descriptions}

To use a tool, write: <tool>name(argument)</tool>
Only use a tool if you need it to answer the question.

User: {user_message}
Assistant:"""
    
    first_response = chat(prompt, temperature=0)  # Deterministic for reliable parsing
    print(f"Model's first response: {first_response}")
    
    # Check if model wants to use a tool
    tool_name, argument = parse_tool_call(first_response)
    
    if tool_name:
        # Execute the tool
        tool_result = execute_tool(tool_name, argument)
        print(f"Tool executed: {tool_name}({argument}) = {tool_result}")
        
        # Second turn: give result back to model
        followup = f"""{prompt}{first_response}

Tool result: {tool_result}

Now provide your final answer to the user:"""
        
        final_response = chat(followup, temperature=0.3)
        return final_response
    
    # No tool needed, return first response
    return first_response

print("chat_with_tools() defined!")

In [None]:
# Test tool calling
print("Testing tool calling...\n")

# Test calculation
print("Q: What is 15% of 847?")
answer = chat_with_tools("What is 15% of 847?")
print(f"Final answer: {answer}\n")
print("-" * 50)

# Test date
print("\nQ: What's today's date?")
answer = chat_with_tools("What's today's date?")
print(f"Final answer: {answer}")

## 5. Evaluation and Testing

How do you know if your RAG system is working well? Build an evaluation set.

In [None]:
def evaluate_response(response, expected_traits):
    """Evaluate a response against expected traits."""
    results = {}
    response_lower = response.lower()
    
    # Check for citation markers
    if "cites_source" in expected_traits:
        has_citation = bool(re.search(r'\[\d+\]', response))
        results["cites_source"] = (has_citation == expected_traits["cites_source"])
    
    # Check for required keywords
    if "contains_keywords" in expected_traits:
        keywords = expected_traits["contains_keywords"]
        all_present = all(kw.lower() in response_lower for kw in keywords)
        results["contains_keywords"] = all_present
    
    # Check it's not a refusal
    if "not_refusal" in expected_traits:
        refusal_phrases = ["i don't know", "i cannot", "not in the sources", "no relevant"]
        is_refusal = any(phrase in response_lower for phrase in refusal_phrases)
        if expected_traits["not_refusal"]:
            results["not_refusal"] = not is_refusal
        else:
            results["not_refusal"] = is_refusal
    
    # Check minimum length
    if "min_words" in expected_traits:
        word_count = len(response.split())
        results["min_words"] = (word_count >= expected_traits["min_words"])
    
    return results

print("evaluate_response() defined!")

In [None]:
# Define evaluation set
eval_set = [
    {
        "query": "How many vacation days do employees get?",
        "expected_traits": {
            "contains_keywords": ["15", "days"],
            "cites_source": True,
            "not_refusal": True
        }
    },
    {
        "query": "What is the expense limit for meals?",
        "expected_traits": {
            "contains_keywords": ["75"],
            "cites_source": True,
            "not_refusal": True
        }
    },
    {
        "query": "What is the meaning of life?",  # Out of scope!
        "expected_traits": {
            "not_refusal": False,  # Should refuse
            "cites_source": False
        }
    },
]

print(f"Evaluation set: {len(eval_set)} test cases")

In [None]:
def run_evaluation(rag_system, eval_set):
    """Run full evaluation and report results."""
    total_tests = 0
    passed_tests = 0
    
    for case in eval_set:
        print(f"\nQuery: {case['query']}")
        
        response = rag_answer(case["query"], rag_system)
        results = evaluate_response(response, case["expected_traits"])
        
        for trait, passed in results.items():
            total_tests += 1
            if passed:
                passed_tests += 1
                print(f"  [PASS] {trait}")
            else:
                print(f"  [FAIL] {trait}")
        
        print(f"  Response: {response[:100]}...")
    
    print(f"\n{'='*50}")
    print(f"Results: {passed_tests}/{total_tests} tests passed ({100*passed_tests/total_tests:.1f}%)")

print("run_evaluation() defined!")

In [None]:
# Run evaluation
print("Running RAG evaluation...")
run_evaluation(rag, eval_set)

## 6. Streaming Responses

Streaming makes responses feel faster by showing output as it's generated.

In [None]:
def stream_response(query, rag_system):
    """Stream a RAG response using Ollama."""
    import requests
    
    docs = rag_system.retrieve(query, top_k=3)
    prompt = build_rag_prompt(query, docs)
    
    # Use Ollama's streaming endpoint directly
    response = requests.post(
        "http://localhost:11434/api/chat",
        json={
            "model": "llama3.2",
            "messages": [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt}
            ],
            "stream": True
        },
        stream=True
    )
    
    full_response = ""
    for line in response.iter_lines():
        if line:
            data = json.loads(line)
            if "message" in data and "content" in data["message"]:
                content = data["message"]["content"]
                print(content, end="", flush=True)
                full_response += content
    
    print()  # Newline at the end
    return full_response

print("stream_response() defined!")

In [None]:
# Test streaming
print("Testing streaming response...\n")
print("Q: How many days can I work remotely?\n")
print("A: ", end="")
stream_response("How many days can I work remotely?", rag)

## Exercises

### Exercise 1: Build Your Own Knowledge Base

Create a knowledge base about a topic you care about.

In [None]:
# YOUR CODE HERE
# 1. Write 10 short documents about a topic (recipes, game rules, class notes)
# 2. Initialize a SimpleRAG instance
# 3. Add your documents
# 4. Test with 5 questions
# 5. Observe: Does it retrieve the right chunks?

### Exercise 2: Temperature Experiment

Test how temperature affects RAG responses.

In [None]:
# YOUR CODE HERE
# 1. Modify rag_answer() to accept a temperature parameter
# 2. Ask the same question with temperatures 0.3, 0.7, 1.0
# 3. Compare the responses
# 4. Which is most reliable for factual questions?

### Exercise 3: Evaluation Set Creation

Build a comprehensive evaluation set for your knowledge base.

In [None]:
# YOUR CODE HERE
# 1. Create 10 evaluation questions for your knowledge base
# 2. Include: 5 easy, 3 hard, 2 adversarial (out of scope)
# 3. Define expected traits for each
# 4. Run evaluation and report accuracy

### Exercise 4: Checkpoint - Personal Knowledge Assistant

Build a complete "chat with your documents" application.

In [None]:
# YOUR CODE HERE
# 1. Collect 5-10 text files (notes, articles, documentation)
# 2. Load them into a SimpleRAG instance
# 3. Build a chat loop that uses RAG for every response
# 4. Include: token counting, history management, citations
# 5. Create a 10-question evaluation set
# 6. Run evaluation and report accuracy

## Summary

**What we built:**

- A chat loop with history management and token budgeting
- A complete RAG system from document chunking to citation generation
- A preview of tool calling with safety considerations
- Evaluation and validation for production readiness

**What we learned:**

- Your training knowledge (embeddings, chunking, data quality, reproducibility) transfers directly to applications
- RAG solves the "model doesn't know my data" problem
- Tool calling extends what models can do, but requires careful safety
- Evaluation is essential, not optional