# LLM Experimentation with LlamaIndex

This notebook demonstrates the process of indexing Paul Graham's essays, querying them, and generating responses using a fine-tuned GPT-2 model. It showcases retrieval-augmented generation (RAG) techniques to enhance model responses with contextually relevant information.

## Overview
1. Load fine-tuned GPT-2 model and Paul Graham's essays index
2. Define helper functions for context processing and response generation
3. Query the model with retrieval augmentation
4. Analyze and visualize results

This notebook works in conjunction with the following implementation files:
- `src/finetune_model.py`: Fine-tunes the GPT-2 model on Paul Graham's essays
- `src/index_data.py`: Creates the vector index of essays using LlamaIndex
- `src/generate_text.py`: Interactive interface for generating responses
- `src/utils.py`: Utility functions for analysis and environment setup

In [None]:
# Import Required Libraries
import os
import torch
import time
import re
import textwrap
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Import LLM and index-related libraries
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Add the parent directory to path to allow importing from src
import sys
sys.path.append('..')
# Import our utility functions
from src.utils import check_environment

# Check environment and see if we have GPU support
check_environment()

# Step 1: Load the Fine-Tuned Model and Index

In this step, we load the fine-tuned GPT-2 model and the indexed essays. The model has been trained on Paul Graham's essays to better capture his writing style, insights, and perspectives.

The vector index enables efficient semantic retrieval of relevant essay content based on user queries. This is essential for our retrieval-augmented generation approach.

In [None]:
# Define paths and configuration
model_path = "../models/finetuned/paul_graham_gpt2"
storage_path = "../storage"

# Check if the model path exists
if not os.path.exists(model_path):
    print(f"❌ Model path '{model_path}' not found. Please run finetune_model.py first.")
    print(f"Example: python src/finetune_model.py --data_dir data --output_dir models/finetuned")
else:
    print(f"✓ Found fine-tuned model at {model_path}")

# Check if index exists
if not os.path.exists(storage_path) or not os.listdir(storage_path):
    print(f"❌ Index not found at '{storage_path}'. Please run index_data.py first.")
    print(f"Example: python src/index_data.py")
else:
    print(f"✓ Found vector index at {storage_path}")

# Set device to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

try:
    # Load tokenizer and model
    print("Loading tokenizer and model...")
    tokenizer = GPT2Tokenizer.from_pretrained(model_path)
    tokenizer.pad_token = tokenizer.eos_token  # Set padding token
    model = GPT2LMHeadModel.from_pretrained(model_path).to(device)
    print(f"✓ Model loaded successfully")
    
    # Initialize embedding model for index
    print("Initializing embedding model...")
    embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
    
    # Load the vector index
    print("Loading vector index...")
    storage_context = StorageContext.from_defaults(persist_dir=storage_path)
    index = load_index_from_storage(storage_context, embed_model=embed_model)
    query_engine = index.as_query_engine()
    print(f"✓ Index loaded successfully with {len(index.docstore.docs)} documents")
    
except Exception as e:
    print(f"❌ Error loading model or index: {e}")

# Step 2: Define Helper Functions

We'll define several helper functions to process the context, create structured prompts for the model, and clean up the generated responses. These functions are adapted from our `generate_optimized_text.py` implementation.

Key functions include:
- `filter_context`: Extracts and organizes the most relevant portions of retrieved text
- `create_prompt`: Constructs a well-structured prompt to guide the model's generation
- `post_process_response`: Cleans and formats the model's output for readability
- `generate_response`: Handles the end-to-end retrieval and generation process

In [None]:
def filter_context(context, query, max_length=700):
    """Filters and organizes context to make it more relevant to the query.
    
    Args:
        context (str): Raw context from vector search
        query (str): User's query
        max_length (int): Maximum context length to return
        
    Returns:
        str: Filtered, relevant context
    """
    # Clean up the context
    cleaned_context = re.sub(r'file_path:.*?\n', '', context)
    cleaned_context = re.sub(r'Context information is below\.\s*---------------------\n', '', cleaned_context)
    
    # Split into paragraphs for scoring
    paragraphs = []
    current_paragraph = ""
    
    for line in cleaned_context.split('\n'):
        line = line.strip()
        if line:
            current_paragraph += line + " "
        elif current_paragraph:  # Empty line and we have content
            paragraphs.append(current_paragraph.strip())
            current_paragraph = ""
    
    # Add the last paragraph if it exists
    if current_paragraph:
        paragraphs.append(current_paragraph.strip())
    
    # Extract keywords from query 
    stop_words = {'what', 'does', 'how', 'why', 'when', 'where', 'which', 'paul', 'graham', 
                  'think', 'about', 'according', 'make', 'give', 'describe', 'say', 'tell'}
    
    query_tokens = re.findall(r'\b\w+\b', query.lower())
    query_keywords = set([w for w in query_tokens if len(w) > 2 and w not in stop_words])
    
    # Score paragraphs by relevance to query
    scored_paragraphs = []
    for p in paragraphs:
        p_lower = p.lower()
        # Count keyword matches
        keyword_matches = sum(1 for keyword in query_keywords if keyword in p_lower)
        # Calculate keyword density
        density = keyword_matches / (len(p) / 100) if len(p) > 0 else 0
        # Combined score
        score = keyword_matches * 1.5 + density * 0.5
        scored_paragraphs.append((p, score))
    
    # Sort paragraphs by score (highest first)
    sorted_paragraphs = [p for p, _ in sorted(scored_paragraphs, key=lambda x: x[1], reverse=True)]
    
    # Take top paragraphs up to max_length
    filtered_context = ""
    current_length = 0
    
    for p in sorted_paragraphs:
        p_length = len(p)
        if current_length + p_length + 2 <= max_length:  # +2 for newlines
            filtered_context += p + "\n\n"
            current_length += p_length + 2
        else:
            # Include at least one paragraph even if it's long
            if filtered_context == "" and p_length > max_length:
                filtered_context = p[:max_length-3] + "..."
            break
    
    return filtered_context.strip()

def create_prompt(query, context):
    """Creates a structured prompt for the model.
    
    Args:
        query (str): User's question
        context (str): Relevant context for answering
        
    Returns:
        str: Formatted prompt for the model
    """
    # Create a well-structured prompt that guides the model
    prompt = f"""Answer the following question about Paul Graham's essays using ONLY the information provided below.

CONTEXT FROM PAUL GRAHAM'S ESSAYS:
{context}

QUESTION: {query}

ANSWER:"""
    
    return prompt.strip()

def post_process_response(full_response, query):
    """Cleans and formats the model's generated response.
    
    Args:
        full_response (str): Raw text from the model
        query (str): Original query
        
    Returns:
        str: Cleaned, formatted response
    """
    # Extract just the answer portion
    match = re.search(r'ANSWER:(.*)', full_response, re.DOTALL)
    if match:
        answer = match.group(1).strip()
    else:
        # If format not followed, take text after query
        try:
            query_index = full_response.lower().index(query.lower())
            answer = full_response[query_index + len(query):].strip()
        except ValueError:
            answer = full_response
    
    # Clean up formatting
    answer = re.sub(r'\[\d+\]', '', answer)  # Remove citation markers
    answer = re.sub(r'\n{2,}', '\n\n', answer)  # Normalize newlines
    answer = re.sub(r'\s{2,}', ' ', answer)  # Normalize spaces
    
    # Fix incomplete sentences at the end
    sentences = answer.split('.')
    if len(sentences) > 1 and len(sentences[-1].strip()) < 10:
        answer = '.'.join(sentences[:-1]) + '.'
        
    return answer

def generate_response(query, query_engine, tokenizer, model, device, 
                      max_context_length=700, max_tokens=150,
                      temperature=0.7, top_p=0.9):
    """End-to-end response generation using retrieval and LLM.
    
    Args:
        query (str): User question
        query_engine: LlamaIndex query engine
        tokenizer: GPT-2 tokenizer
        model: GPT-2 model
        device: Computation device (CPU/GPU)
        max_context_length (int): Maximum context length
        max_tokens (int): Maximum tokens to generate
        temperature (float): Generation temperature
        top_p (float): Nucleus sampling parameter
        
    Returns:
        tuple: (final_response, raw_context, filtered_context, generation_time)
    """
    start_time = time.time()
    
    # Get context from index
    raw_response = query_engine.query(query)
    raw_context = str(raw_response)
    
    # Filter to get most relevant context
    filtered_context = filter_context(raw_context, query, max_length=max_context_length)
    
    # Create prompt with context
    prompt = create_prompt(query, filtered_context)
    
    # Generate response with model
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_new_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            num_beams=3,
            no_repeat_ngram_size=3,
            pad_token_id=tokenizer.eos_token_id
        )
    
    # Decode and post-process
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    final_response = post_process_response(generated_text, query)
    
    generation_time = time.time() - start_time
    
    return final_response, raw_context, filtered_context, prompt, generation_time

# Step 3: Query the Model

Now let's test the model with a sample query about Paul Graham's essays. We'll demonstrate how the retrieval-augmented generation process works:

1. User submits a question about Paul Graham's essays
2. The system retrieves relevant context from the indexed essays
3. The context is filtered and formatted into a prompt
4. The fine-tuned model generates a response based on the prompt
5. The response is post-processed for clarity and readability

This process combines the strengths of retrieval systems (accurate information lookup) with generative AI (natural language generation).

In [None]:
# Let's test with a sample query
query = "What is Paul Graham's advice on startups?"

try:
    # Generate response
    response, raw_context, filtered_context, prompt, gen_time = generate_response(
        query=query,
        query_engine=query_engine,
        tokenizer=tokenizer,
        model=model,
        device=device,
        max_context_length=700,
        max_tokens=150,
        temperature=0.7,
        top_p=0.9
    )
    
    # Display results
    print("=" * 80)
    print(f"QUERY: {query}")
    print("-" * 80)
    print("FILTERED CONTEXT:")
    print("-" * 80)
    print(filtered_context[:300] + "..." if len(filtered_context) > 300 else filtered_context)
    print("-" * 80)
    print("RESPONSE:")
    print("-" * 80)
    print(textwrap.fill(response, width=80))
    print("-" * 80)
    print(f"Generation time: {gen_time:.2f} seconds")
    print("=" * 80)
    
except Exception as e:
    print(f"Error generating response: {e}")

# Step 4: Experiment with Different Queries

Let's try a few different queries to see how the model handles various questions about Paul Graham's essays. We'll create a simple function to test multiple queries and display the results in a consistent format.

In [None]:
def test_queries(queries, display_context=False):
    """Test multiple queries and display results.
    
    Args:
        queries (list): List of query strings
        display_context (bool): Whether to display retrieved context
    """
    results = []
    
    for query in queries:
        print(f"Processing: '{query}'")
        try:
            # Generate response
            response, raw_context, filtered_context, prompt, gen_time = generate_response(
                query=query,
                query_engine=query_engine,
                tokenizer=tokenizer,
                model=model,
                device=device
            )
            
            results.append({
                'query': query,
                'response': response,
                'context': filtered_context,
                'time': gen_time
            })
            
            # Display results
            print(f"Response generated in {gen_time:.2f} seconds")
            print("-" * 40)
            
        except Exception as e:
            print(f"Error: {e}")
    
    # Display all results in a nice format
    for i, result in enumerate(results):
        print("=" * 80)
        print(f"QUERY {i+1}: {result['query']}")
        print("-" * 80)
        print("RESPONSE:")
        print(textwrap.fill(result['response'], width=80))
        print("-" * 80)
        print(f"Generation time: {result['time']:.2f} seconds")
        
        if display_context:
            print("\nCONTEXT EXTRACT:")
            print(result['context'][:200] + "...")
            
        print("=" * 80)
        print()
    
    # Return all results for further analysis
    return results

# Define a set of interesting queries
test_query_set = [
    "What does Paul Graham think about programming languages?",
    "How does Paul Graham describe the ideal founder?",
    "What is Paul Graham's philosophy on innovation?",
    "What advice does Paul Graham give to young people about careers?"
]

# Run the test queries
query_results = test_queries(test_query_set)

# Step 5: Visualize Context Relevance and Model Performance

Let's create some visualizations to better understand how the context influences our model's responses. We'll analyze:

1. Context relevance scoring - how well our ranking works
2. Response length vs. context length
3. Generation time analysis
4. Keyword overlap between query, context, and response

In [None]:
# Create a more advanced visualization of context relevance
def visualize_context_relevance(context, query):
    """Visualizes the relevance of context segments to the query."""
    # Split into paragraphs
    paragraphs = [p.strip() for p in context.split('\n\n') if p.strip()]
    
    # Extract keywords from query
    stop_words = {'what', 'does', 'how', 'why', 'when', 'where', 'which', 'paul', 'graham'}
    query_tokens = re.findall(r'\b\w+\b', query.lower())
    query_keywords = [w for w in query_tokens if len(w) > 2 and w not in stop_words]
    
    # Calculate relevance scores based on keyword presence
    relevance_scores = []
    paragraph_texts = []
    
    for i, p in enumerate(paragraphs):
        # Limit to first 50 chars for display
        paragraph_texts.append(f"P{i+1}: {p[:50]}...")
        
        # Score based on keyword matches
        p_lower = p.lower()
        keyword_matches = sum(1 for keyword in query_keywords if keyword in p_lower)
        density = keyword_matches / (len(p) / 100) if len(p) > 0 else 0
        score = keyword_matches * 1.5 + density * 0.5
        relevance_scores.append(score)
    
    # Create visualization
    plt.figure(figsize=(12, 6))
    bars = plt.bar(paragraph_texts, relevance_scores, color="skyblue")
    
    # Highlight most relevant paragraph
    if relevance_scores:
        max_idx = relevance_scores.index(max(relevance_scores))
        bars[max_idx].set_color('orange')
    
    plt.xlabel("Context Paragraphs")
    plt.ylabel("Relevance Score")
    plt.title(f"Context Relevance to Query: '{query}'")
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.show()
    
    # Return the most relevant paragraph
    if paragraphs:
        max_idx = relevance_scores.index(max(relevance_scores))
        return paragraphs[max_idx]
    return ""

# Try the visualization on our first query
sample_query = "What is Paul Graham's advice on startups?"
sample_response, sample_raw_context, sample_filtered_context, sample_prompt, _ = generate_response(
    query=sample_query,
    query_engine=query_engine,
    tokenizer=tokenizer,
    model=model,
    device=device
)

most_relevant_paragraph = visualize_context_relevance(sample_filtered_context, sample_query)
print("\nMost relevant paragraph:")
print("-" * 40)
print(textwrap.fill(most_relevant_paragraph, width=80))

In [None]:
# Advanced Performance Analysis

# 1. Try different temperature settings to see impact on responses
temperature_values = [0.2, 0.5, 0.8, 1.0]
temp_query = "What makes a successful startup according to Paul Graham?"

print("Testing different temperature values...")
temp_results = []

for temp in temperature_values:
    response, _, _, _, gen_time = generate_response(
        query=temp_query,
        query_engine=query_engine,
        tokenizer=tokenizer,
        model=model,
        device=device,
        temperature=temp
    )
    temp_results.append((temp, response, gen_time))
    print(f"Temperature {temp}: Generated in {gen_time:.2f}s")

# Display the results for comparison
print("\nComparing responses at different temperatures:")
for temp, response, _ in temp_results:
    print(f"\nTemperature = {temp}:")
    print("-" * 40)
    print(textwrap.fill(response[:200] + "...", width=80))
    
# Create metrics for response quality
def get_response_metrics(response):
    metrics = {
        'length': len(response),
        'sentences': len(re.split(r'[.!?]', response)),
        'words': len(re.findall(r'\b\w+\b', response))
    }
    return metrics

# Get metrics for each temperature
metrics_data = {'temperature': [], 'length': [], 'sentences': [], 'words': [], 'time': []}
for temp, response, gen_time in temp_results:
    metrics = get_response_metrics(response)
    metrics_data['temperature'].append(temp)
    metrics_data['length'].append(metrics['length'])
    metrics_data['sentences'].append(metrics['sentences'])
    metrics_data['words'].append(metrics['words'])
    metrics_data['time'].append(gen_time)

# Plot metrics vs temperature
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Response length metrics
ax1.plot(metrics_data['temperature'], metrics_data['words'], 'o-', label='Word Count')
ax1.plot(metrics_data['temperature'], metrics_data['sentences'], 's-', label='Sentence Count')
ax1.set_xlabel('Temperature')
ax1.set_ylabel('Count')
ax1.set_title('Response Length Metrics vs Temperature')
ax1.grid(True, alpha=0.3)
ax1.legend()

# Generation time
ax2.plot(metrics_data['temperature'], metrics_data['time'], 'o-', color='orange')
ax2.set_xlabel('Temperature')
ax2.set_ylabel('Generation Time (seconds)')
ax2.set_title('Generation Time vs Temperature')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Step 6: Evaluating Response Quality

Let's analyze the quality of our responses compared to direct generation (without retrieval augmentation). This will help us understand the benefits of the retrieval-augmented generation approach.

We'll compare:
1. Response with retrieval (context-aware)
2. Response without retrieval (model knowledge only)

In [None]:
def generate_direct_response(query, tokenizer, model, device, max_tokens=150):
    """Generate response without retrieval augmentation."""
    # Create a direct prompt
    prompt = f"QUESTION: {query}\n\nANSWER:"
    
    # Generate response
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_new_tokens=max_tokens,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    # Decode and post-process
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    final_response = post_process_response(generated_text, query)
    
    return final_response

# Compare retrieval vs non-retrieval responses
comparison_query = "What does Paul Graham think about risk-taking in startups?"

print("Generating responses for comparison...")

# Response with retrieval
retrieval_response, raw_context, filtered_context, _, _ = generate_response(
    query=comparison_query,
    query_engine=query_engine,
    tokenizer=tokenizer,
    model=model,
    device=device
)

# Response without retrieval
direct_response = generate_direct_response(
    query=comparison_query,
    tokenizer=tokenizer,
    model=model,
    device=device
)

# Display the comparison
print("\n" + "=" * 80)
print(f"QUERY: {comparison_query}")
print("=" * 80)

print("\nRESPONSE WITH RETRIEVAL:")
print("-" * 40)
print(textwrap.fill(retrieval_response, width=80))

print("\nRESPONSE WITHOUT RETRIEVAL:")
print("-" * 40)
print(textwrap.fill(direct_response, width=80))

print("\nKEY CONTEXT USED:")
print("-" * 40)
context_preview = filtered_context[:300] + "..." if len(filtered_context) > 300 else filtered_context
print(textwrap.fill(context_preview, width=80))

# Conclusion

In this notebook, we've demonstrated how to use LlamaIndex and a fine-tuned GPT-2 model to create a system that can answer questions about Paul Graham's essays. Key components of our approach include:

1. **Vector Indexing**: Storing and retrieving essay content using semantic search
2. **Context Filtering**: Identifying the most relevant passages for a given query
3. **Fine-tuned Model**: Using a model specifically trained on Paul Graham's writing style
4. **Retrieval-Augmented Generation**: Combining retrieval and generation for more accurate responses

This approach helps address some of the limitations of traditional language models, such as hallucinations and outdated knowledge, by grounding responses in specific source material.

## Next Steps

To further improve this system, consider:

1. Experimenting with different embedding models for better retrieval
2. Fine-tuning larger models like GPT-J or LLaMA
3. Implementing a feedback mechanism to improve response quality over time
4. Adding source citations to responses for better transparency