In [None]:
print('Setup complete.')

# RAG Basics & First Retrieval - Hands-On Lab

**Hands-on**: build a small index on provided docs; answer with top-k cites

**Deliverable**: answer text with cited chunks

## Lab Objectives
By the end of this lab, you will:
- Build a searchable index from provided documents
- Implement retrieval to find top-k relevant chunks
- Generate answers that include proper source citations
- Validate that your citations are accurate and traceable

## Setup Instructions
1. Run the installation cell below to install required packages
2. Set your OpenAI API key in the environment
3. Work through each section step by step
4. Test your implementation with the provided queries

## Provided Documents
You will work with a small knowledge base about sustainable energy topics.

In [None]:
# Install required packages for Google Colab compatibility
!pip install langchain langchain-openai langchain-community faiss-cpu tiktoken python-dotenv

# TODO: Import all necessary modules for RAG implementation
# You will need:
# - Document handling: Document from langchain.schema
# - Text processing: RecursiveCharacterTextSplitter from langchain.text_splitter
# - Embeddings: OpenAIEmbeddings from langchain_openai
# - Vector store: FAISS from langchain.vectorstores
# - Language model: ChatOpenAI from langchain_openai
# - Chain: RetrievalQA from langchain.chains
# - Prompts: PromptTemplate from langchain.prompts
# - Standard libraries: os, json, sys

print("✅ Installation complete - now add your imports!")

In [None]:
# TODO: Set your OpenAI API key
# Option 1: Set directly (not recommended for production)
# os.environ["OPENAI_API_KEY"] = "your-api-key-here"

# Option 2: Load from environment file (recommended)
# from dotenv import load_dotenv
# load_dotenv()

# TODO: Add a check to verify the API key is set
# Print a success message if found, warning if not found

## Task 1: Document Loading and Preparation

Load the provided documents about sustainable energy and convert them to LangChain Document objects.

In [None]:
# PROVIDED: Document data about sustainable energy
energy_documents = [
    {
        "content": "Solar power harnesses energy from the sun using photovoltaic cells or solar thermal collectors. Modern solar panels can convert about 15-22% of sunlight into electricity. Solar farms can generate utility-scale power, while rooftop installations serve individual buildings. The technology has become increasingly cost-effective, with prices dropping over 80% in the last decade. Solar power is intermittent, requiring energy storage or grid integration for consistent power supply.",
        "metadata": {"source": "solar_energy_overview.pdf", "topic": "Solar Power", "date": "2024"}
    },
    {
        "content": "Wind energy captures kinetic energy from moving air using wind turbines. Modern turbines can reach heights of 150+ meters with rotor diameters exceeding 100 meters. Wind farms are typically located in areas with consistent wind patterns - either onshore in plains and hills, or offshore where winds are stronger and more consistent. Wind power has low operating costs once installed but faces challenges with intermittency and visual/noise impacts on communities.",
        "metadata": {"source": "wind_power_guide.pdf", "topic": "Wind Energy", "date": "2024"}
    },
    {
        "content": "Energy storage systems are crucial for renewable energy integration. Battery technologies like lithium-ion provide short-term storage for homes and grids. Pumped hydro storage uses excess energy to pump water uphill, then generates power as water flows down. Other solutions include compressed air storage, flywheels, and emerging technologies like hydrogen fuel cells. Storage helps balance supply and demand when renewable sources are intermittent.",
        "metadata": {"source": "energy_storage.pdf", "topic": "Energy Storage", "date": "2024"}
    },
    {
        "content": "Smart grids use digital technology to manage electricity flow efficiently. They enable two-way communication between utilities and consumers, allowing for real-time monitoring and optimization. Smart grids can automatically reroute power during outages, integrate renewable sources more effectively, and enable demand response programs. Advanced metering infrastructure provides detailed energy usage data to both utilities and consumers for better energy management.",
        "metadata": {"source": "smart_grid_tech.pdf", "topic": "Smart Grids", "date": "2024"}
    },
    {
        "content": "Electric vehicles (EVs) are becoming a key part of sustainable transportation. Modern EVs can travel 200-400 miles per charge, with charging infrastructure expanding rapidly. EVs can serve as mobile energy storage, potentially feeding power back to the grid during peak demand. The transition to EVs reduces greenhouse gas emissions, especially when powered by renewable electricity. Challenges include charging time, battery costs, and the need for widespread charging infrastructure.",
        "metadata": {"source": "electric_vehicles.pdf", "topic": "Electric Vehicles", "date": "2024"}
    },
    {
        "content": "Geothermal energy taps into Earth's internal heat for power generation and heating. Geothermal power plants can provide consistent baseload power, unlike intermittent solar and wind. Enhanced geothermal systems (EGS) can expand geothermal potential to areas without natural hot springs. Geothermal has a small environmental footprint and can operate 24/7. However, it requires specific geological conditions and has high upfront costs for drilling and plant construction.",
        "metadata": {"source": "geothermal_energy.pdf", "topic": "Geothermal", "date": "2024"}
    }
]

# TODO: Convert the energy_documents list to LangChain Document objects
# Use the Document class with page_content and metadata parameters
# Store the result in a variable called 'documents'

# TODO: Print summary statistics about your loaded documents:
# - Total number of documents
# - Total character count across all documents
# - List of document sources and topics

## Task 2: Text Chunking Strategy

Implement intelligent text chunking to prepare documents for optimal retrieval.

In [None]:
# TODO: Initialize a RecursiveCharacterTextSplitter with these parameters:
# - chunk_size: 250 (good size for detailed retrieval)
# - chunk_overlap: 40 (preserve context across boundaries)
# - length_function: len
# - separators: [". ", "\n", " ", ""] (try sentence boundaries first)

# TODO: Use the text splitter to split your documents into chunks
# Store the result in a variable called 'chunks'

# TODO: Print analysis of your chunking:
# - Number of original documents vs. number of chunks
# - Average chunk size in characters
# - Show the first 2 chunks as examples with their metadata

## Task 3: Create Searchable Index

Build a vector index using embeddings that will enable semantic search.

In [None]:
# TODO: Initialize OpenAI embeddings
# Use the text-embedding-ada-002 model
# Set chunk_size=1000 for batch processing

# TODO: Create a FAISS vector store from your chunks and embeddings
# Use FAISS.from_documents() method
# Store the result in a variable called 'vectorstore'
# Add error handling in case the API call fails

# TODO: Print confirmation of successful index creation:
# - Number of vectors in the index
# - Embedding dimension size
# - Success message

## Task 4: Implement Retrieval with Top-K Results

Test your index by retrieving the most relevant chunks for different queries.

In [None]:
# PROVIDED: Test queries for your retrieval system
test_queries = [
    "How efficient are solar panels at converting sunlight?",
    "What are the challenges with wind energy?",
    "How do smart grids help manage electricity?",
    "What energy storage options are available?",
    "What makes electric vehicles sustainable?"
]

# TODO: For each query in test_queries:
# 1. Use vectorstore.similarity_search_with_score() to get top 2 results
# 2. Print the query
# 3. For each result, print:
#    - Similarity score (lower is better)
#    - Source document name
#    - Topic from metadata  
#    - First 60 characters of content
# 4. Add a separator line between queries

# HINT: The similarity_search_with_score returns tuples of (Document, score)

## Task 5: Build RAG Chain with Citation Support

Create a complete RAG system that generates answers with proper source citations.

In [None]:
# TODO: Create a custom prompt template for RAG with citations
# Your template should:
# - Accept {context} and {question} as input variables
# - Instruct the model to answer based only on provided context
# - Require citations by mentioning source document names
# - Tell model to say if information is insufficient
# - Emphasize not making up information
# Use PromptTemplate class with template and input_variables parameters

# Example template structure (customize as needed):
# """
# You are an expert on sustainable energy helping users with questions.
# 
# Relevant context:
# {context}
# 
# Question: {question}
# 
# Instructions:
# [Add your instructions here]
# 
# Answer:
# """

In [None]:
# TODO: Initialize the language model and create RAG chain
# 1. Create ChatOpenAI with:
#    - temperature=0 (for consistency)
#    - model="gpt-3.5-turbo"
#    - max_tokens=400
# 
# 2. Create a retriever from your vectorstore:
#    - Use vectorstore.as_retriever()
#    - Set search_type="similarity"
#    - Set search_kwargs={"k": 3} for top 3 chunks
#
# 3. Create RetrievalQA chain:
#    - Use RetrievalQA.from_chain_type()
#    - Set chain_type="stuff"
#    - Use your retriever and LLM
#    - Set return_source_documents=True
#    - Pass your custom prompt in chain_type_kwargs
# 
# Add error handling for API issues

## Task 6: Test Complete RAG Pipeline

Generate answers with citations and validate the results.

In [None]:
# PROVIDED: Final test questions
final_questions = [
    "What are the main advantages and challenges of solar power?",
    "How do energy storage systems support renewable energy?",
    "What role do electric vehicles play in sustainable transportation?"
]

# TODO: For each question in final_questions:
# 1. Use your qa_chain to get an answer: qa_chain({"query": question})
# 2. Display the question clearly
# 3. Display the generated answer
# 4. List all source documents that were consulted:
#    - Show source filename
#    - Show topic from metadata
#    - Show first 80 characters of each source chunk
# 5. Add clear separators between questions
# 6. Handle any errors gracefully

# DELIVERABLE: Your output should show answer text with cited chunks

## Task 7: Validate Citations

Verify that your system is providing accurate citations.

In [None]:
# TODO: Create a function to validate citations
# The function should:
# 1. Take an answer and source documents as parameters
# 2. Check if document names mentioned in the answer actually exist in sources
# 3. Look for phrases like "according to X.pdf" or "from X.pdf"
# 4. Return a report of citation accuracy

# TODO: Test your citation validation with at least one example
# Use a question and manually check if citations match sources

# BONUS: Implement additional validation:
# - Check if cited content actually appears in the mentioned source
# - Flag potential hallucinations where facts don't match sources

## Lab Completion Checklist

Before submitting, verify you have completed:

### ✅ Required Deliverables
- [ ] Built a searchable index from the provided sustainable energy documents
- [ ] Implemented top-k retrieval (tested with k=2 and k=3)
- [ ] Generated answers that include source citations
- [ ] Validated that citations reference actual source documents

### ✅ Technical Implementation
- [ ] Successfully loaded and chunked documents
- [ ] Created embeddings and vector store index
- [ ] Implemented similarity search with scores
- [ ] Built complete RAG chain with custom prompt
- [ ] Generated responses to all test questions

### ✅ Quality Checks
- [ ] Answers are grounded in provided context
- [ ] Citations reference actual source document names
- [ ] Retrieved chunks are relevant to queries
- [ ] No obvious hallucinations or made-up facts
- [ ] Code runs without errors

## Reflection Questions

1. **Retrieval Quality**: Did your system retrieve relevant chunks for each query? What could improve retrieval accuracy?

2. **Citation Accuracy**: Are the citations in your answers accurate? Do they reference the correct source documents?

3. **Answer Quality**: Are the generated answers helpful and grounded in the provided context?

4. **Chunk Strategy**: How did your chunking approach affect retrieval quality? Would different chunk sizes help?

## Next Steps

After completing this lab, you're ready to:
- Experiment with different chunking strategies
- Try hybrid search combining semantic and keyword search
- Implement more sophisticated citation validation
- Add metadata filtering to improve precision
- Scale to larger document collections