# RAG Application for Domain-Specific Question Answering

This notebook demonstrates a Retrieval-Augmented Generation (RAG) system for answering questions based on custom documents.

## Features
- Load and process documents (PDF, TXT, DOCX)
- Create vector embeddings using OpenAI
- Store embeddings in ChromaDB
- Answer questions using retrieved context
- Show source documents for transparency

## 1. Setup and Installation

First, install required packages:

In [None]:
!pip install langchain langchain-community langchain-openai chromadb pypdf python-docx python-dotenv openai tiktoken -q

## 2. Configuration

Set up your OpenAI API key:

In [None]:
import os
import getpass

# Set your OpenAI API key
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key: ")

# Configuration
EMBEDDING_MODEL = "text-embedding-ada-002"
LLM_MODEL = "gpt-3.5-turbo"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
TOP_K_RESULTS = 4

## 3. Import Libraries

In [None]:
from langchain_community.document_loaders import TextLoader, PyPDFLoader, Docx2txtLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from typing import List
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Libraries imported successfully!")

## 4. Load and Process Documents

You can upload your own documents or create sample documents:

In [None]:
# Create sample documents if needed
import os

# Create documents directory
os.makedirs('sample_docs', exist_ok=True)

# Sample document 1: Machine Learning
ml_content = """Machine Learning Best Practices

Machine learning is a powerful tool for solving complex problems. Here are key best practices:

1. Data Quality: Always ensure your training data is clean and representative.
2. Model Selection: Start with simple models before moving to complex ones.
3. Validation: Use cross-validation to get robust performance estimates.
4. Overfitting: Apply regularization techniques to prevent overfitting.
5. Evaluation: Choose appropriate metrics for your specific problem.
"""

# Sample document 2: NLP
nlp_content = """Natural Language Processing Guide

NLP focuses on enabling computers to understand human language. Key concepts:

1. Tokenization: Breaking text into words or sentences.
2. Embeddings: Converting words into numerical vectors.
3. Transformers: Modern architecture that revolutionized NLP.
4. BERT and GPT: Popular pre-trained models for various NLP tasks.
5. Fine-tuning: Adapting pre-trained models to specific tasks.
"""

# Sample document 3: RAG Systems
rag_content = """Retrieval-Augmented Generation Systems

RAG combines retrieval with generation for better AI responses:

1. Document Indexing: Store documents in a vector database.
2. Retrieval: Find relevant documents based on query similarity.
3. Generation: Use LLM to generate answers from retrieved context.
4. Benefits: Reduced hallucinations, up-to-date information, source attribution.
5. Use Cases: Customer support, technical documentation, research assistance.
"""

# Write sample documents
with open('sample_docs/ml_best_practices.txt', 'w') as f:
    f.write(ml_content)

with open('sample_docs/nlp_guide.txt', 'w') as f:
    f.write(nlp_content)

with open('sample_docs/rag_systems.txt', 'w') as f:
    f.write(rag_content)

print("‚úÖ Sample documents created in 'sample_docs' directory")

In [None]:
# Load documents from directory
def load_documents_from_directory(directory: str) -> List:
    """Load all text documents from a directory"""
    documents = []
    
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        
        if filename.endswith('.txt'):
            loader = TextLoader(file_path)
            documents.extend(loader.load())
            print(f"Loaded: {filename}")
    
    return documents

# Load documents
docs = load_documents_from_directory('sample_docs')
print(f"\n‚úÖ Loaded {len(docs)} documents")

## 5. Split Documents into Chunks

In [None]:
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    length_function=len,
)

chunks = text_splitter.split_documents(docs)
print(f"‚úÖ Split into {len(chunks)} chunks")

# Show a sample chunk
if chunks:
    print(f"\nSample chunk:\n{chunks[0].page_content[:200]}...")

## 6. Create Vector Store

Create embeddings and store them in ChromaDB:

In [None]:
# Initialize embeddings
embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)

# Create vector store
print("Creating vector store... (this may take a moment)")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

print("‚úÖ Vector store created successfully!")

## 7. Test Retrieval

Test if we can retrieve relevant documents:

In [None]:
# Test similarity search
query = "What is RAG?"
relevant_docs = vectorstore.similarity_search(query, k=2)

print(f"Query: {query}\n")
print(f"Found {len(relevant_docs)} relevant documents:\n")

for i, doc in enumerate(relevant_docs, 1):
    print(f"Document {i}:")
    print(doc.page_content[:300])
    print("\n" + "="*80 + "\n")

## 8. Create RAG Chain

Set up the Retrieval-Augmented Generation chain:

In [None]:
# Initialize LLM
llm = ChatOpenAI(model=LLM_MODEL, temperature=0.0)

# Create custom prompt template
prompt_template = """You are a helpful assistant that answers questions based on the provided context.
Use the following pieces of context to answer the question at the end.
If you don't know the answer based on the context, say so - don't make up information.

Context:
{context}

Question: {question}

Answer: """

PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

# Create RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": TOP_K_RESULTS}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

print("‚úÖ RAG chain created successfully!")

## 9. Ask Questions

Now you can ask questions based on your documents:

In [None]:
def ask_question(question: str):
    """Ask a question and display the answer with sources"""
    print(f"\n{'='*80}")
    print(f"‚ùì Question: {question}")
    print(f"{'='*80}\n")
    
    response = qa_chain.invoke({"query": question})
    
    print(f"üí° Answer:\n{response['result']}\n")
    
    if response.get('source_documents'):
        print(f"\nüìÑ Sources ({len(response['source_documents'])} documents):")
        for i, doc in enumerate(response['source_documents'], 1):
            source = doc.metadata.get('source', 'Unknown')
            print(f"\n[{i}] {source}")
            print(f"    {doc.page_content[:150]}...")
    
    print(f"\n{'='*80}")
    return response

### Example Questions

Try these example questions:

In [None]:
# Example 1: About RAG
ask_question("What is Retrieval-Augmented Generation and what are its benefits?")

In [None]:
# Example 2: About Machine Learning
ask_question("What are some best practices for machine learning?")

In [None]:
# Example 3: About NLP
ask_question("What are transformers in NLP and why are they important?")

In [None]:
# Example 4: Comparative question
ask_question("How does RAG help reduce hallucinations in AI systems?")

## 10. Interactive Question-Answering

Run this cell for interactive Q&A:

In [None]:
# Interactive mode
print("\nü§ñ Interactive RAG Q&A")
print("Type 'exit' to quit\n")

while True:
    question = input("\n‚ùì Your question: ").strip()
    
    if question.lower() in ['exit', 'quit', 'q']:
        print("üëã Goodbye!")
        break
    
    if not question:
        continue
    
    try:
        ask_question(question)
    except Exception as e:
        print(f"‚ùå Error: {str(e)}")

## 11. Summary

This notebook demonstrates a complete RAG system with:

‚úÖ Document loading and processing  
‚úÖ Vector embeddings with OpenAI  
‚úÖ ChromaDB vector store  
‚úÖ Semantic search and retrieval  
‚úÖ LLM-based answer generation  
‚úÖ Source attribution  

### Next Steps

To use this with your own documents:
1. Replace the sample documents with your own files
2. Adjust chunk size and overlap based on your documents
3. Experiment with different embedding models
4. Try different LLM models (GPT-4, etc.)
5. Tune the number of retrieved documents (TOP_K_RESULTS)

### Resources

- [LangChain Documentation](https://python.langchain.com/)
- [ChromaDB Documentation](https://docs.trychroma.com/)
- [OpenAI API Documentation](https://platform.openai.com/docs/)