# AI Knowledge Helper - RAG System Implementation

This notebook provides a step-by-step implementation of a Retrieval-Augmented Generation (RAG) question-answering system.

## Overview
1. **Data Ingestion**: Extract text from PDF files
2. **Preprocessing**: Clean and chunk documents
3. **Embeddings**: Generate semantic embeddings
4. **Vector Database**: Store and search embeddings
5. **RAG Pipeline**: Retrieve + LLM reasoning
6. **Evaluation**: Assess retrieval quality
7. **ML Components**: Summarization and question classification


In [None]:
# Install required packages (run once)
# !pip install -r requirements.txt

# Import libraries
import os
import sys
from pathlib import Path

# Add current directory to path
sys.path.append('.')

# Import project modules
from data_ingestion import ingest_pdfs
from preprocessing import preprocess_text
from embeddings import create_embeddings, create_query_embedding
from retrieval import setup_vector_db, load_vector_db, search_similar
from qa_pipeline import RAGPipeline
from evaluation import comprehensive_evaluation
from ml_components import summarize_chunks, classify_question_type
import config

print("All modules imported successfully!")
print(f"Embedding model: {config.EMBEDDING_MODEL}")
print(f"Chunk size: {config.CHUNK_SIZE} tokens")
print(f"Top K retrieval: {config.TOP_K}")


## Part 1: Data Ingestion

First, we'll extract text from PDF files in the `data/` directory.


In [None]:
# Step 1: Ingest PDF files
data_dir = "data"

# Create data directory if it doesn't exist
Path(data_dir).mkdir(exist_ok=True)

# Ingest PDFs
texts = ingest_pdfs(data_dir)

print(f"\nIngested {len(texts)} PDF documents")

# Display sample text from first document
if texts:
    print(f"\nFirst document preview ({len(texts[0])} characters):")
    print(texts[0][:500] + "...")
else:
    print("\n⚠️ No PDF files found in the data directory.")
    print("Please add PDF files to the 'data/' directory and re-run this cell.")


## Part 2: Text Preprocessing

Clean and chunk the extracted text for optimal retrieval.


In [None]:
# Step 2: Preprocess and chunk text
if texts:
    chunks = preprocess_text(
        texts,
        chunk_size=config.CHUNK_SIZE,
        chunk_overlap=config.CHUNK_OVERLAP
    )
    
    print(f"Created {len(chunks)} text chunks")
    print(f"\nSample chunk (first 200 characters):")
    print(chunks[0][:200] + "..." if len(chunks) > 0 else "No chunks created")
    
    # Display chunk statistics
    from preprocessing import count_tokens
    chunk_sizes = [count_tokens(chunk) for chunk in chunks[:10]]
    print(f"\nAverage chunk size (first 10): {sum(chunk_sizes) / len(chunk_sizes):.1f} tokens")
else:
    print("No texts to preprocess. Please run the data ingestion step first.")
    chunks = []


## Part 3: Generate Embeddings

Create semantic embeddings using SentenceTransformers.


In [None]:
# Step 3: Create embeddings
if chunks:
    print("Generating embeddings (this may take a few minutes)...")
    embeddings, chunks = create_embeddings(chunks)
    
    print(f"\nEmbeddings shape: {embeddings.shape}")
    print(f"Embedding dimension: {embeddings.shape[1]}")
    print(f"Number of chunks: {len(chunks)}")
    
    # Display embedding statistics
    print(f"\nEmbedding statistics:")
    print(f"  Mean: {embeddings.mean():.4f}")
    print(f"  Std: {embeddings.std():.4f}")
    print(f"  Min: {embeddings.min():.4f}")
    print(f"  Max: {embeddings.max():.4f}")
else:
    print("No chunks to embed. Please run preprocessing step first.")
    embeddings = None


## Part 4: Vector Database Setup

Store embeddings in ChromaDB for efficient similarity search.


In [None]:
# Step 4: Setup vector database
if embeddings is not None and chunks:
    print("Setting up vector database...")
    collection = setup_vector_db(embeddings, chunks)
    
    print(f"✅ Vector database created successfully!")
    print(f"   Location: {config.VECTOR_DB_PATH}")
    print(f"   Collection: {config.COLLECTION_NAME}")
    print(f"   Documents stored: {len(chunks)}")
    
    # Test retrieval
    print("\nTesting retrieval with sample query...")
    test_query = "What is the main topic?"
    query_emb = create_query_embedding(test_query)
    results = search_similar(query_emb, top_k=2)
    
    print(f"\nQuery: '{test_query}'")
    print(f"Retrieved {len(results)} documents:")
    for i, (doc, score) in enumerate(results, 1):
        print(f"\n  {i}. Similarity: {score:.3f}")
        print(f"     Text: {doc[:150]}...")
else:
    print("No embeddings available. Please run previous steps first.")


## Part 5: RAG Pipeline - Question Answering

Use the complete RAG pipeline to answer questions.


In [None]:
# Step 5: Initialize RAG pipeline
rag = RAGPipeline()

# Check if vector DB is loaded
if not rag.vector_db_loaded:
    print("Vector database not found. Processing documents...")
    rag.process_documents(data_dir)
else:
    print("✅ Vector database loaded successfully!")


In [None]:
# Ask a question
question = "What is the main topic of the document?"

print(f"Question: {question}\n")
print("Processing...\n")

result = rag.ask(question, use_summarization=False)

print("=" * 60)
print("ANSWER:")
print("=" * 60)
print(result["answer"])
print("\n" + "=" * 60)
print("METADATA:")
print("=" * 60)
print(f"Confidence: {result['confidence']:.3f}")
print(f"Number of sources: {result['num_sources']}")
print(f"\nTop sources:")
for i, source in enumerate(result['sources'][:3], 1):
    print(f"\n  {i}. {source[:200]}...")


## Part 6: Evaluation

Evaluate retrieval quality and answer relevance.


In [None]:
# Evaluate the RAG system
if result.get("sources"):
    evaluation = comprehensive_evaluation(
        query=question,
        retrieved_texts=result["sources"],
        answer=result["answer"]
    )
    
    print("=" * 60)
    print("EVALUATION RESULTS")
    print("=" * 60)
    print(f"\nQuery: {evaluation['query']}")
    print(f"\nRetrieval Evaluation:")
    print(f"  Relevance Score: {evaluation['retrieval_evaluation']['relevance_score']:.3f}")
    print(f"  Coverage: {evaluation['retrieval_evaluation']['coverage']:.3f}")
    print(f"  Is Relevant: {evaluation['retrieval_evaluation']['is_relevant']}")
    print(f"  Assessment: {evaluation['retrieval_evaluation']['assessment']}")
    
    print(f"\nAnswer Evaluation:")
    print(f"  Has Answer: {evaluation['answer_evaluation']['has_answer']}")
    print(f"  Context Usage: {evaluation['answer_evaluation']['context_usage']:.3f}")
    print(f"  Length Score: {evaluation['answer_evaluation']['length_score']:.3f}")
    print(f"  Quality: {evaluation['answer_evaluation']['overall_quality']}")
    
    print(f"\nOverall Score: {evaluation['overall_score']:.3f}")
    print(f"Recommendation: {evaluation['recommendation']}")
else:
    print("No sources available for evaluation.")


## Part 7: ML Components

### 7.1 Document Summarization


In [None]:
# Test summarization
if result.get("sources"):
    print("Original retrieved chunks:")
    for i, source in enumerate(result["sources"][:2], 1):
        print(f"\n{i}. {source[:200]}...")
    
    print("\n" + "=" * 60)
    print("SUMMARIZED VERSION:")
    print("=" * 60)
    summary = summarize_chunks(result["sources"])
    print(summary)
    
    # Use summarization in RAG
    print("\n" + "=" * 60)
    print("RAG WITH SUMMARIZATION:")
    print("=" * 60)
    result_with_summary = rag.ask(question, use_summarization=True)
    print(result_with_summary["answer"])
else:
    print("No sources available for summarization.")


### 7.2 Question Classification


In [None]:
# Test question classification
test_questions = [
    "What is artificial intelligence?",
    "How does machine learning work?",
    "Compare AI and traditional programming.",
    "Is AI better than human intelligence?",
    "When was the first computer invented?"
]

print("Question Classification Results:")
print("=" * 60)
for q in test_questions:
    q_type = classify_question_type(q)
    print(f"\nQuestion: {q}")
    print(f"Type: {q_type}")


## Part 8: Interactive Question-Answering

Try asking your own questions!


In [None]:
# Interactive Q&A
def ask_question_interactive(question: str, use_summary: bool = False):
    """Helper function for interactive questioning."""
    print(f"\n{'='*60}")
    print(f"QUESTION: {question}")
    print(f"{'='*60}\n")
    
    result = rag.ask(question, use_summarization=use_summary)
    
    print("ANSWER:")
    print("-" * 60)
    print(result["answer"])
    print("\n" + "-" * 60)
    print(f"Confidence: {result['confidence']:.3f}")
    print(f"Sources: {result['num_sources']}")
    
    return result

# Example questions - modify these or add your own!
questions = [
    "What are the key concepts discussed?",
    "Can you summarize the main points?",
    "What is the conclusion?"
]

for q in questions:
    ask_question_interactive(q)
    print("\n")


## Summary

This notebook demonstrated a complete RAG pipeline:

1. ✅ **Data Ingestion**: Extracted text from PDFs
2. ✅ **Preprocessing**: Cleaned and chunked documents
3. ✅ **Embeddings**: Generated semantic embeddings
4. ✅ **Vector Database**: Stored embeddings in ChromaDB
5. ✅ **RAG Pipeline**: Implemented retrieval + LLM reasoning
6. ✅ **Evaluation**: Assessed retrieval quality
7. ✅ **ML Components**: Summarization and question classification

### Key Takeaways:
- RAG combines semantic search with LLM reasoning
- Chunking strategy affects retrieval quality
- Embedding models determine semantic understanding
- Evaluation helps improve the system

### Next Steps:
- Experiment with different chunk sizes
- Try different embedding models
- Add more documents to improve coverage
- Fine-tune retrieval parameters
