# 🧠 Simple RAG Pipeline Demo

This notebook demonstrates the RAG (Retrieval-Augmented Generation) pipeline step by step with simple, easy-to-understand code.

## What is RAG?
RAG combines:
- **Retrieval**: Finding relevant documents
- **Generation**: Creating answers using AI

## Prerequisites
1. Make sure your `.env` file has your `GEMINI_API_KEY`
2. Documents are in the `./data/documents/` folder

## 🛠️ Setup - Import Libraries

In [1]:
import os
import sys
import warnings
from pathlib import Path

# Hide warnings for cleaner output
os.environ["ANONYMIZED_TELEMETRY"] = "False"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
warnings.filterwarnings("ignore")

# Add our project to Python path
sys.path.append(str(Path.cwd()))

# Import our RAG system
from src.rag_pipeline import RAGPipeline
from config.settings import settings

print("✅ Setup complete!")
print(f"📁 Working from: {Path.cwd()}")

✅ Setup complete!
📁 Working from: /Users/adrianglazer/General Projects/Personal Knowledge Assistant


## 🔑 Check Configuration

In [2]:
# Check if everything is configured properly
try:
    settings.validate_required_keys()
    print("✅ API key is configured")
except ValueError as e:
    print(f"❌ {e}")
    print("💡 Please set up your .env file with GEMINI_API_KEY")

# Check if we have documents
docs_folder = Path("data/documents")
if docs_folder.exists():
    doc_files = list(docs_folder.glob("*.*"))
    print(f"✅ Found {len(doc_files)} documents")
    for doc in doc_files[:5]:  # Show first 5
        print(f"   📄 {doc.name}")
else:
    print("❌ No documents folder found")

print(f"\n⚙️ Current settings:")
print(f"   🤖 AI Model: {settings.GEMINI_MODEL}")
print(f"   🧮 Embeddings: {settings.EMBEDDING_MODEL}")
print(f"   📊 Top results: {settings.TOP_K_RESULTS}")

✅ API key is configured
✅ Found 8 documents
   📄 CV - Adrian Putra Pratama Badjideh.pdf
   📄 sample_note.md
   📄 Proposal Pengembangan Sistem Aplikasi Marketplace Konsultasi Profesional (Revisi).pdf
   📄 .gitkeep
   📄 sample_python_notes.txt

⚙️ Current settings:
   🤖 AI Model: gemini-1.5-flash
   🧮 Embeddings: all-MiniLM-L6-v2
   📊 Top results: 5


## 🚀 Initialize RAG System

In [3]:
# Create our RAG system
print("🔄 Starting RAG system...")

try:
    rag = RAGPipeline()
    print("✅ RAG system ready!")
    
    # Check what documents we have in our knowledge base
    stats = rag.get_knowledge_base_stats()
    
    if stats['success']:
        print(f"\n📊 Knowledge Base:")
        print(f"   📄 {stats['total_files']} documents")
        print(f"   🧩 {stats['stats']['total_chunks']} text chunks")
        
        if stats['total_files'] == 0:
            print("\n🔄 No documents found. Let's add some...")
            result = rag.ingest_documents("data/documents")
            if result['success']:
                print(f"✅ Added {result['documents_processed']} documents!")
            else:
                print(f"❌ Failed to add documents: {result['error']}")
    else:
        print(f"❌ Error getting stats: {stats['error']}")
        
except Exception as e:
    print(f"❌ Failed to start RAG system: {e}")
    rag = None

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given


🔄 Starting RAG system...


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event CollectionGetEvent: capture() takes 1 positional argument but 3 were given


✅ RAG system ready!

📊 Knowledge Base:
   📄 7 documents
   🧩 30 text chunks


## ❓ Question 1: What topics are covered in the documents?

In [4]:
if rag:
    question = "What are the main topics covered in these documents?"
    
    print(f"❓ Question: {question}")
    print("🔍 Searching documents...")
    
    try:
        result = rag.query(question)
        
        if result['success']:
            print(f"\n💡 Answer:")
            print(result['answer'])
            
            if result.get('sources'):
                print(f"\n📚 Sources used:")
                for source in result['sources']:
                    print(f"   📄 {source}")
        else:
            print(f"❌ Error: {result['error']}")
            
    except Exception as e:
        print(f"❌ Failed to answer: {e}")
else:
    print("❌ RAG system not available")

❓ Question: What are the main topics covered in these documents?
🔍 Searching documents...


Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given



💡 Answer:
The provided text is from the HBS MBA Application Guide 2025-2026.  It covers topics related to applying to Harvard Business School's MBA program.  These include:  introducing yourself (including prior application status and joint degree program information),  personal information,  your experience (resume, employment history, accomplishments, awards), your academics (transcripts, GRE/GMAT scores, English language tests), essays, recommendations, and fee waivers.


📚 Sources used:
   📄 HBS MBA Application Guide 2025-2026.pdf


## ❓ Question 2: What skills and technologies are mentioned?

In [5]:
if rag:
    question = "What skills, technologies, or programming languages are mentioned in the documents?"
    
    print(f"❓ Question: {question}")
    print("🔍 Searching documents...")
    
    try:
        result = rag.query(question)
        
        if result['success']:
            print(f"\n💡 Answer:")
            print(result['answer'])
            
            if result.get('sources'):
                print(f"\n📚 Sources used:")
                for source in result['sources']:
                    print(f"   📄 {source}")
        else:
            print(f"❌ Error: {result['error']}")
            
    except Exception as e:
        print(f"❌ Failed to answer: {e}")
else:
    print("❌ RAG system not available")

❓ Question: What skills, technologies, or programming languages are mentioned in the documents?
🔍 Searching documents...

💡 Answer:
Adrian's CV [Source: CV - Adrian Putra Pratama Badjideh.pdf] lists proficiency in Python and SQL.  He also mentions skills in exploratory data analysis and predictive modeling, as well as expertise in machine learning, predictive analytics, and AI engineering.  His projects utilized XGBoost and MySQL.


📚 Sources used:
   📄 CV - Adrian Putra Pratama Badjideh.pdf


## ❓ Question 3: What projects or work experience are described?

In [6]:
if rag:
    question = "What projects, work experience, or professional activities are described in these documents?"
    
    print(f"❓ Question: {question}")
    print("🔍 Searching documents...")
    
    try:
        result = rag.query(question)
        
        if result['success']:
            print(f"\n💡 Answer:")
            print(result['answer'])
            
            if result.get('sources'):
                print(f"\n📚 Sources used:")
                for source in result['sources']:
                    print(f"   📄 {source}")
        else:
            print(f"❌ Error: {result['error']}")
            
    except Exception as e:
        print(f"❌ Failed to answer: {e}")
else:
    print("❌ RAG system not available")

❓ Question: What projects, work experience, or professional activities are described in these documents?
🔍 Searching documents...

💡 Answer:
The provided text mentions a "Personal Knowledge Assistant Project" (sample_note.md),  detailing its features and technical specifications.  The second document (HBS MBA Application Guide 2025-2026.pdf) discusses the application requirements for Harvard Business School, including sections for resume/CV, employment history (with details on roles, responsibilities, accomplishments, and challenges), activities, awards, and recognition.  However, specific details about the nature of these experiences are not provided.


📚 Sources used:
   📄 HBS MBA Application Guide 2025-2026.pdf
   📄 sample_note.md


## ❓ Question 4: What educational background or qualifications are mentioned?

In [7]:
if rag:
    question = "What educational background, qualifications, or academic achievements are mentioned?"
    
    print(f"❓ Question: {question}")
    print("🔍 Searching documents...")
    
    try:
        result = rag.query(question)
        
        if result['success']:
            print(f"\n💡 Answer:")
            print(result['answer'])
            
            if result.get('sources'):
                print(f"\n📚 Sources used:")
                for source in result['sources']:
                    print(f"   📄 {source}")
        else:
            print(f"❌ Error: {result['error']}")
            
    except Exception as e:
        print(f"❌ Failed to answer: {e}")
else:
    print("❌ RAG system not available")

❓ Question: What educational background, qualifications, or academic achievements are mentioned?
🔍 Searching documents...

💡 Answer:
To apply to Harvard Business School, a bachelor's degree or equivalent is required ("HBS MBA Application Guide 2025-2026.pdf").  Applicants must also upload transcripts from all undergraduate and graduate institutions attended ("HBS MBA Application Guide 2025-2026.pdf").  A GRE or GMAT score is required before application submission ("HBS MBA Application Guide 2025-2026.pdf"), and an English language proficiency test (TOEFL, IELTS, PTE, or Duolingo) is needed if the applicant's undergraduate institution didn't use English as its sole language of instruction ("HBS MBA Application Guide 2025-2026.pdf").


📚 Sources used:
   📄 HBS MBA Application Guide 2025-2026.pdf


## ❓ Question 5: What specific applications or systems are discussed?

In [8]:
if rag:
    question = "What specific applications, systems, or software solutions are discussed in the documents?"
    
    print(f"❓ Question: {question}")
    print("🔍 Searching documents...")
    
    try:
        result = rag.query(question)
        
        if result['success']:
            print(f"\n💡 Answer:")
            print(result['answer'])
            
            if result.get('sources'):
                print(f"\n📚 Sources used:")
                for source in result['sources']:
                    print(f"   📄 {source}")
        else:
            print(f"❌ Error: {result['error']}")
            
    except Exception as e:
        print(f"❌ Failed to answer: {e}")
else:
    print("❌ RAG system not available")

❓ Question: What specific applications, systems, or software solutions are discussed in the documents?
🔍 Searching documents...

💡 Answer:
The provided text mentions several applications, systems, and software solutions:

* **ChromaDB:** Used for efficient similarity search with sentence transformers (sample_note.md).
* **Google's Gemini AI:**  Leveraged for natural language understanding and generation (sample_note.md).
* **Streamlit:** Used for the chat interface (sample_note.md).
* **NumPy, Pandas, Requests, Flask, Django, Matplotlib, Scikit-learn:** Common Python libraries for various tasks including numerical computing, data manipulation, HTTP requests, web frameworks, data visualization, and machine learning (sample_python_notes.txt).



📚 Sources used:
   📄 sample_python_notes.txt
   📄 sample_note.md


## 🧪 Try Your Own Question

In [9]:
if rag:
    # Change this to ask your own question!
    your_question = "Tell me about data science experience"
    
    print(f"❓ Your Question: {your_question}")
    print("🔍 Searching documents...")
    
    try:
        result = rag.query(your_question)
        
        if result['success']:
            print(f"\n💡 Answer:")
            print(result['answer'])
            
            if result.get('sources'):
                print(f"\n📚 Sources used:")
                for source in result['sources']:
                    print(f"   📄 {source}")
        else:
            print(f"❌ Error: {result['error']}")
            
    except Exception as e:
        print(f"❌ Failed to answer: {e}")
else:
    print("❌ RAG system not available")
    
print("\n💡 Tips for better questions:")
print("   • Be specific: 'What Python libraries are mentioned?'")
print("   • Ask about content: 'What projects involve machine learning?'")
print("   • Request summaries: 'Summarize the key achievements'")

❓ Your Question: Tell me about data science experience
🔍 Searching documents...

💡 Answer:
Adrian Putra Pratama Badjideh's CV (CV - Adrian Putra Pratama Badjideh.pdf) highlights his experience in data science through several projects and roles:

* **ATM Predictive Analytics for Future Maintenance Using XGBoost (PT. Datindo Infonet Prima):**  He built a predictive model using XGBoost, achieving 83% accuracy in forecasting ATM maintenance needs. This resulted in the potential for a 15% reduction in unplanned maintenance costs (CV - Adrian Putra Pratama Badjideh.pdf).

* **Garuda Indonesia Data Warehouse Project:** He improved reporting capability by an estimated 60% by developing a star schema and building a full ETL pipeline in MySQL (CV - Adrian Putra Pratama Badjideh.pdf).

* **Siemens Healthineers Database Specialist:** He built dashboards to improve task distribution efficiency by 25% and reduce average downtime by 12% through better monitoring (CV - Adrian Putra Pratama Badjideh.pd

## 📋 How RAG Works - Simple Explanation

Here's what happened behind the scenes:

### 1. **📄 Document Processing**
- Your documents were broken into smaller chunks
- Each chunk contains about 1000 characters with 200 character overlap

### 2. **🧮 Creating Embeddings**  
- Each text chunk was converted into a 384-dimensional vector
- Similar texts have similar vectors (closer in "space")

### 3. **🗃️ Vector Storage**
- All vectors were stored in ChromaDB database
- This allows fast similarity search

### 4. **🔍 Question Processing**
- Your question was also converted to a vector
- System found the most similar document chunks
- Top 5 most relevant chunks were selected

### 5. **🤖 Answer Generation**
- Relevant chunks were sent to Gemini AI
- AI generated an answer based on the context
- Sources were tracked and returned

## 🚀 Next Steps

- Add more documents to `data/documents/` folder
- Try the web interface: `python run.py`
- Experiment with settings in `config/settings.py`
- Check out the full demo: `RAG_Pipeline_Demo.ipynb`