<a href="https://colab.research.google.com/github/Markboon123/machine-learning-with-python-logistic-regression-3211129/blob/main/MLB_Project3_RAG_Chatbot_FIXED.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MLB Project 3 - RAG Chatbot

## Project Overview

Welcome to Project 3! In this project, you'll build a **Retrieval-Augmented Generation (RAG)** chatbot that can answer questions about a PDF document.

### What is RAG?
RAG combines two powerful concepts:
1. **Retrieval**: Finding relevant information from a document
2. **Generation**: Using an LLM to generate natural language answers

### What You'll Learn
- How to extract and process text from PDFs
- How to create text embeddings (vector representations)
- How to build a simple vector database
- How to search for relevant information using similarity
- How to use an LLM to generate answers based on context

### Project Structure
1. Setup and imports
2. Build a Vector Database
3. PDF processing utilities
4. Question answering system
5. Put it all together!

---

## Step 1: Setup and Imports

First, let's install the required libraries and import them.

In [1]:
# Install required packages (run this cell first!)
!pip install sentence-transformers pypdf transformers huggingface_hub torch tqdm -q

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/328.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[90m‚ï∫[0m[90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m174.1/328.9 kB[0m [31m4.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m328.9/328.9 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
# Import all necessary libraries
import os
import numpy as np
import warnings
from tqdm import tqdm
from sentence_transformers import SentenceTransformer
from pypdf import PdfReader
from transformers import pipeline
from huggingface_hub import login

# Suppress unnecessary warnings for cleaner output
warnings.filterwarnings("ignore", category=UserWarning)

print("‚úÖ All libraries imported successfully!")

‚úÖ All libraries imported successfully!


## Step 2: Configuration

Let's set up our model names and API keys.

In [3]:
# Configuration settings
LLM_MODEL = "google/flan-t5-base"  # The language model for generating answers
EMBEDDING_MODEL = "all-MiniLM-L12-v2"  # The model for creating embeddings
HF_API_KEY = os.getenv("HF_API_KEY", "YOUR-HF-API-KEY-HERE")  # Optional Hugging Face API key

print(f"üìã LLM Model: {LLM_MODEL}")
print(f"üìã Embedding Model: {EMBEDDING_MODEL}")

üìã LLM Model: google/flan-t5-base
üìã Embedding Model: all-MiniLM-L12-v2


## Step 3: Build the Vector Database Class

A **Vector Database** stores text as numerical vectors (embeddings) and allows us to search for similar text using mathematical operations.

### What are Embeddings?
Embeddings are numerical representations of text that capture semantic meaning. Similar texts have similar embeddings.

Example:
- "dog" and "puppy" would have similar embeddings
- "dog" and "car" would have very different embeddings

### 3.1: Initialize the Vector Database

**TODO**: Complete the `__init__` method to:
1. Load the SentenceTransformer model
2. Get the embedding dimension
3. Initialize empty storage for embeddings and text

In [4]:
class VectorDB:
    def __init__(self, model_name: str):
        """
        Initialize the Vector Database with an embedding model.

        Args:
            model_name: Name of the SentenceTransformer model to use
        """
        # TODO: Load the SentenceTransformer embedding model
        # Hint: self.embedModel = SentenceTransformer(model_name)
        self.embedModel = SentenceTransformer(model_name)

        # TODO: Get the embedding dimension from the model
        # Hint: Use get_sentence_embedding_dimension()
        self.embed_size = self.embedModel.get_sentence_embedding_dimension()

        # TODO: Initialize an empty NumPy array for embeddings
        # Hint: Start with np.empty((0, self.embed_size))
        self._embeddings = np.empty((0, self.embed_size))

        # TODO: Initialize an empty list for storing the original text strings
        self._strings = []

        print(f"‚úÖ VectorDB initialized with embedding dimension: {self.embed_size}")

### 3.2: Add Data to the Database

**TODO**: Complete the `addToDatabase` method to:
1. Convert text strings to embeddings
2. Store the embeddings in the database
3. Store the original text strings

In [5]:
    def addToDatabase(self, input: list[str]):
        """
        Add text chunks to the vector database.

        Args:
            input: List of text strings to add to the database
        """
        # TODO: Convert the input strings to embeddings
        # Hint: Use self.embedModel.encode(input) to get embeddings
        new_embeddings = self.embedModel.encode(input)

        # TODO: Stack the new embeddings with existing ones
        # Hint: Use np.vstack() to append vertically
        # Handle the case where _embeddings is empty (first addition)
        if self._embeddings.shape[0] == 0:
            self._embeddings = new_embeddings
        else:
            self._embeddings = np.vstack([self._embeddings, new_embeddings])

        # TODO: Extend the _strings list with the new input strings
        # Hint: Use list.extend()
        self._strings.extend(input)

        print(f"‚úÖ Added {len(input)} chunks. Total chunks: {len(self._strings)}")

# Add this method to the VectorDB class
VectorDB.addToDatabase = addToDatabase

### 3.3: Clear the Database

**TODO**: Complete the `clearDatabase` method to reset the database.

In [6]:
    def clearDatabase(self):
        """
        Clear all data from the vector database.
        """
        # TODO: Reset _embeddings to an empty array with the correct shape
        # Hint: Use np.empty((0, self.embed_size))
        self._embeddings = np.empty((0, self.embed_size))

        # TODO: Reset _strings to an empty list
        self._strings = []

        print("üóëÔ∏è Database cleared!")

# Add this method to the VectorDB class
VectorDB.clearDatabase = clearDatabase

### 3.4: Calculate Euclidean Similarity

**TODO**: Implement the similarity function.

**What is Euclidean Distance?**
It's the straight-line distance between two points in space. We convert it to similarity:
- Distance = 0 ‚Üí Similarity = 1 (identical)
- Distance = large ‚Üí Similarity = close to 0 (very different)

In [7]:
    def euclideanSim(self, x, y):
        """
        Calculate Euclidean similarity between two vectors.

        Args:
            x: First vector (numpy array)
            y: Second vector (numpy array)

        Returns:
            Similarity score (higher = more similar)
        """
        # TODO: Calculate Euclidean distance using np.linalg.norm()
        # Hint: distance = np.linalg.norm(x - y)
        distance = np.linalg.norm(x - y)

        # TODO: Convert distance to similarity
        # Hint: similarity = 1 / (1 + distance)
        similarity = 1 / (1 + distance)

        return similarity

# Add this method to the VectorDB class
VectorDB.euclideanSim = euclideanSim

### 3.5: Search the Database

**TODO**: Implement the search functionality to find the most similar text chunks.

In [8]:
    def search(self, query: str, n_return=3):
        """
        Search the database for the most similar chunks to the query.

        Args:
            query: The search query string
            n_return: Number of top results to return

        Returns:
            Tuple of (text_chunks, similarity_scores)
        """
        # TODO: Generate an embedding for the query
        # Hint: Use self.embedModel.encode([query])[0] to get a single embedding
        query_embedding = self.embedModel.encode([query])[0]

        # TODO: Calculate similarity between query and all stored embeddings
        # Hint: Use a list comprehension with self.euclideanSim()
        similarities = [self.euclideanSim(query_embedding, emb) for emb in self._embeddings]
        # Example: [self.euclideanSim(query_embedding, emb) for emb in self._embeddings]

        # TODO: Find indices of the top n_return most similar results
        # Hint: Use np.argsort() and reverse the order with [::-1]
        top_indices = np.argsort(similarities)[::-1][:n_return]

        # TODO: Get the corresponding text chunks and scores
        top_chunks = [self._strings[i] for i in top_indices]
        top_scores = [similarities[i] for i in top_indices]

        return top_chunks, top_scores

# Add this method to the VectorDB class
VectorDB.search = search

### Test the Vector Database

Let's test our VectorDB with some sample data!

In [9]:
# Test the VectorDB
print("Testing VectorDB...\n")

# Create a test database
test_vdb = VectorDB(EMBEDDING_MODEL)

# Add some test data
test_data = [
    "Python is a programming language.",
    "Machine learning involves training models on data.",
    "Dogs are loyal pets.",
    "Neural networks are inspired by the human brain."
]

test_vdb.addToDatabase(test_data)

# Search for something
query = "What is ML?"
results, scores = test_vdb.search(query, n_return=2)

print(f"\nüîç Query: '{query}'\n")
for i, (chunk, score) in enumerate(zip(results, scores), 1):
    print(f"Result {i} (similarity: {score:.4f}):")
    print(f"  {chunk}\n")

Testing VectorDB...



modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

‚úÖ VectorDB initialized with embedding dimension: 384
‚úÖ Added 4 chunks. Total chunks: 4

üîç Query: 'What is ML?'

Result 1 (similarity: 0.4539):
  Machine learning involves training models on data.

Result 2 (similarity: 0.4379):
  Neural networks are inspired by the human brain.



## Step 4: PDF Processing Utilities

Now we'll build functions to extract and process text from PDF files.

### 4.1: Clean Text Function

This function removes extra whitespace and newlines from text.

In [10]:
def clean_text(text: str) -> str:
    """
    Clean text by removing extra whitespace and newlines.

    Args:
        text: Raw text string

    Returns:
        Cleaned text string
    """
    # Split text into words and join with single spaces
    return " ".join(text.split())

# Test the function
test_text = "This   has    extra\n\nspaces   and\nnewlines."
print(f"Original: {repr(test_text)}")
print(f"Cleaned:  {repr(clean_text(test_text))}")

Original: 'This   has    extra\n\nspaces   and\nnewlines.'
Cleaned:  'This has extra spaces and newlines.'


### 4.2: Create Text Chunks

**TODO**: Split long text into overlapping chunks.

**Why Overlapping Chunks?**
- Ensures we don't split important information across boundaries
- Example: chunk_size=500, overlap=50 means each chunk shares 50 characters with the next

In [11]:
def chunksFromText(text: str, chunk_size=500, overlap=50):
    """
    Split text into overlapping chunks.

    Args:
        text: Input text string
        chunk_size: Size of each chunk in characters
        overlap: Number of overlapping characters between chunks

    Returns:
        List of text chunks
    """
    chunks = []

    # TODO: Calculate the step size (how much to move forward each time)
    # Hint: step = chunk_size - overlap
    step = chunk_size - overlap

    # TODO: Loop through the text, creating chunks
    # Hint: Use range(start, stop, step) where start=0, stop=len(text), step=calculated above
    # For each position i, create a chunk from text[i:i+chunk_size]
    for i in range(0, len(text), step):
        chunk = text[i:i+chunk_size]
        if chunk:
            chunks.append(chunk)

    return chunks

# Test the function
test_text = "A" * 100  # 100 characters
test_chunks = chunksFromText(test_text, chunk_size=30, overlap=5)
print(f"Text length: {len(test_text)}")
print(f"Number of chunks: {len(test_chunks)}")
print(f"First chunk length: {len(test_chunks[0]) if test_chunks else 0}")

Text length: 100
Number of chunks: 4
First chunk length: 30


### 4.3: Process PDF and Add to Database

**TODO**: Read a PDF, extract text, create chunks, and add to the database.

In [12]:
def chunksFromPDF(vDB, path: str, startPage=0, endPage=None):
    """
    Extract text from a PDF, chunk it, and add to the vector database.

    Args:
        vDB: VectorDB instance to add chunks to
        path: Path to the PDF file
        startPage: First page to process (0-indexed)
        endPage: Last page to process (None = all pages)
    """
    # TODO: Create a PdfReader object
    # Hint: reader = PdfReader(path)
    reader = PdfReader(path)

    # TODO: Get the list of pages to process
    # Hint: Use reader.pages[startPage:endPage]
    pages = reader.pages[startPage:endPage]

    print(f"üìÑ Processing {len(pages)} pages from PDF...")

    all_chunks = []

    # TODO: Loop through each page with tqdm for progress bar
    for page_num, page in enumerate(tqdm(pages, desc="Extracting text"), startPage):
        # TODO: Extract text from the page
        # Hint: Use page.extract_text()
        text = page.extract_text() or ""

        # TODO: Clean the text
        # Hint: Use the clean_text() function
        text = clean_text(text)

        # Skip empty or very short pages (likely covers or blank pages)
        if len(text) < 100:
            continue

        # TODO: Convert text to chunks
        # Hint: Use chunksFromText(text)
        page_chunks = chunksFromText(text)

        # Add page number to each chunk for reference
        page_chunks = [f"[Page {page_num+1}] {chunk}" for chunk in page_chunks]
        all_chunks.extend(page_chunks)

    # TODO: Add all chunks to the vector database
    # Hint: Use vDB.addToDatabase(all_chunks)
    vDB.addToDatabase(all_chunks)

    print(f"‚úÖ Successfully processed PDF: {len(all_chunks)} chunks added to database")

## Step 5: Question Answering System

Now we'll create the function that ties everything together!

### 5.1: Generate Answer Function

**TODO**: Implement the RAG pipeline:
1. Retrieve relevant context from the database
2. Create a prompt with context and question
3. Generate an answer using the LLM

In [13]:
def generateAnswer(question: str, vDB, llm):
    """
    Generate an answer to a question using RAG.

    Args:
        question: User's question
        vDB: VectorDB instance with loaded documents
        llm: Language model pipeline for generation

    Returns:
        Generated answer string
    """
    # TODO: Search the database for relevant chunks
    # Hint: Use vDB.search(question, n_return=3)
    relevant_chunks, scores = vDB.search(question, n_return=3)

    # TODO: Combine the chunks into a context string
    # Hint: Use "\n\n".join(relevant_chunks)
    context = "\n\n".join(relevant_chunks)

    # TODO: Create a prompt that includes the context and question
    # The prompt should ask the LLM to answer based on the context
    prompt = f"""
Based on the following context, answer the question.

Context:
{context}

Question: {question}

Answer:
"""  # You can modify this prompt template

    # TODO: Generate answer using the LLM
    # Hint: result = llm(prompt, max_length=200, num_return_sequences=1)
    result = llm(prompt, max_length=200, num_return_sequences=1)

    # TODO: Extract the generated text from the result
    # Hint: The result is a list of dictionaries with 'generated_text' key
    generated_text = result[0]["generated_text"]

    # TODO: Extract only the part after "Answer:"
    # Hint: Use split("Answer:")[-1].strip()
    answer = generated_text.split("Answer:")[-1].strip()

    return answer

## Step 6: Put It All Together! üéâ

Now let's run the complete RAG chatbot!

### 6.1: Login to Hugging Face (Optional)

If you have an API key, this step helps avoid rate limits.

In [14]:
print("Logging in to Hugging Face Hub...")
try:
    login(token=HF_API_KEY)
    print("‚úÖ Login successful!")
except Exception as e:
    print(f"‚ö†Ô∏è Skipping login (API key optional): {e}")

Logging in to Hugging Face Hub...
‚ö†Ô∏è Skipping login (API key optional): Invalid user token.


### 6.2: Initialize the Vector Database

**TODO**: Create your VectorDB instance.

In [15]:
print("Loading embedding model...")

# TODO: Create an instance of VectorDB using EMBEDDING_MODEL
# Hint: vDB = VectorDB(EMBEDDING_MODEL)
vDB = VectorDB(EMBEDDING_MODEL)

print("‚úÖ Vector Database ready!")

Loading embedding model...
‚úÖ VectorDB initialized with embedding dimension: 384
‚úÖ Vector Database ready!


### 6.3: Load and Process the PDF

**TODO**: Process the TechNova IT Handbook PDF.

**Note**: Make sure the PDF file is in the same directory as this notebook!

In [16]:
# Locate the PDF file
pdf_path = "TechNova_IT_Handbook.pdf"

# Check if file exists
if not os.path.exists(pdf_path):
    print(f"‚ùå PDF not found at {pdf_path}")
    print("Please make sure 'TechNova_IT_Handbook.pdf' is in the same folder as this notebook.")
else:
    print(f"üìÑ Found PDF: {pdf_path}")

    # TODO: Call chunksFromPDF to process the PDF
    # Hint: chunksFromPDF(vDB, pdf_path)
    chunksFromPDF(vDB, pdf_path)


‚ùå PDF not found at TechNova_IT_Handbook.pdf
Please make sure 'TechNova_IT_Handbook.pdf' is in the same folder as this notebook.


### 6.4: Load the Language Model

**TODO**: Load the LLM for generating answers.

**Note**: This may take a few minutes the first time!

In [17]:
print("Loading LLM model...")
print("‚è≥ This may take a few minutes...")

# TODO: Create a text generation pipeline
# Hint: llm = pipeline("text2text-generation", model=LLM_MODEL)
llm = pipeline("text2text-generation", model=LLM_MODEL)

print("‚úÖ LLM loaded successfully!")

Loading LLM model...
‚è≥ This may take a few minutes...


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use cpu


‚úÖ LLM loaded successfully!


### 6.5: Interactive Q&A Session

**TODO**: Create an interactive loop for asking questions.

Try asking questions like:
- "What is the password policy?"
- "How do I report a security incident?"
- "What are the remote work guidelines?"

In [None]:
print("\n" + "="*50)
print("ü§ñ RAGBot is ready!")
print("Ask questions about the TechNova IT Handbook.")
print("Type 'exit' or 'quit' to stop.")
print("="*50 + "\n")

while True:
    # Get user input
    question = input("\n‚ùì Your question: ")

    # Check if user wants to exit
    if question.lower() in ["exit", "quit"]:
        print("\nüëã Goodbye! Thanks for using RAGBot!")
        break

    # Skip empty questions
    if not question.strip():
        continue

    # TODO: Generate and print the answer
    # Hint: answer = generateAnswer(question, vDB, llm)
    print("\nü§î Thinking...\n")
    answer = generateAnswer(question, vDB, llm)

    print(f"üí° Answer: {answer}")
    print("\n" + "-"*50)


ü§ñ RAGBot is ready!
Ask questions about the TechNova IT Handbook.
Type 'exit' or 'quit' to stop.


‚ùì Your question: "What is the password policy?"

ü§î Thinking...



Both `max_new_tokens` (=256) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


üí° Answer: not enough information

--------------------------------------------------

‚ùì Your question: "How do I report a security incident?"


Both `max_new_tokens` (=256) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



ü§î Thinking...

üí° Answer: (iii).

--------------------------------------------------

‚ùì Your question: "What are the remote work guidelines?"


Both `max_new_tokens` (=256) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



ü§î Thinking...

üí° Answer: [iii]

--------------------------------------------------


### Alternative: Single Question Testing

If you prefer to test with individual questions, use this cell instead:

In [None]:
# Test with a single question
test_question = "What is the password policy?"

print(f"Question: {test_question}\n")
answer = generateAnswer(test_question, vDB, llm)
print(f"Answer: {answer}")

## üéì Congratulations!

You've successfully built a RAG chatbot! Here's what you accomplished:

1. ‚úÖ Created a vector database to store document embeddings
2. ‚úÖ Implemented similarity search using Euclidean distance
3. ‚úÖ Processed PDF documents and created text chunks
4. ‚úÖ Built a complete RAG pipeline for question answering
5. ‚úÖ Integrated an LLM to generate natural language responses

### üöÄ Next Steps

Want to improve your RAG bot? Try:
- Experimenting with different chunk sizes and overlap values
- Using cosine similarity instead of Euclidean distance
- Adding more sophisticated text preprocessing
- Trying different embedding models
- Implementing a better prompt engineering strategy
- Adding source citations to your answers

### üìö Additional Resources

- [SentenceTransformers Documentation](https://www.sbert.net/)
- [Hugging Face Transformers](https://huggingface.co/docs/transformers/)
- [RAG Paper (Lewis et al.)](https://arxiv.org/abs/2005.11401)

Great work! üéâ