# RAG Chatbot with PDF Knowledge Base

This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) chatbot. The chatbot will use the information from an uploaded PDF file as its knowledge base. We will use Hugging Face for models and ChromaDB for the vector store.

### 1. Install Dependencies

First, we need to install the necessary Python libraries. We'll use:
- `transformers` and `torch` for loading Hugging Face models.
- `sentence-transformers` for creating embeddings.
- `pypdf` to read and extract text from the PDF file.
- `chromadb` for our vector database.
- `datasets` to handle our text data easily.

In [None]:
!pip install transformers torch sentence-transformers pypdf chromadb datasets -q

### 2. Import Libraries

Now, let's import all the required libraries for our project.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
from sentence_transformers import SentenceTransformer
import chromadb
from pypdf import PdfReader
from datasets import Dataset
import numpy as np
import textwrap

### 3. Load and Process the PDF

We'll load the `Aluminium.pdf` file, extract its text content, and then split the text into smaller, manageable chunks. This chunking is important because language models have a limited context window.

In [None]:
def extract_text_from_pdf(pdf_path):
    """Extracts text from a PDF file."""
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() or ""
    return text

def split_text_into_chunks(text, chunk_size=500, chunk_overlap=50):
    """Splits text into overlapping chunks."""
    wrapper = textwrap.TextWrapper(width=chunk_size, break_long_words=False, replace_whitespace=False)
    chunks = []
    current_pos = 0
    while current_pos < len(text):
        end_pos = current_pos + chunk_size
        chunk = text[current_pos:end_pos]
        chunks.append(chunk)
        current_pos += chunk_size - chunk_overlap
    return [chunk for chunk in chunks if chunk.strip()] # Remove empty chunks

# Specify the path to your PDF file
pdf_path = 'Aluminium.pdf'

# Extract and chunk the text
pdf_text = extract_text_from_pdf(pdf_path)
text_chunks = split_text_into_chunks(pdf_text)

# Create a Hugging Face Dataset
documents = {'text': text_chunks}
dataset = Dataset.from_dict(documents)

print(f"Successfully loaded and split the PDF into {len(dataset)} chunks.")
print("\n--- Example Chunk ---")
print(dataset[0]['text'])

### 4. Create Text Embeddings

Next, we'll convert our text chunks into numerical vectors (embeddings) using a pre-trained model from Hugging Face. These embeddings capture the semantic meaning of the text, allowing us to find similar chunks based on a query.

In [None]:
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
embedding_model = SentenceTransformer(model_name)

# Generate embeddings for each chunk
# This can take a few moments depending on the number of chunks
embeddings = embedding_model.encode(dataset['text'], show_progress_bar=True)

# Add the embeddings to our dataset
dataset = dataset.add_column('embeddings', embeddings.tolist())

print("Embeddings created and added to the dataset.")

### 5. Build the ChromaDB Collection

With our embeddings ready, we can create a ChromaDB collection. This collection will store our vectors and allow for efficient similarity searches.

In [None]:
# Create a ChromaDB client (this will be an in-memory instance)
client = chromadb.Client()

# Create a new collection or get it if it already exists
collection = client.get_or_create_collection(name="aluminium_kb")

# ChromaDB requires string IDs for each document
doc_ids = [str(i) for i in range(len(dataset))]

# Add the documents and their embeddings to the collection
collection.add(
    embeddings=np.array(dataset['embeddings']),
    documents=dataset['text'],
    ids=doc_ids
)

print(f"ChromaDB collection created with {collection.count()} documents.")

### 6. Define the RAG Chatbot Logic

This is the core of our chatbot. We'll define two main functions:
1.  `retrieve_context`: This function takes a user's question, embeds it, and uses the ChromaDB collection to find the most relevant text chunks from our PDF.
2.  `generate_answer`: This function takes the user's question and the retrieved context and feeds them to a question-answering model to generate a final, coherent answer.

In [None]:
# Load a question-answering model
qa_model_name = 'deepset/roberta-base-squad2'
qa_pipeline = pipeline('question-answering', model=qa_model_name, tokenizer=qa_model_name)

def retrieve_context(query, k=3):
    """Retrieves the top-k most relevant text chunks for a query from ChromaDB."""
    # Convert the query to a list of embeddings
    query_embedding = embedding_model.encode([query]).tolist()
    
    # Query the ChromaDB collection
    results = collection.query(
        query_embeddings=query_embedding,
        n_results=k
    )
    
    # The retrieved documents are in the 'documents' key
    retrieved_chunks = results['documents'][0]
    return " ".join(retrieved_chunks)

def generate_answer(query, context):
    """Generates an answer based on the query and retrieved context."""
    qa_input = {
        'question': query,
        'context': context
    }
    result = qa_pipeline(qa_input)
    return result['answer']

def chatbot(query):
    """The main chatbot function."""
    print(f"❓ Query: {query}")
    
    # 1. Retrieve context
    context = retrieve_context(query)
    # print(f"\n🔍 Retrieved Context:\n{context}") # Uncomment for debugging
    
    # 2. Generate answer
    answer = generate_answer(query, context)
    print(f"\n🤖 Answer: {answer}")

### 7. Ask a Question!

Now it's time to test our RAG chatbot. Let's ask a question based on the content of the `Aluminium.pdf` file.

In [None]:
# Example usage
user_query = "What are the main production stages of Aluminium?"
chatbot(user_query)

In [None]:
user_query_2 = "How much energy does recycling aluminum save?"
chatbot(user_query_2)

In [None]:
user_query_3 = "What is red mud?"
chatbot(user_query_3)