# RAG Chatbot with PDF Knowledge Base

This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) chatbot. The chatbot will use the information from an uploaded PDF file as its knowledge base. We will use Hugging Face for models and FAISS for efficient document retrieval.

### 1. Install Dependencies

First, we need to install the necessary Python libraries. We'll use:
- `transformers` and `torch` for loading Hugging Face models.
- `sentence-transformers` for creating embeddings.
- `pypdf` to read and extract text from the PDF file.
- `faiss-cpu` for creating the vector index for fast retrieval.
- `datasets` to handle our text data easily.

^C
Note: you may need to restart the kernel to use updated packages.


### 2. Import Libraries

Now, let's import all the required libraries for our project.

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
from sentence_transformers import SentenceTransformer
import faiss
from pypdf import PdfReader
from datasets import Dataset
import numpy as np
import textwrap

  from .autonotebook import tqdm as notebook_tqdm


### 3. Load and Process the PDF

We'll load the `Aluminium.pdf` file, extract its text content, and then split the text into smaller, manageable chunks. This chunking is important because language models have a limited context window.

In [2]:
def extract_text_from_pdf(pdf_path):
    """Extracts text from a PDF file."""
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() or ""
    return text

def split_text_into_chunks(text, chunk_size=500, chunk_overlap=200):
    """Splits text into overlapping chunks."""
    wrapper = textwrap.TextWrapper(width=chunk_size, break_long_words=False, replace_whitespace=False)
    chunks = []
    current_pos = 0
    while current_pos < len(text):
        end_pos = current_pos + chunk_size
        chunk = text[current_pos:end_pos]
        chunks.append(chunk)
        current_pos += chunk_size - chunk_overlap
    return [chunk for chunk in chunks if chunk.strip()] # Remove empty chunks

# Specify the path to your PDF file
pdf_path = 'Aluminium.pdf'

# Extract and chunk the text
pdf_text = extract_text_from_pdf(pdf_path)
text_chunks = split_text_into_chunks(pdf_text)

# Create a Hugging Face Dataset
documents = {'text': text_chunks}
dataset = Dataset.from_dict(documents)

print(f"Successfully loaded and split the PDF into {len(dataset)} chunks.")
print("\n--- Example Chunk ---")
print(dataset[0]['text'])

Successfully loaded and split the PDF into 62 chunks.

--- Example Chunk ---
Aluminium
Ore & Mining: Bauxite ore (mainly in tropical countries) is the principal source of alumina. Global
bauxite mines are often large open-pit operations producing 3–5 tonnes of ore per tonne of Al.
(India’s bauxite reserves lie mainly in Odisha and Jharkhand.). 
Production Steps: Primary Al production is a three-step process (bauxite mining, alumina
refining via Bayer , then Hall–Héroult electrolysis). In India, alumina (Al₂O₃) is refined (Bayer
process) at plants in Odisha/Chhattisgarh, 


### 4. Create Text Embeddings

Next, we'll convert our text chunks into numerical vectors (embeddings) using a pre-trained model from Hugging Face. These embeddings capture the semantic meaning of the text, allowing us to find similar chunks based on a query.

In [3]:
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
embedding_model = SentenceTransformer(model_name)

# Generate embeddings for each chunk
# This can take a few moments depending on the number of chunks
embeddings = embedding_model.encode(dataset['text'], show_progress_bar=True)

# Add the embeddings to our dataset
dataset = dataset.add_column('embeddings', embeddings.tolist())

print("Embeddings created and added to the dataset.")

Batches: 100%|██████████| 2/2 [00:01<00:00,  1.47it/s]

Embeddings created and added to the dataset.





### 5. Build the FAISS Index

With our embeddings ready, we can create a FAISS index. FAISS (Facebook AI Similarity Search) is a library that allows for efficient searching of similar vectors. This index will function as our fast, searchable knowledge base.

In [4]:
# Convert embeddings to a numpy array
embeddings_np = np.array(dataset['embeddings'], dtype='float32')

# Get the dimension of the embeddings
d = embeddings_np.shape[1]

# Create the FAISS index
index = faiss.IndexFlatL2(d)
index.add(embeddings_np)

print(f"FAISS index created with {index.ntotal} vectors.")

FAISS index created with 62 vectors.


### 6. Define the RAG Chatbot Logic

This is the core of our chatbot. We'll define two main functions:
1.  `retrieve_context`: This function takes a user's question, embeds it, and uses the FAISS index to find the most relevant text chunks from our PDF.
2.  `generate_answer`: This function takes the user's question and the retrieved context and feeds them to a question-answering model to generate a final, coherent answer.

In [5]:
# Load a question-answering model
qa_model_name = 'deepset/roberta-base-squad2'
qa_pipeline = pipeline('question-answering', model=qa_model_name, tokenizer=qa_model_name)

def retrieve_context(query, k=3):
    """Retrieves the top-k most relevant text chunks for a query."""
    query_embedding = embedding_model.encode([query])
    query_embedding_np = np.array(query_embedding, dtype='float32')
    
    # Search the FAISS index
    distances, indices = index.search(query_embedding_np, k)
    
    # Get the corresponding text chunks
    retrieved_chunks = [dataset[i]['text'] for i in indices[0]]
    return " ".join(retrieved_chunks)

def generate_answer(query, context):
    """Generates an answer based on the query and retrieved context."""
    qa_input = {
        'question': query,
        'context': context
    }
    result = qa_pipeline(qa_input)
    return result['answer']

def chatbot(query):
    """The main chatbot function."""
    print(f"❓ Query: {query}")
    
    # 1. Retrieve context
    context = retrieve_context(query)
    # print(f"\n🔍 Retrieved Context:\n{context}") # Uncomment for debugging
    
    # 2. Generate answer
    answer = generate_answer(query, context)
    print(f"\n🤖 Answer: {answer}")

Fetching 0 files: 0it [00:00, ?it/s]
Fetching 1 files: 100%|██████████| 1/1 [00:00<00:00, 11397.57it/s]
Fetching 0 files: 0it [00:00, ?it/s]
Device set to use mps:0


### 7. Ask a Question!

Now it's time to test our RAG chatbot. Let's ask a question based on the content of the `Aluminium.pdf` file.

In [6]:
# Example usage
user_query = "What are the main production stages of Aluminium?"
chatbot(user_query)

❓ Query: What are the main production stages of Aluminium?


TypeError: Wrong key type: '45' of type '<class 'numpy.int64'>'. Expected one of int, slice, range, str or Iterable.

In [None]:
user_query_2 = "How much energy does recycling aluminum save?"
chatbot(user_query_2)

In [None]:
user_query_3 = "What is red mud?"
chatbot(user_query_3)