# RAG Chatbot with PDF Knowledge Base

This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) chatbot. The chatbot will use the information from an uploaded PDF file as its knowledge base. We will use Hugging Face for models and ChromaDB for the vector store. 

This version uses a **generative LLM (`google/flan-t5-base`)** to provide comprehensive, natural-language answers instead of just extracting text.

### 1. Install Dependencies

First, we need to install the necessary Python libraries. We'll use:
- `transformers` and `torch` for loading Hugging Face models.
- `sentence-transformers` for creating embeddings.
- `pypdf` to read and extract text from the PDF file.
- `chromadb` for our vector database.
- `datasets` to handle our text data easily.

In [None]:
!pip install transformers torch sentence-transformers pypdf chromadb datasets -q

### 2. Import Libraries

Now, let's import all the required libraries for our project.

In [17]:
import torch
from transformers import pipeline
from sentence_transformers import SentenceTransformer
import chromadb
from pypdf import PdfReader
from datasets import Dataset
import numpy as np
import textwrap

### 3. Load and Process the PDF

We'll load the `Aluminium.pdf` file, extract its text content, and then split the text into smaller, manageable chunks. This chunking is important because language models have a limited context window.

In [18]:
def extract_text_from_pdf(pdf_path):
    """Extracts text from a PDF file."""
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() or ""
    return text

def split_text_into_chunks(text, chunk_size=500, chunk_overlap=50):
    """Splits text into overlapping chunks."""
    chunks = []
    current_pos = 0
    while current_pos < len(text):
        end_pos = current_pos + chunk_size
        chunk = text[current_pos:end_pos]
        chunks.append(chunk)
        current_pos += chunk_size - chunk_overlap
    return [chunk for chunk in chunks if chunk.strip()] # Remove empty chunks

# Specify the path to your PDF file
pdf_path = 'Aluminium.pdf'

# Extract and chunk the text
pdf_text = extract_text_from_pdf(pdf_path)
text_chunks = split_text_into_chunks(pdf_text)

# Create a Hugging Face Dataset
documents_dict = {'text': text_chunks}
dataset = Dataset.from_dict(documents_dict)

print(f"Successfully loaded and split the PDF into {len(dataset)} chunks.")
print("\n--- Example Chunk ---")
print(dataset[0]['text'])

Successfully loaded and split the PDF into 42 chunks.

--- Example Chunk ---
Aluminium
Ore & Mining: Bauxite ore (mainly in tropical countries) is the principal source of alumina. Global
bauxite mines are often large open-pit operations producing 3–5 tonnes of ore per tonne of Al.
(India’s bauxite reserves lie mainly in Odisha and Jharkhand.). 
Production Steps: Primary Al production is a three-step process (bauxite mining, alumina
refining via Bayer , then Hall–Héroult electrolysis). In India, alumina (Al₂O₃) is refined (Bayer
process) at plants in Odisha/Chhattisgarh, 


### 4. Create Text Embeddings

Next, we'll convert our text chunks into numerical vectors (embeddings) using a pre-trained model from Hugging Face. These embeddings capture the semantic meaning of the text, allowing us to find similar chunks based on a query.

In [19]:
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
embedding_model = SentenceTransformer(model_name)

# Generate embeddings for each chunk
# This can take a few moments depending on the number of chunks
embeddings = embedding_model.encode(dataset['text'], show_progress_bar=True)

# Add the embeddings to our dataset
dataset = dataset.add_column('embeddings', embeddings.tolist())

print("Embeddings created and added to the dataset.")

Batches: 100%|██████████| 2/2 [00:00<00:00, 10.18it/s]


Embeddings created and added to the dataset.


### 5. Build the ChromaDB Collection

With our embeddings ready, we can create a ChromaDB collection. This collection will store our vectors and allow for efficient similarity searches.

In [20]:
# Create a ChromaDB client (this will be an in-memory instance)
client = chromadb.Client()

# Create a new collection or get it if it already exists
collection = client.get_or_create_collection(name="aluminium_kb")

# ChromaDB requires string IDs for each document
doc_ids = [str(i) for i in range(len(dataset))]

# **FIX:** Explicitly convert the documents column to a standard Python list
documents_list = [doc for doc in dataset['text']]

# Add the documents and their embeddings to the collection
collection.add(
    embeddings=np.array(dataset['embeddings']),
    documents=documents_list, # Use the explicit list
    ids=doc_ids
)

print(f"ChromaDB collection created with {collection.count()} documents.")

ChromaDB collection created with 42 documents.


### 6. Define the RAG Chatbot Logic with a Generative LLM

This is the core of our chatbot. We'll define two main functions:
1.  `retrieve_context`: This function takes a user's question, embeds it, and uses the ChromaDB collection to find the most relevant text chunks from our PDF.
2.  `generate_answer`: This function now uses a **generative LLM (`google/flan-t5-base`)**. It takes the user's question and the retrieved context, and *generates* a new, human-like answer.

In [21]:
# Load a generative text-to-text model (Flan-T5 Large for better generation)
llm_model_name = 'google/flan-t5-large'
llm_pipeline = pipeline('text2text-generation', model=llm_model_name, tokenizer=llm_model_name)

def retrieve_context(query, k=3):
    """Retrieves the top-k most relevant text chunks for a query from ChromaDB."""
    # Convert the query to a list of embeddings
    query_embedding = embedding_model.encode([query]).tolist()
    
    # Query the ChromaDB collection
    results = collection.query(
        query_embeddings=query_embedding,
        n_results=k
    )
    
    # The retrieved documents are in the 'documents' key
    retrieved_chunks = results['documents'][0]
    return " ".join(retrieved_chunks)

def generate_answer(query, context):
    """Generates a natural language answer based on the query and retrieved context."""
    # Create a detailed prompt for the generative model
    prompt = f"""
You are a helpful assistant. Based on the following context, provide a comprehensive, detailed, and natural-sounding answer to the question. Explain your answer step by step if necessary.

Context:
{context}

Question:
{query}

Answer:
"""
    
    # Generate the answer using the LLM pipeline with parameters for longer, more varied responses
    result = llm_pipeline(prompt, max_length=512, min_length=50, do_sample=True, temperature=0.7, top_p=0.9, clean_up_tokenization_spaces=True)
    
    # Extract the generated text
    return result[0]['generated_text']

def chatbot(query):
    """The main chatbot function."""
    print(f"❓ Query: {query}")
    
    # 1. Retrieve context
    context = retrieve_context(query)
    # print(f"\n🔍 Retrieved Context:\n{context}") # Uncomment for debugging
    
    # 2. Generate answer
    answer = generate_answer(query, context)
    print(f"\n🤖 Answer: {answer}")

Device set to use mps:0


### 7. Ask a Question!

Now it's time to test our new and improved RAG chatbot. The answers should be much more detailed and conversational.

In [None]:
user_query = "What are the main production stages of Aluminium?"
chatbot(user_query)

❓ Query: What are the main production stages of Aluminium?


Token indices sequence length is longer than the specified maximum sequence length for this model (576 > 512). Running this sequence through the model will result in indexing errors
Both `max_new_tokens` (=256) and `max_length`(=256) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



🤖 Answer: bauxite mining, alumina refining, and electrolysis


In [15]:
user_query_2 = "How much energy does recycling aluminum save?"
chatbot(user_query_2)

❓ Query: How much energy does recycling aluminum save?


Both `max_new_tokens` (=256) and `max_length`(=256) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



🤖 Answer: 95%


In [16]:
user_query_3 = "What is red mud?"
chatbot(user_query_3)

❓ Query: What is red mud?


Both `max_new_tokens` (=256) and `max_length`(=256) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



🤖 Answer: Bauxite residue


In [22]:
user_query_4 = "What are the environmental impacts of aluminum production?"
chatbot(user_query_4)

Token indices sequence length is longer than the specified maximum sequence length for this model (584 > 512). Running this sequence through the model will result in indexing errors


❓ Query: What are the environmental impacts of aluminum production?


Both `max_new_tokens` (=256) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



🤖 Answer: Environmental impacts include GHGs (CO2, PFCs) and fluorides. Recycling rates: India 30–35%, global 65–75%. Typical waste: red mud (1–2 t per t alumina) and spent pot linings. Product lifetimes vary: packaging (years), transport/ construction (decades).","metadata":"title":"Aluminium Production Facts","author":"SynthKB","date":"2025-09-01","jurisdiction":"global/ India","metadata":"title":"Aluminium Production Facts","author":"SynthKB","date":"2025-09-01","jurisdiction":"global/ India","metadata":"title":"Aluminium Production Facts","author":"SynthKB","date":"2025-09-01","jurisdiction":"global/ India","metadata":"title":"A


In [23]:
user_query_3 = "What is red mud?"
chatbot(user_query_3)

Both `max_new_tokens` (=256) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


❓ Query: What is red mud?

🤖 Answer: Bauxite residue, containing caustic, Fe/Ti oxides. Smelting produces spent pot-linings and alumina-saturated slags. India is classifying red mud as hazardous waste to enforce safe disposal. Use-Phase (Lifetimes): Al products vary: e.g. foil/can packaging is short-lived (years), transportation and building components are long-lived (cars 10–20 yr; infrastru


In [24]:
user_query_2 = "How much energy does recycling aluminum save?"
chatbot(user_query_2)

Both `max_new_tokens` (=256) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


❓ Query: How much energy does recycling aluminum save?

🤖 Answer: In India, aluminium smelters emit 20.9 tCO2 per tonne due to coal-based power. By contrast, recycling aluminum saves 95% of that energy. Recycling uses 5% of primary energy. Recycling rates: India 30–35%, global 65–75%. Typical waste: red mud (1–2 t per t alumina) and spent pot linings. Product lifetimes vary: packaging (years), transport/ construction (decades).","metadata":"title":"Aluminium Productio Recycled Cu uses far less energy; ICA cites 85% energy savings vs primary.) Emissions: Copper smelters emit substantial pollutants: SO2 (if not fully captured), CO2 (from fuel, charcoal, and power), particulates (fugitive dust), and heavy metals (As, Pb, Cd) in slag/dust. Converting blister Cu is exothermic and produces CO2 from coke. LCA studies note that SO2 released (if not recovered)


In [None]:

user_query = "What are the main production stages of Aluminium?"
chatbot(user_query)

Both `max_new_tokens` (=256) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


❓ Query: What are the main production stages of Aluminium?

🤖 Answer: Bauxite mining, alumina refining, and electrolysis are the main production stages of Aluminium. Typically “1 tonne of aluminium ingot (cradle-to-gate)” is used as the functional unit. Ensure the system boundary (mining to gate, etc.) is clearly defined.


In [26]:
user_query = "PRESIDENT OF INDIA ?"
chatbot(user_query)

❓ Query: PRESIDENT OF INDIA ?


Both `max_new_tokens` (=256) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



🤖 Answer: The President of India is called the President of India. The President of India is called the President of India. The President of India is called the President of India. The President of India is called the President of India. The President of India is called the President of India. The President of India is called the President of India. The President of India is called the President of India. The President of India is called the President of India. The President of India is called the President of India. The President of India is called the President of India. The President of India is called the President of India. The President of India is called the President of India. The President of India is called the President of India. The President of India is called the President of India. The President of India is called the President of India. The President of India is called the President of India. The President of India is called the President of India. The President of Ind