# RAG Chatbot with PDF Knowledge Base (Optimized for Apple Silicon)

This notebook demonstrates how to build a high-performance Retrieval-Augmented Generation (RAG) chatbot optimized to run on Apple Silicon (M-series chips). 

We will use:
- **Generator LLM:** `meta-llama/Meta-Llama-3-8B-Instruct` (The official Llama 3 model)
- **Retriever Model:** `BAAI/bge-large-en-v1.5` (A top-tier embedding model)
- **Vector Store:** ChromaDB

### 1. Install Dependencies

First, we need to install the necessary Python libraries. `accelerate` helps with efficient model loading.

In [None]:
!pip install transformers torch sentence-transformers pypdf chromadb datasets accelerate -q

### 2. Login to Hugging Face

The official Llama 3 model is gated. You need to log in with a Hugging Face account that has been granted access. You can request access on the [Llama 3 model card](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).

Run this cell and enter a User Access Token from your Hugging Face account.

In [None]:
from huggingface_hub import login

# Replace 'YOUR_HUGGINGFACE_TOKEN' with your actual token
hf_token = "YOUR_HUGGINGFACE_TOKEN" 
login(token=hf_token)

### 3. Import Libraries and Check for Apple Silicon GPU

We'll import all the required libraries and verify that PyTorch can see the Mac's `mps` device (the GPU).

In [29]:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer
import chromadb
from pypdf import PdfReader
from datasets import Dataset
import numpy as np
import textwrap

# Check for Apple Silicon GPU and set the device
if torch.backends.mps.is_available():
    device = torch.device("mps")
    print("MPS (Apple Silicon GPU) is available. Using device: mps")
else:
    device = torch.device("cpu")
    print("MPS not available. Using device: cpu")

MPS (Apple Silicon GPU) is available. Using device: mps


### 4. Load and Process the PDF

This step remains the same. We'll load the `Aluminium.pdf` file, extract its text content, and then split the text into smaller, manageable chunks.

In [30]:
def extract_text_from_pdf(pdf_path):
    """Extracts text from a PDF file."""
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() or ""
    return text

def split_text_into_chunks(text, chunk_size=500, chunk_overlap=50):
    """Splits text into overlapping chunks."""
    chunks = []
    current_pos = 0
    while current_pos < len(text):
        end_pos = current_pos + chunk_size
        chunk = text[current_pos:end_pos]
        chunks.append(chunk)
        current_pos += chunk_size - chunk_overlap
    return [chunk for chunk in chunks if chunk.strip()] # Remove empty chunks

# Specify the path to your PDF file
pdf_path = 'Aluminium.pdf'

# Extract and chunk the text
pdf_text = extract_text_from_pdf(pdf_path)
text_chunks = split_text_into_chunks(pdf_text)

# Create a Hugging Face Dataset
documents_dict = {'text': text_chunks}
dataset = Dataset.from_dict(documents_dict)

print(f"Successfully loaded and split the PDF into {len(dataset)} chunks.")

Successfully loaded and split the PDF into 42 chunks.


### 5. Create Text Embeddings with BGE-Large

We'll use the new, more accurate BGE embedding model to convert our text chunks into numerical vectors. This will run on your Mac's GPU automatically.

In [31]:
embedding_model_name = 'BAAI/bge-large-en-v1.5'
embedding_model = SentenceTransformer(embedding_model_name, device=device)

# Generate embeddings for each chunk
embeddings = embedding_model.encode(dataset['text'], show_progress_bar=True)

# Add the embeddings to our dataset
dataset = dataset.add_column('embeddings', embeddings.tolist())

print("Embeddings created with BGE-Large and added to the dataset.")

RuntimeError: MPS backend out of memory (MPS allocated: 27.15 GiB, other allocations: 2.23 MiB, max allowed: 27.20 GiB). Tried to allocate 119.23 MiB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

### 6. Build the ChromaDB Collection

This step remains the same. We will load the documents and their new embeddings into our in-memory vector store.

In [None]:
# Create a ChromaDB client (this will be an in-memory instance)
client = chromadb.Client()

# Create a new collection or get it if it already exists
collection = client.get_or_create_collection(name="aluminium_kb_v2")

doc_ids = [str(i) for i in range(len(dataset))]
documents_list = [doc for doc in dataset['text']]

# Add the documents and their embeddings to the collection
collection.add(
    embeddings=np.array(dataset['embeddings']),
    documents=documents_list,
    ids=doc_ids
)

print(f"ChromaDB collection created with {collection.count()} documents.")

### 7. Define the RAG Chatbot with Llama 3

This is the core of our new chatbot. We load the official Llama 3 model and create a pipeline that runs on the Mac's GPU (`mps`).

**Note:** The first time you run this cell, it will download the Llama 3 model, which is several gigabytes. This may take some time.

In [None]:
llm_model_name = 'meta-llama/Meta-Llama-3-8B-Instruct'

tokenizer = AutoTokenizer.from_pretrained(llm_model_name)
model = AutoModelForCausalLM.from_pretrained(
    llm_model_name,
    torch_dtype=torch.bfloat16, # Use bfloat16 for better performance on M-series
    device_map=device # Explicitly set the device
)

# Create the pipeline for text generation
llm_pipeline = pipeline('text-generation', model=model, tokenizer=tokenizer)

def retrieve_context(query, k=3):
    query_embedding = embedding_model.encode([query]).tolist()
    results = collection.query(query_embeddings=query_embedding, n_results=k)
    retrieved_chunks = results['documents'][0]
    return " ".join(retrieved_chunks)

def generate_answer(query, context):
    # Llama 3 uses a specific chat template
    messages = [
        {"role": "system", "content": "You are a helpful assistant. Answer the user's question based on the provided context."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
    ]
    
    prompt = llm_pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )

    terminators = [
        llm_pipeline.tokenizer.eos_token_id,
        llm_pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>" # End of turn token for Llama 3
    ]

    outputs = llm_pipeline(
        prompt,
        max_new_tokens=256,
        eos_token_id=terminators,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
    )
    
    # Extract the response from the generated text
    generated_text = outputs[0]['generated_text']
    response = generated_text[len(prompt):].strip()
    return response

def chatbot(query):
    print(f"❓ Query: {query}")
    context = retrieve_context(query)
    answer = generate_answer(query, context)
    print(f"\n🤖 Llama 3 Answer:\n{textwrap.fill(answer, width=80)}")

### 8. Ask a Question!

Now, let's test our new high-performance RAG chatbot. The answers should be significantly more detailed, coherent, and human-like.

In [None]:
# Example usage
user_query = "What is red mud and how is it managed in India?"
chatbot(user_query)

In [None]:
user_query_2 = "Explain the concept of a circular economy for metals like aluminium and copper."
chatbot(user_query_2)

In [None]:
q = "what is langchain"
chatbot(q)

In [None]:
q = "what will the c02 emission if i use recycled aluminium instead of bauxite"
chatbot(q)