# **Instructions for Running This Assignment in Google Colab**

To ensure this notebook runs correctly, please follow these steps:

    Unzip the Submission File: Extract all contents of the submitted .zip file. You should find the IB9LQ0_5652632_RAG_assignment.pdf report, your Colab notebook (IB9LQ0_5652632_RAG_assignment.ipynb), and the three PDF corpus files (CNIL practice guide to GDPR_removed.pdf, GDPR EU_removed.pdf, ICO guide to GDPR_removed.pdf).

    Open the Notebook in Google Colab: Upload and open your_notebook_name.ipynb in your Google Colab environment.

    Set Up Your Google Gemini API Key (One-Time Setup):
        This notebook uses the Google Gemini API. You will need your own Gemini API key for this to work.
        If you don't have one, you can get one from the Google AI Studio.
        In Google Colab, on the left sidebar, click the 🔑 Secrets icon.

    Click "Add new secret".
    Set the Name of the secret to GOOGLE_API_KEY (case-sensitive).
    Paste your Gemini API key into the Value field.
    Ensure "Notebook access" is toggled ON for this notebook.

Run the Notebook Cells Sequentially:

    Initial Setup Cell (Installs Libraries): Run the first cell to install all necessary Python libraries.
    File Upload Cell: When you run the cell that prompts for file upload, a file selection dialog will appear. Please select the following three PDF corpus files from your unzipped submission:
        CNIL practice guide to GDPR_removed.pdf
        GDPR EU_removed.pdf
        ICO guide to GDPR_removed.pdf (These are your corpus documents for the RAG system.)
    Continue Running: Proceed to run the rest of the notebook cells in order. The notebook will automatically load the uploaded PDFs, create document chunks, generate embeddings, set up the retriever, and perform the RAG queries.

# **1. Importing & Reading Files**

In [1]:
# Import Python's standard library to work with file paths and directories
import os

# Install the required Python libraries
# - langchain: for building the RAG pipeline
# - langchain-community: community-built integrations (i.e. PDF loading)
# - pypdf: for reading and parsing PDF documents
# - sentence-transformers: to later create embeddings
# - chromadb: vector database for storing and retrieving chunks semantically
!pip install langchain langchain-community pypdf sentence-transformers chromadb -q

# Import the PDF loader from the langchain_community package
# Reads PDF files and breaks them down page by page into structured text
from langchain_community.document_loaders import PyPDFLoader

# For file upload
from google.colab import files

# --- Step 1: Instructions and File Upload for the Marker ---
print("IMPORTANT: Please upload your 3 PDF corpus files (CNIL, GDPR EU, ICO Guide) now.")
print("A file selection dialog will appear. Please select the following files:")
print("  - CNIL practice guide to GDPR_removed.pdf")
print("  - GDPR EU_removed.pdf")
print("  - ICO guide to GDPR_removed.pdf")

uploaded = files.upload() # Will open a file selection dialog in your browser

# Optional: Confirm uploaded files
print("\nFiles uploaded by user:")
for fn in uploaded.keys():
    print(f'- "{fn}"')

# --- Step 2: Define the folder path within Colab's local environment ---
# Once uploaded via files.upload(), the PDFs are placed directly in /content/
pdf_folder = "/content/"

# --- Step 3: Your existing code to load PDF pages from the folder ---
# Create an empty list to hold all the pages from all PDFs
docs = []

# Loop through each file in the folder
for filename in os.listdir(pdf_folder):
    # Only load files that end in '.pdf' and match specific filenames
    # This prevents loading other temporary files that might be in /content/
    if filename in ["CNIL practice guide to GDPR_removed.pdf",
                    "GDPR EU_removed.pdf",
                    "ICO guide to GDPR_removed.pdf"] and filename.endswith(".pdf"):
        # Create a PDF loader for the current file
        loader = PyPDFLoader(os.path.join(pdf_folder, filename))

        # Load the PDF into pages and add to the docs list
        # Each page is treated as a separate 'Document' object
        docs.extend(loader.load())

# Print out how many total pages were loaded from all PDFs
print(f"\nLoaded {len(docs)} pages from folder: {pdf_folder}")

IMPORTANT: Please upload your 3 PDF corpus files (CNIL, GDPR EU, ICO Guide) now.
A file selection dialog will appear. Please select the following files:
  - CNIL practice guide to GDPR_removed.pdf
  - GDPR EU_removed.pdf
  - ICO guide to GDPR_removed.pdf


Saving CNIL practice guide to GDPR_removed.pdf to CNIL practice guide to GDPR_removed (1).pdf
Saving GDPR EU_removed.pdf to GDPR EU_removed (1).pdf
Saving ICO guide to GDPR_removed.pdf to ICO guide to GDPR_removed (1).pdf

Files uploaded by user:
- "CNIL practice guide to GDPR_removed (1).pdf"
- "GDPR EU_removed (1).pdf"
- "ICO guide to GDPR_removed (1).pdf"

Loaded 465 pages from folder: /content/


# **2. Creating document chunks**
with Semantic Chunking
---



In [2]:
# Install NLTK
import nltk
nltk.download("punkt")
# Download the necessary 'punkt_tab' resource for the NLTKTextSplitter
nltk.download("punkt_tab")

# Use NLTK sentence-based splitter instead of character-based
from langchain.text_splitter import NLTKTextSplitter

# Create semantic text splitter
text_splitter = NLTKTextSplitter(
    chunk_size=800,         # Target max token length
    chunk_overlap=100       # To preserve cross-sentence context
)

# Apply to already-loaded 'docs' list
splits = text_splitter.split_documents(docs)

# Print total number of semantic chunks created
print(f"Split into {len(splits)} semantically coherent chunks.")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Split into 1856 semantically coherent chunks.


In [3]:
# Print the first 3 chunks to visually inspect the structure
for i, split in enumerate(splits[:3]):
    print(f"--- Chunk {i + 1} ---")
    print(split.page_content)
    print()

--- Chunk 1 ---
4
FOREWORD
Security is an essential part of the protection of personal data.

It is binding on any data controller and 
data processor through Article 32 of the General Data Protection Regulation1  (GDPR).

In principle, 
each processing operation must be subjected to a set of security measures decided according to the 
context, namely “useful precautions, having regard to the nature of the data and the risks presented 
by the processing” ( Article 121 of the French Data Protection Act 2 ).

The GDPR specifies that the 
protection of personal data requires taking “appropriate technical and organisational measures to 
ensure a level of security appropriate to the risk” for the rights and freedoms of natural persons, 
including their privacy.

--- Chunk 2 ---
To assess the measures to be put in place, two complementary approaches are to be deployed:
 – the establishment of a security base incorporating good practices resulting from years of      
capitalising on hygiene a

# **3. Creating embeddings & vector storage**

In [4]:
!pip install tqdm -q
from langchain_community.vectorstores import Chroma
from tqdm import tqdm
import shutil
import os

# Load BGE-base embedding model (768 dimensions)
from langchain.embeddings import HuggingFaceEmbeddings
embedding_model = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en")

# Vectorstore setup
from langchain_community.vectorstores import Chroma
import shutil, os
from tqdm import tqdm

persist_directory = "db_bge_base_embeddings"

# Clear old vectorstore if exists
if os.path.exists(persist_directory):
    shutil.rmtree(persist_directory)

# Embed chunks
batch_size = 300
print(f"📦 Embedding {len(splits)} chunks with BGE-base...")

vectorstore = Chroma.from_documents(
    documents=splits[:batch_size],
    embedding=embedding_model,
    persist_directory=persist_directory
)
vectorstore.persist()

for i in tqdm(range(batch_size, len(splits), batch_size), desc="Embedding batches"):
    end = min(i + batch_size, len(splits))
    vectorstore = Chroma(persist_directory=persist_directory, embedding_function=embedding_model)
    vectorstore.add_documents(splits[i:end])
    vectorstore.persist()

print(f"All chunks embedded and stored in '{persist_directory}' with BGE-base.")

  embedding_model = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


📦 Embedding 1856 chunks with BGE-base...


  vectorstore.persist()
  vectorstore = Chroma(persist_directory=persist_directory, embedding_function=embedding_model)
Embedding batches: 100%|██████████| 6/6 [17:18<00:00, 173.12s/it]

All chunks embedded and stored in 'db_bge_base_embeddings' with BGE-base.





Conduct an initial test query to check if chunks can be retrieved properly

In [5]:
8# Set up the retriever
retriever = vectorstore.as_retriever()

# Create an interactive input box in Google Colab
query = "What rights does an individual have under the GDPR regarding their personal data?"  #@param {type:"string"}

# Perform retrieval
results = retriever.invoke(query)

# Display results
print(f"\n🔍 Retrieval for: '{query}'")
print(f"Retrieved {len(results)} documents:\n")

# Show retrieved content
for i, doc in enumerate(results):
    print(f"--- Document {i+1} ---")
    print(doc.page_content[:500])  # Show preview of the text
    print(f"Source: {doc.metadata.get('source', 'N/A')}\n")



🔍 Retrieval for: 'What rights does an individual have under the GDPR regarding their personal data?'
Retrieved 4 documents:

--- Document 1 ---
Individual rights
The UK GDPR provides the following rights for individuals:
The right to be informed1.

The right of access2.

The right to rectification3.

The right to erasure4.

The right to restrict processing5.

The right to data portability6.

The right to object7.

Rights in relation to automated decision making and profiling.8.

This part of the guide explains these rights.

14 October 2022 - 1.1.17 99
Source: /content/ICO guide to GDPR_removed.pdf

--- Document 2 ---
The right to withdraw consent ✓✓✓✓ ✓✓✓✓
The right to lodge a complaint with a
supervisory authority
✓✓✓✓ ✓✓✓✓ 
The source of the personal data  ✓✓✓✓
The details of whether individuals are under a
statutory or contractual obligation to provide
the personal data
✓✓✓✓  
The details of the existence of automated
decision-making, including profiling
✓✓✓✓ ✓✓✓✓
When should we p

# **4. LLM setup & RAG pipeline**
with Reranking
---

In [6]:
# Install the reranker
!pip install -q transformers accelerate

# Install the retriever
!pip install rank_bm25

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F
from langchain_community.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.retrievers import BM25Retriever
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.schema import Document

embedding_model = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en")
persist_directory = "db_bge_base_embeddings"
vectorstore = Chroma(persist_directory=persist_directory, embedding_function=embedding_model)



# **5. Querying**
Gemini API key is required - named 'GOOGLE_API_KEY'
---

In [7]:
# Load reranker
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

# Import for Colab secrets access
from google.colab import userdata
import google.generativeai as genai

In [8]:
# Define your query
query = "What are the lawful bases for processing personal data under GDPR?" #@param {type:"string"}

reranker_name = "BAAI/bge-reranker-large"
tokenizer = AutoTokenizer.from_pretrained(reranker_name)
reranker = AutoModelForSequenceClassification.from_pretrained(reranker_name)
reranker.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
reranker.to(device)

# Step 1: Build a keyword-based BM25Retriever (from the same docs used in embedding)
bm25_retriever = BM25Retriever.from_documents(splits)  # 'splits' = your chunked documents
bm25_retriever.k = 10  # You can adjust

# Step 2: Build a vector-based retriever from the semantic DB
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# Step 3: Define hybrid retriever
def hybrid_retrieve(query):
    bm25_docs = bm25_retriever.get_relevant_documents(query)
    vector_docs = vector_retriever.get_relevant_documents(query)

    # Combine & deduplicate (based on content)
    all_docs = {doc.page_content: doc for doc in bm25_docs + vector_docs}
    return list(all_docs.values())

# Step 4: Apply reranker to hybrid results
def rerank_docs(query, docs, top_n=8):
    inputs = [
        tokenizer(query, doc.page_content, return_tensors="pt", truncation=True, padding=True, max_length=512).to(device)
        for doc in docs
    ]
    with torch.no_grad():
        scores = []
        for i, inp in enumerate(inputs):
            output = reranker(**inp)
            score = output.logits[0].item()
            scores.append((score, docs[i]))
    reranked = sorted(scores, key=lambda x: x[0], reverse=True)[:top_n]
    return [doc for _, doc in reranked]

initial_docs = hybrid_retrieve(query)
retrieved_docs = rerank_docs(query, initial_docs)

# Context line
context = "\n\n".join([doc.page_content for doc in retrieved_docs])

# Prompt for Gemini
prompt = f"""
You are an expert legal assistant with detailed knowledge of data protection law,
especially the General Data Protection Regulation (GDPR). Answer the following question
clearly and precisely using only the information provided in the context below.

If the answer is not found in the context, do not hallucinate. If the answer is not clearly available from
the context provided, say "The information is not available in the context provided."

Context:
{context}

Question:
{query}
"""

# Use Gemini
import google.generativeai as genai
from google.generativeai import GenerationConfig

# Get API key
genai.configure(api_key=userdata.get('GOOGLE_API_KEY'))  # Replace if needed
model = genai.GenerativeModel("gemini-2.0-flash")

generation_config = GenerationConfig(max_output_tokens=1024, temperature=0.3)
response = model.generate_content(prompt, generation_config=generation_config)

# Output
print("🔍 Query:")
print(query)
print("\n🧠 Gemini's Answer:")
print(response.text)

  bm25_docs = bm25_retriever.get_relevant_documents(query)


🔍 Query:
What are the lawful bases for processing personal data under GDPR?

🧠 Gemini's Answer:
The lawful bases for processing are set out in Article 6 of the UK GDPR. At least one of these must apply whenever you process personal data: (a) Consent: the individual has given clear consent for you to process their personal data for a specific purpose. Article 6(1)(d) provides a lawful basis for processing where processing is necessary in order to protect the vital interests of the data subject or of another natural person. There are six available lawful bases for processing.



In [9]:
# View retrieved chunks
for i, doc in enumerate(retrieved_docs):
    print(f"\n--- Chunk {i+1} ---")
    print(doc.page_content[:1000])  # print first 1000 characters
    print(f"Source: {doc.metadata.get('source', 'N/A')}")


--- Chunk 1 ---
In brief
What are the lawful bases for processing?

When is processing ‘necessary’?

Why is the lawful basis for processing important?

How do we decide which lawful basis applies?

Is this different for public authorities?

Can we change our lawful basis?

What happens if we have a new purpose?

How should we document our lawful basis?

What do we need to tell people?

What about special category data?

What about criminal offence data?

What are the lawful bases for processing?

The lawful bases for processing are set out in Article 6 of the UK GDPR.

At least one of these must apply
whenever you process personal data:
(a) Consent: the individual has given clear consent for you to process their personal data for a specific
purpose.
Source: /content/ICO guide to GDPR_removed.pdf

--- Chunk 2 ---
What are ‘vital interests’?

When is the vital interests basis likely to apply?

What else should we consider?

What does the UK GDPR say?

Article 6(1)(d) provides a lawful b