<a href="https://colab.research.google.com/github/CrisMcode111/DI_Bootcamp/blob/main/w7_d2_Exercises_XP_VDB_Student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercises XP: Vector Databases and RAG
Use this guided notebook and fill each TODO before running cells.

## What you'll learn
- Vector search strategies (KNN, ANN) and evaluation.
- Vector database utility (similarity search, RAG).
- Differences between vector DBs, libraries, and plugins.
- Best practices for vector store usage and performance.
- How LMs use context; embedding generation and storage.
- Querying vector stores and applying LMs for QA with retrieved context.

## What you'll build
A functional RAG pipeline with FAISS and ChromaDB, plus QA over retrieved context using a Hugging Face model.

## 0. Setup
Run the install cell once. If your platform needs system deps (e.g., libomp for FAISS), follow instructions in comments.

In [1]:
!pip install 'numpy<2' --upgrade
!pip install pydantic==1.10.13 faiss-cpu==1.8.0 chromadb==0.3.21 sentence-transformers transformers --upgrade



In [2]:
import os
import json
from pathlib import Path
import numpy as np
import pandas as pd
import faiss
from sentence_transformers import SentenceTransformer, InputExample
import chromadb
from chromadb.config import Settings
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from IPython.display import display
os.makedirs('cache', exist_ok=True)


## ðŸŒŸ Exercise 1 Â· Data loading and preparation

In [9]:
!wget -O labelled_newscatcher_dataset.zip "https://github.com/devtlv/Datasets-GEN-AI-Bootcamp/raw/refs/heads/main/Week%205/Day%204%20-%20Diving%20Deep%20into%20Vector%20Databases%20and%20RAG%20Chatbots/labelled_newscatcher_dataset.zip"
!unzip -o labelled_newscatcher_dataset.zip

--2025-11-17 11:41:24--  https://github.com/devtlv/Datasets-GEN-AI-Bootcamp/raw/refs/heads/main/Week%205/Day%204%20-%20Diving%20Deep%20into%20Vector%20Databases%20and%20RAG%20Chatbots/labelled_newscatcher_dataset.zip
Resolving github.com (github.com)... 140.82.116.4
Connecting to github.com (github.com)|140.82.116.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://media.githubusercontent.com/media/devtlv/Datasets-GEN-AI-Bootcamp/refs/heads/main/Week%205/Day%204%20-%20Diving%20Deep%20into%20Vector%20Databases%20and%20RAG%20Chatbots/labelled_newscatcher_dataset.zip [following]
--2025-11-17 11:41:24--  https://media.githubusercontent.com/media/devtlv/Datasets-GEN-AI-Bootcamp/refs/heads/main/Week%205/Day%204%20-%20Diving%20Deep%20into%20Vector%20Databases%20and%20RAG%20Chatbots/labelled_newscatcher_dataset.zip
Resolving media.githubusercontent.com (media.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to medi

In [11]:
import pandas as pd

pdf = pd.read_csv("labelled_newscatcher_dataset.csv", sep=';', encoding='latin1', engine='python')
pdf.head()

Unnamed: 0,topic,link,domain,published_date,title,lang
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could...,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-met...,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how ...,en


In [12]:
from pathlib import Path
import pandas as pd
from IPython.display import display

# point to the dataset path
data_path = "labelled_newscatcher_dataset.csv"  # The file is in the current directory

# adjust path if needed
pdf = pd.read_csv(data_path, sep=';', encoding='latin1', engine='python')

# replace with your own identifier logic if provided
if 'id' not in pdf.columns:
    # pornesc de la 0 sau 1, cum preferÄƒ profii â€“ aici de la 0, cum aveai È™i tu
    pdf['id'] = range(len(pdf))

display(pdf.head())

# create a manageable subset (e.g., first 1000 rows)
max_rows = 1000
pdf_subset = pdf.head(min(max_rows, len(pdf)))

display(pdf_subset[['id', 'title']].head())

Unnamed: 0,topic,link,domain,published_date,title,lang,id
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en,0
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en,1
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en,2
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could...,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en,3
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-met...,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how ...,en,4


Unnamed: 0,id,title
0,0,A closer look at water-splitting's solar fuel ...
1,1,"An irresistible scent makes locusts swarm, stu..."
2,2,Artificial intelligence warning: AI will know ...
3,3,Glaciers Could Have Sculpted Mars Valleys: Study
4,4,Perseid meteor shower 2020: What time and how ...


In [13]:
# point to the dataset path
data_path = 'labelled_newscatcher_dataset.csv'

# adjust path if needed
pdf = pd.read_csv(data_path, sep=';')

# replace with your own identifier logic if provided
if 'id' not in pdf.columns:
    pdf['id'] = range(len(pdf))  # genereazÄƒ ID-uri simple, 0...n

display(pdf.head())

# create a manageable subset (e.g., first 1000 rows)
subset_size = 1000
pdf_subset = pdf.head(min(subset_size, len(pdf)))

display(pdf_subset[['id', 'title']].head())


Unnamed: 0,topic,link,domain,published_date,title,lang,id
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en,0
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en,1
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en,2
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could...,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en,3
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-met...,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how ...,en,4


Unnamed: 0,id,title
0,0,A closer look at water-splitting's solar fuel ...
1,1,"An irresistible scent makes locusts swarm, stu..."
2,2,Artificial intelligence warning: AI will know ...
3,3,Glaciers Could Have Sculpted Mars Valleys: Study
4,4,Perseid meteor shower 2020: What time and how ...


## ðŸŒŸ Exercise 2 Â· Vectorization with Sentence Transformers

In [14]:
def example_create_fn(idx: int, text: str) -> InputExample:
    return InputExample(guid=str(idx), texts=[text], label=0.0)

faiss_train_examples = [example_create_fn(i, t) for i, t in enumerate(pdf_subset['title'].tolist())]
faiss_train_examples[:2]


[<sentence_transformers.readers.InputExample.InputExample at 0x7e78770c6f90>,
 <sentence_transformers.readers.InputExample.InputExample at 0x7e787703c5c0>]

In [15]:
model = SentenceTransformer('all-MiniLM-L6-v2')
titles_list = pdf_subset['title'].tolist()
faiss_title_embedding = model.encode(titles_list, convert_to_numpy=True, show_progress_bar=True)
len(faiss_title_embedding), len(faiss_title_embedding[0])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

(1000, 384)

In [21]:
print("Titles number:", len(faiss_title_embedding))
print("Embedding dimension:", len(faiss_title_embedding[0]))


Titles number: 1000
Embedding dimension: 384


## ðŸŒŸ Exercise 3 Â· FAISS indexing and search

In [16]:
pdf_to_index = pdf_subset
id_index = pdf_to_index['id'].to_numpy().astype(np.int64)
content_encoded_normalized = faiss_title_embedding.astype('float32')
faiss.normalize_L2(content_encoded_normalized)
index_content = faiss.IndexIDMap(faiss.IndexFlatIP(content_encoded_normalized.shape[1]))
index_content.add_with_ids(content_encoded_normalized, id_index)
index_content.ntotal


1000

In [17]:
def search_content(query: str, pdf_to_index: pd.DataFrame, k: int = 3):
    # encode the query using the sentence transformer model
    query_vector = model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(query_vector)
    sims, ids = index_content.search(query_vector.astype('float32'), k)
    results = pdf_to_index[pdf_to_index['id'].isin(ids[0])].copy()
    results['similarities'] = sims[0]
    return results

display(search_content('animal', pdf_to_index, k=5))


Unnamed: 0,topic,link,domain,published_date,title,lang,id,similarities
99,TECHNOLOGY,https://www.gematsu.com/2020/08/ghostwire-toky...,gematsu.com,2020-08-07 16:43:13,Ghostwire: Tokyo confirms dog petting,en,99,0.391902
176,TECHNOLOGY,https://www.pushsquare.com/news/2020/08/random...,pushsquare.com,2020-08-03 16:30:00,Random: You Can Pick Up and Pet Cats in Assass...,en,176,0.376784
762,SCIENCE,https://af.reuters.com/article/worldNews/idAFK...,af.reuters.com,2020-08-13 16:51:00,'Secret' life of sharks: Study reveals their s...,en,762,0.344058
928,SCIENCE,https://www.thecut.com/2020/08/scientists-say-...,thecut.com,2020-08-04 12:52:00,Just Let This Lizard Be a Dinosaur,en,928,0.317387
975,HEALTH,https://www.news-medical.net/news/20200813/Res...,news-medical.net,2020-08-13 05:18:00,Researchers explore social behavior of animals...,en,975,0.295497


In [22]:
display(search_content("animal", pdf_to_index, k=5))
display(search_content("politics", pdf_to_index, k=5))
display(search_content("economy", pdf_to_index, k=5))


Unnamed: 0,topic,link,domain,published_date,title,lang,id,similarities
99,TECHNOLOGY,https://www.gematsu.com/2020/08/ghostwire-toky...,gematsu.com,2020-08-07 16:43:13,Ghostwire: Tokyo confirms dog petting,en,99,0.391902
176,TECHNOLOGY,https://www.pushsquare.com/news/2020/08/random...,pushsquare.com,2020-08-03 16:30:00,Random: You Can Pick Up and Pet Cats in Assass...,en,176,0.376784
762,SCIENCE,https://af.reuters.com/article/worldNews/idAFK...,af.reuters.com,2020-08-13 16:51:00,'Secret' life of sharks: Study reveals their s...,en,762,0.344058
928,SCIENCE,https://www.thecut.com/2020/08/scientists-say-...,thecut.com,2020-08-04 12:52:00,Just Let This Lizard Be a Dinosaur,en,928,0.317387
975,HEALTH,https://www.news-medical.net/news/20200813/Res...,news-medical.net,2020-08-13 05:18:00,Researchers explore social behavior of animals...,en,975,0.295497


Unnamed: 0,topic,link,domain,published_date,title,lang,id,similarities
488,HEALTH,https://www.expressandstar.com/news/crime/2020...,expressandstar.com,2020-08-08 12:44:00,28 illegal parties broken up in one night by W...,en,488,0.264689
512,TECHNOLOGY,https://www.malaymail.com/news/tech-gadgets/20...,malaymail.com,2020-08-08 02:14:27,"Windows, Gates and a firewall: Microsoftâ€™s del...",en,512,0.235076
611,TECHNOLOGY,https://www.rt.com/news/497937-belarus-opposit...,rt.com,2020-08-13 15:08:00,â€˜How can we help?â€™: Musk responds to Belarus o...,en,611,0.226675
710,HEALTH,https://www.bigeasymagazine.com/2020/08/10/the...,bigeasymagazine.com,2020-08-10 15:21:07,The Ethics Of AI And Death,en,710,0.213847
792,HEALTH,https://news.trust.org/item/20200806141021-qe2re,news.trust.org,2020-08-06 14:58:00,It's not for me: speed of COVID-19 vaccine rac...,en,792,0.212101


Unnamed: 0,topic,link,domain,published_date,title,lang,id,similarities
88,TECHNOLOGY,https://www.deccanchronicle.com/technology/in-...,deccanchronicle.com,2020-08-06 15:11:00,Nintendo profit soars as people play more game...,en,88,0.312264
231,TECHNOLOGY,https://www.prweb.com/releases/the_npd_group_u...,prweb.com,2020-08-10 10:02:45,The NPD Group: US Consumer Spend on Video Game...,en,231,0.309242
418,TECHNOLOGY,https://www.kotaku.com.au/2020/08/the-fascinat...,kotaku.com.au,2020-08-13 23:32:00,"The Fascinating Web Of Entropia Universe, The ...",en,418,0.302339
672,TECHNOLOGY,https://weeklywall.com/shore-power-market-anal...,weeklywall.com,2020-08-17 16:02:00,"Shore Power Market Analysis With Key Players, ...",en,672,0.295278
856,SCIENCE,https://www.thebusinessdesk.com/eastmidlands/n...,thebusinessdesk.com,2020-08-10 05:26:37,Five-figure funding boost for social enterpris...,en,856,0.271413


## ðŸŒŸ Exercise 4 Â· ChromaDB collection and querying

In [18]:
chroma_client = chromadb.Client(Settings(anonymized_telemetry=False))
collection_name = 'my_news'
if any(c.name == collection_name for c in chroma_client.list_collections()):
    chroma_client.delete_collection(name=collection_name)
collection = chroma_client.create_collection(name=collection_name)
collection.add(
    documents=pdf_subset['title'][:100].tolist(),
    metadatas=[{'topic': t} for t in pdf_subset['topic'][:100].tolist()],
    ids=[str(i) for i in pdf_subset['id'][:100].tolist()],
)
results = collection.query(query_texts=['space'], n_results=10)
print(json.dumps(results, indent=2))


ERROR:chromadb.telemetry.posthog:Failed to send telemetry event client_start: capture() takes 1 positional argument but 3 were given
ERROR:chromadb.telemetry.posthog:Failed to send telemetry event collection_add: capture() takes 1 positional argument but 3 were given


{
  "ids": [
    [
      "72",
      "7",
      "30",
      "26",
      "23",
      "76",
      "69",
      "40",
      "47",
      "75"
    ]
  ],
  "embeddings": null,
  "documents": [
    [
      "Beck teams up with NASA and AI for 'Hyperspace' visual album experience",
      "Orbital space tourism set for rebirth in 2021",
      "NASA drops \"insensitive\" nicknames for cosmic objects",
      "\u2018It came alive:\u2019 NASA astronauts describe experiencing splashdown in SpaceX Dragon",
      "Hubble Uses Moon As \u201cMirror\u201d to Study Earth\u2019s Atmosphere \u2013 Proxy in Search of Potentially Habitable Planets Around Other Stars",
      "Australia's small yet crucial part in the mission to find life on Mars",
      "NASA Astronauts in SpaceX Capsule Splashdown in Gulf Of Mexico",
      "SpaceX's Starship spacecraft saw 150 meters high",
      "NASA\u2019s InSight lander shows what\u2019s beneath Mars\u2019 surface",
      "Alien base on Mercury: ET hunters claim to find hu

## ðŸŒŸ Exercise 5 Â· Question answering with a Hugging Face model

In [23]:
model_id = 'google/flan-t5-small'  # lightweight, better than tiny GPT-2 for QA
pipe = pipeline(
    'text2text-generation',
    model=model_id,
    tokenizer=model_id,
    max_new_tokens=128,
    temperature=0.1,
    top_p=0.9,
    device_map='auto'
)

question = "What's the latest news on space development?"
context_docs = results['documents'][0][:3]
context = ' '.join(context_docs)
prompt = (
    f"Answer the question using only the context.\n"
    f"Context: {context}\n"
    f"Question: {question}\n"
    f"Answer:\n"
)
response = pipe(prompt)[0]['generated_text']
print(response)


Device set to use cpu


NASA drops "insensitive" nicknames for cosmic objects
