# RAG

A notebook to demo how we can use Retrieval Augmented Generation (RAG) with an LLM.

## Chroma Demo

A notebook to demo how we can use Retrieval Augmented Generation (RAG) with an LLM.

We will use ChromaDB as the vector store for this.

In [1]:
from dotenv import load_dotenv
import chromadb
import os

load_dotenv() 

chroma_client = chromadb.Client()

In [2]:
collection = chroma_client.get_or_create_collection(name="main")
collection

Collection(name=main)

In [3]:
chroma_client.list_collections()

[Collection(name=main)]

In [4]:
# just a quick demo on how this works
# by default this uses a MiniLM L6-v2
collection.add(
    ids=["id1", "id2", "id3", "id4"],
    documents=[
        "document sur un cfc / apprentissage",
        "document sur l'epfl",
        "document sur 42 Lausanne",
        "document sur université de fribougr",
    ]
)

print("Number of items in the collection:", collection.count())

Number of items in the collection: 4


In [5]:
out = collection.query(
    query_texts=["J'aimerais faire un apprentissage"],
    n_results=1,
)

out

{'ids': [['id1']],
 'embeddings': None,
 'documents': [['document sur un cfc / apprentissage']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[None]],
 'distances': [[0.9254448413848877]]}

In [6]:
out = collection.query(
    query_texts=["J'aimerais aller au collège pour ensuite faire l'université"],
    n_results=2,
)

# key documents returns the text of the relevant docs
out["documents"]

[['document sur université de fribougr', "document sur l'epfl"]]

In [7]:
chroma_client.delete_collection(name="main")

## Beautiful Soup: HTML to Text

In [8]:
import requests
from bs4 import BeautifulSoup

url = "www.orientation.ch/dyn/show/1900?id=152"

res = requests.get("https://" + url)

In [9]:
# needed to pretend I am a normal user ;)
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
}


def url_to_string(url: str) -> str:
    if not url.startswith("https://"):
        url = "https://" + url
    
    res = requests.get(url, headers=HEADERS)
    res.raise_for_status()

    soup = BeautifulSoup(res.text, "html.parser")
    site_text = " ".join([text for text in soup.stripped_strings])
    return site_text

site_text = url_to_string(url)

print("Number of chars:", len(site_text))
print("Number of words:", len(site_text.split()))

Number of chars: 30161
Number of words: 3883


In [10]:
import pandas as pd

df = pd.read_csv("../links.csv")
df.head()

Unnamed: 0,title,link
0,informaticen cfc,www.orientation.ch/dyn/show/1900?id=152
1,informaticen es,www.orientation.ch/dyn/show/1900?id=885
2,informatique université de fribourg,www.unifr.ch/inf/fr/informatique
3,informatique de gestion université de fribourg,www.unifr.ch/inf/fr/informatique-de-gestion
4,informatique université de genève,www.unige.ch/dinfo/formations/bachelor


In [11]:
url_list = df["link"].to_list()
url_text_list = [url_to_string(url) for url in url_list]

In [12]:
df["url_text"] = url_text_list
df.head()

Unnamed: 0,title,link,url_text
0,informaticen cfc,www.orientation.ch/dyn/show/1900?id=152,Informaticien CFC / Informaticienne CFC - orie...
1,informaticen es,www.orientation.ch/dyn/show/1900?id=885,Informaticien ES / Informaticienne ES - orient...
2,informatique université de fribourg,www.unifr.ch/inf/fr/informatique,Informatique | Département d'informatique | U...
3,informatique de gestion université de fribourg,www.unifr.ch/inf/fr/informatique-de-gestion,Informatique de gestion | Département d'inform...
4,informatique université de genève,www.unige.ch/dinfo/formations/bachelor,Bachelor en sciences informatiques Présentatio...


## Model List

Infomaniak has 2 models:
- mini_lm_l12_v2
- bge_multilingual_gemma2

By default, chroma DB uses:
- all-MiniLM-L6-v2

Some 2 other functions:
- TF-IDF
- BM25

We should probably benchmark these three models, see how they perform.

In [13]:
DB_PATH = "../data/"

client = chromadb.PersistentClient(path=DB_PATH)

# ensure the DB is available
client.heartbeat()

1760191073011173104

In [14]:
collection._embedding_function

<chromadb.utils.embedding_functions.DefaultEmbeddingFunction at 0x763e4dad4850>

In [15]:
collection = client.get_or_create_collection(name="study-docs")
print(collection)

Collection(name=study-docs)


In [16]:
from chromadb.utils.embedding_functions import Bm25EmbeddingFunction

ids: list[str] = df["title"].to_list() #df.index.astype("str").to_list()
documents: list[str] = df["url_text"].to_list()

collection.add(
    ids=ids,
    documents=documents,
)

In [17]:
results = collection.query(
    query_texts=["J'aimerais faire un apprentissage en informatique"],
    n_results=5,
)

In [18]:
print(results["ids"])

[['42 lausanne', 'idec', 'informatique université de fribourg', 'informatique de gestion université de fribourg', 'informatique université de genève']]
