Retrieval-Augmented Generation (RAG) service (1 point)

In this part of laboratory, we will build a RAG service. It enhances the LLM text generation capabilities with context and information drawn from a knowledge base. Relevant textual information is found with vector search and appended to the prompt, resulting in less hallucinations and more precise, relevant answers.

In such cases, we don't relaly need any additional capabilities like attributes filtering, ACID, JOINs or other Postgres-related advantages. Thus, we will use Milvus, a typical example of vector database. To generate embeddings, we will use Silver Retriever model from Sentence Transformers. It is based on HerBERT model for Polish language, and finetuned for retrieval of similar vectors.

1. Start by setting up Milvus by using its Docker image. Docker Compose file is also conveniently provided by its creators:
2. Run the database with docker compose up -d.
3. Next code sections are quite interactive and will probably be easier to run inside a Jupyter Notebook. Start it with jupyter notebook.
4. Let's connect to the database. Milvus provides its own pymilvus library.



In [1]:
from pymilvus import MilvusClient

host = "localhost"
port = "19530"

milvus_client = MilvusClient(
    host=host,
    port=port
)

5. Vector databases work quite similarly to document databases like e.g. MongoDB. We define not a table, but a collection with specific schema, but conceptually it's a bit similar. For each element, we have an ID, text, and its embedding.

In [2]:
from pymilvus import FieldSchema, DataType, CollectionSchema

VECTOR_LENGTH = 768  # check the dimensionality for Silver Retriever Base (v1.1) model

id_field = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, description="Primary id")
text = FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=4096, description="Page text")
embedding_text = FieldSchema("embedding", dtype=DataType.FLOAT_VECTOR, dim=VECTOR_LENGTH, description="Embedded text")

fields = [id_field, text, embedding_text]

schema = CollectionSchema(fields=fields, auto_id=True, enable_dynamic_field=True, description="RAG Texts collection")

6. To create a collection with the given schema:

In [3]:
COLLECTION_NAME = "rag_texts_and_embeddings"

milvus_client.create_collection(
    collection_name=COLLECTION_NAME,
    schema=schema
)

index_params = milvus_client.prepare_index_params()

index_params.add_index(
    field_name="embedding", 
    index_type="HNSW",
    metric_type="L2",
    params={"M": 4, "efConstruction": 64}  # lower values for speed
) 

milvus_client.create_index(
    collection_name=COLLECTION_NAME,
    index_params=index_params
)

# checkout our collection
print(milvus_client.list_collections())

# describe our collection
print(milvus_client.describe_collection(COLLECTION_NAME))

['rag_texts_and_embeddings']
{'collection_name': 'rag_texts_and_embeddings', 'auto_id': True, 'num_shards': 1, 'description': 'RAG Texts collection', 'fields': [{'field_id': 100, 'name': 'id', 'description': 'Primary id', 'type': <DataType.INT64: 5>, 'params': {}, 'auto_id': True, 'is_primary': True}, {'field_id': 101, 'name': 'text', 'description': 'Page text', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 4096}}, {'field_id': 102, 'name': 'embedding', 'description': 'Embedded text', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}], 'functions': [], 'aliases': [], 'collection_id': 462149625083593581, 'consistency_level': 2, 'properties': {}, 'num_partitions': 1, 'enable_dynamic_field': True, 'created_timestamp': 462149670625869828}


7. Now we are able to insert documents into put database. RAG is the most useful when information is very specialized, niche, or otherwise probably unknown to the model or less popular. Let's start with "IAB POLSKA Przewodnik po sztucznej inteligencji". This part is inspired by SpeakLeash and one of their projects Bielik-how-to-start - Bielik_2_(4_bit)_RAG example. Bielik is the first Polish LLM, and you can also explore other tutorials for its usage. Let's define some constants for a start:

In [4]:
# define data source and destination
## the document origin destination from which document will be downloaded 
pdf_url = "https://www.iab.org.pl/wp-content/uploads/2024/04/Przewodnik-po-sztucznej-inteligencji-2024_IAB-Polska.pdf"

## local destination of the document
file_name = "Przewodnik-po-sztucznej-inteligencji-2024_IAB-Polska.pdf"

## local destination of the processed document 
file_json = "Przewodnik-po-sztucznej-inteligencji-2024_IAB-Polska.json"

## local destination of the embedded pages of the document
embeddings_json = "Przewodnik-po-sztucznej-inteligencji-2024_IAB-Polska-Embeddings.json"

## local destination of all above local required files
data_dir = "./data"

8. Let's download the document into the data_dir directory:

In [5]:
# download data
import os
import requests

def download_pdf_data(pdf_url: str, file_name: str) -> None:
    response = requests.get(pdf_url, stream=True)
    with open(os.path.join(data_dir, file_name), "wb") as file:
        for block in response.iter_content(chunk_size=1024):
            if block:
                file.write(block)

download_pdf_data(pdf_url, file_name)

9. This is a lot of text, and in RAG we need to add specific fragments to the prompt. To keep things simple, and number of vectors not too large, we will treat each page as a separate chunk to vectorize and search for. Below, we paginate document and save each page separately into a JSON file in format {"page": page_number, "text": text_of_the_page}.

In [6]:
# prepare data

import fitz
import json


def extract_pdf_text(file_name, file_json):
    document = fitz.open(os.path.join(data_dir, file_name))
    pages = []

    for page_num in range(len(document)):
        page = document.load_page(page_num)
        page_text = page.get_text()
        pages.append({"page_num": page_num, "text": page_text})

    with open(os.path.join(data_dir, file_json), "w", encoding='utf-8') as file:
        json.dump(pages, file, indent=4, ensure_ascii=False)


extract_pdf_text(file_name, file_json)

10. Now we have texts, but we need vectors. We will use the model to embed text from each page and save the result in out collection in Milvus. It's very easy if we first prepare a single JSON file with all data. Its format is {"page": page_num, "embedding": embedded_text}.

In [7]:
# vectorize data

import torch
import numpy as np
from sentence_transformers import SentenceTransformer


def generate_embeddings(file_json, embeddings_json, model):
    pages = []
    with open(os.path.join(data_dir, file_json), "r", encoding='utf-8') as file:
        data = json.load(file)

    for page in data:
        pages.append(page["text"])

    embeddings = model.encode(pages)

    embeddings_paginated = []
    for page_num in range(len(embeddings)):
        embeddings_paginated.append({"page_num": page_num, "embedding": embeddings[page_num].tolist()})

    with open(os.path.join(data_dir, embeddings_json), "w", encoding='utf-8') as file:
        json.dump(embeddings_paginated, file, indent=4, ensure_ascii=False)

model_name = "ipipan/silver-retriever-base-v1.1"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer(model_name, device=device)
generate_embeddings(file_json, embeddings_json, model)

In [8]:
print(model.device)

cuda:0


11. Now we can easily insert the data into Milvus:

In [9]:
def insert_embeddings(file_json, embeddings_json, client=milvus_client):
    rows = []
    with open(os.path.join(data_dir, file_json), "r", encoding='utf-8') as t_f, open(os.path.join(data_dir, embeddings_json), "r", encoding='utf-8') as e_f:
        text_data, embedding_data = json.load(t_f), json.load(e_f)
        text_data =  list(map(lambda d: d["text"], text_data))
        embedding_data = list(map(lambda d: d["embedding"], embedding_data))
        
        for page, (text, embedding) in enumerate(zip(text_data, embedding_data)):
            rows.append({"text":text, "embedding": embedding})

    client.insert(collection_name="rag_texts_and_embeddings", data=rows)


insert_embeddings(file_json, embeddings_json)

# load inserted data into memory
milvus_client.load_collection("rag_texts_and_embeddings")

12. Now let's do some semantic search!

In [10]:
# search
def search(model, query, client=milvus_client):
    embedded_query = model.encode(query).tolist()
    result = client.search(
        collection_name="rag_texts_and_embeddings", 
        data=[embedded_query], 
        limit=1,
        search_params={"metric_type": "L2"},
        output_fields=["text"]
    )
    return result


result = search(model, query="Czym jest sztuczna inteligencja")
print(result)

data: [[{'id': 462149625083598989, 'distance': 29.125164031982422, 'entity': {'text': 'Historia powstania\nsztucznej inteligencji\n7\nW języku potocznym „sztuczny" oznacza to, co\njest \nwytworem \nmającym \nnaśladować \ncoś\nnaturalnego. W takim znaczeniu używamy\nterminu ,,sztuczny\'\', gdy mówimy o sztucznym\nlodowisku lub oku. Sztuczna inteligencja byłaby\nczymś (programem, maszyną) symulującym\ninteligencję naturalną, ludzką.\nSztuczna inteligencja (AI) to obszar informatyki,\nktóry skupia się na tworzeniu programów\nkomputerowych zdolnych do wykonywania\nzadań, które wymagają ludzkiej inteligencji. \nTe zadania obejmują rozpoznawanie wzorców,\nrozumienie języka naturalnego, podejmowanie\ndecyzji, uczenie się, planowanie i wiele innych.\nGłównym celem AI jest stworzenie systemów,\nktóre są zdolne do myślenia i podejmowania\ndecyzji na sposób przypominający ludzki.\nHistoria sztucznej inteligencji sięga lat 50. \nXX wieku, kiedy to powstały pierwsze koncepcje\ni modele tego, co mog

However, this is not yet RAG!. This is just searching through our embeddings, without any LLM or generation. Many companies rely on external LLMs used via API, due to easy setup, good scalability, and low cost. We will follow this trend here and use Google Gemini API to generate answer with RAG.

# Gemini API integration
Gemini API is free to use, with rate limit for the free version of the API. We can use this LLM for intergration with our RAG system.
1. Get the API key from the Gemini API. Go to model info, and click "Try it in Google AI Studio".
2. You will be redirected to the Google AI Studio. Click "Create API Key", create a test project, and a test key.
3. Copy the key and save it in the environment variable. This is a secret, like any other API key, and must never be shared!
4. Let's prepare the function that will call Google API and generate our response.

In [16]:
import os
from google import genai

GEMINI_KEY = os.getenv("GEMINI_API_KEY")
gemini_client = genai.Client(api_key=GEMINI_KEY)

MODEL = "gemini-2.0-flash"

def generate_response(prompt: str):
    try:
        # Send request to Gemini 2.0 Flash API and get the response
        response = gemini_client.models.generate_content(
            model=MODEL,
            contents=prompt,
        )
        return response.text 
    except Exception as e:
        print(f"Error generating response: {e}")
        return None

5. Now we can fully integrate everything into a RAG system. Fill the function below that will augment the prompt with knowledge from Milvus, and then use the LLM to generate an answer based on that context.

In [18]:
def build_prompt(context: str, query: str) -> str:
    prompt = (
        "You are an intelligent assistant. Use the following context to answer the question.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {query}\n"
        "Answer in a clear and concise way."
    )
    return prompt
    return prompt
    

def rag(model, query: str) -> str:
    # having all prepared functions, you can combine them together and try to build your own RAG!
    context = search(model, query)[0][0]["entity"]["text"]
    prompt = build_prompt(context, query)
    return generate_response(prompt)

6. Test the RAG system with a few sample queries.

In [23]:
print(rag(model, "Czym jest sztuczna inteligencja?"))

Sztuczna inteligencja (AI) to obszar informatyki, który skupia się na tworzeniu programów komputerowych zdolnych do wykonywania zadań, które wymagają ludzkiej inteligencji, takich jak rozpoznawanie wzorców, rozumienie języka naturalnego, podejmowanie decyzji, uczenie się i planowanie. Głównym celem AI jest stworzenie systemów zdolnych do myślenia i podejmowania decyzji w sposób przypominający ludzki.



In [24]:
print(rag(model, "Czym jest uczenie głębokie?"))

Uczenie głębokie to zaawansowany rodzaj uczenia maszynowego, wykorzystujący sieci algorytmów inspirowanych strukturą mózgu, zwane sieciami neuronowymi.



In [25]:
print(rag(model, "Jak zrobić naleśniki?"))

Przepraszam, ta informacja nie znajduje się w tekście.

