# MinSearch vs Vector Search usando RAG

En este notebook, se va a realizar la comparación de rag sin y con vector search

## Dataset

In [2]:
import requests

ruta_doc= 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_rpta= requests.get(ruta_doc)
documentos= docs_rpta.json()


Preparamos los documentos para indexar

In [3]:
docs=[]

for i in documentos:

    for j in i['documents']:
        j['course'] = i['course']
        docs.append(j)

## Cliente con Open AI

In [4]:
# pip install python-dotenv
from dotenv import load_dotenv
import os 

load_dotenv() # carga las variables de entorno
api_key = os.getenv('OPENAI_API_KEY') 

from openai import OpenAI
client= OpenAI()
client


<openai.OpenAI at 0x7c7b0a0ffdd0>

## Minsearch - RAG

Funciones:
- función search:
    - búsqueda por index
    - ordenamos los documentos y se genera el prompt final
    - función para generar el ouputl del llm

Para mayor detalle: [rag_intro](https://github.com/Halsey26/llm-zoomcamp/blob/main/01_Week_IntroRag/rag_intro.ipynb)

In [5]:
docs[500]

{'text': 'Q: “In lesson 2.8 why is y_pred different from y? After all, we trained X_train to get the weights that when multiplied by X_train should give exactly y, or?”\nA: linear regression is a pretty simple model, it neither can nor should fit 100% (nor any other model, as this would be the sign of overfitting). This picture might illustrate some intuition behind this, imagine X is a single feature:\nAs our model is linear, how would you draw a line to fit all the "dots"?\nYou could "fit" all the "dots" on this pic using something like scipy.optimize.curve_fit (non-linear least squares) if you wanted to, but imagine how it would perform on previously unseen data.\nAdded by Andrii Larkin',
 'section': '2. Machine Learning for Regression',
 'question': 'Why linear regression doesn’t provide a “perfect” fit?',
 'course': 'machine-learning-zoomcamp'}

In [6]:
import minsearch
# indexamos los documentos con minsearch
index= minsearch.Index(
    text_fields=['text', 'section','question'],
    keyword_fields=['course']
)
index.fit(docs)

<minsearch.minsearch.Index at 0x7c7ae9804470>

In [7]:

def busqueda(query, course):
    # pesos por defecto
    pesos_dicc = {'text': 0.2, 'section': 0.5, 'question': 4.0}

    resultados= index.search(
        query= query,
        boost_dict= pesos_dicc, 
        filter_dict= {'course':course},
        num_results= 5
    )
    return resultados
# devuelve una lista

In [8]:
def build_prompt(query, results):
    prompt_template= """
You are a course teaching assistant. Answer the QUESTION based on the CONTEXT. Use only the facts from the CONTEXT when answering the QUESTION from FAQ 
database.
    
QUESTION: {question}
        
CONTEXT: 
{context} 
"""

    contexto= ""
    for i in results:
        contexto = contexto + f"Section: {i['section']} \nQuestion: {i['question']} \nAnswer: {i['text']} \n\n"

    prompt_final= prompt_template.format(question= query, context= contexto).strip()

    return prompt_final


In [9]:
def output_llm(prompt):

    response= client.chat.completions.create(
        model= 'gpt-4o-mini', 
        messages= [
            {
                "role":"user", 
                "content":  prompt
            }
        ]
    )
    return response.choices[0].message.content

Pregunta sin Contexto

In [None]:
# pregunta sin contexto
rpta= output_llm('how can i get the certificate?')

"To obtain a certificate, you'll typically need to follow these general steps, depending on the type of certificate you're seeking—such as educational, professional, or health-related. Here are some common approaches:\n\n### 1. **Educational Certificate**\n   - **Enroll in a Course:** Sign up for a course that offers the certificate you need (e.g., academic, online courses, workshops).\n   - **Complete Requirements:** Successfully complete the required coursework, exams, or projects.\n   - **Request Certificate:** Upon completion, request your certificate from the institution or organization.\n\n### 2. **Professional Certification**\n   - **Choose the Certification:** Research the certifications relevant to your career field (e.g., IT, healthcare, finance).\n   - **Meet Prerequisites:** Ensure you meet any prerequisites (educational background, work experience).\n   - **Prepare for and Pass Exam:** Study for and pass the required examinations or assessments.\n   - **Apply for Certifica

Modularizamos en una sola función RAG

In [60]:
def RAG_minsearch(query, course):
    resultados_busqueda= busqueda(query, course)
    prompt= build_prompt(query, resultados_busqueda)
    final_output = output_llm(prompt)
    return final_output

query= 'how can i get the certificate?'
course= 'machine-learning-zoomcamp'

respuesta=  RAG_minsearch(query, course)

Comparación de preguntas con y sin Contexto

In [63]:
print(f"Pregunta: {query} \n\n Respuesta sin contexto:\n {rpta} \n\nRespuesta con contexto:\n{respuesta}")

Pregunta: how can i get the certificate? 

 Respuesta sin contexto:
 To obtain a certificate, you'll typically need to follow these general steps, depending on the type of certificate you're seeking—such as educational, professional, or health-related. Here are some common approaches:

### 1. **Educational Certificate**
   - **Enroll in a Course:** Sign up for a course that offers the certificate you need (e.g., academic, online courses, workshops).
   - **Complete Requirements:** Successfully complete the required coursework, exams, or projects.
   - **Request Certificate:** Upon completion, request your certificate from the institution or organization.

### 2. **Professional Certification**
   - **Choose the Certification:** Research the certifications relevant to your career field (e.g., IT, healthcare, finance).
   - **Meet Prerequisites:** Ensure you meet any prerequisites (educational background, work experience).
   - **Prepare for and Pass Exam:** Study for and pass the require

## Vector Search con RAG

Creamos el cliente

In [10]:
from qdrant_client import QdrantClient, models
# recuerda correr el contenedor de Qdrant previamente
qdrant_cliente= QdrantClient('http://localhost:6333')

  from .autonotebook import tqdm as notebook_tqdm


Creación de los puntos

In [11]:
modelo_seleccionado= "jinaai/jina-embeddings-v2-small-en"
points= []
id=0 

for i in documentos:
    for j in i['documents']: # text-section -question
        point = models.PointStruct(
            id= id, 
            vector= models.Document(text= j['text'], model= modelo_seleccionado), # embedding de los documentos
            payload= {
                "text": j['text'], 
                "section":j['section'], 
                "question":j['question'], 
                "course": i['course']
            }
        )
        points.append(point)
        id +=1

Creación de la colección

In [None]:
dimension_embedding= 512
# modelo_seleccionado= "jinaai/jina-embeddings-v2-small-en"
nombre_coleccion= 'vectorsearch_rag'

# para eliminar una colección ya creada
# qdrant_cliente.delete_collection(collection_name=nombre_coleccion)

qdrant_cliente.create_collection(
    collection_name= nombre_coleccion, 
    vectors_config= models.VectorParams(
        size= dimension_embedding, 
        distance = models.Distance.COSINE
    )
)

Búsqueda con filtrado
- Indexamos el payload

In [14]:
qdrant_cliente.create_payload_index(
    collection_name=nombre_coleccion,
    field_name="course",
    field_schema="keyword"
)

UpdateResult(operation_id=3, status=<UpdateStatus.COMPLETED: 'completed'>)

- Se cargan los puntos a la colección

In [15]:
qdrant_cliente.upsert(
    collection_name= nombre_coleccion, 
    points = points
)

UpdateResult(operation_id=4, status=<UpdateStatus.COMPLETED: 'completed'>)

Ahora sí, definamos la función de búsqueda con Vector Search

In [16]:
def vector_search(query, course, limit):
    results= qdrant_cliente.query_points(
        collection_name=  nombre_coleccion, 
        query= models.Document(
            text= query, 
            model = modelo_seleccionado
        ),
        query_filter = models.Filter( # filtra por curso
            must=[ # debe aceptar todas las condiciones
                models.FieldCondition(
                    key= "course", 
                    match= models.MatchValue(value=course)
                )
            ]
        ), 
        limit = limit ,
        with_payload=True
    )

    resultados= []
    for i in results.points:
        resultados.append(i.payload)

    return resultados

Rag con Vector Search

In [20]:
def RAG_vectorsearch(query, course, nro_results):
    resultados_busqueda= vector_search(query, course, nro_results)
    prompt= build_prompt(query, resultados_busqueda)
    final_output = output_llm(prompt)
    return final_output

In [21]:
query= "how can i get the certificate?"
course= "machine-learning-zoomcamp"
RAG_vectorsearch(query, course, 5)

"To get the certificate, you need to submit at least 2 out of 3 course projects and review 3 peers' projects by the deadline. Once you meet these requirements, you will be eligible for a certificate."

# Homework 2

Indicaciones del [Homework](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/cohorts/2025/02-vector-search/homework.md): Vector Search
(Separar después a otro archivo. )

## Q1
Embed the query: 'I just discovered the course. Can I join now?'. Use the 'jinaai/jina-embeddings-v2-small-en' model.

You should get a numpy array of size 512.

What's the minimal value in this array?

- Documentación usada: [qdrant_fastembed](https://qdrant.tech/documentation/fastembed/fastembed-quickstart/)

In [2]:
from fastembed import TextEmbedding

modelo= 'jinaai/jina-embeddings-v2-small-en'
modelo_embedding= TextEmbedding(model_name=modelo)

query= 'I just discovered the course. Can I join now?'
query_embed= modelo_embedding.embed(query)
query_embed = list(query_embed)

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
print(f'Tamaño del array: {len(query_embed[0])}')
print(f'Mínimo valor del array: {min(query_embed[0]):.4f}')

Tamaño del array: 512
Mínimo valor del array: -0.1173


##  Q2
**Cosine similarity with another vector** <p>
Now let's embed this document:
- doc = 'Can I still join the course after the start date?'

What's the cosine similarity between the vector for the query and the vector for the document?

In [4]:
# Los vectores ya estan normalizados
import numpy as np
np.linalg.norm(query_embed)

np.float64(1.0)

In [5]:
# el coseno del vector embebido consigo mismo, debería dar 1
print(f'Similitud de coseno de la query embebida consigo misma: {np.dot(query_embed[0], query_embed[0]):.4f}')

Similitud de coseno de la query embebida consigo misma: 1.0000


In [6]:
doc= 'Can I still join the course after the start date?'

doc_embed= list(modelo_embedding.embed(doc))
cosine= np.dot(query_embed[0], doc_embed[0])

print(f'Evaluación Cosine Similarity: {cosine:.3f}')


Evaluación Cosine Similarity: 0.901


## Q3
Ranking por coseno
Para las preguntas 3 y 4, se usarán estos documentos.
Indicación:
```
Compute the embeddings for the text field, and compute the cosine between the query vector and all the documents.
What's the document index with the highest similarity? (Indexing starts from 0):
```

In [7]:
documents = [{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
  'section': 'General course-related questions',
  'question': 'Course - When will the course start?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.',
  'section': 'General course-related questions',
  'question': 'Course - What can I do before the course starts?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Star the repo! Share it with friends if you find it useful ❣️\nCreate a PR if you see you can improve the text or the structure of the repository.',
  'section': 'General course-related questions',
  'question': 'How can we contribute to the course?',
  'course': 'data-engineering-zoomcamp'}]

In [8]:
# Obtención de solo el campo 'text'
text_documents= []
for i in documents:
    text_documents.append(i['text'])

# embedding
text_embed= list(modelo_embedding.embed(text_documents))

# coseno con el primer documento
cosine_q3= np.dot(query_embed[0], text_embed[0])

print(f'Evaluación Cosine Similarity para el primer documento: {cosine_q3:.3f}')


Evaluación Cosine Similarity para el primer documento: 0.763


In [9]:
text_array= np.array(text_embed)
print(f'Tamaño text_embed: {text_array.shape}')
print(f'Tamaño query_embed: {query_embed[0].shape}')


Tamaño text_embed: (5, 512)
Tamaño query_embed: (512,)


In [10]:
coseno_docs = np.dot(text_array,query_embed[0])

print('Evaluación por similitud de cosenos para TODOS los documentos:')
i=0
print

for cos_doc in coseno_docs:
    print(f'Indice: {i} - Cos: {cos_doc:.3f}')
    i+=1

Evaluación por similitud de cosenos para TODOS los documentos:
Indice: 0 - Cos: 0.763
Indice: 1 - Cos: 0.818
Indice: 2 - Cos: 0.809
Indice: 3 - Cos: 0.713
Indice: 4 - Cos: 0.730


In [11]:
max(coseno_docs)

np.float64(0.8182378150042889)

## Q4

Now let's calculate a new field, which is a concatenation of question and text:
- `full_text = doc['question'] + ' ' + doc['text']`

Embed this field and compute the cosine between it and the query vector. What's the highest scoring document?

In [40]:
# creación de nuevo campo
# documents_v2= documents.copy()

# for doc in documents_v2:
#     doc['full_text']= doc['question']+' '+doc['text']

#obtención del campo 'full_text'
full_text=[]
for doc in documents:
    full_text.append(doc['question']+' '+doc['text']) 


#embedding
textField_embed= list(modelo_embedding.embed(full_text)) 
textField_embed= np.array(textField_embed)
coseno_q4= np.dot(textField_embed, query_embed[0])

print('Evaluación por similitud de cosenos (campo QUESTION añadido)')
i=0
print

for cos_doc in coseno_q4:
    print(f'Indice: {i} - Cos: {cos_doc:.3f}')
    i+=1


Evaluación por similitud de cosenos (campo QUESTION añadido)
Indice: 0 - Cos: 0.851
Indice: 1 - Cos: 0.844
Indice: 2 - Cos: 0.841
Indice: 3 - Cos: 0.776
Indice: 4 - Cos: 0.809


El documento con mayor similitud es el primero, indice 0

En este caso, al añadir la pregunta el documento con mayor similitud será el primero (índice 0)<p>
Esto se debe a que la pregunta del textField_embed[0] es:
- Can I still join the course after the start date? 

Y la query: 
- 'I just discovered the course. Can I join now?'


In [39]:
print(f'Documento original:\n {text_documents}\n\n Documento con la pregunta añadida:\n{full_text}')

Documento original:
 ["Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.", 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.', "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.", 'You c

## Q5
Now let's select a smaller embedding model. What's the smallest dimensionality for models in fastembed?

One of these models is `BAAI/bge-small-en`. Let's use it.

In [10]:
modelos= TextEmbedding.list_supported_models()
modelo_seleccionado = ""
for model in modelos:
    if model['model']== 'BAAI/bge-small-en':
        modelo_seleccionado=model['model'] #redudante, pero para establecer que se utiliza este modelo
        dimension= model['dim']
        print(f'Modelo: {model['model']} \nDimension: {model['dim']}')  

Modelo: BAAI/bge-small-en 
Dimension: 384


## Q6
For the last question, we will use more documents.

We will select only FAQ records from our **ml zoomcamp**.

Add them to qdrant using the model form Q5.

When adding the data, use both question and answer fields:
- `text = doc['question'] + ' ' + doc['text']`

After the data is inserted, use the question from Q1 for querying the collection.

What's the highest score in the results? (The score for the first returned record):

In [1]:
import requests 

docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()


documents = []

for course in documents_raw:
    course_name = course['course']
    if course_name != 'machine-learning-zoomcamp':
        continue

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

In [6]:
modelo_seleccionado

'BAAI/bge-small-en'

In [8]:
# conexion con qdrant
from qdrant_client import QdrantClient, models
cliente= QdrantClient('http://localhost:6333') # correr el contenedor de qdrant previamente
modelo = modelo_seleccionado

Se crea la Colección

In [13]:
name_collection = 'homework2_collection'
cliente.create_collection(
    collection_name= name_collection,
    vectors_config= models.VectorParams(
        size= dimension,
        distance= models.Distance.COSINE
    )
)

True

Nota: 
Para este caso no es necesario hacer 'Indexación del payload' porque los documentos ya son para un solo curso: 'machine-learning-zoomcamp'

Creación de los puntos

In [None]:
points = []

for i,doc in enumerate(documents):

    text = doc['question']+ ' '+doc['text']
    point = models.PointStruct(
        id= i,
        vector= models.Document(text= text, model=modelo), 
        payload= doc
    )
    points.append(point)
    
print(f'Cantidad de puntos: {len(points)}')

Cantidad de puntos: 375


Se actualiza o inserta los puntos a la colección

In [21]:
cliente.upsert(
    collection_name= name_collection, 
    points= points
)

Fetching 5 files: 100%|██████████| 5/5 [00:00<00:00,  8.34it/s]


UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

Función búsqueda

In [24]:
def vector_search(query, limit=1):
    results= cliente.query_points(
        collection_name= name_collection,
        query = models.Document(
            text= query,
            model= modelo
        ), 
        limit=limit,
        with_payload=True
    )
    return results

In [27]:
query= 'I just discovered the course. Can I join now?'
rpta_busqueda= vector_search(query)

In [37]:
score = rpta_busqueda.points[0].score
rpta=  rpta_busqueda.points[0].payload['text']

In [39]:
print(f'Respuesta: {rpta}\n')
print(f'Puntaje: {score}')

Respuesta: Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.
In order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.

Puntaje: 0.8703172
