## Homework: Vector Search

## Q1. Getting the embeddings model

First, we will get the embeddings model `multi-qa-distilbert-cos-v1` from
[the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

In [None]:
!pip install sentence_transformers==2.7.0

In [5]:
!pip show sentence_transformers

Name: sentence-transformers
Version: 2.7.0
Summary: Multilingual text embeddings
Home-page: https://www.SBERT.net
Author: Nils Reimers
Author-email: info@nils-reimers.de
License: Apache License 2.0
Location: /usr/local/python/3.10.13/lib/python3.10/site-packages
Requires: huggingface-hub, numpy, Pillow, scikit-learn, scipy, torch, tqdm, transformers
Required-by: 


In [7]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('multi-qa-distilbert-cos-v1')

Create the embedding for this user question:

In [14]:
user_question = "I just discovered the course. Can I still join it?"
encoded = embedding_model.encode(user_question)
print(f"Answer #1: {encoded[0]}")

Answer #1: 0.07822265475988388


In [48]:
encoded.shape

(768,)

## Prepare the documents

Now we will create the embeddings for the documents.

Load the documents with ids that we prepared in the module:

In [15]:
import requests 

base_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main'
relative_url = '03-vector-search/eval/documents-with-ids.json'
docs_url = f'{base_url}/{relative_url}?raw=1'
docs_response = requests.get(docs_url)
documents = docs_response.json()

In [16]:
documents[1]

{'text': 'GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites',
 'section': 'General course-related questions',
 'question': 'Course - What are the prerequisites for this course?',
 'course': 'data-engineering-zoomcamp',
 'id': '1f6520ca'}

In [18]:
# All documents
len(documents)

948

In [21]:
# Only 'machine-learning-zoomcamp' documents
docs_mlzc = [doc for doc in documents if doc['course'] == 'machine-learning-zoomcamp']
len(docs_mlzc)

375

## Q2. Creating the embeddings

Now for each document, we will create an embedding for both question and answer fields.

We want to put all of them into a single matrix `X`:

- Create a list `embeddings` 
- Iterate over each document 
- `qa_text = f'{question} {text}'`
- compute the embedding for `qa_text`, append to `embeddings`
- At the end, let `X = np.array(embeddings)` (`import numpy as np`) 

What's the shape of X? (`X.shape`). Include the parantheses. 

In [43]:
embeddings = []
for doc in docs_mlzc:
    # doc["qa_text"] = f'{docs_mlzc[1]["question"]} {docs_mlzc[1]["text"]}'
    #doc["text_vector"] = embedding_model.encode(doc["qa_text"])
    # doc["text_vector"] = embedding_model.encode(f'{docs_mlzc[1]["question"]} {docs_mlzc[1]["text"]}')
    embedding = embedding_model.encode(f'{docs_mlzc[1]["question"]} {docs_mlzc[1]["text"]}')
    embeddings.append(embedding)

In [None]:
embeddings[1]

In [45]:
import numpy as np
X = np.array(embeddings)

In [46]:
# Q2 ANSWER
X.shape

(375, 768)

## Q3. Search

We have the embeddings and the query vector. Now let's compute the 
cosine similarity between the vector from Q1 (let's call it `v`) and the matrix from Q2. 

The vectors returned from the embedding model are already
normalized (you can check it by computing a dot product of a vector
with itself - it should return something very close to 1.0). This means that in order
to compute the coside similarity, it's sufficient to 
multiply the matrix `X` by the vector `v`:


```python
scores = X.dot(v)
```

What's the highest score in the results?

- 65.0 
- 6.5
- 0.65
- 0.065

In [50]:
v = encoded
scores = X.dot(v)
scores.shape

(375,)

In [52]:
# ANSWER
max(scores)

0.43505073