# Vector Search Homework Solutions

This notebook contains solutions to the Vector Search homework.

## Q1. Embedding the query

**Question:**

Embed the query: `'I just discovered the course. Can I join now?'`.
Use the `'jinaai/jina-embeddings-v2-small-en'` model.

You should get a numpy array of size 512.

What's the minimal value in this array?

- -0.51
- -0.11
- 0
- 0.51


**Explanation:**

We use FastEmbed's `TextEmbedding` with the specified model to embed the query.
We check the shape of the resulting vector and find its minimum value.

In [1]:
from fastembed import TextEmbedding
import numpy as np

def embed_text(text, model_name='jinaai/jina-embeddings-v2-small-en'):
    model = TextEmbedding(model_name=model_name)
    embedding = list(model.embed([text]))[0]
    return np.array(embedding)

query = 'I just discovered the course. Can I join now?'
q = embed_text(query)
print('Shape:', q.shape)
print('Min value:', q.min())

Shape: (512,)
Min value: -0.11726373885183883


---

## Q2. Cosine similarity with another vector

**Question:**

Now let's embed this document:

```python
doc = 'Can I still join the course after the start date?'
```

What's the cosine similarity between the vector for the query and the vector for the document?

- 0.3
- 0.5
- 0.7
- 0.9


**Explanation:**

We embed the document using the same model and compute the dot product with the query embedding.
Since both vectors are normalized, the dot product gives the cosine similarity.

In [2]:
doc = 'Can I still join the course after the start date?'
d = embed_text(doc)
cosine_sim = float(np.dot(q, d))
print('Cosine similarity:', cosine_sim)

Cosine similarity: 0.9008528895674548


---

## Q3. Ranking by cosine

**Question:**

For Q3 and Q4 we will use these documents:

```python
documents = [{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.", 'section': 'General course-related questions', 'question': 'Course - Can I still join the course after the start date?', 'course': 'data-engineering-zoomcamp'}, ...]
```

Compute the embeddings for the `text` field, and compute the cosine between the query vector and all the documents.

What's the document index with the highest similarity? (Indexing starts from 0):

- 0
- 1
- 2
- 3
- 4

**Explanation:**

We embed the `text` field of each document using the same model as before. We then compute the dot product between the query embedding (`q`) and the document embeddings matrix. The index of the maximum value in the resulting array corresponds to the most similar document.

In [3]:
# Documents for Q3 and Q4
documents = [
    {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.", 'section': 'General course-related questions', 'question': 'Course - Can I still join the course after the start date?', 'course': 'data-engineering-zoomcamp'},
    {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.', 'section': 'General course-related questions', 'question': 'Course - Can I follow the course after it finishes?', 'course': 'data-engineering-zoomcamp'},
    {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.", 'section': 'General course-related questions', 'question': 'Course - When will the course start?', 'course': 'data-engineering-zoomcamp'},
    {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.', 'section': 'General course-related questions', 'question': 'Course - What can I do before the course starts?', 'course': 'data-engineering-zoomcamp'},
    {'text': 'Star the repo! Share it with friends if you find it useful ❣️\nCreate a PR if you see you can improve the text or the structure of the repository.', 'section': 'General course-related questions', 'question': 'How can we contribute to the course?', 'course': 'data-engineering-zoomcamp'}
]

# Embed the 'text' field of each document
texts = [doc['text'] for doc in documents]
V = np.array(list(TextEmbedding(model_name='jinaai/jina-embeddings-v2-small-en').embed(texts)))

# Compute cosine similarities
scores = V.dot(q)

# Find the index of the highest score
highest_score_index = np.argmax(scores)
print(f"Highest similarity index: {highest_score_index}")

Highest similarity index: 1


---

## Q4. Ranking by cosine, version two

**Question:**

Now let's calculate a new field, which is a concatenation of `question` and `text`:

```python
full_text = doc['question'] + ' ' + doc['text']
```

Embed this field and compute the cosine between it and the query vector. What's the highest scoring document?

- 0
- 1
- 2
- 3
- 4

Is it different from Q3? If yes, why?

**Explanation:**

We create a new field by concatenating the `question` and `text` fields for each document. This provides more context for the embedding model. We then embed these combined texts and compute their cosine similarity with the query vector. The document with the highest score is our answer. This approach can yield different and often more accurate results than using the `text` field alone because the question provides valuable keywords and context.

In [4]:
# Create the 'full_text' field
full_texts = [doc['question'] + ' ' + doc['text'] for doc in documents]

# Embed the full texts
V_full = np.array(list(TextEmbedding(model_name='jinaai/jina-embeddings-v2-small-en').embed(full_texts)))

# Compute cosine similarities
scores_full = V_full.dot(q)

# Find the index of the highest score
highest_score_index_full = np.argmax(scores_full)
print(f"Highest similarity index (full text): {highest_score_index_full}")

Highest similarity index (full text): 0


---

## Q5. Selecting the embedding model

**Question:**

What's the smallest dimensionality for models in fastembed?

- 128
- 256
- 384
- 512

**Explanation:**

The `fastembed` library provides a way to list all supported models and their metadata. We can iterate through this list to find the minimum dimensionality (`dim`) available.

In [5]:
from fastembed import TextEmbedding

models_list = TextEmbedding.list_supported_models()
dims = sorted(set(m['dim'] for m in models_list))

print('Available embedding dimensions:', dims)
print('Smallest dimension:', min(dims))

Available embedding dimensions: [384, 512, 768, 1024]
Smallest dimension: 384


---

## Q6. Indexing with qdrant

**Question:**

For the last question, we will use more documents. We will select only FAQ records from our ml zoomcamp. Add them to qdrant using the model from Q5 (`BAAI/bge-small-en`).

When adding the data, use both question and answer fields:
```python
text = doc['question'] + ' ' + doc['text']
```

After the data is inserted, use the question from Q1 for querying the collection. What's the highest score in the results?

- 0.97
- 0.87
- 0.77
- 0.67

**Explanation:**

First, we fetch a larger dataset and filter it for 'machine-learning-zoomcamp' documents. We initialize a new `TextEmbedding` model (`BAAI/bge-small-en`) and an in-memory `QdrantClient`. We then create a Qdrant collection with the correct vector size and cosine distance metric. We prepare the documents by combining their `question` and `text` fields, generate embeddings, and `upsert` them into the collection using `PointStruct`. Finally, we embed the original query with the new model and use the `search` method to find the most similar document and its score.

In [6]:
import requests
from qdrant_client import QdrantClient, models

# Initialize the new embedding model
new_embedding_model = TextEmbedding(model_name="BAAI/bge-small-en")

# 1. Fetch Data
docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

# 2. Filter Documents
ml_zoomcamp_documents = [
    {**doc, 'course': course['course']}
    for course in documents_raw
    if course['course'] == 'machine-learning-zoomcamp'
    for doc in course['documents']
]
print(f"Number of 'machine-learning-zoomcamp' documents: {len(ml_zoomcamp_documents)}")

# 3. Prepare documents for indexing
texts_to_embed = [f"{doc.get('question', '')} {doc.get('text', '')}".strip() for doc in ml_zoomcamp_documents]

# 4. Setup Qdrant Client and Collection
qdrant_client = QdrantClient(":memory:")
collection_name = "ml-zoomcamp-faq"
vector_size = len(list(new_embedding_model.embed("test"))[0])

qdrant_client.recreate_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(size=vector_size, distance=models.Distance.COSINE),
)

# 5. Index the Data
document_embeddings = list(new_embedding_model.embed(texts_to_embed))
points = [
    models.PointStruct(id=idx, vector=emb, payload=doc)
    for idx, (doc, emb) in enumerate(zip(ml_zoomcamp_documents, document_embeddings))
]
qdrant_client.upsert(collection_name=collection_name, points=points, wait=True)

# 6. Search the collection
query_q6 = 'I just discovered the course. Can I join now?'
query_embedding_q6 = list(new_embedding_model.embed(query_q6))[0]

search_results = qdrant_client.search(
    collection_name=collection_name,
    query_vector=query_embedding_q6,
    limit=1,
)

# 7. Find Highest Score
if search_results:
    highest_score = search_results[0].score
    print(f"The highest score in the search results is: {highest_score:.4f}")
else:
    print("No search results found.")

Number of 'machine-learning-zoomcamp' documents: 375


  qdrant_client.recreate_collection(


The highest score in the search results is: 0.8703


  search_results = qdrant_client.search(


In [None]:
#using query_points instead of search as it is deprecated:
#change query_vector to query in query_points (search_results function)

In [None]:
import requests
from qdrant_client import QdrantClient, models

# 1. Fetch Data
docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

# 2. Filter Documents for 'machine-learning-zoomcamp'
ml_zoomcamp_documents = []
for course in documents_raw:
    course_name = course['course']
    if course_name != 'machine-learning-zoomcamp':
        continue
    for doc in course['documents']:
        doc['course'] = course_name
        ml_zoomcamp_documents.append(doc)

print(f"Number of 'machine-learning-zoomcamp' documents: {len(ml_zoomcamp_documents)}")

# 3. Prepare Documents for Indexing
texts_to_embed = []
for doc in ml_zoomcamp_documents:
    question_content = doc.get('question', '')
    text_content = doc.get('text', '')
    texts_to_embed.append(question_content + ' ' + text_content)

# 4. Setup Qdrant Client
qdrant_client = QdrantClient(":memory:")

# Define the collection name
collection_name = "ml-zoomcamp-faq"

# Get the vector size from the embedding model
dummy_embedding_q5 = list(new_embedding_model.embed("test"))[0]
vector_size_q5 = dummy_embedding_q5.shape[0]

# Recreate collection
if qdrant_client.collection_exists(collection_name=collection_name):
    print(f"Collection '{collection_name}' already exists. Deleting and recreating...")
    qdrant_client.delete_collection(collection_name=collection_name)
qdrant_client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(size=vector_size_q5, distance=models.Distance.COSINE),
)
print(f"Collection '{collection_name}' created.")

# 5. Index the Data
print("Generating embeddings for documents...")
document_embeddings_q5 = list(new_embedding_model.embed(texts_to_embed))
print("Embeddings generated. Adding to Qdrant...")

qdrant_client.upsert(
    collection_name=collection_name,
    wait=True,
    points=models.Batch(
        ids=list(range(len(ml_zoomcamp_documents))),
        vectors=document_embeddings_q5,
        payloads=ml_zoomcamp_documents,
    ),
)
print(f"Added {len(ml_zoomcamp_documents)} documents to Qdrant collection '{collection_name}'.")

# 6. Search the Collection
query_q6 = 'I just discovered the course. Can I join now?'
query_embedding_q6 = list(new_embedding_model.embed(query_q6))[0]

search_results = qdrant_client.query_points(
    collection_name=collection_name,
    query=query_embedding_q6.tolist(),
    limit=1,
    with_payload=True,
)

# 7. Find Highest Score
if search_results and search_results.points:
    highest_score = search_results.points[0].score
    print(f"The highest score in the search results is: {highest_score:.4f}")
else:
    print("No search results found.")

In [7]:
#modular version

In [None]:
import requests
from qdrant_client import QdrantClient, models
from fastembed import TextEmbedding

def fetch_documents(url: str) -> list[dict]:
    """Fetches and parses JSON documents from a URL."""
    print(f"Fetching documents from {url}...")
    response = requests.get(url)
    response.raise_for_status()  # Raise an exception for bad status codes
    print("Documents fetched successfully.")
    return response.json()

def filter_documents_by_course(documents_raw: list[dict], course_name: str) -> list[dict]:
    """Filters documents for a specific course."""
    return [
        {**doc, 'course': course['course']}
        for course in documents_raw
        if course['course'] == course_name
        for doc in course['documents']
    ]

def main():
    # 1. Configuration
    docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
    course_to_filter = 'machine-learning-zoomcamp'
    collection_name = "ml_zoomcamp_faq_simplified"
    embedding_model_name = "BAAI/bge-small-en"
    search_query = 'I just discovered the course. Can I join now?'

    # 2. Initialize models and clients
    embedding_model = TextEmbedding(model_name=embedding_model_name)
    qdrant_client = QdrantClient(":memory:")  # In-memory storage

    # 3. Fetch and process data
    documents_raw = fetch_documents(docs_url)
    ml_documents = filter_documents_by_course(documents_raw, course_to_filter)
    print(f"Found {len(ml_documents)} documents for '{course_to_filter}'.")

    texts_to_embed = [
        f"{doc.get('question', '')} {doc.get('text', '')}".strip()
        for doc in ml_documents
    ]

    # 4. Setup and create Qdrant collection
    print("Setting up Qdrant collection...")
    # Use the model to get the vector size
    vector_size = len(list(embedding_model.embed("A test sentence to get vector size"))[0])

    qdrant_client.recreate_collection(
        collection_name=collection_name,
        vectors_config=models.VectorParams(size=vector_size, distance=models.Distance.COSINE),
    )
    print(f"Collection '{collection_name}' created with vector size {vector_size}.")

    # 5. Generate embeddings and index the documents
    print("Generating embeddings and indexing documents...")
    embeddings = list(embedding_model.embed(texts_to_embed))

    qdrant_client.upsert(
        collection_name=collection_name,
        points=[
            models.PointStruct(id=idx, vector=emb, payload=doc)
            for idx, (doc, emb) in enumerate(zip(ml_documents, embeddings))
        ],
        wait=True,
    )
    print(f"Indexed {len(ml_documents)} documents.")

    # 6. Perform the search
    print(f"Searching for: '{search_query}'")
    query_embedding = list(embedding_model.embed(search_query))[0]

    search_results = qdrant_client.search(
        collection_name=collection_name,
        query_vector=query_embedding,
        limit=1,  # Get the top result
        with_payload=True,
    )

    # 7. Display the top result
    if search_results:
        top_result = search_results[0]
        print(f"\nHighest score: {top_result.score:.4f}")
        print("Top result payload:")
        print(top_result.payload)
    else:
        print("No search results found.")

if __name__ == "__main__":
    main()