## Vector Search with Qdrant

In [1]:
pip install -q "qdrant-client[fastembed]>=1.14.2"

Note: you may need to restart the kernel to use updated packages.


### We now import the required libraries and connect to Qdrant

In [2]:
from qdrant_client import QdrantClient, models

QdrantClient class -> establishes a connection with the Qdrant service, Models ->  provides definitions for various configurations and parameters

In [3]:
client = QdrantClient("http://localhost:6333")

### Importing our FAQ dataset

In [4]:
import requests

docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
raw_doc = docs_response.json()

In [5]:
#raw_doc

The dataset has three courses: data-engineering-zoomcamp, machine-learning-zoomcamp, and mlops-zoomcamp, with question and answer pairs

### Choosing the Embedding Model
Since we are embedding course-related question and answer pairs - small chunks of English text, we can choose a suitable embedding model to convert this data into vectors

#### FastEmbed for Textual Data

In [6]:
from fastembed import TextEmbedding
TextEmbedding.list_supported_models()

[{'model': 'BAAI/bge-base-en',
  'sources': {'hf': 'Qdrant/fast-bge-base-en',
   'url': 'https://storage.googleapis.com/qdrant-fastembed/fast-bge-base-en.tar.gz',
   '_deprecated_tar_struct': True},
  'model_file': 'model_optimized.onnx',
  'description': 'Text embeddings, Unimodal (text), English, 512 input tokens truncation, Prefixes for queries/documents: necessary, 2023 year.',
  'license': 'mit',
  'size_in_GB': 0.42,
  'additional_files': [],
  'dim': 768,
  'tasks': {}},
 {'model': 'BAAI/bge-base-en-v1.5',
  'sources': {'hf': 'qdrant/bge-base-en-v1.5-onnx-q',
   'url': 'https://storage.googleapis.com/qdrant-fastembed/fast-bge-base-en-v1.5.tar.gz',
   '_deprecated_tar_struct': True},
  'model_file': 'model_optimized.onnx',
  'description': 'Text embeddings, Unimodal (text), English, 512 input tokens truncation, Prefixes for queries/documents: not so necessary, 2023 year.',
  'license': 'mit',
  'size_in_GB': 0.21,
  'additional_files': [],
  'dim': 768,
  'tasks': {}},
 {'model':

We choose the 512 dimension so we don’t overuse resources- a model that produces small-to-moderate-sized embeddings.    

In [7]:
import json

EMBEDDING_DIMENSIONALITY = 512

for model in TextEmbedding.list_supported_models():
    if model["dim"] == EMBEDDING_DIMENSIONALITY:
        print(json.dumps(model, indent=2))

{
  "model": "BAAI/bge-small-zh-v1.5",
  "sources": {
    "hf": "Qdrant/bge-small-zh-v1.5",
    "url": "https://storage.googleapis.com/qdrant-fastembed/fast-bge-small-zh-v1.5.tar.gz",
    "_deprecated_tar_struct": true
  },
  "model_file": "model_optimized.onnx",
  "description": "Text embeddings, Unimodal (text), Chinese, 512 input tokens truncation, Prefixes for queries/documents: not so necessary, 2023 year.",
  "license": "mit",
  "size_in_GB": 0.09,
  "additional_files": [],
  "dim": 512,
  "tasks": {}
}
{
  "model": "Qdrant/clip-ViT-B-32-text",
  "sources": {
    "hf": "Qdrant/clip-ViT-B-32-text",
    "url": null,
    "_deprecated_tar_struct": false
  },
  "model_file": "model.onnx",
  "description": "Text embeddings, Multimodal (text&image), English, 77 input tokens truncation, Prefixes for queries/documents: not necessary, 2021 year",
  "license": "mit",
  "size_in_GB": 0.25,
  "additional_files": [],
  "dim": 512,
  "tasks": {}
}
{
  "model": "jinaai/jina-embeddings-v2-small-e

We choose an embedding model suitable for English text and one that is unimodal since we're dealing with text data

In [8]:
model_handle = "jinaai/jina-embeddings-v2-small-en"

### Creating a collection

In [9]:
# Define the collection name
collection_name = "FAQ_Search-rag"

# Create the collection with specified vector parameters
client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=EMBEDDING_DIMENSIONALITY,  # Dimensionality of the vectors
        distance=models.Distance.COSINE  # Distance metric for similarity search
    )
)

True

#### Creating, embedding  & inserting points into the collection

In [10]:
points = []
id = 0

for course in raw_doc:
    for doc in course['documents']:

        point = models.PointStruct(
            id=id,
            vector=models.Document(text=doc['text'], model=model_handle), #embed text locally with "jinaai/jina-embeddings-v2-small-en" from FastEmbed
            payload={
                "text": doc['text'],
                "section": doc['section'],
                "course": course['course']
            } 
        )
        points.append(point)

        id += 1

#### We now embed and upload points to our collection

In [11]:
client.upsert(
    collection_name=collection_name,
    points=points
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

### Visualizing the uploaded data

We can study the semantic similarity visually by exploring the uploaded data in the Qdrant Web UI at http://localhost:6333/dashboard 

In the visualize tab in the FAQ_Search-rag, we run the following command to gain insight into the 2D representation of the course content: 
{
"limit": 948,
  "color_by": {
    "payload": "course"
  }
}

### Running a Similarity Search
We want to find the most relevant answer to a given question

In [12]:
def search(query, limit=1):

    results = client.query_points(
        collection_name=collection_name,
        query=models.Document( #embed the query text locally with "jinaai/jina-embeddings-v2-small-en"
            text=query,
            model=model_handle 
        ),
        limit=limit, 
        with_payload=True 
    )

    return results

#### We want to pick a random question since we haven't uploaded the FAQs to Qdrant

In [13]:
import random

course = random.choice(raw_doc)
course_piece = random.choice(course['documents'])
print(json.dumps(course_piece, indent=2))

{
  "text": "In this video, we store the data file as \u201coutput.csv\u201d. The data file won\u2019t store correctly if the file extension is csv.gz instead of csv. One alternative is to replace csv_name = \u201coutput.cs -v\u201d with the file name given at the end of the URL. Notice that the URL for the yellow taxi data is: https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz where the highlighted part is the name of the file. We can parse this file name from the URL and use it as csv_name. That is, we can replace csv_name = \u201coutput.csv\u201d with\ncsv_name = url.split(\u201c/\u201d)[-1] . Then when we use csv_name to using pd.read_csv, there won\u2019t be an issue even though the file name really has the extension csv.gz instead of csv since the pandas read_csv function can read csv.gz files directly.",
  "section": "Module 1: Docker and Terraform",
  "question": "Taxi Data - How to handle taxi data files, now that the files ar

##### We now check the answer obtained from the random question

In [14]:
result = search(course_piece['question'])
result

QueryResponse(points=[ScoredPoint(id=45, version=0, score=0.88844717, payload={'text': 'In this video, we store the data file as “output.csv”. The data file won’t store correctly if the file extension is csv.gz instead of csv. One alternative is to replace csv_name = “output.cs -v” with the file name given at the end of the URL. Notice that the URL for the yellow taxi data is: https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz where the highlighted part is the name of the file. We can parse this file name from the URL and use it as csv_name. That is, we can replace csv_name = “output.csv” with\ncsv_name = url.split(“/”)[-1] . Then when we use csv_name to using pd.read_csv, there won’t be an issue even though the file name really has the extension csv.gz instead of csv since the pandas read_csv function can read csv.gz files directly.', 'section': 'Module 1: Docker and Terraform', 'course': 'data-engineering-zoomcamp'}, vector=None, sha

##### Let's perform another random search but with a question from the FAQ database

In [15]:
print(search("Can I get a course certificate without submitting assignments?").points[0].payload['text'])

Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.
In order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.


#### How do we refine our search to get precise answers
We want to use metadata filters to refine our search

### Similarity Search with Filters

In [16]:
client.create_payload_index(
    collection_name=collection_name,
    field_name="course",
    field_schema="keyword" # exact matching on string metadata fields
)

UpdateResult(operation_id=2, status=<UpdateStatus.COMPLETED: 'completed'>)

##### We now update our search function for efficient filtering

In [17]:
def course_related_search(query, course="data-engineering-zoomcamp", limit=1):

    results = client.query_points(
        collection_name=collection_name,
        query=models.Document(
            text=query,
            model=model_handle
        ),
        query_filter=models.Filter(
            must=[
                models.FieldCondition(
                    key="course",
                    match=models.MatchValue(value=course)
                )
            ]
        ),
        limit=limit,
        with_payload=True 
    )

    return results

In [18]:
print(course_related_search("Can I get a course certificate without submitting assignments?", "data-engineering-zoomcamp").points[0].payload['text'])

No, you can only get a certificate if you finish the course with a “live” cohort. We don't award certificates for the self-paced mode. The reason is you need to peer-review capstone(s) after submitting a project. You can only peer-review projects at the time the course is running.


##### We can vary the courses and see how the same question is answered