# Bookline RAG Toy Playground

This notebook demonstrates how to use the [OpenAI API](https://platform.openai.com) to embed documents and index them in a [Qdrant vector database](https://qdrant.tech/) and perform a similarity search on the indexed documents.

In the train folder you have a small 15594 document database so that you can do some testing and get an intuition on how stuff works.

# Importing Libraries

In [110]:
import os
import openai
import qdrant_client
from pathlib import Path

In [111]:
from qdrant_client.http.models import PointStruct
from qdrant_client.http.models import VectorParams, Distance

# Global Variables

We will define a few global variables here:

- `BASE_DATA_DIR`: The base directory where the data is stored.
- `EMBEDDING_MODEL`: The OpenAI embedding model to use for embedding the documents.
- `COLLECTION_NAME`: The name of the Qdrant collection to store the documents.
- `OPENAI_SECRET_KEY_FILE`: The path to a JSON file containing the OpenAI secret key.
- `MAX_DOCUMENTS_TO_LOAD`: The maximum number of documents to load from the dataset.

In [112]:
BASE_DATA_DIR = Path('../../data')
EMBEDDING_MODEL = 'text-embedding-3-small'
COLLECTION_NAME = "my_document_collection"
OPENAI_SECRET_KEY_FILE = "../../secrets/openai_api_key.json"
MAX_DOCUMENTS_TO_LOAD = 100

In [113]:
try:
    openai_client
except NameError:
    openai_client = None

# Helper Functions

We will import the helper functions to load the data, get the OpenAI secret key, and embed the documents.
The helper functions are defined in the `src/helper_functions.py` file.

In [114]:
from src.helper_functions import load_data, get_openai_api_key, embed_documents

# Loading the OpenAI secret key and Creating an In-Memory Qdrant Database

We will load the OpenAI secret key from the `secrets/openai_api_key.json` file and create an in-memory Qdrant database to index the documents.

In [115]:
if openai_client is None:
    openai_client = openai.Client(
        api_key=get_openai_api_key(OPENAI_SECRET_KEY_FILE),
    )

In [116]:
# In memory database creation
qdrant_client = qdrant_client.QdrantClient(":memory:")

# Loading the Documents

We will use the [feedback-prize-2021](https://www.kaggle.com/competitions/feedback-prize-2021/data) dataset from Kaggle as an example. To download the dataset, you can use the following command in the root directory of the project:

`bash bin/download_kaggle_dataset.sh "competition" "feedback-prize-2021" "data/input/kaggle_competitions/fp1"`

In [117]:
data = load_data(BASE_DATA_DIR / 'db')[:MAX_DOCUMENTS_TO_LOAD]

In [118]:
data.head()

Unnamed: 0,id,text
0,3321A3E87AD3,I do agree that some students would benefit fr...
1,DFEAEC512BAB,Should students design a summer project for sc...
2,2E4AFCD3987F,"Dear State Senator\n\n,\n\nIn the ruels of vot..."
3,EB6C2AF20BFE,People sometimes have a different opinion than...
4,A91A08E523D5,"Dear senator,\n\nAs you know the Electoral Col..."


# Embedding the Documents

We will embed the documents using OpenAI's API using the `text-embedding-3-small` model.

In [119]:
embeddings = embed_documents(data['text'], openai_client, EMBEDDING_MODEL)

In [120]:
embeddings.data[0].embedding[:10]

[-0.008226783946156502,
 -0.013497698120772839,
 0.01179460994899273,
 0.060780055820941925,
 0.06498292833566666,
 -0.013740171678364277,
 -0.02018304169178009,
 0.005063080228865147,
 0.009854820556938648,
 0.06174995005130768]

In [121]:
data_with_embeddings = data.copy()
data_with_embeddings['embedding'] = embeddings.data

In [122]:
data_with_embeddings.head()

Unnamed: 0,id,text,embedding
0,3321A3E87AD3,I do agree that some students would benefit fr...,"Embedding(embedding=[-0.008226783946156502, -0..."
1,DFEAEC512BAB,Should students design a summer project for sc...,"Embedding(embedding=[-0.0008782672230154276, 0..."
2,2E4AFCD3987F,"Dear State Senator\n\n,\n\nIn the ruels of vot...","Embedding(embedding=[0.040929023176431656, 0.0..."
3,EB6C2AF20BFE,People sometimes have a different opinion than...,"Embedding(embedding=[0.04807731509208679, -0.0..."
4,A91A08E523D5,"Dear senator,\n\nAs you know the Electoral Col...","Embedding(embedding=[0.04964446648955345, -0.0..."


# Indexing the Documents in Qdrant Database

We will index the documents in the Qdrant database with the document embeddings and the document text as the payload.
By saving our embeddings as the vectors we can make vectorial based queries to relate our query with the documents in our database.

In [123]:
points = [
    PointStruct(
        id=idx,
        vector=data.embedding,
        payload={"text": text},
    )
    for idx, (data, text) in enumerate(zip(data_with_embeddings['embedding'], data_with_embeddings['text']))
]

In [124]:
qdrant_client.create_collection(
    COLLECTION_NAME,
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE,
    ),
)

qdrant_client.upsert(COLLECTION_NAME, points)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

# Searching for Similar Documents

Let's search for similar documents to a sample query. We will use cosine similarity and limit the search results to 3 documents with a similarity (or score) threshold of 0.1 or higher.

In [125]:
query = "Bookline"

results = qdrant_client.search(
    collection_name=COLLECTION_NAME,
    query_vector=openai_client.embeddings.create(
        input=[query],
        model=EMBEDDING_MODEL,
    ).data[0].embedding,
    limit=20,
    score_threshold=0.1,
)

In [109]:
results

[ScoredPoint(id=40, version=0, score=0.18126124123354687, payload={'text': 'Luke Bomberger became a Seagoing Cowboy he helps people whose lives had been affected by World War ll. Luke became a Seagoing\n\nCowboy because he knew it was an opportunity of a lifetime, as he says in paragraph one. Luke also wanted to help others who had been affected by World War ll.\n\nLuke crossed the Atlantic Ocean sixteen times and the Pacific Ocean twice, to help countries recover their food supplies, animals, and much more as it says in paragraph two. Even though it was a long journey across the oceans and the waters were rough Luke loved being a Seagoing Cowboy and he loved to help the people.\n\nSome people would probably refuse to go to another country that had just been through war some people would say it was to risky and unsafe well it is, some people might say that it is too dangerous to cross the ocean well it is but Luke felt that these were things that he needed to do and is something that i