# Bookline RAG Toy Playground

This notebook demonstrates how to use the [OpenAI API](https://platform.openai.com) to embed documents and index them in a [Qdrant vector database](https://qdrant.tech/) and perform a similarity search on the indexed documents.

In the train folder you have a small 15594 document database so that you can do some testing and get an intuition on how stuff works.

# Importing Libraries

In [1]:
import os
import openai
import qdrant_client
from pathlib import Path

In [2]:
from qdrant_client.http.models import PointStruct
from qdrant_client.http.models import VectorParams, Distance

# Global Variables

We will define a few global variables here:

- `BASE_DATA_DIR`: The base directory where the data is stored.
- `EMBEDDING_MODEL`: The OpenAI embedding model to use for embedding the documents.
- `COLLECTION_NAME`: The name of the Qdrant collection to store the documents.
- `OPENAI_SECRET_KEY_FILE`: The path to a JSON file containing the OpenAI secret key.
- `MAX_DOCUMENTS_TO_LOAD`: The maximum number of documents to load from the dataset.

In [3]:
BASE_DATA_DIR = Path('../../data')
EMBEDDING_MODEL = 'text-embedding-3-small'
COLLECTION_NAME = "my_document_collection"
OPENAI_SECRET_KEY_FILE = "../../secrets/openai_api_key.json"
MAX_DOCUMENTS_TO_LOAD = 100

In [4]:
try:
    openai_client
except NameError:
    openai_client = None

# Helper Functions

We will import the helper functions to load the data, get the OpenAI secret key, and embed the documents.
The helper functions are defined in the `helper_functions.py` file.

In [5]:
from helper_functions import load_data, get_openai_api_key, embed_documents

# Loading the OpenAI secret key and Creating an In-Memory Qdrant Database

We will load the OpenAI secret key from the `secrets/openai_api_key.json` file and create an in-memory Qdrant database to index the documents.

In [6]:
if openai_client is None:
    openai_client = openai.Client(
        api_key=get_openai_api_key(OPENAI_SECRET_KEY_FILE),
    )

In [7]:
# In memory database creation
qdrant_client = qdrant_client.QdrantClient(":memory:")

# Loading the Documents

We will use the [feedback-prize-2021](https://www.kaggle.com/competitions/feedback-prize-2021/data) dataset from Kaggle as an example. To download the dataset, you can use the following command in the root directory of the project:

`bash bin/download_kaggle_dataset.sh "competition" "feedback-prize-2021" "data/input/kaggle_competitions/fp1"`

In [8]:
data = load_data(BASE_DATA_DIR / 'db')[:MAX_DOCUMENTS_TO_LOAD]

In [9]:
data.head()

Unnamed: 0,id,text
0,3321A3E87AD3,I do agree that some students would benefit fr...
1,DFEAEC512BAB,Should students design a summer project for sc...
2,2E4AFCD3987F,"Dear State Senator\n\n,\n\nIn the ruels of vot..."
3,EB6C2AF20BFE,People sometimes have a different opinion than...
4,A91A08E523D5,"Dear senator,\n\nAs you know the Electoral Col..."


# Embedding the Documents

We will embed the documents using OpenAI's API using the `text-embedding-3-small` model.

In [10]:
embeddings = embed_documents(data['text'], openai_client, EMBEDDING_MODEL)

In [11]:
embeddings.data[0].embedding[:10]

[-0.008226783946156502,
 -0.013497698120772839,
 0.01179460994899273,
 0.060780055820941925,
 0.06498292833566666,
 -0.013740171678364277,
 -0.02018304169178009,
 0.005063080228865147,
 0.009854820556938648,
 0.06174995005130768]

In [12]:
data_with_embeddings = data.copy()
data_with_embeddings['embedding'] = embeddings.data

In [13]:
data_with_embeddings.head()

Unnamed: 0,id,text,embedding
0,3321A3E87AD3,I do agree that some students would benefit fr...,"Embedding(embedding=[-0.008226783946156502, -0..."
1,DFEAEC512BAB,Should students design a summer project for sc...,"Embedding(embedding=[-0.0008782672230154276, 0..."
2,2E4AFCD3987F,"Dear State Senator\n\n,\n\nIn the ruels of vot...","Embedding(embedding=[0.040929023176431656, 0.0..."
3,EB6C2AF20BFE,People sometimes have a different opinion than...,"Embedding(embedding=[0.04807731509208679, -0.0..."
4,A91A08E523D5,"Dear senator,\n\nAs you know the Electoral Col...","Embedding(embedding=[0.04964446648955345, -0.0..."


# Indexing the Documents in Qdrant Database

We will index the documents in the Qdrant database with the document embeddings and the document text as the payload.
By saving our embeddings as the vectors we can make vectorial based queries to relate our query with the documents in our database.

In [14]:
points = [
    PointStruct(
        id=idx,
        vector=data.embedding,
        payload={"text": text},
    )
    for idx, (data, text) in enumerate(zip(data_with_embeddings['embedding'], data_with_embeddings['text']))
]

In [15]:
qdrant_client.create_collection(
    COLLECTION_NAME,
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE,
    ),
)

qdrant_client.upsert(COLLECTION_NAME, points)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

# Searching for Similar Documents

Let's search for similar documents to a sample query. We will use cosine similarity and limit the search results to 3 documents with a similarity (or score) threshold of 0.1 or higher.

In [16]:
query = "technology in schools"

results = qdrant_client.search(
    collection_name=COLLECTION_NAME,
    query_vector=openai_client.embeddings.create(
        input=[query],
        model=EMBEDDING_MODEL,
    ).data[0].embedding,
    limit=20,
    score_threshold=0.1,
)

In [17]:
results

[ScoredPoint(id=26, version=0, score=0.4816395354927998, payload={'text': 'Many Public and private schools have advanced in many ways throughout the past decade. Schools all around the world have weaved technology advancements into the classroom and outside of the class room to help get students more actively involved in school related activities and lessons. Technology is a blessing and curse, which means it helps teachers teach but also in certain aspects it worsens the attention span of younger children. Diving deeper into this world-wide technology advancement that is happening, there is this one main question that gets asked quite a lot by administrators all around the world, "Would students work harder and better if they could attend classes from home virtually online or through video conferencing?" My full opinion to that is, no. Studies show that almost 97% of people work better in a room full of other peers, although the isolation and focus you may have at your own home is nic