## RAG with Qdrant

This notebook documents my work making a RAG application based on a dataset of book reviews.  In this case the dataset is a truncated dataset including books from the James Bond series and the Harry Potter series.

Tech stack:
* Qdrant for the embedding store, as a local file
* sentence_transformers using model distilbert-base-nli-mean-tokens for the encoder
* Pandas to access the review texts from a sqlite based on the id from Qdrant
* Langchain to create prompt templates and call the LLM
* OpenAI as the LLM
* Also using tiktoken to ensure the token count isn't too high before the call to OpenAI


In [1]:
from qdrant_client import models, QdrantClient
qclient = QdrantClient(path="C:\\Users\\dpeters\\Documents\\Data\\Movies_Books\\embeddings\\")



In [2]:
qclient.get_collection(collection_name="Books_distilbert_1_partial")

CollectionInfo(status=<CollectionStatus.GREEN: 'green'>, optimizer_status=<OptimizersStatusOneOf.OK: 'ok'>, vectors_count=None, indexed_vectors_count=0, points_count=3802, segments_count=1, config=CollectionConfig(params=CollectionParams(vectors=VectorParams(size=768, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None, multivector_config=None), shard_number=None, sharding_method=None, replication_factor=None, write_consistency_factor=None, read_fan_out_factor=None, on_disk_payload=None, sparse_vectors=None), hnsw_config=HnswConfig(m=16, ef_construct=100, full_scan_threshold=10000, max_indexing_threads=0, on_disk=None, payload_m=None), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=0, max_segment_size=None, memmap_threshold=None, indexing_threshold=20000, flush_interval_sec=5, max_optimization_threads=1), wal_config=WalConfig(wal_capacity_mb=32, wal_segments_ahead=0

Here we create the model to encode the queries

In [3]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
#model.to("cuda:0")
model.device

device(type='cpu')

In [4]:
query_vector = model.encode("Who is James Bonds' biggest nemesis?")
hits = qclient.query_points(
   collection_name="Books_distilbert_1_partial",
   query=query_vector,
   limit=15
)

In [5]:
hits

QueryResponse(points=[ScoredPoint(id=991071499, version=0, score=0.6845826550583209, payload={'book_id': 3762}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=9440370, version=0, score=0.6785884070676913, payload={'book_id': 177193}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=2678579974, version=0, score=0.6750211140601249, payload={'book_id': 3763}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=974944422, version=0, score=0.6664279536227815, payload={'book_id': 18780375}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=135113126, version=0, score=0.6650838003984122, payload={'book_id': 3761}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=956890387, version=0, score=0.6428104369854029, payload={'book_id': 1}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=94261343, version=0, score=0.6360902476260423, payload={'book_id': 3761}, vector=None, shard_key=None, order_value=None), ScoredPoint

In [6]:
review_ids = [sp.id for sp in hits.points]

In [7]:
review_ids

[991071499,
 9440370,
 2678579974,
 974944422,
 135113126,
 956890387,
 94261343,
 2018638560,
 1235845077,
 1292853721,
 2324277797,
 2031306485,
 2195574181,
 970169375,
 208009817]

Now that we have the review id's we can retrieve them from the sqlite db

In [8]:
import sqlite3
import pandas as pd

In [11]:
con = sqlite3.connect("../../../../data/Movies_Books/reviews_clean.sqlite3")
matches = pd.read_sql_query("SELECT * from book_reviews_cleaned where review_id IN (" + ",".join([str(x) for x in review_ids]) + ")", con)

matches['book_id'] = matches['book_id'].astype(int)
matches.head()

Unnamed: 0,book_id,review_id,rating,review_text,language,id,primary_topic_id,primary_topic_prob,rating_numeric
0,1,956890387,it was amazing,Im yet to mention one of the most important ch...,en,177959,16.0,0.358908,5.0
1,3750,208009817,it was amazing,* The third Bond book.* And far and away the b...,en,189997,24.0,0.344065,5.0
2,3758,970169375,did not like it,"""Say what you like about James Bond,"" my ex-hu...",en,91136,16.0,0.558882,1.0
3,3759,1292853721,it was ok,"My first Bond novel, and very likely my last. ...",en,202822,20.0,0.498595,2.0
4,3761,135113126,it was amazing,There are no other 007 films quite like this o...,en,166134,24.0,0.791107,5.0


In [12]:
all_review = matches["review_text"].str.cat(sep="\n\nNew Review\n\n")

In [13]:
all_review

'Im yet to mention one of the most important characters in this series in a review. Im, of course, talking about Severus Snape.Severus, the unsung hero.Severus, who sacrificed his own soul.Severus, who loved another more than life itself.Severus, the half-blood prince- the truth about his character was, and will likely always be, one of the most surprising twists Ive ever read in fiction. The set up is all in this book. \n\nNew Review\n\n* The third Bond book.* And far and away the best of the three. Tense, exciting; cards and spycraft. Always hard to believe when such an excellent book is turned into such a dismal movie.* Hugo Drax is the most fully realized villain, and the most frightening. Le Chiffre was a bit pedestrian, Mr. Big little more than a criminal; Drax is highly neurotic, yet a patriot, motivated by vengeance and national pride. He comes off as Bond\'s first truly worthy foe.* Fleming devotes the first third to a card game in which Bond must cheat a cheater, and during w

LLM section

In [14]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from langchain.text_splitter import CharacterTextSplitter
from langchain import OpenAI
from langchain.document_loaders import SeleniumURLLoader
from langchain import PromptTemplate
import tiktoken
llm_model_name = "gpt-3.5-turbo-instruct"
encoding = tiktoken.encoding_for_model(llm_model_name)




In [15]:
template = """You are an expert chatbot that answers questions about books and movies.

You know the following context information.

{chunks_formatted}

Answer to the following question from a customer. Use only information from the previous context information. Do not invent stuff.

Question: {query}

Answer:"""

prompt = PromptTemplate(
    input_variables=["chunks_formatted", "query"],
    template=template,
)

Now we take the query from above and add it to the prompt template, including the texts of the reviews.  Then we call the OpenAI llm with the prompt

In [16]:
query = "Who is James Bonds' biggest nemesis?"
prompt_formatted = prompt.format(chunks_formatted=all_review, query=query)

# generate answer
llm = OpenAI(model=llm_model_name, temperature=0)
answer = llm(prompt_formatted)
print(answer)

  llm = OpenAI(model=llm_model_name, temperature=0)
  answer = llm(prompt_formatted)


 James Bond's biggest nemesis is Ernst Stavro Blofeld, who is described as his Moriarty and arch-nemesis in the later James Bond books.


Not bad!  Now we can write a function that takes the query and does all the steps

In [17]:
def GetAnswerFromBook(query: str):
    num_matches = 15
    max_tokens = 4096 - 256 - len(encoding.encode(prompt.template)) # save some for completion
    hits = qclient.query_points(
    collection_name="Books_distilbert_1_partial",
    query=model.encode(query),
    limit=num_matches  # Return 10 closest points
    )
    review_ids = [sp.id for sp in hits.points]
    matches = pd.read_sql_query("SELECT * from book_reviews_cleaned where review_id IN (" + ",".join([str(x) for x in review_ids]) + ")", con)
    all_review = matches["review_text"].str.cat(sep="\n\nNew Review\n\n")
    num_tokens = len(encoding.encode(all_review))
    num_matches = matches.count()[0]

    while (num_tokens > max_tokens and matches.count()[0] > 1):
        num_matches -= 1
        all_review = matches["review_text"][0:num_matches].str.cat(sep="\n\nNew Review\n\n")
        num_tokens = len(encoding.encode(all_review))
    
    prompt_formatted = prompt.format(chunks_formatted=all_review, query=query)
    answer = llm(prompt_formatted)
    return answer

    

In [60]:
GetAnswerFromBook("Who is Harry Potter's biggest nemesis?")

" Harry Potter's biggest nemesis is Voldemort, the dark wizard who killed his parents and is trying to kill him."

In [61]:
GetAnswerFromBook("Who is Hermione?")

" Hermione is a character in the Harry Potter series, known for her intelligence and bravery. She is a close friend of Harry and Ron, and is often involved in important plot points, such as discovering Rita Skeeter's true identity and starting the organization SPEW. She is also a member of Dumbledore's Army and plays a crucial role in the fight against Voldemort."

In [62]:
GetAnswerFromBook("Who are the main characters in James Bond books other than Bond?")

' The main characters in James Bond books other than Bond include M, Moneypenny, and various villains such as Le Chiffre and Mr. Big. In "The Spy Who Loved Me," the protagonist is a woman named Viv.'

In [43]:
prompt.template

'You are an expert chatbot that answers questions about books and movies.\n\nYou know the following context information.\n\n{chunks_formatted}\n\nAnswer to the following question from a customer. Use only information from the previous context information. Do not invent stuff.\n\nQuestion: {query}\n\nAnswer:'

Still to do: redo the review embeddings with chunking; trying different embedding models; tweak the prompt