This Notebook uses Retrieval Augmented Generation (RAG) along with ChatGPT to recommend movies.

In [12]:
import pandas as pd

Building the Training Data

In [13]:
train = pd.read_csv('../data/train.csv')

In [14]:
train.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,movie_title,genres,avg_rating
0,259,255,4,874724710,My Best Friend's Wedding (1997),Romance,4.0
1,259,286,4,874724727,"English Patient, The (1996)","Romance, War",4.0
2,259,298,4,874724754,Face/Off (1997),"Action, Sci-Fi, Thriller",4.0
3,259,185,4,874724781,Psycho (1960),"Horror, Romance, Thriller",4.0
4,259,173,4,874724843,"Princess Bride, The (1987)","Action, Adventure, Romance",4.0


Using Qdrant Vector Database

In [8]:
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer

Set the embedding layer

In [9]:
encoder = SentenceTransformer("all-MiniLM-L6-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Set the Qdrant database as an in memory database

In [10]:
client = QdrantClient(":memory:")

Define the database settings

In [60]:
client.recreate_collection(
    collection_name="movie_ratings",
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(),  # Vector size is defined by used model
        distance=models.Distance.COSINE,
    ),
)

  client.recreate_collection(


True

Turn the data into more descriptive strings for better semantic searching

In [61]:
train['rating_descriptions'] = train.apply(lambda x: f"user {x.loc['user_id']} rated {x.loc['movie_title']} with a rating of {x.loc['rating']}", axis=1)

In [64]:
training_dict = train.sample(1000).to_dict(orient='records') # Sample 1000 records to speed up the process

Upload Training Data to vector database

In [63]:
client.upload_points(
    collection_name="movie_ratings",
    points=[
        models.PointStruct(
            id=idx, vector=encoder.encode(doc["rating_descriptions"]).tolist(), payload=doc
        )
        for idx, doc in enumerate(training_dict)
    ],
)

Search Vector Database for the query

In [73]:
hits = client.search(
    collection_name="movie_ratings",
    query_vector=encoder.encode("user 175 with a rating of 5").tolist(),
    limit=5,
)
for hit in hits:
    print(hit.payload, "score:", hit.score)

{'user_id': 175, 'movie_id': 11, 'rating': 5, 'timestamp': 877107339, 'movie_title': 'Seven (Se7en) (1995)', 'genres': 'Thriller', 'avg_rating': 5.0, 'rating_descriptions': 'user 175 rated Seven (Se7en) (1995) with a rating of 5'} score: 0.7580424035243876
{'user_id': 243, 'movie_id': 280, 'rating': 1, 'timestamp': 879987148, 'movie_title': 'Up Close and Personal (1996)', 'genres': 'Romance', 'avg_rating': 1.0, 'rating_descriptions': 'user 243 rated Up Close and Personal (1996) with a rating of 1'} score: 0.7244802590664849
{'user_id': 181, 'movie_id': 1353, 'rating': 1, 'timestamp': 878962200, 'movie_title': '1-900 (1994)', 'genres': 'Romance', 'avg_rating': 1.0, 'rating_descriptions': 'user 181 rated 1-900 (1994) with a rating of 1'} score: 0.7221969391858298
{'user_id': 286, 'movie_id': 367, 'rating': 5, 'timestamp': 877531574, 'movie_title': 'Clueless (1995)', 'genres': 'unknown', 'avg_rating': 5.0, 'rating_descriptions': 'user 286 rated Clueless (1995) with a rating of 5'} score: 

In [75]:
search_results = [hit.payload for hit in hits]

In [74]:
user_prompt = "I am user 175.  Please recommend me a movie that I would rate 5."

Use OpenAI to recommend a movie based on vector database search results

In [78]:
from openai import OpenAI
from key import OPENAI_API_KEY
client = OpenAI(api_key=OPENAI_API_KEY)

In [80]:
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
        messages= [
            { 'role':'system','content' : 'You are a movie recommender.  You help users find movies they would rate 5 stars.'},
            { 'role':'user','content' : user_prompt},
            { "role": "assistant", "content": str(search_results)},],
        # temperature=0,
        # max_tokens=512,
        # top_p=1,
        # frequency_penalty=0,
        # presence_penalty=0,
        )

In [88]:
response.choices[0].message.content

'Based on your previous ratings, I recommend the movie "Seven (Se7en) (1995)". Users with similar tastes rated this movie 5 stars. Enjoy watching!'