#### About Recommendation systems

Different types of recommendation systems:
- Collaborative filtering: This type of recommendation system uses the ratings or feedback of other users who have similar preferences to the target user. It assumes that users who liked certain items in the past will like similar items in the future. For example, if user A and user B both liked movies X and Y, then the algorithm may recommend movie Z to user A if user B also liked it.
- Collaborative filtering can be further divided into two subtypes: user-based and item-based:
    - User-based collaborative filtering finds similar users to the target user and recommends items that they liked.
    - Item-based collaborative filtering finds similar items to the ones that the target user liked and recommends them.
- Content-based filtering: This type of recommendation system uses the features or attributes of the items themselves to recommend items that are similar to the ones that the target user has liked or interacted with before. It assumes that users who liked certain features of an item will like other items with similar features. The main difference with item-based collaborative filtering is that item-based uses patterns of user behavior to make recommendations, content-based filtering uses information about the items themselves. For example, if user A liked movie X, which is a comedy with actor Y, then the algorithm may recommend movie Z, which is also a comedy with actor Y.
- Hybrid filtering: This type of recommendation system combines both collaborative and content-based filtering methods to overcome some of their limitations and provide more accurate and diverse recommendations. For example, YouTube uses hybrid filtering to recommend videos based on both the ratings and views of other users who have watched similar videos, and the features and categories of the videos themselves.
- Knowledge-based filtering: This type of recommendation system uses explicit knowledge or rules about the domain and the user’s needs or preferences to recommend items that satisfy certain criteria or constraints. It does not rely on ratings or feedback from other users, but rather on the user’s input or query. For example, if user A wants to buy a laptop with certain specifications and budget, then the algorithm may recommend a laptop that satisfies those criteria. Knowledge-based recommender systems work well when there is no or little rating history available, or when the items are complex and customizable.

Existing recommendation systems
- Modern recommendation systems use machine learning (ML) techniques to make better predictions about users’ preferences, based on the available data such as the following:
- User behavior data: Insights about user interaction with a product. This data can be acquired from factors like user ratings, clicks, and purchase records.
- User demographic data: This refers to personal information about users, including details like age, educational background, income level, and geographical location.
- Product attribute data: This involves information about the characteristics of a product, such as genres of books, casts of movies, or specific cuisines in the context of food.
- As of today, some of the most popular ML techniques are K-nearest neighbors, dimensionality reduction, and neural networks. 

KNN:
- KNN can be applied to recommendation systems in the context of collaborative filtering, both user-based and item-based:
- User-based KNN is a type of collaborative filtering, which uses the ratings or feedback of other users who have similar tastes or preferences to the target user.
    - For example, let’s say we have three users: Alice, Bob, and Charlie. They all buy books online and rate them. Alice and Bob both liked (rated highly) the series, Harry Potter, and the book, The Hobbit. The system sees this pattern and considers Alice and Bob to be similar.
    - Now, if Bob also liked the book A Game of Thrones, which Alice hasn’t read yet, the system will recommend A Game of Thrones to Alice. This is because it assumes that since Alice and Bob have similar tastes, Alice might also like A Game of Thrones.
- Item-based KNN is another type of collaborative filtering, which uses the attributes or features of the items to recommend similar items to the target user.
    - For example, let’s consider the same users and their ratings for the books. The system notices that the Harry Potter series and the book, The Hobbit are both liked by Alice and Bob. So, it considers these two books to be similar.
    - Now, if Charlie reads and likes Harry Potter, the system will recommend The Hobbit to Charlie. This is because it assumes that since Harry Potter and The Hobbit are similar (both liked by the same users), Charlie might also like The Hobbit.
- KNN is a popular technique in recommendation systems, but it has some pitfalls:
    - Scalability: KNN can become computationally expensive and slow when dealing with large datasets, as it requires calculating distances between all pairs of items or users.
    - Cold-start problem: KNN struggles with new items or users that have limited or no interaction history, as it relies on finding neighbors based on historical data.
    - Data sparsity: KNN performance can degrade in sparse datasets where there are many missing values, making it challenging to find meaningful neighbors.
    - Feature relevance: KNN treats all features equally and assumes that all features contribute equally to similarity calculations. This may not hold true in scenarios where some features are more relevant than others.
    - Choice of K: Selecting the appropriate value of K (number of neighbors) can be subjective and impact the quality of recommendations. A small K may result in noise, while a large K may lead to overly broad recommendations.

Neural networks:
- Collaborative filtering with neural networks: Neural networks can model user-item interactions by embedding users and items into vector spaces. These embeddings capture latent features that represent user preferences and item characteristics. Neural collaborative filtering models combine these embeddings with neural network architectures to predict ratings or interactions between users and items.
- Content-based recommendations: In content-based recommendation systems, neural networks can learn representations of item content, such as text, images, or audio. These representations capture item characteristics and user preferences. Neural networks like convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are used to process and learn from item content, enabling personalized content-based recommendations.
- Sequential models: In scenarios where user interactions have a temporal sequence, such as clickstreams or browsing history, RNNs or variants such as long short-term memory (LSTM) networks can capture temporal dependencies in the user behavior and make sequential recommendations.
- Autoencoders can be used to learn low-dimensional representations of users and items. Autoencoders are a type of neural network architecture used for unsupervised learning and dimensionality reduction. They consist of an encoder and a decoder. The encoder maps the input data into a lower-dimensional latent space representation, while the decoder attempts to reconstruct the original input data from the encoded representation. The idea here is to learn a compressed and meaningful representation of the input data in the latent space, which can be useful for various tasks including feature extraction, data generation, and dimensionality reduction.
- Neural networks can incorporate additional user and item attributes, such as demographic information, location, or social connections, to improve recommendations by learning from diverse data sources.
- There are some challenges which is been faced while using neural network such as increased complexity due to layered architecture, requires special hardware requirements including GPUs, and there is chance of overfitting.

#### LLM-powered recommendation system


- An example of a recommendation system LLM is P5. P5 is a unified text-to-text paradigm for building recommender systems using large language models (LLMs). It consists of three steps: 
    - Pretrain: A foundation language model based on T5 architecture is pretrained on a large-scale web corpus and fine-tuned on recommendation tasks.
    - Personalized prompt: A personalized prompt is generated for each user based on their behavior data and contextual features.
    - Predict: The personalized prompt is fed into the pretrained language model to generate recommendations.

- The goal is to make the movie recommendation system that will able to address various recommendations tasks with a conversational interface.

##### Data Processing, Feature Engineering and Finding Embeddings

In [1]:
import pandas as pd

md = pd. read_csv('data/movies_metadata.csv')
md.head()

  md = pd. read_csv('data/movies_metadata.csv')


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [2]:
# First, we format the genres column into a numpy array, which is easier to handle than the original dictionary format in the dataset
import ast
# Convert string representation of dictionaries to actual dictionaries
md['genres'] = md['genres'].apply(ast.literal_eval)

# Transforming the 'genres' column
md['genres'] = md['genres'].apply(lambda x: [genre['name'] for genre in x])
md.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


- Next, we merge the vote_average and vote_count columns into a single column, which is the weighted ratings with respect to the number of votes.
- The formula for calculating the Top Rated 250 Titles gives a true Bayesian estimate: `weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C` where:
    - R = average for the movie (mean) = (Rating)
    - v = number of votes for the movie = (votes)
    - m = minimum votes required to be listed in the Top 250
    - C = the mean vote across the whole report (currently 7.0)
- Also, have limited the rows to the 95th percentile of the number of votes, so that we can get rid of minimum vote counts to prevent skewed results.

In [3]:
# Calculate weighted rate (IMDb formula)
def calculate_weighted_rate(vote_average, vote_count, min_vote_count=10):
    return (vote_count / (vote_count + min_vote_count)) * vote_average + (min_vote_count / (vote_count + min_vote_count)) * 5.0
# Minimum vote count to prevent skewed results
vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')
min_vote_count = vote_counts.quantile(0.95)
# Create a new column 'weighted_rate'
md['weighted_rate'] = md.apply(lambda row: calculate_weighted_rate(row['vote_average'], row['vote_count'], min_vote_count), axis=1)

In [4]:
md = md.dropna()

Next, we create a new column called combined_info where we are going to merge all the elements that will be provided as context to the LLMs. Those elements are the movie title, overview, genres, and ratings

In [5]:
md_final = md[['genres', 'title', 'overview', 'weighted_rate']].reset_index(drop=True)
md_final.head()

Unnamed: 0,genres,title,overview,weighted_rate
0,"[Adventure, Action, Thriller]",GoldenEye,James Bond must unmask the mysterious head of ...,6.173464
1,[Comedy],Friday,Craig and Smokey are two guys in Los Angeles h...,6.083421
2,"[Horror, Action, Thriller, Crime]",From Dusk Till Dawn,Seth Gecko and his younger brother Richard are...,6.503176
3,[Comedy],Blue in the Face,"Auggie runs a small tobacco shop in Brooklyn, ...",5.109091
4,"[Action, Adventure, Science Fiction, Family, F...",Mighty Morphin Power Rangers: The Movie,Power up with six incredible teens who out-man...,5.052129


In [6]:
# Create a new column by combining 'title', 'overview', and 'genre'
md_final['combined_info'] = md_final.apply(lambda row: f"Title: {row['title']}. Overview: {row['overview']} Genres: {', '.join(row['genres'])}. Rating: {row['weighted_rate']}", axis=1)
print(md_final['combined_info'][0])

Title: GoldenEye. Overview: James Bond must unmask the mysterious head of the Janus Syndicate and prevent the leader from utilizing the GoldenEye weapons system to inflict devastating revenge on Britain. Genres: Adventure, Action, Thriller. Rating: 6.173464373464373


Embeddings
- We tokenize the movie combined_info so that we will get better results while embedding.
- The cl100k_base tokenizer is based on the byte pair encoding (BPE) algorithm, which learns a vocabulary of subword units from a large corpus of text. The cl100k_base tokenizer has a vocabulary of 100,000 tokens, which are mostly common words and word pieces, but also include some special tokens for punctuation, formatting, and control. It can handle texts in multiple languages and domains, and can encode up to 8,191 tokens per input.

In [7]:
import pandas as pd
import tiktoken
import os
import openai
os.environ["OPEN_API_KEY"]=""
openai.api_key = os.environ["OPEN_API_KEY"]

embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
max_tokens = 8000  # the maximum for text-embedding-ada-002 is 8191
encoding = tiktoken.get_encoding(embedding_encoding)
# omit reviews that are too long to embed
md_final["n_tokens"] = md_final.combined_info.apply(lambda x: len(encoding.encode(x)))
md_final = md_final[md_final.n_tokens <= max_tokens]

In [11]:
from langchain.embeddings.openai import OpenAIEmbeddings

model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(
    deployment = model_name,
    openai_api_key=os.environ["OPEN_API_KEY"]
)

  warn_deprecated(


In [16]:
md_final["embedding"] = embed.embed_documents(md_final['overview'])

In [17]:
md_final

Unnamed: 0,genres,title,overview,weighted_rate,combined_info,n_tokens,embedding
0,"[Adventure, Action, Thriller]",GoldenEye,James Bond must unmask the mysterious head of ...,6.173464,Title: GoldenEye. Overview: James Bond must un...,59,"[-0.023300853315990976, -0.015978474642742865,..."
1,[Comedy],Friday,Craig and Smokey are two guys in Los Angeles h...,6.083421,Title: Friday. Overview: Craig and Smokey are ...,52,"[0.0015801145214720843, -0.010825300100552526,..."
2,"[Horror, Action, Thriller, Crime]",From Dusk Till Dawn,Seth Gecko and his younger brother Richard are...,6.503176,Title: From Dusk Till Dawn. Overview: Seth Gec...,105,"[-0.008636361718460609, -0.004765349329433997,..."
3,[Comedy],Blue in the Face,"Auggie runs a small tobacco shop in Brooklyn, ...",5.109091,Title: Blue in the Face. Overview: Auggie runs...,87,"[-0.02030525503837447, -0.012256867948449244, ..."
4,"[Action, Adventure, Science Fiction, Family, F...",Mighty Morphin Power Rangers: The Movie,Power up with six incredible teens who out-man...,5.052129,Title: Mighty Morphin Power Rangers: The Movie...,89,"[-0.0038093382028046544, -0.03924035041209671,..."
...,...,...,...,...,...,...,...
688,"[Drama, Science Fiction, War]",War for the Planet of the Apes,Caesar and his apes are forced into a deadly c...,6.350166,Title: War for the Planet of the Apes. Overvie...,124,"[0.0058863207670547075, -0.03123171154714419, ..."
689,[Comedy],Goon: Last of the Enforcers,"During a pro lockout, Doug ""The Thug"" Glatt is...",5.074627,Title: Goon: Last of the Enforcers. Overview: ...,75,"[-0.02065249828045568, -0.02817226350420119, -..."
690,"[Adventure, Fantasy, Animation, Action, Family]",Pokémon: Spell of the Unknown,When Molly Hale's sadness of her father's disa...,5.249135,Title: Pokémon: Spell of the Unknown. Overview...,112,"[0.011107639327611748, -0.016110721856645256, ..."
691,"[Action, Science Fiction, Thriller, Adventure]",Transformers: The Last Knight,"Autobots and Decepticons are at war, with huma...",5.922092,Title: Transformers: The Last Knight. Overview...,80,"[0.003442174556747603, -0.056541432163398575, ..."


In [18]:
md_final.rename(columns = {'embedding': 'vector'}, inplace = True)
md_final.rename(columns = {'combined_info': 'text'}, inplace = True)
md_final.to_pickle('movies.pkl')

##### Building a QA recommendation chatbot in a cold-start scenario

- Cold-start scenario means interacting with a user for the first time. The less information we have about a user, the harder it is to match the recommendations to their preferences. High-level architecture of recommendation system in a cold-start scenario is shown in img1.

In [19]:
import pandas as pd
md = pd.read_pickle('movies.pkl')
md.head(2)

Unnamed: 0,genres,title,overview,weighted_rate,text,n_tokens,vector
0,"[Adventure, Action, Thriller]",GoldenEye,James Bond must unmask the mysterious head of ...,6.173464,Title: GoldenEye. Overview: James Bond must un...,59,"[-0.023300853315990976, -0.015978474642742865,..."
1,[Comedy],Friday,Craig and Smokey are two guys in Los Angeles h...,6.083421,Title: Friday. Overview: Craig and Smokey are ...,52,"[0.0015801145214720843, -0.010825300100552526,..."


- We need to store it in a VectorDB. For this purpose, we are going to leverage LanceDB, an open-source database for vector-search built with persistent storage, which greatly simplifies the retrieval, filtering, and management of embeddings and also offers a native integration with LangChain. 

In [22]:
import lancedb

uri = "data/sample-lancedb"
db = lancedb.connect(uri)
table = db.create_table("movies", md)

  from .autonotebook import tqdm as notebook_tqdm


In [46]:
from langchain.document_loaders import UnstructuredHTMLLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import LanceDB
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
import os
embeddings = OpenAIEmbeddings(openai_api_key=os.environ["OPEN_API_KEY"])
docsearch = LanceDB(connection = db, embedding = embeddings, table_name = "movies")

In [47]:
# docsearch = LanceDB.from_documents(db, embeddings)
query = "I'm looking for an animated action movie. What could you suggest to me?"
docs = docsearch.similarity_search(query)
docs

[Document(page_content='Title: Hitman: Agent 47. Overview: An assassin teams up with a woman to help her find her father and uncover the mysteries of her ancestry. Genres: Action, Crime, Thriller. Rating: 5.365800865800866', metadata={'genres': ['Action', 'Crime', 'Thriller'], 'title': 'Hitman: Agent 47', 'overview': 'An assassin teams up with a woman to help her find her father and uncover the mysteries of her ancestry.', 'weighted_rate': 5.365800865800866, 'n_tokens': 52, 'vector': [-0.005739843472838402, -0.016554784029722214, -0.022614216431975365, -0.025899605825543404, -0.002633424708619714, 0.019009236246347427, -0.0031223981641232967, -0.029248911887407303, -0.01803768053650856, -0.006442942190915346, 0.016043439507484436, 0.011307108215987682, 0.014726725406944752, -0.027356937527656555, 0.02251194603741169, -0.014752292074263096, 0.03533391281962395, -0.012272271327674389, 0.012726088985800743, -0.012464025057852268, 0.0041386960074305534, -0.006497272755950689, 0.01583890058

In [49]:
qa = RetrievalQA.from_chain_type(llm=OpenAI(openai_api_key=os.environ["OPEN_API_KEY"]), chain_type="stuff", retriever=docsearch.as_retriever(), return_source_documents=True)
query = "I'm looking for an animated action movie. What could you suggest to me?"
result = qa({"query": query})
result['result']

  warn_deprecated(


' Ice Age: Dawn of the Dinosaurs.'

In [51]:
result['source_documents'][0]

Document(page_content='Title: Hitman: Agent 47. Overview: An assassin teams up with a woman to help her find her father and uncover the mysteries of her ancestry. Genres: Action, Crime, Thriller. Rating: 5.365800865800866', metadata={'genres': ['Action', 'Crime', 'Thriller'], 'title': 'Hitman: Agent 47', 'overview': 'An assassin teams up with a woman to help her find her father and uncover the mysteries of her ancestry.', 'weighted_rate': 5.365800865800866, 'n_tokens': 52, 'vector': [-0.005739843472838402, -0.016554784029722214, -0.022614216431975365, -0.025899605825543404, -0.002633424708619714, 0.019009236246347427, -0.0031223981641232967, -0.029248911887407303, -0.01803768053650856, -0.006442942190915346, 0.016043439507484436, 0.011307108215987682, 0.014726725406944752, -0.027356937527656555, 0.02251194603741169, -0.014752292074263096, 0.03533391281962395, -0.012272271327674389, 0.012726088985800743, -0.012464025057852268, 0.0041386960074305534, -0.006497272755950689, 0.015838900581

- Note that the first document reported is not the one the model suggested. This occurred probably because of the rating, which is lower than "Ice Age: Dawn of the Dinosaurs". This is a great example of how the LLM was able to consider multiple factors, on top of similarity, to suggest a movie to the user.
- The model was able to generate a conversational answer, however, it is still using only a part of the available information – the textual overview. We want to leverage the other variables. We can approach the task in two ways:
1. The “filter” way: This approach consists of adding some filters as kwargs to our retriever, which might be required by the application before responding to the user. Those questions might be, for example, about the genre of a movie.

In [52]:
df_filtered = md[md['genres'].apply(lambda x: 'Comedy' in x)]
qa = RetrievalQA.from_chain_type(llm=OpenAI(openai_api_key=os.environ["OPEN_API_KEY"]), chain_type="stuff", 
    retriever=docsearch.as_retriever(search_kwargs={'data': df_filtered}), return_source_documents=True)

query = "I'm looking for a movie with animals and an adventurous plot."
result = qa({"query": query})
result

{'query': "I'm looking for a movie with animals and an adventurous plot.",
 'result': ' Ice Age or The Curse of the Were-Rabbit.',
 'source_documents': [Document(page_content='Title: Cats & Dogs 2 : The Revenge of Kitty Galore. Overview: The ongoing war between the canine and feline species is put on hold when they join forces to thwart a rogue cat spy with her own sinister plans for conquest. Genres: Comedy, Family. Rating: 4.978057553956835', metadata={'genres': ['Comedy', 'Family'], 'title': 'Cats & Dogs 2 : The Revenge of Kitty Galore', 'overview': 'The ongoing war between the canine and feline species is put on hold when they join forces to thwart a rogue cat spy with her own sinister plans for conquest.', 'weighted_rate': 4.978057553956835, 'n_tokens': 66, 'vector': [-0.009394473396241665, -0.022314323112368584, -0.00801696628332138, -0.0198987890034914, -0.014114560559391975, 0.025186851620674133, 0.0136836813762784, -0.026897313073277473, -0.012926378287374973, -0.0197682194411

In [57]:
qa = RetrievalQA.from_chain_type(llm=OpenAI(openai_api_key=os.environ["OPEN_API_KEY"]), chain_type="stuff", 
    retriever=docsearch.as_retriever(search_kwargs={'filter': {'adult':'False'}}), return_source_documents=True)

query = "I'm looking for a movie with animals and an adventurous plot."
result = qa({"query": query})
result

{'query': "I'm looking for a movie with animals and an adventurous plot.",
 'result': ' Ice Age or The Curse of the Were-Rabbit could both fit this description.',
 'source_documents': [Document(page_content='Title: Cats & Dogs 2 : The Revenge of Kitty Galore. Overview: The ongoing war between the canine and feline species is put on hold when they join forces to thwart a rogue cat spy with her own sinister plans for conquest. Genres: Comedy, Family. Rating: 4.978057553956835', metadata={'genres': ['Comedy', 'Family'], 'title': 'Cats & Dogs 2 : The Revenge of Kitty Galore', 'overview': 'The ongoing war between the canine and feline species is put on hold when they join forces to thwart a rogue cat spy with her own sinister plans for conquest.', 'weighted_rate': 4.978057553956835, 'n_tokens': 66, 'vector': [-0.009394473396241665, -0.022314323112368584, -0.00801696628332138, -0.0198987890034914, -0.014114560559391975, 0.025186851620674133, 0.0136836813762784, -0.026897313073277473, -0.0129

2. The “agentic” way: Making our chain agentic means converting the retriever to a tool that the agent can leverage if needed, including the additional variables. By doing so, it would be sufficient for the user to provide their preferences in natural language so that the agent can retrieve the most promising recommendation if needed.

In [58]:
from langchain.agents.openai_functions_agent.base import OpenAIFunctionsAgent
from langchain.schema.messages import SystemMessage
from langchain.prompts import MessagesPlaceholder
from langchain.agents.openai_functions_agent.agent_token_buffer_memory import AgentTokenBufferMemory
from langchain.agents.agent_toolkits import create_conversational_retrieval_agent
from langchain.agents.agent_toolkits import create_retriever_tool
from langchain.chat_models import ChatOpenAI

retriever = docsearch.as_retriever(return_source_documents = True)
llm = ChatOpenAI(openai_api_key=os.environ["OPEN_API_KEY"], temperature = 0)

tool = create_retriever_tool(
    retriever,
    "movies",
    "Searches and returns recommendations about movies."
)
tools = [tool]

system_message = SystemMessage(
        content=(
            "Do your best to answer the questions. "
            "if there are more than one argument for the single-input tool, reason step by step and treat them as single input. "
            "relevant information, only if neccessary"
        )
)

# This is needed for both the memory and the prompt
memory_key = "history"
memory = AgentTokenBufferMemory(memory_key=memory_key, llm=llm)

prompt = OpenAIFunctionsAgent.create_prompt(
        system_message=system_message,
        extra_prompt_messages=[MessagesPlaceholder(variable_name=memory_key)]
    )
agent_executor = create_conversational_retrieval_agent(llm=llm, tools=tools, prompt = prompt, verbose=True)

result = agent_executor({"input": "I liked a lot kung fu panda 1 and 2. Could you suggest me some similar movies?"})
result

  warn_deprecated(




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `movies` with `{'query': 'kung fu panda'}`


[0m[36;1m[1;3mTitle: Kung Fu Panda. Overview: When the Valley of Peace is threatened, lazy Po the panda discovers his destiny as the "chosen one" and trains to become a kung fu hero, but transforming the unsleek slacker into a brave warrior won't be easy. It's up to Master Shifu and the Furious Five -- Tigress, Crane, Mantis, Viper and Monkey -- to give it a try. Genres: Adventure, Animation, Family, Comedy. Rating: 6.675006821282402

Title: Kung Fu Panda 2. Overview: Po is now living his dream as The Dragon Warrior, protecting the Valley of Peace alongside his friends and fellow kung fu masters, The Furious Five - Tigress, Crane, Mantis, Viper and Monkey. But Po’s new life of awesomeness is threatened by the emergence of a formidable villain, who plans to use a secret, unstoppable weapon to conquer China and destroy kung fu. It is up to Po and The Furious Five to jou

{'input': 'I liked a lot kung fu panda 1 and 2. Could you suggest me some similar movies?',
 'chat_history': [HumanMessage(content='I liked a lot kung fu panda 1 and 2. Could you suggest me some similar movies?'),
  AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{"query":"kung fu panda"}', 'name': 'movies'}}, response_metadata={'token_usage': {'completion_tokens': 16, 'prompt_tokens': 101, 'total_tokens': 117}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'function_call', 'logprobs': None}, id='run-c3b8f63e-ae65-4451-b4c1-231083184c21-0'),
  FunctionMessage(content='Title: Kung Fu Panda. Overview: When the Valley of Peace is threatened, lazy Po the panda discovers his destiny as the "chosen one" and trains to become a kung fu hero, but transforming the unsleek slacker into a brave warrior won\'t be easy. It\'s up to Master Shifu and the Furious Five -- Tigress, Crane, Mantis, Viper and Monkey -- to give it a try. Genres: Adventur

- Our application more tailored toward its goal of being a recommender system, so will add the prompt engineering step

In [60]:
from langchain.prompts import PromptTemplate

template = """You are a movie recommender system that help users to find movies that match their preferences. 
Use the following pieces of context to answer the question at the end. 
For each question, suggest three movies, with a short description of the plot and the reason why the user migth like it.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Your response:"""

PROMPT = PromptTemplate(template=template, input_variables=["context", "question"])

chain_type_kwargs = {"prompt": PROMPT}
qa = RetrievalQA.from_chain_type(llm=OpenAI(openai_api_key=os.environ["OPEN_API_KEY"]), 
    chain_type="stuff", 
    retriever=docsearch.as_retriever(),
    return_source_documents=True, 
    chain_type_kwargs=chain_type_kwargs)

query = "I'm looking for a funny action movie, any suggestion?"
result = qa({'query':query})
print(result['result'])


1. Hot Fuzz - A skilled London cop is sent to a small town to handle a case, but he soon discovers a sinister plot that forces him to team up with the local bumbling cop. This comedy-action movie has plenty of laughs and exciting action scenes.
2. Deadpool - A wisecracking mercenary gets superpowers after a failed experiment and sets out to get revenge on the man who ruined his life. This movie is a perfect blend of comedy and action, with the main character constantly breaking the fourth wall to deliver hilarious jokes.
3. The Nice Guys - A private investigator and a hired enforcer team up to solve the case of a missing girl and uncover a conspiracy involving the porn industry. This buddy-action comedy has a great mix of witty dialogue and thrilling action sequences.


- Another thing that we might want to implement in our prompt is the information gathered with the conversational preliminary questions that we might want to set as a welcome page. For example, before letting the user input their natural language question, we might want to ask their age, gender, and favorite movie genre. To do so, we can insert in our prompt a section where we can format the input variables with those shared by the user, and then combine this prompt chunk in the final prompt we are going to pass to the chain.

In [61]:
from langchain.prompts import PromptTemplate

template_prefix = """You are a movie recommender system that help users to find movies that match their preferences. 
Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}"""

user_info = """This is what we know about the user, and you can use this information to better tune your research:
Age: {age}
Gender: {gender}"""

template_suffix= """Question: {question}
Your response:"""

user_info = user_info.format(age = 18, gender = 'female')

COMBINED_PROMPT = template_prefix +'\n'+ user_info +'\n'+ template_suffix
print(COMBINED_PROMPT)

You are a movie recommender system that help users to find movies that match their preferences. 
Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}
This is what we know about the user, and you can use this information to better tune your research:
Age: 18
Gender: female
Question: {question}
Your response:


In [63]:
PROMPT = PromptTemplate(
    template=COMBINED_PROMPT, input_variables=["context", "question"])

chain_type_kwargs = {"prompt": PROMPT}
qa = RetrievalQA.from_chain_type(llm=OpenAI(openai_api_key=os.environ["OPEN_API_KEY"]), 
    chain_type="stuff", 
    retriever=docsearch.as_retriever(),
    return_source_documents=True, 
    chain_type_kwargs=chain_type_kwargs)

query = "Can you suggest me some action movie?"
result = qa({'query':query})
result['result']

' Based on your preferences, I would recommend the following action movies: A Good Day to Die Hard, Goldfinger, Ong Bak 2, and The Raid 2. These movies have a high rating in the action genre and are known for their intense action scenes and thrilling plot. I think you would enjoy them as a fan of action movies.'

##### Building a content-based system

- Manytimes, recommender systems already have some backstory about users, and it is extremely useful to embed this knowledge in our application. Let’s imagine, for example, that we have a users database where the system has stored all the registered user’s information (such as age, gender, country, etc.) as well as the movies the user has already watched alongside their rating. The high-level architecture is shown in img2.

- As discussed earlier, we now have a bit of information about our users’ preferences. More specifically, imagine we have a dataset containing users’ attributes (name, age, gender) along with their reviews (a score from 1 to 10) of some movies. 

In [64]:
import pandas as pd
data = {
    "username": ["Alice", "Bob"],
    "age": [25, 32],
    "gender": ["F", "M"],
    "movies": [
        [("Transformers: The Last Knight", 7), ("Pokémon: Spell of the Unknown", 5)],
        [("Bon Cop Bad Cop 2", 8), ("Goon: Last of the Enforcers", 9)]
    ]
}
# Convert the "movies" column into dictionaries
for i, row_movies in enumerate(data["movies"]):
    movie_dict = {}
    for movie, rating in row_movies:
        movie_dict[movie] = rating
    data["movies"][i] = movie_dict
# Create a pandas DataFrame
df = pd.DataFrame(data)
df.head()

Unnamed: 0,username,age,gender,movies
0,Alice,25,F,"{'Transformers: The Last Knight': 7, 'Pokémon:..."
1,Bob,32,M,"{'Bon Cop Bad Cop 2': 8, 'Goon: Last of the En..."


In [65]:
template_prefix = """You are a movie recommender system that help users to find movies that match their preferences. 
Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}"""

user_info = """This is what we know about the user, and you can use this information to better tune your research:
Age: {age}
Gender: {gender}
Movies already seen alongside with rating: {movies}"""

template_suffix= """Question: {question}
Your response:"""

In [66]:
# We then format the user_info chunk as follows (assuming that the user interacting with the system is Alice)
age = df.loc[df['username']=='Alice']['age'][0]
gender = df.loc[df['username']=='Alice']['gender'][0]
movies = ''
# Iterate over the dictionary and output movie name and rating
for movie, rating in df['movies'][0].items():
    output_string = f"Movie: {movie}, Rating: {rating}" + "\n"
    movies+=output_string
    #print(output_string)
user_info = user_info.format(age = age, gender = gender, movies = movies)
COMBINED_PROMPT = template_prefix +'\n'+ user_info +'\n'+ template_suffix
print(COMBINED_PROMPT)

You are a movie recommender system that help users to find movies that match their preferences. 
Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}
This is what we know about the user, and you can use this information to better tune your research:
Age: 25
Gender: F
Movies already seen alongside with rating: Movie: Transformers: The Last Knight, Rating: 7
Movie: Pokémon: Spell of the Unknown, Rating: 5

Question: {question}
Your response:


In [68]:
PROMPT = PromptTemplate(
    template=COMBINED_PROMPT, input_variables=["context", "question"])

chain_type_kwargs = {"prompt": PROMPT}
qa = RetrievalQA.from_chain_type(llm=OpenAI(openai_api_key=os.environ["OPEN_API_KEY"]), 
    chain_type="stuff", 
    retriever=docsearch.as_retriever(),
    return_source_documents=True, 
    chain_type_kwargs=chain_type_kwargs)

query = "Can you suggest me some action movie based on my background?"
result = qa({'query':query})
result['result']

' Based on your background, I would recommend the following action movies:\n1. Mad Max: Fury Road - Rating: 7.5\n2. Atomic Blonde - Rating: 6.7\n3. Wonder Woman - Rating: 7.4\n4. John Wick - Rating: 7.4\n5. Mission: Impossible - Fallout - Rating: 7.7'