problems with knowledge-based recommender:

- the model and its recommendations still remained very generic.
- An obvious fix to this problem is to ask the user for more metadata as input. Ex:sub-genre input
  - The first problem is that we do not possess data on sub-genres
  - our users are extremely unlikely to possess knowledge of their favorite movies' metadata.
  - even if they did, they would certainly not have the patience to input it into a long form.


## ch4 content-based recommender
- **Plot description-based recommender**: This model compares the descriptions and
taglines of different movies, and provides recommendations that have the most
similar plot descriptions.

- **Metadata-based recommender**: This model takes a host of features, such as
genres, keywords, cast, and crew, into consideration and provides
recommendations that are the most similar with respect to the aforementioned
features.

### 1. **Plot description-based recommender**

In [1]:
import pandas as pd
import numpy as np
#Import data from the clean file
df = pd.read_csv('../data/metadata_clean.csv')
#Print the head of the cleaned DataFrame
df.head()


Unnamed: 0,title,genres,runtime,vote_average,vote_count,year
0,Toy Story,"['Animation', 'Comedy', 'Family']",81.0,7.7,5415.0,1995
1,Jumanji,"['Adventure', 'Fantasy', 'Family']",104.0,6.9,2413.0,1995
2,Grumpier Old Men,"['Romance', 'Comedy']",101.0,6.5,92.0,1995
3,Waiting to Exhale,"['Comedy', 'Drama', 'Romance']",127.0,6.1,34.0,1995
4,Father of the Bride Part II,['Comedy'],106.0,5.7,173.0,1995


we will be using pairwise similarity between different movies'plot,so we need to numerically quantify the similarity between two bodies of text

representing documents(plots) as vectors:
- **CountVectorizer**:
  - A: The sun is a star.
  - B: My love is like a red, red rose
  - C: Mary had a little lamb
    - to have a vocablary: set of unique words of those three 'documents' :
      - V: like, little, lamb, love, mary, red, rose, sun, star (9 words)
    - then our documents will be represented as nine-dimensional vectors
  - A: (0, 0, 0, 0, 0, 0, 0, 1, 1)
  - B: (1, 0, 0, 1, 0, 2, 1, 0, 0)
  - C: (0, 1, 1, 0, 1, 0, 0, 0, 0)


- **TF-IDFVectorizer**:
  - For every word i in document j, the following applies:
\begin{equation}
 w [ij] = tf(ij) * \log{\frac{N}{df(i)}} 
\end{equation}
  - $wi,j$ is the weight of word i in document j
  - $dfi$ is the number of documents that contain the term i
  - $N$ is the total number of document
  - that the weight of a word in a document is greater if it occurs more frequently in that document and is present in fewer documents.
  -  The weight $wi,j$ takes values between 0 and 1




#### Cosine similarity

\begin{equation}
cosine(x,y) = \frac{x.yT}{ \lVert x \rVert \lVert y \rVert}
\end{equation}

The higher the cosine score, the more
similar the documents are to each other


#### Plot description-based recommender
- input: movie title
- output: a list of movies that are most similar based on their plots. 
**steps**: 
1. Obtain the data required to build the model
2. Create TF-IDF vectors for the plot description (or overview) of every movie
3. Compute the pairwise cosine similarity score of every movie
4. Write the recommender function that takes in a movie title as an argument and
outputs movies most similar to it based on the plot

In [15]:
# first step: obtain the data
# Import the original file
orig_df = pd.read_csv('../data/movies_metadata.csv', low_memory=False)
#Add the useful features into the cleaned dataframe (we removed them earlier in ch3)
df['overview'], df['id'] = orig_df['overview'], orig_df['id']





In [16]:
df = df[df.id!='1997-08-20']
df = df[df.id!='2012-09-29']
df = df[df.id!='2014-01-01']

In [20]:
# we will take a subset of the data (my machine couldn't handel all the data)
sub_df = pd.read_csv("../data/links_small.csv")
df['id']=df['id'].apply(float)
small_df= df.merge(sub_df,right_on="tmdbId",left_on='id')
print(small_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9099 entries, 0 to 9098
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         9099 non-null   object 
 1   genres        9099 non-null   object 
 2   runtime       9099 non-null   float64
 3   vote_average  9099 non-null   float64
 4   vote_count    9099 non-null   float64
 5   year          9099 non-null   int64  
 6   overview      9087 non-null   object 
 7   id            9099 non-null   float64
 8   movieId       9099 non-null   int64  
 9   imdbId        9099 non-null   int64  
 10  tmdbId        9099 non-null   float64
dtypes: float64(5), int64(3), object(3)
memory usage: 782.1+ KB
None


In [21]:
# second step: Creating the TF-IDF matrix
# each row represents the TF-IDF vector of the overview feature of the corresponding movie in our main DataFrame.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
#Replace NaN with an empty string
small_df['overview'] = small_df['overview'].fillna('')
#Construct the required TF-IDF matrix by applying the fit_transform method on the overview feature
tfidf_matrix = tfidf.fit_transform(small_df['overview'])
#Output the shape of tfidf_matrix
tfidf_matrix.shape

(9099, 29727)

In [22]:
# thrid step: Computing the cosine similarity score
#we are going to create a 45,466 × 45,466 matrix,
# where the cell in the ith row and jth column represents the similarity score between movies i and j.
# Import linear_kernel to compute the dot product
from sklearn.metrics.pairwise import linear_kernel
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
#MemoryError: Unable to allocate 15.4 GiB for an array with shape (45466, 45466) and data type float64




#### finall step: building the recommendation function
We will perform the following steps in building the recommender function:
1. Declare the title of the movie as an argument.
2. Obtain the index of the movie from the indices reverse mapping.
3. Get the list of cosine similarity scores for that particular movie with all movies
using cosine_sim. Convert this into a list of tuples where the first element is the
position and the second is the similarity score.
4. Sort this list of tuples on the basis of the cosine similarity scores.
5. Get the top 10 elements of this list. Ignore the first element as it refers to the
similarity score with itself (the movie most similar to a particular movie is
obviously the movie itself).
6. Return the titles corresponding to the indices of the top 10 elements, excluding
the first


In [23]:
#Construct a reverse mapping of indices and movie titles, and drop duplicate titles, if any
indices = pd.Series(small_df.index, index=small_df['title']).drop_duplicates()

In [24]:
# Function that takes in movie title as input and gives recommendations
def content_recommender(title, cosine_sim=cosine_sim, df=small_df,
indices=indices):
# Obtain the index of the movie that matches the title
 idx = indices[title]
 # Get the pairwsie similarity scores of all movies with that movie
 # And convert it into a list of tuples as described above
 sim_scores = list(enumerate(cosine_sim[idx]))
 # Sort the movies based on the cosine similarity scores
 sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
 # Get the scores of the 10 most similar movies. Ignore the first movie.
 sim_scores = sim_scores[1:11]
 # Get the movie indices
 movie_indices = [i[0] for i in sim_scores]
 # Return the top 10 most similar movies
 return df['title'].iloc[movie_indices]

In [25]:
#Get recommendations for The Lion King
content_recommender('The Lion King')

7784                      African Cats
5879    The Lion King 2: Simba's Pride
4526                         Born Free
2719                          The Bear
4772     Once Upon a Time in China III
7072                        Crows Zero
739                   The Wizard of Oz
8926                   The Jungle Book
1749                 Shadow of a Doubt
7997                      October Baby
Name: title, dtype: object

### 2. **Metadata-based recommender**
same steps different data

To build this model, we will be using the following metdata:
- The genre of the movie. 
- The director of the movie. This person is part of the crew.
- The movie's three major stars. They are part of the cast.
- Sub-genres or keywords.

In [26]:
# Load the keywords and credits files
cred_df = pd.read_csv('../data/credits.csv')
key_df = pd.read_csv('../data/keywords.csv')
#Print the head of the credit dataframe
cred_df.head()


Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [27]:
#Print the head of the keywords dataframe
key_df.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [30]:
#Convert the IDs of df into int
small_df['id'] = small_df['id'].astype('int')
key_df['id'] = key_df['id'].astype('int')
cred_df['id'] = cred_df['id'].astype('int')

In [31]:
# Merge keywords and credits into your main metadata dataframe
df = df.merge(cred_df, on='id')
df = df.merge(key_df, on='id')
#Display the head of the merged df
df.head()



Unnamed: 0,title,genres,runtime,vote_average,vote_count,year,overview,id,cast,crew,keywords
0,Toy Story,"['Animation', 'Comedy', 'Family']",81.0,7.7,5415.0,1995,"Led by Woody, Andy's toys live happily in his ...",862.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,Jumanji,"['Adventure', 'Fantasy', 'Family']",104.0,6.9,2413.0,1995,When siblings Judy and Peter discover an encha...,8844.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,Grumpier Old Men,"['Romance', 'Comedy']",101.0,6.5,92.0,1995,A family wedding reignites the ancient feud be...,15602.0,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,Waiting to Exhale,"['Comedy', 'Drama', 'Romance']",127.0,6.1,34.0,1995,"Cheated on, mistreated and stepped on, the wom...",31357.0,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,Father of the Bride Part II,['Comedy'],106.0,5.7,173.0,1995,Just when George Banks has recovered from his ...,11862.0,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."
