# Content-based Filtering Recommendation

Content-based Recommender System is another type of recommender system other than [Memory-based](https://github.com/Olliang/All-About-Movie-Data/blob/master/Memory_Based_CF.ipynb) and [Model-based](https://github.com/Olliang/All-About-Movie-Data/blob/master/Matrix_Factorization_CF.ipynb) Collaborative Filtering that I have showcased. Compared to Collaborative Filtering techniques, Content-based Recommender System requires extra information about the available items and some sort of user profile describing what the user likes. The general task of Content-based RS is to learn user preferences and then recommen items that are "similar" to the user preferences.It generally works well when it's easy to determine the context/properties of each item.<br>

Simple summary:<br>
**Collaborative Filtering:** "Show me how other users have rated this item" or "Show me how other items are rated by this user"<br>
**Content-based Recommender System:** "Show me more of the same to what I've like"<br>
<br>
<br>
**Term Frequency and Inverse Document Frequency (TF-IDF)** is a standard measure to identify the similarity between each items by analyzing the terms with the right weight. Term Frequency (TF) counts how frequent a term occurs in a document. Inverse Document Frequency (IDF) devalues the words that happen too many times in a documnet. How to calculate them?<br>
**TF(w)** = (num of times the word appears in a document) / (total num of words in the document)<br>
**IDF(w)** = log(num of documents / num of documents that contain word w)
<br>
![TFIDF](TF_IDF.PNG)
<br>
After calculating TF-IDF scores For each term, how do we determine which items are closer to each other?
<br>

In [2]:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import warnings; warnings.simplefilter('ignore')

In [9]:
# we will use MovieLens data for this model
ratings = pd.read_csv('ratings.dat', sep='::', 
                      header = None, 
                      names = ['user_id', 'movie_id', 'rating', 'timestamp'])
users = pd.read_csv('users.dat', sep='::', 
                    header = None, 
                    names = ['user_id', 'gender', 'age', 'occu_id', 'zipcode'])
movies = pd.read_csv('movies.dat', sep='::', 
                    header = None, 
                    names = ['movie_id', 'title', 'genres'])

This content-based recommender system will compute similarity between movies based on movie genres. I will use `TfidfVectorizer` function from `scikit-learn` to convert a collection of text documents to a matrix of TF-IDF features. Each item's genre is treated as a corpus.

In [10]:
movies['genres'] = movies['genres'].str.split('|')
movies['genres'] = movies['genres'].fillna('').astype('str')

In [61]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(movies['genres'])
print('The number of terms: ', len(tf.get_feature_names()))
print('The number of items: ', tfidf_matrix.shape[0])

The number of terms:  127
The number of items:  3883


In [86]:
# print the scores
feature_names = tf.get_feature_names() 
tfidf_df = pd.DataFrame(tfidf_matrix.todense(), index = movies['movie_id'], columns=feature_names)
tfidf_df.iloc[:5,:5]

Unnamed: 0_level_0,action,action adventure,action animation,action children,action comedy
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0


Now that we get the TF-IDF scores of each term for each movie, we can calculate the similarity between each movie pair by using cosine similarity (the metric we used in the other notebook) or other metric that can be used to calculate pair-wise vectors similarity.<br>
From some researching on the cosine similarity computation online, I found that `linear_kernel` from sklearn can compute cosine similarity much faster than directly using `cosine_similarity`. This would be what we use in this notebook. 

In [78]:
from sklearn.metrics.pairwise import linear_kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim[:4, :4]

array([[1.        , 0.14193614, 0.09010857, 0.1056164 ],
       [0.14193614, 1.        , 0.        , 0.        ],
       [0.09010857, 0.        , 1.        , 0.1719888 ],
       [0.1056164 , 0.        , 0.1719888 , 1.        ]])

In [144]:
# Build a 1-dimensional array with movie titles
title_genres = movies[['title', 'genres']]
indices = pd.Series(movies.index, index=movies['title'])

# Function that get movie recommendations based on the cosine similarity score of movie genres
def genre_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:21]
    movie_indices = [i[0] for i in sim_scores]
    movie_scores = [i[1] for i in sim_scores]
    result = pd.DataFrame({'movies': title_genres.iloc[movie_indices,0],
                           'genres': title_genres.iloc[movie_indices,1],
                           'sim_scores': movie_scores})
    return result

def get_recommendations(movie_id):
    title = movies['title'][movies['movie_id'] == movie_id].values[0]
    genres = movies[movies['movie_id'] == movie_id]['genres'].values[0]
    print('Recommendations for people who has watched movie: {}'.format(title))
    print('---------------------------------------------')
    print('The genre of this movie: {}'.format(genres))
    print('---------------------------------------------')
    return genre_recommendations(title)

In [146]:
get_recommendations(1)

Recommendations for people who has watched movie: Toy Story (1995)
---------------------------------------------
The genre of this movie: ['Animation', "Children's", 'Comedy']
---------------------------------------------


Unnamed: 0,movies,genres,sim_scores
1050,Aladdin and the King of Thieves (1996),"['Animation', ""Children's"", 'Comedy']",1.0
2072,"American Tail, An (1986)","['Animation', ""Children's"", 'Comedy']",1.0
2073,"American Tail: Fievel Goes West, An (1991)","['Animation', ""Children's"", 'Comedy']",1.0
2285,"Rugrats Movie, The (1998)","['Animation', ""Children's"", 'Comedy']",1.0
2286,"Bug's Life, A (1998)","['Animation', ""Children's"", 'Comedy']",1.0
3045,Toy Story 2 (1999),"['Animation', ""Children's"", 'Comedy']",1.0
3542,Saludos Amigos (1943),"['Animation', ""Children's"", 'Comedy']",1.0
3682,Chicken Run (2000),"['Animation', ""Children's"", 'Comedy']",1.0
3685,"Adventures of Rocky and Bullwinkle, The (2000)","['Animation', ""Children's"", 'Comedy']",1.0
236,"Goofy Movie, A (1995)","['Animation', ""Children's"", 'Comedy', 'Romance']",0.869805


In [150]:
get_recommendations(47)

Recommendations for people who has watched movie: Seven (Se7en) (1995)
---------------------------------------------
The genre of this movie: ['Crime', 'Thriller']
---------------------------------------------


Unnamed: 0,movies,genres,sim_scores
49,"Usual Suspects, The (1995)","['Crime', 'Thriller']",1.0
517,Romeo Is Bleeding (1993),"['Crime', 'Thriller']",1.0
653,Purple Noon (1960),"['Crime', 'Thriller']",1.0
1073,Reservoir Dogs (1992),"['Crime', 'Thriller']",1.0
1331,Albino Alligator (1996),"['Crime', 'Thriller']",1.0
1444,City of Industry (1997),"['Crime', 'Thriller']",1.0
1601,Playing God (1997),"['Crime', 'Thriller']",1.0
1629,Incognito (1997),"['Crime', 'Thriller']",1.0
1640,Red Corner (1997),"['Crime', 'Thriller']",1.0
2140,Young and Innocent (1937),"['Crime', 'Thriller']",1.0


In [145]:
get_recommendations(1704)

Recommendations for people who has watched movie: Good Will Hunting (1997)
---------------------------------------------
The genre of this movie: ['Drama']
---------------------------------------------


Unnamed: 0,movies,genres,sim_scores
25,Othello (1995),['Drama'],1.0
26,Now and Then (1995),['Drama'],1.0
29,Shanghai Triad (Yao a yao yao dao waipo qiao) ...,['Drama'],1.0
30,Dangerous Minds (1995),['Drama'],1.0
35,Dead Man Walking (1995),['Drama'],1.0
39,"Cry, the Beloved Country (1995)",['Drama'],1.0
42,Restoration (1995),['Drama'],1.0
52,Lamerica (1994),['Drama'],1.0
54,Georgia (1995),['Drama'],1.0
56,Home for the Holidays (1995),['Drama'],1.0


As we can see, the movies recommended for these 3 movies are listed in very simiar genre groups respectively. It works pretty well when the movies are all well-described by the genres. 