In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
cd "/content/drive/My Drive/Colab Notebooks/CoE202/Content-based Filtering"

# Recommender System

<figure class="image">
  <img src="https://drive.google.com/uc?export=view&id=18-785SB1H_aWDukQ8fzvi6nqoH2d2ZXK" width="40%" height="40%" title="recommender system" alt="recommender system"></img>
  <figcaption>[Infographic by GO-Globe Web Design Company]</figcaption>
</figure>


### Why Recommendation?

- To identify things that we might like
- To help people discover new content
- To discover which things go together
- To personalize user experiences in response to user feedback

"The most powerful AI space for the next a couple of decades is recommendation systems.  
They are going to have the biggest impact on our society because they **affect the information we see, how we learn, what we think, how we communicate.**  
These algorithms are controlling us ..."


## Content-based vs. Collaborative Filtering

- **Collaborative Filtering (CF)** (<span style="color:red">Only Ratings</span>)
    - Memory-based approach (Neighborhood-based CF)
    - Model-based approach
    
- **Side information-based Recommendation**
    - Content-based approach (<span style="color:red">Only Contents</span>)
    - Content-based CF (<span style="color:red">Rating + Contents</span>)
        - Text, Image, Social network ... 

<figure class="image">
  <img src="https://drive.google.com/uc?export=view&id=197ni5M07eTb6hliZH3niu0YyP_4mfF6F" width="40%" height="40%" title="recommender system" alt="recommender system"></img>
  <figcaption>[Picture from https://www.themarketingtechnologist.co/]</figcaption>
</figure>

## Content-based recommendation
**Goal** : Recommend items **similar** to those the user liked  
Content-based filtering uses item features to recommend other items similar to what the users likes, based on their previous actions or explicit feedback.

#### Example
- <span style="color:blue">Movie recommendations</span> : Recommend movies with same actor, director, genre
- <span style="color:blue">Websites, blogs, news</span> : Recommend other sites with "similar" content


#### When is it useful?
<span style="color:red">**Useful when ratings of other users are not available**</span>  

Example: John has rated the movie “Terminator” highly, but we do not have access to the ratings of other users.
- Therefore, we cannot use Collaborative Filtering (CF).
- The item description of “Terminator” contains similar genre keywords as other science fiction movies, such as “Matrix”. 
- Therefore, “Matrix” can be recommended to John.

<span style="color:red">**Then, how can we find similar items?**</span>

#### Bag-of-Words (BoW)

**Bag-of-Words (BoW)** simplifies representation of a document as the bag of its words, disregarding grammar and even word order but keeping multiplicity.  
The bag-of-words model is commonly used in methods of **document classification** where the occurence of each word is used as feature for training a classifier.

<figure class="image">
  <img src="https://drive.google.com/uc?export=view&id=19JlenTX2aZVjRaR4LFwUBc41Vc7KWjAN" width="40%" height="40%" title="recommender system" alt="recommender system"></img>
</figure>


Each document is represented by a **binary vector** of existing word.


<figure class="image">
  <img src="https://drive.google.com/uc?export=view&id=19HEdheo7WX71eLuYYiEOKABeQ-v4o7nQ" width="50%" height="50%" title="recommender system" alt="recommender system"></img>
</figure>

Let's see how to use Bag-of-Words in Natural Language Processing.

In [None]:
!pip install nltk
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

In [None]:
corpus = ["The first time you see The Second Renaissance it may look boring.",
          "Look at it at least twice and definitely watch part 2.", 
          "It will change your view of the matrix",
          "Are the human people the ones who started the war?",
          "Is AI a bad thing?"]

In [None]:
stop_words = stopwords.words("english") # to remove the meaningless words
lemmatizer = WordNetLemmatizer() # to find the root of the word

In [None]:
def tokenize(text):
    # normalize case and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
   
    # tokenize text
    tokens = word_tokenize(text)
   
    # lemmatize and remove stop words
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
   
    return tokens

In [None]:
word_tokenize(corpus[0])

In [None]:
tokenize(corpus[0])

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# initialize count vectorizer object
vect = CountVectorizer(tokenizer = tokenize)

# get counts of each token(word) in text data
X = vect.fit_transform(corpus)

X.toarray()

But this model ignores relative importance between the words. For example, words like 'may' would be frequently used in many documents and may lead to distortion of similarity.  
To overcome the problem, TF-IDF consider the relative importance of a term.

#### Term Frequency - Inverse Document Frequency (TF-IDF)
**Term Frequency (TF)** : The number of occurences of a term  
**Document Frequency (DF)** : The number of documents in the collection that the term occurs in.  
We assume that rare terms are more informative than frequent terms.  
> **Inverse Document Frequency (IDF)** : A measure of the informativeness of the term.  
${idf}_{t} = \log\frac{N}{df_t}$


**TF-IDF** is used to evaluate how important a word is to corpus of documents.   
$tf-idf_{t} = (1 + \log{{tf}_{t,d}})\log{\frac{N}{df_t}}$

## Content-based movie recommendation system
We are going to build **content-based movie recommendation system**.  
The datasets contain metadata for all 45,000 movies listed in Full MovieLens Dataset. The datasets consists of movies released on or before July 2017.  
- movie_metadat.csv
    - The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies
    
- keywords.csv
    - The file containing the movie plot keywords for our MovieLens movies.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Movie metadata
movie_data = pd.read_csv('./movies_metadata.csv')

print(movie_data.shape)
print(movie_data.columns)
movie_data.head()

(45466, 24)
Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')


  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [3]:
movie_data =  movie_data.loc[movie_data['original_language'] == 'en', :]
movie_data = movie_data[['id', 'title', 'original_language', 'genres']]

print(movie_data.shape)
movie_data.head()

(32269, 4)


Unnamed: 0,id,title,original_language,genres
0,862,Toy Story,en,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '..."
1,8844,Jumanji,en,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '..."
2,15602,Grumpier Old Men,en,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
3,31357,Waiting to Exhale,en,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam..."
4,11862,Father of the Bride Part II,en,"[{'id': 35, 'name': 'Comedy'}]"


In [4]:
# Movie keywords data
movie_keyword = pd.read_csv('./keywords.csv')

print(movie_keyword.shape)
movie_keyword.head()

(46419, 2)


Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [5]:
# Merge metadata and keywords data
movie_data.id = movie_data.id.astype(int)
movie_keyword.id = movie_keyword.id.astype(int)
movie_data = pd.merge(movie_data, movie_keyword, on='id')

print(movie_data.shape)
movie_data.head()

(32852, 5)


Unnamed: 0,id,title,original_language,genres,keywords
0,862,Toy Story,en,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,Jumanji,en,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,Grumpier Old Men,en,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,Waiting to Exhale,en,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,Father of the Bride Part II,en,"[{'id': 35, 'name': 'Comedy'}]","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


Movie genres and keywords are expressed in list and dictionary type. But the problem is that, it is actually consisted with string type. So we are going to change the string to dictionary and list with **ast.literal_eval** function.

In [6]:
movie_data.genres[0]

"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"

In [7]:
movie_data['genres'] = movie_data['genres'].apply(literal_eval)
movie_data['genres'] = movie_data['genres'].apply(lambda x : [d['name'] for d in x]).apply(lambda x : " ".join(x))
movie_data['genres']

0         Animation Comedy Family
1        Adventure Fantasy Family
2                  Romance Comedy
3            Comedy Drama Romance
4                          Comedy
                   ...           
32847             Science Fiction
32848        Drama Action Romance
32849       Action Drama Thriller
32850                            
32851                            
Name: genres, Length: 32852, dtype: object

In [8]:
movie_data['keywords'] = movie_data['keywords'].apply(literal_eval)
movie_data['keywords'] = movie_data['keywords'].apply(lambda x : [d['name'] for d in x]).apply(lambda x : " ".join(x))

movie_data.head()

Unnamed: 0,id,title,original_language,genres,keywords
0,862,Toy Story,en,Animation Comedy Family,jealousy toy boy friendship friends rivalry bo...
1,8844,Jumanji,en,Adventure Fantasy Family,board game disappearance based on children's b...
2,15602,Grumpier Old Men,en,Romance Comedy,fishing best friend duringcreditsstinger old men
3,31357,Waiting to Exhale,en,Comedy Drama Romance,based on novel interracial relationship single...
4,11862,Father of the Bride Part II,en,Comedy,baby midlife crisis confidence aging daughter ...


In [26]:
movie_data = movie_data[:20000]

#### Create TF-IDF movie matrix
Now, we are going to create TF-IDF matrix of the movie datasets.  
We can simply adopt TF-IDF with only single line of code thanks to scikit-learn library!

In [15]:
tfidf_vector = TfidfVectorizer()
tfidf_matrix = tfidf_vector.fit_transform(movie_data['genres'] + " " + movie_data['keywords']).toarray()
tfidf_matrix_feature = tfidf_vector.get_feature_names()

In [10]:
tfidf_matrix.shape

(32852, 11437)

In [16]:
tfidf_matrix

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

#### Compute Cosine Similarity
Now, we are going to compute cosine similarity based on the above TF-IDF matrix.  
As you learned in the **Neighborhood-based Collaborative Filtering** lecture, cosine similarity can measure the similarity between two vectors.  
A row of the above TF-IDF matrix can be seen as **a vector indicating the features of the movie**, and we use the vector to measure the similarity between movies.

<figure class="image">
  <img src="https://drive.google.com/uc?export=view&id=18totRSZ2rFoZ7zRMHbjb6wW4Pv90dswO" width="40%" height="40%" title="recommender system" alt="recommender system"></img>
</figure>

Cosine similarity is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction.

We can also adopt cosine similarity with a single code, again thanks to scikit-learn.

In [23]:
cosine_sim = cosine_similarity(tfidf_matrix)

In [25]:
import pickle
with open('cosine_sim.pkl', 'wb') as f:
    pickle.dump(cosine_sim, f) 

OverflowError: cannot serialize a bytes object larger than 4 GiB

In [20]:
import torch
import torch.nn.functional as F

normalized = F.normalize(torch.tensor(tfidf_matrix))
cosine_sim = torch.matmul(normalized, normalized.T).numpy()

In [None]:
cosine_sim_df = pd.DataFrame(cosine_sim, index = movie_data.title, columns = movie_data.title)
print(cosine_sim_df.shape)
cosine_sim_df.head()

#### Build Recommendation Function
Now, we are going to build simple recommendation function.

In [None]:
def get_recommendations(target_title, matrix, items, k=10):
    
    recom_idx = matrix.loc[:, target_title].values.reshape(1, -1).argsort()[:, ::-1].flatten()[1:k+1]
    recom_title = items.iloc[recom_idx, :].title.values
    recom_genre = items.iloc[recom_idx, :].genres.values
    target_title_list = np.full(len(range(k)), target_title)
    target_genre_list = np.full(len(range(k)), items[items.title == target_title].genres.values)
    d = {
        'target_title':target_title_list,
        'target_genre':target_genre_list,
        'recom_title' : recom_title,
        'recom_genre' : recom_genre
    }
    
    return pd.DataFrame(d)

In [None]:
get_recommendations('Jumanji', cosine_sim_df, movie_data)

Let's check out whether *Jumanji* and *The Games Maker* are really similar contents!
<figure class="image">
  <img src="https://drive.google.com/uc?export=view&id=18VXzn0Ky306mbero4yIZ0AjvKGIr776p" width="40%" height="40%" title="recommender system" alt="recommender system"></img>
</figure>

#### Pros and Cons of Content-based Filtering
- **Pros**
    - No need for other users data : No cold-start or sparsity
    - Able to recommend to users with unique tastes
    - Able to recommend new and unpopular items
    - Able to provide explanations
        - Can provide explanations of recommended items by listing content-features that caused an item to be recommended
        
- **Cons**
    - Requires content that can be encoded as meaningful features (difficult in some domains/catalogs)
    - Difficult to implement serendipity (Obvious recommendations)
    - Easy to overfit (e.g. for a user with few data points)
    - Effective for providing recommendations for new items, but not for new users
    
<span style="color:red">**Pure content-based systems are rarely found in commercial environments**</span>