In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
cd "/content/drive/My Drive/Colab Notebooks/CoE202/Content-based Filtering"

# Recommender System

<figure class="image">
  <img src="https://drive.google.com/uc?export=view&id=18-785SB1H_aWDukQ8fzvi6nqoH2d2ZXK" width="40%" height="40%" title="recommender system" alt="recommender system"></img>
  <figcaption>[Infographic by GO-Globe Web Design Company]</figcaption>
</figure>


### Why Recommendation?

- To identify things that we might like
- To help people discover new content
- To discover which things go together
- To personalize user experiences in response to user feedback

"The most powerful AI space for the next a couple of decades is recommendation systems.  
They are going to have the biggest impact on our society because they **affect the information we see, how we learn, what we think, how we communicate.**  
These algorithms are controlling us ..."


## Content-based vs. Collaborative Filtering

- **Collaborative Filtering (CF)** (<span style="color:red">Only Ratings</span>)
    - Memory-based approach (Neighborhood-based CF)
    - Model-based approach
    
- **Side information-based Recommendation**
    - Content-based approach (<span style="color:red">Only Contents</span>)
    - Content-based CF (<span style="color:red">Rating + Contents</span>)
        - Text, Image, Social network ... 

<figure class="image">
  <img src="https://drive.google.com/uc?export=view&id=197ni5M07eTb6hliZH3niu0YyP_4mfF6F" width="40%" height="40%" title="recommender system" alt="recommender system"></img>
  <figcaption>[Picture from https://www.themarketingtechnologist.co/]</figcaption>
</figure>

## Content-based recommendation
**Goal** : Recommend items **similar** to those the user liked  
Content-based filtering uses item features to recommend other items similar to what the users likes, based on their previous actions or explicit feedback.

#### Example
- <span style="color:blue">Movie recommendations</span> : Recommend movies with same actor, director, genre
- <span style="color:blue">Websites, blogs, news</span> : Recommend other sites with "similar" content


#### When is it useful?
<span style="color:red">**Useful when ratings of other users are not available**</span>  

Example: John has rated the movie “Terminator” highly, but we do not have access to the ratings of other users.
- Therefore, we cannot use Collaborative Filtering (CF).
- The item description of “Terminator” contains similar genre keywords as other science fiction movies, such as “Matrix”. 
- Therefore, “Matrix” can be recommended to John.

<span style="color:red">**Then, how can we find similar items?**</span>

#### Bag-of-Words (BoW)

**Bag-of-Words (BoW)** simplifies representation of a document as the bag of its words, disregarding grammar and even word order but keeping multiplicity.  
The bag-of-words model is commonly used in methods of **document classification** where the occurence of each word is used as feature for training a classifier.

<figure class="image">
  <img src="https://drive.google.com/uc?export=view&id=19JlenTX2aZVjRaR4LFwUBc41Vc7KWjAN" width="40%" height="40%" title="recommender system" alt="recommender system"></img>
</figure>


Each document is represented by a **binary vector** of existing word.


<figure class="image">
  <img src="https://drive.google.com/uc?export=view&id=19HEdheo7WX71eLuYYiEOKABeQ-v4o7nQ" width="50%" height="50%" title="recommender system" alt="recommender system"></img>
</figure>

Let's see how to use Bag-of-Words in Natural Language Processing.

In [1]:
!pip install nltk
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')



[nltk_data] Downloading package punkt to /home/namkyeong/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/namkyeong/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/namkyeong/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
corpus = ["The first time you see The Second Renaissance it may look boring.",
          "Look at it at least twice and definitely watch part 2.", 
          "It will change your view of the matrix",
          "Are the human people the ones who started the war?",
          "Is AI a bad thing?"]

In [3]:
stop_words = stopwords.words("english") # to remove the meaningless words
lemmatizer = WordNetLemmatizer() # to find the root of the word

In [4]:
def tokenize(text):
    # normalize case and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
   
    # tokenize text
    tokens = word_tokenize(text)
   
    # lemmatize and remove stop words
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
   
    return tokens

In [5]:
word_tokenize(corpus[0])

['The',
 'first',
 'time',
 'you',
 'see',
 'The',
 'Second',
 'Renaissance',
 'it',
 'may',
 'look',
 'boring',
 '.']

In [6]:
tokenize(corpus[0])

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring']

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# initialize count vectorizer object
vect = CountVectorizer(tokenizer = tokenize)

# get counts of each token(word) in text data
X = vect.fit_transform(corpus)

X.toarray()

But this model ignores relative importance between the words. For example, words like 'may' would be frequently used in many documents and may lead to distortion of similarity.  
To overcome the problem, TF-IDF consider the relative importance of a term.

#### Term Frequency - Inverse Document Frequency (TF-IDF)
**Term Frequency (TF)** : The number of occurences of a term  
**Document Frequency (DF)** : The number of documents in the collection that the term occurs in.  
We assume that rare terms are more informative than frequent terms.  
> **Inverse Document Frequency (IDF)** : A measure of the informativeness of the term.  
${idf}_{t} = \log\frac{N}{df_t}$


**TF-IDF** is used to evaluate how important a word is to corpus of documents.   
$tf-idf_{t} = (1 + \log{{tf}_{t,d}})\log{\frac{N}{df_t}}$

## Content-based movie recommendation system
We are going to build **content-based movie recommendation system**.  
The datasets contain metadata for all 45,000 movies listed in Full MovieLens Dataset. The datasets consists of movies released on or before July 2017.  
- movie_metadat.csv
    - The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies
    
- keywords.csv
    - The file containing the movie plot keywords for our MovieLens movies.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# Movie metadata
movie_data = pd.read_csv('./movies_metadata.csv')

print(movie_data.shape)
print(movie_data.columns)
movie_data.head()

In [None]:
movie_data =  movie_data.loc[movie_data['original_language'] == 'en', :]
movie_data = movie_data[['id', 'title', 'original_language', 'genres']]

print(movie_data.shape)
movie_data.head()

In [None]:
# Movie keywords data
movie_keyword = pd.read_csv('./keywords.csv')

print(movie_keyword.shape)
movie_keyword.head()

In [None]:
# Merge metadata and keywords data
movie_data.id = movie_data.id.astype(int)
movie_keyword.id = movie_keyword.id.astype(int)
movie_data = pd.merge(movie_data, movie_keyword, on='id')

print(movie_data.shape)
movie_data.head()

Movie genres and keywords are expressed in list and dictionary type. But the problem is that, it is actually consisted with string type. So we are going to change the string to dictionary and list with **ast.literal_eval** function.

In [None]:
movie_data.genres[0]

In [None]:
movie_data['genres'] = movie_data['genres'].apply(literal_eval)
movie_data['genres'] = movie_data['genres'].apply(lambda x : [d['name'] for d in x]).apply(lambda x : " ".join(x))
movie_data['genres']

In [None]:
movie_data['keywords'] = movie_data['keywords'].apply(literal_eval)
movie_data['keywords'] = movie_data['keywords'].apply(lambda x : [d['name'] for d in x]).apply(lambda x : " ".join(x))

movie_data.head()

#### Create TF-IDF movie matrix
Now, we are going to create TF-IDF matrix of the movie datasets.  
We can simply adopt TF-IDF with only single line of code thanks to scikit-learn library!

In [None]:
tfidf_vector = TfidfVectorizer()
tfidf_matrix = tfidf_vector.fit_transform(movie_data['genres'] + " " + movie_data['keywords']).toarray()
tfidf_matrix_feature = tfidf_vector.get_feature_names()

In [None]:
tfidf_matrix.shape

In [None]:
tfidf_matrix = pd.DataFrame(tfidf_matrix, columns=tfidf_matrix_feature, index = movie_data.title)
tfidf_matrix.head()

#### Compute Cosine Similarity
Now, we are going to compute cosine similarity based on the above TF-IDF matrix.  
As you learned in the **Neighborhood-based Collaborative Filtering** lecture, cosine similarity can measure the similarity between two vectors.  
A row of the above TF-IDF matrix can be seen as **a vector indicating the features of the movie**, and we use the vector to measure the similarity between movies.

<figure class="image">
  <img src="https://drive.google.com/uc?export=view&id=18totRSZ2rFoZ7zRMHbjb6wW4Pv90dswO" width="40%" height="40%" title="recommender system" alt="recommender system"></img>
</figure>

Cosine similarity is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction.

We can also adopt cosine similarity with a single code, again thanks to scikit-learn.

In [None]:
cosine_sim = cosine_similarity(tfidf_matrix)

In [None]:
cosine_sim_df = pd.DataFrame(cosine_sim, index = movie_data.title, columns = movie_data.title)
print(cosine_sim_df.shape)
cosine_sim_df.head()

#### Build Recommendation Function
Now, we are going to build simple recommendation function.

In [None]:
def get_recommendations(target_title, matrix, items, k=10):
    
    recom_idx = matrix.loc[:, target_title].values.reshape(1, -1).argsort()[:, ::-1].flatten()[1:k+1]
    recom_title = items.iloc[recom_idx, :].title.values
    recom_genre = items.iloc[recom_idx, :].genres.values
    target_title_list = np.full(len(range(k)), target_title)
    target_genre_list = np.full(len(range(k)), items[items.title == target_title].genres.values)
    d = {
        'target_title':target_title_list,
        'target_genre':target_genre_list,
        'recom_title' : recom_title,
        'recom_genre' : recom_genre
    }
    
    return pd.DataFrame(d)

In [None]:
get_recommendations('Jumanji', cosine_sim_df, movie_data)

Let's check out whether *Jumanji* and *The Games Maker* are really similar contents!
<figure class="image">
  <img src="https://drive.google.com/uc?export=view&id=18VXzn0Ky306mbero4yIZ0AjvKGIr776p" width="40%" height="40%" title="recommender system" alt="recommender system"></img>
</figure>

#### Pros and Cons of Content-based Filtering
- **Pros**
    - No need for other users data : No cold-start or sparsity
    - Able to recommend to users with unique tastes
    - Able to recommend new and unpopular items
    - Able to provide explanations
        - Can provide explanations of recommended items by listing content-features that caused an item to be recommended
        
- **Cons**
    - Requires content that can be encoded as meaningful features (difficult in some domains/catalogs)
    - Difficult to implement serendipity (Obvious recommendations)
    - Easy to overfit (e.g. for a user with few data points)
    - Effective for providing recommendations for new items, but not for new users
    
<span style="color:red">**Pure content-based systems are rarely found in commercial environments**</span>