# Content-based recommendation system

## Introduction

Many websites give users the possibility to rate items nowadays. Companies such as Amazon, Netflix, YouTube, IMDB and Bol.com use this information to recommend similar items to their users. The MovieLens dataset is a free dataset with a collection of movie ratings.

In this document I will go through the process of creating a content-based recommendation system.


## Preparation

First we need to import our different datasets: The Movies dataset, the list of english stopwords and the english based stemmer.

In [1]:
from IPython.core.display import HTML
from movie_display import movie_display
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
import pandas as pd
import numpy as np
import itertools
import re

# Load movies into a dataframe
movies_df = pd.read_json('./dataset/imdbdata.json', orient='columns')

# Have a quick look at the structure and contents
display(movies_df.head())
for col in movies_df.columns:
    display(movies_df[col].describe())

# Only keep the name of usable features
unwanted = ['Poster', 'imdbId', 'imdbRating', 'imdbVotes']
filtered_features = [e for e in movies_df.columns if e not in unwanted]

# Get english stopwords
stopwords = stopwords.words('english')

# Get english based stemmer
stemmer = SnowballStemmer('english')

Unnamed: 0,Actors,Awards,Country,Director,Genre,Language,Plot,Poster,Production,Rated,Released,Runtime,Title,Writer,Year,imdbId,imdbRating,imdbVotes
0,"Tom Hanks, Tim Allen, Don Rickles, Jim Varney",Nominated for 3 Oscars. Another 23 wins & 18 n...,USA,John Lasseter,"Animation, Adventure, Comedy",English,A cowboy doll is profoundly threatened and jea...,https://images-na.ssl-images-amazon.com/images...,Buena Vista,G,22 Nov 1995,81 min,Toy Story,"John Lasseter (original story by), Pete Docter...",1995,114709,8.3,666855
1,"Robin Williams, Jonathan Hyde, Kirsten Dunst, ...",4 wins & 9 nominations.,USA,Joe Johnston,"Action, Adventure, Family","English, French",When two kids find and play a magical board ga...,https://images-na.ssl-images-amazon.com/images...,Sony Pictures Home Entertainment,PG,15 Dec 1995,104 min,Jumanji,"Jonathan Hensleigh (screenplay), Greg Taylor (...",1995,113497,6.9,223000
2,"Walter Matthau, Jack Lemmon, Sophia Loren, Ann...",2 wins & 2 nominations.,USA,Howard Deutch,"Comedy, Romance",English,John and Max resolve to save their beloved bai...,https://images-na.ssl-images-amazon.com/images...,Warner Home Video,PG-13,22 Dec 1995,101 min,Grumpier Old Men,"Mark Steven Johnson (characters), Mark Steven ...",1995,113228,6.6,20100
3,"Whitney Houston, Angela Bassett, Loretta Devin...",8 wins & 8 nominations.,USA,Forest Whitaker,"Comedy, Drama, Romance",English,"Based on Terry McMillan's novel, this film fol...",https://images-na.ssl-images-amazon.com/images...,Twentieth Century Fox Home Entertainment,R,22 Dec 1995,124 min,Waiting to Exhale,"Terry McMillan (novel), Terry McMillan (screen...",1995,114885,5.7,7769
4,"Steve Martin, Diane Keaton, Martin Short, Kimb...",Nominated for 1 Golden Globe. Another 1 win & ...,USA,Charles Shyer,"Comedy, Family, Romance",English,George Banks must deal not only with the pregn...,https://images-na.ssl-images-amazon.com/images...,Disney,PG,08 Dec 1995,106 min,Father of the Bride Part II,"Albert Hackett (screenplay), Frances Goodrich ...",1995,113041,5.9,27815


count     9125
unique    9006
top        N/A
freq        23
Name: Actors, dtype: object

count     9125
unique    2309
top        N/A
freq      1640
Name: Awards, dtype: object

count     9125
unique     845
top        USA
freq      5411
Name: Country, dtype: object

count     9125
unique    3835
top        N/A
freq        55
Name: Director, dtype: object

count      9125
unique      642
top       Drama
freq        518
Name: Genre, dtype: object

count        9125
unique        991
top       English
freq         5697
Name: Language, dtype: object

count     9125
unique    9101
top        N/A
freq        19
Name: Plot, dtype: object

count     9125
unique    9109
top        N/A
freq        15
Name: Poster, dtype: object

count     9057
unique    1287
top        N/A
freq       482
Name: Production, dtype: object

count     9125
unique      18
top          R
freq      3519
Name: Rated, dtype: object

count     9125
unique    4826
top        N/A
freq        65
Name: Released, dtype: object

count       9125
unique       231
top       90 min
freq         268
Name: Runtime, dtype: object

count       9125
unique      8846
top       Hamlet
freq           6
Name: Title, dtype: object

count     9125
unique    7863
top        N/A
freq       275
Name: Writer, dtype: object

count     9125
unique     124
top       1998
freq       274
Name: Year, dtype: object

count    9.125000e+03
mean     4.796230e+05
std      7.426405e+05
min      4.170000e+02
25%      8.884600e+04
50%      1.197780e+05
75%      4.284410e+05
max      5.794766e+06
Name: imdbId, dtype: float64

count     9125
unique      79
top        7.2
freq       428
Name: imdbRating, dtype: object

count      9125
unique     8335
top       2,153
freq          4
Name: imdbVotes, dtype: object

## Text Learning

Our objective is now to prepare the movie dataset for the recommendation system. This is done by:

1. Removing English StopWords
2. Stemming the words remaining
3. Create Tf-Idf weighted tables for the words
4. Generate the top N matches

Since we have multiple attributes to handle (Plot, Actors, Genre, Title) we need to process them separately and then merge their individual recommendations.

*Side Note: At the end of the preprocess, the list of words is rebuilt as a string. This is volountary as the `TfidfVectorizer` can take a full DataFrame as parameter and split the words of each row.*

In [2]:
def preprocess(feature_row):
    
    # Make sure that row is a string
    feature_row = str(feature_row)
    
    # Transforms the text into an array of words
    word_list_sw = re.sub("[^\w]"," ", feature_row).split()
    
    # Then filter out the stop words
    words_filtered = [word for word in word_list_sw if word not in stopwords]
    
    # Compute the stems of the remaning words 
    word_list_stemmed = [stemmer.stem(word) for word in words_filtered]
    
    return ' '.join(word_list_stemmed)

def generate_Tfidf(feature_name):
    
    vectorizer = TfidfVectorizer()

    # Apply the preprocessing  to each row
    df_preprocessed = movies_df.apply(lambda row: preprocess(row[feature_name]), axis=1)
    
    return vectorizer.fit_transform(df_preprocessed)

def generate_NN(tfidf):
    cosineNN = NearestNeighbors(metric='cosine')
    return cosineNN.fit(tfidf)

We have now the functions that we need to generate our `Tf-Idf`vector and Nearest Neighbors learner.

In order to make the computation faster during the movie recommendation, we precompute all tables here to avoid to recreate each time the `Tf-Idf`vector and Nearest Neighbors learner.

In [3]:
# Create the Tf-Idf vectors for all the selected features
tfIdf_dict = {}
for f in filtered_features:
    print(f, 'parsed')
    tfIdf_dict[f] = generate_Tfidf(f)

Actors parsed
Awards parsed
Country parsed
Director parsed
Genre parsed
Language parsed
Plot parsed
Production parsed
Rated parsed
Released parsed
Runtime parsed
Title parsed
Writer parsed
Year parsed


In [4]:
# Create the Nearest Neighbors for all the selected features
nn_dict = {}
for f in filtered_features:
    print(f, 'NN generated')
    nn_dict[f] = generate_NN(tfIdf_dict[f])

Actors NN generated
Awards NN generated
Country NN generated
Director NN generated
Genre NN generated
Language NN generated
Plot NN generated
Production NN generated
Rated NN generated
Released NN generated
Runtime NN generated
Title NN generated
Writer NN generated
Year NN generated


## Recommendation

Now that our `Tf-Idf` tables for each feature are ready, we can compoute the NearestNeighbors for each of them by using the cosine metric.

In [5]:
def compute_recommendation(nearestNeighbors, tfidf, movie_index, nb_recommendation):
    
    # Get the K best neighbors for the given movie
    dist, indices = nearestNeighbors.kneighbors(tfidf[movie_index], nb_recommendation+1)

    # Remove the current movie from the computed distances
    dist = np.delete(dist[0], np.where(indices[0] == movie_index))
    
    # Remove the current movie from the recommended indices
    indices = np.delete(indices[0], np.where(indices[0] == movie_index))
    
    return dist, indices

We have now a way to compute the indices and distances of the recommendation for a given features in a given movie.

But we would now like to merge the results of the recommendation for different features. A solution to this is to get the top K for all features with their distances, merge the indices and distances of all features into one big list of pairs of indices and distances and make a final top K selection on this final list.

In [6]:
def get_recommended_movies(index, features, nb_recommendation):
    
    all_dist = []
    all_indices = []
    for f in features:
        tf_idf = tfIdf_dict[f]
        nearest_neighbors = nn_dict[f]
        dist, indices = compute_recommendation(nearest_neighbors, tf_idf, index, nb_recommendation)
        #print(f, ': ')
        #print(' dist', dist)
        #print(' indices', indices)
        all_dist.extend(dist)
        all_indices.extend(indices)
    
    df = pd.DataFrame({'Indices': all_indices, 'Distances': all_dist})
    # Sort the values by Distances and then by Indices (in case of equality)
    df = df.sort_values(by=['Distances', 'Indices'], ascending=True)[:nb_recommendation]
    
    return df

In the second step, we need to do recommendation from a list of movies. In this case we need to generate the K best recommendation for each movie, put them all in an array, sort them and keep only the top K.

This last list will thus represent the best recommendation for a list of movies. It is possible to get the recommendations for only one movie by entering only one movie in the function.

*Note: This function take a list of movie_list as a string `movie1, movie2, movie3`*

In [7]:
def recommendations(movie_list, features, nb_recommendation):
    
    # Clean the movie_list
    if movie_list[-1] is ',':
        movie_list = movie_list[:-1]
    
    # Get the indices of the movies in the list
    index_list = []
    for movie_title in movie_list.split(','):
        index = movies_df.index[movies_df['Title'] == movie_title].tolist()[0]
        if index != None:
            index_list.append(index)
    
    # Get the scores of the movies
    recom_movies_df = pd.DataFrame(columns=['Indices', 'Distances'])
    for movie_idx in index_list:
        # Get the recommended movies for one movie
        r_m_df = get_recommended_movies(movie_idx, features, nb_recommendation)
        # Add them to the dataframe
        recom_movies = pd.concat([recom_movies_df, r_m_df], join="inner")

    # Sort them by their scores (here the Distances) and then by their Indices in case of equality
    recom_movies = recom_movies.sort_values(by=['Distances', 'Indices'], ascending=True)
    # Retrieve the best N recommendations , and keep their indices only
    recom_movies = recom_movies['Indices'][:nb_recommendation]
    
    return recom_movies

## Testing the recommendation

Now that we have been through all the hard work it is time to put our efforts at the test !

In order to make the program more flexible we made it possible to select the features used in the score computations. Some features with low distances values (Genre, Country, etc) have some sorting issues. Since all the first values are 0, there is no telling which movies will be selected first by the `sklearn.neighbors.NearestNeighbors`.

We define here a plot function that will print the recommended movies for the given list of movies and selection of feature.

### One movie recommendation



In [8]:
def plot(movie_list, nb_recommendation, features):
    print('Selected movies:', movie_list)
    print('Selected features:', features)
    movie_ids = recommendations(movie_list, features, nb_recommendation)
    movie_plot = []
    for index in movie_ids:
        movie_plot.append(movies_df.iloc[index])
    return HTML(movie_display.show(movie_plot))

In [9]:
movies = interact(
    plot,
    movie_list=widgets.Dropdown(options=movies_df['Title'].sort_values(), value='Toy Story', description='Movies:'),
    nb_recommendation=widgets.IntSlider(min=1, max=10, step=1, value=5),
    features=widgets.SelectMultiple(options = filtered_features, description='Features', value=['Plot', 'Title', 'Writer', 'Actors'], disabled = False)
)

interactive(children=(Dropdown(description='Movies:', index=8479, options=('$9.99', "'Hellboy': The Seeds of C…

### Multiple movie recommendation

Here we can plot movies based on multiple movies. Just by click on a movie and then on 'Select Movie'. After some movies are added just click on 'Display'.

In [10]:
#Textarea widget to add multiple movies to one String
text = widgets.Label()
def btn_select_clicked(b):
    # Append currently selected movie to list (if not already there)
    if mov.value not in text.value:
        text.value = text.value+mov.value+','
    
def getRecs(b):
    rec = interact(
        plot,
        movie_list=text.value,
        nb_recommendation=widgets.IntSlider(min=1,max=20,step=1,value=5),
        features=widgets.SelectMultiple(options = filtered_features, description='Features', value=['Plot', 'Title', 'Writer', 'Actors'], disabled = False)
    )

#button for selecting movies
btn_select = widgets.Button(description="Select Movie")
#button for displaying movies
btn_display = widgets.Button(description="Display")
#when button 'Select Movie' is clicked the 'btn_select_clicked' function is called
btn_select.on_click(btn_select_clicked)
#when button 'btn_display' is clicked the 'getRecs' function is called
btn_display.on_click(getRecs)
#assigning a selected movie from a Select widget
mov = widgets.Select(options=movies_df['Title'].sort_values(), value='Toy Story')
#put all widgets in one box
one_box = widgets.VBox([mov,text, btn_select, btn_display])
#display the box
display(one_box)

VBox(children=(Select(index=8479, options=('$9.99', "'Hellboy': The Seeds of Creation", "'I Know Where I'm Goi…

## Conclusion

Through this notebook we learnt how to create a content-based recommendation system. Some choices of implementation were made and commented.