# 04 - Making a Hybrid Movie Recommender:
#### Amin Khoeini

***

The purpose of this project is to get review from a user and in case that review is a positive one, the system recommand a movie to that user. we need to make recomandation engine that get a user id and a movie title and based on that recommand a list of the movie to that user. 

For that goals, we are going to make a hybrid filter. Makign a hybrid recommender consists of two step:


- The first step is to make a __content based filter__ to pick a list of movie close to the target movie.
    * 1 - We need to make a metadata of the movie avalible in the database.
    * 2 - Use year, director, actors, genre and description of the movie to make a metadata. For that we are going to remove any unwanted character from those, remove the stop words from the description and stemm the  description.
    * 3 - We add all of these columns to make a soup metadata column. We also add director columns three time to give it more weight.
    * 4 - Then we use tfidf vectorizer on that soup.
    * 5 - The final step is to use sklearn cosine similarity on the tfidf vector. This way we having a dataset of movie similarities based on thier content,genre, director and actors.
    * 6 - This way we can __pick 30 film that are close to the target film__ with our costum content_recommender function.
    
    
    
- Secound step is to make a __colaborative filter__, using a Surprise library. This way we can perdict scores for the 30 movie that produce by the previous filter for that target user. Finally,  we can sort that list based on the  stimated score and pick a top 10 as a final recomendation.

## Content Based Filter:

In [3]:
import pandas as pd
import numpy as np

from nltk.stem import SnowballStemmer
import nltk

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [4]:
movies = pd.read_csv('/Users/Amin/Documents/GitHub/Review-Sentiment-Analysis-with-Recommendation-System/data/imdb_subset.csv')

In [5]:
movies.head()

Unnamed: 0,year,genre,director,actors,description,Title,imdb_id
0,1915,"Drama, History, War",D.W. Griffith,"Henry B. Walthall, Lillian Gish, Mae Marsh, Mi...",The Stoneman family finds its friendship with ...,The Birth of a Nation (1915),tt0004972
1,1920,"Fantasy, Horror, Mystery",Robert Wiene,"Werner Krauss, Conrad Veidt, Friedrich Feher, ...","Hypnotist Dr. Caligari uses a somnambulist, Ce...",The Cabinet of Dr. Caligari (1920),tt0010323
2,1921,"Comedy, Drama, Family",Charles Chaplin,"Carl Miller, Edna Purviance, Jackie Coogan, Ch...","The Tramp cares for an abandoned child, but ev...",The Kid (1921),tt0012349
3,1922,"Fantasy, Horror",F.W. Murnau,"Max Schreck, Gustav von Wangenheim, Greta Schr...",Vampire Count Orlok expresses interest in a ne...,Nosferatu (1922),tt0013442
4,1923,"Action, Comedy, Thriller","Fred C. Newmeyer, Sam Taylor","Harold Lloyd, Mildred Davis, Bill Strother, No...",A boy leaves his small country town and heads ...,Safety Last! (1923),tt0014429


In [6]:
movies.dtypes

year            int64
genre          object
director       object
actors         object
description    object
Title          object
imdb_id        object
dtype: object

First we need to change all the features to a string, and process them to be all lower case, remove the stop words from the description, commas from the genre and actors columns and add all them together.

In [7]:
# Remove the empty space at the end of the movie title
movies['Title'] = movies['Title'].str.strip()

In [8]:
# Change the data type to string
movies[['year', 'genre','director','actors','description']] = movies[['year', 'genre','director','actors','description']].astype('string')

In [9]:
movies.dtypes

year           string
genre          string
director       string
actors         string
description    string
Title          object
imdb_id        object
dtype: object

In [10]:
# Some of the movie content are NaN so we just fill those with empty
movies = movies.fillna('')

In [11]:
# Remove the comma from the genre columns
movies['genre'] = movies['genre'].str.replace(","," ")
movies['genre'] = movies['genre'].str.lower()

In [12]:
# Remove the comma from the director columns for the movie that has more than one director
movies['director'] = movies['director'].str.replace(","," ")
movies['director'] = movies['director'].str.lower()

In [13]:
# Do the same for the actors columns
movies['actors'] = movies['actors'].str.replace(","," ")
movies['actors'] = movies['actors'].str.lower()

In [14]:
# lower case the description and remove the comma from it
movies['description'] = movies['description'].str.lower()
movies['description'] = movies['description'].str.replace(","," ")

In [15]:
# Remove the stop word from description and also do the stemmer on it to make it better for the vectorizer

stop = nltk.corpus.stopwords.words('english')

movies['description'] = movies['description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [16]:
stemmer = SnowballStemmer('english')
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
def stemmer_text(text):
    return ' '.join([stemmer.stem(w) for w in w_tokenizer.tokenize(text)])

In [17]:
movies['description'] = movies['description'].apply(stemmer_text)

In [18]:
# Making a soup from all the columns, gives director name more weight by adding it twice to the soup
movies['soup'] = movies['year'] +' '+ movies ['director'] + ' '+ movies ['director'] +' '+ movies['genre'] + ' '+movies ['description'] +' '+ movies ['actors']

In [19]:
movies.soup[0]

'1915 d.w. griffith d.w. griffith drama  history  war stoneman famili find friendship cameron affect civil war fight opposit armies. develop war live play lincoln assassin birth ku klux klan. henry b. walthall  lillian gish  mae marsh  miriam cooper  mary alden  ralph lewis  george siegmann  walter long  robert harron  wallace reid  joseph henabery  elmer clifton  josephine crowell  spottiswoode aitken  george beranger'

In [20]:
# Use CountVectorizer on the soup columns

count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(movies['soup'])

In [21]:
# Make a cosin similarity matrix on the countvectorizer so we can calculate the similarity of the movie to eachother based on the content
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [22]:
# Create a dataframe of that similarity matrix
movie_sim = pd.DataFrame(cosine_sim, index=movies.Title, columns=movies.Title)

In [23]:
def content_recommender(movie_title):
    ''' This function uses the similarity matrix that we create based on the count vector of the movie content
     and create a list of 30 movie that are similar to the target movie'''
    
    # Selecting the target movie similarity matrix
    cosine_similarity_series = movie_sim.loc[movie_title]
    
    # Sort these values highest to lowest and pick the first 30 movie.
    ordered_similarities = cosine_similarity_series.sort_values(ascending=False)[1:31]
    # 
    return (ordered_similarities.index.tolist())

In [24]:
# Get the recommandation for the Gangs of New York, we can see that the director name has lot of influence,
# The top recommandation of from same director.
content_recommender('Superman IV: The Quest for Peace (1987)')

['Superman III (1983)',
 'Superman II (1980)',
 'Iron Man 2 (2010)',
 'The Ipcress File (1965)',
 'The Entity (1982)',
 'Batman v Superman: Dawn of Justice (2016)',
 'The Amazing Spider-Man 2 (2014)',
 'Vice (2015)',
 '2001: A Space Odyssey (1968)',
 'Flash Gordon (1980)',
 'Iron Man Three (2013)',
 'The Core (2003)',
 'Iron Man (2008)',
 'The Avengers (1998)',
 'Terminator 3: Rise of the Machines (2003)',
 'Captain America: Civil War (2016)',
 'Outlander (2008)',
 'The 6th Day (2000)',
 'The Last Man on Earth (1964)',
 'Dead Man (1995)',
 'Raiders of the Lost Ark (1981)',
 'Mission to Mars (2000)',
 'Spider-Man (2002)',
 'War of the Worlds (2005)',
 'Rumble in the Bronx (1995)',
 'Predator (1987)',
 'Hot Shots! (1991)',
 'Star Trek VI: The Undiscovered Country (1991)',
 'Battlefield Earth (2000)',
 'Hardware (1990)']

## Collaborative Filter by Surprise:

The content based filter is only capable of suggesting movies based on its content and are not capable of capturing tastes of the viewer and how they might vote for the certin film.With this filter all the user that like a certain movie will get a same recommandation.

To enhance our recommandation, we will use Collaborative Filtering to make recommendations to users. Collaborative Filtering is based on the idea that users similar to eachother can be used to predict how much certain user will like a particular movie.

Surprise library provide extremely powerful algorithms to minimise RMSE (Root Mean Square Error) and give great recommendations.

at first step, we try most of these algorithms on our dataset to see which one has the least RMSE.

In [25]:
from surprise import SVD,SVDpp,SlopeOne, NMF, NormalPredictor, KNNBaseline, KNNBasic,KNNWithMeans, KNNWithZScore, BaselineOnly
from surprise.model_selection import cross_validate,GridSearchCV,train_test_split
import surprise
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
import csv

In [26]:
# import the rating data 
ratings = pd.read_csv('/Users/Amin/Documents/GitHub/Review-Sentiment-Analysis-with-Recommendation-System/data/process_db.csv')

In [27]:
# Drop the review and only keep the user_id,movie_id and vote number
ratings.drop(columns=['review_detail','review_clean','lable','review_date','reviewer'],inplace=True)

In [28]:
# Make a instant of reader and data which is neccesary to make a model in Surprise

reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(ratings[['User_ID', 'imdb_id', 'rating']], reader)

Now that everything is ready to train a Surprise model, it is time to use a crossvalidation and get the performance of each of the surprise algorithms on our dataset.

In [None]:
benchmark = []
# Iterate over all algorithms
algorithms = [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(),
 KNNWithMeans(), KNNWithZScore(), BaselineOnly()]

for algorithm in algorithms:
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False,n_jobs=-1)
 
 # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)

In [None]:
performance = pd.DataFrame(benchmark)
performance

__KNNBaseline has the least RMSE__ amoung the algorithms, although it has one the longest performance time but it will pick as our best model.
Next, we will try to see if we can get a better performance by doing a hyperparameters tunning.

In [None]:
trainset, testset = train_test_split(data, test_size=0.2)

In [116]:
# HyperParameters Tunning For sgd method

list_of_ks = [10,20,40]

sgd_bsl_options = [
    {'method':'sgd', 'reg': 0.02, 'learning_rate': 0.005},
    {'method':'sgd', 'reg': 0.05, 'learning_rate': 0.005},
    {'method':'sgd', 'reg': 0.1, 'learning_rate': 0.005},
    {'method':'sgd', 'reg': 0.02, 'learning_rate': 0.01},
    {'method':'sgd', 'reg': 0.05, 'learning_rate': 0.01},
    {'method':'sgd', 'reg': 0.1, 'learning_rate': 0.01}]


In [126]:
with open('/Users/Amin/Documents/GitHub/Review-Sentiment-Analysis-with-Recommendation-System/kNN_baseline_sgd_scores.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(['reg', 'learning_rate', 'k', 'train_rmse', 'test_rmse'])

In [None]:
res=[]
for curr_bsl_option in sgd_bsl_options:

    for curr_k in list_of_ks:
        
        print(
            'Currently calculating k = ' + str(curr_k) + ' ...'
        )        
        algo = KNNBaseline(k = curr_k, bsl_options = curr_bsl_option)
        results = cross_validate(algo, data, measures=['RMSE'], cv=3, return_train_measures=True);
        
        with open('/Users/Amin/Documents/GitHub/Review-Sentiment-Analysis-with-Recommendation-System/kNN_baseline_sgd_scores.csv', 'a') as f:
            writer = csv.writer(f)
            writer.writerow(
                [curr_bsl_option['reg'], curr_bsl_option['learning_rate'],str(curr_k), 
                 str(np.mean(results['train_rmse'])), str(np.mean(results['test_rmse']))])


In [128]:
df = pd.read_csv('/Users/Amin/Documents/GitHub/Review-Sentiment-Analysis-with-Recommendation-System/kNN_baseline_sgd_scores.csv')
df.sort_values(by = 'test_rmse', inplace = True)
df

Unnamed: 0,reg,learning_rate,k,train_rmse,test_rmse
17,0.1,0.01,40,1.039012,1.796238
8,0.1,0.005,40,1.010634,1.796835
11,0.02,0.01,40,1.058071,1.797147
2,0.02,0.005,40,1.026944,1.797261
14,0.05,0.01,40,1.050509,1.798137
5,0.05,0.005,40,1.020813,1.798795
4,0.05,0.005,20,0.876147,1.80724
16,0.1,0.01,20,0.897017,1.807699
1,0.02,0.005,20,0.884633,1.808392
10,0.02,0.01,20,0.923078,1.808613


In [131]:
# HyperParameters Tunning For als method

als_bsl_options = [
    {'method':'als', 'reg_i': 20, 'reg_u': 30},
    {'method':'als', 'reg_i': 40, 'reg_u': 60},
    {'method':'als', 'reg_i': 20, 'reg_u': 30},
    {'method':'als', 'reg_i': 40, 'reg_u': 60}
]

In [132]:
with open('/Users/Amin/Documents/GitHub/Review-Sentiment-Analysis-with-Recommendation-System/kNN_baseline_als_scores.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(['reg_i', 'reg_u', 'k', 'train_rmse', 'test_rmse'])


In [None]:
for curr_bsl_option in als_bsl_options:

    for curr_k in list_of_ks:
        
        print(
            'Currently calculating k = ' + str(curr_k) + ' ...'
        )        
        algo = KNNBaseline(k = curr_k, bsl_options = curr_bsl_option)
        results = cross_validate(algo, data, measures=['RMSE'], cv=3, return_train_measures=True);
        
        with open('/Users/Amin/Documents/GitHub/Review-Sentiment-Analysis-with-Recommendation-System/kNN_baseline_als_scores.csv', 'a') as f:
            writer = csv.writer(f)
            writer.writerow(
                [curr_bsl_option['reg_i'], curr_bsl_option['reg_u'],str(curr_k), 
                 str(np.mean(results['train_rmse'])), str(np.mean(results['test_rmse']))])

In [135]:
df1 = pd.read_csv('/Users/Amin/Documents/GitHub/Review-Sentiment-Analysis-with-Recommendation-System/kNN_baseline_als_scores.csv')
df1.sort_values(by = 'test_rmse', inplace = True)
df1

Unnamed: 0,reg_i,reg_u,k,train_rmse,test_rmse
8,20,30,40,0.955185,1.819076
2,20,30,40,0.954788,1.819555
1,20,30,20,0.787187,1.830765
7,20,30,20,0.786768,1.832037
11,40,60,40,0.937141,1.83918
5,40,60,40,0.937572,1.84089
4,40,60,20,0.758422,1.849251
10,40,60,20,0.758217,1.850024
6,20,30,10,0.632322,1.866829
0,20,30,10,0.632503,1.867818


The Hyperparameters tunning lower the RMSE by 0.2, while we train the model Using Stochastic Gradient Descent method with learning rate of 0.01, 40 neighbors and regularization parameter set to 0.1

In [29]:
# Making a Best Suprise Model for our data
model = KNNBaseline(k = 40, bsl_options = {'method':'sgd', 'reg': 0.1, 'learning_rate': 0.01})

In [30]:
# Train The Model on Full Dataset 
full_trainset = data.build_full_trainset()
model.fit(full_trainset)


Estimating biases using sgd...
Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x7fb8bc2c9550>

Now by giving the user_id and movie_id to the model we can predict the how that user would vote that movie. 

We will use this model in the next step which is combining the two filter and create our final recommandation engine.

In [32]:
# Get the prediction for user_id 9090 and movie_id tt0208988
model.predict(uid=9090, iid='tt0208988')

Prediction(uid=9090, iid='tt0208988', r_ui=None, est=3.0685858694893526, details={'actual_k': 34, 'was_impossible': False})

In [40]:
est=[]
movies_list = movies_db
ids =  indices['imdb_id']
for b in ids:
    stm = model.predict(2,b).est
    est.append(stm)
movies_list['est'] = est
    
    # Add the estimated vote from collaborative filter and add it to our suggested movie dataset, and sort the data based on that estimated vote
movies_list = movies_list.sort_values('est', ascending=False)
    # Return the top 10 as our final movie recommandation

In [41]:
movies_list[0:20]

Unnamed: 0_level_0,director,year,genre,est
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Other (1972),robert mulligan,1972,drama horror mystery,6.908325
La Grande Illusion (1937),jean renoir,1937,drama war,6.661664
A Christmas Carol (1951),brian desmond hurst,1951,drama fantasy,6.655398
It's a Wonderful Life (1946),frank capra,1946,drama family fantasy,6.633878
The Wizard of Oz (1939),victor fleming george cukor,1939,adventure family fantasy,6.62909
Glory (1989),edward zwick,1989,biography drama history,6.598417
Dead Man Walking (1995),tim robbins,1995,crime drama,6.567237
The Right Stuff (1983),philip kaufman,1983,adventure biography drama,6.556452
The Spy Who Came in from the Cold (1965),martin ritt,1965,drama thriller,6.55301
Alive (1993),frank marshall,1993,adventure biography drama,6.513601


## Making a Final Recommandation System with Hybrid Filter:

In the final section, we will try to build a simple hybrid recommender that brings together what we have implemented in the content based and collaborative filter based engines. 

This final filter will work like this:

- Input: User ID and the Title of a Movie
- Output: Similar movies sorted on the basis of expected ratings by that particular user.



In [33]:
# Create a index list for the movie title and movie id and set the title as a index
indices = movies[['Title','imdb_id']].set_index('Title')

# make a subset of the movie metadata and set the index to Title
movies_db = movies[['Title','director','year','genre']].set_index('Title')

In [26]:
# create our final recommandation engine 
def recommender(user_id,movie_title):
    
    ''' This function used the two previous filter and create the final recommandation movie list.
    First it creates the list of 50 movie by using the content based filter.
    Secound, using the suprise model, predict the vote that target user would gives to those movie.
    Third sort the 50 movie based on that predicted vote
    Finally, it return the top 10 movies as a final recommandation
    
    '''
    
    est=[]
    # First use the content based filter to make a list of movie that are close to the target movie
    rec_movies = content_recommender(movie_title)
    # subset the movie metadata to have only movie that the content base filter suggested
    movies_list = movies_db.loc[rec_movies]
    # Get the movie_id of the suggested movie
    ids =  indices.loc[rec_movies]['imdb_id']
    # Using the surprise model and collaborative filtering to predict the vote that the target user gives to the suggested movie
    for b in ids:
        stm = model.predict(user_id,b).est
        est.append(stm)
    movies_list['est'] = est
    
    # Add the estimated vote from collaborative filter and add it to our suggested movie dataset, and sort the data based on that estimated vote
    movies_list = movies_list.sort_values('est', ascending=False)
    # Return the top 10 as our final movie recommandation
    return movies_list[0:10]

In [28]:
recommender(2,'Gangs of New York (2002)')

Unnamed: 0_level_0,director,year,genre,est
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Key Largo (1948),john huston,1948,action crime drama,5.757901
Midnight Cowboy (1969),john schlesinger,1969,drama,5.647158
Calvary (2014),john michael mcdonagh,2014,comedy drama,5.350856
Alice Doesn't Live Here Anymore (1974),martin scorsese,1974,drama romance,5.337787
The Last Temptation of Christ (1988),martin scorsese,1988,drama,5.066978
Casino (1995),martin scorsese,1995,crime drama,5.003399
Casualties of War (1989),brian de palma,1989,crime drama war,4.890082
The Aviator (2004),martin scorsese,2004,biography drama,4.810042
After Hours (1985),martin scorsese,1985,comedy crime drama,4.805793
Catch Me If You Can (2002),steven spielberg,2002,biography crime drama,4.719803


In [510]:
recommender(1,'Kill Bill: Vol. 2 (2004)')

Unnamed: 0_level_0,director,year,genre,est
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Pulp Fiction (1994),quentin tarantino,1994,crime drama,9.776907
Sin City (2005),frank miller quentin tarantino,2005,crime thriller,9.109121
Reservoir Dogs (1992),quentin tarantino,1992,crime drama thriller,9.1076
Kill Bill: Vol. 1 (2003),quentin tarantino,2003,action crime thriller,8.893922
Django Unchained (2012),quentin tarantino,2012,drama western,8.77428
The China Syndrome (1979),james bridges,1979,drama thriller,8.663043
Jackie Brown (1997),quentin tarantino,1997,crime drama thriller,8.656235
Inglourious Basterds (2009),quentin tarantino,2009,adventure drama war,8.629003
Hamburger Hill (1987),john irvin,1987,action drama thriller,8.50958
From Dusk Till Dawn (1996),robert rodriguez,1996,action crime horror,8.375198


In [511]:
recommender(2344,'Kill Bill: Vol. 2 (2004)')

Unnamed: 0_level_0,director,year,genre,est
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Inglourious Basterds (2009),quentin tarantino,2009,adventure drama war,7.197344
Django Unchained (2012),quentin tarantino,2012,drama western,7.187281
Jackie Brown (1997),quentin tarantino,1997,crime drama thriller,6.96103
Reservoir Dogs (1992),quentin tarantino,1992,crime drama thriller,6.894105
Batman Returns (1992),tim burton,1992,action crime fantasy,6.789048
Pulp Fiction (1994),quentin tarantino,1994,crime drama,6.77793
Westworld (1973),michael crichton,1973,action sci-fi thriller,6.739897
Sin City (2005),frank miller quentin tarantino,2005,crime thriller,6.685136
Kill Bill: Vol. 1 (2003),quentin tarantino,2003,action crime thriller,6.657046
The Hateful Eight (2015),quentin tarantino,2015,crime drama mystery,6.612125
