# Movie Recommender System

Nowadays,recommender system is everywhere around us. Amazon use it to promote similar products to custmers base on their shopping history or habits, Tiktok use it to suggest short videos, Yelp use it to find your next favorite restaurant,and so on.  

In this program, I'll build three models including simple recommender, content based recommender and collaborative filtering recommerder. The datasets used in this project MovieLens comes from Kaggle.

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from ast import literal_eval
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate

## Simple Recommender

In [2]:
metadata = pd.read_csv('movies_metadata.csv',low_memory=False)

In [3]:
metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


In [4]:
# Mean of vote average column
C = metadata['vote_average'].mean()
print(C)

5.618207215133889


In [5]:
# Minimum number of votes required
m = metadata['vote_count'].quantile(0.9)
print(m)

160.0


In [6]:
# Select qualified movies
qualified_movies = metadata.copy().loc[metadata['vote_count'] > m]
qualified_movies.shape

(4538, 24)

In [7]:
metadata.shape

(45466, 24)

In [8]:
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [9]:
# Define new variable 'score'
qualified_movies['score'] = qualified_movies.apply(weighted_rating, axis=1)

In [10]:
# Sort movies based on 'score'
qualified_movies = qualified_movies.sort_values('score',ascending=False)

In [11]:
qualified_movies[['title','vote_count','vote_average','score']].head(20)

Unnamed: 0,title,vote_count,vote_average,score
314,The Shawshank Redemption,8358.0,8.5,8.445869
834,The Godfather,6024.0,8.5,8.425439
10309,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
12481,The Dark Knight,12269.0,8.3,8.265477
2843,Fight Club,9678.0,8.3,8.256385
292,Pulp Fiction,8670.0,8.3,8.251406
522,Schindler's List,4436.0,8.3,8.206639
23673,Whiplash,4376.0,8.3,8.205404
5481,Spirited Away,3968.0,8.3,8.196055
2211,Life Is Beautiful,3643.0,8.3,8.187171


## Content Based

Movie Description Based Recommender.
We'll build a content recommender system based on movies overviews by using sklearn's linear_kernel to calculate cosine similarity that meatures the similarity between two movies.

In [12]:
# Content Based Recommender
metadata['overview'].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

In [13]:
# Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the','a'
tfidf = TfidfVectorizer(stop_words='english')

In [14]:
# Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')

In [15]:
# Construct TF-IDF matrix 
tfidf_matrix = tfidf.fit_transform(metadata['overview'])

In [16]:
tfidf_matrix.shape

(45466, 75827)

In [17]:
tfidf.get_feature_names()[5000:5010]

['avails',
 'avaks',
 'avalanche',
 'avalanches',
 'avallone',
 'avalon',
 'avant',
 'avanthika',
 'avanti',
 'avaracious']

In [18]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [19]:
cosine_sim.shape

(45466, 45466)

In [20]:
cosine_sim[1]

array([0.01504121, 1.        , 0.04681953, ..., 0.        , 0.02198641,
       0.00929411])

In [21]:
# Build a reverse map of indices and movie titles
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

In [22]:
indices[:10]

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
Heat                           5
Sabrina                        6
Tom and Huck                   7
Sudden Death                   8
GoldenEye                      9
dtype: int64

In [23]:
def get_recommendations(title,cosine_sim=cosine_sim):
    
    # Get the index of the movie that matches the title
    idx = indices[title]
    
    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort movies based on the scores
    sim_scores = sorted(sim_scores, key=lambda x:x[1], reverse=True)
    
    # Top 10 most similar movies
    sim_scores = sim_scores[1:11]
    
    movie_indices = [i[0] for i in sim_scores]
    
    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]
    

In [24]:
get_recommendations('Toy Story')

15348                                     Toy Story 3
2997                                      Toy Story 2
10301                          The 40 Year Old Virgin
24523                                       Small Fry
23843                     Andy Hardy's Blonde Trouble
29202                                      Hot Splash
43427                Andy Kaufman Plays Carnegie Hall
38476    Superstar: The Life and Times of Andy Warhol
42721    Andy Peters: Exclamation Mark Question Point
8327                                        The Champ
Name: title, dtype: object

Our system is now able to identify similar movies such as 'Toy Story 3' and 'Toy Story 2'. We could further improve our model by incorporating more features such as cast, director, keywords and genre.

In [25]:
# Load keywords and credits
credits = pd.read_csv('credits.csv')
keywords = pd.read_csv('keywords.csv')

In [26]:
metadata = metadata.drop([19730,29503,35587])

In [27]:
# Merge data
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
metadata['id'] = metadata['id'].astype('int')
metadata = metadata.merge(credits, on='id')
metadata = metadata.merge(keywords, on='id')

In [28]:
metadata.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,spoken_languages,status,tagline,title,video,vote_average,vote_count,cast,crew,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."


In [29]:
# Convert the stringified features into python objects
features = ['cast','crew','keywords','genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(literal_eval)

In [30]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
        return np.nan

In [31]:
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        if len(names) > 3:
            names = names[:3]
            return names
        return[]
        

In [32]:
# Define new director, cast, genres and keywords
metadata['director'] = metadata['crew'].apply(get_director)
features = ['cast', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(get_list)

In [33]:
metadata[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy]",[]
1,Jumanji,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",,"[board game, disappearance, based on children'...",[]
2,Grumpier Old Men,"[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch,"[fishing, best friend, duringcreditsstinger]",[]


In [34]:
# Convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ","")) for i in x]
    else:
        if isinstance(x, str):
            return str.lower(x.replace(" ",""))
        else:
            return''

In [35]:
# Appply clean_data function
features = ['cast', 'keywords','director', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(clean_data)

In [36]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

In [37]:
# Create a new soup variable
metadata['soup'] = metadata.apply(create_soup, axis=1)

In [38]:
metadata[['soup']].head(2)

Unnamed: 0,soup
0,jealousy toy boy tomhanks timallen donrickles ...
1,boardgame disappearance basedonchildren'sbook ...


In [39]:
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(metadata['soup'])

In [40]:
count_matrix.shape

(46628, 63344)

In [41]:
# Compute the Cosine Similarity matrix based on the count_matrix
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [42]:
# Reset index of your main DataFrame and construct reverse mapping as before
metadata = metadata.reset_index()
indices = pd.Series(metadata.index, index=metadata['title'])

In [43]:
get_recommendations('Toy Story', cosine_sim2)

15519                   Toy Story 3
26001    Toy Story That Time Forgot
19301                       Tin Toy
19355                   Red's Dream
22126          Toy Story of Terror!
3024                    Toy Story 2
16340          Crazy on the Outside
1034             That Thing You Do!
25999               Partysaurus Rex
2331                You've Got Mail
Name: title, dtype: object

The result looks better to me as now we have more animated films included in our list.

# Collaborative Filtering 

In this section, we'll be using user-based collaborative filtering to make recommendations. Due to limited computational power, I used ratings_small file instead of the full ratings file.

In [3]:
reader = Reader()

In [5]:
ratings = pd.read_csv('ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [6]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

In [7]:
algo = SVD()

In [8]:
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9038  0.8919  0.8970  0.9029  0.8936  0.8978  0.0048  
MAE (testset)     0.6961  0.6853  0.6890  0.6972  0.6890  0.6913  0.0046  
Fit time          3.29    3.08    3.25    3.54    3.13    3.26    0.16    
Test time         0.08    0.08    0.14    0.08    0.08    0.09    0.03    


{'test_rmse': array([0.90379422, 0.89188036, 0.89704524, 0.902881  , 0.89357224]),
 'test_mae': array([0.69614452, 0.68534617, 0.68896203, 0.69721243, 0.68900784]),
 'fit_time': (3.2881386280059814,
  3.076244831085205,
  3.2500014305114746,
  3.5370891094207764,
  3.1291098594665527),
 'test_time': (0.07820773124694824,
  0.07809638977050781,
  0.14261865615844727,
  0.07781291007995605,
  0.07810831069946289)}

In [11]:
trainset = data.build_full_trainset()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x28feef9c580>

In [12]:
ratings[ratings['userId'] == 1]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
5,1,1263,2.0,1260759151
6,1,1287,2.0,1260759187
7,1,1293,2.0,1260759148
8,1,1339,3.5,1260759125
9,1,1343,2.0,1260759131


In [20]:
algo.predict(2, 50, 4)

Prediction(uid=2, iid=50, r_ui=4, est=4.318173069671328, details={'was_impossible': False})

For movie with ID 50, our estimated rating is 4.31, which is close to the true rating of 4.

## Conclusion

In short, we've built 3 models based on different algorithms:

1.Simple Recommender. Ranked all movies based on their weighted_rating.

2.Content Based Recommender. We created 2 differnent engines. One used movie overviews as input and the other one utlized metadata such as cast, director, genre and keywords.

3.Collaborative Filtering. The user-based collaborative filtering recommend products to a user that similar users have liked. In this case, our model takes a movie ID and makes prediction based on how other users have rated the movie.