## Movie Recommendation System

### 1. Dataset
The movie dataset comes from [MovieLens](https://grouplens.org/datasets/movielens/latest/). This dataset describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
#import data
df=pd.read_csv('movies.csv')
df_tag=pd.read_csv('tags.csv')
df_rating=pd.read_csv('ratings.csv')

In [3]:
df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
df_tag.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,14,110,epic,1443148538
1,14,110,Medieval,1443148532
2,14,260,sci-fi,1442169410
3,14,260,space action,1442169421
4,14,318,imdb top 250,1442615195


In [5]:
df_rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,307,3.5,1256677221
1,1,481,3.5,1256677456
2,1,1091,1.5,1256677471
3,1,1257,4.5,1256677460
4,1,1449,4.5,1256677264


In [6]:
def parse_movie_genres(genres):
    return genres.split('|')

In [7]:
df['categories']=df['genres'].apply(parse_movie_genres)

In [8]:
df.head()

Unnamed: 0,movieId,title,genres,categories
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"[Adventure, Animation, Children, Comedy, Fantasy]"
1,2,Jumanji (1995),Adventure|Children|Fantasy,"[Adventure, Children, Fantasy]"
2,3,Grumpier Old Men (1995),Comedy|Romance,"[Comedy, Romance]"
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,"[Comedy, Drama, Romance]"
4,5,Father of the Bride Part II (1995),Comedy,[Comedy]


### 2. Feature Engineering

Since the genre is a categorical variable,one-hot encoding should be used, which gives each genre its own dimension in feature space.Here I use [DictVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html
). This transformer turns lists of mappings (dict-like objects) of feature names to feature values into Numpy arrays or spare matrics for use with scikit-learn estimators. 

Before using DictVectorizer, the column genre should be in dict-like format.

In [9]:
from sklearn import base
class DictEncoder(base.BaseEstimator, base.TransformerMixin):
    
    def __init__(self, col):
        self.col = col
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        
        def to_dict(l):
            try:
                return {x: 1 for x in l}
            except TypeError:
                return {}
        
        return X[self.col].apply(to_dict)

In [10]:
a=DictEncoder('categories')
a.fit_transform(df)[:5]

0    {'Adventure': 1, 'Animation': 1, 'Children': 1...
1        {'Adventure': 1, 'Children': 1, 'Fantasy': 1}
2                          {'Comedy': 1, 'Romance': 1}
3              {'Comedy': 1, 'Drama': 1, 'Romance': 1}
4                                        {'Comedy': 1}
Name: categories, dtype: object

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction import DictVectorizer

pipe=Pipeline([('encoder', DictEncoder('categories')),\
              ('vectorizer', DictVectorizer())])

features=pipe.fit_transform(df)
features

<58098x20 sparse matrix of type '<class 'numpy.float64'>'
	with 106107 stored elements in Compressed Sparse Row format>

The DictVectorizer returns a sparse matrix, each row represents a movie, columns are the categories where if the movie has that category, the value is 1, otherwise 0. 

### 3. Model 

Clearly, choosing similar movie uses feature similarity. So this is a [NearestNeighbors](https://scikit-learn.org/stable/modules/neighbors.html) problem.Sklearn.neighbors provides functionality for unsupervised and supervised neighbors-based learning methods.The principle behind nearest neighbor methods is to find a predefined number of training samples closet in distance to the new point. The distance can, in general, be any metric measure: standard Euclidean distance is the most common choice. Neighbors-based methods are known as non-generalizing and non-parametic machine learning methods, since they simply "remember" all of its training data, calculate the distances between the new point and the whole training data and find the closest n_neighbors points, where n_neighbors is a hyperparameter.

In [12]:
from sklearn.neighbors import NearestNeighbors

nn=NearestNeighbors(n_neighbors=20).fit(features)

Let's see what movies are similar to the first movie

In [13]:
df.iloc[0]

movieId                                                       1
title                                          Toy Story (1995)
genres              Adventure|Animation|Children|Comedy|Fantasy
categories    [Adventure, Animation, Children, Comedy, Fantasy]
Name: 0, dtype: object

In [14]:
#distance is array to show the distance from the closest 20 movies to <Toy Story> and indice is the array showing
#the locations of movies.
distance, indice=nn.kneighbors(features[0])
df.iloc[indice[0]]

Unnamed: 0,movieId,title,genres,categories
2210,2294,Antz (1998),Adventure|Animation|Children|Comedy|Fantasy,"[Adventure, Animation, Children, Comedy, Fantasy]"
54888,186159,Tangled: Before Ever After (2017),Adventure|Animation|Children|Comedy|Fantasy,"[Adventure, Animation, Children, Comedy, Fantasy]"
25073,115879,Toy Story Toons: Small Fry (2011),Adventure|Animation|Children|Comedy|Fantasy,"[Adventure, Animation, Children, Comedy, Fantasy]"
25651,117454,The Magic Crystal (2011),Adventure|Animation|Children|Comedy|Fantasy,"[Adventure, Animation, Children, Comedy, Fantasy]"
32912,136361,Scooby-Doo! Mask of the Blue Falcon (2012),Adventure|Animation|Children|Comedy|Fantasy,"[Adventure, Animation, Children, Comedy, Fantasy]"
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"[Adventure, Animation, Children, Comedy, Fantasy]"
11009,45074,"Wild, The (2006)",Adventure|Animation|Children|Comedy|Fantasy,"[Adventure, Animation, Children, Comedy, Fantasy]"
32793,136016,The Good Dinosaur (2015),Adventure|Animation|Children|Comedy|Fantasy,"[Adventure, Animation, Children, Comedy, Fantasy]"
52769,181601,Olaf's Frozen Adventure (2017),Adventure|Animation|Children|Comedy|Fantasy,"[Adventure, Animation, Children, Comedy, Fantasy]"
50764,177037,Puss in Book: Trapped in an Epic Tale (2017),Adventure|Animation|Children|Comedy|Fantasy,"[Adventure, Animation, Children, Comedy, Fantasy]"


The similar movies for Toy Story are reasonable. All of the movies have the exactly same categories.

### 3. User Tag

Above I just use movie categories to find similar movies, but it is not specific to a user. Thus, this section I use tags created by users to identify movie recommendation.

In [15]:
#get all tags for each movie
tags=df_tag.groupby('movieId')['tag'].apply(lambda x: x.tolist())
tags=pd.DataFrame(tags).reset_index()

In [16]:
tags.head()

Unnamed: 0,movieId,tag
0,1,"[animated, buddy movie, Cartoon, cgi, comedy, ..."
1,2,"[fantasy, adapted from:book, animals, bad cgi,..."
2,3,"[moldy, old, Ann Margaret, Burgess Meredith, D..."
3,4,"[characters, girl movie, characters, chick fli..."
4,5,"[steve martin, steve martin, pregnancy, remake..."


In [17]:
#merge movie dataset with tags
df=df.merge(tags, how='left', left_on='movieId', right_on='movieId')

In [18]:
df.head()

Unnamed: 0,movieId,title,genres,categories,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"[Adventure, Animation, Children, Comedy, Fantasy]","[animated, buddy movie, Cartoon, cgi, comedy, ..."
1,2,Jumanji (1995),Adventure|Children|Fantasy,"[Adventure, Children, Fantasy]","[fantasy, adapted from:book, animals, bad cgi,..."
2,3,Grumpier Old Men (1995),Comedy|Romance,"[Comedy, Romance]","[moldy, old, Ann Margaret, Burgess Meredith, D..."
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,"[Comedy, Drama, Romance]","[characters, girl movie, characters, chick fli..."
4,5,Father of the Bride Part II (1995),Comedy,[Comedy],"[steve martin, steve martin, pregnancy, remake..."


In [19]:
#one-hot encode tags
tag_pipe=Pipeline([('encoder', DictEncoder('tag')),\
               ('vectorizer', DictVectorizer())])

In [20]:
from sklearn.pipeline import FeatureUnion

union=FeatureUnion([('categories', pipe), \
                   ('tags', tag_pipe)])

In [21]:
df['tag']=df['tag'].astype(str)
df['categories']=df['categories'].astype(str)

In [22]:
df.head()

Unnamed: 0,movieId,title,genres,categories,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"['Adventure', 'Animation', 'Children', 'Comedy...","['animated', 'buddy movie', 'Cartoon', 'cgi', ..."
1,2,Jumanji (1995),Adventure|Children|Fantasy,"['Adventure', 'Children', 'Fantasy']","['fantasy', 'adapted from:book', 'animals', 'b..."
2,3,Grumpier Old Men (1995),Comedy|Romance,"['Comedy', 'Romance']","['moldy', 'old', 'Ann Margaret', 'Burgess Mere..."
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,"['Comedy', 'Drama', 'Romance']","['characters', 'girl movie', 'characters', 'ch..."
4,5,Father of the Bride Part II (1995),Comedy,['Comedy'],"['steve martin', 'steve martin', 'pregnancy', ..."


In [23]:
features=union.fit_transform(df)
features

<58098x407 sparse matrix of type '<class 'numpy.float64'>'
	with 1955658 stored elements in Compressed Sparse Row format>

In [24]:
nn=NearestNeighbors(n_neighbors=20).fit(features)

In [25]:
df.iloc[0]

movieId                                                       1
title                                          Toy Story (1995)
genres              Adventure|Animation|Children|Comedy|Fantasy
categories    ['Adventure', 'Animation', 'Children', 'Comedy...
tag           ['animated', 'buddy movie', 'Cartoon', 'cgi', ...
Name: 0, dtype: object

In [26]:
distance, indice=nn.kneighbors(features[0])
df.iloc[indice[0]]

Unnamed: 0,movieId,title,genres,categories,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"['Adventure', 'Animation', 'Children', 'Comedy...","['animated', 'buddy movie', 'Cartoon', 'cgi', ..."
3028,3114,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy,"['Adventure', 'Animation', 'Children', 'Comedy...","['Pixar', 'sequel better than original', 'aban..."
32293,134853,Inside Out (2015),Adventure|Animation|Children|Comedy|Drama|Fantasy,"['Adventure', 'Animation', 'Children', 'Comedy...","['imaginative', 'psychology', 'creative', 'emo..."
13345,65261,Ponyo (Gake no ue no Ponyo) (2008),Adventure|Animation|Children|Fantasy,"['Adventure', 'Animation', 'Children', 'Fantasy']","['environmental', 'environmentalism', 'friends..."
1113,1136,Monty Python and the Holy Grail (1975),Adventure|Comedy|Fantasy,"['Adventure', 'Comedy', 'Fantasy']","['british comedy', 'excellent dialogue', 'High..."
4791,4886,"Monsters, Inc. (2001)",Adventure|Animation|Children|Comedy|Fantasy,"['Adventure', 'Animation', 'Children', 'Comedy...","['funny', 'Pixar', 'Comedy', 'funny', 'Pixar',..."
4212,4306,Shrek (2001),Adventure|Animation|Children|Comedy|Fantasy|Ro...,"['Adventure', 'Animation', 'Children', 'Comedy...","['fairy tale', 'Funny', 'Kids', 'animation', '..."
13812,68954,Up (2009),Adventure|Animation|Children|Drama,"['Adventure', 'Animation', 'Children', 'Drama']","['adventure', 'Pixar', 'bittersweet', 'romance..."
3309,3396,"Muppet Movie, The (1979)",Adventure|Children|Comedy|Musical,"['Adventure', 'Children', 'Comedy', 'Musical']","['funny', 'Jim Henson', 'Kermit the Frog', 'mu..."
5520,5618,Spirited Away (Sen to Chihiro no kamikakushi) ...,Adventure|Animation|Fantasy,"['Adventure', 'Animation', 'Fantasy']","['want to own', 'alternate reality', 'atmosphe..."


With Users' tags,*Toy story 2* is the most similar movie now.However, another issue comes up.Clearly *Pirates of the Caribbean* and *Eragon* are not very related to *Toy Story*.

### 3. Dimension Reduction

Dimension reduction is very important when there are a lot of features which could cause overfitting and increase computational cost.Here I am going to use [TruncatedSVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html).This transformer performs linear dimensionality reduction by means of truncated singular value decomposition.Contrary to PCA, this estimator does not center the data before computing the singular value decomposition, which means it can work with sparse matrices efficiently.

In [27]:
from sklearn.decomposition import TruncatedSVD

svd=TruncatedSVD(n_components=100)
tag_pipe=Pipeline([('encoder', DictEncoder('tag')),
                     ('vectorizer', DictVectorizer()),
                     ('svd', svd)])
union=FeatureUnion([('categories', pipe), \
                   ('tags', tag_pipe)])

In [28]:
features=union.fit_transform(df)
features

<58098x138 sparse matrix of type '<class 'numpy.float64'>'
	with 6583658 stored elements in Compressed Sparse Row format>

In [29]:
nn=NearestNeighbors(n_neighbors=20).fit(features)

In [30]:
distance, indice=nn.kneighbors(features[5])
df.iloc[indice[0]]

Unnamed: 0,movieId,title,genres,categories,tag
5,6,Heat (1995),Action|Crime|Thriller,"['Action', 'Crime', 'Thriller']","['overrated', 'bank robbery', 'crime', 'heists..."
4760,4855,Dirty Harry (1971),Action|Crime|Thriller,"['Action', 'Crime', 'Thriller']","['Andrew Robinson', 'Clint Eastwood', 'Lalo Sc..."
1017,1036,Die Hard (1988),Action|Crime|Thriller,"['Action', 'Crime', 'Thriller']","['action', 'Alan Rickman', 'thriller', 'violen..."
26547,120466,Chappie (2015),Action|Thriller,"['Action', 'Thriller']","['AI', 'artificial intelligence', 'cyberpunk',..."
6765,6874,Kill Bill: Vol. 1 (2003),Action|Crime|Thriller,"['Action', 'Crime', 'Thriller']","['Kick-Butt Women', 'martial arts', 'nonlinear..."
7313,7438,Kill Bill: Vol. 2 (2004),Action|Drama|Thriller,"['Action', 'Drama', 'Thriller']","['martial arts', 'mystery writer', 'samurai', ..."
4117,4210,Manhunter (1986),Action|Crime|Drama|Horror|Thriller,"['Action', 'Crime', 'Drama', 'Horror', 'Thrill...","['Acting', 'atmospheric', 'based on a book', '..."
2874,2959,Fight Club (1999),Action|Crime|Drama|Thriller,"['Action', 'Crime', 'Drama', 'Thriller']","['ohsoso', 'Brad Pitt', 'Brad Pitt', 'dark com..."
2204,2288,"Thing, The (1982)",Action|Horror|Sci-Fi|Thriller,"['Action', 'Horror', 'Sci-Fi', 'Thriller']","['aliens', 'paranoia', 'shape shifter', 'claus..."
14333,71535,Zombieland (2009),Action|Comedy|Horror,"['Action', 'Comedy', 'Horror']","['clever', 'dark comedy', 'Jesse Eisenberg', '..."


### 4. Movie Recommendation

When a user logs into a movie website, movies recommended to the user should be not just based on movie catogeries but also user's historical perference.Here the approach is to average movies previously watched and rated by the user. First, select all movies watched and rated by the user and find the corresponding features. Second, use these movies' rating scores to weight the features to find the "average" movie. Last, use nearestneighbors algorithm to find the most silimar movies to the "average" movie.

In [31]:
#merge rating with movie data
df_rating_movie=df_rating.merge(df, how='left', left_on='movieId', right_on='movieId')
df_rating_movie=df_rating_movie[['userId', 'movieId', 'rating', 'title', 'categories', 'tag']]

We look at userId==5, and see what the user like to watch.

In [90]:
def select_movies(userId):
    """
    select movies according to userId.
    Args: userId (int)
    Returns: dataframe
    """
    return df_rating_movie[df_rating_movie['userId']==userId].sort_values('rating', ascending=False)

In [96]:
#look at userId==5, and see what the user like to watch.
userId=5
userId5=select_movies(userId)
userId5

Unnamed: 0,userId,movieId,rating,title,categories,tag
791,5,1222,5.0,Full Metal Jacket (1987),"['Drama', 'War']","['Stanley Kubrick', 'political', 'drill instru..."
790,5,1213,5.0,Goodfellas (1990),"['Crime', 'Drama']","['crime', 'dark comedy', 'disturbing', 'gangst..."
810,5,5995,5.0,"Pianist, The (2002)","['Drama', 'War']","['holocaust', 'Nazis', 'World War II', 'dramat..."
811,5,6016,5.0,City of God (Cidade de Deus) (2002),"['Action', 'Adventure', 'Crime', 'Drama', 'Thr...","['Brazil', 'crime', 'drugs', 'killing', 'money..."
813,5,7361,5.0,Eternal Sunshine of the Spotless Mind (2004),"['Drama', 'Romance', 'Sci-Fi']","['bittersweet', 'Charlie Kaufman', 'Elijah Woo..."
779,5,50,5.0,"Usual Suspects, The (1995)","['Crime', 'Mystery', 'Thriller']","['great acting', 'storytelling', 'twist ending..."
798,5,2858,5.0,American Beauty (1999),"['Drama', 'Romance']","['adultery', 'loneliness', 'midlife crisis', '..."
826,5,44195,5.0,Thank You for Smoking (2006),"['Comedy', 'Drama']","['Adam Brody', 'corporate greed', 'satire', 'c..."
829,5,46976,5.0,Stranger than Fiction (2006),"['Comedy', 'Drama', 'Fantasy', 'Romance']","['modern fantasy', 'metaphysics', 'modern fant..."
793,5,1732,5.0,"Big Lebowski, The (1998)","['Comedy', 'Crime']","['black comedy', 'Coen Brothers', 'Jeff Bridge..."


looks like userId==5' favarite movies are in crime and drama categories.

In [97]:
#normalize rating scale in [0, 1]
scale=select_movies(userId)['rating']*0.2

In [98]:
#from movieId find the index and then
index=select_movies(userId)['movieId'].apply(lambda x: df.index[df['movieId']==x]).tolist()

flatten_index=[x for y in index for x in y]

In [99]:
weighted_avg_movie = scale.values.reshape(1,-1).dot(features[flatten_index,:].toarray()) / len(scale)

In [100]:
distance, indice=nn.kneighbors(weighted_avg_movie)

In [101]:
df.iloc[indice[0]]

Unnamed: 0,movieId,title,genres,categories,tag
10309,34437,Broken Flowers (2005),Comedy|Drama,"['Comedy', 'Drama']","['almost no plot', 'Bill Murray', 'mediocre pl..."
1057,1079,"Fish Called Wanda, A (1988)",Comedy|Crime,"['Comedy', 'Crime']","['absurd', 'black comedy', 'British', 'dark co..."
5368,5464,Road to Perdition (2002),Crime|Drama,"['Crime', 'Drama']","['father and son', 'organized crime', 'Oscar b..."
1178,1203,12 Angry Men (1957),Drama,['Drama'],"['thought-provoking', 'classic', 'social comme..."
3061,3147,"Green Mile, The (1999)",Crime|Drama,"['Crime', 'Drama']","['Stephen King', 'Tom Hanks', 'compassionate',..."
9587,30749,Hotel Rwanda (2004),Drama|War,"['Drama', 'War']","['Africa', 'history', 'killing', 'Rwanda', 'Un..."
3231,3317,Wonder Boys (2000),Comedy|Drama,"['Comedy', 'Drama']","['college', 'Gay', 'literature', 'Michael Doug..."
7447,7748,Pierrot le fou (1965),Crime|Drama,"['Crime', 'Drama']","['Jean-Paul Belmondo', 'Anna Karina', 'french ..."
12246,55820,No Country for Old Men (2007),Crime|Drama,"['Crime', 'Drama']","['a movie about death', 'lack of character dev..."
1149,1172,Cinema Paradiso (Nuovo cinema Paradiso) (1989),Drama,['Drama'],"['bittersweet', 'Giuseppe Tornatore', 'superb ..."


Above 20 movies recommended to userId==5 seem like reasonable.