# Project Overview
-----------------------------------------
Objective of the project is to build an app with a simple UI. This app will allow the user to search for movies and recommendations.

Different methods for creating recommendations system: 
1) Collaborative Filtering.
2) Content-based Filtering.
3) Personalized Video Ranker.
4) Candidate Generation Network.
5) Knowledge-based Recommender systems.



In [1]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Meta Data

#### Ratings Data File Structure (ratings.csv)
-----------------------------------------

All ratings are contained in the file `ratings.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

    userId,movieId,rating,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.


#### Tags Data File Structure (tags.csv)
-----------------------------------

All tags are contained in the file `tags.csv`. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:

    userId,movieId,tag,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.


#### Movies Data File Structure (movies.csv)
---------------------------------------

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,title,genres

Movie titles are entered manually or imported from <https://www.themoviedb.org/>, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.

Genres are a pipe-separated list, and are selected from the following:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)


#### Links Data File Structure (links.csv)
---------------------------------------

Identifiers that can be used to link to other sources of movie data are contained in the file `links.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,imdbId,tmdbId

movieId is an identifier for movies used by <https://movielens.org>. E.g., the movie Toy Story has the link <https://movielens.org/movies/1>.

imdbId is an identifier for movies used by <http://www.imdb.com>. E.g., the movie Toy Story has the link <http://www.imdb.com/title/tt0114709/>.

tmdbId is an identifier for movies used by <https://www.themoviedb.org>. E.g., the movie Toy Story has the link <https://www.themoviedb.org/movie/862>.

Use of the resources listed above is subject to the terms of each provider.

---------------------------------------

### Importing Data

In [2]:
df_movies=pd.read_csv('movies.csv')
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
df_links=pd.read_csv('links.csv')
df_links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [4]:
df_ratings=pd.read_csv('ratings.csv')
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
df_tags=pd.read_csv('tags.csv')
df_tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [6]:
print("Shape of Dataframes: \n"+ " Rating DataFrame"+ str(df_ratings.shape)+"\n Movies DataFrame"+ str(df_movies.shape)+"\n Tags DataFrame"+str(df_tags.shape)+"\n Links DataFrame"+str(df_links.shape))

Shape of Dataframes: 
 Rating DataFrame(100836, 4)
 Movies DataFrame(9742, 3)
 Tags DataFrame(3683, 4)
 Links DataFrame(9742, 3)


-----------------------------------------
### Merging datasets
-----------------------------------------

In [7]:
df_rating_movies=pd.merge(df_movies,df_ratings,on='movieId')
df_rating_movies=df_rating_movies.drop('timestamp',axis=1)
df_rating_movies.head()

Unnamed: 0,movieId,title,genres,userId,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5


In [8]:
df_tags_movies=pd.merge(df_movies,df_tags,on='movieId')
df_tags_movies=df_tags_movies.drop('timestamp',axis=1)
df_tags_movies.head()

Unnamed: 0,movieId,title,genres,userId,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,336,pixar
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,474,pixar
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,567,fun
3,2,Jumanji (1995),Adventure|Children|Fantasy,62,fantasy
4,2,Jumanji (1995),Adventure|Children|Fantasy,62,magic board game


In [9]:
# print("Shape of Ratings and movies merged dataset "+str(df_rating_movies.shape))
# print("Shape of Tags and movies merged dataset "+str(df_tags_movies.shape))

-----------------------------------------
## Data Visualization
-----------------------------------------

#### Average rating and Number of ratings for each movies
-----------------------------------------

In [10]:
ratings = pd.DataFrame(df_rating_movies.groupby('title')['rating'].mean())
ratings['num of ratings'] = pd.DataFrame(df_rating_movies.groupby('title')['rating'].count())
ratings

Unnamed: 0_level_0,rating,num of ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
'71 (2014),4.000000,1
'Hellboy': The Seeds of Creation (2004),4.000000,1
'Round Midnight (1986),3.500000,2
'Salem's Lot (2004),5.000000,1
'Til There Was You (1997),4.000000,2
...,...,...
eXistenZ (1999),3.863636,22
xXx (2002),2.770833,24
xXx: State of the Union (2005),2.000000,5
¡Three Amigos! (1986),3.134615,26


In [11]:
# print('Max number of rating for a movie: ',ratings['num of ratings'].max())
# print('Min number of rating for a movie: ',ratings['num of ratings'].min())

In [12]:
# plt.figure(figsize=(10,4))
# ratings['num of ratings'].hist(bins=50)
# plt.title('Distribution of num of ratings')
# plt.show()

In [13]:
# print('Max average rating for a movies: ',ratings['rating'].max())
# print('Min average rating for a movie: ',ratings['rating'].min())

In [14]:
# plt.figure(figsize=(10,4))
# ratings['rating'].hist(bins=11)
# plt.title('Distribution of average ratings')
# plt.show()

In [15]:
# plt.figure(figsize=(10,4))
# sns.jointplot(x='rating',y='num of ratings',data=ratings,alpha=0.5)
# plt.show()

In [16]:
# plt.figure(figsize=(20,7))
# generlist = df_rating_movies['genres'].apply(lambda generlist_movie : str(generlist_movie).split("|"))
# geners_count = {}

# for generlist_movie in generlist:
#     for i in generlist_movie:
#         if(geners_count.get(i, False)):
#             geners_count[i]=geners_count[i]+1
#         else:
#             geners_count[i] = 1       
# geners_count.pop("(no genres listed)")
# plt.bar(geners_count.keys(),geners_count.values(),color='y')
# plt.show()

In [17]:
# plt.figure(figsize=(10,4))
# sns.distplot(df_rating_movies["rating"])
# plt.title('Density plot of rating')
# plt.show()

In [18]:
# ratings_grouped_by_users = df_rating_movies.groupby('userId').agg([np.size, np.mean])
# ratings_grouped_by_users = ratings_grouped_by_users.drop('movieId', axis = 1)
# ratings_grouped_by_users['rating']['size'].sort_values(ascending=False).head(10).plot.bar(figsize = (10,5))
# plt.title('Users who gave most number of ratings')
# plt.show()

In [19]:
# plt.figure(figsize=(20,7))
# ratings_grouped_by_movies = df_rating_movies.groupby('title').agg([np.mean, np.size])
# ratings_grouped_by_movies.shape
# ratings_grouped_by_movies['rating']['size'].sort_values(ascending=False).head(10).plot.bar( figsize = (10,5))
# plt.title('Movies with most number of ratings')
# plt.show()

-----------------------------------------
## Building movie based recomendation system.
-----------------------------------------

In [20]:
from scipy.sparse import csr_matrix
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics.pairwise import cosine_similarity

In [21]:
def encode(series, encoder):
    return encoder.fit_transform(series.values.reshape((-1, 1))).astype(int).reshape(-1)

user_encoder, movie_encoder = OrdinalEncoder(), OrdinalEncoder()
df_rating_movies['user_id_encoding'] = encode(df_rating_movies['userId'], user_encoder)
df_rating_movies['movie_id_encoding'] = encode(df_rating_movies['movieId'], movie_encoder)

matrix = csr_matrix((df_rating_movies['rating'], (df_rating_movies['user_id_encoding'], df_rating_movies['movie_id_encoding'])))

In [22]:
matrix.shape

(610, 9724)

In [23]:
df_rating_movies.head()

Unnamed: 0,movieId,title,genres,userId,rating,user_id_encoding,movie_id_encoding
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,0,0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,4,0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,6,0
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,14,0
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,16,0


In [24]:
df_matrix = pd.DataFrame(matrix.toarray())

#### Normalizing the matrix <br>
Rows represent Users <br>
Columns represent Movies

In [25]:
df_matrix = df_matrix.sub(df_matrix.sum(axis=1)/df_matrix.shape[1],axis=0)

In [26]:
df_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9714,9715,9716,9717,9718,9719,9720,9721,9722,9723
0,3.895825,-0.104175,3.895825,-0.104175,-0.104175,3.895825,-0.104175,-0.104175,-0.104175,-0.104175,...,-0.104175,-0.104175,-0.104175,-0.104175,-0.104175,-0.104175,-0.104175,-0.104175,-0.104175,-0.104175
1,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775,...,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775,-0.011775
2,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770,...,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770,-0.009770
3,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980,...,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980,-0.078980
4,3.983546,-0.016454,-0.016454,-0.016454,-0.016454,-0.016454,-0.016454,-0.016454,-0.016454,-0.016454,...,-0.016454,-0.016454,-0.016454,-0.016454,-0.016454,-0.016454,-0.016454,-0.016454,-0.016454,-0.016454
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
605,2.080625,-0.419375,-0.419375,-0.419375,-0.419375,-0.419375,2.080625,-0.419375,-0.419375,-0.419375,...,-0.419375,-0.419375,-0.419375,-0.419375,-0.419375,-0.419375,-0.419375,-0.419375,-0.419375,-0.419375
606,3.927190,-0.072810,-0.072810,-0.072810,-0.072810,-0.072810,-0.072810,-0.072810,-0.072810,-0.072810,...,-0.072810,-0.072810,-0.072810,-0.072810,-0.072810,-0.072810,-0.072810,-0.072810,-0.072810,-0.072810
607,2.232158,1.732158,1.732158,-0.267842,-0.267842,-0.267842,-0.267842,-0.267842,-0.267842,3.732158,...,-0.267842,-0.267842,-0.267842,-0.267842,-0.267842,-0.267842,-0.267842,-0.267842,-0.267842,-0.267842
608,2.987557,-0.012443,-0.012443,-0.012443,-0.012443,-0.012443,-0.012443,-0.012443,-0.012443,3.987557,...,-0.012443,-0.012443,-0.012443,-0.012443,-0.012443,-0.012443,-0.012443,-0.012443,-0.012443,-0.012443


In [27]:
cosine_matrix = cosine_similarity(df_matrix.T)

In [28]:
pd.DataFrame(cosine_matrix)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9714,9715,9716,9717,9718,9719,9720,9721,9722,9723
0,1.000000,0.372402,0.239448,-0.169952,0.248008,0.339290,0.219856,-0.037423,0.127169,0.360700,...,-0.320294,-0.350984,-0.293216,-0.293216,-0.320294,-0.293216,-0.320294,-0.320294,-0.320294,-0.293987
1,0.372402,1.000000,0.224777,-0.076430,0.229663,0.249620,0.170375,0.021634,-0.052973,0.380252,...,-0.280591,-0.307396,-0.256937,-0.256937,-0.280591,-0.256937,-0.280591,-0.280591,-0.280591,-0.257736
2,0.239448,0.224777,1.000000,-0.010217,0.382465,0.231223,0.367492,0.228901,0.254974,0.190362,...,-0.149892,-0.164039,-0.137401,-0.137401,-0.149892,-0.137401,-0.149892,-0.149892,-0.149892,-0.138090
3,-0.169952,-0.076430,-0.010217,1.000000,0.117858,-0.080012,0.193285,0.166526,0.033591,-0.063055,...,0.190762,0.210033,0.173807,0.173807,0.190762,0.173807,0.190762,0.190762,0.190762,0.172753
4,0.248008,0.229663,0.382465,0.117858,1.000000,0.245745,0.446702,0.222190,0.301724,0.163623,...,-0.092984,-0.101632,-0.085341,-0.085341,-0.092984,-0.085341,-0.092984,-0.092984,-0.092984,-0.085962
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9719,-0.293216,-0.256937,-0.137401,0.173807,-0.085341,-0.238092,-0.099174,0.044466,0.065114,-0.221649,...,0.997933,0.990024,1.000000,1.000000,0.997933,1.000000,0.997933,0.997933,0.997933,0.295792
9720,-0.320294,-0.280591,-0.149892,0.190762,-0.092984,-0.260014,-0.108122,0.049278,0.071596,-0.242049,...,1.000000,0.997033,0.997933,0.997933,1.000000,0.997933,1.000000,1.000000,1.000000,0.324439
9721,-0.320294,-0.280591,-0.149892,0.190762,-0.092984,-0.260014,-0.108122,0.049278,0.071596,-0.242049,...,1.000000,0.997033,0.997933,0.997933,1.000000,0.997933,1.000000,1.000000,1.000000,0.324439
9722,-0.320294,-0.280591,-0.149892,0.190762,-0.092984,-0.260014,-0.108122,0.049278,0.071596,-0.242049,...,1.000000,0.997033,0.997933,0.997933,1.000000,0.997933,1.000000,1.000000,1.000000,0.324439


In [29]:
title_list = df_rating_movies.groupby('title')['movieId'].agg('mean')

In [66]:
offline_results = {
    movie_id: np.argsort(similarities)[::-1]
    for movie_id, similarities in enumerate(cosine_matrix)
}
class recc:
    def get_recommendations(self,movie_title, top_n):
        movie_id = title_list[movie_title]
        movie_csr_id = movie_encoder.transform([[movie_id]])[0, 0].astype(int)
        rankings = offline_results[movie_csr_id][:top_n]
        ranked_indices = movie_encoder.inverse_transform(rankings.reshape((-1, 1))).reshape(-1)
        temp_df2 = df_movies.set_index('movieId').loc[ranked_indices]
        return list(temp_df2['title'])

In [67]:
a = recc()
a.get_recommendations('Toy Story (1995)',10)

['Toy Story (1995)',
 'Toy Story 2 (1999)',
 'Jurassic Park (1993)',
 'Independence Day (a.k.a. ID4) (1996)',
 'Star Wars: Episode IV - A New Hope (1977)',
 'Forrest Gump (1994)',
 'Lion King, The (1994)',
 'Star Wars: Episode VI - Return of the Jedi (1983)',
 'Mission: Impossible (1996)',
 'Groundhog Day (1993)']

In [41]:
import pickle
pickle_out = open('recc.pkl', 'wb')
pickle.dump(a, pickle_out)
pickle_out.close()

-----------------------------------------
## Building movie User recomendation system.
-----------------------------------------

In [33]:
cosine_matrix2 = cosine_similarity(df_matrix)

In [34]:
pd.DataFrame(cosine_matrix2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,600,601,602,603,604,605,606,607,608,609
0,1.000000,0.019400,0.053056,0.176920,0.120866,0.104418,0.143793,0.128547,0.055268,-0.000298,...,0.066256,0.149942,0.186978,0.056530,0.134412,0.121981,0.254200,0.262241,0.085434,0.098719
1,0.019400,1.000000,-0.002594,-0.003804,0.013183,0.016257,0.021567,0.023750,-0.003448,0.061880,...,0.198549,0.010888,-0.004030,-0.005345,-0.007919,0.011299,0.005813,0.032730,0.024373,0.089329
2,0.053056,-0.002594,1.000000,-0.004556,0.001887,-0.004577,-0.005634,0.001703,-0.003111,-0.005501,...,0.000150,-0.000585,0.011211,-0.004822,0.003678,-0.003246,0.012885,0.008096,-0.002963,0.015962
3,0.176920,-0.003804,-0.004556,1.000000,0.121018,0.065719,0.100602,0.054235,0.002417,0.015615,...,0.072848,0.114287,0.281866,0.039699,0.065493,0.164831,0.115118,0.116861,0.023930,0.062523
4,0.120866,0.013183,0.001887,0.121018,1.000000,0.294138,0.101725,0.426576,-0.004185,0.023471,...,0.061912,0.414931,0.095394,0.254117,0.141077,0.090158,0.145764,0.122607,0.258289,0.040372
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
605,0.121981,0.011299,-0.003246,0.164831,0.090158,0.047506,0.172499,0.081913,0.057989,0.054877,...,0.153892,0.084208,0.224637,0.035251,0.106752,1.000000,0.115999,0.188354,0.052385,0.093851
606,0.254200,0.005813,0.012885,0.115118,0.145764,0.142169,0.173293,0.178133,0.003257,-0.004809,...,0.080034,0.187588,0.173025,0.126267,0.101138,0.115999,1.000000,0.258245,0.142533,0.098518
607,0.262241,0.032730,0.008096,0.116861,0.122607,0.137954,0.305439,0.175912,0.086229,0.048373,...,0.136316,0.174069,0.164479,0.133734,0.144896,0.188354,0.258245,1.000000,0.109563,0.248944
608,0.085434,0.024373,-0.002963,0.023930,0.258289,0.207124,0.084494,0.421627,-0.003937,0.014983,...,0.029664,0.331053,0.046000,0.232115,0.089810,0.052385,0.142533,0.109563,1.000000,0.033713


In [64]:
offline_results = {
    user_id: np.argsort(similarities)[::-1]
    for user_id, similarities in enumerate(cosine_matrix2)
}
class recc2:
    def get_user_recommendations(self,user_id, top_n):
        rankings = offline_results[user_id][1:top_n]
        ranked_indices = user_encoder.inverse_transform(rankings.reshape((-1, 1))).reshape(-1)
        temp_df = df_rating_movies.set_index('userId').loc[ranked_indices].sort_values('rating', ascending=False).iloc[:10,:].drop(['movieId','genres','rating','user_id_encoding','movie_id_encoding'], axis=1)
        return list(temp_df['title'])

In [65]:
b = recc2()
b.get_user_recommendations(15,2)

['Star Wars: Episode V - The Empire Strikes Back (1980)',
 'Pulp Fiction (1994)',
 'Lord of the Rings: The Two Towers, The (2002)',
 'Spirited Away (Sen to Chihiro no kamikakushi) (2001)',
 'Nausicaä of the Valley of the Wind (Kaze no tani no Naushika) (1984)',
 'Big Lebowski, The (1998)',
 'Forrest Gump (1994)',
 'Godfather, The (1972)',
 'Shawshank Redemption, The (1994)',
 'Star Wars: Episode IV - A New Hope (1977)']

In [37]:
pickle_out2 = open('recc2.pkl', 'wb')
pickle.dump(b, pickle_out2)
pickle_out2.close()