# Weekend Movie Trip

__Dataset Summary__ -
MovieLens-100K Dataset describes ratings given by various users to various movies from MovieLens, a movie recommendation service. It contains 100,000 ratings from 1000 users on 1700 movies. The dataset was released on 4/1998.
<br>
<br>
Users were selected at random for inclusion. All selected users had rated at least 1 movies. Each user is represented by an id, age, sex, occupation and zipcode. The data are contained in the files u.data(ratings information), u.user(user information), u.item(movies information).
<br>
<br>
This dataset can be downloaded at https://grouplens.org/datasets/movielens/100k/.


In this notebook, I have used K-Neighbours classifier approach to recommend similar interest movies to the users. I didn't spend much time on it and most of it is from the reference, was just curious to see and compare the results.

References:
<br>
https://beckernick.github.io/matrix-factorization-recommender/
<br>
https://www.kaggle.com/shadow1409/movie-recommender/
<br>
https://medium.com/coinmonks/recommendation-engine-python-401c080c583e

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine, correlation


In [2]:
users_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('../data/external/ml-100k/u.user', sep='|', names=users_cols, parse_dates=True) 

users.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [3]:
movie_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
movies = pd.read_csv('../data/external/ml-100k/u.item', sep='|', names=movie_cols, usecols=range(5),encoding='latin-1')

movies.head(3)

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...


In [4]:
rating_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('../data/external/ml-100k/u.data', sep='\t', names=rating_cols)
ratings.head()


Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [5]:
avg_user_ratings_df = ratings.groupby(['user_id'], as_index=False)['rating'].mean()
avg_user_ratings_df.head()

Unnamed: 0,user_id,rating
0,1,3.610294
1,2,3.709677
2,3,2.796296
3,4,4.333333
4,5,2.874286


In [6]:
avg_user_ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 943 entries, 0 to 942
Data columns (total 2 columns):
user_id    943 non-null int64
rating     943 non-null float64
dtypes: float64(1), int64(1)
memory usage: 22.1 KB


In [7]:
avg_movie_ratings_df = ratings.groupby(['movie_id'], as_index=False)['rating'].mean()
avg_movie_ratings_df.head()

Unnamed: 0,movie_id,rating
0,1,3.878319
1,2,3.206107
2,3,3.033333
3,4,3.550239
4,5,3.302326


In [8]:
avg_movie_ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1682 entries, 0 to 1681
Data columns (total 2 columns):
movie_id    1682 non-null int64
rating      1682 non-null float64
dtypes: float64(1), int64(1)
memory usage: 39.4 KB


In [9]:
#Merging movie data with their ratings
movie_ratings = pd.merge(movies, ratings)
movie_ratings.head()


Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url,user_id,rating,unix_timestamp
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,308,4,887736532
1,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,287,5,875334088
2,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,148,4,877019411
3,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,280,4,891700426
4,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,66,3,883601324


In [10]:
#Merging movies and their ratings with the users
df = pd.merge(movie_ratings, users)
df.head()


Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url,user_id,rating,unix_timestamp,age,sex,occupation,zip_code
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,308,4,887736532,60,M,retired,95076
1,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,308,5,887737890,60,M,retired,95076
2,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),308,4,887739608,60,M,retired,95076
3,7,Twelve Monkeys (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Twelve%20Monk...,308,4,887738847,60,M,retired,95076
4,8,Babe (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Babe%20(1995),308,5,887736696,60,M,retired,95076


In [11]:
df.drop(df.columns[[3,4,7]], axis=1, inplace=True)

In [12]:
ratings.drop( "unix_timestamp", inplace = True, axis = 1 ) 
movies.drop(movies.columns[[3,4]], inplace = True, axis = 1 )

In [13]:
#Pivot Table(This creates a matrix of users and movie_ratings)
ratings_matrix = ratings.pivot_table(index=['movie_id'],columns=['user_id'],values='rating').reset_index(drop=True)
ratings_matrix.fillna( 0, inplace = True )


In [14]:
ratings_matrix.head()

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
0,5.0,4.0,0.0,0.0,4.0,4.0,0.0,0.0,0.0,4.0,...,2.0,3.0,4.0,0.0,4.0,0.0,0.0,5.0,0.0,0.0
1,3.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,...,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,4.0,...,5.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0
4,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
#Cosine Similarity(Creates a cosine matrix of similaraties ..... which is the pairwise distances
# between two items )

movie_similarity = 1 - pairwise_distances( ratings_matrix.values, metric="cosine" )


In [16]:
np.fill_diagonal( movie_similarity, 0 ) 


In [17]:
ratings_matrix = pd.DataFrame( movie_similarity )

In [18]:
try:
    user_inp="Toy Story (1995)"
    inp=movies[movies['title']==user_inp].index.tolist()
    inp=inp[0]
    
    movies['similarity'] = ratings_matrix.iloc[inp]
    movies.columns = ['movie_id', 'title', 'release_date','similarity']
    movies.head(5)
    
except:
    print("Sorry, the movie is not in the database!")
    
print("Recommended movies based on your choice of ",user_inp ,": \n", movies.sort_values( ["similarity"], ascending = False )[1:20])


Recommended movies based on your choice of  Toy Story (1995) : 
      movie_id                                         title release_date  \
180       181                     Return of the Jedi (1983)  14-Mar-1997   
120       121                 Independence Day (ID4) (1996)  03-Jul-1996   
116       117                              Rock, The (1996)  07-Jun-1996   
404       405                    Mission: Impossible (1996)  22-May-1996   
150       151  Willy Wonka and the Chocolate Factory (1971)  01-Jan-1971   
221       222               Star Trek: First Contact (1996)  22-Nov-1996   
99        100                                  Fargo (1996)  14-Feb-1997   
236       237                          Jerry Maguire (1996)  13-Dec-1996   
173       174                Raiders of the Lost Ark (1981)  01-Jan-1981   
6           7                         Twelve Monkeys (1995)  01-Jan-1995   
117       118                                Twister (1996)  10-May-1996   
171       172          