**RECOMMENDATION SYSTEM**

(This notebook displays the working of a basic movie recommendation system, using various popular algorithms)

The Purpose of recommendation system is to suggest relevant items to the users. 

Here, we have taken the example of a movie recommendation system, which suggests the most relevant movies
to each user on the basis of users past interactions with other movies. 

![Image](https://miro.medium.com/max/1132/1*N0-ikjPv4RUVvS-6KCgLPg.jpeg)


**EXPLORATORY DATA ANALYSIS **

The first task is to analyze the data set which is going to be used to train our model. In this model, the dataset we have used
is MovieLens-100K Dataset, which consists of more than 100,000 ratings given by nearly 1,000 users.

We start by importing the data from the dataset, and cleaning the data.

In [1]:
# Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
import missingno as msno


# Configure visualisations
%matplotlib inline
mpl.style.use( 'ggplot' )
plt.style.use('fivethirtyeight')
sns.set(context="notebook", palette="dark", style = 'whitegrid' , color_codes=True)

from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""");


# Make Visualizations better
params = { 
    'axes.labelsize': "large",
    'xtick.labelsize': 'x-large',
    'legend.fontsize': 20,
    'figure.dpi': 150,
    'figure.figsize': [25, 7]
}
plt.rcParams.update(params)

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# IMPORTING THE DATA
# Data set columns
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('../input/movielens-100k-dataset/ml-100k/u.user', sep='|', names=u_cols,  encoding='latin-1')

# Removing duplicates
users = users.drop_duplicates(keep='first')

# Converting columns to specific data type
user_int_columns = ['user_id', 'age']
users[user_int_columns] = users[user_int_columns].applymap(np.int64)

total_users = int(users.shape[0])

In [3]:
users.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [4]:
users.dtypes

user_id        int64
age            int64
sex           object
occupation    object
zip_code      object
dtype: object

In [5]:
i_cols = ['movie id', 'movie title', 'release date', 'video release date', 'IMDb URL', 'unknown', 'Action',
          'Adventure', 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
          'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

movies = pd.read_csv('../input/movielens-100k-dataset/ml-100k/u.item', sep='|', names=i_cols, encoding='latin-1')

# Removing duplicates
movies = movies.drop_duplicates(keep='first')

# Dropping the unnecessary columns
movies.drop(['Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary',
                          'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
                          'Thriller', 'War', 'Western', 'video release date', 'IMDb URL'], axis=1, inplace=True)

total_movies = int(movies.shape[0])

In [6]:
movies.head()

Unnamed: 0,movie id,movie title,release date,unknown
0,1,Toy Story (1995),01-Jan-1995,0
1,2,GoldenEye (1995),01-Jan-1995,0
2,3,Four Rooms (1995),01-Jan-1995,0
3,4,Get Shorty (1995),01-Jan-1995,0
4,5,Copycat (1995),01-Jan-1995,0


In [7]:
movies.dtypes

movie id         int64
movie title     object
release date    object
unknown          int64
dtype: object

In [8]:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('../input/movielens-100k-dataset/ml-100k/u.data', sep='\t', names=r_cols, encoding='latin-1')

# Dropping the timestamp column
ratings.drop(['unix_timestamp'], axis=1, inplace=True)

# Removing duplicates
ratings = ratings.drop_duplicates(keep='first')

# Converting to appropriate data types
ratings_int_columns = ['user_id', 'movie_id', 'rating']
ratings[ratings_int_columns] = ratings[ratings_int_columns].applymap(np.int64)

total_ratings = int(ratings.shape[0])

In [9]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


In [10]:
ratings.dtypes

user_id     int64
movie_id    int64
rating      int64
dtype: object

**To get the most highly rated movies **

In [11]:
frame = ratings.set_index('movie_id').join(movies.set_index('movie id'))
ratings_per_movie = frame['rating'].groupby(frame['movie title']).sum()

print('MOST RATED MOVIES')
print(ratings_per_movie.sort_values(ascending=False))

from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)

data = frame['rating'].value_counts().sort_index(ascending=False)
trace = go.Bar(x = data.index,
               text = ['{:.1f} %'.format(val) for val in (data.values / frame.shape[0] * 100)],
               textposition = 'auto',
               textfont = dict(color = '#000000'),
               y = data.values,
               )
# Create layout
layout = dict(title = 'Distribution Of {} movie-ratings'.format(frame.shape[0]),
              xaxis = dict(title = 'Rating'),
              yaxis = dict(title = 'Count'))
# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

MOST RATED MOVIES
movie title
Star Wars (1977)                  2541
Fargo (1996)                      2111
Return of the Jedi (1983)         2032
Contact (1997)                    1936
Raiders of the Lost Ark (1981)    1786
                                  ... 
Leopard Son, The (1996)              1
Liebelei (1933)                      1
Bird of Prey (1996)                  1
Lotto Land (1995)                    1
Daens (1992)                         1
Name: rating, Length: 1664, dtype: int64


**USER-MOVIE RATING MATRIX**

Next task is to plot a User-Movie Rating Matrix. 

In [12]:
data_matrix = np.zeros((total_users, total_movies))
for line in ratings.itertuples():
    data_matrix[line[1]-1, line[2]-1] = line[3]
    
print("USER_MOVIE RATING MATRIX")
print(data_matrix)

USER_MOVIE RATING MATRIX
[[5. 3. 4. ... 0. 0. 0.]
 [4. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [5. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 5. 0. ... 0. 0. 0.]]


**MEMORY BASED APPROACH**

Memory based approach directly works with values of recorded interactions, assuming no model, and are essentially based on 
nearest neighbour search. This approach has low bias but high variance. 

**Types of memory based approach**
* **User-user** : These roughly tries to identify users with the most similar “interactions profile” (nearest neighbours) in order to suggest items that are the most popular among these neighbours (and that are “new” to our user). This method is said to be “user-centred” as it represents users based on their interactions with items and evaluate distances between users.
* **Item-item** : These finds items similar to the ones the user already “positively” interacted with. Two items are considered to be similar if most of the users that have interacted with both of them did it in a similar way. This method is said to be “item-centred” as it represent items based on interactions users had with them and evaluate distances between those items.


Implementation of memory based approach requires the calculation of similarity matrix.
For calculation of similarity between items, cosine similarity is being used. 


In [13]:
from sklearn.metrics.pairwise import pairwise_distances 

user_similarity = pairwise_distances(data_matrix, metric='cosine')
movie_similarity = pairwise_distances(data_matrix.T, metric='cosine')

print("USER-USER SIMILARITY MATRIX")
print(user_similarity)
print('\n')
print("MOVIE-MOVIE SIMILARITY MATRIX")
print(movie_similarity)

USER-USER SIMILARITY MATRIX
[[0.         0.83306902 0.95254046 ... 0.85138306 0.82049212 0.60182526]
 [0.83306902 0.         0.88940868 ... 0.83851522 0.82773219 0.89420212]
 [0.95254046 0.88940868 0.         ... 0.89875744 0.86658385 0.97344413]
 ...
 [0.85138306 0.83851522 0.89875744 ... 0.         0.8983582  0.90488042]
 [0.82049212 0.82773219 0.86658385 ... 0.8983582  0.         0.81753534]
 [0.60182526 0.89420212 0.97344413 ... 0.90488042 0.81753534 0.        ]]


MOVIE-MOVIE SIMILARITY MATRIX
[[0.         0.59761782 0.66975521 ... 1.         0.95281693 0.95281693]
 [0.59761782 0.         0.72693082 ... 1.         0.92170064 0.92170064]
 [0.66975521 0.72693082 0.         ... 1.         1.         0.90312495]
 ...
 [1.         1.         1.         ... 0.         1.         1.        ]
 [0.95281693 0.92170064 1.         ... 1.         0.         1.        ]
 [0.95281693 0.92170064 0.90312495 ... 1.         1.         0.        ]]


**PREDICTION**

After getting the similarity matrix, next step is to predict the most relevant movie to suggest for each user.
For this, we write a prediction fucntion. 

In [14]:
def predict(ratings, similarity, typ='user'):
    if typ == 'user':
        mean_user_rating = ratings.mean(axis=1)
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif typ == 'movie':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

user_user_prediction = predict(data_matrix, user_similarity, typ='user')
movie_movie_prediction = predict(data_matrix, movie_similarity, typ='movie')

print('USER-USER PREDICTION MATRIX')
print(user_user_prediction)
print()
print('MOVIE-MOVIE PREDICTION MATRIX')
print(movie_movie_prediction)

USER-USER PREDICTION MATRIX
[[ 2.06532606  0.73430275  0.62992381 ...  0.39359041  0.39304874
   0.3927712 ]
 [ 1.76308836  0.38404019  0.19617889 ... -0.08837789 -0.0869183
  -0.08671183]
 [ 1.79590398  0.32904733  0.15882885 ... -0.13699223 -0.13496852
  -0.13476488]
 ...
 [ 1.59151513  0.27526889  0.10219534 ... -0.16735162 -0.16657451
  -0.16641377]
 [ 1.81036267  0.40479877  0.27545013 ... -0.00907358 -0.00846587
  -0.00804858]
 [ 1.8384313   0.47964837  0.38496292 ...  0.14686675  0.14629808
   0.14641455]]

MOVIE-MOVIE PREDICTION MATRIX
[[0.44627765 0.475473   0.50593755 ... 0.58815455 0.5731069  0.56669645]
 [0.10854432 0.13295661 0.12558851 ... 0.13445801 0.13657587 0.13711081]
 [0.08568497 0.09169006 0.08764343 ... 0.08465892 0.08976784 0.09084451]
 ...
 [0.03230047 0.0450241  0.04292449 ... 0.05302764 0.0519099  0.05228033]
 [0.15777917 0.17409459 0.18900003 ... 0.19979296 0.19739388 0.20003117]
 [0.24767207 0.24489212 0.28263031 ... 0.34410424 0.33051406 0.33102478]]


**MODEL BASED APPROACH**

This method assumes an underlying “generative” model that explains the user-item interactions and try to discover it in order to make new predictions. This approach has high bias but low variance.


In [15]:
from surprise import Reader, Dataset, KNNBasic, SVD, NMF, NormalPredictor, BaselineOnly, KNNWithMeans, KNNWithZScore, KNNBaseline, SVDpp, SlopeOne, CoClustering
from surprise.model_selection import GridSearchCV, cross_validate

reader = Reader(rating_scale=(0.0, 5.0))
data = Dataset.load_from_df( ratings[['user_id', 'movie_id', 'rating']], reader = reader )
sim_options = {'name': 'msd',
               'user_based': False  # compute  similarities between items
               }
algo = KNNBasic(k=30,sim_options=sim_options)
cross_validate(algo=algo, data=data, measures=['RMSE'], cv=5, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9737  0.9797  0.9717  0.9789  0.9781  0.9764  0.0031  
Fit time          0.78    0.76    0.78    0.77    0.75    0.77    0.01    
Test time         7.50    7.36    7.47    7.47    7.44    7.45    0.05    


{'test_rmse': array([0.97369861, 0.97966303, 0.97174211, 0.97886048, 0.97805926]),
 'fit_time': (0.777292013168335,
  0.7622137069702148,
  0.7761285305023193,
  0.7652060985565186,
  0.7498533725738525),
 'test_time': (7.5039284229278564,
  7.363107204437256,
  7.474025011062622,
  7.47149658203125,
  7.441975831985474)}

In [16]:
n_neighbours = [10, 20, 30]
param_grid = {'n_neighbours' : n_neighbours,'k':n_neighbours,'user_based':[True,False]}

gs = GridSearchCV(KNNBasic, measures=['RMSE'], param_grid=param_grid)
gs.fit(data)

print('\n\n\n')
# Best RMSE score
print('Best Score :', gs.best_score['rmse'])

# Combination of parameters that gave the best RMSE score
print('Best Parameters :', gs.best_params['rmse'])

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

**SVD**

In [17]:
algo = SVD()
cross_validate(algo=algo, data=data, measures=['RMSE'], cv=5, verbose=True)

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9338  0.9398  0.9285  0.9325  0.9468  0.9363  0.0064  
Fit time          9.87    9.94    9.95    10.09   9.86    9.94    0.08    
Test time         0.39    0.28    0.26    0.25    0.25    0.29    0.05    


{'test_rmse': array([0.93384748, 0.93980655, 0.92847748, 0.93246249, 0.94684817]),
 'fit_time': (9.872539281845093,
  9.940195798873901,
  9.945584297180176,
  10.090240001678467,
  9.857121706008911),
 'test_time': (0.38591575622558594,
  0.2761983871459961,
  0.26484084129333496,
  0.25478291511535645,
  0.2517826557159424)}

In [18]:
param_grid = {'n_factors' : [50, 60, 70], 'lr_all' : [0.5, 0.05, 0.01], 'reg_all' : [0.06, 0.04, 0.02]}

gs = GridSearchCV(algo_class=SVD, measures=['RMSE'], param_grid=param_grid)
gs.fit(data)

print('\n\n\n')
# Best RMSE score
print('Best Score :', gs.best_score['rmse'])

# Combination of parameters that gave the best RMSE score
print('Best Parameters :', gs.best_params['rmse'])





Best Score : 0.9170911081560135
Best Parameters : {'n_factors': 50, 'lr_all': 0.01, 'reg_all': 0.06}


**NNMF**

In [19]:
cross_validate(algo=NMF(), data=data, measures=['RMSE'], cv=5, verbose=True)

Evaluating RMSE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9664  0.9609  0.9642  0.9644  0.9736  0.9659  0.0042  
Fit time          9.14    9.35    9.21    9.35    9.47    9.30    0.12    
Test time         0.34    0.20    0.21    0.20    0.35    0.26    0.07    


{'test_rmse': array([0.96643131, 0.96088708, 0.96419593, 0.96440751, 0.97355646]),
 'fit_time': (9.14198613166809,
  9.35426950454712,
  9.210577487945557,
  9.345982074737549,
  9.469260692596436),
 'test_time': (0.34488987922668457,
  0.20435714721679688,
  0.20601582527160645,
  0.20436406135559082,
  0.34943580627441406)}

In [20]:
param_grid = {'n_factors' : [50, 60, 70],'verbose':[True]}

gs = GridSearchCV(algo_class=NMF, measures=['RMSE'], param_grid=param_grid)
gs.fit(data)

print('\n\n\n')
# Best RMSE score
print('Best Score :', gs.best_score['rmse'])

# Combination of parameters that gave the best RMSE score
print('Best Parameters :', gs.best_params['rmse'])

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19
Processing epoch 20
Processing epoch 21
Processing epoch 22
Processing epoch 23
Processing epoch 24
Processing epoch 25
Processing epoch 26
Processing epoch 27
Processing epoch 28
Processing epoch 29
Processing epoch 30
Processing epoch 31
Processing epoch 32
Processing epoch 33
Processing epoch 34
Processing epoch 35
Processing epoch 36
Processing epoch 37
Processing epoch 38
Processing epoch 39
Processing epoch 40
Processing epoch 41
Processing epoch 42
Processing epoch 43
Processing epoch 44
Processing epoch 45
Processing epoch 46
Processing epoch 47
Processing epoch 48
Processing epoch 49
Processing