#**Movie Recommendation System Using Collaborative Filtering**

###**Tasks to be performed**
- Downloading the data set from Dropbox
- Installing required libraries
- Importing required libraries
- Loading the dataset into a Pandas DataFrame
- Analyzing the dataset
- Exploratory Data Analysis (Using Plotly Express)

###**Downloading the data set from Dropbox**



In [1]:
#Please run this code in Google Co-Lab

!wget https://www.dropbox.com/s/llg91ednbv6gcu5/ratings.csv

--2020-09-05 09:40:55--  https://www.dropbox.com/s/llg91ednbv6gcu5/ratings.csv
Resolving www.dropbox.com (www.dropbox.com)... 162.125.1.1, 2620:100:6016:1::a27d:101
Connecting to www.dropbox.com (www.dropbox.com)|162.125.1.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/llg91ednbv6gcu5/ratings.csv [following]
--2020-09-05 09:40:55--  https://www.dropbox.com/s/raw/llg91ednbv6gcu5/ratings.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc1b71534d2a4a0f283efb370dd1.dl.dropboxusercontent.com/cd/0/inline/A-wiK9fMyB7axMlvaLEU36DKqHmXexzOgF94FXTlAehYakZ-qJHYZMjHMedfu7eReDaLoC7PViTq0k2bTg2GV2d1lkqog5W-S6DdD7_1oXGLxUUs4gwkb65fz2UxwsMT2VA/file# [following]
--2020-09-05 09:40:55--  https://uc1b71534d2a4a0f283efb370dd1.dl.dropboxusercontent.com/cd/0/inline/A-wiK9fMyB7axMlvaLEU36DKqHmXexzOgF94FXTlAehYakZ-qJHYZMjHMedfu7eReDaLoC7PViTq0k2bTg2GV2d1lkqog5W-S6DdD7_1oXGLxUUs4gwkb

[**Click Here!**](https://www.dropbox.com/s/llg91ednbv6gcu5/ratings.csv) to download the dataset

###**Installing required libraries**

In [2]:

!pip3 install scikit-surprise

Collecting scikit-surprise
[?25l  Downloading https://files.pythonhosted.org/packages/97/37/5d334adaf5ddd65da99fc65f6507e0e4599d092ba048f4302fe8775619e8/scikit-surprise-1.1.1.tar.gz (11.8MB)
[K     |████████████████████████████████| 11.8MB 3.4MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp36-cp36m-linux_x86_64.whl size=1670922 sha256=db5182020316b16142f13effae4940a16654a03d537b33d2916946a22632cc4c
  Stored in directory: /root/.cache/pip/wheels/78/9c/3d/41b419c9d2aff5b6e2b4c0fc8d25c538202834058f9ed110d0
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.1


###**Importing required libraries**

In [3]:
import pandas as pd
import numpy as np

from surprise import Reader, Dataset, SVD

from surprise.accuracy import rmse, mae
from surprise.model_selection import cross_validate

###**Loading the dataset into a Pandas DataFrame**

In [4]:
df = pd.read_csv('ratings.csv')
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


###**Analyzing the dataset**

In [5]:
print('Shape of the dataframe', df.shape)
print('Contains:',df.shape[0],'rows')
print('Contains:',df.shape[1],'columns')

Shape of the dataframe (100836, 4)
Contains: 100836 rows
Contains: 4 columns


In [6]:
df.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

From above, you can see that we dont have any null values

In [7]:
# Let's drop the timestamp column because we are not gonna be using this column

df.drop('timestamp', inplace=True, axis = 1)

In [8]:
df.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


###**Exploratory Data Analysis**

In [9]:
print('Number of Unique Movies:', df['movieId'].nunique())
print('Number of Unique Users:', df['userId'].nunique())


Number of Unique Movies: 9724
Number of Unique Users: 610


In [10]:
import plotly.express as px

In [11]:
import matplotlib.pyplot as plt

plt.figure(figsize=(12,8))
fig = px.histogram(df, x= df['rating'])
fig.show()

<Figure size 864x576 with 0 Axes>

From above, we can see that more than 25K users rated movies as 4

###**Dimensionality Reduction**

Here, we will filter out rarely rated movies, and user's rarely voting

####Filter movies with less than 3 ratings

In [12]:
filter_movies = df['movieId'].value_counts() > 3
filter_movies = filter_movies[filter_movies].index.tolist()

In [14]:
filter_movies[0:5]

[356, 318, 296, 593, 2571]

####Filter user's with less than 3 ratings

In [17]:
filter_users = df['userId'].value_counts() > 3
filter_users = filter_users[filter_users].index.tolist()

In [18]:
filter_users[0:5]

[414, 599, 474, 448, 274]

####Remove rarely rated movies and rarely rating users

In [19]:
print('Original Shape:',df.shape)
df = df[(df['movieId'].isin(filter_movies)) & (df['userId'].isin(filter_users))]
print('New Shape:', df.shape)

Original Shape: (100836, 3)
New Shape: (92394, 3)


###**Create Training and Test Sets**

In [21]:
cols = ['userId', 'movieId', 'rating']

reader = Reader(rating_scale=(0.5, 5))
data = Dataset.load_from_df(df[cols], reader)

trainset = data.build_full_trainset()
antitest = trainset.build_anti_testset()


###**Training the model**

In [23]:
#Creating the Model
algo = SVD(n_epochs=25, verbose= True)

In [24]:
#Training the Model

cross_validate(algo, data, measures=['RMSE', 'MAE'], cv = 5, verbose=True)


Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19
Processing epoch 20
Processing epoch 21
Processing epoch 22
Processing epoch 23
Processing epoch 24
Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19
Processing epoch 20
Processing epoch 21
Processing epoch 22
Processing epoch 23
Processing epoch 24
Processing epoch 0
P

{'fit_time': (5.6530921459198,
  5.719628095626831,
  5.753535509109497,
  5.760352611541748,
  5.720288038253784),
 'test_mae': array([0.65636409, 0.65932794, 0.66082022, 0.6632626 , 0.66027883]),
 'test_rmse': array([0.85336559, 0.86140975, 0.8610185 , 0.86889867, 0.85765739]),
 'test_time': (0.3175170421600342,
  0.12688493728637695,
  0.12573885917663574,
  0.2844655513763428,
  0.12717843055725098)}

###**Make Predictions**

In [27]:
predictions = algo.test(antitest)
predictions[0]

Prediction(uid=1, iid=318, r_ui=3.529119856267723, est=5, details={'was_impossible': False})

In [30]:
from collections import defaultdict
def get_top_n(predictions, n):

  top_n = defaultdict(list)
  for uid, iid, _, est, _ in predictions:
    top_n[uid].append((iid, est))

  for uid, user_ratings in top_n.items():
    user_ratings.sort(key = lambda x: x[1], reverse = True)
    top_n[uid] = user_ratings[ :n]

  return top_n
  pass

top_n = get_top_n(predictions, n=3)

In [31]:
for uid, user_ratings in top_n.items():
  print(uid, [iid for (iid, rating) in user_ratings])

1 [318, 109487, 720]
2 [56782, 1248, 2571]
3 [1250, 79132, 3451]
4 [4973, 1223, 527]
5 [2858, 1411, 1213]
6 [912, 910, 903]
7 [1197, 720, 1276]
8 [2858, 2324, 898]
9 [1208, 1732, 1215]
10 [3730, 1207, 1136]
11 [912, 527, 1148]
12 [47, 50, 260]
13 [1262, 318, 1250]
14 [3451, 1673, 2239]
15 [5992, 904, 1617]
16 [44555, 1262, 3451]
17 [3275, 4144, 56782]
18 [912, 1204, 933]
19 [58559, 1207, 527]
20 [1197, 356, 318]
21 [110, 1204, 318]
22 [1221, 1617, 1203]
23 [2571, 1673, 593]
24 [2959, 1204, 2160]
25 [296, 1197, 1206]
26 [1213, 1221, 58559]
27 [899, 1136, 527]
28 [6787, 306, 2160]
29 [110, 912, 2160]
30 [50, 296, 1213]
31 [318, 778, 7361]
32 [58559, 858, 1221]
33 [2959, 1276, 2067]
34 [296, 1213, 1193]
35 [318, 1197, 720]
36 [4226, 260, 2858]
37 [1213, 858, 106642]
38 [1221, 589, 2324]
39 [3508, 1230, 6711]
40 [2067, 1228, 1206]
41 [44195, 589, 2117]
42 [111, 1884, 96821]
43 [260, 527, 593]
44 [318, 1250, 898]
45 [3451, 3030, 4973]
46 [2571, 1193, 912]
47 [1197, 912, 2324]
48 [1204, 5747

User_Id **610** should watch movies with IDs **1617**, **2329**, and **1223**