**Collaborative filtering based recommendation system on jokes rating**

In [4]:
import pandas as pd
import numpy as np

Data analysis

In [5]:
data=pd.read_csv('/content/jokes-data.csv')
data.head()

Unnamed: 0,id,user_id,joke_id,Rating
0,31030_110,31030,110,2.75
1,16144_109,16144,109,5.094
2,23098_6,23098,6,-6.438
3,14273_86,14273,86,4.406
4,18419_134,18419,134,9.375


In [6]:
data.shape

(1092059, 4)

In [7]:
data.isna().sum()

id         0
user_id    0
joke_id    0
Rating     0
dtype: int64

In [60]:
data.dtypes

user_id      int64
joke_id      int64
Rating     float64
dtype: object

No NaN and catogerical parameters

In [8]:
!pip install surprise



In [9]:
data=data.drop('id',axis=1) #since id is found to be a combination of columns user_id and joke_id and has no relevance
data.head()

Unnamed: 0,user_id,joke_id,Rating
0,31030,110,2.75
1,16144,109,5.094
2,23098,6,-6.438
3,14273,86,4.406
4,18419,134,9.375


In [10]:
data.Rating.describe()

count    1.092059e+06
mean     1.758394e+00
std      5.230860e+00
min     -1.000000e+01
25%     -1.719000e+00
50%      2.344000e+00
75%      5.781000e+00
max      1.000000e+01
Name: Rating, dtype: float64

So from min and max, it is found that rating has a range of(-10,10), so scaling down the range using Min_Max Scaler

In [11]:
from sklearn.preprocessing import MinMaxScaler
min_max=MinMaxScaler()
data.loc[:,['Rating']]=min_max.fit_transform(data[['Rating']])

In [59]:
data.head()

Unnamed: 0,user_id,joke_id,Rating
0,31030,110,0.6375
1,16144,109,0.7547
2,23098,6,0.1781
3,14273,86,0.7203
4,18419,134,0.96875


In [12]:
data.Rating.max()

1.0

In [13]:
data.Rating.min()

0.0

USING KNNBASIC

In [14]:
data.columns

Index(['user_id', 'joke_id', 'Rating'], dtype='object')

In [15]:
from surprise import Reader,Dataset,KNNBasic
reader=Reader(rating_scale=(0,1))
data1=Dataset.load_from_df(data,reader)
algo=KNNBasic()

In [16]:
from surprise.model_selection import cross_validate

In [None]:
cross_validate(algo,data1,measures=['RMSE'],cv=5)

output:"Your session crashed after using all available RAM. If you are interested in access to high-RAM runtimes, you may want to check out Colab Pro."

USING SVD

In [17]:
from surprise import SVD
algo_svd=SVD()

In [18]:
cross_validate(algo_svd,data1,measures=['RMSE'],cv=5)

{'test_rmse': array([0.21182287, 0.21139194, 0.211345  , 0.21179851, 0.21165166]),
 'fit_time': (17.356159210205078,
  16.904132604599,
  17.566420078277588,
  17.085808992385864,
  16.679700136184692),
 'test_time': (2.1666688919067383,
  2.2095625400543213,
  2.2906124591827393,
  2.174330472946167,
  2.780599594116211)}

RMSE for SVD is 0.2113

In [19]:
from surprise import KNNWithMeans
sim_options={'name':'cosine'}

In [20]:
algo_knnm=KNNWithMeans(k=3,sim_options=sim_options)

In [None]:
cross_validate(algo_knnm,data1,measures=['RMSE'],cv=5)

KNNMeans also failing

Hyper tuning the parameters of svd using grid search

In [21]:
from surprise.model_selection import GridSearchCV

In [45]:
parameters={"n_epochs": [5, 10,20], "lr_all": [0.002, 0.005], "n_factors":[10,20]}

In [46]:
gs = GridSearchCV(SVD, parameters, measures=["rmse", "mae"], cv=5)
gs.fit(data1)

In [50]:
print("Best score of RMSE :",gs.best_score["rmse"])
print("Best parameters of RMSE :",gs.best_params["rmse"])

Best score of RMSE : 0.21143879397456322
Best parameters of RMSE : {'n_epochs': 20, 'lr_all': 0.005, 'n_factors': 10}


In [52]:
best_algo=SVD(n_epochs=20,lr_all=0.005,n_factors=10) # SVD with best parameters

Fitting the dataset

In [54]:
trainingset = data1.build_full_trainset()
best_algo.fit(trainingset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x78255f7e11b0>

Prediciting the algorithm with already given data

In [56]:
userid=14273
jokeid=86
prediction=best_algo.predict(userid,jokeid)
prediction.est

0.6828736649955393

Actual rating for the mentioned user id and joke id is 0.72(after scaling)