##About Practice Problem: Is this joke funny?
Many online businesses rely on customer reviews and ratings. Explicit feedback is especially important in the entertainment and ecommerce industry where all customer engagements are impacted by these ratings. Netflix relies on such rating data to power its recommendation engine to provide best movie and TV series recommendations that are personalized and most relevant to the user.

This practice problem challenges the participants to predict the ratings for jokes given by the users provided the ratings provided by the same users for another set of jokes. This dataset is taken from the famous jester online Joke Recommender system dataset.

In [None]:
!pip install surprise

Collecting surprise
  Downloading https://files.pythonhosted.org/packages/61/de/e5cba8682201fcf9c3719a6fdda95693468ed061945493dea2dd37c5618b/surprise-0.1-py2.py3-none-any.whl
Collecting scikit-surprise
[?25l  Downloading https://files.pythonhosted.org/packages/97/37/5d334adaf5ddd65da99fc65f6507e0e4599d092ba048f4302fe8775619e8/scikit-surprise-1.1.1.tar.gz (11.8MB)
[K     |████████████████████████████████| 11.8MB 9.6MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp36-cp36m-linux_x86_64.whl size=1670909 sha256=78047ed0ab00181c46d8792a4adc22da5fc396cbf7ca922a16da1be291657f18
  Stored in directory: /root/.cache/pip/wheels/78/9c/3d/41b419c9d2aff5b6e2b4c0fc8d25c538202834058f9ed110d0
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.1 surprise-0.1


In [None]:
import pandas as pd
from surprise import Reader,Dataset
from surprise.model_selection import cross_validate,KFold,train_test_split
from surprise import KNNBasic
from surprise import KNNWithMeans,KNNWithZScore,KNNBaseline
from surprise import SVD,SVDpp
from surprise import BaselineOnly
from surprise import NMF,SlopeOne,CoClustering
from surprise import NormalPredictor
from surprise import accuracy
from surprise.accuracy import rmse
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)
%matplotlib inline

In [None]:
train_data = pd.read_csv("train.csv")
jokes_data = pd.read_csv("jokes.csv")
test_data = pd.read_csv("test.csv")

In [None]:

df = pd.read_csv('train.csv')
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(df[['user_id','joke_id','Rating']], reader)
trainingSet = data.build_full_trainset()

In [None]:
"""
Distribution of Ratings
"""
data = df['Rating'].value_counts().sort_index(ascending=False)
trace = go.Bar(x = data.index,
               text = ['{:.1f} %'.format(val) for val in (data.values / df.shape[0] * 100)],
               textposition = 'auto',
               textfont = dict(color = '#000000'),
               y = data.values,
               )
# Create layout
layout = dict(title = 'Distribution Of {} joke-ratings'.format(df.shape[0]),
              xaxis = dict(title = 'Rating'),
              yaxis = dict(title = 'Count'))
# Create plot
fig = go.Figure(data=[trace], layout=layout)
#iplot(fig)
fig.show(renderer="colab")

In [None]:
"""
Rating Distribution by Jokes
"""
# Number of ratings per joke
data=df.groupby('joke_id')['Rating'].count().clip(upper=150)
trace=go.Histogram(x=data.values,
                   name='Ratings',
                   xbins=dict(start=0,
                              end=150,
                              size=2))
layout=dict(title ='Distribution of Rating per Jokes',
            xaxis=dict(title='No. of ratings per day'),
            yaxis=dict(title='count'),
            bargap=0.2)

figure=go.Figure(data=[trace],layout=layout)
fig.show(renderer='colab')

In [None]:
df.head(2)

Unnamed: 0,id,user_id,joke_id,Rating
0,31030_110,31030,110,2.75
1,16144_109,16144,109,5.094


In [None]:
df.groupby('joke_id')['Rating'].count().reset_index().sort_values('Rating',ascending=False)[:10]

Unnamed: 0,joke_id,Rating
7,8,8689
3,4,8636
2,3,8600
4,5,8581
6,7,8556
1,2,8532
5,6,8525
8,9,8524
78,79,5339
103,104,5290


In [None]:
"""
Rating Distribution by User
"""
# Number of ratings per user

data=df.groupby('user_id')['Rating'].count().clip(upper=150)
trace=go.Histogram(x=data.values,
                   name='Ratings',
                   xbins=dict(start=0,
                              end=150,
                              size=2))
layout=dict(title ='Distribution Of Number of Ratings Per User (Clipped at 50)',
            xaxis=dict(title='ratings per user'),
            yaxis=dict(title='count'),
            bargap=0.2)

figure=go.Figure(data=[trace],layout=layout)
fig.show(renderer='colab')

In [None]:
df.groupby('user_id')['Rating'].count().reset_index().sort_values('Rating',ascending=False)[:10]

Unnamed: 0,user_id,Rating
33500,34002,45
21159,21492,42
3061,3100,42
361,366,41
29914,30370,40
40040,40661,40
15675,15929,39
24471,24847,39
36251,36803,39
36223,36774,39


In [None]:
df.head()

Unnamed: 0,id,user_id,joke_id,Rating
0,31030_110,31030,110,2.75
1,16144_109,16144,109,5.094
2,23098_6,23098,6,-6.438
3,14273_86,14273,86,4.406
4,18419_134,18419,134,9.375


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 341121 entries, 0 to 341120
Data columns (total 4 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       341121 non-null  object 
 1   user_id  341121 non-null  int64  
 2   joke_id  341121 non-null  int64  
 3   Rating   341121 non-null  float64
dtypes: float64(1), int64(2), object(1)
memory usage: 10.4+ MB


In [None]:
df.describe()

Unnamed: 0,user_id,joke_id,Rating
count,341121.0,341121.0,341121.0
mean,20700.840344,63.976601,1.752048
std,11808.463348,44.12442,5.232872
min,1.0,1.0,-10.0
25%,10462.0,22.0,-1.75
50%,21344.0,62.0,2.344
75%,30771.0,104.0,5.781
max,40863.0,139.0,10.0


In [None]:
df.duplicated().sum()

0

In [None]:
df.sort_values('Rating',ascending=False).head()

In [None]:
"""
Surprise Library
"""

"""
Building SVD Model  
"""
svd=SVD(n_epochs=50,lr_all=0.01,reg_all=0.04,n_factors=250)
kf=KFold(n_splits=10,random_state=95)
for x,y in kf.split(data):
  svd.fit(trainingSet)
  pred=svd.test(y)
  rmse(pred,verbose=True)




RMSE: 3.4889
RMSE: 3.4813
RMSE: 3.5047
RMSE: 3.4886
RMSE: 3.4929
RMSE: 3.4728
RMSE: 3.4762
RMSE: 3.5059
RMSE: 3.5116
RMSE: 3.4920


In [None]:
trainsett=svd.trainset
print(svd.__class__.__name__)

SVD


In [None]:
"""
Prediction on Test Data 
"""

id=[]
user_id=[]
joke_id=[]
result=[]
result1=[]
for index,row in test_data.iterrows():
  print(index,row)
  id.append(str(row['id'])+'-'+str(row['joke_id'])+'-'+str(row['user_id']))
  result1.append(svd.predict(row['user_id'],row['joke_id']).est)
result=pd.DataFrame({'id':pd.Series(id),'rating':pd.Series(result1)}) 
result[['id','joke_id','user_id']] = result['id'].str.split('-',expand=True)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
user_id      2134
joke_id         8
Name: 536630, dtype: object
536631 id         27471_86
user_id       27471
joke_id          86
Name: 536631, dtype: object
536632 id         3939_119
user_id        3939
joke_id         119
Name: 536632, dtype: object
536633 id         8391_76
user_id       8391
joke_id         76
Name: 536633, dtype: object
536634 id         16088_117
user_id        16088
joke_id          117
Name: 536634, dtype: object
536635 id         28740_3
user_id      28740
joke_id          3
Name: 536635, dtype: object
536636 id         31493_54
user_id       31493
joke_id          54
Name: 536636, dtype: object
536637 id         39028_87
user_id       39028
joke_id          87
Name: 536637, dtype: object
536638 id         37275_105
user_id        37275
joke_id          105
Name: 536638, dtype: object
536639 id         11951_53
user_id       11951
joke_id          53
Name: 536639, dtype: object
536640 id       

In [None]:
result.head()

Unnamed: 0,id,rating,joke_id,user_id
0,6194_11,2.909905,11,6194
1,19356_3,0.0,3,19356
2,23426_79,2.86983,79,23426
3,40030_3,0.0,3,40030
4,19806_115,5.0,115,19806


In [None]:
endResult = result.drop(['user_id','joke_id'],axis=1)
endResult.columns = ['id','Rating']


In [None]:
endResult.to_csv("brahm_jokes_submission1.csv",index=False)