## What is Recommendation System ?

Recommender/recommendation system is a subclass of information filtering system that seeks to predict the rating/ preference a user would give to an item.

They are primarily used in applications where a person/ entity is involved with a product/ service. To further improve their experience with this product, we try to personalize it to their needs. For this we have to look up at their past interactions with this product.

*In one line* -> **Specialized content for everyone.**

*For further info, [Wiki](https://en.wikipedia.org/wiki/Recommender_system#:~:text=A%20recommender%20system%2C%20or%20a,would%20give%20to%20an%20item.)*

## Types of Recommender System

* 1). Popularity Based
* 2). Classification Based
* 3). Content Based
* 4). Collaborative Based
* 5). Hybrid Based (Content + Collaborative)
* 6). Association Based Rule Mining

## Popularity based recommender system
As the name suggests it recommends based on what is currently popular. This is particularly useful when you don't have past data as a reference to recommend product to the user. 

# Import packages and dataset

In [56]:
import pandas as pd
import numpy as np

data = pd.read_csv('../input/jester-online-joke-recommender/jesterfinal151cols.csv')
print(data.shape)
data.head()

(50691, 151)


Unnamed: 0,62,99,99.1,99.2,99.3,0.21875,99.4,-9.28125,-9.28125.1,99.5,...,99.78,99.79,99.80,99.81,99.82,99.83,99.84,99.85,99.86,99.87
0,34,99,99,99,99,-9.6875,99,9.9375,9.53125,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
1,18,99,99,99,99,-9.84375,99,-9.84375,-7.21875,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
2,82,99,99,99,99,6.90625,99,4.75,-5.90625,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
3,27,99,99,99,99,-0.03125,99,-9.09375,-0.40625,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
4,46,99,99,99,99,-2.90625,99,-2.34375,-0.5,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0


In [57]:
#There are NaN values, we need to drop or impute them in data preprocessing step.
data.isnull().sum()

62       0
99       0
99.1     0
99.2     0
99.3     0
        ..
99.83    1
99.84    1
99.85    1
99.86    1
99.87    1
Length: 151, dtype: int64

# Data Preprocessing

Dataset contains no column headers. The first column is user id and subsequent columns are Joke ratings for 150 jokes. Also there are NaN values towards the end of the data

**Things to do:**
* Add column headers
* All other Joke rating columns would be renamed to 1-150
* 0th column would be user_id
* Some rows contain NaN values, replace them as 0
* Many ratings are 99.0 such jokes were not rated by user, replace them as 0

In [58]:
#Convert all 151 columns into a range of 0-150
data.columns = range(data.shape[1]) #shape of column
print(data.columns) #Start 0, Stop 151, Step 1
data.head()

RangeIndex(start=0, stop=151, step=1)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,141,142,143,144,145,146,147,148,149,150
0,34,99,99,99,99,-9.6875,99,9.9375,9.53125,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
1,18,99,99,99,99,-9.84375,99,-9.84375,-7.21875,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
2,82,99,99,99,99,6.90625,99,4.75,-5.90625,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
3,27,99,99,99,99,-0.03125,99,-9.09375,-0.40625,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
4,46,99,99,99,99,-2.90625,99,-2.34375,-0.5,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0


In [59]:
#0th column would be renamed to user_id
data.rename(columns = {0: 'user_id'}, inplace = True)
data.head()

Unnamed: 0,user_id,1,2,3,4,5,6,7,8,9,...,141,142,143,144,145,146,147,148,149,150
0,34,99,99,99,99,-9.6875,99,9.9375,9.53125,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
1,18,99,99,99,99,-9.84375,99,-9.84375,-7.21875,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
2,82,99,99,99,99,6.90625,99,4.75,-5.90625,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
3,27,99,99,99,99,-0.03125,99,-9.09375,-0.40625,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
4,46,99,99,99,99,-2.90625,99,-2.34375,-0.5,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0


In [60]:
#Replace all NaN values as 0
data = data.fillna(0)
data.tail()

Unnamed: 0,user_id,1,2,3,4,5,6,7,8,9,...,141,142,143,144,145,146,147,148,149,150
50686,15,99,99,99,99,99.0,99,-5.9375,-3.71875,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,-1.15625,99.0,-1.1875
50687,12,99,99,99,99,99.0,99,-5.71875,-8.15625,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,3.0625,99.0,99.0
50688,14,99,99,99,99,99.0,99,0.09375,0.09375,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
50689,29,99,99,99,99,99.0,99,-0.125,-0.125,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
50690,19,99,99,99,99,99.0,99,-1.75,-0.09375,99,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [61]:
#Replace all 99.0 ratings as 0
data = data.replace(99.0, 0)
data.head()

Unnamed: 0,user_id,1,2,3,4,5,6,7,8,9,...,141,142,143,144,145,146,147,148,149,150
0,34,0,0,0,0,-9.6875,0,9.9375,9.53125,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,18,0,0,0,0,-9.84375,0,-9.84375,-7.21875,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,82,0,0,0,0,6.90625,0,4.75,-5.90625,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,27,0,0,0,0,-0.03125,0,-9.09375,-0.40625,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,46,0,0,0,0,-2.90625,0,-2.34375,-0.5,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Some these ratings are as high as 6.9 while some are -9.68. Lets normalize only ratings columns using **Standard Scalar**, the idea behind this is to transform your data such that it's distribution will have mean of 0 and standard deviation of 1. Standard scaler aligns it into a Gaussian or Normal disctribution.

*For further info on [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)*

**Things to do:**
* Extract ratings
* Fit Standard Scaler into Ratings

In [62]:
#Extract only ratings columns
ratings = data.iloc[:, 1:]
ratings

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,141,142,143,144,145,146,147,148,149,150
0,0,0,0,0,-9.68750,0,9.93750,9.53125,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0000
1,0,0,0,0,-9.84375,0,-9.84375,-7.21875,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0000
2,0,0,0,0,6.90625,0,4.75000,-5.90625,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0000
3,0,0,0,0,-0.03125,0,-9.09375,-0.40625,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0000
4,0,0,0,0,-2.90625,0,-2.34375,-0.50000,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50686,0,0,0,0,0.00000,0,-5.93750,-3.71875,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.15625,0.0,-1.1875
50687,0,0,0,0,0.00000,0,-5.71875,-8.15625,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.06250,0.0,0.0000
50688,0,0,0,0,0.00000,0,0.09375,0.09375,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0000
50689,0,0,0,0,0.00000,0,-0.12500,-0.12500,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0000


In [63]:
#Fit StandardScaler into ratings
from sklearn.preprocessing import StandardScaler
ratings_ss = StandardScaler().fit_transform(ratings)
ratings_ss

array([[ 0.        ,  0.        ,  0.        , ..., -0.30631715,
        -0.2130309 , -0.33077684],
       [ 0.        ,  0.        ,  0.        , ..., -0.30631715,
        -0.2130309 , -0.33077684],
       [ 0.        ,  0.        ,  0.        , ..., -0.30631715,
        -0.2130309 , -0.33077684],
       ...,
       [ 0.        ,  0.        ,  0.        , ..., -0.30631715,
        -0.2130309 , -0.33077684],
       [ 0.        ,  0.        ,  0.        , ..., -0.30631715,
        -0.2130309 , -0.33077684],
       [ 0.        ,  0.        ,  0.        , ..., -0.30631715,
        -0.2130309 , -0.33077684]])

# Recommend Popular Jokes
Recommend the top n most popular jokes using mean ratings.

**Things to do:**
* Find mean rating for all the jokes
* Mean rating is an array that needs to be converted into Dataframe for sort into descending order
* Recommend top n popular jokes

In [64]:
# Find the mean rating for all the jokes
mean_ratings = ratings_ss.mean(axis = 0) #axis of 0 for it to calculate mean across all rows 
print(mean_ratings.shape) #(150,) clearly indicates mean scores for all 150 jokes
mean_ratings

(150,)


array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
       -2.94359895e-18,  0.00000000e+00, -1.12137103e-18, -4.70975832e-17,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
       -2.57915337e-17,  0.00000000e+00, -1.23350813e-17, -5.63488942e-17,
       -2.97163323e-17,  1.12137103e-17, -4.48548412e-18, -4.62565550e-18,
       -2.49505054e-17, -2.65274334e-17, -3.86873005e-17,  4.19112422e-17,
        3.13983888e-17,  1.17743958e-17,  3.08377033e-18,  2.45299913e-17,
        4.02291857e-17, -2.67727333e-17,  7.28891169e-18, -2.57915337e-17,
        4.16133781e-18,  4.15958566e-17,  4.90599825e-18, -6.92446611e-17,
        2.91556468e-17, -3.42018164e-17, -3.33607881e-17,  3.05573606e-17,
       -2.29881061e-17, -2.48103340e-17,  3.11881318e-18,  1.03726820e-17,
       -3.95283288e-17,  3.32907024e-17, -1.26154241e-17,  2.24274206e-18,
        2.57915337e-17, -2.60718764e-17,  3.43419878e-18,  2.87351326e-18,
       -3.98086716e-17, -

In [65]:
#Convert array into Dataframe and rename column name for better readability
mean_ratings = pd.DataFrame(mean_ratings)
mean_ratings.rename(columns = {0: 'mean_joke_ratings'}, inplace = True) 
mean_ratings

Unnamed: 0,mean_joke_ratings
0,0.000000e+00
1,0.000000e+00
2,0.000000e+00
3,0.000000e+00
4,-2.943599e-18
...,...
145,6.167541e-18
146,4.050953e-17
147,3.083770e-18
148,2.242742e-17


In [67]:
#Recommend the top n most popular jokes
n = 10
#mean_ratings.iloc[:,0].argsort()[:-(n+1):-1] #outputs only Joke ids
mean_ratings.sort_values(ascending = False, by = 'mean_joke_ratings')[:n] #outputs Joke ids and their mean ratings

Unnamed: 0,mean_joke_ratings
117,5.845146e-17
78,5.676941e-17
124,5.3265120000000004e-17
128,5.0461700000000006e-17
97,4.4294160000000004e-17
105,4.387364e-17
23,4.1911240000000004e-17
33,4.1595860000000006e-17
93,4.149073e-17
62,4.1420640000000005e-17


**Recommender system recommends the top 10 most popular jokes based on their mean ratings.**