## Goal: Apply Collaborative Filtering to recommend jokes to users

### Plan (2020-05-19):
1. Define user's active participation: ratings > 5 <br>
2. Define the threshold of ratings for jokes <br>
   So, our final dataset contains 4093 users for 137 jokes. And each user has given at least 5 ratings and each book has received > 100 ratings.
   
Collborative Filtering:   
- Users who rate jokes in a similar manner share one or more hidden preferences.
- Users with shared preferences are likely to give ratings in the same way to the same jokes.

### Plan (2020-05-27):
Goal: Go through the article and try to replicate half of the collaborative filtering codes
1. Prepare the final data for modeling

### Plan (2020-05-31):
Goal: Understand what Susan Li did in the article: data wrangling & neural network.

Data wrangling: 
1. Normalize all rating data <br>
2. Filling Nan with a value

Neural network:
1. Set up neural network parameters: initialize the weights & bias randomly
2. Build the active function (encoder & decoder recurrent nn model)
3. Construct the model and prediction
4. Define the loss function & optimizer (to minimize the mse)
5. Initialize placeholders & variables bc Tensorflow 
6. Train the model
7. See how model works



In [1]:
## Import the required packages
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler

In [2]:
## Check the virsion due to the Error: ModuleNotFoundError: No module named 'tensorflow'
## Error fixed: https://github.com/tensorflow/tensorflow/issues/27935
from platform import python_version
print(python_version())

3.7.6


In [3]:
## Import the trimmed Jester Joke dataset
## Method 1: use pd.read_csv()
df = pd.read_csv('Trimmed Jester Data.csv', delimiter = ',')
print(df.head())
df.shape

   Unnamed: 0      0     6     7    12    14    15    16   17    18  ...  \
0           3   47.0   NaN   NaN   NaN   NaN   NaN   NaN  NaN -5.41  ...   
1           4   13.0   NaN   NaN   NaN   NaN   NaN   NaN  NaN -7.72  ...   
2           5   33.0   NaN   NaN   NaN   NaN   NaN   NaN  NaN  4.39  ...   
3           6  112.0 -4.45  7.54 -9.65 -7.26  7.83 -8.19  0.0  0.00  ...   
4           7   34.0   NaN   NaN   NaN  1.71   NaN   NaN  NaN  6.63  ...   

    148   149   150   151    152   153   154   155   156   157  
0   NaN   NaN  5.61 -4.51   0.00  0.00   NaN  0.00  5.93  4.19  
1   NaN   NaN   NaN   NaN    NaN  0.00   NaN   NaN   NaN  0.00  
2   NaN  3.19   NaN  0.00   3.41   NaN -2.32   NaN  0.00  2.93  
3 -1.97  1.89  0.00  0.00   7.38  3.19 -9.33 -7.26 -9.13 -8.19  
4  3.98   NaN  2.22  6.08  10.00   NaN   NaN  6.30  4.11  8.25  

[5 rows x 138 columns]


(4094, 138)

In [4]:
## Drop the second columns
df.drop(['0'], axis = 1, inplace = True)
df.shape

(4094, 137)

In [5]:
## Rename the first column
df.rename(columns={"Unnamed: 0": "User ID"}, inplace = True)

In [6]:
## Make wide data long
long_df = df.melt(id_vars = 'User ID', var_name = 'Joke ID', value_name = 'Rating')
long_df

Unnamed: 0,User ID,Joke ID,Rating
0,3,6,
1,4,6,
2,5,6,
3,6,6,-4.45
4,7,6,
...,...,...,...
556779,7690,157,0.41
556780,7693,157,
556781,7694,157,
556782,7696,157,0.65


In [47]:
## rename the user ID
long_df['User ID']

range(1, 4094)

In [8]:
## normalize the ratings
scaler = MinMaxScaler()
long_df['Rating'] = long_df['Rating'].values.astype(float)
rating_scaled = pd.DataFrame(scaler.fit_transform(long_df['Rating'].values.reshape(-1,1)))
long_df['Rating'] = rating_scaled
long_df.describe()


Unnamed: 0,User ID,Rating
count,556784.0,97366.0
mean,3711.863214,0.52653
std,2252.559941,0.246083
min,3.0,0.0
25%,1679.0,0.3945
50%,3619.5,0.5
75%,5652.0,0.686
max,7697.0,1.0


In [61]:
## Make data wide again
user_joke_matrix = long_df.pivot(index = 'User ID', columns = 'Joke ID', values = 'Rating')
user_joke_matrix

Joke ID,100,101,102,103,104,105,106,107,108,109,...,89,90,91,92,93,94,95,96,97,98
User ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3,,,,0.7500,0.2360,0.7925,0.7560,,,,...,,0.7235,0.5,,0.5000,,0.5000,0.7380,,
4,,,,0.5000,0.5000,,,,,,...,,,,,,,,,,
5,,,,0.8090,0.3260,0.9055,0.7470,0.699,,,...,,,,,,,,,,
6,0.011,,,0.7430,0.4340,,0.1045,0.639,0.6475,0.5,...,,,,0.2105,,,0.5000,0.8080,0.0785,0.5
7,,,,0.7735,0.8525,0.8285,0.6290,,,,...,,,,,,,,0.6445,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7690,,,,0.4410,0.3365,,0.5435,,,,...,,,,,,0.238,0.3335,,,
7693,,,,,0.3355,,0.6400,,,,...,,,,,,,,,,
7694,,,,,0.5000,0.5000,0.5000,,,,...,,,,,,,,,,
7696,,,,0.9735,0.5000,,,,,,...,,,,,0.2955,,,0.8830,,


In [62]:
## replace NA with 0.5
user_joke_matrix.fillna(0.5, inplace = True)
user_joke_matrix

Joke ID,100,101,102,103,104,105,106,107,108,109,...,89,90,91,92,93,94,95,96,97,98
User ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3,0.500,0.5,0.5,0.7500,0.2360,0.7925,0.7560,0.500,0.5000,0.5,...,0.5,0.7235,0.5,0.5000,0.5000,0.500,0.5000,0.7380,0.5000,0.5
4,0.500,0.5,0.5,0.5000,0.5000,0.5000,0.5000,0.500,0.5000,0.5,...,0.5,0.5000,0.5,0.5000,0.5000,0.500,0.5000,0.5000,0.5000,0.5
5,0.500,0.5,0.5,0.8090,0.3260,0.9055,0.7470,0.699,0.5000,0.5,...,0.5,0.5000,0.5,0.5000,0.5000,0.500,0.5000,0.5000,0.5000,0.5
6,0.011,0.5,0.5,0.7430,0.4340,0.5000,0.1045,0.639,0.6475,0.5,...,0.5,0.5000,0.5,0.2105,0.5000,0.500,0.5000,0.8080,0.0785,0.5
7,0.500,0.5,0.5,0.7735,0.8525,0.8285,0.6290,0.500,0.5000,0.5,...,0.5,0.5000,0.5,0.5000,0.5000,0.500,0.5000,0.6445,0.5000,0.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7690,0.500,0.5,0.5,0.4410,0.3365,0.5000,0.5435,0.500,0.5000,0.5,...,0.5,0.5000,0.5,0.5000,0.5000,0.238,0.3335,0.5000,0.5000,0.5
7693,0.500,0.5,0.5,0.5000,0.3355,0.5000,0.6400,0.500,0.5000,0.5,...,0.5,0.5000,0.5,0.5000,0.5000,0.500,0.5000,0.5000,0.5000,0.5
7694,0.500,0.5,0.5,0.5000,0.5000,0.5000,0.5000,0.500,0.5000,0.5,...,0.5,0.5000,0.5,0.5000,0.5000,0.500,0.5000,0.5000,0.5000,0.5
7696,0.500,0.5,0.5,0.9735,0.5000,0.5000,0.5000,0.500,0.5000,0.5,...,0.5,0.5000,0.5,0.5000,0.2955,0.500,0.5000,0.8830,0.5000,0.5


In [63]:
users = user_joke_matrix.index.tolist()
jokes = user_joke_matrix.columns.tolist()
user_joke_matrix = user_joke_matrix.to_numpy()
user_joke_matrix

array([[0.5   , 0.5   , 0.5   , ..., 0.738 , 0.5   , 0.5   ],
       [0.5   , 0.5   , 0.5   , ..., 0.5   , 0.5   , 0.5   ],
       [0.5   , 0.5   , 0.5   , ..., 0.5   , 0.5   , 0.5   ],
       ...,
       [0.5   , 0.5   , 0.5   , ..., 0.5   , 0.5   , 0.5   ],
       [0.5   , 0.5   , 0.5   , ..., 0.883 , 0.5   , 0.5   ],
       [0.5   , 0.5   , 0.5   , ..., 0.429 , 0.4825, 0.5   ]])

In [64]:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()


In [65]:
## Step 1: Set up neural network parameters: initialize the weights & bias randomly

num_input = long_df['Joke ID'].nunique()
num_hidden_1 = 10
num_hidden_2 = 5

X = tf.placeholder(tf.float64, [None, num_input])

weights = {
    'encoder_h1': tf.Variable(tf.random_normal([num_input, num_hidden_1], dtype=tf.float64)),
    'encoder_h2': tf.Variable(tf.random_normal([num_hidden_1, num_hidden_2], dtype=tf.float64)),
    'decoder_h1': tf.Variable(tf.random_normal([num_hidden_2, num_hidden_1], dtype=tf.float64)),
    'decoder_h2': tf.Variable(tf.random_normal([num_hidden_1, num_input], dtype=tf.float64)),
}

biases = {
    'encoder_b1': tf.Variable(tf.random_normal([num_hidden_1], dtype=tf.float64)),
    'encoder_b2': tf.Variable(tf.random_normal([num_hidden_2], dtype=tf.float64)),
    'decoder_b1': tf.Variable(tf.random_normal([num_hidden_1], dtype=tf.float64)),
    'decoder_b2': tf.Variable(tf.random_normal([num_input], dtype=tf.float64)),
}

In [66]:
## Step 2: create active function

def encoder(x):
    layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['encoder_h1']), biases['encoder_b1']))
    layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['encoder_h2']), biases['encoder_b2']))
    return layer_2

def decoder(x):
    layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['decoder_h1']), biases['decoder_b1']))
    layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['decoder_h2']), biases['decoder_b2']))
    return layer_2

In [67]:
## Step 3: construct the model and prediction

encoder_op = encoder(X)
decoder_op = decoder(encoder_op)
y_pred = decoder_op
y_true = X

In [68]:
## Step 4: Define the loss function & optimizer & evaluation metrics

loss = tf.losses.mean_squared_error(y_true, y_pred)
optimizer = tf.train.RMSPropOptimizer(0.03).minimize(loss)
eval_x = tf.placeholder(tf.int32, )
eval_y = tf.placeholder(tf.int32, )
pre, pre_op = tf.metrics.precision(labels=eval_x, predictions=eval_y)

In [69]:
## Step 5: Initialize placeholders & variables bc Tensorflow

init = tf.global_variables_initializer()
local_init = tf.local_variables_initializer()
pred_data = pd.DataFrame()

In [70]:
## Step 6: Train the model

with tf.Session() as session:
    epochs = 100
    batch_size = 35

    session.run(init)
    session.run(local_init)

    num_batches = int(user_joke_matrix.shape[0] / batch_size)
    user_joke_matrix = np.array_split(user_joke_matrix, num_batches)
    
    for i in range(epochs):

        avg_cost = 0
        for batch in user_joke_matrix:
            _, l = session.run([optimizer, loss], feed_dict={X: batch})
            avg_cost += l

        avg_cost /= num_batches

        print("epoch: {} Loss: {}".format(i + 1, avg_cost))

    user_joke_matrix = np.concatenate(user_joke_matrix, axis=0)

    preds = session.run(decoder_op, feed_dict={X: user_joke_matrix})

    pred_data = pred_data.append(pd.DataFrame(preds))

    pred_data = pred_data.stack().reset_index(name = 'Rating')
    pred_data.columns = ['User ID', 'Joke ID', 'Rating']
    pred_data['User ID'] = pred_data['User ID'].map(lambda value: users[value])
    pred_data['Joke ID'] = pred_data['Joke ID'].map(lambda value: jokes[value])



epoch: 1 Loss: 0.12096985326758747
epoch: 2 Loss: 0.02402451067986288
epoch: 3 Loss: 0.01079311532546477
epoch: 4 Loss: 0.010690696188248694
epoch: 5 Loss: 0.010606602404330825
epoch: 6 Loss: 0.01043772982866985
epoch: 7 Loss: 0.010204923692448386
epoch: 8 Loss: 0.010075837694879236
epoch: 9 Loss: 0.009982151762132758
epoch: 10 Loss: 0.009893849748989632
epoch: 11 Loss: 0.009774253270107097
epoch: 12 Loss: 0.009710388258099556
epoch: 13 Loss: 0.009634306405446139
epoch: 14 Loss: 0.009551794332420004
epoch: 15 Loss: 0.00947149701673409
epoch: 16 Loss: 0.009476351840742704
epoch: 17 Loss: 0.009426096373976305
epoch: 18 Loss: 0.009367045155597916
epoch: 19 Loss: 0.009349414516754192
epoch: 20 Loss: 0.009302900738789347
epoch: 21 Loss: 0.0092814267404249
epoch: 22 Loss: 0.009216726512712395
epoch: 23 Loss: 0.009185482587279945
epoch: 24 Loss: 0.009126246613771495
epoch: 25 Loss: 0.009098029261904544
epoch: 26 Loss: 0.009087102034064973
epoch: 27 Loss: 0.009049064126507989
epoch: 28 Loss: 0

In [77]:
pred_data
pred_data.to_csv('Joke Prediction Data.csv')

In [78]:
## Step 7: See how the model works
pred_data 

Unnamed: 0,User ID,Joke ID,Rating
0,3,100,0.494387
1,3,101,0.498305
2,3,102,0.503023
3,3,103,0.577925
4,3,104,0.657741
...,...,...,...
556779,7697,94,0.489600
556780,7697,95,0.512002
556781,7697,96,0.523419
556782,7697,97,0.518409


In [116]:
## See the top jokes
avg_pred = pred_data.groupby(['Joke ID'], as_index=False).mean()
avg_pred
top = avg_pred.sort_values(by = ['Rating'], ascending = False)
top10 = top.head(10)
top10
top10.to_csv('Joke Prediction Data top10.csv')
top10

Unnamed: 0,Joke ID,User ID,Rating
90,52,3711.863214,0.541692
25,125,3711.863214,0.539783
4,104,3711.863214,0.533609
72,31,3711.863214,0.532956
125,88,3711.863214,0.530291
5,105,3711.863214,0.529687
3,103,3711.863214,0.529156
100,62,3711.863214,0.529009
63,20,3711.863214,0.528253
110,71,3711.863214,0.528221


In [118]:
ls = top10['Joke ID'].to_list()

In [115]:
top10_fixed = pd.read_csv('Joke Prediction Data top10.csv', delimiter = ',')
joke = pd.read_csv('Dataset4JokeSet.csv', delimiter = ',', header = None)
joke

Unnamed: 0,0
0,"A man visits the doctor. The doctor says ""I ha..."
1,This couple had an excellent relationship goin...
2,Q. What's 200 feet long and has 4 teeth? A. ...
3,Q. What's the difference between a man and a t...
4,Q.\tWhat's O. J. Simpson's Internet address? ...
...,...
153,"Poodle: ""My life is a mess. My owner is mean, ..."
154,Did you hear that NASA has launched several co...
155,"A bear walks into a bar and says,""I'd like a b..."
156,A dog goes into a bar and orders a martini. Th...


In [126]:
joke['Joke ID'] = joke.index
joke

Unnamed: 0,0,Joke ID
0,"A man visits the doctor. The doctor says ""I ha...",0
1,This couple had an excellent relationship goin...,1
2,Q. What's 200 feet long and has 4 teeth? A. ...,2
3,Q. What's the difference between a man and a t...,3
4,Q.\tWhat's O. J. Simpson's Internet address? ...,4
...,...,...
153,"Poodle: ""My life is a mess. My owner is mean, ...",153
154,Did you hear that NASA has launched several co...,154
155,"A bear walks into a bar and says,""I'd like a b...",155
156,A dog goes into a bar and orders a martini. Th...,156


In [128]:
joke_content = joke.iloc[ls]
joke_content['Joke ID'] = joke_content.index
joke_content

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,0,Joke ID
52,One Sunday morning William burst into the livi...,52
125,"A Briton, a Frenchman and a Russian are viewin...",125
104,A couple of hunters are out in the woods in th...,104
31,A man arrives at the gates of heaven. St. Pete...,31
88,A radio conversation of a US naval ship with ...,88
105,An engineer dies and reports to the pearly gat...,105
103,"As a pre-med student, I had to take a difficul...",103
62,"An engineer, a physicist and a mathematician a...",62
20,What's the difference between a used tire and ...,20
71,"On the first day of college, the Dean addresse...",71


In [135]:
top10['Joke ID']=top10['Joke ID'].astype(int)
joke_content['Joke ID']=joke_content['Joke ID'].astype(int)
top10 = pd.merge(top10, joke_content, on = 'Joke ID')
top10
top10.to_csv('top10 jokes.csv')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [50]:
## case: fill in 0
pred_data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,126,127,128,129,130,131,132,133,134,135
0,0.014528,0.056670,0.024655,0.580730,0.618806,0.609599,0.605157,0.274994,0.088468,0.077892,...,0.043299,0.143165,0.127695,1.576413e-01,0.146409,0.072779,0.277289,0.233185,0.136035,0.047001
1,0.000659,0.000667,0.000139,0.294620,0.669780,0.056703,0.009578,0.028399,0.000006,0.000209,...,0.000007,0.000275,0.000014,1.169974e-03,0.000145,0.000919,0.003523,0.011084,0.000070,0.000002
2,0.000498,0.000118,0.000010,0.715496,0.622577,0.833540,0.496342,0.841696,0.001017,0.121208,...,0.000419,0.006382,0.002571,4.037950e-01,0.003917,0.000559,0.019867,0.291991,0.001195,0.000142
3,0.191897,0.548057,0.559537,0.520399,0.513671,0.591423,0.413878,0.570052,0.295729,0.564082,...,0.305879,0.308246,0.324582,6.689008e-01,0.395414,0.495958,0.397205,0.597498,0.438343,0.304526
4,0.008540,0.050752,0.020444,0.588529,0.629927,0.630358,0.619807,0.290071,0.055665,0.047737,...,0.021177,0.096165,0.083610,1.252148e-01,0.090131,0.063852,0.242172,0.172950,0.239111,0.026252
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4089,0.032275,0.070742,0.033534,0.463472,0.598116,0.352492,0.373838,0.074195,0.024372,0.075057,...,0.023074,0.088981,0.036512,9.742065e-02,0.136166,0.102652,0.141033,0.329094,0.024611,0.015359
4090,0.000043,0.000805,0.000029,0.002656,0.402708,0.022242,0.474397,0.000007,0.000335,0.000001,...,0.000505,0.007647,0.004718,1.822645e-07,0.018359,0.000840,0.027313,0.000433,0.000014,0.001699
4091,0.006010,0.019365,0.004524,0.288303,0.622905,0.214164,0.542137,0.028556,0.009987,0.004072,...,0.003920,0.043448,0.017295,1.227934e-02,0.054030,0.032259,0.133670,0.050271,0.047137,0.003414
4092,0.010047,0.010792,0.002754,0.604392,0.569506,0.561907,0.257345,0.215771,0.001560,0.213692,...,0.001329,0.013565,0.002732,3.159234e-01,0.024000,0.026615,0.023827,0.577834,0.017336,0.000554
