building recommendations upon other users ratings, who have similar ratings with the user to whom we want to recommend.
Using Matrix factorization

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler
import re

## pre-processing steps of the data<br>
-use usernames, hotelids and ratings<br>
-filter hotelids (remove ';' from the id)<br>
-filter hotels that have had at least 20 ratings from users<br>
-filter users that have given at least 30 ratings. Because collaborative filtering algorithms requires user's active participation.


In [2]:
hotel_rating_df = pd.read_csv('dataset.csv', sep=";",index_col=0, low_memory=False, dtype={'hotel_id':object})[['username','hotel_id','overall_rating']]

In [3]:
# hotelDf.username=hotelDf.username.fillna("missing")

In [4]:
hotel_rating_df.hotel_id=hotel_rating_df.hotel_id.map(lambda x: str(x).replace(';',''))

In [5]:
hotel_rating_df.shape

(878533, 3)

In [6]:
# checking how many hotels do we have in the data
len(hotel_rating_df.hotel_id.unique())

3946

In [7]:
# count the number of ratings each hotel have had
rating_count_df = (hotel_rating_df.
     groupby(by = ['hotel_id'])['overall_rating'].
     count().
     reset_index().
     rename(columns = {'overall_rating': 'RatingCount_hotel'})
     [['hotel_id', 'RatingCount_hotel']]
    )

In [8]:
rating_count_df.head()

Unnamed: 0,hotel_id,RatingCount_hotel
0,100407,64
1,100504,739
2,100505,644
3,100506,88
4,100507,1049


In [9]:
rating_count_df.shape

(3946, 2)

In [10]:
# setting the threshold to 30 to filter the hotels that have had at least 20 
# ratings from users
threshold=30
rating_count_df=rating_count_df.query('RatingCount_hotel >= @threshold')

In [11]:
# checking how many hotels did we get after  applying the filter
rating_count_df.shape
# from 3946-> 3003

(2737, 2)

In [12]:
user_rating_df= pd.merge(rating_count_df, hotel_rating_df, left_on='hotel_id', right_on='hotel_id', how='left')

In [13]:
user_rating_df.head()

Unnamed: 0,hotel_id,RatingCount_hotel,username,overall_rating
0,100407,64,johnb6597,5.0
1,100407,64,John K,5.0
2,100407,64,AlowishusCPMc,4.0
3,100407,64,plasmid,5.0
4,100407,64,tyramjer,4.0


In [14]:
# count the number of hotels each user rated
user_count_df = (user_rating_df.
     groupby(by = ['username'])['overall_rating'].
     count().
     reset_index().
     rename(columns = {'overall_rating': 'RatingCount_user'})
     [['username', 'RatingCount_user']]
    )
user_count_df.head()

Unnamed: 0,username,RatingCount_user
0,!!,2
1,!!!!!!?,1
2,!_1234,1
3,#1Cubsfan,2
4,#1ElvisFan,1


In [15]:
user_count_df.shape
# there is 532767 users

(529341, 2)

In [16]:
threshold = 20
user_count_df = user_count_df.query('RatingCount_user >= @threshold')
user_count_df.head()

Unnamed: 0,username,RatingCount_user
1620,1NicePerson,26
2867,2Midwest,41
6939,A B,24
6940,A C,20
6953,A K,21


In [17]:
user_count_df.shape
# from 532767 users to 537 active users!

(525, 2)

In [18]:
combined_df= user_rating_df.merge(user_count_df, left_on = 'username', right_on = 'username', how = 'inner')

In [36]:
combined_df.tail()

Unnamed: 0,hotel_id,RatingCount_hotel,username,overall_rating,RatingCount_user
15897,80806,286,Umailop,0.8,21
15898,80836,341,Umailop,1.0,21
15899,81022,525,Umailop,0.8,21
15900,81215,549,Umailop,0.2,21
15901,81246,133,Umailop,0.6,21


In [20]:
combined_df.shape

(15902, 5)

In [21]:
print('Number of unique hotels: ', combined_df['hotel_id'].nunique())
print('Number of unique users: ', combined_df['username'].nunique())

Number of unique hotels:  2340
Number of unique users:  525


So, our final dataset contains 537 users for 2462 hotels. And each user has given at least 20 ratings and each hotel has received at least 20 ratings. 
This is a good size in our case because we do not have a GPU.

The focus is on finding users who have given similar ratings to the same hotels. That is, create a link between users, to whom hotels that were reviewed in a positive way, will be suggested to them.
Thus we don't look for associations between hotels, but between users.
Therefore, to make recommendations in collaborative filtering, it relies only on observed user behavior — no content data or profile data is necessary.

The main observations that we will focus on are: <br>
Users who rate hotels in a similar manner share one or more hidden preferences. And the Users with shared preferences are likely to give ratings in the same way to the same hotels.

In [22]:
# normalize the ratings 
scaler = MinMaxScaler()
combined_df['overall_rating'] = combined_df['overall_rating'].values.astype(float)
rating_scaled_df= pd.DataFrame(scaler.fit_transform(combined_df['overall_rating'].values.reshape(-1,1)))
combined_df['overall_rating'] = rating_scaled_df
combined_df.head()

Unnamed: 0,hotel_id,RatingCount_hotel,username,overall_rating,RatingCount_user
0,100407,64,John K,1.0,63
1,100597,637,John K,1.0,63
2,1007816,210,John K,0.2,63
3,109454,183,John K,0.2,63
4,1113787,397,John K,1.0,63


In [23]:
#  building the user hotel matrix 
combined_df= combined_df.drop_duplicates(['username', 'hotel_id'])
user_hotel_matrix = combined_df.pivot(index='username', columns='hotel_id', values='overall_rating')
user_hotel_matrix.fillna(0, inplace=True)
users = user_hotel_matrix.index.tolist()
hotels = user_hotel_matrix.columns.tolist()
user_hotel_matrix = user_hotel_matrix.as_matrix()

  import sys


In [24]:
user_hotel_matrix

array([[0. , 0. , 0.4, ..., 0. , 0. , 0.8],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       ...,
       [0. , 0. , 0. , ..., 0. , 0.8, 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ]])

In [25]:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

Instructions for updating:
non-resource variables are not supported in the long term


In [26]:
# setting up the network parameters: the dimension of each hidden layer
num_input = combined_df['hotel_id'].nunique()
num_hidden_1 = 10
num_hidden_2 = 5

# initialize tenserflow placeholder
X = tf.placeholder(tf.float64, [None, num_input])

#  randomly initialize the weights ans biases

weights = {
    'encoder_h1': tf.Variable(tf.random_normal([num_input, num_hidden_1], dtype=tf.float64)),
    'encoder_h2': tf.Variable(tf.random_normal([num_hidden_1, num_hidden_2], dtype=tf.float64)),
    'decoder_h1': tf.Variable(tf.random_normal([num_hidden_2, num_hidden_1], dtype=tf.float64)),
    'decoder_h2': tf.Variable(tf.random_normal([num_hidden_1, num_input], dtype=tf.float64)),
}

biases = {
    'encoder_b1': tf.Variable(tf.random_normal([num_hidden_1], dtype=tf.float64)),
    'encoder_b2': tf.Variable(tf.random_normal([num_hidden_2], dtype=tf.float64)),
    'decoder_b1': tf.Variable(tf.random_normal([num_hidden_1], dtype=tf.float64)),
    'decoder_b2': tf.Variable(tf.random_normal([num_input], dtype=tf.float64)),
}

In [27]:
# build the encoder and decoder model
def encoder(x):
    layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['encoder_h1']), biases['encoder_b1']))
    layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['encoder_h2']), biases['encoder_b2']))
    return layer_2

def decoder(x):
    layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['decoder_h1']), biases['decoder_b1']))
    layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['decoder_h2']), biases['decoder_b2']))
    return layer_2

In [28]:
# construct the model and the predictions
encoder_op = encoder(X)
decoder_op = decoder(encoder_op)
y_pred = decoder_op
y_true = X

In [29]:
# define loss function and optimizer
# minimize the squared error and define the evaluation metrics
loss = tf.losses.mean_squared_error(y_true, y_pred)
optimizer = tf.train.RMSPropOptimizer(0.03).minimize(loss)
eval_x = tf.placeholder(tf.int32, )
eval_y = tf.placeholder(tf.int32, )
pre, pre_op = tf.metrics.precision(labels=eval_x, predictions=eval_y)

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.


In [30]:
# initilaze placeholders ans variables before they have values
# create empty data frame to store the result table which will be the top 10 
# recommendations for every user
init = tf.global_variables_initializer()
local_init = tf.local_variables_initializer()
pred_data = pd.DataFrame()

## Train the model

In [31]:
# split training data into batches -> then feed the network with them
# train the model with vectors of user ratings (each vector represents a user and 
# each column represents a hotel , the entries are ratings that the user gave to the hotel)
# we should not recommend hotels to a user who have already rated them.
with tf.Session() as session:
    epochs = 100
    batch_size = 35

    session.run(init)
    session.run(local_init)

    num_batches = int(user_hotel_matrix.shape[0] / batch_size)
    user_hotel_matrix = np.array_split(user_hotel_matrix, num_batches)
    
    for i in range(epochs):

        avg_cost = 0
        for batch in user_hotel_matrix:
            _, l = session.run([optimizer, loss], feed_dict={X: batch})
            avg_cost += l

        avg_cost /= num_batches

#         print("epoch: {} Loss: {}".format(i + 1, avg_cost))

    user_hotel_matrix = np.concatenate(user_hotel_matrix, axis=0)

    preds = session.run(decoder_op, feed_dict={X: user_hotel_matrix})

    pred_data = pred_data.append(pd.DataFrame(preds))

    pred_data = pred_data.stack().reset_index(name='overall_rating')
    pred_data.columns = ['username', 'hotel_id', 'overall_rating']
    pred_data['username'] = pred_data['username'].map(lambda value: users[value])
    pred_data['hotel_id'] = pred_data['hotel_id'].map(lambda value: hotels[value])
    
    keys = ['username', 'hotel_id']
    index_1 = pred_data.set_index(keys).index
    index_2 = combined_df.set_index(keys).index

    top_ten_ranked = pred_data[~index_1.isin(index_2)]
    top_ten_ranked = top_ten_ranked.sort_values(['username', 'overall_rating'], ascending=[True, False])
    top_ten_ranked = top_ten_ranked.groupby('username').head(10)

In [33]:
import pickle
pickle.dump(top_ten_ranked,open("top_ten_ranked.p","wb"))

In [34]:
# selected a user, to see which hotels we should recommended sorted by the normalized predicted ratings
top_ten_ranked.loc[top_ten_ranked['username'] == 'John K']

Unnamed: 0,username,hotel_id,overall_rating
480807,John K,258705,0.11319
481750,John K,93454,0.088896
481605,John K,87638,0.085385
481753,John K,93464,0.078889
481761,John K,93507,0.076928
480124,John K,119728,0.066861
481296,John K,80602,0.064206
481359,John K,81192,0.062267
480459,John K,2079052,0.0618
481672,John K,89617,0.060563


In [35]:
# lets see the hotels , the user selected has rated them sorted by ratings
hotel_rating_df.loc[hotel_rating_df['username'] == 'John K'].sort_values(by=['overall_rating'], ascending=False)

Unnamed: 0,username,hotel_id,overall_rating
1190,John K,282699,5.0
498129,John K,781627,5.0
328653,John K,114591,5.0
362946,John K,76442,5.0
378107,John K,274223,5.0
...,...,...,...
530073,John K,109454,1.0
679808,John K,121917,1.0
82565,John K,93625,1.0
242573,John K,113300,1.0
