# KUYLAH by Stackoverthink

This code is kuylah collaborative filtering for recommendation system for kuylah user.

Collaborative filtering is a technique widely used by recommendation systems for having a decent size of user dan data. It makes recommendation based on the content preferences of similiar users.

The collaborative filtering approach focuses on finding users who have given similar ratings to the same destinations, thus creating a link between users, to whom will be suggested destinations that were reviewed in a positive way. In this way, we look for associations between users, not between destinations. Therefore, collaborative filtering relies only on observed user behavior to make recommendations — no profile data or content data is necessary.



---


Therefore, collaborative filtering is not a suitable model to deal with cold start problem, in which it cannot draw any inference for users or items about which it has not yet gathered sufficient information.

But once we have relative large user — item interaction data, then collaborative filtering is the most widely used recommendation approach. That is why we still generated ratings.csv by ourselves which contains 1000 users and each user gave 10 destination feedbacks





In [None]:
# Upload dataset.csv and ratings.csv file for training
from google.colab import files
files.upload()

TypeError: ignored

In [None]:
# Import necessary modules
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler

In [None]:
# read csv files as pandas dataframe
rating = pd.read_csv('ratings.csv', error_bad_lines=False, encoding="latin-1")
destination = pd.read_csv('dataset.csv', error_bad_lines=False, encoding="latin-1")

In [None]:
rating.head()

Unnamed: 0,user_id,index,ratings
0,1,24,5
1,1,123,3
2,1,51,1
3,1,76,4
4,1,139,4


In [None]:
destination.head()

Unnamed: 0,index,nama,vote_average,vote_count,type,htm_weekday,htm_weekend,latitude,longitude,description
0,0,Candi Borobudur,4.7,81922,Budaya dan Sejarah,50000.0,50000.0,-7.607087,110.203623,Candi yang pernah masuk sebagai salah satu dar...
1,1,Candi Prambanan,4.7,71751,Budaya dan Sejarah,50000.0,50000.0,-7.751835,110.491532,Candi Prambanan adalah kompleks candi Hindu te...
2,2,Tebing Breksi,4.4,51431,Alam,10000.0,10000.0,-7.781477,110.504576,Tebing Breksi merupakan tempat wisata yang ber...
3,3,Gembira Loka Zoo,4.5,36337,Buatan,20000.0,25000.0,-7.806234,110.396798,Gambira Loka adalah kebun binatang yang berada...
4,4,The Palace of Yogyakarta (Keraton Yogyakarta),4.6,30091,Budaya dan Sejarah,8000.0,8000.0,-7.805284,110.364203,Kompleks keraton merupakan museum yang menyimp...


In [None]:
# Merge rating and destination based on index (destination primary key)
dest_rating = pd.merge(rating, destination, on='index')
cols = ['vote_average', 'vote_count', 'type', 'htm_weekday', 'htm_weekend', 'latitude', 'longitude', 'description']
dest_rating.drop(cols, axis=1, inplace=True)
dest_rating.head()

Unnamed: 0,user_id,index,ratings,nama
0,1,24,5,Pantai Drini
1,18,24,4,Pantai Drini
2,45,24,3,Pantai Drini
3,78,24,3,Pantai Drini
4,111,24,3,Pantai Drini


In [None]:
# count how many destinations have reviewed
rating_count = (dest_rating.
     groupby(by = ['nama'])['ratings'].
     count().
     reset_index().
     rename(columns = {'ratings': 'RatingCount_dest'})
     [['nama', 'RatingCount_dest']]
    )
rating_count

Unnamed: 0,nama,RatingCount_dest
0,Affandi Museum,75
1,Agro Tourism Bhumi Merapi,64
2,Air Terjun Kedung Pedut,75
3,Balong Waterpark,63
4,Bendungan Kamijoro,76
...,...,...
137,Wisata Air Wanatirta Kencana,63
138,Wisata Alam Watu Amben,62
139,Wisata Kalibiru,81
140,Wisata Telaga Potorono,75


In [None]:
rating_count['RatingCount_dest'].describe()

count    142.000000
mean      70.422535
std        8.595169
min       51.000000
25%       64.000000
50%       70.000000
75%       75.750000
max       99.000000
Name: RatingCount_dest, dtype: float64

In [None]:
threshold = 65
rating_count = rating_count.query('RatingCount_dest >= @threshold')
rating_count.head()

Unnamed: 0,nama,RatingCount_dest
0,Affandi Museum,75
2,Air Terjun Kedung Pedut,75
4,Bendungan Kamijoro,76
5,Blue Lagoon Jogja,75
6,Bukit Klangon,75


In [None]:
rating_count.shape

(142, 2)

In [None]:
# merge rating_coun and dest_count
user_rating = pd.merge(rating_count, dest_rating, left_on='nama', right_on='nama', how='left')
user_rating.head()

Unnamed: 0,nama,RatingCount_dest,user_id,index,ratings
0,Affandi Museum,75,17,84,2
1,Affandi Museum,75,28,84,5
2,Affandi Museum,75,42,84,2
3,Affandi Museum,75,56,84,4
4,Affandi Museum,75,60,84,2


In [None]:
user_count = (user_rating.
     groupby(by = ['user_id'])['ratings'].
     count().
     reset_index().
     rename(columns = {'ratings': 'RatingCount_user'})
     [['user_id', 'RatingCount_user']]
    )
user_count

Unnamed: 0,user_id,RatingCount_user
0,1,10
1,2,10
2,3,10
3,4,10
4,5,10
...,...,...
995,996,10
996,997,10
997,998,10
998,999,10


In [None]:
user_count['RatingCount_user'].describe()

count    1000.0
mean       10.0
std         0.0
min        10.0
25%        10.0
50%        10.0
75%        10.0
max        10.0
Name: RatingCount_user, dtype: float64

In [None]:
combined = user_rating.merge(user_count, left_on = 'user_id', right_on = 'user_id', how = 'inner')
combined

Unnamed: 0,nama,RatingCount_dest,user_id,index,ratings,RatingCount_user
0,Affandi Museum,75,17,84,2,10
1,Air Terjun Kedung Pedut,75,17,57,5,10
2,Bukit Paralayang Watugupit,59,17,23,1,10
3,Central Museum of the Air Force Dirgantara Man...,59,17,21,3,10
4,Galaxy Waterpark,64,17,62,3,10
...,...,...,...,...,...,...
9995,Pantai Parangkusumo,58,299,34,2,10
9996,Pantai Slili,58,299,98,3,10
9997,Taman Sari,70,299,5,1,10
9998,Taman Wisata Kaliurang,56,299,40,5,10


In [None]:
combined.shape

(10000, 6)

In [None]:
print('Number of unique destination: ', combined['nama'].nunique())
print('Number of unique users: ', combined['user_id'].nunique())

Number of unique destination:  142
Number of unique users:  1000


In [None]:
# Normalize the ratings with MinMaxScaler
scaler = MinMaxScaler()
combined['ratings'] = combined['ratings'].values.astype(float)
rating_scaled = pd.DataFrame(scaler.fit_transform(combined['ratings'].values.reshape(-1,1)))
combined['ratings'] = rating_scaled

In [None]:
combined

Unnamed: 0,nama,RatingCount_dest,user_id,index,ratings,RatingCount_user
0,Affandi Museum,75,17,84,0.25,10
1,Air Terjun Kedung Pedut,75,17,57,1.00,10
2,Bukit Paralayang Watugupit,59,17,23,0.00,10
3,Central Museum of the Air Force Dirgantara Man...,59,17,21,0.50,10
4,Galaxy Waterpark,64,17,62,0.50,10
...,...,...,...,...,...,...
9995,Pantai Parangkusumo,58,299,34,0.25,10
9996,Pantai Slili,58,299,98,0.50,10
9997,Taman Sari,70,299,5,0.00,10
9998,Taman Wisata Kaliurang,56,299,40,1.00,10


In [None]:
# transform the dataframe into matris for training sake
combined = combined.drop_duplicates(['user_id', 'nama'])
user_dest_matrix = combined.pivot(index='user_id', columns='nama', values='ratings')
user_dest_matrix.fillna(0, inplace=True)

users = user_dest_matrix.index.tolist()
dests = user_dest_matrix.columns.tolist()

user_dest_matrix = user_dest_matrix.values

In [None]:
user_dest_matrix

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

In [None]:
# We will initialize the TensorFlow placeholder.
# Then, weights and biases are randomly initialized,
# the following code are taken from the book: Python Machine Learning Cook Book - Second Edition
num_input = combined['nama'].nunique()
num_hidden_1 = 10
num_hidden_2 = 5

X = tf.placeholder(tf.float64, [None, num_input])

weights = {
    'encoder_h1': tf.Variable(tf.random_normal([num_input, num_hidden_1], dtype=tf.float64)),
    'encoder_h2': tf.Variable(tf.random_normal([num_hidden_1, num_hidden_2], dtype=tf.float64)),
    'decoder_h1': tf.Variable(tf.random_normal([num_hidden_2, num_hidden_1], dtype=tf.float64)),
    'decoder_h2': tf.Variable(tf.random_normal([num_hidden_1, num_input], dtype=tf.float64)),
}

biases = {
    'encoder_b1': tf.Variable(tf.random_normal([num_hidden_1], dtype=tf.float64)),
    'encoder_b2': tf.Variable(tf.random_normal([num_hidden_2], dtype=tf.float64)),
    'decoder_b1': tf.Variable(tf.random_normal([num_hidden_1], dtype=tf.float64)),
    'decoder_b2': tf.Variable(tf.random_normal([num_input], dtype=tf.float64)),
}

In [None]:
# Build the encode and decoder model
def encoder(x):
    layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['encoder_h1']), biases['encoder_b1']))
    layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['encoder_h2']), biases['encoder_b2']))
    return layer_2

def decoder(x):
    layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['decoder_h1']), biases['decoder_b1']))
    layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['decoder_h2']), biases['decoder_b2']))
    return layer_2

In [None]:
# Construct the model and the predictions
encoder_op = encoder(X)
decoder_op = decoder(encoder_op)

y_pred = decoder_op

y_true = X

In [None]:
# Define loss function, optimizer, minimize the squared error, and evaluation metrics
loss = tf.losses.mean_squared_error(y_true, y_pred)
optimizer = tf.train.RMSPropOptimizer(0.03).minimize(loss)
eval_x = tf.placeholder(tf.int32, )
eval_y = tf.placeholder(tf.int32, )
pre, pre_op = tf.metrics.precision(labels=eval_x, predictions=eval_y)

In [None]:
# Initialize the variables.
# Because TensorFlow uses computational graphs for its operations, placeholders and variables must be initialized.
init = tf.global_variables_initializer()
local_init = tf.local_variables_initializer()
pred_data = pd.DataFrame()

We start training our model.
- We split training data into batches, and we feed the network with them.
- We train our model with vectors of user ratings, each vector represents a user and each column a destination, and entries are ratings that the user gave to destination.
- After a few trials, we discovered that training model for 500 epochs with a batch size of 35 would be consuming enough memories. This means that the entire training set will feed our neural network 100 times, every time using 35 users.
- At the end, we must make sure to remove user’s ratings in the training set. That is, we must not recommend destinations to a user in which he (or she) has already rated.

In [None]:
with tf.Session() as session:
    epochs = 500
    batch_size = 35

    session.run(init)
    session.run(local_init)

    num_batches = int(user_dest_matrix.shape[0] / batch_size)
    user_dest_matrix = np.array_split(user_dest_matrix, num_batches)

    for i in range(epochs):

        avg_cost = 0
        for batch in user_dest_matrix:
            _, l = session.run([optimizer, loss], feed_dict={X: batch})
            avg_cost += l

        avg_cost /= num_batches

        print("epoch: {} Loss: {}".format(i + 1, avg_cost))

    user_dest_matrix = np.concatenate(user_dest_matrix, axis=0)

    preds = session.run(decoder_op, feed_dict={X: user_dest_matrix})

    pred_data = pred_data.append(pd.DataFrame(preds))

    pred_data = pred_data.stack().reset_index(name='nama')
    pred_data.columns = ['user_id', 'nama', 'ratings']
    pred_data['user_id'] = pred_data['user_id'].map(lambda value: users[value])
    pred_data['nama'] = pred_data['nama'].map(lambda value: dests[value])

    keys = ['user_id', 'nama']
    index_1 = pred_data.set_index(keys).index
    index_2 = combined.set_index(keys).index

    top_ten_ranked = pred_data[~index_1.isin(index_2)]
    top_ten_ranked = top_ten_ranked.sort_values(['user_id', 'ratings'], ascending=[True, False])
    top_ten_ranked = top_ten_ranked.groupby('user_id').head(10)

epoch: 1 Loss: 0.34649403606142315
epoch: 2 Loss: 0.3423175790480205
epoch: 3 Loss: 0.3250672934310777
epoch: 4 Loss: 0.26927051906074795
epoch: 5 Loss: 0.14658397622406483
epoch: 6 Loss: 0.05821994545736483
epoch: 7 Loss: 0.02981460014624255
epoch: 8 Loss: 0.024449147417076995
epoch: 9 Loss: 0.024420475786817924
epoch: 10 Loss: 0.024417686808322157
epoch: 11 Loss: 0.024406798183918
epoch: 12 Loss: 0.024388139402227744
epoch: 13 Loss: 0.02435923188126513
epoch: 14 Loss: 0.02431849004434688
epoch: 15 Loss: 0.024266997086150304
epoch: 16 Loss: 0.024208963715604374
epoch: 17 Loss: 0.024148712001208748
epoch: 18 Loss: 0.024088075051882436
epoch: 19 Loss: 0.024027116064514433
epoch: 20 Loss: 0.023962109349668026
epoch: 21 Loss: 0.023888977044927224
epoch: 22 Loss: 0.023804512739713703
epoch: 23 Loss: 0.02371131010087473
epoch: 24 Loss: 0.023612176267696277
epoch: 25 Loss: 0.023506233069513525
epoch: 26 Loss: 0.02339327129136239
epoch: 27 Loss: 0.023271529669208185
epoch: 28 Loss: 0.02314246

After 500 epochs of training, we saved the recommendation in `top_ten_ranked` pandas dataframe, which saved top ten ranked destination recommendation for each user.

Below example top ten ranked destinations for user 3 and user 123

In [None]:
top_ten_ranked.loc[top_ten_ranked['user_id'] == 3]

Unnamed: 0,user_id,nama,ratings
349,3,Mangrove Jembatan Api-Api (MJAA),0.34548
284,3,Affandi Museum,0.321784
315,3,Desa Wisata Gamplong,0.265625
297,3,Bundaran UGM,0.211005
290,3,Bukit Klangon,0.175006
360,3,Museum Wayang Kekayon,0.14975
340,3,Kawasan Ekowisata Gunung Api Purba Nglanggeran,0.122032
336,3,Jogja National Museum,0.115224
423,3,Wisata Kalibiru,0.113527
413,3,The Lost World Castle,0.107662


In [None]:
top_ten_ranked.loc[top_ten_ranked['user_id'] == 123]

Unnamed: 0,user_id,nama,ratings
17359,123,Embung Tambakboyo,0.33743
17326,123,Air Terjun Kedung Pedut,0.259476
17408,123,Pantai Kesirat,0.174604
17391,123,Monumen Yogya Kembali,0.163451
17407,123,Pantai Indrayanti,0.118556
17424,123,Pantai Timang,0.09745
17327,123,Balong Waterpark,0.095495
17450,123,Tebing Breksi,0.088619
17409,123,Pantai Kuwaru,0.084833
17346,123,Candi Prambanan,0.077256


In [None]:
top_ten_ranked

Unnamed: 0,user_id,nama,ratings
42,1,Goa Selarong,0.255784
138,1,Wisata Alam Watu Amben,0.187140
70,1,Museum Factory Dan Kedai Chocolate Monggo,0.150734
46,1,Gunung Api Purba Nglanggeran,0.136025
53,1,Jurang Tembelan Kanigoro,0.061779
...,...,...,...
141938,1000,Pantai Drini,0.098939
141961,1000,Pasar Kembang,0.082962
141903,1000,Grojogan Watu Purbo Bangunrejo,0.080489
141941,1000,Pantai Indrayanti,0.067598


In [None]:
top_ten_ranked.to_csv(r'top_ten_ranked.csv', index = False, header=True)