<a href="https://colab.research.google.com/github/IrfanChairurrachman/kuylah-backend/blob/main/inventory/tfrs.ipynb" target="_parent"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KUYLAH by Stackoverthink

This code is kuylah collaborative filtering for recommendation system for kuylah user.

We build 2 models, content based filtering with sckit-learn and collaborative filtering with tensorflow

![Recommendation System](https://github.com/IrfanChairurrachman/kuylah-backend/blob/main/inventory/rs.png?raw=1)

Collaborative filtering is a technique widely used by recommendation systems for having a decent size of user dan data. It makes recommendation based on the content preferences of similiar users.

The collaborative filtering approach focuses on finding users who have given similar ratings to the same destinations, thus creating a link between users, to whom will be suggested destinations that were reviewed in a positive way. In this way, we look for associations between users, not between destinations. Therefore, collaborative filtering relies only on observed user behavior to make recommendations — no profile data or content data is necessary.



---


Therefore, collaborative filtering is not a suitable model to deal with cold start problem, in which it cannot draw any inference for users or items about which it has not yet gathered sufficient information.

But once we have relative large user — item interaction data, then collaborative filtering is the most widely used recommendation approach. That is why we still generated ratings.csv by ourselves which contains 1000 users and each user gave 10 destination feedbacks




In [None]:
# Upload dataset.csv and ratings.csv file for training
from google.colab import files
files.upload()

In [3]:
# Import necessary modules
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler

In [4]:
# read csv files as pandas dataframe
rating = pd.read_csv('ratings.csv', error_bad_lines=False, encoding="latin-1")
destination = pd.read_csv('dataset.csv', error_bad_lines=False, encoding="latin-1")

In [5]:
rating.head()

Unnamed: 0,user_id,index,ratings
0,1,31,1
1,1,29,5
2,1,91,4
3,1,3,4
4,1,37,1


In [6]:
destination.head()

Unnamed: 0,index,nama,vote_average,vote_count,type,htm_weekday,htm_weekend,latitude,longitude,description
0,0,Candi Borobudur,4.7,81922,Budaya dan Sejarah,50000.0,50000.0,-7.607087,110.203623,Candi yang pernah masuk sebagai salah satu dar...
1,1,Candi Prambanan,4.7,71751,Budaya dan Sejarah,50000.0,50000.0,-7.751835,110.491532,Candi Prambanan adalah kompleks candi Hindu te...
2,2,Tebing Breksi,4.4,51431,Alam,10000.0,10000.0,-7.781477,110.504576,Tebing Breksi merupakan tempat wisata yang ber...
3,3,Gembira Loka Zoo,4.5,36337,Buatan,20000.0,25000.0,-7.806234,110.396798,Gambira Loka adalah kebun binatang yang berada...
4,4,The Palace of Yogyakarta (Keraton Yogyakarta),4.6,30091,Budaya dan Sejarah,8000.0,8000.0,-7.805284,110.364203,Kompleks keraton merupakan museum yang menyimp...


In [7]:
# Merge rating and destination based on index (destination primary key)
dest_rating = pd.merge(rating, destination, on='index')
cols = ['vote_average', 'vote_count', 'type', 'htm_weekday', 'htm_weekend', 'latitude', 'longitude', 'description']
dest_rating.drop(cols, axis=1, inplace=True)
dest_rating.head()

Unnamed: 0,user_id,index,ratings,nama
0,1,31,1,De Mata Trick Eye Museum
1,27,31,4,De Mata Trick Eye Museum
2,30,31,4,De Mata Trick Eye Museum
3,30,31,1,De Mata Trick Eye Museum
4,30,31,2,De Mata Trick Eye Museum


In [8]:
# count how many destinations have reviewed
rating_count = (dest_rating.
     groupby(by = ['nama'])['ratings'].
     count().
     reset_index().
     rename(columns = {'ratings': 'RatingCount_dest'})
     [['nama', 'RatingCount_dest']]
    )
rating_count

Unnamed: 0,nama,RatingCount_dest
0,Affandi Museum,85
1,Agro Tourism Bhumi Merapi,109
2,Air Terjun Kedung Pedut,100
3,Balong Waterpark,93
4,Bendungan Kamijoro,90
...,...,...
137,Wisata Air Wanatirta Kencana,115
138,Wisata Alam Watu Amben,107
139,Wisata Kalibiru,107
140,Wisata Telaga Potorono,97


In [9]:
rating_count['RatingCount_dest'].describe()

count    142.000000
mean     105.838028
std        9.992648
min       84.000000
25%       98.000000
50%      107.000000
75%      113.000000
max      132.000000
Name: RatingCount_dest, dtype: float64

In [13]:
threshold = 95
rating_count = rating_count.query('RatingCount_dest >= @threshold')
rating_count.head()

Unnamed: 0,nama,RatingCount_dest
1,Agro Tourism Bhumi Merapi,109
2,Air Terjun Kedung Pedut,100
5,Blue Lagoon Jogja,114
6,Bukit Klangon,116
9,Bukit Paralayang Watugupit,116


In [14]:
rating_count.shape

(123, 2)

In [15]:
# merge rating_coun and dest_count
user_rating = pd.merge(rating_count, dest_rating, left_on='nama', right_on='nama', how='left')
user_rating.head()

Unnamed: 0,nama,RatingCount_dest,user_id,index,ratings
0,Agro Tourism Bhumi Merapi,109,1,29,5
1,Agro Tourism Bhumi Merapi,109,1,29,2
2,Agro Tourism Bhumi Merapi,109,16,29,3
3,Agro Tourism Bhumi Merapi,109,17,29,5
4,Agro Tourism Bhumi Merapi,109,32,29,1


In [16]:
user_count = (user_rating.
     groupby(by = ['user_id'])['ratings'].
     count().
     reset_index().
     rename(columns = {'ratings': 'RatingCount_user'})
     [['user_id', 'RatingCount_user']]
    )
user_count

Unnamed: 0,user_id,RatingCount_user
0,1,14
1,2,19
2,3,11
3,4,12
4,5,11
...,...,...
995,996,18
996,997,14
997,998,10
998,999,18


In [17]:
user_count['RatingCount_user'].describe()

count    1000.000000
mean       13.335000
std         3.172198
min         6.000000
25%        11.000000
50%        13.000000
75%        16.000000
max        20.000000
Name: RatingCount_user, dtype: float64

In [18]:
combined = user_rating.merge(user_count, left_on = 'user_id', right_on = 'user_id', how = 'inner')
combined

Unnamed: 0,nama,RatingCount_dest,user_id,index,ratings,RatingCount_user
0,Agro Tourism Bhumi Merapi,109,1,29,5,14
1,Agro Tourism Bhumi Merapi,109,1,29,2,14
2,Air Terjun Kedung Pedut,100,1,57,4,14
3,Candi Ijo,104,1,138,5,14
4,Candi Sari,104,1,119,5,14
...,...,...,...,...,...,...
13330,Stonehenge Merapi,108,965,71,3,15
13331,Tebing Breksi,112,965,2,4,15
13332,The World Landmarks - Merapi Park Yogyakarta,132,965,8,4,15
13333,The World Landmarks - Merapi Park Yogyakarta,132,965,8,1,15


In [19]:
combined.shape

(13335, 6)

In [20]:
print('Number of unique destination: ', combined['nama'].nunique())
print('Number of unique users: ', combined['user_id'].nunique())

Number of unique destination:  123
Number of unique users:  1000


In [21]:
# Normalize the ratings with MinMaxScaler
scaler = MinMaxScaler()
combined['ratings'] = combined['ratings'].values.astype(float)
rating_scaled = pd.DataFrame(scaler.fit_transform(combined['ratings'].values.reshape(-1,1)))
combined['ratings'] = rating_scaled

In [22]:
combined

Unnamed: 0,nama,RatingCount_dest,user_id,index,ratings,RatingCount_user
0,Agro Tourism Bhumi Merapi,109,1,29,1.00,14
1,Agro Tourism Bhumi Merapi,109,1,29,0.25,14
2,Air Terjun Kedung Pedut,100,1,57,0.75,14
3,Candi Ijo,104,1,138,1.00,14
4,Candi Sari,104,1,119,1.00,14
...,...,...,...,...,...,...
13330,Stonehenge Merapi,108,965,71,0.50,15
13331,Tebing Breksi,112,965,2,0.75,15
13332,The World Landmarks - Merapi Park Yogyakarta,132,965,8,0.75,15
13333,The World Landmarks - Merapi Park Yogyakarta,132,965,8,0.00,15


In [23]:
# transform the dataframe into matris for training sake
combined = combined.drop_duplicates(['user_id', 'nama'])
user_dest_matrix = combined.pivot(index='user_id', columns='nama', values='ratings')
user_dest_matrix.fillna(0, inplace=True)

users = user_dest_matrix.index.tolist()
dests = user_dest_matrix.columns.tolist()

user_dest_matrix = user_dest_matrix.values

In [24]:
user_dest_matrix

array([[1.  , 0.75, 0.  , ..., 0.  , 1.  , 0.  ],
       [0.  , 0.25, 0.  , ..., 0.  , 1.  , 0.  ],
       [0.  , 0.  , 0.  , ..., 0.  , 0.  , 0.  ],
       ...,
       [0.  , 0.  , 0.  , ..., 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , ..., 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , ..., 0.  , 0.  , 0.5 ]])

In [25]:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

Instructions for updating:
non-resource variables are not supported in the long term


In [26]:
# We will initialize the TensorFlow placeholder. 
# Then, weights and biases are randomly initialized, 
# the following code are taken from the book: Python Machine Learning Cook Book - Second Edition
num_input = combined['nama'].nunique()
num_hidden_1 = 10
num_hidden_2 = 5

X = tf.placeholder(tf.float64, [None, num_input])

weights = {
    'encoder_h1': tf.Variable(tf.random_normal([num_input, num_hidden_1], dtype=tf.float64)),
    'encoder_h2': tf.Variable(tf.random_normal([num_hidden_1, num_hidden_2], dtype=tf.float64)),
    'decoder_h1': tf.Variable(tf.random_normal([num_hidden_2, num_hidden_1], dtype=tf.float64)),
    'decoder_h2': tf.Variable(tf.random_normal([num_hidden_1, num_input], dtype=tf.float64)),
}

biases = {
    'encoder_b1': tf.Variable(tf.random_normal([num_hidden_1], dtype=tf.float64)),
    'encoder_b2': tf.Variable(tf.random_normal([num_hidden_2], dtype=tf.float64)),
    'decoder_b1': tf.Variable(tf.random_normal([num_hidden_1], dtype=tf.float64)),
    'decoder_b2': tf.Variable(tf.random_normal([num_input], dtype=tf.float64)),
}

In [27]:
# Build the encode and decoder model
def encoder(x):
    layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['encoder_h1']), biases['encoder_b1']))
    layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['encoder_h2']), biases['encoder_b2']))
    return layer_2

def decoder(x):
    layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['decoder_h1']), biases['decoder_b1']))
    layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['decoder_h2']), biases['decoder_b2']))
    return layer_2

In [28]:
# Construct the model and the predictions
encoder_op = encoder(X)
decoder_op = decoder(encoder_op)

y_pred = decoder_op

y_true = X

In [29]:
# Define loss function, optimizer, minimize the squared error, and evaluation metrics
loss = tf.losses.mean_squared_error(y_true, y_pred)
optimizer = tf.train.RMSPropOptimizer(0.03).minimize(loss)
eval_x = tf.placeholder(tf.int32, )
eval_y = tf.placeholder(tf.int32, )
pre, pre_op = tf.metrics.precision(labels=eval_x, predictions=eval_y)

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


In [30]:
# Initialize the variables. 
# Because TensorFlow uses computational graphs for its operations, placeholders and variables must be initialized.
init = tf.global_variables_initializer()
local_init = tf.local_variables_initializer()
pred_data = pd.DataFrame()

We start training our model.
- We split training data into batches, and we feed the network with them.
- We train our model with vectors of user ratings, each vector represents a user and each column a destination, and entries are ratings that the user gave to destination.
- After a few trials, we discovered that training model for 500 epochs with a batch size of 35 would be consuming enough memories. This means that the entire training set will feed our neural network 100 times, every time using 35 users.
- At the end, we must make sure to remove user’s ratings in the training set. That is, we must not recommend destinations to a user in which he (or she) has already rated.

In [31]:
with tf.Session() as session:
    epochs = 500
    batch_size = 35

    session.run(init)
    session.run(local_init)

    num_batches = int(user_dest_matrix.shape[0] / batch_size)
    user_dest_matrix = np.array_split(user_dest_matrix, num_batches)
    
    for i in range(epochs):

        avg_cost = 0
        for batch in user_dest_matrix:
            _, l = session.run([optimizer, loss], feed_dict={X: batch})
            avg_cost += l

        avg_cost /= num_batches

        print("epoch: {} Loss: {}".format(i + 1, avg_cost))

    user_dest_matrix = np.concatenate(user_dest_matrix, axis=0)

    preds = session.run(decoder_op, feed_dict={X: user_dest_matrix})

    pred_data = pred_data.append(pd.DataFrame(preds))

    pred_data = pred_data.stack().reset_index(name='nama')
    pred_data.columns = ['user_id', 'nama', 'ratings']
    pred_data['user_id'] = pred_data['user_id'].map(lambda value: users[value])
    pred_data['nama'] = pred_data['nama'].map(lambda value: dests[value])
    
    keys = ['user_id', 'nama']
    index_1 = pred_data.set_index(keys).index
    index_2 = combined.set_index(keys).index

    top_ten_ranked = pred_data[~index_1.isin(index_2)]
    top_ten_ranked = top_ten_ranked.sort_values(['user_id', 'ratings'], ascending=[True, False])
    top_ten_ranked = top_ten_ranked.groupby('user_id').head(10)

epoch: 1 Loss: 0.3145644792488643
epoch: 2 Loss: 0.3101328622017588
epoch: 3 Loss: 0.2918482092874391
epoch: 4 Loss: 0.22979426862938063
epoch: 5 Loss: 0.11383699439466
epoch: 6 Loss: 0.04719820831503187
epoch: 7 Loss: 0.03655328841081688
epoch: 8 Loss: 0.036469047356929095
epoch: 9 Loss: 0.03648648703736918
epoch: 10 Loss: 0.03645131071763379
epoch: 11 Loss: 0.036411386515413015
epoch: 12 Loss: 0.03636415169707367
epoch: 13 Loss: 0.036289721461279054
epoch: 14 Loss: 0.036184429457145076
epoch: 15 Loss: 0.03605520525681121
epoch: 16 Loss: 0.035912826524249146
epoch: 17 Loss: 0.03576250161443438
epoch: 18 Loss: 0.035611899103969336
epoch: 19 Loss: 0.03546498230259333
epoch: 20 Loss: 0.03531557973474264
epoch: 21 Loss: 0.03517045599541494
epoch: 22 Loss: 0.035022719097988944
epoch: 23 Loss: 0.03487002284133008
epoch: 24 Loss: 0.03471263870596886
epoch: 25 Loss: 0.03453885757231286
epoch: 26 Loss: 0.034382090371634276
epoch: 27 Loss: 0.034222226456872056
epoch: 28 Loss: 0.0340707997259284

After 500 epochs of training, we saved the recommendation in `top_ten_ranked` pandas dataframe, which saved top ten ranked destination recommendation for each user.

Below example top ten ranked destinations for user 3 and user 123

In [40]:
top_ten_ranked.loc[top_ten_ranked['user_id'] == 154]

Unnamed: 0,user_id,nama,ratings
18827,154,Camera House Borobudur,0.226843
18895,154,Pantai Ngrenehan,0.153823
18882,154,Omah Petroek,0.12599
18936,154,Watu Goyang,0.112464
18822,154,Bukit Klangon,0.095267
18828,154,Candi ASU Klaten,0.091231
18848,154,Gembira Loka Zoo,0.089614
18821,154,Blue Lagoon Jogja,0.079157
18938,154,Wisata Alam Watu Amben,0.078585
18825,154,Bukit Teletubbies,0.078355


In [39]:
dest_rating.loc[dest_rating['user_id'] == 154]

Unnamed: 0,user_id,index,ratings,nama
111,154,29,2,Agro Tourism Bhumi Merapi
2432,154,38,1,Pantai Cemara Sewu Bantul Yogyakarta
2982,154,23,2,Bukit Paralayang Watugupit
5832,154,141,5,Wisata Air Wanatirta Kencana
7341,154,65,2,Museum HM Soeharto
8548,154,19,2,Ratu Boko
8861,154,6,3,Hutan Pinus Mangunan Dlingo
10539,154,81,1,Kebun Buah Mangunan
11085,154,63,3,Pemecah Ombak Pantai Glagah
11187,154,79,3,Pantai Glagah


In [35]:
top_ten_ranked

Unnamed: 0,user_id,nama,ratings
93,1,Ramadanu flower garden,0.597280
46,1,Kebun Buah Mangunan,0.333036
24,1,Desa Wisata Pentingsari,0.169195
39,1,Jogja Bay,0.148575
32,1,Goa Kiskendo,0.129310
...,...,...,...
122980,1000,Taman Lampion (Taman Pelangi),0.194220
122925,1000,Kids Fun Galleria Mall,0.133518
122939,1000,Ngobaran Beach,0.131618
122935,1000,Museum Monumen Pangeran Diponegoro,0.123365


In [36]:
top_ten_ranked.to_csv(r'top_ten_ranked.csv', index = False, header=True)