<a href="https://colab.research.google.com/github/AkramAzouzi/masters_pfe/blob/main/Neural_Collaborative_Filtering_Model_with_TensorFlow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

CF approach focuses on finding users who have given similar ratings to the same books, thus creating a link between users, to whom will be recommend books that were reviewed in a positive way. In this way, we look for associations between users, not between books.

# Project description
In this project we aimed to construct a neural collaborative filtering recommender system with TensorFlow library, where recommendations of books are built upon the existing ratings of other users, who have similar ratings with the user to whom we want to recommend. This approach focuses on finding users who have given similar ratings to the same books, thus creating a link between users, to whom will recommend books that were reviewed in a positive way. In this way, we look for associations between users, not between books. Therefore, collaborative filtering relies only on observed user behavior to make recommendations â€” no profile data or content data is necessary.

# 1 Dataset

In [None]:
!mkdir data && wget http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip && unzip BX-CSV-Dump.zip -d data/ && clear

First we need to import some libraries, since we are using google colab most of the python libraries came preinstalled, so we just need to import them.

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler

reading dataset tables from our storage

In [None]:
rating = pd.read_csv('data/BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
user = pd.read_csv('data/BX-Users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
book = pd.read_csv('data/BX-Books.csv', sep=';', error_bad_lines=False, encoding="latin-1")

In [None]:
rating

After that we need to do some important steps to make the work clear, starting by Merging user, rating and book data and also removing unused columns.

In [None]:
book_rating = pd.merge(rating, book, on='ISBN')
cols = ['Year-Of-Publication', 'Publisher', 'Book-Author', 'Image-URL-S', 'Image-URL-M', 'Image-URL-L']
book_rating.drop(cols, axis=1, inplace=True)
book_rating.head()

Then Filtering books that have had at least 25 ratings, Filtering users that have given at least 20 ratings.

In [None]:
#count all ratings
rating_count = (book_rating.
     groupby(by = ['Book-Title'])['Book-Rating'].
     count().
     reset_index().
     rename(columns = {'Book-Rating': 'RatingCount_book'})
     [['Book-Title', 'RatingCount_book']]
    )
# rating_count.head()
#books that have had at least 25 ratings
threshold = 25
rating_count = rating_count.query('RatingCount_book >= @threshold')
user_rating = pd.merge(rating_count, book_rating, left_on='Book-Title', right_on='Book-Title', how='left')


In [None]:
#counting users
user_count = (user_rating.
     groupby(by = ['User-ID'])['Book-Rating'].
     count().
     reset_index().
     rename(columns = {'Book-Rating': 'RatingCount_user'})
     [['User-ID', 'RatingCount_user']]
    )
# user_count.head()
threshold = 20
user_count = user_count.query('RatingCount_user >= @threshold')
combined = user_rating.merge(user_count, left_on = 'User-ID', right_on = 'User-ID', how = 'inner')

In [None]:
combined.shape

In [None]:
print('Number of unique books: ', combined['Book-Title'].nunique())
print('Number of unique users: ', combined['User-ID'].nunique())

Our technique will be based on the following observations:


*   Users who rate books in a similar manner share one or more hidden preferences.
*   Users with shared preferences are likely to give ratings in the same way to the same books.

Now we pass to the Process in TensorFlow in order to normalize the ratings feature, then build user-book matrix with three features:


In [None]:
scaler = MinMaxScaler()
combined['Book-Rating'] = combined['Book-Rating'].values.astype(float)
rating_scaled = pd.DataFrame(scaler.fit_transform(combined['Book-Rating'].values.reshape(-1,1)))
combined['Book-Rating'] = rating_scaled

Build the user book matrix.

In [None]:
combined = combined.drop_duplicates(['User-ID', 'Book-Title'])
user_book_matrix = combined.pivot(index='User-ID', columns='Book-Title', values='Book-Rating')
user_book_matrix.fillna(0, inplace=True)

users = user_book_matrix.index.tolist()
books = user_book_matrix.columns.tolist()

user_book_matrix = user_book_matrix.values

tf.placeholder only available in v1, so we have to work around. 

In [None]:
import os
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
os.system('clear')

0

In the following code script we aim to :
* Set the network parameters, such as the dimension of each hidden layer.
* Initialize the TF placeholder.
* Weights and biases are randomly initialized.


We will initialize the TensorFlow placeholder. Then, weights and biases are randomly initialized, the following code are taken from the book: Python Machine Learning Cook Book - Second Edition

In [None]:
num_input = combined['Book-Title'].nunique()
num_hidden_1 = 10
num_hidden_2 = 5

X = tf.placeholder(tf.float64, [None, num_input])

weights = {
    'encoder_h1': tf.Variable(tf.random_normal([num_input, num_hidden_1], dtype=tf.float64)),
    'encoder_h2': tf.Variable(tf.random_normal([num_hidden_1, num_hidden_2], dtype=tf.float64)),
    'decoder_h1': tf.Variable(tf.random_normal([num_hidden_2, num_hidden_1], dtype=tf.float64)),
    'decoder_h2': tf.Variable(tf.random_normal([num_hidden_1, num_input], dtype=tf.float64)),
}

biases = {
    'encoder_b1': tf.Variable(tf.random_normal([num_hidden_1], dtype=tf.float64)),
    'encoder_b2': tf.Variable(tf.random_normal([num_hidden_2], dtype=tf.float64)),
    'decoder_b1': tf.Variable(tf.random_normal([num_hidden_1], dtype=tf.float64)),
    'decoder_b2': tf.Variable(tf.random_normal([num_input], dtype=tf.float64)),
}

Now, we can build the encoder and decoder model, as follows:

In [None]:
def encoder(x):
    layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['encoder_h1']), biases['encoder_b1']))
    layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['encoder_h2']), biases['encoder_b2']))
    return layer_2

def decoder(x):
    layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['decoder_h1']), biases['decoder_b1']))
    layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['decoder_h2']), biases['decoder_b2']))
    return layer_2

We will construct the model and the predictions

In [None]:
encoder_op = encoder(X)
decoder_op = decoder(encoder_op)

y_pred = decoder_op

y_true = X

define loss function and optimizer, and minimize the squared error, and define the evaluation metrics

In [None]:
loss = tf.losses.mean_squared_error(y_true, y_pred)
optimizer = tf.train.RMSPropOptimizer(0.03).minimize(loss)
eval_x = tf.placeholder(tf.int32, )
eval_y = tf.placeholder(tf.int32, )
pre, pre_op = tf.metrics.precision(labels=eval_x, predictions=eval_y)

Initialize the variables. Because TensorFlow uses computational graphs for its operations, placeholders and variables must be initialized.

In [None]:
init = tf.global_variables_initializer()
local_init = tf.local_variables_initializer()
pred_data = pd.DataFrame()

We can finally start to train our model.

We split training data into batches, and we feed the network with them.

We train our model with vectors of user ratings, each vector represents a user and each column a book, and entries are ratings that the user gave to books. 

After a few trials, We discovered that training model for 5 epochs with a batch size of 10 would be consum enough memory. This means that the entire training set will feed our neural network 20 times, every time using 50 users.

In [None]:
with tf.Session() as session:
    epochs = 100
    batch_size = 35

    session.run(init)
    session.run(local_init)

    num_batches = int(user_book_matrix.shape[0] / batch_size)
    user_book_matrix = np.array_split(user_book_matrix, num_batches)
    
    for i in range(epochs):

        avg_cost = 0
        for batch in user_book_matrix:
            _, l = session.run([optimizer, loss], feed_dict={X: batch})
            avg_cost += l

        avg_cost /= num_batches

        print("epoch: {} Loss: {}".format(i + 1, avg_cost))

    user_book_matrix = np.concatenate(user_book_matrix, axis=0)

    preds = session.run(decoder_op, feed_dict={X: user_book_matrix})

    pred_data = pred_data.append(pd.DataFrame(preds))

    pred_data = pred_data.stack().reset_index(name='Book-Rating')
    pred_data.columns = ['User-ID', 'Book-Title', 'Book-Rating']
    pred_data['User-ID'] = pred_data['User-ID'].map(lambda value: users[value])
    pred_data['Book-Title'] = pred_data['Book-Title'].map(lambda value: books[value])
    
    keys = ['User-ID', 'Book-Title']
    index_1 = pred_data.set_index(keys).index
    index_2 = combined.set_index(keys).index

    top_ten_ranked = pred_data[~index_1.isin(index_2)]
    top_ten_ranked = top_ten_ranked.sort_values(['User-ID', 'Book-Rating'], ascending=[True, False])
    top_ten_ranked = top_ten_ranked.groupby('User-ID').head(10)

epoch: 1 Loss: 0.3661415465585478
epoch: 2 Loss: 0.3051663774710435
epoch: 3 Loss: 0.062118540257010815
epoch: 4 Loss: 0.004407464247708629
epoch: 5 Loss: 0.003829436005696982
epoch: 6 Loss: 0.003625660814897536
epoch: 7 Loss: 0.003230570761773449
epoch: 8 Loss: 0.0030120116412885242
epoch: 9 Loss: 0.002924666096517755
epoch: 10 Loss: 0.0027865328145428346
epoch: 11 Loss: 0.0027303414094353934
epoch: 12 Loss: 0.00272240114377832
epoch: 13 Loss: 0.002716303084066117
epoch: 14 Loss: 0.0027114765099403295
epoch: 15 Loss: 0.0027075726948269123
epoch: 16 Loss: 0.0027043569850950288
epoch: 17 Loss: 0.0027016660103901893
epoch: 18 Loss: 0.0026993829038014614
epoch: 19 Loss: 0.00269742224078912
epoch: 20 Loss: 0.002695720412533034
epoch: 21 Loss: 0.0026942291693598194
epoch: 22 Loss: 0.00269291143107054
epoch: 23 Loss: 0.0026917384584321754
epoch: 24 Loss: 0.0026906882348767184
epoch: 25 Loss: 0.002689743662703332
epoch: 26 Loss: 0.0026888912901855432
epoch: 27 Loss: 0.00268811971373897
epoch:

In [None]:
top_ten_ranked['User-ID'].head(100)

In [None]:
top_ten_ranked.loc[top_ten_ranked['User-ID'] == 6543]

In [None]:
# book_rating

In [None]:
# book_rating.loc[book_rating['User-ID'] == 10314].sort_values(by=['Book-Rating'], ascending=False)