<a href="https://colab.research.google.com/github/AchintyaX/neural_recommendation_systems/blob/master/Collaborative_filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Collaborative Filtering 


In [1]:
import numpy as np
import pandas as pd 
import tensorflow as tf 
from sklearn.preprocessing import MinMaxScaler

import os 
os.chdir("/content/drive/My Drive/recommender_systems ")

## Data Setup

We are using booking crosssing dataset. the data pre-processing steps does the following - 
1. Merge user, rating and book data 
2. Remove unused columns 
3. Filtering books that have had atleast 25 ratings 
4. Filtering users that have given atleast 20 ratings. Remember collaborative filtering algorithms often requires users' active participating 

In [2]:
rating = pd.read_csv('Data/BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
user = pd.read_csv('Data/BX-Users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
book = pd.read_csv('Data/BX-Books.csv', sep=';', error_bad_lines=False, encoding="latin-1")

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
book_rating = pd.merge(rating, book, on='ISBN')
cols = ['Year-Of-Publication', 'Publisher', 'Book-Author', 'Image-URL-S', 'Image-URL-M', 'Image-URL-L']
book_rating.drop(cols, axis=1, inplace=True)

rating_count = (book_rating.
     groupby(by = ['Book-Title'])['Book-Rating'].
     count().
     reset_index().
     rename(columns = {'Book-Rating': 'RatingCount_book'})
     [['Book-Title', 'RatingCount_book']]
    )
    
threshold = 25
rating_count = rating_count.query('RatingCount_book >= @threshold')

user_rating = pd.merge(rating_count, book_rating, left_on='Book-Title', right_on='Book-Title', how='left')

user_count = (user_rating.
     groupby(by = ['User-ID'])['Book-Rating'].
     count().
     reset_index().
     rename(columns = {'Book-Rating': 'RatingCount_user'})
     [['User-ID', 'RatingCount_user']]
    )
    
threshold = 20
user_count = user_count.query('RatingCount_user >= @threshold')

combined = user_rating.merge(user_count, left_on = 'User-ID', right_on = 'User-ID', how = 'inner')

print('Number of unique books: ', combined['Book-Title'].nunique())
print('Number of unique users: ', combined['User-ID'].nunique())

Number of unique books:  5850
Number of unique users:  3192


## Technique 

Collaborative filtering approach focuses on finding users who have given similar ratings to the same books, this creating a link between users, to whom will be suggested books that were reviewed in a positive way. 
In this wasy we look for associations between users, not between books. 
Therefore, collaborative filtering relies only on observed user behavior to make recommedations. 
Our technique will be based on the following observations - 
1. Users who rate books in a similar manner share one or more hidden preferences. 
2. Users with shared preferences are likely to give ratings in the same way to the same books 

In [4]:
# Normalize the rating feature using tensorflow 
scaler = MinMaxScaler()
combined['Book-Rating'] = combined['Book-Rating'].values.astype(float)
rating_scaled = pd.DataFrame(scaler.fit_transform(combined['Book-Rating'].values.reshape(-1,1)))
combined['Book-Rating'] = rating_scaled

In [5]:
# Then, build user, book matrix with three features 

combined = combined.drop_duplicates(['User-ID', 'Book-Title'])
user_book_matrix = combined.pivot(index='User-ID', columns='Book-Title', values='Book-Rating')
user_book_matrix.fillna(0, inplace=True)
users = user_book_matrix.index.tolist()
books = user_book_matrix.columns.tolist()
user_book_matrix = user_book_matrix.to_numpy()

In [6]:
# tf.placeholder is only availbe in v1

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

Instructions for updating:
non-resource variables are not supported in the long term


We will intialiaze the tensorflow placeholder. Then weights and biases are randomly initialized 

In [7]:
num_input = combined['Book-Title'].nunique()
num_hidden_1 = 10
num_hidden_2 = 5

X = tf.placeholder(tf.float64, [None, num_input])

weights = {
    'encoder_h1': tf.Variable(tf.random_normal([num_input, num_hidden_1], dtype=tf.float64)),
    'encoder_h2': tf.Variable(tf.random_normal([num_hidden_1, num_hidden_2], dtype=tf.float64)),
    'decoder_h1': tf.Variable(tf.random_normal([num_hidden_2, num_hidden_1], dtype=tf.float64)),
    'decoder_h2': tf.Variable(tf.random_normal([num_hidden_1, num_input], dtype=tf.float64)),
}

biases = {
    'encoder_b1': tf.Variable(tf.random_normal([num_hidden_1], dtype=tf.float64)),
    'encoder_b2': tf.Variable(tf.random_normal([num_hidden_2], dtype=tf.float64)),
    'decoder_b1': tf.Variable(tf.random_normal([num_hidden_1], dtype=tf.float64)),
    'decoder_b2': tf.Variable(tf.random_normal([num_input], dtype=tf.float64)),
}

Now, we build the encoder and decoder model

In [8]:
def encoder(x):
    layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['encoder_h1']), biases['encoder_b1']))
    layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['encoder_h2']), biases['encoder_b2']))
    return layer_2

def decoder(x):
    layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['decoder_h1']), biases['decoder_b1']))
    layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['decoder_h2']), biases['decoder_b2']))
    return layer_2

Building the final model by connecting the encoder and decoder 

In [9]:
encoder_op = encoder(X)
decoder_op = decoder(encoder_op)

y_pred = decoder_op 
y_true = X

Compiling the model, by defining the loss function and the optimizer 

In [10]:
loss = tf.losses.mean_pairwise_squared_error(y_true, y_pred)
optimizer = tf.train.RMSPropOptimizer(0.03).minimize(loss)

eval_x = tf.placeholder(tf.int32, )
eval_y = tf.placeholder(tf.int32, )
pre, pre_op = tf.metrics.precision(labels=eval_x, predictions=eval_y)

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Because tensorflow uses computational graphs for its operations, placeholders and variables must be initialized before they have values. <br> 
So in the following code, we initialize the variables, then create an empty dataframe to store the result table, which will be top 10 recommedations for every user 

In [11]:
init = tf.global_variables_initializer()
local_init = tf.local_variables_initializer()
pred_data = pd.DataFrame()

## Training 

In [12]:
with tf.Session() as session:
    epochs = 100
    batch_size = 35

    session.run(init)
    session.run(local_init)

    num_batches = int(user_book_matrix.shape[0] / batch_size)
    user_book_matrix = np.array_split(user_book_matrix, num_batches)
    
    for i in range(epochs):

        avg_cost = 0
        for batch in user_book_matrix:
            _, l = session.run([optimizer, loss], feed_dict={X: batch})
            avg_cost += l

        avg_cost /= num_batches

        print("epoch: {} Loss: {}".format(i + 1, avg_cost))

    user_book_matrix = np.concatenate(user_book_matrix, axis=0)

    preds = session.run(decoder_op, feed_dict={X: user_book_matrix})

    pred_data = pred_data.append(pd.DataFrame(preds))

    pred_data = pred_data.stack().reset_index(name='Book-Rating')
    pred_data.columns = ['User-ID', 'Book-Title', 'Book-Rating']
    pred_data['User-ID'] = pred_data['User-ID'].map(lambda value: users[value])
    pred_data['Book-Title'] = pred_data['Book-Title'].map(lambda value: books[value])
    
    keys = ['User-ID', 'Book-Title']
    index_1 = pred_data.set_index(keys).index
    index_2 = combined.set_index(keys).index

    top_ten_ranked = pred_data[~index_1.isin(index_2)]
    top_ten_ranked = top_ten_ranked.sort_values(['User-ID', 'Book-Rating'], ascending=[True, False])
    top_ten_ranked = top_ten_ranked.groupby('User-ID').head(10)

epoch: 1 Loss: 4.556532320085463
epoch: 2 Loss: 1.4511856171456012
epoch: 3 Loss: 0.1952440702653193
epoch: 4 Loss: 0.18817446945787786
epoch: 5 Loss: 0.18784083213124955
epoch: 6 Loss: 0.18767856601830368
epoch: 7 Loss: 0.18759499866883833
epoch: 8 Loss: 0.18755042454698584
epoch: 9 Loss: 0.18752664524120288
epoch: 10 Loss: 0.18750555829687432
epoch: 11 Loss: 0.18748681656606905
epoch: 12 Loss: 0.1874687537387177
epoch: 13 Loss: 0.18744990131357214
epoch: 14 Loss: 0.18742992589761923
epoch: 15 Loss: 0.18740914680145598
epoch: 16 Loss: 0.1873869028065231
epoch: 17 Loss: 0.18736390360109098
epoch: 18 Loss: 0.18733953774630369
epoch: 19 Loss: 0.1873143691938002
epoch: 20 Loss: 0.18728865535704645
epoch: 21 Loss: 0.187262495467951
epoch: 22 Loss: 0.18723561180816903
epoch: 23 Loss: 0.18720858830672044
epoch: 24 Loss: 0.1871812369797256
epoch: 25 Loss: 0.18715391873003362
epoch: 26 Loss: 0.1871266666349474
epoch: 27 Loss: 0.18709947185201958
epoch: 28 Loss: 0.18707264419440384
epoch: 29 Lo

Top 10 results for this user sorted by the normalized predicted ratings 

In [13]:
top_ten_ranked.loc[top_ten_ranked['User-ID'] == 278582]

Unnamed: 0,User-ID,Book-Title,Book-Rating
18660405,278582,The Lovely Bones: A Novel,0.881973
18659952,278582,The Da Vinci Code,0.872277
18658056,278582,Life of Pi,0.865832
18660492,278582,The Nanny Diaries: A Novel,0.851216
18656352,278582,Bridget Jones's Diary,0.850631
18661131,278582,"Tuesdays with Morrie: An Old Man, a Young Man,...",0.850195
18659596,278582,Suzanne's Diary for Nicholas,0.849884
18657783,278582,Interview with the Vampire,0.848153
18661340,278582,Where the Heart Is (Oprah's Book Club (Paperba...,0.848142
18658167,278582,Lucky : A Memoir,0.847287


Actual books rated by this user, from the book titles we can see the that the recommendation system is not doing a bad job with identifying the genre the user likes 

In [14]:
book_rating.loc[book_rating['User-ID'] == 278582].sort_values(by=['Book-Rating'], ascending=False)

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title
174885,278582,0226848620,10,Chinese Bell Murders (Judge Dee Mysteries)
176582,278582,157566254X,10,"Skin Deep, Blood Red"
40008,278582,0441478123,10,The Left Hand of Darkness (Remembering Tomorrow)
174861,278582,0061044725,10,Search the Shadows
58156,278582,0451202503,10,The Songcatcher: A Ballad Novel
64570,278582,1400034779,10,The No. 1 Ladies' Detective Agency (Today Show...
175958,278582,0345350499,10,The Mists of Avalon
176314,278582,0449223558,9,Murdering Mr. Monti: A Merry Little Tale of Se...
174877,278582,0140277471,9,Blanche Cleans Up
176438,278582,0515136557,8,The Cat Who Brought Down the House
