<a href="https://colab.research.google.com/github/Ayushman0Singh/BookRecommendationSystem/blob/main/BookRecommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **BOOK RECOMMENDATION**

In [None]:
# Testing comming from DSlab

# Business Problem

Online book reading and selling websites like Kindle and Goodreads compete against each other on many factors. One of those important factors is their book recommendation system. A book recommendation system is designed to recommend books of interest to the buyer.


The purpose of a book recommendation system is to predict buyer’s interest and recommend books to them accordingly. A book recommendation system can take into account many parameters like book content and book quality by filtering user reviews.I will try to make a recommendation system for our given data set.

In [None]:
#importing necessary libraries 
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import random
# we will import libraries further as per need

In [None]:
from google.colab import drive # mounting drive
drive.mount('/content/drive')

We have been given 3 data sets, Lets have a look at all the data provided to us and its properties

In [None]:
books = pd.read_csv("/content/drive/MyDrive/Almabetter/Capstone Projects/Unsupervised ML/Copy of Books.csv")
users = pd.read_csv("/content/drive/MyDrive/Almabetter/Capstone Projects/Unsupervised ML/Copy of Users.csv")
ratings = pd.read_csv("/content/drive/MyDrive/Almabetter/Capstone Projects/Unsupervised ML/Copy of Ratings.csv")

In [None]:
books.head(4) # checking the head and columns 

In [None]:
users.head(5) #first look at the given data

In [None]:
ratings.head(5) # chcking the given data-sets

In [None]:
#dimensions of book dataframe
books.shape

In [None]:
#checking shape
users.shape

In [None]:
ratings.shape

In [None]:
# dimensions of the 3rd data set
ratings['Book-Rating'].value_counts()

Many users have rated books 0

# Data Cleaning

Before moving onto the data visualisation and EDA. First, lets make sure our data is ready to use.


**Checking for null values data**

In [None]:
# Books data-frame null values
books.isnull().sum()

In [None]:
# Users data-frame null values
users.isnull().sum()

In [None]:
#checking the ratings df for null values
ratings.isnull().sum()

Whatever null values we have will be dealt with when do feature engineering and apply constraints. 

# Exploratory Data Analysis

**Rating Distribution**

In [None]:
# show the distribution of rating
plt.figure(figsize=(12, 6))
sns.countplot(x='Book-Rating', data=ratings)
plt.title('Rating Distribution')
plt.xticks(rotation=90)
plt.ylabel('Number of Books')
plt.show()
print('Average rating recieved by all the apps is {}.'.format(ratings['Book-Rating'].mean()))

Most of the ratings are zero



**HYPOTHESIS**: Most of the users are between the age of 20 to 30 

In [None]:
# plotting the age of thee users 
plt.figure(figsize=(12, 6))
sns.histplot(data=users['Age'], bins=np.arange(0,100,10))
plt.title('Age Distribution\n')
plt.xlabel('Age')
plt.ylabel('Count')

This verifies our hypothesis most of the users are in the age group of 20-30, followed by 30-40.

# Data cleaning and feature Engineering

We will be using Two methods for our Books recommendation system, first we will use a memory Based colaborative model. Then we will also make a model based Colaboraative recomendation system. We will not be using content based algorithms for recommendation since we do not have enough indivisual features for users and the books. Users only has one extra feature. Moreover we might run into the cold-start problem. 

For this exercise we will be using two models:

**1) Memory based Collaborative filtering (using KNN)**

**2) Model based Collaborative filtering (using SVD)**


I tried to apply the above mentioned models to the full data set but I ran into memory problems. To solve these issues I applied a general threshold for our models. This also takes care of the cold start problem.

To reduce the data size we will try to apply certain constraints on the data frame. We can apply many different types of constraints to the dataset. These constraints include:

1. **Popularity Threshold**: Minimum number of user-ratings for a book.
2. **Active user Threshold**: Minimum number of books read for a unique user to be included in the recommendation system. 
3. **Regional Recommendation**: We will also recommend stuff regionally. The user will get recommendations from the users of the same location. We will do it for 1 region to not run into memory problems 

**Active User Threshold**


Remove users with less than 50 ratings(**inactive users**).

To to that, we will apply value_counts on user-id, each repetation of user means a new rating for a book by the same user. Then we will pick up users with atleast 50 repetions/ratings and filter them in our ratings data frame.
This also makes sure that all our users are consistent readers. 

In [None]:
# checking the number of users and thier number ratings
counts1 = ratings['User-ID'].value_counts()
print(counts1)

In [None]:
#keeping users with more than 50 ratings 
counts1_50 = counts1[counts1 >= 50].index # list of user-ids with more than 200 ratings
ratings = ratings[ratings['User-ID'].isin(counts1_50)]  # updating the whole data frame with only users with high ratings

In [None]:
ratings

In [None]:
ratings['User-ID'].value_counts()


Starting from the original data set, we will be only looking at the popular books. In order to find out which books are popular, we combine books data with ratings data.

In [None]:
# merging rating with users
combine_book_rating = pd.merge(ratings, books,how = 'inner', on='ISBN') #merging two dataframes (inner join since we only want ifo of users with high rating)
columns =['Book-Author',    'Year-Of-Publication',    'Publisher',    'Image-URL-S',    'Image-URL-M',    'Image-URL-L'] #list of unnecessary columns
combine_book_rating = combine_book_rating.drop(columns, axis=1) #droping those columns
combine_book_rating.head()

**Book Populatity Threshold**

Next we will apply the constraint where we have a minimum number of ratings for a book to be considered in our recomendation system. 

In [None]:
#collecting total rating counts
combine_book_rating = combine_book_rating.dropna(axis = 0, subset = ['Book-Title']) # clearing null/nan values from Book-Title
#counting number of ratings for a book and renaming the columns appropiately
book_rating_Count = combine_book_rating.groupby(by = ['Book-Title'])['Book-Rating'].count().reset_index().rename(columns = {'Book-Rating': 'totalRatingCount'})
book_rating_Count.head()

In [None]:
#combine with main data frame
#use left join since we want rating count for all the books in combined_book_rating column
rating_with_totalRatingCount = combine_book_rating.merge(book_rating_Count, left_on = 'Book-Title', right_on = 'Book-Title', how = 'left')
rating_with_totalRatingCount.head()

In [None]:
# looking at distribution of totalratingsCount
rating_with_totalRatingCount.describe()

In [None]:
#checking each quantiles of total ratings count closely to decide threshold
print(rating_with_totalRatingCount['totalRatingCount'].quantile(np.arange(0.1, 1, .05)))

We can see that there are many books with rating count more than ~30 this is just the distribution of total rating count. Lets, consider a threshold required count to be around 50. This will include many popular books which have been read by atleast 50 users. This will also make the opinion of users for a book more concrete, since there will be atleast 50 bad or good ratings for each of the books included.

In [None]:
popularity_threshold = 50
rating_popular_book = rating_with_totalRatingCount[rating_with_totalRatingCount['totalRatingCount'] >= 50]
rating_popular_book

# Collaborative Filtering Using k-Nearest Neighbors (kNN) / Memory based Model

**Applying a Country/regional Threshold**


In order to improve computing speed, and not run into the “MemoryError” issue, I will limit our user data to those in the India and US. And then combine user data with the rating data and total rating count data.

Lets have a look at number of users from each of these countries.

In [None]:
#mergeing the books+ratings data frame with the user info data frame
combined = rating_popular_book.merge(users, left_on = 'User-ID', right_on = 'User-ID', how = 'left')
# users with location as Undia or US 
india_us_user_rating = combined[combined['Location'].str.contains("india|usa")]
india_us_user_rating= india_us_user_rating.drop('Age', axis=1)
india_us_user_rating.head()

**Implementing KNN**

We use unsupervised algorithms with sklearn.neighbors. The algorithm we use to compute the nearest neighbors is “brute”, and we specify “metric=cosine” so that the algorithm will calculate the cosine similarity between rating vectors. Finally, we fit the model.

In [None]:
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

india_us_user_rating = india_us_user_rating.drop_duplicates(['User-ID', 'Book-Title'])  #dropping duplicates since we dont need it in the pivot matrix
india_us_user_rating_pivot = india_us_user_rating.pivot(index = 'Book-Title', columns = 'User-ID', values = 'Book-Rating').fillna(0) #filling nan values with zeroes
plt.spy(india_us_user_rating_pivot) #checking the non-zero values in the matrix
india_us_user_rating_matrix = csr_matrix(india_us_user_rating_pivot.values)
model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
model_knn.fit(india_us_user_rating_matrix)

In [None]:
plt.spy(india_us_user_rating_matrix)

In [None]:
india_us_user_rating_pivot.shape[1]

In [None]:
#generating recommendations
random_userbook_rating_index = np.random.choice(india_us_user_rating_pivot.shape[0]) #pick a random row from the pivot table
distances, indices = model_knn.kneighbors(india_us_user_rating_pivot.iloc[random_userbook_rating_index,].values.reshape(1, -1), n_neighbors = 6) # provide the row as features to the kNN
for i in range(0, len(distances.flatten())):   #loop through all the recomendations
    if i == 0:  #selected book 
        print('Recommendations for {0}:\n'.format(india_us_user_rating_pivot.index[random_userbook_rating_index])) 
    else:  
        #recomendations based on cosine distance of the books from selected book
        print('{0}: {1}, with distance of {2}:'.format(i,india_us_user_rating_pivot.index[indices.flatten()[i]], distances.flatten()[i]))
        

In [None]:
random_userbook_rating_index = np.random.choice(india_us_user_rating_pivot.shape[0]) #pick a random row from the pivot table
distances, indices = model_knn.kneighbors(india_us_user_rating_pivot.iloc[random_userbook_rating_index,].values.reshape(1, -1), n_neighbors = 6) # provide the row as features to the kNN
for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(india_us_user_rating_pivot.index[random_userbook_rating_index]))
    else:
        print('{0}: {1}, with distance of {2}:'.format(i,india_us_user_rating_pivot.index[indices.flatten()[i]], distances.flatten()[i]))

We have sucessfully impleted collaborative filering using KNN. The user-interacction matrix that we ended up with was quite small after all the constraints. Lets try to implement matrix factorisation **without any regional constraints**. 

This time we will apply the **matrix-factorisation method/SVD** method for recommendations. We will assume some latent interactions between the users and items. Then we will try to come up with the user item interaction matrix by ourselves using SVD, in the process we would be filling the non-interacted items with a rating. We can rank these up for the best recommendations. 

# **Matrix Factorisation** / Model based approach

In [None]:
#we will be using this matrix for the matrix factorisation method. 
rating_popular_book.drop(columns=['ISBN','totalRatingCount'], inplace = True) #removing useless columns

In [None]:
rating_popular_book['User-ID'].value_counts() #checking user-id counts
#since we are going to stratify 'User-ID' for our test train split, lets make sure all the user-ids have multiple instances

In [None]:
#counting # of user-ids with 1 rating
k = rating_popular_book['User-ID'].value_counts().reset_index() #creating a matrix with user-id counts
k[k['User-ID'] == 1].shape[0]   #checking the number of User_ids which have only been repeated once

In [None]:
#removing the rows with unique user_ID, since we need more than 1 to straify.
rating_popular_book = rating_popular_book[rating_popular_book.duplicated(subset=["User-ID"], keep=False)] 

In [None]:
rating_popular_book.drop_duplicates(inplace = True)
rating_popular_book.head(10)

Since we will be developing an evaluation system for the recommendation system, We need to do a test train split. 

In [None]:
interactions_train_df, interactions_test_df = train_test_split(rating_popular_book, #spliting the user-item rating dataframe                                        
                                   stratify=rating_popular_book.loc[:,'User-ID'],          #stratify using user-id column
                                   test_size=0.20,                                   #using 20 percent data as test set
                                   random_state=42)

print('# interactions on Train set: %d' % len(interactions_train_df))
print('# interactions on Test set: %d' % len(interactions_test_df))

In [None]:
#Creating a sparse pivot table with users in rows and items in columns
pivot_matrix = interactions_train_df.pivot_table(index='User-ID', columns='Book-Title', values='Book-Rating', aggfunc='sum').fillna(0) #agreegating any duplicate entries #filling nan values with 0
plt.spy(pivot_matrix)  # visualising the sparse matrix
pivot_matrix

In [None]:
from scipy.sparse.linalg import svds
# The number of factors to factor the user-item matrix.
NUMBER_OF_FACTORS_MF = 15

#Performs matrix factorization of the original user item matrix
U, sigma, Vt = svds(pivot_matrix, k = NUMBER_OF_FACTORS_MF)

In [None]:
#checking shapes
print(U.shape)
print(sigma.shape)
print(Vt.shape)

In [None]:
sigma

Sigma here is an 1-d array with 15 elements we need to convert it to a diagonal matrix so that the matrix multiplication goes smoothly and the dimensions are correct.

In [None]:
#making sigma a diagonal matrix
sigma = np.diag(sigma)

In [None]:
#reconstructing the original matrix without any zeros 
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) 
all_user_predicted_ratings 

In [None]:
#checking if martix shape is same as the original matrix
all_user_predicted_ratings.shape

In [None]:
# converting back to a dataframe
# defining index as book-title
# defining the columns as all the filtered user-ids
prediction_df = pd.DataFrame(all_user_predicted_ratings.transpose(), index= pivot_matrix.columns, columns=pivot_matrix.index) 

In [None]:
prediction_df

Now we have all the potential ratings for all the items and users. All thats left is to pick up these ratings of previously uninteracted items for each indivisual user and sort them in descending order to have a recommendation list for that user.

We also have to ignore the books which the user has already read and rated. Lets try it for a random user  638.

In [None]:
a = pivot_matrix.loc[638,:].reset_index() #making a data frame of ratings for an user
unread_books = a[a[638] < 0.1]['Book-Title']      # filtering the books user has not read  #keeping a non-zero threshold to include all the uninteracted books
unread_books  #books user hasnt interacted with yet 

In [None]:
sorted_user_predictions = prediction_df[638].sort_values(ascending=False).reset_index().rename(columns = {638:'RecommendationStrength'})   #best recommendations for the user 
#making sure the recommendations are uninteracted books
recommendations = sorted_user_predictions[sorted_user_predictions['Book-Title'].isin(unread_books)].sort_values('RecommendationStrength', ascending = False)
recommendations.set_index('Book-Title', inplace=True) #setting books as index

In [None]:
#extracting the top 5 recommendations. 
list(recommendations.index[0:5])

Now that we have successfully gotten recomendations for a user. Lets write a function to do the same for a chosen user. 

In [None]:
#fuction for top 5 recommendations for a user
def recommend_book(user_id):
  a = pivot_matrix.loc[user_id,:].reset_index() #making a data frame of ratings for an user
  unread_books = a[a[user_id] < 0.1]['Book-Title']      # filtering the books user has not read, uninteracted items are rated zero. 
  sorted_user_predictions = prediction_df[user_id].sort_values(ascending=False).reset_index().rename(columns = {user_id:'RecommendationStrength'})# getting best recommendations from the reconstructed matrix
  recommendations = sorted_user_predictions[sorted_user_predictions['Book-Title'].isin(unread_books)].sort_values('RecommendationStrength', ascending = False) #making sure we are not recommending already interacted items. 
  recommendations.set_index('Book-Title', inplace=True) #setting index as book title
  return list(recommendations.index[0:5]) #extracting the enquired information

In [None]:
recommend_book(638)

In [None]:
prediction_df.columns

# Evaluation 

In [None]:
#settinng User-Id as index in all our interactions data frame (full,train,test)
full_df_indexed = rating_popular_book.set_index('User-ID')
interactions_train_indexed_df = interactions_train_df.set_index('User-ID')
interactions_test_indexed_df = interactions_test_df.set_index('User-ID')

In [None]:
# Function for getting the set of items which a user has interacted with
def get_items_interacted(user,interactions_df):
  interacted_items = interactions_df.loc[user]['Book-Title']  #interacted-books for the user
  return set(interacted_items)  #converting to set

# Function for getting the set of items which a user has not interacted with in training set
def non_inter_items_train(user, seed = 42): 
  interacted_items = get_items_interacted(user, interactions_train_indexed_df)                            # taking all the interacted items from train set
  all_items = set(interactions_train_df['Book-Title'])                                                    # all the items in train set
  non_interacted_items = all_items - interacted_items                                                     # non-interacted items
  random.seed(seed)                                                                                       # defining a random seed for consistency across users
  non_interacted_items_sample = random.sample(non_interacted_items, 100)                                  # taking 100 non interacted items
  return set(non_interacted_items_sample)                                                                 # set of the 100 non-interacted items from the train

# Function to recommend the highest predicted rating content that the user hasn't seen yet
def recommend_items(user_id, items_to_ignore=[], topn=10):
  sorted_user_predictions = prediction_df[user_id].sort_values(ascending=False).reset_index().rename(columns = {user_id:'RecommendationStrength'})
  recommendations_df = sorted_user_predictions[~sorted_user_predictions['Book-Title'].isin(items_to_ignore)].sort_values('RecommendationStrength', ascending = False)
  return recommendations_df

# Function to verify whether a particular item_id was present in the set of top N recommended items
def top_n(book_name, recommended_items, topn):
  try:
      index = next(i for i, c in enumerate(recommended_items) if c == book_name)  #getting the item rank according to recommendation strength
  except:
      index = -1   #default value for index                                                            
  hit = int(index in range(0, topn))            # hit is integer of true or false/ true when rank is in topn
  return hit, index

We will be using the above defined functions to make our evaluator for the recommendation system. Now lets get to writing our recommedation systems

### This evaluation method works as follows:

* ### For each user
    *  For each item the user has interacted in test set
        *  Sample 100 other items the user has never interacted.   
        *  Ask the recommender model to produce a ranked list of recommended items, from a set composed of one interacted item and the 100 non-interacted items
        *  Compute the Top-N accuracy metrics for this user and interacted item from the recommendations ranked list
* ### Aggregate the global Top-N accuracy metrics

In [None]:
# write a Function to evaluate the performance of model for each user
def evaluate_model_for_user(user_id):
  #getting items in test set
  interacted_values_testset = interactions_test_indexed_df.loc[user_id]
  person_interacted_items_testset = set(interacted_values_testset['Book-Title'])
  
  interacted_items_count_testset = len(person_interacted_items_testset) 
  
  # Getting a ranked recommendation list from the model for a given user
  person_recs_df = recommend_items(user_id, items_to_ignore=get_items_interacted(user_id, interactions_train_indexed_df),topn=10000000000)
  
  hits_at_5_count = 0
  hits_at_10_count = 0

  # For each item the user has interacted in test set
  for book in person_interacted_items_testset:

    #getting a random sample of 100 people from train set and combing with our test set item
    non_interacted_items_sample = non_inter_items_train(user_id)
    items_to_filter_recs = non_interacted_items_sample.union(set([book]))

    # Filtering only recommendations that are either the interacted item or from a random sample of 100 non-interacted items
    valid_recs_df = person_recs_df[person_recs_df['Book-Title'].isin(items_to_filter_recs)]                    
    valid_recs = valid_recs_df['Book-Title'].values
            
    # Verifying if the current interacted item is among the Top-N recommended items
    hit_at_5, index_at_5 = top_n(book, valid_recs, 5)
    hits_at_5_count += hit_at_5
    hit_at_10, index_at_10 = top_n(book, valid_recs, 10)
    hits_at_10_count += hit_at_10

  # Recall is the rate of the interacted items that are ranked among the Top-N recommended items
  recall_at_5 = hits_at_5_count / float(interacted_items_count_testset)
  recall_at_10 = hits_at_10_count / float(interacted_items_count_testset)

  user_metrics = {'hits@5_count':hits_at_5_count, 
                          'hits@10_count':hits_at_10_count, 
                          'interacted_count': interacted_items_count_testset,
                          'recall@5': recall_at_5,
                          'recall@10': recall_at_10}
  return user_metrics


# Function to evaluate the performance of model for all users( overall performance )


In [None]:
evaluate_model_for_user(28204)

In [None]:
# storing metrics of all the users in the test-set       
people_metrics = [] 
for idx, person_id in enumerate(list(interactions_test_indexed_df.index.unique().values)):    
  person_metrics = evaluate_model_for_user(person_id)  
  person_metrics['_person_id'] = person_id
  people_metrics.append(person_metrics)
            
print('%d users processed' % idx)


In [None]:
# Evaluating global metrics 
detailed_results_df = pd.DataFrame(people_metrics).sort_values('interacted_count', ascending=False)
global_recall_at_5 = detailed_results_df['hits@5_count'].sum() / float(detailed_results_df['interacted_count'].sum())
global_recall_at_10 = detailed_results_df['hits@10_count'].sum() / float(detailed_results_df['interacted_count'].sum())

global_metrics = {'recall@5': global_recall_at_5,'recall@10': global_recall_at_10} 

print(global_metrics)                    

# Conclusion



This brings us to the end of our exercise.
We running a recommendation system with the whole data set but we kept running into memory problems. 

So, I put some constraints on the data set and tried collaborative filtering for our recommendations. We used both memory-based and model-based approaches. 
I also developed an top_n evaluation system for my model-based collaborative filtering approach.

Thanks for reading!