# Recommender System - Let's get started with the basics

![Certificate.png](attachment:Certificate.png)

### __What are recommender systems ?__

Even though people's tastes may vary, they generally follow patterns. That is, there are similarities in the things that people tend to like. For example,  if you’ve recently purchased a book on Machine Learning in Python and you’ve enjoyed reading it, it’s very likely that you’ll also enjoy reading a book on Data Visualization. People also tend to have similar tastes to those of the people they’re close to in their lives.

And here comes the role of a recommender system. 

Recommender systems try to capture these patterns and similar behaviors, to help predict what else you might like. Recommender systems have many applications that I’m sure you’re already familiar with. Indeed, Recommender systems are usually at play on many websites. For example, suggesting books on Amazon and movies on Netflix. In fact, everything on Netflix’s website is driven by customer selection. If a certain movie gets viewed frequently enough, Netflix’s recommender system ensures that that movie gets an increasing number of recommendations. Another example can be found in a daily-use mobile app, where a recommender engine is used to recommend anything from where to eat, or, what job to apply to. On social media, sites like Facebook or LinkedIn, regularly recommend friendships.

There are many of these types of examples and they are growing in number every day. So, let’s take a closer look at the main benefits of using a recommendation system. One of the main advantages of using recommendation systems is that users get a broader exposure to many different products they might be interested in. This exposure encourages users towards continual usage or purchase of their product.

#### __Types of recommender systems :__

There are generally 2 main types of recommendation systems: __Content-based__ and __Collaborative filtering.__ 


![Certificate.png](attachment:Certificate.png)

The main difference between each, can be summed up by the type of statement that a consumer might make. 

For instance, the main paradigm of a Content-based recommendation system is driven by the statement: “Show me more of the same of what I've liked before." Content-based systems try to figure out what a user's favorite aspects of an item are, and then make recommendations on items that share those aspects. 

Collaborative filtering is based on a user saying, “Tell me what's popular among my neighbors because I might like it too.” Collaborative filtering techniques find similar groups of users, and provide recommendations based on similar tastes within that group. In short, it assumes that a user might be interested in what similar users are interested in.

Also, there are Hybrid recommender systems, which combine various mechanisms.

In terms of implementing recommender systems, there are 2 types: __Memory-based__ and __Model-based__.

In memory-based approaches, we use the entire user-item dataset to generate a recommendation system. It uses statistical techniques to approximate users or items. Examples of these techniques include: Pearson Correlation, Cosine Similarity and Euclidean Distance, among others.

In model-based approaches, a model of users is developed in an attempt to learn their preferences. Models can be created using Machine Learning techniques like regression, clustering, classification, and so on.

### __Content-Based Recommender Systems__

Here we will dive into Content-Based Recommender Systems. So, let's get started.

A content-based recommendation system tries to recommend items to users, based on their profile. The user’s profile revolves around that user’s preferences and tastes. It is shaped based on user ratings, including the number of times that user has clicked on different items or perhaps, even liked those items. The recommendation process is based on the similarity between those items. Similarity, or closeness of items, is measured based on the similarity in the content of those items. When we say content, we’re talking about things like the item’s category, tag, genre, and so on. 

![Certificate.png](attachment:Certificate.png)

For example, if we have 4 movies, and if the user likes or rates the first 2 items, and if item 3 is similar to item 1, in terms of their genre, the engine will also recommend item 3 to the user.

In essence, this is what content-based recommender system engines do.

__Working of content-based recommender system :__

![Certificate.png](attachment:Certificate.png)

Let’s assume we have a dataset of only 6 movies. This dataset shows movies that our user has watched, and also the genre of each of the movies. For example, “Batman versus Superman” is in the Adventure, Super Hero genre. And “Guardians of the Galaxy” is in Comedy, Adventure, Super Hero, and Science Fiction genres. Let’s say the user has watched and rated 3 movies so far and she has given a rating of 2 out of 10 to the first movie, 10 out of 10 to the second movie, and an 8 out of 10 to the third. 

The task of the recommender engine is to recommend one of the 3 candidate movies to this user. Or, in other words, we want to predict what the user’s possible rating would be, of the 3 candidate movies if she were to watch them.

To achieve this, we have to build the user profile. First, we create a vector to show the user’s ratings for the movies that she’s already watched. We call it “input user ratings.”

![Certificate.png](attachment:Certificate.png)

Then, we encode the movies through the "One Hot Encoding” approach. Genre of movies are used here as a feature set. We use the first 3 movies to make this matrix.

![Certificate.png](attachment:Certificate.png)

If we multiply these 2 matrices, we can get the “weighted feature set” for the movies. Let’s take a look at the result. This matrix is also called the “Weighted Genre Matrix,” and represents the interests of the user for each genre based on the movies that she’s watched.
![Certificate.png](attachment:Certificate.png)

Now, given the weighted genre matrix, we can shape the profile of our active user. Essentially, we can aggregate the weighted genres.![Certificate-2.png](attachment:Certificate-2.png)

And then normalize them to find the user profile. 

![Certificate-3.png](attachment:Certificate-3.png)

It clearly indicates that she likes “super hero” movies more than other genres. We use this profile to figure out what movie is proper to recommend to this user.


In the same above way, we can encode the other three candidate movies for recommendation, that haven’t been watched by the user.

![Certificate.png](attachment:Certificate.png)

1. We simply multiply the "user-profile" matrix by the "candidate movie matrix", which results in the “weighted movies” matrix. It shows the weight of each genre, with respect to the user profile. 
2. Now, if we aggregate these weighted ratings, we get the active user’s possible interest-level in these 3 movies. In essence, it’s our “recommendation” list, which we can sort to rank the movies, and recommend them to the user. For example, we can say that the “Hitchhiker's Guide to the Galaxy” has the highest score in our list, and is proper to recommend to the user.

![Certificate.png](attachment:Certificate.png)

So, now we can come back and fill the predicted ratings for the user and recommended the movie with the highest rating score.

__Recap :__

The recommendation in a content-based system, is based on user’s tastes, and the content or feature set items. Such a model is very efficient. However, in some cases it doesn’t work.

![Certificate.png](attachment:Certificate.png)

For example, assume that we have a movie in the “drama” genre, which the user has
never watched. So, this genre would not be in her profile.
Therefore, she’ll only get recommendations related to genres that are already in her
profile, and the recommender engine may never recommend any movie within other genres.
This problem can be solved by other types of recommender systems such as
"Collaborative Filtering.”

### __Collaborative Filtering__

Here we’ll be covering another recommender system technique called, Collaborative filtering. So let’s get started.

Collaborative filtering is based on the fact that relationships exist between products and people’s interests. Many recommendation systems use Collaborative filtering to find these relationships and to give an accurate recommendation of a product that the user might like or be interested in.

Collaborative filtering has basically two approaches: __User-based__ and __Item-based__.
1. User-based collaborative filtering is based on the user’s similarity or neighborhood.
2. Item-based collaborative filtering is based on similarity among items.

__Intuition behind the “user-based” approach :__

In user-based collaborative filtering, we have an active user for whom the recommendation is aimed. The collaborative filtering engine, first looks for users who are similar, that is, users who share the active user’s rating patterns. Collaborative filtering bases this similarity on things like history, preference, and choices that users make when buying, watching, or enjoying something. For example, movies that similar users have rated highly. Then, it uses the ratings from these similar users to predict the possible ratings by the active user for a movie that she had not previously watched. 

![Certificate.png](attachment:Certificate.png)

For instance, if 2 users are similar or are neighbors, in terms of their interest in movies, we can recommend a movie to the active user that her neighbor has already seen.

__Intuition behind the “item-based” approach :__

In the item-based approach, similar items build neighborhoods on the behavior of users.

![Certificate.png](attachment:Certificate.png)

For example, Item 1 and Item 3 are considered neighbors, as they were positively rated by both User1 and User2. So, Item 1 can be recommended to User 3 as he has already shown interest in Item3. Therefore, the recommendations here are based on the items in the neighborhood that a user might prefer.

__Challenges :__

1. Data Sparsity : It happens when you have a large dataset of users, who generally, rate only a limited number of items. As mentioned, collaborative-based recommenders can only predict scoring of an item if there are other users who have rated it. Due to sparsity, we might not have enough ratings in the user-item dataset, which makes it impossible to provide proper recommendations.
    
2. Cold start : It refers to the difficulty the recommendation system has when there is a new user and, as such, a profile doesn’t exist for them yet. Cold start can also happen when we have a new item, which has not received a rating.
    
3. Scalability : Scalability can become an issue, as well. As the number of users or items increases and the amount of data expands, Collaborative filtering algorithms will begin to suffer drops in performance, simply due to growth in the similarity computation.

There are some solutions for each of these challenges, such as using hybrid-based recommender or model based approaches.

### __Hybrid Recommendations Approaches__

Most recommender systems now use a hybrid approach, combining collaborative filtering, content-based filtering, and other approaches. It can be implemented in several ways: 
1. By making content-based and collaborative-based predictions separately and then combining them. 
2. By adding content-based capabilities to a collaborative-based approach (and vice versa). 
3. By unifying the approaches into one model. 

Several studies that empirically compare the performance of the hybrid with the pure collaborative and content-based methods and demonstrated that the hybrid methods can provide more accurate recommendations than pure approaches. These methods can also be used to overcome some of the common problems in recommender systems such as cold start and the sparsity problem.

Netflix is a good example of the use of hybrid recommender systems. The website makes recommendations by comparing the watching and searching habits of similar users (i.e., collaborative filtering) as well as by offering movies that share characteristics with films that a user has rated highly (content-based filtering).

There are some organizations that use this method like Facebook which shows news which is important for you and for others also in your network and the same is used by Linkedin too.

# Book Recommender System Project :

![Book2.jpg](attachment:Book2.jpg)

How about we start with a simple project to get more insights about a recommendation system, with Machine Learning using the Python programming language.. Sounds good..right ?

A book recommendation system is designed to recommend books of interest to the buyer. The purpose of a book recommendation system is to predict buyer’s interest and recommend books to them accordingly. A book recommendation system can take into account many parameters like book content and book quality by filtering user reviews.

The dataset used in this project consist of 3 types of dataframe :
1. The first one is about books, an excel file with large collection of books and their details like book title, author, year of publication, publisher, etc.
2. The second dataset is about the users, the registered users with their id, location and age.
3. The third dataset is about ratings, provided by various users for different books.

In [4]:
import numpy as np
import pandas as pd

In [5]:
# Importing the first dataset about the books

books = pd.read_csv('BX-Books.csv', sep=';', error_bad_lines=False, encoding='latin-1')

# The values in the CSV file are separated by semicolons, not by a comma.
# Encoding of a file is in Latin

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'


In [3]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [4]:
books.columns

Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher',
       'Image-URL-S', 'Image-URL-M', 'Image-URL-L'],
      dtype='object')

_The features 'Image-URL-S', 'Image-URL-M', 'Image-URL-L' are having image url's for small, medium and large images, which is of no use in our analysis._

In [5]:
books = books[['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher']]

In [6]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


_The features can be renamed for easy access._

In [7]:
books.rename(columns={'Book-Title':'title',
                      'Book-Author':'author',
                      'Year-Of-Publication':'year',
                      'Publisher':'publisher'}, inplace=True)

In [8]:
books.head(1)

Unnamed: 0,ISBN,title,author,year,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press


_So this is the final books dataset we got._

In [9]:
# Importing the second dataset about the users

users = pd.read_csv('BX-Users.csv', sep=';', error_bad_lines=False, encoding='latin-1')

In [10]:
users.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [11]:
# Let's rename the features for easy access

users.rename(columns={'User-ID':'user-id',
                      'Location':'location',
                      'Age':'age'}, inplace=True)

In [12]:
users.head(1)

Unnamed: 0,user-id,location,age
0,1,"nyc, new york, usa",


_This is the final dataset we have for the registered users._

In [13]:
# Importing the third dataset about the ratings provided by various users for different books

ratings = pd.read_csv('BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding='latin-1')

In [14]:
ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [15]:
ratings.rename(columns={'User-ID':'user-id',
                        'Book-Rating':'book-rating'}, inplace=True)

In [16]:
ratings.head(1)

Unnamed: 0,user-id,ISBN,book-rating
0,276725,034545104X,0


_This is the final dataset which includes the ratings for different types of books._

In [17]:
books.shape

(271360, 5)

_There are more than 2 lakhs books in this dataset._

In [18]:
users.shape

(278858, 3)

_And there are more than 2 lakhs users registered in that website._

In [19]:
ratings.shape

(1149780, 3)

_So more than 2 lakhs users altogether gave 11 lakhs ratings for the different books._

### Approach Strategy 

Firstly, in this book recommendation system we are going to approach with the collaborative filtering strategy. Here we don't need to find any relation or similarity between the users nor the books, rather we have to give an accurate recommendation of a product that an user might like or be interested in.

Suppose there are 2 users and both read a book-1. But user-1 had also read another book-2, so we can recommend this book-2 to user-2, as he might like it as well.

So we are gonna start with matrix factorization. A matrix whose columns will be users and indices will be books and values inside the matrix will be ratings provided by each user for different books.

We will perform clustering based on this matrix, users whose behaviour for rating books are similar will be clustered together and as a result, books will be recommended according to their interests within that cluster.

### Flaw in the dataset

There is a flaw in this dataset. Till now we are considering all the users and books available in the dataset for modelling and this will create a problem in recommendation. 

__Constraint-1 :__ There are over 2 lakhs users in this dataset, definitely there will be some users who has only registered on the website or has only read one or two books. Hence we cannot rely on such users for recommendations, because such users will not have variety of knowledge about books. We require users who has read a variety of books from different good writers, so that more knowledge can be extracted from his data.

A solution to this problem is that we can set a limit or criteria for the users to be included in our modelling. So we will set a criteria that include only those users who has rated at least 200 books, which can be considered as a knowledgeable user.

__Constraint-2 :__ And the same logic goes for books too. There are more than 2 lakhs books, obviously there will be some books which has zero ratings or very low ratings or books which were not even sold. These books can disrupt the recommendations and cannot be relied. Hence we can apply a constraint that include only those books which has received at least 50 ratings from a user.

### Exploratory Data Analysis

So let's start with the EDA process, starting with users and total number of ratings provided by them.

In [20]:
ratings['user-id'].value_counts()

11676     13602
198711     7550
153662     6109
98391      5891
35859      5850
          ...  
271728        1
245123        1
234886        1
259466        1
187812        1
Name: user-id, Length: 105283, dtype: int64

_Highest number of ratings was given by user 11676, that is a total of 13602 ratings, who is a knowledeable user for us._ 

In [21]:
# Let's find the number of total unique users

ratings['user-id'].nunique()

105283

_So out of 2.7 lakhs users, only 1 lakh users have given a rating for the books, which is 50% of the users. And rest of the 50% users hasn't provided any rating._

In [22]:
# Now filter out those users who has rated at least 200 books

x = ratings['user-id'].value_counts()>200

In [23]:
x[x].shape

(899,)

_So these are our good knowledeable users who has rated at least 200 books and will help in our model building._

In [24]:
# Now let's extract these 899 users

y = x[x].index
y

Int64Index([ 11676, 198711, 153662,  98391,  35859, 212898, 278418,  76352,
            110973, 235105,
            ...
            116122,  28634, 188951,  59727, 155916, 274808,  73681,   9856,
            268622,  44296],
           dtype='int64', length=899)

_These are the user-ids of those 899 users who had rated at least 200 books._

In [25]:
# Now we have to find the ratings provided by these 899 users only. 
# This we can extract from the 'ratings' dataframe using the following syntax

ratings = ratings[ratings['user-id'].isin(y)]
ratings

Unnamed: 0,user-id,ISBN,book-rating
1456,277427,002542730X,10
1457,277427,0026217457,0
1458,277427,003008685X,8
1459,277427,0030615321,0
1460,277427,0060002050,0
...,...,...,...
1147612,275970,3829021860,0
1147613,275970,4770019572,0
1147614,275970,896086097,0
1147615,275970,9626340762,8


_Initially there was a total of 11 lakhs ratings provided by 2.7 lakhs users. But those 899 users alone provided 5.2 lakhs ratings._

### Merge datasets

Our data is distributed in 3 dataframes. So in this section we will merge 'ratings' dataframe with 'books' dataframe on basis of ISBN so that we get the rating of each user on each book id together in one place.

In [26]:
ratings_with_books = ratings.merge(books, on='ISBN')
ratings_with_books

Unnamed: 0,user-id,ISBN,book-rating,title,author,year,publisher
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
...,...,...,...,...,...,...,...
487666,275970,1892145022,0,Here Is New York,E. B. White,1999,Little Bookroom
487667,275970,1931868123,0,There's a Porcupine in My Outhouse: Misadventu...,Mike Tougias,2002,Capital Books (VA)
487668,275970,3411086211,10,Die Biene.,Sybil GrÃ?Â¤fin SchÃ?Â¶nfeldt,1993,"Bibliographisches Institut, Mannheim"
487669,275970,3829021860,0,The Penis Book,Joseph Cohen,1999,Konemann


_Now dataframe size has decreased and we have 4.8 lakhs ratings and its corresponding book, because for some book ISBN's there is no information in the 'book' dataset and hence around 40,000 ratings got removed._

### Constraint-2
So far now we have solved the first constraint which was to limit the users and include only those users who has rated at least 200 books.

Now let's work on the second constraint, which was to include only those books which has received at least 50 ratings from a user.

In [41]:
number_of_rating=ratings_with_books.groupby('title')['book-rating'].count().reset_index()

In [42]:
number_of_rating.rename(columns={'book-rating':'number of ratings'}, inplace=True)
number_of_rating

Unnamed: 0,title,number of ratings
0,A Light in the Storm: The Civil War Diary of ...,2
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,Beyond IBM: Leadership Marketing and Finance ...,1
4,Clifford Visita El Hospital (Clifford El Gran...,1
...,...,...
160264,Ã?Â?ber die Pflicht zum Ungehorsam gegen den S...,3
160265,Ã?Â?lpiraten.,1
160266,Ã?Â?rger mit Produkt X. Roman.,1
160267,Ã?Â?stlich der Berge.,1


### Merge datasets

In [43]:
# Let's merge 'number_of_rating' dataframe with 'ratings_with_books' dataframe on the basis of title of the book

final_rating = ratings_with_books.merge(number_of_rating, on='title')
final_rating

Unnamed: 0,user-id,ISBN,book-rating,title,author,year,publisher,number of ratings
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
...,...,...,...,...,...,...,...,...
487666,275970,1892145022,0,Here Is New York,E. B. White,1999,Little Bookroom,1
487667,275970,1931868123,0,There's a Porcupine in My Outhouse: Misadventu...,Mike Tougias,2002,Capital Books (VA),1
487668,275970,3411086211,10,Die Biene.,Sybil GrÃ?Â¤fin SchÃ?Â¶nfeldt,1993,"Bibliographisches Institut, Mannheim",1
487669,275970,3829021860,0,The Penis Book,Joseph Cohen,1999,Konemann,1


_This is the final dataframe consisting of all the required book and rating details._

In [45]:
final_rating = final_rating[final_rating['number of ratings']>=50]

In [46]:
final_rating

Unnamed: 0,user-id,ISBN,book-rating,title,author,year,publisher,number of ratings
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
...,...,...,...,...,...,...,...,...
236701,255489,0553579983,7,And Then You Die,Iris Johansen,1998,Bantam,50
236702,256407,0553579983,0,And Then You Die,Iris Johansen,1998,Bantam,50
236703,257204,0553579983,0,And Then You Die,Iris Johansen,1998,Bantam,50
236704,261829,0553579983,0,And Then You Die,Iris Johansen,1998,Bantam,50


__So this is our final dataframe solving both constraints, users who has rated at least 200 books and books which has received at least 50 ratings from a user.__

In [55]:
# Let's check for duplicates 

final_rating[final_rating[['user-id','title']].duplicated() == True]

Unnamed: 0,user-id,ISBN,book-rating,title,author,year,publisher,number of ratings
702,11676,0440977096,10,The Secret Garden,Frances Hodgson Burnett,1989,Laure Leaf,79
710,11676,0879236493,9,The Secret Garden,Frances Hodgson Burnett,1987,David R. Godine Publisher,79
717,35050,0439099390,0,The Secret Garden,Frances Hodgson Burnett,1999,Scholastic,79
721,174791,0439099390,0,The Secret Garden,Frances Hodgson Burnett,1999,Scholastic,79
722,230522,0439099390,0,The Secret Garden,Frances Hodgson Burnett,1999,Scholastic,79
...,...,...,...,...,...,...,...,...
223478,172030,0821744941,9,Dark Angel,Anna Grant,1994,Zebra Books,54
223483,153662,0505524147,0,Dark Angel,Cassandra Collins,2000,Love Spell,54
228875,113270,0743439651,0,Still Waters,Jennifer Lauck,2001,Atria,83
228876,162639,0743439651,0,Still Waters,Jennifer Lauck,2001,Atria,83


_These are the 2000 rows where same user had rated the same book multiple times, which is of no use to our model and hence can be removed._

In [58]:
final_rating.drop_duplicates(['user-id','title'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_rating.drop_duplicates(['user-id','title'], inplace=True)


In [57]:
final_rating.shape

(59850, 8)

_Successfully dropped and after all the data-cleaning we have this dataframe._

### Pivot table
This is the matrix formation whose columns will be users and indices will be books and values inside the matrix will be ratings provided by each user for different books.

In [59]:
book_pivot = final_rating.pivot_table(columns='user-id', index='title', values='book-rating')

In [60]:
book_pivot

user-id,254,2276,2766,2977,3363,3757,4017,4385,6242,6251,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,,,,,,,,,,...,,,,,,0.0,,,,
1st to Die: A Novel,,,,,,,,,,,...,,,,,,,,,,
2nd Chance,,10.0,,,,,,,,,...,,,,0.0,,,,,0.0,
4 Blondes,,,,,,,,,,0.0,...,,,,,,,,,,
84 Charing Cross Road,,,,,,,,,,,...,,,,,,10.0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Year of Wonders,,,,7.0,,,,,7.0,,...,,,,,,0.0,,,,
You Belong To Me,,,,,,,,,,,...,,,,,,,,,,
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,,,,,0.0,,,,,0.0,...,,,,,,0.0,,,,
Zoya,,,,,,,,,,,...,,,,,,,,,,


_There are 742 books which recieved at least 50 ratings from a user and 888 users users who has rated at least 200 books. But before there 899 users, 11 users have been removed because their ratings were on those books which did not receive more than 50 ratings so they are moved out of the picture._

In [61]:
# Removing all the NaN values and impute with zero

book_pivot.fillna(0, inplace=True)
book_pivot

user-id,254,2276,2766,2977,3363,3757,4017,4385,6242,6251,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 Blondes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
84 Charing Cross Road,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Year of Wonders,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,7.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
You Belong To Me,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zoya,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Modelling

We have done all the required analysis of our dataset for modeling. Next we will use the nearest neighbors algorithm which is similar to the KNN Classifier algorithm, but this is used for clustering based on euclidian distance, and the closest ones are brought together in one cluster.

But here in the pivot table, we have lots of zero values, and on clustering this computing power will increase to calculate the distance of zero values as well. So we will convert the pivot table into a sparse matrix and then feed it to the model.

In [62]:
from scipy.sparse import csr_matrix

In [63]:
book_sparse = csr_matrix(book_pivot)

_Internally this matrix will consider only the non-zero values to calculate the distances._

In [65]:
from sklearn.neighbors import NearestNeighbors

In [66]:
model = NearestNeighbors(algorithm='brute')

_Here, brute means find the distance of every point to every other point, then find the nearest points and cluster them._

In [67]:
# Now we will train the model and pass the sparse matrix
model.fit(book_sparse)

NearestNeighbors(algorithm='brute')

_The model is trained successfully and now its time to test it and check if it provides some good recommendations._

In [73]:
distances, suggestions = model.kneighbors(book_pivot.iloc[240, :].values.reshape(1, -1), n_neighbors=6)

In [74]:
distances

array([[ 0.        , 61.20457499, 68.78953409, 71.36525765, 73.17786551,
        74.57211275]])

_These are the 6 distances of books which are closest to the book id we provided, 240. The 0 distance is the 240th book itself._

In [75]:
suggestions

array([[240, 238, 237, 241, 239, 184]], dtype=int64)

_These are the 6 closest books to 240 recommended by our model._

In [76]:
book_pivot.index[240]

'Harry Potter and the Prisoner of Azkaban (Book 3)'

In [77]:
for i in range(len(suggestions)):
    print(book_pivot.index[suggestions[i]])

Index(['Harry Potter and the Prisoner of Azkaban (Book 3)',
       'Harry Potter and the Goblet of Fire (Book 4)',
       'Harry Potter and the Chamber of Secrets (Book 2)',
       'Harry Potter and the Sorcerer's Stone (Book 1)',
       'Harry Potter and the Order of the Phoenix (Book 5)', 'Exclusive'],
      dtype='object', name='title')


_The book id 240 which we provided to our model was Harry Potter and the Prisoner of Azkaban. And so it recommended the other similar books._ 

__Now let's create a function, where we provide the book name and get recommendations easily.__

In [79]:
np.where(book_pivot.index=='Animal Farm')[0][0]

54

_This how we get the index of a book, and so use this logic in the following function._

In [87]:
def recommend_book(book_name):
    book_id = np.where(book_pivot.index==book_name)[0][0]
    distances, suggestions = model.kneighbors(book_pivot.iloc[book_id, :].values.reshape(1, -1), n_neighbors=6)
    
    for i in range(len(suggestions)):
        print(book_pivot.index[suggestions[i]])

In [91]:
recommend_book('Animal Farm')

Index(['Animal Farm', 'Exclusive', 'Jacob Have I Loved', 'Second Nature',
       'Pleading Guilty', 'No Safe Place'],
      dtype='object', name='title')


With this, we come to the end of a machine learning project on the book recommendation system. As we can see, our model shows a pretty decent result. This is a wonderful Unsupervised learning project where we have done lots of preprocessing. Hope you liked this article on Book Recommendation System With Machine Learning using Python.