<a href="https://colab.research.google.com/github/AJAkil/Book_recommendation/blob/master/book_recommendation_with_knn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Book Recommendation Engine with KNN(5)
### A simple Book recommendation Engine that can recommend 5 books based on the rating of the books.

In [169]:
# import libraries (you may add additional imports but you may not have to)
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt

This project was a part of freecode camp so I collected the data from their cdn

In [170]:
# get data files
!wget https://cdn.freecodecamp.org/project-data/books/book-crossings.zip --no-check-certificate

!unzip book-crossings.zip

books_filename = 'BX-Books.csv'
ratings_filename = 'BX-Book-Ratings.csv'

--2020-08-15 17:11:52--  https://cdn.freecodecamp.org/project-data/books/book-crossings.zip
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 104.248.60.43, 159.65.216.232, 2604:a880:400:d1::89c:7001, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|104.248.60.43|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26085508 (25M) [application/zip]
Saving to: ‘book-crossings.zip.1’


2020-08-15 17:12:00 (3.28 MB/s) - ‘book-crossings.zip.1’ saved [26085508/26085508]

Archive:  book-crossings.zip
replace BX-Book-Ratings.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace BX-Books.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace BX-Users.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


# Loading and Merging the Dataframes

In [171]:
# import csv data into dataframes
df_books = pd.read_csv(
    books_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['isbn', 'title', 'author'],
    usecols=['isbn', 'title', 'author'],
    dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})

df_ratings = pd.read_csv(
    ratings_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['user', 'isbn', 'rating'],
    usecols=['user', 'isbn', 'rating'],
    dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})

In [172]:
# Exploring the data-frames
df_books.head(5)

Unnamed: 0,isbn,title,author
0,195153448,Classical Mythology,Mark P. O. Morford
1,2005018,Clara Callan,Richard Bruce Wright
2,60973129,Decision in Normandy,Carlo D'Este
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata
4,393045218,The Mummies of Urumchi,E. J. W. Barber


In [173]:
df_ratings.head()

Unnamed: 0,user,isbn,rating
0,276725,034545104X,0.0
1,276726,0155061224,5.0
2,276727,0446520802,0.0
3,276729,052165615X,3.0
4,276729,0521795028,6.0


In [174]:
# Merging the data frames on the isbn column
combined_df = pd.merge(df_books, df_ratings, on='isbn')

combined_df.drop(['author'], axis=1, inplace=True)

In [175]:
combined_df.head()

Unnamed: 0,isbn,title,user,rating
0,195153448,Classical Mythology,2,0.0
1,2005018,Clara Callan,8,5.0
2,2005018,Clara Callan,11400,0.0
3,2005018,Clara Callan,11676,8.0
4,2005018,Clara Callan,41385,0.0


In [176]:
len(combined_df)
# dropping any row that has a null value
combined_df.dropna(inplace=True)

# Filtering the dataset based on number of ratings of book

Since we have so many data in the data set, we may need to filter out books based on the total number of ratings. So we groupby the dataframe by title and find out the total number of rating for each of the unique books.

In [177]:
# we group by the booktilte and take total rating in account
df_total_rating = (combined_df.groupby(by=['title'])['rating'].count().
                   reset_index().
                   rename(columns={'rating':'total_rating_count'})
                   [['title','total_rating_count']]
                   )

df_total_rating.head()

Unnamed: 0,title,total_rating_count
0,A Light in the Storm: The Civil War Diary of ...,4
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,"Ask Lily (Young Women of Faith: Lily Series, ...",1
4,Beyond IBM: Leadership Marketing and Finance ...,1


We then combine the original dataframe with the pivot table we constructed

In [178]:
# We now combine this dataframe with our original one
combined_new_df = combined_df.merge(df_total_rating, left_on='title', right_on='title', how='inner')
combined_new_df.head()

Unnamed: 0,isbn,title,user,rating,total_rating_count
0,195153448,Classical Mythology,2,0.0,2
1,801319536,Classical Mythology,269782,7.0,2
2,2005018,Clara Callan,8,5.0,14
3,2005018,Clara Callan,11400,0.0,14
4,2005018,Clara Callan,11676,8.0,14


In [194]:
# We search the descriptive statistics of the rating pivot table
df_total_rating['total_rating_count'].describe()

count    241090.000000
mean          4.277137
std          16.738045
min           1.000000
25%           1.000000
50%           1.000000
75%           3.000000
max        2502.000000
Name: total_rating_count, dtype: float64

So we see that the median book has been rated only once. So we need to explore the different quantiles to have an idea how to select the threshhold to filter rows based on total rating.

In [180]:
df_total_rating['total_rating_count'].quantile([.25,.5,.75])

0.25    1.0
0.50    1.0
0.75    3.0
Name: total_rating_count, dtype: float64

Let's explore the upper quantile

In [181]:
df_total_rating['total_rating_count'].quantile(np.arange(0.99,1,0.001))

0.990      50.000
0.991      54.000
0.992      60.000
0.993      66.000
0.994      73.000
0.995      83.000
0.996      96.000
0.997     117.000
0.998     150.000
0.999     220.911
1.000    2502.000
Name: total_rating_count, dtype: float64

So it seems that only a small portion of the data distribution has a higher number of rating. Particularly, 1% books have a rating 50 or more. So we can keep the threshold to be 100 for the total rating. This would reduce the size of our data set.

In [195]:
threshhold = 100
combined_new_df = combined_new_df[combined_new_df['total_rating_count'] >= threshhold]
len(combined_new_df)

183799

In [196]:
combined_new_df.head()

Unnamed: 0,isbn,title,user,rating,total_rating_count
31,399135782,The Kitchen God's Wife,8,0.0,311
32,399135782,The Kitchen God's Wife,11676,9.0,311
33,399135782,The Kitchen God's Wife,29526,9.0,311
34,399135782,The Kitchen God's Wife,36836,0.0,311
35,399135782,The Kitchen God's Wife,46398,9.0,311


# Filtering the dataset based on number of user rating
Since there are lot's of users with very low rating, so just like the previous book rating, we would like to filter the dataset based on the number of ratings given by the user. 

In [184]:
# We convert the user unique ratings to a data frame
user_rating_count_df = combined_new_df['user'].value_counts().rename_axis('user').reset_index(name='counts')
user_rating_count_df.head(5)

Unnamed: 0,user,counts
0,11676,1173
1,35859,521
2,76352,433
3,16795,426
4,153662,396


We explore the quantile ranges for the counts again

In [197]:
user_rating_count_df['counts'].quantile([.25,.5,.75])

0.25    1.0
0.50    1.0
0.75    3.0
Name: counts, dtype: float64

In [213]:
user_rating_count_df['counts'].quantile(np.arange(.99,1.0,.001))

0.990      70.000
0.991      74.000
0.992      80.000
0.993      88.000
0.994      97.000
0.995     105.000
0.996     117.000
0.997     137.000
0.998     170.214
0.999     212.869
1.000    1173.000
Name: counts, dtype: float64

We again see an interesting trend in the data distribution here. The top 1% of the distribution if users have a rating of 70 or more. So A good threshold to choose may be 200.

In [214]:
# Then we merge the dataframe to a previous data frame
final_df = combined_new_df.merge(user_rating_count_df,how='left')

So we set the threshold to be 200 for gaining statistical signifance for our data set.

In [215]:
# We filter the final datafraame with a threshold of the total number of ratings
# to be greater or equal to 200
threshhold = 200
final_df = final_df[final_df['counts'] >= threshhold]
len(final_df)

14890

In [216]:
final_pivot = final_df.pivot_table(index='title',columns='user',values='rating').fillna(0)
final_pivot_matrix = csr_matrix(final_pivot.values)

In [217]:
# Building the model
model = NearestNeighbors(n_neighbors=5,metric='cosine')
model.fit(final_pivot_matrix)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [232]:
book_names = final_pivot.index.to_list()

914

In [192]:
# function to return recommended books
def get_recommends(book = ""):
  distances, indices = model.kneighbors(final_pivot.loc[book,:].values.reshape(1,-1), n_neighbors=6)
  recommended_books = []
  recommended_books.append(book)
  books = []

  for i in range(1, len(distances.flatten())):
    books.append([final_pivot.index[indices.flatten()[i]], distances.flatten()[i]])

  recommended_books.append(books)
  return recommended_books

Now you can play around with some recommendations as you like

In [230]:
import pprint as pp
pp.pprint(get_recommends("Harry Potter and the Goblet of Fire (Book 4)"))

['Harry Potter and the Goblet of Fire (Book 4)',
 [["Harry Potter and the Sorcerer's Stone (Harry Potter (Paperback))",
   0.19848251],
  ['Secrets', 0.2397002],
  ['Harry Potter and the Chamber of Secrets (Book 2)', 0.24839234],
  ['Harry Potter and the Prisoner of Azkaban (Book 3)', 0.28263128],
  ['Night Whispers', 0.3156348]]]


In [231]:
pp.pprint(get_recommends("The Shining"))

['The Shining',
 [['Pet Sematary', 0.36982954],
  ['It', 0.38520074],
  ['Different Seasons', 0.39143884],
  ['The Girl Who Loved Tom Gordon : A Novel', 0.42693406],
  ['Skeleton Crew', 0.4291672]]]


In [242]:
import random as rd

def show_recommendations(no_of_reco=5):
  for i in range(no_of_reco):
    pp.pprint(get_recommends(book_names[rd.randint(0,len(book_names))]))


In [243]:
show_recommendations(3)

['Cold Fire',
 [['Dragon Tears', 0.13622105],
  ['Still Waters', 0.13622105],
  ['Long After Midnight', 0.13622105],
  ['The House of Thunder', 0.13622105],
  ['The Crush', 0.18654317]]]
['Mr. Murder',
 [['Mr. Murder', 0.0],
  ['The Key to Midnight', 0.0],
  ['The Right Hand of Evil', 0.0],
  ['Coming Home', 0.0],
  ['Guilty Pleasures (Anita Blake Vampire Hunter (Paperback))', 0.07152331]]]
['The Bad Place',
 [['The Apprentice', 0.1268717],
  ['The Laws of Our Fathers', 0.2407434],
  ['Move to Strike', 0.24462712],
  ['Shattered', 0.34417015],
  ['Long After Midnight', 0.34920865]]]
