<h1 align=center><font size = 5>BOOK RECOMMENDATION SYSTEM</font></h1>

### Table of contents

<a href="#ref1">1. Preprocessing data</a>

<a href="#ref2">2. Content-based Recommendation System</a>

<a href="#ref3">3. The final recommendation table</a>

In [1]:
import numpy as np 
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import re
import matplotlib.style as style
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/goodbooks-10k/to_read.csv
/kaggle/input/goodbooks-10k/tags.csv
/kaggle/input/goodbooks-10k/book_tags.csv
/kaggle/input/goodbooks-10k/sample_book.xml
/kaggle/input/goodbooks-10k/ratings.csv
/kaggle/input/goodbooks-10k/books.csv


In [2]:
books = pd.read_csv("/kaggle/input/goodbooks-10k/books.csv")
book_tags = pd.read_csv("/kaggle/input/goodbooks-10k/book_tags.csv")
tags = pd.read_csv("/kaggle/input/goodbooks-10k/tags.csv")
ratings = pd.read_csv("/kaggle/input/goodbooks-10k/ratings.csv")

<a id="ref1"></a>
# Preprocessing Data

Reviewing the data in ***tags*** and ***book_tags***

In [3]:
tags.head()

Unnamed: 0,tag_id,tag_name
0,0,-
1,1,--1-
2,2,--10-
3,3,--12-
4,4,--122-


In [4]:
book_tags.head()

Unnamed: 0,goodreads_book_id,tag_id,count
0,1,30574,167697
1,1,11305,37174
2,1,11557,34173
3,1,8717,12986
4,1,33114,12716


Both of these can be merged as one using the column ***'tag_id'***

In [5]:
#Left join between book_tags and tags dataframe
book_tags = pd.merge(book_tags,tags,on='tag_id',how='left')

Removing duplicated rows, if any.

In [6]:
book_tags.drop(book_tags[book_tags.duplicated()].index, inplace = True)

**FINAL *book_tags*:**

In [7]:
book_tags

Unnamed: 0,goodreads_book_id,tag_id,count,tag_name
0,1,30574,167697,to-read
1,1,11305,37174,fantasy
2,1,11557,34173,favorites
3,1,8717,12986,currently-reading
4,1,33114,12716,young-adult
...,...,...,...,...
999907,33288638,21303,7,neighbors
999908,33288638,17271,7,kindleunlimited
999909,33288638,1126,7,5-star-reads
999910,33288638,11478,7,fave-author


Reviewing the data in ***books***

In [8]:
books.head()

Unnamed: 0,id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


Removing columns that aren't needed for a content-based recommendation system and renaming some of them for better understanding.

In [9]:
#Drop unnecessary columns
books.drop(columns=['id', 'best_book_id', 'work_id', 'isbn', 'isbn13', 'title','work_ratings_count','ratings_count','work_text_reviews_count', 'ratings_1', 'ratings_2', 'ratings_3','ratings_4', 'ratings_5', 'image_url','small_image_url'], inplace= True)

#Rename columns
books.rename(columns={'original_publication_year':'pub_year', 'original_title':'title', 'language_code':'language', 'average_rating':'rating'}, inplace=True)

Checking for nulls, if any.

In [10]:
books.isnull().sum()

book_id           0
books_count       0
authors           0
pub_year         21
title           585
language       1084
rating            0
dtype: int64

In [11]:
#Dropping the null values
books.dropna(inplace= True)

Splitting the values in the ***authors*** column into a ***list of authors*** to simplify future use.

In [12]:
#Using python's split string function to create a list of authors
books['authors'] = books.authors.str.split(',')

**FINAL *books*:**

In [13]:
books

Unnamed: 0,book_id,books_count,authors,pub_year,title,language,rating
0,2767052,272,[Suzanne Collins],2008.0,The Hunger Games,eng,4.34
1,3,491,"[J.K. Rowling, Mary GrandPré]",1997.0,Harry Potter and the Philosopher's Stone,eng,4.44
2,41865,226,[Stephenie Meyer],2005.0,Twilight,en-US,3.57
3,2657,487,[Harper Lee],1960.0,To Kill a Mockingbird,eng,4.25
4,4671,1356,[F. Scott Fitzgerald],1925.0,The Great Gatsby,eng,3.89
...,...,...,...,...,...,...,...
9994,15613,199,[Herman Melville],1924.0,"Billy Budd, Sailor",eng,3.09
9995,7130616,19,[Ilona Andrews],2010.0,Bayou Moon,eng,4.09
9996,208324,19,[Robert A. Caro],1990.0,Means of Ascent,eng,4.25
9997,77431,60,[Patrick O'Brian],1977.0,The Mauritius Command,eng,4.35


1. Since keeping authors in a list format isn't optimal for the content-based recommendation system technique, we will use the ***One Hot Encoding technique*** to convert it into to a vector where each column corresponds to one possible value of the feature. This encoding is needed for feeding categorical data. 

2. Store every different author in columns that contain either 1 or 0. 1 shows that the book is written by that author and 0 shows that it isn't.

In [14]:
book_authors = books.copy()

#For every row in the dataframe, iterate through the list of authors and place a 1 into the corresponding column
for index, row in books.iterrows():
    for author in row['authors']:
        book_authors.at[index, author] = 1
        
#Filling in the NaN values with 0 to show that a book isn't written by that author
book_authors = book_authors.fillna(0)
book_authors.head()

Unnamed: 0,book_id,books_count,authors,pub_year,title,language,rating,Suzanne Collins,J.K. Rowling,Mary GrandPré,...,Deeanne Gist,Peter Matthiessen,Tom Clancy,Steve Pieczenik,John Rawls,Oscar Hijuelos,Ben Okri,Miles Cameron,Ian Mortimer,Peggy Orenstein
0,2767052,272,[Suzanne Collins],2008.0,The Hunger Games,eng,4.34,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,3,491,"[J.K. Rowling, Mary GrandPré]",1997.0,Harry Potter and the Philosopher's Stone,eng,4.44,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,41865,226,[Stephenie Meyer],2005.0,Twilight,en-US,3.57,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2657,487,[Harper Lee],1960.0,To Kill a Mockingbird,eng,4.25,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4671,1356,[F. Scott Fitzgerald],1925.0,The Great Gatsby,eng,3.89,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
#Generalising the format of author names for simplicity in future
book_authors.columns = [c.lower().strip().replace(' ', '_') for c in book_authors.columns]

#Setting book_id as index of the dataframe 
book_authors = book_authors.set_index(book_authors['book_id'])

#Dropping unnecessary columns
book_authors.drop(columns= {'book_id','pub_year','title','rating','books_count', 'authors','language'}, inplace=True)

**FINAL *book_authors*:**

In [16]:
book_authors.head()

Unnamed: 0_level_0,suzanne_collins,j.k._rowling,mary_grandpré,stephenie_meyer,harper_lee,f._scott_fitzgerald,john_green,j.r.r._tolkien,j.d._salinger,dan_brown,...,deeanne_gist,peter_matthiessen,tom_clancy,steve_pieczenik,john_rawls,oscar_hijuelos,ben_okri,miles__cameron,ian_mortimer,peggy_orenstein
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2767052,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
41865,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2657,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4671,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<a id="ref2"></a>
# Content-based Recommendation System

A **Content-Based** or **Item-Item recommendation system** attempts to figure out what a user's favourite aspects of an item is, and then recommends items that present those aspects. 

In this case, I'm going to figure out recommendations for a user based on the authors of the books they've read and ratings given.

Creating an input user to recommend books to:

In [17]:
user_1 = pd.DataFrame([{'book_id':2767052, 'rating':5.0},{'book_id':3, 'rating':4.0}, {'book_id':41865, 'rating':4.5},{'book_id':15613, 'rating':3.0},{'book_id':2657, 'rating':2.5}])
user_1

Unnamed: 0,book_id,rating
0,2767052,5.0
1,3,4.0
2,41865,4.5
3,15613,3.0
4,2657,2.5


To learn user's preferences, we get the subset of authors that the user has already read from the dataframe (*book_authors*) containing authors of books with binary values.


In [18]:
user_authors = book_authors[book_authors.index.isin(user_1['book_id'].tolist())].reset_index(drop=True)
user_authors

Unnamed: 0,suzanne_collins,j.k._rowling,mary_grandpré,stephenie_meyer,harper_lee,f._scott_fitzgerald,john_green,j.r.r._tolkien,j.d._salinger,dan_brown,...,deeanne_gist,peter_matthiessen,tom_clancy,steve_pieczenik,john_rawls,oscar_hijuelos,ben_okri,miles__cameron,ian_mortimer,peggy_orenstein
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Turning the authors into weights by using the user's ratings and multiplying them into the user's author table (*user_authors*) and then summing up the resulting table by column.
This operation is a result of dot product between a matrix and a vector that can be accomplished by Pandas's "dot" function.

In [19]:
user_1.rating

0    5.0
1    4.0
2    4.5
3    3.0
4    2.5
Name: rating, dtype: float64

In [20]:
#Dot product to get weights
userProfile = user_authors.transpose().dot(user_1['rating'])
#The user profile
userProfile

suzanne_collins    5.0
j.k._rowling       4.0
mary_grandpré      4.0
stephenie_meyer    4.5
harper_lee         3.0
                  ... 
oscar_hijuelos     0.0
ben_okri           0.0
miles__cameron     0.0
ian_mortimer       0.0
peggy_orenstein    0.0
Length: 5271, dtype: float64

*userProfile* contains the weights of the user's preferences. 
Using this, we can recommend books that satisfy the user's preferences.

With the *userProfile* and the *book_authors* , we take the **weighted average** of every book based on the user's profile and recommend the top twenty books written by same authors.

In [21]:
recommendation = (((book_authors*userProfile).sum(axis=1))/(userProfile.sum())).sort_values(ascending=False)
#Top 20 recommendations
recommendation.head(20)

book_id
99298      0.347826
6          0.347826
136251     0.347826
2          0.347826
1          0.347826
5          0.347826
15881      0.347826
3          0.347826
262430     0.217391
319644     0.217391
385742     0.217391
7260188    0.217391
385706     0.217391
6148028    0.217391
2767052    0.217391
7938275    0.217391
428263     0.195652
3609763    0.195652
1162543    0.195652
1656001    0.195652
dtype: float64

<a id="ref3"></a>
# The final recommendation table:

In [22]:
#The final recommendation table
books.loc[books['book_id'].isin(recommendation.head(20).keys())].reset_index()

Unnamed: 0,index,book_id,books_count,authors,pub_year,title,language,rating
0,0,2767052,272,[Suzanne Collins],2008.0,The Hunger Games,eng,4.34
1,1,3,491,"[J.K. Rowling, Mary GrandPré]",1997.0,Harry Potter and the Philosopher's Stone,eng,4.44
2,16,6148028,201,[Suzanne Collins],2009.0,Catching Fire,eng,4.3
3,17,5,376,"[J.K. Rowling, Mary GrandPré, Rufus Beck]",1999.0,Harry Potter and the Prisoner of Azkaban,eng,4.53
4,19,7260188,239,[Suzanne Collins],2010.0,Mockingjay,eng,4.03
5,20,2,307,"[J.K. Rowling, Mary GrandPré]",2003.0,Harry Potter and the Order of the Phoenix,eng,4.46
6,22,15881,398,"[J.K. Rowling, Mary GrandPré]",1998.0,Harry Potter and the Chamber of Secrets,eng,4.37
7,23,6,332,"[J.K. Rowling, Mary GrandPré]",2000.0,Harry Potter and the Goblet of Fire,eng,4.53
8,24,136251,263,"[J.K. Rowling, Mary GrandPré]",2007.0,Harry Potter and the Deathly Hallows,eng,4.61
9,26,1,275,"[J.K. Rowling, Mary GrandPré]",2005.0,Harry Potter and the Half-Blood Prince,eng,4.54


### Advantages of Content-Based Recommendation System

***Advantages***
* Learns user's preferences
* Highly personalized for the user