<a href="https://colab.research.google.com/github/Subhajit53/Book-Recommendation-System/blob/main/Book_Recommendation_System_Colab_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Book Recomendation System </u></b>

# **Problem Statement :**

##### During the last few decades, with the rise of Youtube, Amazon, Netflix, and many other such web services, recommender systems have taken more and more place in our lives. From e-commerce (suggest to buyers articles that could interest them) to online advertisement (suggest to users the right contents, matching their preferences), recommender systems are today unavoidable in our daily online journeys. 
##### In a very general way, recommender systems are algorithms aimed at suggesting relevant items to users (items being movies to watch, text to read, products to buy, or anything else depending on industries). 
##### Recommender systems are really critical in some industries as they can generate a huge amount of income when they are efficient or also be a way to stand out significantly from competitors. The main objective is to create a book recommendation system for users. 
### **Content  :**
The Book-Crossing dataset comprises 3 files. 
* **Users :**
##### Contains the users. Note that user IDs (User-ID) have been anonymized and map to integers. Demographic data is provided (Location, Age) if available. Otherwise, these fields contain NULL values. 
* **Books :**
##### Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (Book-Title, Book-Author, Year-Of-Publication, Publisher), obtained from Amazon Web Services. Note that in the case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavors (Image-URL-S, Image-URL-M, Image-URL-L), i.e., small, medium, large. These URLs point to the Amazon website. 
* **Ratings :**
##### Contains the book rating information. Ratings (Book-Rating) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.

# **Introduction :**

A recommendation system is a subclass of Information filtering Systems that seeks to predict the rating or the preference a user might give to an item. In simple words, it is an algorithm that suggests relevant items to users. Eg: In the case of Netflix which movie to watch, In the case of e-commerce which product to buy, or In the case of kindle which book to read, etc.

# **Building the Recommendation System :**

### **Reading the datasets :**

In [1]:
# Importing essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [15]:
# Reading the datasets
books = pd.read_csv('/content/drive/MyDrive/Unsupervised project/Books.csv')
users = pd.read_csv('/content/drive/MyDrive/Unsupervised project/Users.csv')
ratings = pd.read_csv('/content/drive/MyDrive/Unsupervised project/Ratings.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [16]:
# Getting shapes of the datasets
print("Books Data:    ", books.shape)
print("Users Data:    ", users.shape)
print("Books-ratings: ", ratings.shape)

Books Data:     (271360, 8)
Users Data:     (278858, 3)
Books-ratings:  (1149780, 3)


### **Pre-processing the datasets :**

#### **Pre-processeing Books Dataset :**

In [17]:
# Viewing the head of the dataset
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


We can see that there are 3 URL columns which are unnecessary for our purpose. So let's drop them.

In [18]:
# Dropping URL columns
books.drop(['Image-URL-S','Image-URL-M','Image-URL-L'], axis = 1, inplace = True)

In [19]:
# Viewing the head again
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


Cool! Now let's check for null values.

In [20]:
# Getting counts of null values
books.isnull().sum()

ISBN                   0
Book-Title             0
Book-Author            1
Year-Of-Publication    0
Publisher              2
dtype: int64

It seems like we have got a few null values in Book-Author and Publisher column. I'm going to set those null values as 'Other' so that we don't lose any data.

In [21]:
# Imputing the missing values
books.loc[books['Book-Author'].isnull(),'Book-Author'] = 'Other'
books.loc[books['Publisher'].isnull(),'Publisher'] = 'Other'

Now let's check the Year-Of-Publication column as in my past experience I've seen that years get really messy in dataframes.

In [24]:
# Getting unique publication years
books['Year-Of-Publication'].unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

Seems really bad to me! We can see that some publications are of future years. Even there are some Publishing companies as publishing years. We need to investigate that.

In [28]:
# Setting column with so that we can see the column materials in full
pd.set_option('display.max_colwidth', -1)

  """Entry point for launching an IPython kernel.


In [29]:
# Getting rows where year of publication is mistakenly inputted as DK Publishing Inc
books.loc[books['Year-Of-Publication'] == 'DK Publishing Inc',:]

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Began (Level 4: Proficient Readers)\"";Michael Teitelbaum""",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.01.THUMBZZZ.jpg
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Books Come to Life (Level 4: Proficient Readers)\"";James Buckley""",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.01.THUMBZZZ.jpg


In [30]:
# Getting rows where year of publication is mistakenly inputted as Gallimard
books.loc[books['Year-Of-Publication'] == 'Gallimard',:]

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
220731,2070426769,"Peuple du ciel, suivi de 'Les Bergers\"";Jean-Marie Gustave Le ClÃ?Â©zio""",2003,Gallimard,http://images.amazon.com/images/P/2070426769.01.THUMBZZZ.jpg


It seems like the data inputted was really messed up. Below I've tried to correct the values of those rows from whatever already exists in the rows.

In [31]:
# Correcting the messed up rows
books.loc[209538 ,'Publisher'] = 'DK Publishing Inc'
books.loc[209538 ,'Year-Of-Publication'] = 2000
books.loc[209538 ,'Book-Title'] = 'DK Readers: Creating the X-Men, How It All Began (Level 4: Proficient Readers)'
books.loc[209538 ,'Book-Author'] = 'Michael Teitelbaum'

books.loc[221678 ,'Publisher'] = 'DK Publishing Inc'
books.loc[221678 ,'Year-Of-Publication'] = 2000
books.loc[209538 ,'Book-Title'] = 'DK Readers: Creating the X-Men, How Comic Books Come to Life (Level 4: Proficient Readers)'
books.loc[209538 ,'Book-Author'] = 'James Buckley'

books.loc[220731 ,'Publisher'] = 'Gallimard'
books.loc[220731 ,'Year-Of-Publication'] = '2003'
books.loc[209538 ,'Book-Title'] = 'Peuple du ciel - Suivi de Les bergers '
books.loc[209538 ,'Book-Author'] = 'Jean-Marie Gustave Le ClÃ?Â©zio'

As we are done with the problem of publishing houses in year of publishing, let's now work on the futuristic years.

In [32]:
# Converting year of publication in Numbers
books['Year-Of-Publication'] = books['Year-Of-Publication'].astype(int)

In [33]:
# Printing year of publications
print(sorted(list(books['Year-Of-Publication'].unique())))

[0, 1376, 1378, 1806, 1897, 1900, 1901, 1902, 1904, 1906, 1908, 1909, 1910, 1911, 1914, 1917, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2008, 2010, 2011, 2012, 2020, 2021, 2024, 2026, 2030, 2037, 2038, 2050]


In [38]:
# Viewing rows with year of publication as 0
books[books['Year-Of-Publication']==0]

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
176,3150000335,Kabale Und Liebe,Schiller,0,"Philipp Reclam, Jun Verlag GmbH"
188,342311360X,Die Liebe in Den Zelten,Gabriel Garcia Marquez,0,Deutscher Taschenbuch Verlag (DTV)
288,0571197639,Poisonwood Bible Edition Uk,Barbara Kingsolver,0,Faber Faber Inc
351,3596214629,"Herr Der Fliegen (Fiction, Poetry and Drama)",Golding,0,Fischer Taschenbuch Verlag GmbH
542,8845229041,Biblioteca Universale Rizzoli: Sulla Sponda Del Fiume Piedra,P Coelho,0,Fabbri - RCS Libri
...,...,...,...,...,...
270794,014029953X,Foe (Essential.penguin S.),J.M. Coetzee,0,Penguin Books Ltd
270913,0340571187,Postmens House,Maggie Hemingway,0,Trafalgar Square
271094,8427201079,El Misterio De Sittaford,Agatha Christie,0,Editorial Molino
271182,0887781721,Tom Penny,Tony German,0,P. Martin Associates


It seems like these are erroneous entries. The books were not published in the year 0. So, I'm going to impute these years with the mode.

In [43]:
# Getting value counts of different publication years
books['Year-Of-Publication'].value_counts()

2002    17627
1999    17431
2001    17359
2000    17234
1998    15766
        ...  
1910    1    
1934    1    
1914    1    
1904    1    
2037    1    
Name: Year-Of-Publication, Length: 116, dtype: int64

In [44]:
# Imputing publication year '0' with '2002'
books.loc[books['Year-Of-Publication'] == 0, 'Year-Of-Publication'] = 2002

Now, let's impute the columns with year of publication as greater than 2021. I'll simply google about those books and change the years.

In [54]:
# Changing publication years of books having years greater than 2021
books.loc[books['Year-Of-Publication']==2024, 'Year-Of-Publication'] = 2013
books.loc[books['Year-Of-Publication']==2026, 'Year-Of-Publication'] = 1996
books.loc[37487, 'Year-Of-Publication'] = 1991
books.loc[55676, 'Year-Of-Publication'] = 2016
books.loc[78168, 'Year-Of-Publication'] = 2001
books.loc[192993, 'Year-Of-Publication'] = 1999
books.loc[240169, 'Year-Of-Publication'] = 1987
books.loc[228173, 'Year-Of-Publication'] = 1925
books.loc[260974, 'Year-Of-Publication'] = 1991
books.loc[books['Year-Of-Publication']==2037, 'Year-Of-Publication'] = 1937
books.loc[books['Year-Of-Publication']==2038, 'Year-Of-Publication'] = 1952
books.loc[80264, 'Year-Of-Publication'] = 1871
books.loc[97826, 'Year-Of-Publication'] = 1942

Now, let's drop the duplicate rows.

In [56]:
# Drop duplicate rows
books.drop_duplicates(keep='last', inplace=True) 
books.reset_index(drop = True, inplace = True)

In [57]:
# Getting a final info of the dataset
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 5 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Book-Author          271360 non-null  object
 3   Year-Of-Publication  271360 non-null  int64 
 4   Publisher            271360 non-null  object
dtypes: int64(1), object(4)
memory usage: 10.4+ MB


Great! We are done with the Books dataset. Now, let's begin with the Users dataset.