<a href="https://colab.research.google.com/github/PuroshotamSingh/Book-Recommendation-System/blob/main/Book_Recommendation_System_Puroshotam_Kumar_Singh_Capstone_Project_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction**

During the last few decades, with the rise of Youtube, Amazon, Netflix, and many other such
web services, recommender systems have taken more and more place in our lives. From
e-commerce (suggest to buyers articles that could interest them) to online advertisement
(suggest to users the right contents, matching their preferences), recommender systems are
today unavoidable in our daily online journeys.
<br>
In a very general way, recommender systems are algorithms aimed at suggesting relevant
items to users (items being movies to watch, text to read, products to buy, or anything else
depending on industries).<br>
Recommender systems are really critical in some industries as they can generate a huge
amount of income when they are efficient or also be a way to stand out significantly from
competitors. **The main objective is to create a book recommendation system for users.**

# **Dataset information**

The Book-Crossing dataset comprises 3 files.<br>
● **Users** :<br>
Contains the users. Note that user IDs (User-ID) have been anonymized and map to
integers. Demographic data is provided (Location, Age) if available. Otherwise, these
fields contain NULL values.<br>
● **Books** : <br>
Books are identified by their respective ISBN. Invalid ISBNs have already been removed
from the dataset. Moreover, some content-based information is given (Book-Title,
Book-Author, Year-Of-Publication, Publisher), obtained from Amazon Web
Services. Note that in the case of several authors, only the first is provided. URLs linking
to cover images are also given, appearing in three different flavors (Image-URL-S,
Image-URL-M, Image-URL-L), i.e., small, medium, large. These URLs point to the
Amazon website.<br>
● **Ratings** :<br> 
Contains the book rating information. Ratings (Book-Rating) are either explicit,
expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit,
expressed by 0.

# **Importing libraries**

In [1]:
# Importing libraries

import pandas as pd
import sys
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random

# To supress the warning messages

import warnings
warnings.filterwarnings('ignore')


# **Dataset initialization**

### **Loading Users data.**

In [2]:
# Loading Users data and creating dataframe

users = pd.read_csv('/content/drive/MyDrive/Copy of Users.csv')

In [3]:
# Printing first 10 rows of users dataframe

users.head(10)

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",
5,6,"santa monica, california, usa",61.0
6,7,"washington, dc, usa",
7,8,"timmins, ontario, canada",
8,9,"germantown, tennessee, usa",
9,10,"albacete, wisconsin, spain",26.0


In [10]:
# Shape of dataset

users.shape

(278858, 3)

* **Number of instances: 278858**
* **Number of attributes: 3**

In [8]:
# Columns of users data

users.columns

Index(['User-ID', 'Location', 'Age'], dtype='object')

* **In our users dataset, there are three features; User_ID, Location and Age.**

### **Loading Books data.**

In [4]:
# Loading Books data

books = pd.read_csv('/content/drive/MyDrive/Copy of Books.csv')

In [6]:
# Let's see first 3 rows of books dataset

books.head(3)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...


In [11]:
# Shape of dataset

books.shape

(271360, 8)

* **Number of instances: 271360**
* **Number of attributes: 8**

In [12]:
# Columns of books data

books.columns

Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher',
       'Image-URL-S', 'Image-URL-M', 'Image-URL-L'],
      dtype='object')

* **In our books dataset, there are 8 features; 'ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher', 'Image-URL-S', 'Image-URL-M', 'Image-URL-L'.**

### **Loading Ratings data.**

In [13]:
# Ratings data

ratings = pd.read_csv('/content/drive/MyDrive/Copy of Ratings.csv')

In [14]:
# Let's see first 5 rows of ratings dataset

ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [15]:
# Shape of dataset

ratings.shape

(1149780, 3)

* **Number of instances: 1149780**
* **Number of attributes: 3**

In [16]:
# Columns of ratings data

ratings.columns

Index(['User-ID', 'ISBN', 'Book-Rating'], dtype='object')

* **In our ratings dataset, there are three features; User_ID, ISBN and Book_Rating.**

# **EDA on Users dataset**

### **Let's check for null values.**

In [17]:
# Checking for total null values in each column

users.isnull().sum()

User-ID          0
Location         0
Age         110762
dtype: int64

In [18]:
# Percentage of null values in each column

print(100*(users.isnull().sum()/len(users.index)).sort_values(ascending=False))

Age         39.719857
Location     0.000000
User-ID      0.000000
dtype: float64


* **Age column have around 40% null values.**