<a href="https://colab.research.google.com/github/Faraz-Khan02/Book-Recommendation-System/blob/main/Book_Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Book Recommendation System**



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual
##### **Name**   -     Faraz Faisal Khan


# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**During the last few decades, with the rise of Youtube, Amazon, Netflix, and many other such web services, recommender systems have taken more and more place in our lives. From e-commerce (suggest to buyers articles that could interest them) to online advertisement (suggest to users the right contents, matching their preferences), recommender systems are today unavoidable in our daily online journeys.
In a very general way, recommender systems are algorithms aimed at suggesting relevant items to users (items being movies to watch, text to read, products to buy, or anything else depending on industries).
Recommender systems are really critical in some industries as they can generate a huge amount of income when they are efficient or also be a way to stand out significantly from competitors. The main objective is to create a book recommendation system for users.**

# ***Let's Begin !***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.decomposition import PCA
from sklearn.impute import KNNImputer
import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
# Mounting Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
books = pd.read_csv("/content/drive/MyDrive/Capstone Project - 4/Books.csv")
users = pd.read_csv("/content/drive/MyDrive/Capstone Project - 4/Users.csv")
ratings = pd.read_csv("/content/drive/MyDrive/Capstone Project - 4/Ratings.csv")

## **Books Data**

In [None]:
# Books data first look
books.head()

Here, Books data contains many features which are as follows:


*   **ISBN** : It contains ISBN number of the books which mean  International Standard Book Number.
*   **Book-Title** : It contains title of the books.


*   **Book-Author** : It contains the name of author of that book.
*   **Year-Of-Publication** : It contains the year in which that book was published.


*   **Publisher** : It contains the name of the publisher.
*   **Image-URL-S** : It contains the Image Url of small size.


*   **Image-URL-M** : It contains the Image Url of medium size.
*   **Image-URL-L** : It contains the Image Url of Large size.









In [None]:
# Shape of Books data
books.shape

Our, Books dataset contains 271360 rows and 8 columns.

In [None]:
# Books dataset info
books.info

In [None]:
# Books Dataset Duplicate Value Count
duplicate = books.duplicated()
print(duplicate.value_counts())

It means there is no any duplicate values in the books dataset.

In [None]:
# Missing Values/Null Values Count
books.isnull().sum()

## ***Data Cleaning of Books Dataset***

Here, Image-URL-S, Image-URL-M, Image-URL-L are not any useful feature for our recommmendation so we will drop it.

In [None]:
# dropping last three columns 
books.drop(['Image-URL-S', 'Image-URL-M', 'Image-URL-L'],axis=1,inplace=True)

In [None]:
# After dropping 3 columns
books.head()

In [None]:
# Checking data types of columns
print(books.dtypes)

Here, Year-Of-Publication should have integer datatype but its object so we will check the unique values of it.

In [None]:
#Get the unique values of Year-Of-Publication	
books['Year-Of-Publication'].unique()

Here we can see that 'DK Publishing Inc' and 'Gallimard' are wrong entry and we can see Year of Publication is more than 2004 which is wrong because our data was published in 2004.

In [None]:
#Checking the rows having 'DK Publishing Inc' and 'Gallimard' as Year-Of-Publication
books.loc[(books['Year-Of-Publication'] == 'DK Publishing Inc') |( books['Year-Of-Publication'] == 'Gallimard'),:]

Here, we can clearly see that Book-Author and Year-Of-Publication is mismatched so we will replace it correctly.

In [None]:
# Correcting 1st row 
books.loc[books.ISBN == '0789466953','Year-Of-Publication'] = 2000
books.loc[books.ISBN == '0789466953','Book-Author'] = "James Buckley"
books.loc[books.ISBN == '0789466953','Publisher'] = "DK Publishing Inc"
books.loc[books.ISBN == '0789466953','Book-Title'] = "DK Readers: Creating the X-Men, How Comic Books Come to Life (Level 4: Proficient Readers)"

In [None]:
# Correcting 2nd row
books.loc[books.ISBN == '2070426769','Year-Of-Publication'] = 2003
books.loc[books.ISBN == '2070426769','Book-Author'] = "Jean-Marie Gustave Le ClÃ?Â©zio"
books.loc[books.ISBN == '2070426769','Publisher'] = "Gallimard"
books.loc[books.ISBN == '2070426769','Book-Title'] = "Peuple du ciel, suivi de 'Les Bergers"

In [None]:
# Correcting 3rd row
books.loc[books.ISBN == '078946697X','Year-Of-Publication'] = 2000
books.loc[books.ISBN == '078946697X','Book-Author'] = "Michael Teitelbaum"
books.loc[books.ISBN == '078946697X','Publisher'] = "DK Publishing Inc"
books.loc[books.ISBN == '078946697X','Book-Title'] = "DK Readers: Creating the X-Men, How It All Began (Level 4: Proficient Readers)"

In [None]:
#Rechecking after correcting
books.loc[(books.ISBN == '0789466953') | (books.ISBN == '078946697X') | (books.ISBN == '2070426769'),:]

In [None]:
# Converting year of publication to type int
books['Year-Of-Publication'] = books['Year-Of-Publication'].astype(int)

In [None]:
# Checking year in sorted manner
sorted(books['Year-Of-Publication'].unique())

Here, 0 is incorrect and our data was of 2004 and Year-Of-Publication is greater than 2004 which means it is wrongly interpreted.So, we will interpret those values with NaN. 

In [None]:
# Removing year 0 and years above 2004 with NaN
books.loc[(books['Year-Of-Publication'] > 2004) | (books['Year-Of-Publication']==0),'Year-Of-Publication'] = np.NAN

## ***Handling Missing Values***

In [None]:
#Checking for missing values
books.isnull().sum()

In [None]:
# box plot for Year-Of-Piblication
plt.figure(figsize=(8,4))
sns.boxplot(books['Year-Of-Publication'])

Here, we can see Book Author contains '1' missing values , Year-Of-Publication contain 4690 missing values and Publisher contains '2' missing values. So, we will handle those missing values.

In [None]:
#Imputing the NaN values with medain values of Year-Of-Publication
books['Year-Of-Publication'].fillna(round(books['Year-Of-Publication'].median()),inplace=True)

In [None]:
# Publisher column has 2 NaN so exploring it
books.loc[books.Publisher.isnull(),:]

Here, we can can fill NaN values with others.

In [None]:
# Replacing missing values with 'other'
books.loc[(books.ISBN == '193169656X'),'Publisher'] = 'other'
books.loc[(books.ISBN == '1931696993'),'Publisher'] = 'other'

Since, author has only 1 missing value so, we can drop that row easily.

In [None]:
# Dropping the missing values in 'author' column 
books.dropna(axis=0,inplace=True)

In [None]:
#Checking for missing values
books.isnull().sum()

So, here we can clearly see that we dont have any missing values in our Books dataset.

## **Users Data**

In [None]:
# Users data first look
users.head()

Our Users dataset contains following features and they are:


*   **User-ID** : It contains the User-Id of different Users.
*   **Location** : It contains the location of the Users.

*   **Age** : It contains the age of the Users






In [None]:
# Shape of the Users dataset
users.shape

Our data contains 278858 rows and 3 columns.

In [None]:
# Users dataset info
users.info

In [None]:
# Users Dataset Duplicate Value Count
duplicate = users.duplicated()
print(duplicate.value_counts())

so, here there is no any duplicate value in our dataset.

In [None]:
# Missing Values/Null Values Count
users.isnull().sum()

Age has 110762 missing values which is a great number.

## ***Data Cleaning of Users Data***

In [None]:
# box plot for Age
plt.figure(figsize=(8,4))
sns.boxplot(users['Age'])

Here, we can clearly see Age column contains lot of outliers.

In [None]:
# Getting unique age values in sorted manner
print(sorted(users.Age.unique()))

Here, Age contains NaN values, 0 to 5 years and age greater than 90 no any user would be there who would read books. so will remove age less than 5 and age greater than 90 with NaN.

In [None]:
# Replacing age below 5 and above 90 by NaN
users.loc[(users.Age > 90) | (users.Age < 5), 'Age'] = np.nan

In [None]:
#Checking for missing values
users.isnull().sum()

So, here we will impute the missing values with mean.

In [None]:
# Replacing NaN with mean
users['Age'].fillna((users['Age'].mean()), inplace=True)

In [None]:
users.head()

So, here we will change the data type of Age to int.

In [None]:
# setting the data type as int
users.Age = users.Age.astype(np.int64)

In [None]:
#Checking for missing values
books.isnull().sum()

So, we have cleaned our Users Dataset.

## **Ratings Data**

In [None]:
#Ratings Data first look
ratings.head()

Out Ratings Dataset contains following features:


*   **User-ID** : It contains the User-Id of different users.
*   **ISBN** : It contains ISBN number of the books which mean International Standard Book Number.

*   **Book-Rating** : It contains the rating of the book given by different users. 



In [None]:
#Shape of the Ratings Dataset
ratings.shape

In [None]:
# Ratings dataset info
ratings.info

Our dataset contains 1149780 rows and 3 columns.

In [None]:
# Ratings Dataset Duplicate Value Count
duplicate = ratings.duplicated()
print(duplicate.value_counts())

so, there are no any duplicate value present in our dataset.

In [None]:
# Missing Values/Null Values Count
ratings.isnull().sum()

So, here we can clearly see there is no any missing value in ratings. But ratings data contain many ISBN so, we will check it from books dataset.

In [None]:
# we are agregating only unique ISBN from ratings which is in books dataset
unique_ratings = ratings[ratings.ISBN.isin(books.ISBN)]

Ratings dataset should have ratings from users which exist in users dataset, unless new users or book are added to users dataset.

# **Exploratory Data Analysis**

## ***EDA on Books Data***

## Top 10 Authors

In [None]:
# Visualizing top 10 authors
plt.figure(figsize=(12,8))
sns.countplot(y='Book-Author',data=books,order=pd.value_counts(books['Book-Author']).iloc[:10].index)
plt.title('Top 10 Authors')

From our Countplot we can infer that **Agatha Christie**, **William Shakespeare**, **Stephen King**, **Ann M Martin**, **Carolyn Keene**, **Francine Pascal**, **Isaac Asimov**, **Nora Roberts**, **Barbara Cartiand** and **Charles Dickens** are the top Authors.

## Top 10 Publishers

In [None]:
# Visualizing top 10 Publisher
plt.figure(figsize=(12,8))
sns.countplot(y='Publisher',data=books,order=pd.value_counts(books['Publisher']).iloc[:10].index)
plt.title('Top 10 Publishers')

From our countplot we infer that **Harlequin**, **Silhouette**, **Pocket**, **Ballantine Books**, **Bantam Books**, **Scholastic**, **Simon &amp Schuster**, **Penguin Books**, **Berkley Publishing Group** and **Warner Books** are the top 10 publishers.

## Books Published in that Year

In [None]:
# Visualizing the no. of books published each year through histogram
sns.set_style('darkgrid')
fig, ax =plt.subplots()
fig.set_size_inches(12,8)
sns.histplot(books['Year-Of-Publication'],bins=np.arange(1900,2004,3),color='r')
plt.ylabel('No. of Books Published')
plt.xlabel('Year')
plt.title('Visualizing the total no. of books published each year')
plt.show()

From our visualization we can say that in 2000 most no. of books was Published.

## ***EDA on Users Data***

In [None]:
#Plotting histogram of age distribution
fig = plt.figure(figsize = (12,8))
users.Age.hist(bins=[0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
plt.title('Age Distribution\n')
plt.xlabel('Age')
plt.ylabel('count')
plt.show()

From our histogram plotting we can see that mpst of the Users are from 30 - 40.

In [None]:
#Plotting pie chart for above graph
fig = plt.figure(figsize = (12,12))
users.Age.value_counts().plot.pie(autopct='%1.1f%%',shadow=True)
plt.title('Age of Users')



From this we can clearly see 41.9% of users are age 34 this is becuase we have imputed mean value in it.

In [None]:
# Plotting count of rating to see how it's distributed
fig = plt.figure(figsize = (12,8))
sns.countplot(x='Book-Rating',data=ratings)
plt.title("Rating countplot")

**Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***