<a href="https://colab.research.google.com/github/Gauravgade3/Book-Recommendation-System/blob/main/Book_Recommendation_System_Capstone_Project_Gaurav_Gade.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Book Recommendation System**

## Problem Statement:


During the last few decades, with the rise of Youtube, Amazon, Netflix, and many other such web services, recommender systems have taken more and more place in our lives. From e-commerce (suggest to buyers articles that could interest them) to online advertisement (suggest to users the right contents, matching their preferences), recommender systems are today unavoidable in our daily online journeys.
In a very general way, recommender systems are algorithms aimed at suggesting relevant items to users (items being movies to watch, text to read, products to buy, or anything else depending on industries).
Recommender systems are really critical in some industries as they can generate a huge amount of income when they are efficient or also be a way to stand out significantly from competitors. The main objective is to create a book recommendation system for users.

## About Dataset:

The Book-Crossing dataset comprises 3 files.

**Users** :

Contains the users. Note that user IDs (User-ID) have been anonymized and map to integers. Demographic data is provided (Location, Age) if available. Otherwise, these fields contain NULL values.


**Books** :


Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (Book-Title, Book-Author, Year-Of-Publication, Publisher), obtained from Amazon Web Services. Note that in the case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavors (Image-URL-S, Image-URL-M, Image-URL-L), i.e., small, medium, large. These URLs point to the Amazon website.



**Ratings** :


Contains the book rating information. Ratings (Book-Rating) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.

In [None]:
# mounting the data from google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#importing necessary modules

import pandas as pd
import sys
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Users dataset

users=pd.read_csv('/content/drive/MyDrive/Book Recommendation System/Users.csv')
users.head()

In [None]:
#Books dataset
books=pd.read_csv('/content/drive/MyDrive/Book Recommendation System/Books.csv')
books.head()

In [None]:
#Ratings Dataset
ratings=pd.read_csv('/content/drive/MyDrive/Book Recommendation System/Ratings.csv')
ratings.head()

In [None]:
print('Shape Of Dataframe\n')
print(f'books:    {(books.shape)}')
print(f'users:    {(users.shape)}')
print(f'ratings: {(ratings.shape)}')

### **Users dataset**:

In [None]:
users.isnull().sum()

In [None]:
round(users.isnull().mean().mul(100),2)

## In the users dataset there are **110762** missing values found in the **Age** column which is **39%** of the total value.

In [None]:
# Extracting country from location feature
for i in users:
    users['Country']=users.Location.str.extract(r'\,+\s?(\w*\s?\w*)\"*$')

print(users.Country.nunique())

In [None]:
#drop location column
users.drop('Location',axis=1,inplace=True)

In [None]:
users.head()

In [None]:
users.isnull().sum()

In [None]:
# Convert data type of country feature
users['Country']=users['Country'].astype('str')

In [None]:
a=list(users.Country.unique())
a.sort()
print(a)

In [None]:
#spelt correction
users['Country'].replace(['','01776','02458','19104','23232','30064','85021','87510','alachua','america','austria','autralia','cananda','geermany','italia','united kindgonm','united sates','united staes','united state','united states','us'],
                           ['other','usa','usa','usa','usa','usa','usa','usa','usa','usa','australia','australia','canada','germany','italy','united kingdom','usa','usa','usa','usa','usa'],inplace=True)

## Corrected some of the mis spelt words.

In [None]:
# plotting top 10 countries having highest user
plt.figure(figsize=(15,7))
plt.style.use('fivethirtyeight')
sns.countplot(y='Country',data=users,order=pd.value_counts(users['Country']).iloc[:10].index)
plt.title('Country wise User Count')

## Majority of users belongs to the country USA.
## Least users belong to the country Portugal.

In [None]:
# histogram plot for users age distribution
plt.rcParams['figure.figsize'] = (13,7)
users.Age.hist(bins=[0,10,20,30,40,50,60,70,80,90,100], color='purple', alpha=0.7)
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Users Age Distribution')
plt.show()

## The most active book readers are in the age between 20-30 group.

In [None]:
# checking outliers through boxplot
sns.boxplot(data=users, y='Age')

## From the above boxplot we can observe that outliers are present in the Age column in the "Users" dataset.

In [None]:
# Checking the unique value of users age
print(sorted(users.Age.unique()))

In [None]:
# Distplot
sns.distplot(users.Age)
plt.title('Age Distribution Plot')

## Users age above 100 and below 5 does not make much sense for our book rating criteria. So we will replace these values by NaNs.

## From the above plot we can observe that age has positive Skewness so we can use median to fill Nan but for this we don't like to fill Nan value just for one range of age. So for that we will use the country column to fill Nan values.

In [None]:
# outlier data became NaN
users.loc[(users.Age > 100) | (users.Age < 5), 'Age'] = np.nan

In [None]:
users['Age'] = users['Age'].fillna(users.groupby('Country')['Age'].transform('median'))

Few values are still left with Nan so we will fill them by mean value.

In [None]:
users['Age'].fillna(users.Age.mean(),inplace=True)

In [None]:
users.isna().sum()

## **EDA on Books Dataset**

In [None]:
books.head()

In [None]:
plt.figure(figsize=(17,8))
sns.countplot(y='Publisher',data=books,order=pd.value_counts(books['Publisher']).iloc[:10].index)
plt.title('Top 10 Publishers')

## Among all the publishers 'Harlequin' is the most popular and 'Warner Books' are the least popular in the top ten publishers category.