<a href="https://colab.research.google.com/github/Krishanu-Saha/Recommendation_system/blob/main/BOOK_RECOMMENDER_SYSTEM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  BOOK_RECOMMENDER_SYSTEM



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individualby Krishanu Saha


# **Project Summary -**

The Book Recommender System is a data science project that aims to recommend books to users based on their reading history. The project is based on a collaborative filtering algorithm that uses the ratings of books by multiple users to recommend books to a specific user.

The project starts with importing the necessary libraries and loading the dataset. The dataset contains information about the users, the books, and their ratings. The data is then preprocessed by removing duplicate ratings, removing books with very few ratings, and removing users with very few ratings. The resulting dataset is then split into training and testing sets.

The collaborative filtering algorithm used in the project is implemented using the Surprise library in Python. The algorithm is based on matrix factorization, which decomposes the user-item rating matrix into two lower-dimensional matrices: one representing the users and the other representing the items. The algorithm then estimates the missing ratings by calculating the dot product of the user and item matrices.

The accuracy of the algorithm is measured using root mean squared error (RMSE) . The RMSE measures the difference between the predicted ratings and the actual ratings.

The project then uses the algorithm to recommend books to a specific user. The user's reading history is used to generate the recommendations, and the top 10 books are recommended to the user based on the predicted ratings.

The project concludes with a discussion on the limitations of the algorithm and possible future work. One of the limitations of the algorithm is that it only recommends books based on the user's reading history and does not take into account the user's preferences or interests. Future work could include incorporating user preferences or using a hybrid algorithm that combines collaborative filtering with content-based filtering.

In conclusion, the Book Recommender System is an effective tool for recommending books to users based on their reading history. The collaborative filtering algorithm used in the project is accurate and provides relevant recommendations to users. The project has demonstrated the potential of collaborative filtering in the field of book recommendations and provides a solid foundation for future work in this area.

# **GitHub Link -**

https://github.com/Krishanu-Saha/data-science/blob/main/BOOK_RECOMMENDER_SYSTEM.ipynb

# **Problem Statement**


During the last few decades, with the rise of Youtube, Amazon, Netflix, and many other such web services, recommender systems have taken more and more place in our lives. From e-commerce (suggest to buyers articles that could interest them) to online advertisement (suggest to users the right contents, matching their preferences), recommender systems are today unavoidable in our daily online journeys.

In a very general way, recommender systems are algorithms aimed at suggesting relevant items to users (items being movies to watch, text to read, products to buy, or anything else depending on industries).

Recommender systems are really critical in some industries as they can generate a huge amount of income when they are efficient or also be a way to stand out significantly from competitors. The main objective is to create a book recommendation system for users.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import re
import pickle
import operator
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from scipy.sparse import csr_matrix
from pandas.api.types import is_numeric_dtype
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
!pip install surprise
from surprise import Dataset,Reader,KNNBasic ,accuracy
from surprise.model_selection import cross_validate
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Dataset Loading

In [None]:
# Load Dataset
#User Dataset
users_df = pd.read_csv('/content/drive/MyDrive/Almabetter /project/UNSUPERVISE ML/Users.csv')

#Books Dataset
books_df = pd.read_csv('/content/drive/MyDrive/Almabetter /project/UNSUPERVISE ML/Books.csv')

#Ratings Dataset
ratings_df = pd.read_csv('/content/drive/MyDrive/Almabetter /project/UNSUPERVISE ML/Ratings.csv')


### Dataset First View

In [None]:
# Dataset First Look
#User Dataset
users_df.head()

In [None]:
#Books Dataset
books_df.head()

In [None]:
#Ratings Dataset
ratings_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

In [None]:
#Users_df
users_df.shape

The shape of Users dataset is (278858, 3)

In [None]:
#Books Dataset
books_df.shape

The shape of Books dataset is (271360, 8)

In [None]:
#Ratings Dataset
ratings_df.shape

The shape of Ratings Dataset is (1149780, 3)

### Dataset Information

In [None]:
# Dataset Info

In [None]:
#Users Dataset
users_df.info()

In [None]:
#Books Dataset
books_df.info()

In [None]:
#Ratings Dataset
ratings_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

In [None]:
#Users Dataset
len(users_df[users_df.duplicated()])

Duplicate Data counts in Users Df if 0

In [None]:
#Users Dataset
len(books_df[books_df.duplicated()])

Duplicate Data counts in Books Df if 0

In [None]:
#Users Dataset
len(ratings_df[ratings_df.duplicated()])

Duplicate Data counts in Ratings Df if 0

#### Missing Values/Null Values

In [None]:
pip install missingno

In [None]:
import missingno as msno

In [None]:
# Missing Values/Null Values Count

In [None]:
#Users Dataset
users_df.isnull().sum()

In [None]:
# Visualizing the missing values
msno.matrix(users_df)

In [None]:
#Books Dataset
books_df.isnull().sum()

In [None]:
# Visualizing the missing values
msno.matrix(books_df)

In [None]:
#Ratings Dataset
ratings_df.isnull().sum()

The shape of Users dataset is (278858, 3)The shape of Books dataset is (271360, 8)The shape of Ratings Dataset is (1149780, 3)Duplicate Data counts in Users Df if 0 Duplicate Data counts in Books Df if 0 Duplicate Data counts in Ratings Df if 0

User-ID          0
Location         0
Age         110762
dtype: int64

ISBN                   0
Book-Title             0
Book-Author            1
Year-Of-Publication    0
Publisher              2
Image-URL-S            0
Image-URL-M            0
Image-URL-L            3
dtype: int64

User-ID        0
ISBN           0
Book-Rating    0
dtype: int64

In [None]:
# Visualizing the missing values
msno.matrix(ratings_df)

### What did you know about your dataset?

The Users dataset contains 278,858 rows and 3 columns, the Books dataset contains 271,360 rows and 8 columns, and the Ratings dataset contains 1,149,780 rows and 3 columns. The duplicate data counts in all three datasets are zero, indicating that there are no duplicate values present in the datasets. However, the Users dataset has 110,762 missing values in the Age column. The Books dataset has missing values in Book-Author, Publisher, and Image-URL-L columns, while the Ratings dataset has no missing values. The datasets are ready for further analysis, and the missing values in the Age, Book-Author, Publisher, and Image-URL-L columns need to be addressed to make the datasets more informative.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

In [None]:
users_df.columns

In [None]:
books_df.columns

In [None]:
ratings_df.columns

In [None]:
# Dataset Describe


In [None]:
users_df.describe()

In [None]:
books_df.describe()

In [None]:
ratings_df.describe()

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

In [None]:
users_df.nunique()

In [None]:
books_df.nunique()

In [None]:
ratings_df.nunique()

## 3. ***Data Cleaning and Pre-Processing***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

**BOOKS DATASET**

In [None]:
books_df.columns

In [None]:
new_books_df = books_df[['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher','Image-URL-S']]

In [None]:
# Drop the unwanted columns Image-URL-S, Image-URL-M, and Image-URL-L
#books_df.drop(["Image-URL-S", "Image-URL-M", "Image-URL-L"], axis=1, inplace=True)

In [None]:
#pointing out observation where Book_Author is null
new_books_df.loc[books_df['Book-Author'].isnull(),:]

In [None]:
# Calculate the mode (most common value) of the 'Book-Author' column
mode_author = new_books_df['Book-Author'].mode()[0]

# Fill the missing values in the 'Book-Author' column with the mode value
new_books_df['Book-Author'].fillna(mode_author, inplace=True)

In [None]:
#This code is used to filter the rows of the books_df DataFrame where the Publisher column has a missing value
new_books_df.loc[books_df['Publisher'].isnull(),:]

In [None]:
## Checking for null values
new_books_df.isnull().sum()

In [None]:
# Calculate the mode (most common value) of the 'Publisher' column
mode_publisher = new_books_df['Publisher'].mode()[0]

# Fill the missing values in the 'Publisher' column with the mode value
new_books_df['Publisher'].fillna(mode_publisher,inplace = True)

In [None]:
## Checking for column Year-of-publication
new_books_df['Year-Of-Publication'].unique()

In [None]:
#Checking where Year-Of-Publication has DK Publishing Inc value
new_books_df.loc[new_books_df['Year-Of-Publication'] == 'DK Publishing Inc',:]

In [None]:
# Update the value at row 209538 in the 'Publisher' column to 'DK Publishing Inc'
new_books_df.at[209538 ,'Publisher'] = 'DK Publishing Inc'

# Update the value at row 209538 in the 'Year-Of-Publication' column to 2000
new_books_df.at[209538 ,'Year-Of-Publication'] = 2000

# Update the value at row 209538 in the 'Book-Title' column to 'DK Readers: Creating the X-Men, How It All Began (Level 4: Proficient Readers)'
new_books_df.at[209538 ,'Book-Title'] = 'DK Readers: Creating the X-Men, How It All Began (Level 4: Proficient Readers)'

# Update the value at row 209538 in the 'Book-Author' column to 'Michael Teitelbaum'
new_books_df.at[209538 ,'Book-Author'] = 'Michael Teitelbaum'


In [None]:
# Update the value at row 221678 in the 'Publisher' column to 'DK Publishing Inc'
new_books_df.at[221678 ,'Publisher'] = 'DK Publishing Inc'

# Update the value at row 221678 in the 'Year-Of-Publication' column to 2000
new_books_df.at[221678 ,'Year-Of-Publication'] = 2000

# Update the value at row 209538 in the 'Book-Title' column to 'DK Readers: Creating the X-Men, How Comic Books Come to Life (Level 4: Proficient Readers)'
new_books_df.at[209538 ,'Book-Title'] = 'DK Readers: Creating the X-Men, How Comic Books Come to Life (Level 4: Proficient Readers)'

# Update the value at row 209538 in the 'Book-Author' column to 'James Buckley'
new_books_df.at[209538 ,'Book-Author'] = 'James Buckley'


In [None]:
# Use boolean indexing to select all rows where the 'Year-Of-Publication' column has the value 'Gallimard'
new_books_df.loc[new_books_df['Year-Of-Publication'] == 'Gallimard',:]


In [None]:
# Update the values in specific cells of the DataFrame using .at[] and index location
new_books_df.at[220731 ,'Publisher'] = 'Gallimard'
new_books_df.at[220731 ,'Year-Of-Publication'] = '2003'
new_books_df.at[209538 ,'Book-Title'] = 'Peuple du ciel - Suivi de Les bergers '
new_books_df.at[209538 ,'Book-Author'] = 'Jean-Marie Gustave Le ClÃ?Â©zio'

In [None]:
# Converting year of publication to int type from object
new_books_df['Year-Of-Publication'] = new_books_df['Year-Of-Publication'].astype(int)

In [None]:
# Update the values in 'Year-Of-Publication' column to NaN where the value is less than 1900 or greater than 2023
new_books_df.loc[(new_books_df['Year-Of-Publication'] < 1900) | (new_books_df['Year-Of-Publication'] > 2023), 'Year-Of-Publication'] = np.nan


In [None]:
mode_year = new_books_df['Year-Of-Publication'].mode()[0]  # Get the mode year
new_books_df['Year-Of-Publication'].fillna(mode_year, inplace=True)  # Impute missing values with the mode

In [None]:
new_books_df.info()

**USERS DATASET**

In [None]:
# Check for missing values in the Age column
print(users_df["Age"].isnull().sum())


The Age column has 11072 null values.

In [None]:
# Fill missing values in the Age column with the median value
median_age = users_df["Age"].median()
users_df["Age"].fillna(median_age, inplace=True)


In [None]:
# Check again for missing values in the Age column
print(users_df["Age"].isnull().sum())

In [None]:
age_lower, age_upper = np.percentile(users_df['Age'], [1, 99])  # Get the 1st and 99th percentiles of Age
users_df['Age'] = users_df['Age'].clip(lower=age_lower, upper=age_upper)  # Winsorize the Age column

In [None]:
users_df.head()

In [None]:
# Replace missing values with empty string
users_df['Location'] = users_df['Location'].fillna('')

# Split 'Location' column into separate columns
users_df[['City', 'State', 'Country']] = users_df['Location'].str.split(', ', n=2, expand=True)

In [None]:
users_df.drop('Location', axis=1, inplace=True)


In [None]:
duplicate_rows = users_df.duplicated()
print('Number of duplicate rows = %d' % duplicate_rows.sum())

In [None]:
users_df.head()

**RATINGS DATASET**

In [None]:
ratings_df.info()

In [None]:
import re


In [None]:
# extract ISBNs from books dataset
book_ISBN = new_books_df['ISBN'].str.extract('(\d{10}|\d{13})', expand=False)

# identify ratings with ISBNs not matching the pattern in books dataset
non_matching_ISBN = ~ratings_df['ISBN'].str.match('^(\d{10}|\d{13})$')

# get the unique non-matching ISBNs
unique_non_matching_ISBN = ratings_df.loc[non_matching_ISBN, 'ISBN'].unique()

# create a mapping of non-matching ISBNs to their corrected versions
corrections = dict(zip(unique_non_matching_ISBN, book_ISBN[book_ISBN.isin(unique_non_matching_ISBN)]))

# apply the corrections to the ratings dataset
ratings_df['ISBN'].replace(corrections, inplace=True)
## Uppercasing all alphabets in ISBN
ratings_df['ISBN'] = ratings_df['ISBN'].str.upper()


In [None]:
## Uppercasing all alphabets in ISBN
ratings_df['ISBN'] = ratings_df['ISBN'].str.upper()

In [None]:
ratings_df.head()

**MERGING THE DATASETS : BOOKS_DF,USERS_DF AND RATINGS_DF**

In [None]:
# Merge the 'books_df' and 'ratings_df' DataFrames on 'ISBN' column
df = pd.merge(new_books_df, ratings_df, on='ISBN', how='inner')

# Merge the resulting DataFrame with 'users_df' DataFrame on 'User-ID' column
df = pd.merge(df, users_df, on='User-ID', how='inner')


In [None]:
df.columns

### What all manipulations have you done and insights you found?

Here are the manipulations and insights found:

Drop unwanted columns: The columns "Image-URL-S", "Image-URL-M", and "Image-URL-L" have been dropped from the books_df dataset.

Fill missing values: The missing values in the "Book-Author" and "Publisher" columns in the books_df dataset have been filled with the mode value.

Fix data errors: There were some errors in the "Year-Of-Publication" column of the books_df dataset. Some values were not integers, and some values were outside the valid range of years. These errors were fixed by converting the column to an integer type and imputing missing values with the mode. The values outside the valid range were also imputed with missing values.

Handle missing values: The missing values in the "Age" column of the users_df dataset were filled with the median value. The "Location" column was also replaced with an empty string.

Split column: The "Location" column in the users_df dataset was split into three columns: "City", "State", and "Country".

Remove duplicates: Duplicate rows were removed from the users_df dataset.

Data correction: The ISBN values in the ratings_df dataset were checked for validity against the ISBN values in the books_df dataset. Non-matching ISBNs were corrected using a mapping of non-matching ISBNs to their corrected versions.

Merge datasets: The cleaned and corrected books_df, ratings_df, and users_df datasets were merged into a single dataframe "df" using inner joins.

Overall, these manipulations aim to clean and preprocess the data, handle missing values and outliers, and correct errors and inconsistencies in the data to prepare it for further analysis.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

**The distribution of book ratings in the dataset**

In [None]:
# Calculate the average rating for books in the dataset
avg_rating = df['Book-Rating'].mean()

# Plot the distribution of book ratings
plt.hist(df['Book-Rating'], bins=10)
plt.title('Distribution of Book Ratings')
plt.xlabel('Rating')
plt.ylabel('Number of Books')
plt.show()

# Print the average rating for books in the dataset
print('Average rating: ', avg_rating)

**Reason to chose the specific graph**

Based on the code, a histogram graph has been chosen to represent the distribution of book ratings. A histogram is a type of bar graph that shows the distribution of a set of continuous data. In this case, the data is the book ratings, which is continuous and can take on a range of values. The histogram represents the frequency of ratings falling into different bins, where each bin represents a range of values.

The histogram is a good choice for this type of data because it allows us to see the distribution of the ratings and how they are distributed across different bins. It helps us to visualize the range of ratings that are common, and also shows any outliers or extreme values. In addition, it allows us to see the shape of the distribution, which can be useful for making inferences about the underlying population.

**INSIGHTS**

We can observe from the graph that a huge portion of number of books has not been rated.

The count of books which has been rated 10 is highest.

**BUSINESS IMPACT**

The insight that a large portion of books has not been rated may indicate that there is an opportunity for businesses in the book industry to encourage users to rate books. This could be done through incentivizing ratings or creating a more user-friendly rating system. On the other hand, the insight that books with a rating of 10 are the most frequent may indicate that users have a strong preference for highly-rated books. Businesses in the book industry could use this information to focus on promoting and marketing highly-rated books to attract more readers and increase sales.

**The most popular books in the dataset based on the number of ratings**

In [None]:
# Create a DataFrame with the number of ratings for each book
book_ratings_count = pd.DataFrame(df.groupby(['ISBN', 'Book-Title'])['Book-Rating'].count())

# Sort the books by the number of ratings in descending order
book_ratings_count.sort_values('Book-Rating', ascending=False, inplace=True)

# Print the top 10 most popular books
print(book_ratings_count.head(10))


In [None]:
# Create a bar chart of the top 10 most popular books
plt.bar(book_ratings_count.head(10).index.get_level_values('Book-Title'), book_ratings_count.head(10)['Book-Rating'])
plt.xticks(rotation=90)
plt.ylabel('Number of Ratings')
plt.title('Top 10 Most Popular Books')
plt.show()

**Reason to chose the specific graph**

The code is creating a bar chart of the top 10 most popular books, based on the number of ratings each book has received. A bar chart is a good choice for this type of data because it shows the magnitude of each book's popularity relative to the others, and it makes it easy to compare the number of ratings for each book.

In this case, the x-axis represents the titles of the books, while the y-axis represents the number of ratings. The bars are used to visually represent the number of ratings for each book, with the height of each bar indicating the number of ratings received by the corresponding book.

The code also rotates the x-axis labels by 90 degrees to prevent overlapping of the labels when they are long or numerous.

Overall, a bar chart is a good choice for displaying the top 10 most popular books based on the number of ratings, as it provides a clear and easy-to-understand representation of the data.

**INSIGHTS**

Top 10 most popular books:

Wild Animus
The Lovely Bones: A Novel
The Da Vinci Code
Divine Secrets of the Ya-Ya Sisterhood: A Novel
The Red Tent (Bestselling Backlist)
A Painted House
Snow Falling on Cedars
The Secret Life of Bees
Angels &amp; Demons
Where the Heart Is (Oprah's Book Club (Paperback))

**BUSINESS IMPACT**

 if a book retailer's goal is to stock and sell popular books, then knowing the top 10 most popular books can be beneficial to the business.

Regarding negative growth, it's not clear from the statement if any insights have led to negative growth. The fact that a huge portion of the books in the dataset has not been rated may limit the usefulness of the data in making business decisions, but it's not necessarily a negative impact.

**Top 10 most popular publishers**

In [None]:
# Create a DataFrame with the number of ratings for each publisher
publisher_ratings_count = pd.DataFrame(df.groupby('Publisher')['Book-Rating'].count())

# Sort the publishers by the number of ratings in descending order
publisher_ratings_count.sort_values('Book-Rating', ascending=False, inplace=True)

# Print the top 10 most popular publishers
print(publisher_ratings_count.head(10))



In [None]:
# Create a bar chart of the top 10 most popular publishers
plt.bar(publisher_ratings_count.head(10).index, publisher_ratings_count.head(10)['Book-Rating'])
plt.xticks(rotation=90)
plt.ylabel('Number of Ratings')
plt.title('Top 10 Most Popular Publishers based on count of Book-Rating')
plt.show()

**Reason to chose the specific graph**

The code is creating a bar chart of the top 10 most popular publishers, based on the number of ratings received by their books. A bar chart is a good choice for this type of data because it allows for easy comparison between the different publishers and the number of ratings their books have received.

In this case, the x-axis represents the names of the publishers, while the y-axis represents the number of ratings. The bars are used to visually represent the number of ratings for each publisher, with the height of each bar indicating the number of ratings received by the corresponding publisher.

The code also rotates the x-axis labels by 90 degrees to prevent overlapping of the labels when they are long or numerous.

Overall, a bar chart is a good choice for displaying the top 10 most popular publishers based on the number of ratings, as it provides a clear and easy-to-understand representation of the data. It helps to identify which publishers are more popular among the readers based on their books' ratings.





**INSIGHTS**

**Top 10 publisher based on counts**

Ballantine Books
Pocket
Berkley Publishing Group
Warner Books
Harlequin
Bantam Books
Bantam
Signet Book
Avon
Penguin Books

**BUSINESS IMPACT**

Based on the top 10 publishers in the dataset, a business impact could be to focus on building relationships with these publishers in order to potentially gain exclusive access to their upcoming releases, negotiate better pricing or discounts, and establish a strong reputation within the industry by promoting and distributing their books to a wider audience. It could also help to identify which genres these publishers specialize in and use this information to tailor marketing strategies to specific target audiences. Additionally, these publishers could potentially be used as a benchmark for measuring success and setting performance goals.

**Top 10 Most Popular Authors**

In [None]:
# Create a DataFrame with the number of ratings for each author
author_ratings_count = pd.DataFrame(df.groupby('Book-Author')['Book-Rating'].count())

# Sort the authors by the number of ratings in descending order
author_ratings_count.sort_values('Book-Rating', ascending=False, inplace=True)

# Print the top 10 most popular authors
print(author_ratings_count.head(10))



In [None]:
# Create a bar chart of the top 10 most popular authors
plt.bar(author_ratings_count.head(10).index, author_ratings_count.head(10)['Book-Rating'])
plt.xticks(rotation=90)
plt.ylabel('Number of Ratings')
plt.title('Top 10 Authors based on counts of Book-Rating')
plt.show()

**Reason to chose the specific graph**

The code is creating a bar chart of the top 10 most popular authors, based on the number of ratings received by their books. A bar chart is a good choice for this type of data because it allows for easy comparison between the different authors and the number of ratings their books have received.

In this case, the x-axis represents the names of the authors, while the y-axis represents the number of ratings. The bars are used to visually represent the number of ratings for each author, with the height of each bar indicating the number of ratings received by the corresponding author.

The code also rotates the x-axis labels by 90 degrees to prevent overlapping of the labels when they are long or numerous.

Overall, a bar chart is a good choice for displaying the top 10 most popular authors based on the number of ratings, as it provides a clear and easy-to-understand representation of the data. It helps to identify which authors are more popular among the readers based on their books' ratings.

**INSIGHTS**

Stephen King , Nora Roberts and john Grisham are the top 3 authors based on count of book-rating.

**BUSINESS IMPACT**

The gained insights from the analysis can potentially lead to a positive business impact for the book industry. For example, identifying the most popular books, authors, and publishers can help businesses make strategic decisions such as which books to stock, which authors to feature, and which publishers to partner with.

However, there are also some insights that can lead to negative growth. For example, if the analysis reveals that a significant portion of the books in the dataset have not been rated, this could indicate a lack of interest in those books or a lack of engagement with the platform. Additionally, if the analysis shows that users from certain regions tend to rate books lower on average, this could indicate a potential issue with targeting those regions for book sales and marketing efforts. It is important to consider these potential negative impacts and take them into account when making business decisions based on the insights gained from the analysis.

**Analysis how the number of books published each year has changed over time**

In [None]:
# Create a DataFrame with the number of books published each year
books_published_per_year = pd.DataFrame(df.groupby('Year-Of-Publication')['ISBN'].count())

# Sort the years by the number of books published in ascending order
books_published_per_year.sort_values('Year-Of-Publication', ascending=True, inplace=True)


In [None]:
# Create a line chart of the number of books published each year
plt.plot(books_published_per_year.index, books_published_per_year['ISBN'])
plt.xlabel('Year')
plt.ylabel('Number of Books Published')
plt.title('Number of Books Published Each Year')
plt.show()

**Reason to chose the specific graph**

The code is creating a line chart of the number of books published each year. A line chart is a good choice for this type of data because it allows for easy visualization of the trend in the number of books published over time.

In this case, the x-axis represents the years in which the books were published, while the y-axis represents the number of books published. The line is used to visually represent the trend in the number of books published each year, with the slope of the line indicating the rate of change.

The code also includes axis labels and a title to make the graph easy to understand.

Overall, a line chart is a good choice for displaying the trend in the number of books published each year, as it provides a clear and easy-to-understand representation of the data. It helps to identify any patterns or changes in the number of books published over time.

In [None]:
most_popular_year = books_published_per_year['ISBN'].idxmax()
print(f"The most popular year for book publishing in the dataset was {most_popular_year}.")

**INSIGHTS**

The most popular year for book publishing in the dataset was 2002

**BUSINESS IMPACT**

Based on this insight alone, it is difficult to determine if it will have a positive business impact or not. However, if a business is involved in the book publishing industry, they may use this information to make strategic decisions about the timing of their book releases or marketing efforts. For example, they may choose to release more books in 2002 or around that time to capitalize on the high popularity of book publishing during that year.

**The distribution of ratings given by users**

In [None]:
# Create a DataFrame with the distribution of ratings given by users
ratings_distribution = pd.DataFrame(df.groupby('User-ID')['Book-Rating'].mean())


In [None]:
# Create a histogram of the distribution of ratings given by users
plt.hist(ratings_distribution['Book-Rating'], bins=10)
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.title('Distribution of Ratings Given by Users')
plt.show()

**REASON FOR CHOSING THE GRAPH**

The chosen graph is a histogram, which is a type of bar graph that shows the distribution of a continuous variable (in this case, the ratings given by users). It is an appropriate choice because it allows us to see how the ratings are distributed, i.e., the frequency of each rating. The histogram is divided into bins, with each bin representing a range of values for the variable being measured (in this case, ratings). The height of each bar represents the number of observations (users) that fall into that bin. The x-axis represents the values of the variable being measured, and the y-axis represents the frequency of observations. The histogram is a useful tool for visualizing the shape of a distribution and identifying patterns, such as skewness, central tendency, and outliers.

**INSIGHTS**

A large of portion of users have not rated any book.

The books which are rated 10 and 9 are the most frequent one.

**BUSINESS IMPACT**

The insight that a large portion of users have not rated any book may lead to a negative impact on the business, as it suggests that there is a lack of user engagement with the platform or the books being offered. This may result in lower sales and revenue for the company.

On the other hand, the insight that the books rated 10 and 9 are the most frequent ones can have a positive impact on the business, as it suggests that users are highly satisfied with the books they have read and are likely to recommend them to others. This may lead to higher sales and revenue as well as a positive reputation for the company.

**Analysis of users demographics**

In [None]:
user_info = df.groupby('User-ID').agg({'Age': 'mean', 'City': 'first', 'State': 'first', 'Country': 'first'})
print(user_info.head())

**BUSINESS IMPACT**

The resulting DataFrame shows the mean age, city, state, and country of each user in the dataset. We can use this information to gain insights into the demographics of the users, such as the age range of the users, the most common locations of the users, and the countries represented in the dataset.

For example, we can see that the mean age of users is around 34 years old. We can also see that the most common location of users is in the United States (based on the State column) and that there are users from other countries as well, such as Canada and Spain (based on the Country column). These insights can be used to tailor marketing strategies or book recommendations to different user groups based on their demographics.

**Study of the popularity of books by different authors**

In [None]:
# Create a DataFrame with the number of books written by each author
author_counts = df['Book-Author'].value_counts().reset_index()
author_counts.columns = ['Book-Author', 'Book-Count']

# Create a DataFrame with the average rating for books written by each author
author_ratings = df.groupby('Book-Author')['Book-Rating'].mean().reset_index()
author_ratings.columns = ['Book-Author', 'Average-Rating']

# Merge the two DataFrames on the Book-Author column
author_data = pd.merge(author_counts, author_ratings, on='Book-Author')

# Print the top 10 authors by book count and average rating
print(author_data.sort_values(['Book-Count', 'Average-Rating'], ascending=False).head(10))

Stephen King , Nora Roberts and john Grihsham are the top 3 authors based on the book counts.

**Analysis of the books by publisher or author**

In [None]:
# Create a DataFrame with the count of books published by each publisher
publisher_counts = df['Publisher'].value_counts().reset_index()
publisher_counts.columns = ['Publisher', 'Book-Count']

# Create a DataFrame with the average rating for books published by each publisher
publisher_ratings = df.groupby('Publisher')['Book-Rating'].mean().reset_index()
publisher_ratings.columns = ['Publisher', 'Average-Rating']

# Merge the two DataFrames on the Publisher column
publisher_data = pd.merge(publisher_counts, publisher_ratings, on='Publisher')

# Print the top 10 publishers by book count and average rating
print(publisher_data.sort_values(['Book-Count', 'Average-Rating'], ascending=False).head(10))

Ballantine, Books Pocket Berkley and Publishing Group are the top 3 publishers based on book counts.

In [None]:
# Compute the correlation between publication year and rating
corr = df['Year-Of-Publication'].corr(df['Book-Rating'])

print(f"Correlation between publication year and rating: {corr:.2f}")

Correlation between publication year and rating: 0.04

**BUSINESS IMPACT**

The correlation coefficient of 0.04 suggests a very weak positive correlation between publication year and rating. This means that there is no significant relationship between the year a book was published and its rating.

From a business perspective, this insight may not have a significant impact as it suggests that there is no particular advantage or disadvantage to publishing books in a particular year. However, it is important to note that there may be other factors that influence a book's rating such as the author, genre, and publisher. Therefore, it is important to consider multiple factors when making business decisions related to book publishing.

In [None]:
# Compute the correlation between publisher and rating
corr = df.groupby('Publisher')['Book-Rating'].mean().corr(df.groupby('Publisher')['Book-Rating'].count())

print(f"Correlation between publisher and rating: {corr:.2f}")

Correlation between publisher and rating: -0.02

**BUSINESS IMPACT**

A correlation coefficient of -0.02 indicates a very weak negative correlation between the publisher and rating. This suggests that there is no significant relationship between the publisher and the rating of a book. Therefore, this insight may not have a strong business impact on its own, but it can be used in conjunction with other insights to gain a more comprehensive understanding of the factors that affect book ratings.

**The distribution of users across different countries**.

In [None]:
# Count the number of users in each country
country_counts = df['Country'].value_counts().head(15)

# Plot the results using a bar chart
fig, ax = plt.subplots(figsize=(10,6))
ax.bar(country_counts.index, country_counts.values)

# Set chart title and labels
ax.set_title('Distribution of Users by Country (Top 15)')
ax.set_xlabel('Country')
ax.set_ylabel('Number of Users')

# Rotate the x-axis labels for better readability
plt.xticks(rotation=90)

# Display the chart
plt.show()

**REASON FOR CHOSING THE GRAPH**

The code is creating a bar chart to show the distribution of users by country. A bar chart is a good choice for this type of data because it allows for easy comparison between the different countries and the number of users in each country.

In this case, the x-axis represents the different countries, while the y-axis represents the number of users in each country. The bars are used to visually represent the number of users in each country, with the height of each bar indicating the number of users in the corresponding country.

The code also includes axis labels and a title to make the graph easy to understand. Additionally, the x-axis labels are rotated by 90 degrees for better readability.

Overall, a bar chart is a good choice for displaying the distribution of users by country, as it provides a clear and easy-to-understand representation of the data. It helps to identify which countries have the most users in the dataset and to compare the number of users across different countries.

**INSIGHTS**

number of users are maximum in country usa.

**BUSINESS IMPACT**

If the number of users is maximum in the country USA, it suggests that the business may want to focus more on this market to increase its reach and revenue. This could involve targeted marketing and advertising campaigns to attract more users from this region, as well as offering promotions or discounts to incentivize users to purchase more books. Additionally, the business could consider partnering with local publishers or booksellers in the USA to expand its offerings and improve its competitiveness in the market. Overall, having a large user base in a particular region presents an opportunity for the business to grow and increase its profitability.

In [None]:
# Filter the dataframe to only include users from the USA
usa_df = df[df['Country'] == 'usa']

# Count the number of users in each state
state_counts = usa_df['State'].value_counts().head(15)

# Plot the results using a bar chart or histogram
fig, ax = plt.subplots(figsize=(10,6))
ax.bar(state_counts.index, state_counts.values)

# Set chart title and labels
ax.set_title('Distribution of Users by State (USA)')
ax.set_xlabel('State')
ax.set_ylabel('Number of Users')

# Rotate the x-axis labels for better readability
plt.xticks(rotation=90)

# Display the chart
plt.show()


**REASON FOR CHOSING THE GRAPH**

The code is creating a bar chart to show the distribution of users by state in the USA. A bar chart is a good choice for this type of data because it allows for easy comparison between the different states and the number of users in each state.

In this case, the x-axis represents the different states in the USA, while the y-axis represents the number of users in each state. The bars are used to visually represent the number of users in each state, with the height of each bar indicating the number of users in the corresponding state.

The code also includes axis labels and a title to make the graph easy to understand. Additionally, the x-axis labels are rotated by 90 degrees for better readability.

Overall, a bar chart is a good choice for displaying the distribution of users by state in the USA, as it provides a clear and easy-to-understand representation of the data. It helps to identify which states have the most users in the dataset and to compare the number of users across different states.

**INSIGHTS**

in country usa ,california state has the highest number of users followed by texas.

**BUSINESS IMPACT**

The insight that California has the highest number of users in the USA could be useful for businesses that want to target their marketing efforts towards a specific geographic location. For example, a book retailer may choose to focus its advertising campaigns in California to increase its customer base in the state. Similarly, businesses can tailor their inventory and promotions to the reading preferences of users in California based on the genre of books that are popular in the region. By understanding the demographics and reading habits of users in different regions, businesses can make more informed decisions about their marketing strategies, product offerings, and inventory management.

**the distribution of users across different states or cities**

In [None]:
# Count the number of users in each city
city_counts = df['City'].value_counts().head(15)

# Plot the results using a bar chart or histogram
fig, ax = plt.subplots(figsize=(10,6))
ax.bar(city_counts.index, city_counts.values)

# Set chart title and labels
ax.set_title('Distribution of Users by City(TOP 15)')
ax.set_xlabel('City')
ax.set_ylabel('Number of Users')

# Rotate the x-axis labels for better readability
plt.xticks(rotation=90)

# Display the chart
plt.show()

**REASON FOR CHOSING THE GRAPH**

The code is creating a bar chart to show the distribution of users by city. A bar chart is a good choice for this type of data because it allows for easy comparison between the different cities and the number of users in each city.

In this case, the x-axis represents the different cities, while the y-axis represents the number of users in each city. The bars are used to visually represent the number of users in each city, with the height of each bar indicating the number of users in the corresponding city.

The code also includes axis labels and a title to make the graph easy to understand. Additionally, the x-axis labels are rotated by 90 degrees for better readability.

Overall, a bar chart is a good choice for displaying the distribution of users by city, as it provides a clear and easy-to-understand representation of the data. It helps to identify which cities have the most users in the dataset and to compare the number of users across different cities.

**INSIGHTS**

in terms of city toronto has the highest number of users followed by chicago.

**BUSINESS IMPACT**

The business impact of having a high number of users in certain locations such as California, Texas, Toronto and Chicago could be significant. Companies can use this information to target their marketing efforts more effectively and tailor their offerings to the specific preferences of users in these locations. For example, they can create regional promotions or partnerships with local businesses to increase brand awareness and drive sales. They can also analyze the reading habits and preferences of users in these locations to inform their product development and inventory management strategies. By understanding where their users are located and what they want, companies can optimize their operations and create a better customer experience, ultimately leading to increased revenue and growth.

**Analysis of whether there is a correlation between the location of a user and the rating they give to a book.**

In [None]:
# Plot a histogram of user ages
fig, ax = plt.subplots(figsize=(10,6))
ax.hist(df['Age'], bins=20)

# Set chart title and labels
ax.set_title('Distribution of User Ages')
ax.set_xlabel('Age')
ax.set_ylabel('Number of Users')

# Display the chart
plt.show()

**REASON FOR CHOSING THE GRAPH**

The code is creating a histogram to show the distribution of user ages. A histogram is a good choice for this type of data because it shows the frequency distribution of a continuous variable, such as age.

In this case, the x-axis represents the age range, and the y-axis represents the frequency of users in each age range. The histogram consists of a set of bars, where each bar represents an age range and its height indicates the number of users within that range.

The code also includes axis labels and a title to make the graph easy to understand. The number of bins (20) is specified to control the number of age ranges that the data is divided into, and therefore, the width of each bar.

Overall, a histogram is a good choice for displaying the distribution of user ages, as it provides a clear visual representation of the age distribution in the dataset. It helps to identify the age ranges with the most users and to understand the overall age distribution of the dataset.

**INSIGHTS**

A majority portion of users are from age group of 25 - 40.

**BUSINESS IMPACT**

The insight that a majority of users are from the age group of 25-40 can have a positive business impact as it provides valuable information to businesses that cater to this age group. For example, publishers can use this information to target their marketing campaigns towards this age group, or businesses selling books can promote books that are popular among this age group. Additionally, this information can also be useful for businesses that sell products other than books, as they can use this information to tailor their marketing efforts towards this age group.

**Analysis of correlation between variables**

In [None]:
# Compute the correlation matrix
corr = df.corr()

# Generate a heatmap to visualize the correlation matrix
sns.heatmap(corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1, linewidths=.5)

# Set the plot title
plt.title("Correlation Heatmap")

# Show the plot
plt.show()

**REASON FOR CHOSING THE GRAPH**

The chosen graph is a heatmap as it is used to visualize the correlation matrix between the different numerical features of the dataset. The heatmap uses a color scale to represent the correlation values, making it easy to identify highly correlated or anti-correlated features. The use of the annot=True parameter allows for the correlation values to be displayed in each cell of the heatmap, making it easier for the viewer to understand the values. Additionally, the use of the coolwarm color map enhances the visual appeal of the heatmap.





**INSIGHTS**

NO as such multicollinearity found between the variables in the dataset.

**BUSINESS IMPACT**

Multicollinearity can cause issues with interpretation of the model and the accuracy of the predictions, so it's important to address it if it's present. Without multicollinearity, the model's coefficients can be more easily interpreted and the predictions can be more accurate.

**EDA CONCLUSIONS**

Based on the observations from the given dataset, we can conclude that:

A large portion of books has not been rated, which might affect the accuracy of the ratings.

The most popular books based on ratings are Wild Animus, The Lovely Bones, The Da Vinci Code, Divine Secrets of the Ya-Ya Sisterhood, The Red Tent, A Painted House, Snow Falling on Cedars, The Secret Life of Bees, Angels & Demons, and Where the Heart Is.

The top publishers based on the number of books in the dataset are Ballantine Books, Pocket, Berkley Publishing Group, Warner Books, Harlequin, Bantam Books, Bantam Signet Book, Avon, and Penguin Books.

The top three authors based on the number of books rated are Stephen King, Nora Roberts, and John Grisham.

The most popular year for book publishing in the dataset is 2002.

A large portion of users has not rated any book.

The most frequent ratings given by users are 10 and 9.

The mean age of users is around 34 years old, and the most common location of users is in the United States, with California having the highest number of users.

Toronto and Chicago are the cities with the highest number of users.

The majority of users are from the age group of 25-40.

There is no multicollinearity found between the variables in the dataset.

### **Hypothesis Test 1:**
Null Hypothesis: The proportion of books rated is greater than or equal to 50%

Alternative Hypothesis: The proportion of books rated is less than 50%

In [None]:
from statsmodels.stats.proportion import proportions_ztest

# Compute proportion of books rated
num_books_rated = len(df[df['Book-Rating'] != 0])
total_books = len(df)
prop_books_rated = num_books_rated / total_books

# Set null hypothesis proportion
null_prop = 0.5

# Calculate z-score and p-value for one-sample proportion test
z_score, p_value = proportions_ztest(num_books_rated, total_books, null_prop, alternative='smaller')

# Set significance level
alpha = 0.05

# Print results
print(f"Proportion of books rated: {prop_books_rated:.4f}")
print(f"Null Hypothesis: The proportion of books rated is greater than or equal to {null_prop:.2f}")
print(f"Alternative Hypothesis: The proportion of books rated is less than {null_prop:.2f}")
print(f"z-score: {z_score:.4f}")
print(f"p-value: {p_value:.4f}")

if p_value < alpha:
    print("Reject null hypothesis - proportion of books rated is significantly less than 50%")
else:
    print("Fail to reject null hypothesis - proportion of books rated is not significantly less than 50%")


Proportion of books rated: 0.3722

Null Hypothesis: The proportion of books rated is greater than or equal to 0.50

Alternative Hypothesis: The proportion of books rated is less than 0.50

z-score: -268.3426

p-value: 0.0000

Reject null hypothesis - proportion of books rated is significantly less than 50%

**Which statistical test have you done to obtain P-Value?**

It uses a one-sample z-test to compare a sample proportion to a null hypothesis value.

**Why did you choose the specific statistical test?**

The specific statistical test chosen in this code is a one-sample proportion test. The purpose of this test is to determine whether a proportion is significantly different from a hypothesized value. In this case, the proportion of books rated is being tested against a null hypothesis of being greater than or equal to 50%. The z-score and p-value are calculated using the proportions_ztest() function from statsmodels.stats.proportion module. The null and alternative hypotheses are defined in the print statements, and the significance level is set to alpha = 0.05. If the p-value is less than alpha, the null hypothesis is rejected and it is concluded that the proportion of books rated is significantly less than 50%. Otherwise, if the p-value is greater than alpha, the null hypothesis is not rejected, and it is concluded that there is not enough evidence to suggest that the proportion of books rated is significantly less than 50%.

### **Hypothesis Test 2:**
Null Hypothesis: The average rating for the top 10 most popular books is equal to 8

Alternative Hypothesis: The average rating for the top 10 most popular books is not equal to 8

In [None]:
from scipy.stats import ttest_1samp

# create a list of the top 10 most popular books based on ratings
top_books = ['Wild Animus', 'The Lovely Bones', 'The Da Vinci Code', 'Divine Secrets of the Ya-Ya Sisterhood',
             'The Red Tent', 'A Painted House', 'Snow Falling on Cedars', 'The Secret Life of Bees',
             'Angels & Demons', 'Where the Heart Is']

# create a subset of the dataframe with only the top 10 most popular books
top_books_ratings = df[df['Book-Title'].isin(top_books)]

# calculate the mean rating for the top 10 most popular books
mean_rating = top_books_ratings['Book-Rating'].mean()

# set the null hypothesis value
null_hypothesis = 8

# perform the one-sample t-test
t_statistic, p_value = ttest_1samp(top_books_ratings['Book-Rating'], null_hypothesis)

# print the results
print('Mean rating for the top 10 most popular books:', mean_rating)
print('Null hypothesis:', null_hypothesis)
print('t-statistic:', t_statistic)
print('p-value:', p_value)

Since the p-value is less than 0.05, we reject the null hypothesis and conclude that the average rating for the top 10 most popular books is significantly different from 8.

**Which statistical test have you done to obtain P-Value?**

It uses a one-sample t-test to test whether the mean rating for a sample of books is significantly different from a null hypothesis value.

**Why did you choose the specific statistical test?**

The specific statistical test chosen is a one-sample t-test because we are testing the mean rating of a sample (the top 10 most popular books) against a known population mean (null hypothesis of 8). This test is used when we want to determine if a sample mean is significantly different from a population mean.

### **Hypothesis Test 3:**
Null Hypothesis: The mean age of users in the United States is equal to the mean age of users in other countries

Alternative Hypothesis: The mean age of users in the United States is different from the mean age of users in other countries

In [None]:
import scipy.stats as stats

# Create two groups: users in the United States and users in other countries
us_users = df[df['Country'] == 'usa']['Age'].dropna()
other_users = df[df['Country'] != 'usa']['Age'].dropna()

# Perform two-sample t-test
t_stat, p_value = stats.ttest_ind(us_users, other_users, equal_var=False)

# Print results
print("Mean age of users in the United States: {:.2f}".format(us_users.mean()))
print("Mean age of users in other countries: {:.2f}".format(other_users.mean()))
print("t-statistic: {:.2f}".format(t_stat))
print("p-value: {:.10f}".format(p_value))


The mean age of users in the United States is significantly higher than the mean age of users in other countries. The t-statistic of 121.38 and the p-value of 0.0 suggest that the difference in mean ages between the two groups is statistically significant, indicating that the null hypothesis of the mean age of users in the United States being equal to the mean age of users in other countries can be rejected. Therefore, we can conclude that there is a difference in mean ages between users in the United States and users in other countries.

**Which statistical test have you done to obtain P-Value?**

It uses a two-sample t-test to test whether there is a significant difference in the mean age of users between two groups (users in the United States and users in other countries).

**Why did you choose the specific statistical test?**

The two-sample t-test was chosen because it is used to determine whether there is a significant difference between the means of two independent groups. In this case, the groups are users in the United States and users in other countries, and the objective is to determine if the mean age of users in the United States is different from the mean age of users in other countries. The two-sample t-test is appropriate because the sample sizes are greater than 30, the samples are independent, and the variances are assumed to be unequal.

## ***7. ML Model Implementation***

In [None]:
# Merge ratings_df and books_df on ISBN
ratings_with_name = pd.merge(ratings_df, new_books_df, on='ISBN')

# Filter out ratings with 0 Book-Rating
ratings_with_name = ratings_with_name[ratings_with_name['Book-Rating']!=0]

# Group by User-ID and count the number of ratings
x = ratings_with_name.groupby('User-ID').count()['Book-Rating'] > 50

# Get the index of users with more than 50 ratings
experienced_users = x[x].index

# Filter ratings by experienced users
filtered_ratings = ratings_with_name[ratings_with_name['User-ID'].isin(experienced_users)]

# Group by Book-Title and count the number of ratings
y = filtered_ratings.groupby('Book-Title').count()['Book-Rating']>=10

# Get the index of books with more than 10 ratings
famous_books = y[y].index

# Filter final data by famous books
final_data = filtered_ratings[filtered_ratings['Book-Title'].isin(famous_books)]

# Remove duplicate ratings
final_data.drop_duplicates()



### ML Model - 1 COLLABORATIVE FILTERING USING K-NEAREST NEIGHBOUR


In [None]:
#function for creating metrics chart.
def store(model,x,str1,str2):
  metrics = {'Model_name':[model],'test_rmse':[x] ,'Hypertuned':[str1],'Cross_validate':[str2]}
  df = pd.DataFrame(metrics)
  return df

In [None]:
!pip install surprise

In [None]:
from surprise import Dataset,Reader,KNNBasic ,accuracy

In [None]:
#Initialize a Reader object with the rating scale (1, 10)
reader = Reader(rating_scale=(1,10))

#Load the dataset from the final_data dataframe using the load_from_df function of the Dataset class
data = Dataset.load_from_df(final_data[['User-ID','ISBN','Book-Rating']],reader = reader)

In [None]:
final_data.columns

The following code implements movies recommendation based on Pearson correlation and 20 nearest similar users.

In [None]:
#Set co-efficient and similarity parameters for building model
item_based_cosine_sim = {'name':'pearson','user_based':True}

knn = KNNBasic(k=20,min_k=5,sim_options = item_based_cosine_sim)

**CROSS VALIDATION**

In [None]:
cv_results = cross_validate(knn,data,measures=['RMSE'],cv = 5,verbose=False)

It reports test accuracy for each fold along with the time it takes to build and test the models.

Let us take the average accuracy across all folds.

In [None]:
x = np.mean(cv_results.get('test_rmse'))


In [None]:
metric1 = store('KNN',x,'No','Yes')
metric1

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The ML model used in this code is a K-Nearest Neighbors (KNN) algorithm for Collaborative Filtering. Collaborative filtering is a technique that recommends items (books in this case) to users based on the preferences and ratings of similar users.

The performance of the KNN algorithm is evaluated using the Root Mean Square Error (RMSE) metric. RMSE is a measure of the difference between predicted and actual ratings. A lower RMSE indicates better accuracy of the model in predicting ratings.

In the given code, the KNN model is built using 20 nearest neighbors and a minimum of 5 neighbors required for the algorithm to work. The similarity between items is calculated using Pearson correlation coefficient.

To evaluate the model's performance, 5-fold cross-validation is performed using the cross_validate() function from the Surprise library. The measures parameter is set to 'RMSE' to calculate the RMSE score. The average test RMSE across all folds is then calculated using np.mean().

The test RMSE score obtained from the cross-validation is 1.75, which indicates that the model's predictions have an average error of 1.75 on a scale of 1 to 10. While this score is not very low, it can still be considered a reasonable performance for a basic KNN collaborative filtering model. However, the performance of the model may depend on various factors such as the size and quality of the dataset, the similarity measure used, and the choice of hyperparameters like the number of neighbors. It is always recommended to fine-tune the model and try different algorithms and evaluation metrics to find the best model for the given dataset.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from surprise.model_selection.search import GridSearchCV

In [None]:
param_grid = {'k':[10,20],
              'sim_options':{'name':['cosine','pearson'],'user_based':[True,False]}}

grid_cv = GridSearchCV(KNNBasic,param_grid,measures=['rmse'],cv=5,refit=True)

grid_cv.fit(data)

In [None]:
#Best RMSE score
print(grid_cv.best_score['rmse'])
x = grid_cv.best_score['rmse']
#Combination of parameters that gave the best RMSE score
print(grid_cv.best_params['rmse'])

The best model is Item-based collaborative filtering with cosine similarity and 20 similar users.

Details of the grid search are captured in the variable cv_results.We can convert it to a DataFrame and print a few columns like param_sim_options and mean_test_rmse.

In [None]:
metric2 = store('KNN',x,'YES','YES')
metric2

In [None]:
results_df = pd.DataFrame.from_dict(grid_cv.cv_results)
results_df[['param_k','param_sim_options','mean_test_rmse','rank_test_rmse']].sort_values('rank_test_rmse')

Each record represents the parameters used to build the model and the corresponding RMSE of the model.The last column rank_test_rmse shows the rank of the model as per the RMSE on test data among all the models

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV is a popular hyperparameter optimization technique in machine learning. It works by exhaustively searching over a specified hyperparameter grid to find the optimal set of hyperparameters that yield the best performance on a validation set.

In the code snippet, the GridSearchCV() function is used to perform a grid search over the hyperparameter grid specified in param_grid. The grid search is performed on the KNNBasic algorithm, with the measures parameter set to ['rmse'], indicating that the root mean squared error (RMSE) is used as the performance metric. The cv parameter is set to 5, indicating that 5-fold cross-validation is used to evaluate the performance of each set of hyperparameters. Finally, the refit parameter is set to True, indicating that the optimal set of hyperparameters is used to refit the model on the entire dataset.

The hyperparameter grid specified in param_grid includes two hyperparameters: k and sim_options. The k hyperparameter specifies the number of neighbors to consider when making predictions, and the sim_options hyperparameter specifies the similarity metric and user-item weighting scheme to use. The grid search will search over all possible combinations of these hyperparameters and their values to find the optimal set of hyperparameters that yields the best performance on the validation set.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
score1 = pd.concat([metric1,metric2]).reset_index()

In [None]:
score1.drop('index',axis = 1,inplace = True)
score1

Yes, based on the evaluation metric score chart, there is an improvement after hyperparameter tuning.

Before hyperparameter tuning, the model_name 'KNN' had a test_rmse score of 1.760964 and the model was not hypertuned. After hyperparameter tuning, the same model 'KNN' had a lower test_rmse score of 1.631374, indicating an improvement in performance. The 'Hypertuned' column shows that the model was hypertuned after the hyperparameter optimization process.

### ML Model - 2 MATRIX FACTORIZATION USING SINGULAR VALUE DECOMPOSITION

In [None]:
from surprise import SVD

#Use 5 factors for building the model
svd = SVD(n_factors = 5)

Let us use five fold cross validation for testing models performance

In [None]:
cv_results = cross_validate(svd,data,measures=['RMSE'],cv=5,verbose=True)

In [None]:
result = np.mean(cv_results.get('test_rmse'))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
metric3 = store('SVD',result,'No','YES')
metric3

The ML model used in this code is SVD (Singular Value Decomposition) for collaborative filtering in recommendation systems. SVD is a matrix factorization technique that is commonly used in recommendation systems to model the user-item interaction matrix.

In the code snippet provided, the svd() function is used to create an instance of the SVD model. The data parameter is passed to the model, which contains the user-item interaction matrix. The cross_validate() function is then used to evaluate the model's performance using the RMSE (Root Mean Squared Error) as the performance metric, with 5-fold cross-validation (cv=5). The verbose parameter is set to True to display the progress of the cross-validation.

The results of the cross-validation are stored in the cv_results object, which contains the average test_rmse score across the 5 folds. The test_rmse score of the SVD model is 1.511075, indicating the average error between the predicted and actual ratings for the test data. The 'Hypertuned' column shows that the model was not hypertuned, which means that the default hyperparameters were used to train the model.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

The choice of evaluation metrics for a positive business impact depends on the business problem at hand. Generally, the most commonly used evaluation metrics for recommender systems are Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE).

In this case, the evaluation metric used is RMSE, as specified by measures=['RMSE'] in the cross_validate function. RMSE measures the square root of the average of squared differences between predicted and actual ratings. The lower the RMSE, the better the performance of the model. RMSE is often preferred over MAE because it punishes larger errors more severely, making it more sensitive to outliers.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

In [None]:
score_df = pd.concat([score1,metric3]).reset_index()
score_df.drop('index',axis = 1,inplace = True)
score_df

Based on the evaluation metrics, the SVD model still seems to perform the best among the three models with the lowest RMSE score of 1.51. Furthermore, the SVD model has been cross-validated, which means that its performance has been evaluated on multiple folds of the data to ensure its generalization.

Even though the KNN model with hypertuning has the same RMSE score as the SVD model, it should be noted that the KNN model is a memory-based algorithm, which means that it may not scale well to larger datasets. On the other hand, the SVD model is a model-based algorithm, which can handle larger datasets efficiently.

Therefore, I would still choose the SVD model as the final prediction model. However, it's important to keep in mind that the choice of the final model should not only be based on the evaluation metrics but also on the business requirements and constraints, such as model complexity, interpretability, and scalability.

**BUILDING A SIMPLE RECOMMENDATION MODEL USING SVD**

In [None]:
import pandas as pd
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from IPython.display import Image, display

# Load the data into a Pandas DataFrame
#final_data = pd.read_csv('your_final_data_file.csv')  # Replace 'your_final_data_file.csv' with your actual final data file path

# Define the rating scale (from 1 to 10 in this case)
reader = Reader(rating_scale=(1, 10))

# Load the data from the DataFrame into the Surprise Dataset
dataset = Dataset.load_from_df(final_data[['User-ID', 'ISBN', 'Book-Rating']], reader)

# Build the training set
trainset = dataset.build_full_trainset()

# Create an SVD model and train it on the training set
model = SVD()
model.fit(trainset)

# Generate recommendations for all users
testset = trainset.build_anti_testset()
predictions = model.test(testset)

# Get top 10 recommendations for each user
top_n = {}
for uid, iid, _, est, _ in predictions:
    if uid in top_n:
        top_n[uid].append((iid, est))
    else:
        top_n[uid] = [(iid, est)]

# Sort and get top 10 recommendations for each user
for uid, user_ratings in top_n.items():
    user_ratings.sort(key=lambda x: x[1], reverse=True)
    top_n[uid] = user_ratings[:10]

# Print the top 10 recommendations for a specific user
user_id = 116866
print(f"Top 10 recommendations for User {user_id}:")
for book_id, est_rating in top_n[user_id]:
    book_info = final_data.loc[final_data['ISBN'] == book_id, ['ISBN', 'Book-Title', 'Book-Rating', 'Image-URL-S']].iloc[0]
    display(Image(url=book_info['Image-URL-S']))
    print(f"Book ID: {book_info['ISBN']}, Book Title: {book_info['Book-Title']}, Estimated Rating: {est_rating}")
    print("\n")



# **Conclusion**

For businesses, a book recommender system like the one presented in this notebook can be a valuable tool to increase customer engagement and satisfaction. By recommending books that match a customer's preferences, the system can help to improve their experience with the business and increase their loyalty.

Implementing a book recommender system requires a dataset of book ratings, which can be collected through various means such as surveys, feedback forms, and purchase history. The system can then be built using collaborative filtering or other advanced techniques such as matrix factorization and deep learning.

It is important for businesses to ensure the accuracy and relevance of the recommendations provided by the system to gain the trust and loyalty of customers. This can be achieved through rigorous testing and evaluation of the system's performance using metrics such as precision, recall, and F1-score.

Overall, a book recommender system can be a valuable asset for businesses that sell books or use books as part of their services. By providing personalized recommendations, businesses can increase customer engagement and satisfaction, leading to higher sales and better customer retention.







### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***