# TOP 5 BOOKS RECOMMENDATION SYSTEM


<img src="Images/books.jpg" alt="Books" width="700" height = "300"/>

| | GROUP MEMBERS | GITHUB | 
| --- | --- | --- |
| 1. | MAUREEN WAMBUGU | https://github.com/Mau-Wambugu |
| 2. | STEPHEN WAWERU | https://github.com/stendewa|
| 3. | LYDIA NJERI | https://github.com/lydiahsherry23 |
| 4. | LILIAN MULI | https://github.com/mwikali24 |
| 5. | PETER MAINA | https://github.com/Mr-PeterMaina |

## 1. BUSINESS UNDERSTANDING

> PROJECT OVERVIEW
* **Recommendation systems** are powerful tools that use machine learning algorithms to provide relevant suggestions to users based on behaviour patterns or user data.  
* **A Book Recommendation System** is a recommendation system where we recommend similar books to readers based on their interests.
* Recommendation systems help drive engagement, increase sales, increases revenue and this in return brings in loyal clients as the customer experience is elevated promoting customer satisfaction.
* We have 2 main recommendation system models:  
    * 1. Collaborative filtering  
    * 2. Content-based Filtering
> BUSINESS PROBLEM  
* Over the past years, there has been rise in huge ecommerce and online services leading clients facing difficulty when searching for the right products.
* Clients looking to purchase books also face the same struggle when trying to match the right books with their taste and preferences.
* **The Business Problem** is to develop a recommendation system that recommends books that are tailored to our users preferences inorder to improve customer experience and increase revenue.
> PROJECT OBJECTIVE
1. To build a book recommendation system that provides personalized suggestions to our users.
2. Improve sales by showcasing books a user is most likely to buy.
3. Offer relevant books to users inorder to improve customer retention.
4. Increase customer engagement
> DATA SOURCE   
* We used data obtained from [Kaggle]("https://www.kaggle.com/datasets/somnambwl/bookcrossing-dataset/data") mined by Cai-Nicolas Ziegler
* It contains 3 CSV Files:
    1. Books.csv - contains information about books{ISBN;Title;Author;Year;Publisher}
    2. Ratings.csv - contains book ratings provided by users  that range from 0 to 10. {{User-ID;ISBN;Rating}}
    3. Users.csv - contains information about the users {User-ID;Age}
> STAKEHOLDERS
1. Customers
* As the end user, they expect accurate book suggestions based on personal interests.
1. Marketing team
* They would want to do targeted advertising on specific books and also promote personalized offers.
1. Data scientist
* Interested in ensuring the recommendation system models are accurate and scalable.
1. Book Authors
* They would be interested in knowing how their books are recommended inorder to learn and understand their readers taste and preferences.
1. Executive {CEO}
* They would want to understand how recommendation systems impact revenue and customer retention comparing it to the budget allocated to the project.
> METHODOLOGY
* Our project will focus on the CRISP-DM:  
    1. Business Understanding
    2. Data Understanding
    3. Data Preparation
    4. Modeling
    5. Evaluation
    6. Deployment

## DATA UNDERSTANDING

In [26]:
import numpy as np
import pandas as pd

In [27]:
#Read the datasets
Books = pd.read_csv(r'DATA\Books.csv',low_memory= False)
Books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [28]:
#Read the ratings dataset
Ratings = pd.read_csv(r'DATA\Ratings.csv',low_memory=False)
Ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [29]:
#Read the users dataset
Users = pd.read_csv(r'DATA\Users.csv',low_memory=False)
Users.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


##### From our datasets above,The ratings and users table share a common column that is User-ID.Use the merge function to perform a join on both dataframes on the userID column and retain the following columns UserID, bookrating,ISBN and age.

In [30]:
#perform the merge based on 'User-ID'
merged_df = pd.merge(Ratings,Users[['User-ID','Age']],on='User-ID',how='inner')
#keep only the selected columns
merged_df = merged_df[['User-ID','Book-Rating','Age','ISBN']]
merged_df


Unnamed: 0,User-ID,Book-Rating,Age,ISBN
0,276725,0,,034545104X
1,276726,5,,0155061224
2,276727,0,16.0,0446520802
3,276729,3,16.0,052165615X
4,276729,6,16.0,0521795028
...,...,...,...,...
1149775,276704,9,,1563526298
1149776,276706,0,18.0,0679447156
1149777,276709,10,38.0,0515107662
1149778,276721,10,14.0,0590442449


In [31]:
merged_df1 = pd.merge(merged_df,Books,on='ISBN',how='inner')
merged_df1.head()

Unnamed: 0,User-ID,Book-Rating,Age,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,276725,0,,034545104X,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
1,2313,5,23.0,034545104X,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
2,6543,0,34.0,034545104X,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
3,8680,5,2.0,034545104X,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
4,10314,9,,034545104X,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...


In [32]:
merged_df1.shape

(1031136, 11)

In [33]:
merged_df1.isnull().sum()

User-ID                     0
Book-Rating                 0
Age                    277835
ISBN                        0
Book-Title                  0
Book-Author                 2
Year-Of-Publication         0
Publisher                   2
Image-URL-S                 0
Image-URL-M                 0
Image-URL-L                 4
dtype: int64

In [34]:
missing_percentage = merged_df1.isnull().mean()*100
missing_percentage = missing_percentage[missing_percentage > 0]
missing_percentage

Age            26.944554
Book-Author     0.000194
Publisher       0.000194
Image-URL-L     0.000388
dtype: float64

In [35]:
merged_df1 = merged_df1.drop(columns=['Image-URL-L','Image-URL-M','Image-URL-S'],axis=1)
merged_df1

Unnamed: 0,User-ID,Book-Rating,Age,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,276725,0,,034545104X,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
1,2313,5,23.0,034545104X,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
2,6543,0,34.0,034545104X,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
3,8680,5,2.0,034545104X,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
4,10314,9,,034545104X,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
...,...,...,...,...,...,...,...,...
1031131,276688,0,,0517145553,Mostly Harmless,Douglas Adams,1995,Random House Value Pub
1031132,276688,7,,1575660792,Gray Matter,Shirley Kennett,1996,Kensington Publishing Corporation
1031133,276690,0,43.0,0590907301,Triplet Trouble and the Class Trip (Triplet Tr...,Debbie Dadey,1997,Apple
1031134,276704,0,,0679752714,A Desert of Pure Feeling (Vintage Contemporaries),Judith Freeman,1997,Vintage Books USA


In [40]:
merged_df1.dtypes

User-ID                  int64
Book-Rating              int64
Age                    float64
ISBN                    object
Book-Title              object
Book-Author             object
Year-Of-Publication     object
Publisher               object
dtype: object

In [42]:
from sklearn.impute import KNNImputer
#Select only the numeric columns
numeric_columns = merged_df1.select_dtypes(include=['float64', 'int64']).columns

knn_imputer = KNNImputer(n_neighbors=3)
df_imputed = pd.DataFrame(knn_imputer.fit_transform(merged_df1[numeric_columns]), columns=numeric_columns)

print(df_imputed)

KeyboardInterrupt: 