# <span style= "color:cyan"> AMAZON BOOK RECOMMENDATION SYSTEM </SPAN>

# 1 BUSINESS UNDERSTANDING

## 1.1 Overview

In the era of exponential data growth, the emergence of more sophisticated systems leveraging big data has become increasingly prevalent. Among these systems, recommendation systems have proven to be valuable information filtering tools, enhancing search results by providing users with more relevant items based on their search queries or browsing history. Major technology companies have embraced recommendation systems across various applications: YouTube utilizes them to determine the next autoplay video, while Spotify employs them to curate personalized "Made for You" daily mixes.

In line with this project's objectives, we aim to harness the power of data analysis to recommend the best books to users. By examining user behaviors, both individual and collective, we can derive insights that enable us to deliver tailored book recommendations that align with their interests and preferences.

The underlying principle of this project is to leverage data-driven techniques to understand user preferences and behaviors. By analyzing user interactions, historical data, and patterns, we can uncover valuable insights that inform our recommendation system. This allows us to present users with a curated list of book suggestions that are highly likely to resonate with their tastes.

## 1.2 Problem Statement

Amazon is looking to optimize their recommendation system such that it will suggest different and new books while increasing their profitability margin, the company has seen a slight decrease in sales as a result of suggesting old and most frequent books on the platform.

While Amazon remains a dominant player in the marker, other platforms such as Barnes & Noble, Alibris, Smashwords etc. have gained popularity and they can be considered as potential competitors which is threatening to the platform
We have therefore been appointed as Junior Data Scientists by Amazon so as to optimize their book recommendation system. This will enhance customer engagement, improve sales and revenue, increase book discovery and personalize user experience for their competitive advantage.

## 1.3 The Data

 The Book-Crossing dataset comprises 3 files:

- **Users**: Contains the users. Note that user IDs (User-ID) have been anonymized and map to integers. Demographic data is provided (Location, Age) if available. Otherwise, these fields contain NULL values.
- **Books**: Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (Book-Title, Book-Author, Year-Of-Publication, Publisher), obtained from Amazon Web Services. Note that in the case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavors (Image-URL-S, Image-URL-M, Image-URL-L), i.e., small, medium, large. These URLs point to the Amazon website.
- **Ratings**: Contains the book rating information. Ratings (Book-Rating) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.

## 1.4 Business Objectives

1. To develop a book recommendation system that provides personalized suggestions to users based on their individual preferences and reading history.
2. To analyze top popular books and recommend them to increase revenue
3. To utilize the recommendation systems insights to identify peak hours of the day when users are mostly active, allowing for better optimization of online ad campaigns to target book enthusiast during most engaged periods.
4. To investigate the relationship between book unit prices and quantity demanded to determine if there are any significant correlations.
5. To monitor market trends, new book releases and emerging genres to update and refine the recommendation system, ensuring it remains relevant and up to date in the ever changing book landscape.

## 1.5 Determining The Project Goals


- Develop a prediction model within the book recommendation system that can accurately forecast the likelihood of a specific user showing interest in a particular book, based on their historical data, preferences, and interactions.
- Implement a mechanism to handle new users joining the book recommendation system, providing them with initial recommendations that align with their interests and preferences. This will involve utilizing demographic information, user profiling, and collaborative filtering techniques to generate relevant book suggestions.
- Establish evaluation metrics to assess the performance of the recommendation system, such as precision, recall, and mean average precision.
- Create a function that will return top N recommendations for a user.
- Deploy and Implement a real-time recommendation feature that can adapt to users' changing preferences and provide up-to-date book suggestions. This involves continuously updating the recommendation model, incorporating new user interactions, and leveraging real-time data to deliver timely and relevant recommendations.
- Optimize a recommender system that can recommend books to new users thus solving the cold start problem

## 1.6  Determining the project success criteria

1.    RMSE of less than 0.8
2.    Accuracy of 70% +

## 1.7 Methods Used

- Descriptive Statistics
- Data Visualization
- Machine Learning

# 2 DATA UNDERSTANDING 

## import the necessary libraries and modules for the project:

The code below imports the necessary libraries and modules for the project. Here's a breakdown of the imported libraries and their purposes:

- **pandas** and **numpy**: Data manipulation and analysis.
- **seaborn** and **matplotlib**: Data visualization and plotting.
- **surprise**: Library for collaborative filtering-based recommendation systems.
- **scikit-learn**: Preprocessing and evaluation of the recommendation system.
- **warnings**: Ignoring warning messages.
- **scipy**, **math**, and **nltk**: Libraries for text-based recommendation systems.
- **LightFM**: Library for hybrid recommendation systems.

These libraries provide the functionality needed to perform data analysis, preprocessing, modeling, and evaluation for the book recommendation system. The code ensures that any warning messages are ignored to maintain clean output.

The imported libraries will be used in subsequent code cells to implement different aspects of the book recommendation system, such as collaborative filtering, content-based filtering, and hybrid recommendation approaches.

By importing these libraries, we can leverage their functionalities and methods to build a robust and accurate book recommendation system.



In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Surprise library for collaborative filtering
from surprise import Reader, Dataset
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD, SVDpp
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise.model_selection import GridSearchCV

# Scikit-learn for preprocessing and evaluation
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity

# Ignoring warnings
import warnings
warnings.filterwarnings("ignore")

# Scipy, math, and nltk libraries for text-based recommendation
import scipy
import math
import sklearn
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds

# LightFM library for hybrid recommendation
from scipy.sparse import csr_matrix
from lightfm import LightFM


## Reading Data from Files:

The code defines a function called `read_data` that reads data from a given file path using the `pd.read_csv()` function from the pandas library. It has the following parameters:

- `path`: The file path of the data to be read.
- `error_bad_lines`: A flag to handle lines with errors. By default, it is set to `False`.
- `encoding`: The encoding format of the file. The default is set to `'latin-1'`.
- `sep`: The delimiter used in the file. The default is set to `';'`.
- `on_bad_lines`: Action to be taken when encountering bad lines. By default, it is set to `'skip'`.

The function reads the data from the specified file path using the given parameters and returns the data as a pandas DataFrame.

The code then calls the `read_data` function three times to read three different files: `BX-Book-Ratings.csv`, `BX-Books.csv`, and `BX-Users.csv`. The file paths are provided as arguments to the function.

The resulting DataFrames are assigned to the variables `book_ratings`, `books`, and `users`, respectively.

These DataFrames will be used for further analysis, preprocessing, and modeling in the recommendation system.

In [None]:
def read_data(path, error_bad_lines = False, encoding = 'latin-1', sep=';', on_bad_lines = 'skip'):

    "A simple function that reads the data"
    
    data = pd.read_csv(path, error_bad_lines = error_bad_lines, encoding = encoding, sep = sep)
    return data

book_ratings = read_data(r'C:\Users\user\Documents\Recommendation Systems\recommendation_system_project\BX-Book-Ratings.csv')
books = read_data(r'C:\Users\user\Documents\Recommendation Systems\recommendation_system_project\BX-Books.csv')
users = read_data(r'C:\Users\user\Documents\Recommendation Systems\recommendation_system_project\BX-Users.csv')

### Viewing the First 5 Rows of the DataFrames
we have three datasets:
* `books`
* `users`
* `rating`

Let us explore them by viewing first five rows of each

In [3]:
""" calling on variable book_ratings to view the first 5 rows"""

book_ratings.head()

NameError: name 'book_ratings' is not defined

In [4]:
""" calling on variable books to view the first five rows"""

books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [5]:
""" calling on variable users to view the first 5 rows"""

users.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


#### <span style= "color:orange"> Preliminary Data understanding </SPAN>

### A simple function to check the shape, info and descriptive statistics of the dataset

In [6]:

def get_info_shape_stats(dataset, dataset_name):

    """A simple function to check the shape, info and descriptive statistics of the dataset"""
    
    print('The Dataset:', dataset_name )
    print(f"has {dataset.shape[0]} rows and {dataset.shape[1]} columns")
    print('---------------------------')
    print('---------------------------')
    print(dataset.info())
    print('---------------------------')
    print('----------------------------')
    print(dataset.describe())

### calling on the function get_info_shape_stats on the Book Ratings

In [7]:
"""calling on the function get_info_shape_stats"""

get_info_shape_stats(book_ratings, 'Book Ratings')

The Dataset: Book Ratings
has 1149780 rows and 3 columns
---------------------------
---------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1149780 non-null  int64 
 1   ISBN         1149780 non-null  object
 2   Book-Rating  1149780 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB
None
---------------------------
----------------------------
            User-ID   Book-Rating
count  1.149780e+06  1.149780e+06
mean   1.403864e+05  2.866950e+00
std    8.056228e+04  3.854184e+00
min    2.000000e+00  0.000000e+00
25%    7.034500e+04  0.000000e+00
50%    1.410100e+05  0.000000e+00
75%    2.110280e+05  7.000000e+00
max    2.788540e+05  1.000000e+01


The 'book_ratings' dataset contains a total of 1,149,780 rows and 3 columns. Here are some key observations about the dataset:

- The dataset consists of the following columns:
    - User-ID: An anonymized identifier for the users.
    - ISBN: The unique identifier for the books.
    - Book-Rating: The rating given by the users for the books. Ratings range from 0 to 10, with higher values indicating higher appreciation.

- The dataset has no missing values as indicated by the 'Non-Null Count' column.

- Data types:
    - User-ID and Book-Rating columns are of integer type (int64).
    - ISBN column is of object type (string).

- Descriptive Statistics:
    - The mean book rating is approximately 2.87, indicating a relatively low average rating.
    - The standard deviation of book ratings is around 3.85, indicating a wide range of rating values.
    - The minimum book rating is 0, while the maximum rating is 10.
    - The majority of book ratings (75%) fall within the range of 0 to 7.

These observations provide an initial understanding of the 'book_ratings' dataset and its characteristics. Further analysis and processing can be performed based on this information to build the recommendation system.


### calling on the function get_info_shape_stats on the Books

In [8]:
"""calling on the function get_info_shape_stats"""

get_info_shape_stats(books, 'Books')

The Dataset: Books
has 271360 rows and 8 columns
---------------------------
---------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Book-Author          271359 non-null  object
 3   Year-Of-Publication  271360 non-null  object
 4   Publisher            271358 non-null  object
 5   Image-URL-S          271360 non-null  object
 6   Image-URL-M          271360 non-null  object
 7   Image-URL-L          271357 non-null  object
dtypes: object(8)
memory usage: 16.6+ MB
None
---------------------------
----------------------------
              ISBN      Book-Title      Book-Author  Year-Of-Publication  \
count       271360          271360           271359               271360   
unique      271360          24

The 'books' dataset contains a total of 271,360 rows and 8 columns. Here are some key observations about the dataset:

- The dataset consists of the following columns:
    - ISBN: The unique identifier for the books.
    - Book-Title: The title of the books.
    - Book-Author: The author of the books.
    - Year-Of-Publication: The year when the books were published.
    - Publisher: The publisher of the books.
    - Image-URL-S: The URL of the small-sized cover image of the books.
    - Image-URL-M: The URL of the medium-sized cover image of the books.
    - Image-URL-L: The URL of the large-sized cover image of the books.

- The dataset has some missing values in the 'Book-Author', 'Publisher', and 'Image-URL-L' columns, as indicated by the 'Non-Null Count' column.

- Data types:
    - All columns in the dataset are of object type (string).

- Descriptive Statistics:
    - The dataset has 271,360 unique ISBN values, indicating no duplicate ISBNs.
    - The most frequent book in the dataset is "Selected Poems" with 27 occurrences.
    - The most frequent book author is "Agatha Christie" with 632 occurrences.
    - The most common year of publication is 2002, with 13,903 books published in that year.
    - The dataset includes books from various publishers, with the most frequent publisher being "Harlequin" with 7,535 occurrences.

These observations provide an initial understanding of the 'books' dataset and its characteristics. Further analysis and processing can be performed based on this information to enhance the book recommendation system.


### calling on the function get_info_shape_stats on the Users

In [9]:
"""calling on the function get_info_shape_stats"""

get_info_shape_stats(users, 'Users')

The Dataset: Users
has 278858 rows and 3 columns
---------------------------
---------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   User-ID   278858 non-null  int64  
 1   Location  278858 non-null  object 
 2   Age       168096 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB
None
---------------------------
----------------------------
            User-ID            Age
count  278858.00000  168096.000000
mean   139429.50000      34.751434
std     80499.51502      14.428097
min         1.00000       0.000000
25%     69715.25000      24.000000
50%    139429.50000      32.000000
75%    209143.75000      44.000000
max    278858.00000     244.000000


The 'users' dataset contains a total of 278,858 rows and 3 columns. Here are some key observations about the dataset:

- The dataset consists of the following columns:
    - User-ID: An anonymized unique identifier for the users.
    - Location: The location of the users.
    - Age: The age of the users.

- The dataset has some missing values in the 'Age' column, as indicated by the difference between the 'Non-Null Count' and the total number of rows.

- Data types:
    - The 'User-ID' column is of integer type.
    - The 'Location' column is of object type (string).
    - The 'Age' column is of float type.

- Descriptive Statistics:
    - The dataset has 278,858 unique User-ID values, indicating no duplicate User-IDs.
    - The minimum age in the dataset is 0, indicating some missing or invalid age values.
    - The maximum age in the dataset is 244, which could be an outlier or erroneous value.
    - The average age of the users is approximately 34.75, with a standard deviation of 14.43.
    - The age values range from 0 to 244, with the majority of users falling within the 24 to 44 age range.

These observations provide an initial understanding of the 'users' dataset and its characteristics. Further analysis and processing can be performed based on this information to enhance the book recommendation system.


## creating a function to check the data types on the datasets

In [10]:
def data_types(data, dataset_name):

    """A simple function to check the data types on th datasets """

    print("Dataset:",dataset_name, "has",len( data.select_dtypes(include='number').columns),
                "Numeric columns")
    
    print("and", len(data.select_dtypes(include='object').columns),
          "Categorical columns")

    print('*****************************************************')
    print('*****************************************************')

    print('Numerical Columns:', data.select_dtypes(include='number').columns)
    print('Categorical Coulumns:', data.select_dtypes(include='object').columns)

### Calling the data_types function to get an overview of the data types in the 'users' dataset

In [11]:
""" calling on the data_types function """

data_types(users, 'Users') 

Dataset: Users has 2 Numeric columns
and 1 Categorical columns
*****************************************************
*****************************************************
Numerical Columns: Index(['User-ID', 'Age'], dtype='object')
Categorical Coulumns: Index(['Location'], dtype='object')


The 'users' dataset has 2 numerical columns and 1 categorical column. Here's a breakdown of the columns:

- Numerical Columns: 'User-ID' and 'Age'
    - The 'User-ID' column is of type 'object'.
    - The 'Age' column is of type 'object'.

- Categorical Column: 'Location'
    - The 'Location' column is of type 'object'.


### Calling the data_types function to get an overview of the data types in the 'books' dataset

In [12]:
""" calling on the data_types function """

data_types(books, 'Books')

Dataset: Books has 0 Numeric columns
and 8 Categorical columns
*****************************************************
*****************************************************
Numerical Columns: Index([], dtype='object')
Categorical Coulumns: Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher',
       'Image-URL-S', 'Image-URL-M', 'Image-URL-L'],
      dtype='object')


The 'users' dataset has 2 numerical columns and 1 categorical column. Here's a breakdown of the columns:

- Numerical Columns: 'User-ID' and 'Age'
    - The 'User-ID' column is of type 'object'.
    - The 'Age' column is of type 'object'.

- Categorical Column: 'Location'
    - The 'Location' column is of type 'object'.

### Calling the data_types function to get an overview of the data types in the 'book_ratings' dataset

In [13]:
""" calling on the data_types function """

data_types(book_ratings, 'Book Ratings')

Dataset: Book Ratings has 2 Numeric columns
and 1 Categorical columns
*****************************************************
*****************************************************
Numerical Columns: Index(['User-ID', 'Book-Rating'], dtype='object')
Categorical Coulumns: Index(['ISBN'], dtype='object')


The 'book_ratings' dataset has 2 numerical columns and 1 categorical column. Here's a breakdown of the columns:

- Numerical Columns: 'User-ID', 'Book-Rating'
    - The 'User-ID' and 'Book-Rating' columns are of type 'int64'.

- Categorical Column: 'ISBN'
    - The 'ISBN' column is of type 'object'.

# 3. DATA PREPARATION

## creating a function that iterates through the rows of the dataset to check for duplicates

By checking for duplicates in the dataset, we can identify and handle any redundant or repeated data, ensuring the integrity and accuracy of our analysis.

In [14]:
duplicates = []

def check_duplicates(data):

    """Function that iterates through the rows of our dataset to check whether they are duplicated or not"""
    
    for i in data.duplicated():
        duplicates.append(i)
    duplicates_set = set(duplicates)
    if(len(duplicates_set) == 1):
        print('The Dataset has No Duplicates')

    else:
        duplicates_percentage = np.round(((sum(duplicates)/len(data)) * 100 ), 2)
        print(f'Duplicated rows constitute of {duplicates_percentage} % of our dataset')

### Checking for Duplicates in book_ratings Dataset

In [None]:
check_duplicates(book_ratings) # checking for duplicates in book_ratings

The output of the function indicates that the dataset has no duplicates. This means that each row in the book_ratings dataset is unique and there are no repeated entries.

### Checking for Duplicates in books Dataset

In [16]:
check_duplicates(books) # checking for duplicates in books

The Dataset has No Duplicates


The output of the function indicates that the dataset has no duplicates. This means that each row in the books dataset is unique and there are no repeated entries.

### Checking for Duplicates in users Dataset

In [5]:
check_duplicates(users) # checking for duplicates in users

NameError: name 'check_duplicates' is not defined

The output of the function indicates that the dataset has no duplicates. This means that each row in the users dataset is unique and there are no repeated entries.

## Checking for Missing Values

By identifying and addressing missing values, we can verify the completeness and reliability of the dataset, prevent biased or misleading results, handle missing values through imputation, and leverage the presence of missing values for valuable insights. Ultimately, this process safeguards the integrity and accuracy of subsequent data analysis and modeling efforts.

### creating a function for checking null values in percentage in relation to length of the dataset

In [18]:
def missing_values(data):

    """ Function for checking null values in percentage in relation to length of the dataset """

    if data.isnull().any().any() == False :

        print("There Are No Missing Values")

    else:

        missing_values = data.isnull().sum().sort_values(ascending=False)

        missing_val_percent = ((data.isnull().sum()/len(data)).sort_values(ascending=False))

        missing_df = pd.DataFrame({'Missing Values': missing_values, 'Percentage %': missing_val_percent})

        return missing_df[missing_df['Percentage %'] > 0]

### Checking Missing Values in Book Ratings dataset

In [19]:
missing_values(book_ratings) # checking for missing values in book ratings

There Are No Missing Values


By confirming that there are no missing values in the Book Ratings dataset, we can proceed with confidence knowing that all the required data is available for analysis

### Checking Missing Values in the Books dataset

In [20]:
missing_values(books) # checking for missing values in books

Unnamed: 0,Missing Values,Percentage %
Image-URL-L,3,1.1e-05
Publisher,2,7e-06
Book-Author,1,4e-06


The function has identified three columns with missing values: Image-URL-L, Publisher, and Book-Author.

- Image-URL-L: There are 3 missing values in the Image-URL-L column, which represents the URL of the large-sized book image.
- Publisher: There are 2 missing values in the Publisher column, which represents the publisher of the book.
- Book-Author: There is 1 missing value in the Book-Author column, which represents the author of the book.

### Checking Missing Values in Users dataset

In [21]:
missing_values(users) # checking for missing values in users

Unnamed: 0,Missing Values,Percentage %
Age,110762,0.397199


The function has identified one column with missing values: Age.

Age: There are 110,762 missing values in the Age column, which represents the age of the users.

## 4. DATA CLEANING

## 4.1 Dropping Columns with Missing Values

After dropping specified columns, this can be useful when dealing with columns that have a significant number of missing values or when those columns are not relevant for the analysis or modeling task at hand.

### creating a function to drop columns with missing values

In [22]:
def dropping_columns(data, columns):

    """A simple function to drop columns with missing values"""

    drop_column = data.drop(columns=columns, inplace = True)
    
    return drop_column

columns_to_drop = users[['Age']]

dropping_columns(users, columns_to_drop)

### creating a function to remove the rows of columns that have missing values

In [23]:
def drop_rows(data, columns):
    
    """A simple function to remove the rows of columns that have missing values """
    
    new_data = data.dropna(subset=columns, inplace=True)
    return new_data

col = ['Image-URL-L', 'Publisher', 'Book-Author']
drop_rows(books, col)

## creating a function to merge the datasets based on a given column

In [24]:
def merge_dataframe(data_0, data_1, merge_column):
    """A function to merge the datasets based on a given column"""
    new_df = data_0.merge(data_1, on=merge_column)
    return new_df

df_rating = merge_dataframe(users, book_ratings, "User-ID")
df_rating

Unnamed: 0,User-ID,Location,ISBN,Book-Rating
0,2,"stockton, california, usa",0195153448,0
1,7,"washington, dc, usa",034542252,0
2,8,"timmins, ontario, canada",0002005018,5
3,8,"timmins, ontario, canada",0060973129,0
4,8,"timmins, ontario, canada",0374157065,0
...,...,...,...,...
1149775,278854,"portland, oregon, usa",0425163393,7
1149776,278854,"portland, oregon, usa",0515087122,0
1149777,278854,"portland, oregon, usa",0553275739,6
1149778,278854,"portland, oregon, usa",0553578596,0


The function is called to merge the users and book_ratings datasets based on the column "User-ID". The resulting merged dataset is stored in the variable df_rating. This allows us to combine the information from both datasets based on the common column "User-ID".

### confirming that there are no missing values

In [25]:
missing_values(df_rating) # checking for missing values

There Are No Missing Values


### confirming that there are no duplicates

In [26]:
check_duplicates(df_rating) # checking for duplicates

The Dataset has No Duplicates


### getting information and stats for our 'Merged' dataset

In [27]:
get_info_shape_stats(df_rating, 'Merged DataFrame') # checking the dataset info

The Dataset: Merged DataFrame
has 1149780 rows and 4 columns
---------------------------
---------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1149780 entries, 0 to 1149779
Data columns (total 4 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1149780 non-null  int64 
 1   Location     1149780 non-null  object
 2   ISBN         1149780 non-null  object
 3   Book-Rating  1149780 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 43.9+ MB
None
---------------------------
----------------------------
            User-ID   Book-Rating
count  1.149780e+06  1.149780e+06
mean   1.403864e+05  2.866950e+00
std    8.056228e+04  3.854184e+00
min    2.000000e+00  0.000000e+00
25%    7.034500e+04  0.000000e+00
50%    1.410100e+05  0.000000e+00
75%    2.110280e+05  7.000000e+00
max    2.788540e+05  1.000000e+01


The merged dataset, named "Merged DataFrame," contains a total of 1,149,780 rows and 4 columns. It combines the information from the "users" and "book_ratings" datasets based on the "User-ID" column.

The dataset comprises two numerical columns, "User-ID" and "Book-Rating," and two categorical columns, "Location" and "ISBN." The "User-ID" represents the unique identifier of the user, "Location" denotes the location of the user, "ISBN" refers to the unique identifier of the book, and "Book-Rating" represents the rating given by the user for a particular book.

The dataset does not have any missing values, as indicated by the non-null counts for all columns. The descriptive statistics reveal that the average book rating is approximately 2.87, with a standard deviation of 3.85. The ratings range from 0 to 10, with a significant portion of the ratings being 0.

This merged dataset provides valuable information for further analysis and insights into the book ratings given by users and their corresponding locations.

## merging the new dataset with the book dataset

In [28]:
""" merging the new dataset with the book dataset """
df_books = merge_dataframe(books, df_rating, 'ISBN')
df_books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L,User-ID,Location,Book-Rating
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,2,"stockton, california, usa",0
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,8,"timmins, ontario, canada",5
2,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,11400,"ottawa, ontario, canada",0
3,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,11676,"n/a, n/a, n/a",8
4,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,41385,"sudbury, ontario, canada",0


### getting information and stats for our 'Combined' dataset

In [29]:
get_info_shape_stats(df_books, "Combined Dataset") # check merged dataset info

The Dataset: Combined Dataset
has 1031129 rows and 11 columns
---------------------------
---------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1031129 entries, 0 to 1031128
Data columns (total 11 columns):
 #   Column               Non-Null Count    Dtype 
---  ------               --------------    ----- 
 0   ISBN                 1031129 non-null  object
 1   Book-Title           1031129 non-null  object
 2   Book-Author          1031129 non-null  object
 3   Year-Of-Publication  1031129 non-null  object
 4   Publisher            1031129 non-null  object
 5   Image-URL-S          1031129 non-null  object
 6   Image-URL-M          1031129 non-null  object
 7   Image-URL-L          1031129 non-null  object
 8   User-ID              1031129 non-null  int64 
 9   Location             1031129 non-null  object
 10  Book-Rating          1031129 non-null  int64 
dtypes: int64(2), object(9)
memory usage: 94.4+ MB
None
---------------------------
----------------------

The combined dataset, named "Combined Dataset," contains a total of 1,031,129 rows and 11 columns. It merges the information from the "books" dataset and the previously merged "df_rating" dataset based on the "User-ID" column and the "ISBN" column, respectively.

The dataset comprises two numerical columns, "User-ID" and "Book-Rating," and nine categorical columns: "ISBN," "Book-Title," "Book-Author," "Year-Of-Publication," "Publisher," "Image-URL-S," "Image-URL-M," "Image-URL-L," and "Location." These columns provide information about the book's ISBN, title, author, publication year, publisher, and image URLs, as well as the user's ID and location.

Similar to the previous merged dataset, the combined dataset does not have any missing values, as indicated by the non-null counts for all columns. The descriptive statistics show that the average book rating is approximately 2.84, with a standard deviation of 3.85. The ratings range from 0 to 10, with a significant portion of the ratings being 0.

This combined dataset provides a comprehensive collection of information about books, including their details, user ratings, and user demographics. It can be used for various analyses and recommendations in the domain of book recommendations and user behavior.

### confirming that there are no missing values

In [30]:

missing_values(df_books) # check for missing values

There Are No Missing Values


### confirming that there are no duplicates

In [31]:
check_duplicates(df_books) # check for duplicates

The Dataset has No Duplicates


# 5. EDA: EXPLORATORY DATA ANALYSIS

# 6. MODELLING

## MODEL 1: Popularity Based Recommendation System

A Popularity Based Recommendation System is a simple and straightforward approach to recommend items based on their overall popularity. It suggests popular items to all users, regardless of their individual preferences. This system relies on the assumption that popular items are more likely to be of interest to a larger audience.

However, one limitation of a Popularity Based Recommendation System is that it does not consider individual user preferences or specific interests. It treats all users equally and recommends the same popular items to everyone. Therefore, it may not provide personalized recommendations that align with the unique tastes and preferences of each user.

Despite its simplicity, a Popularity Based Recommendation System can be effective in situations where personalized data is limited or when the goal is to promote popular and trending items to a broad user audience.

## creating a function that calculates the popularity of values in a specific column of a dataframe

In [32]:
def calculate_popularity(df, column_name):

    """Calculates the popularity of values in a specific column of a dataframe"""

    popularity_df = pd.DataFrame(df[column_name].value_counts())
    return popularity_df

popularity_df = calculate_popularity(df_books, 'Book-Title')
popularity_df.head(20)

Unnamed: 0,Book-Title
Wild Animus,2502
The Lovely Bones: A Novel,1295
The Da Vinci Code,898
A Painted House,838
The Nanny Diaries: A Novel,828
Bridget Jones's Diary,815
The Secret Life of Bees,774
Divine Secrets of the Ya-Ya Sisterhood: A Novel,740
The Red Tent (Bestselling Backlist),723
Angels &amp; Demons,670


The resulting dataframe, popularity_df, displays the top 20 book titles and their corresponding popularity counts. The popularity is based on the frequency of each book title occurrence in the dataset. This information can be valuable for building recommendation systems or understanding the distribution of popularity among different items.

From the list of top books, it can be observed that certain titles have significantly higher occurrence counts compared to others. These highly popular books, such as "Wild Animus," "The Lovely Bones: A Novel," and "The Da Vinci Code," have likely gained widespread recognition and reader interest. This observation suggests that these books have resonated with a large audience and have potentially garnered positive reviews or word-of-mouth recommendations. These popular books can serve as a starting point for further analysis and recommendation strategies in the context of a book recommendation system.

## creating a function that filters a dataframe to include only users who have actively rated more than a specified threshold.

In [33]:
def filter_active_users(dataframe, threshold):

    """Filter the dataframe to include only users who have actively rated more than the threshold"""
    
    # Filter the DataFrame based on the count of each unique User-ID
    user_counts = dataframe['User-ID'].value_counts()
    filter = user_counts > threshold

    # Get the index values of the filtered rows
    filtered_index = filter[filter].index

    # Create a new DataFrame by selecting only the rows where User-ID is in the filtered index
    filtered_df = dataframe[dataframe['User-ID'].isin(filtered_index)]

    return filtered_df

df_filtered = filter_active_users(df_books, 300)
df_filtered.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L,User-ID,Location,Book-Rating
3,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,11676,"n/a, n/a, n/a",8
6,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,85526,"victoria, british columbia, canada",0
10,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,177458,"ottawa, ontario, canada",0
21,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,110912,"milpitas, california, usa",10
26,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,197659,"indiana, pennsylvania, usa",9


The dataset df_filtered is a subset of the original dataframe df_books, filtered to include only active users who have rated more than 300 times. This filtering process helps focus the analysis or recommendation strategies on users who have actively engaged with the dataset, providing a more targeted dataset for further exploration. The filtered dataset can now be used to gain insights into the preferences and behaviors of highly active users or to develop personalized recommendation systems based on their extensive ratings.

## Creating a function that calculates the number of times each book has been rated in a given dataframe

In [34]:
def calculate_rating_count(dataframe):

    """A Simple Function to Calculate the Number of Times each book has been rated"""

    # Group the dataframe by 'Book-Title' and count the occurrences of 'Book-Rating' for each title
    rating_count = dataframe.groupby('Book-Title')['Book-Rating'].count().reset_index()

    # Rename the 'Book-Rating' column to 'rating_count'
    rating_count.rename(columns={'Book-Rating': 'rating_count'}, inplace=True)

    # Merge the original dataframe with the 'rating_count' dataframe based on 'Book-Title'
    new_df = dataframe.merge(rating_count, on='Book-Title')

    # Display the first few rows of the merged dataframe
    return new_df

new_book_df = calculate_rating_count(df_filtered)
new_book_df.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L,User-ID,Location,Book-Rating,rating_count
0,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,11676,"n/a, n/a, n/a",8,3
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,85526,"victoria, british columbia, canada",0,3
2,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,177458,"ottawa, ontario, canada",0,3
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,110912,"milpitas, california, usa",10,2
4,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,197659,"indiana, pennsylvania, usa",9,2


This function calculates the number of times each book in the dataframe has been rated and adds a new column 'rating_count' to the dataframe. The merged dataframe can be used to analyze the popularity or rating frequency of books, which can be valuable for building recommendation systems or understanding user preferences.

The merged dataframe, new_book_df, can be utilized to analyze the popularity or rating frequency of books. This information can be valuable for various purposes, such as building recommendation systems or understanding user preferences.

## Creating a function to filter a dataframe based on a minimum rating count.

In [35]:
def filter_rating_count(dataframe, threshold):
    
    """A Simple Funtion to Filter the dataframe based on a minimum rating count"""

    # Apply the filter to the 'dataframe' using the 'loc' function
    filtered_df = dataframe.loc[dataframe['rating_count'] >= threshold, :]

    # Display the first few rows of the filtered dataframe
    return filtered_df

rating_more_50 = filter_rating_count(new_book_df, 50)
rating_more_50.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L,User-ID,Location,Book-Rating,rating_count
5,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,11676,"n/a, n/a, n/a",9,88
6,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,36836,"raleigh, north carolina, usa",0,88
7,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,46398,"san antonio, texas, usa",9,88
8,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,113270,"evanston, illinois, usa",0,88
9,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,113519,"pleasanton, california, usa",0,88


The resulting rating_more_50 dataframe contains the filtered rows that have a rating count of 50 or more. It retains the same columns as the original dataframe, including 'ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher', 'Image-URL-S', 'Image-URL-M', 'Image-URL-L', 'User-ID', 'Location', 'Book-Rating', and 'rating_count'.

## Previewing the dataset containing User-ID and Book-Title for books with multiple ratings.

The book_user_id_df dataframe is created by selecting the 'User-ID' and 'Book-Title' columns from the rating_more_50 dataframe. This subset of data allows us to examine instances where a user has rated a book more than once.

The dataframe displays the User-ID and corresponding Book-Title for each entry. Each row represents a user's rating for a particular book. By examining this dataframe, we can observe multiple ratings for the same book by different users, indicating varying opinions or multiple readings of the book.

In [36]:
book_user_id_df = rating_more_50[['User-ID', 'Book-Title']]
book_user_id_df

Unnamed: 0,User-ID,Book-Title
5,11676,The Kitchen God's Wife
6,36836,The Kitchen God's Wife
7,46398,The Kitchen God's Wife
8,113270,The Kitchen God's Wife
9,113519,The Kitchen God's Wife
...,...,...
171955,235105,M Is for Malice
171956,242824,M Is for Malice
171957,254899,M Is for Malice
171958,258534,M Is for Malice


This information provides insights into cases where users have rated the same book multiple times, possibly due to multiple readings or diverse opinions. Analyzing this dataset can help understand user preferences and the range of ratings for popular books.

## Checking for Duplicate Rows in the Dataset

In [37]:
check_duplicates(book_user_id_df)

Duplicated rows constitute of 4.2 % of our dataset


The analysis reveals that duplicated rows make up approximately 4.2% of the dataset. Duplicate rows can occur when a user has rated the same book multiple times or due to data entry errors. Identifying and handling duplicate rows is important to ensure data integrity and avoid biased analysis.

## creating the final dataframe and removing the duplicates in the columns 

In [38]:
final_df = rating_more_50.drop_duplicates(subset=['User-ID', 'Book-Title'])
final_df.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L,User-ID,Location,Book-Rating,rating_count
5,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,11676,"n/a, n/a, n/a",9,88
6,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,36836,"raleigh, north carolina, usa",0,88
7,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,46398,"san antonio, texas, usa",9,88
8,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,113270,"evanston, illinois, usa",0,88
9,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,113519,"pleasanton, california, usa",0,88


### getting information and stats for our 'Final DataFrame' dataset

In [39]:
get_info_shape_stats(final_df, 'Final DataFrame')

The Dataset: Final DataFrame
has 34365 rows and 12 columns
---------------------------
---------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 34365 entries, 5 to 171959
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   ISBN                 34365 non-null  object
 1   Book-Title           34365 non-null  object
 2   Book-Author          34365 non-null  object
 3   Year-Of-Publication  34365 non-null  object
 4   Publisher            34365 non-null  object
 5   Image-URL-S          34365 non-null  object
 6   Image-URL-M          34365 non-null  object
 7   Image-URL-L          34365 non-null  object
 8   User-ID              34365 non-null  int64 
 9   Location             34365 non-null  object
 10  Book-Rating          34365 non-null  int64 
 11  rating_count         34365 non-null  int64 
dtypes: int64(3), object(9)
memory usage: 3.4+ MB
None
---------------------------
------

The final DataFrame consists of 34,365 rows and 12 columns. Here is a summary of the dataset:

The DataFrame contains both numerical and categorical columns.
The columns include 'ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher', 'Image-URL-S', 'Image-URL-M', 'Image-URL-L', 'User-ID', 'Location', 'Book-Rating', and 'rating_count'.
The 'User-ID' column represents the ID of the user who rated the book.
The 'Book-Rating' column contains the ratings given by users for the books.
The 'rating_count' column indicates the number of times each book has been rated.
Here are some statistical insights about the dataset:

The average book rating is approximately 1.82.
The average rating count for books is around 84.93.
The minimum rating count is 50, indicating that only books with at least 50 ratings are included in the dataset.
The maximum rating count is 223, suggesting that some books have received a high number of ratings.


# MODEL 2: Collaborative Filtering Recommendation system

Collaborative filtering is a method of making automatic predictions(i.e filtering) about the interests of a user by collecting preferences or taste information from many users on the aggregate(i.e collaborating). There are two main apporoaches to collaborative filtering :

- Item - Item CF : "Users who like this item also liked..."
- User - Item CF : "Users who are similar to you also liked"

Model based collaborative filtering approach involves building machine learning algorithms to predict user's ratings. They involve dimensionality reduction methods that reduce high dimensional matrix containing abundant number of missing values with a much smaller matrix in a lower-dimensional space. The goal of this section is to compare SVD and SVDpp algorithms, try optimizing parameters and explore obtained results.Let's start by preparing our dataset for modelling

## creating a new dataframe that contains only the relevant columns

In [40]:
# creating a new dataframe that contains only the relevant columns 

final_df.rename(columns = {'User-ID':'user_id' ,'ISBN':'isbn' ,'Book-Rating':'book_rating'},inplace=True)

## Filtering out least active users

In [41]:
""" Filtering out least active users """

user_ratings_threshold = 3

filter_users = final_df['user_id'].value_counts()
filter_users_list = filter_users[filter_users >= user_ratings_threshold].index.to_list()

df_ratings_top = final_df[final_df['user_id'].isin(filter_users_list)]

print('Filter: users with at least %d ratings\nNumber of records: %d' % (user_ratings_threshold, len(df_ratings_top))) 

Filter: users with at least 3 ratings
Number of records: 34361


## getting the Top 10% Most Frequently Rated Books

In [42]:
book_ratings_threshold_perc = 0.1
book_ratings_threshold = len(df_ratings_top['isbn'].unique()) * book_ratings_threshold_perc

filter_books_list = df_ratings_top['isbn'].value_counts().head(int(book_ratings_threshold)).index.to_list()
df_ratings_top = df_ratings_top[df_ratings_top['isbn'].isin(filter_books_list)]

print('Filter: Top %d%% Most Frequently Rated Books\nNumber of records: %d' % (book_ratings_threshold_perc*100, len(df_ratings_top)))

Filter: Top 10% Most Frequently Rated Books
Number of records: 12626


# MODEL 3: SVD (Singular Value Decomposition) recommendation system

SVD (Singular Value Decomposition) is a popular recommendation system technique used to provide personalized recommendations to users. It is based on the mathematical concept of matrix factorization.

In SVD, the user-item interaction matrix is decomposed into three separate matrices: a user matrix, an item matrix, and a diagonal matrix. This decomposition allows us to represent users and items in a lower-dimensional space, capturing the underlying relationships and patterns in the data.

By reducing the dimensionality of the user-item matrix, SVD can effectively identify latent factors that contribute to user preferences and item characteristics. These latent factors can be used to generate recommendations by predicting missing ratings or estimating the likelihood of a user's preference for a particular item. SVD-based recommendation systems have been successful in various domains, including e-commerce, movie recommendations, and content streaming platforms.

SVD recommendation systems offer several advantages. They can handle sparse data efficiently, making them suitable for large-scale datasets. They also provide interpretable factors that can be used to understand the reasons behind recommendations. However, SVD-based systems may face challenges when dealing with new or cold-start users or items, as the model requires sufficient historical data for accurate predictions. Overall, SVD is a powerful technique for building personalized recommendation systems that can enhance user experiences and drive engagement.

## Creating a function to read our data into a Suprise Dataset format, instatiate model and perform cross validation

In [43]:
def evaluate_model(df, model_class, rating_scale=(1, 10), cv=3):

    """ A function to read our data into a Suprise Dataset format, instatiate model and perform cross validation"""

    reader = Reader(rating_scale=rating_scale)
    data = Dataset.load_from_df(df[['user_id', 'isbn', 'book_rating']], reader)
    
    model = model_class()
    cv_results = cross_validate(model, data, cv=cv)
    cv_results_df = pd.DataFrame(cv_results).mean()
    
    return cv_results_df

## SVD results

In [44]:
df = df_ratings_top.copy()
svd_results = evaluate_model(df, SVD)
print("SVD Results:")
print(svd_results)

SVD Results:
test_rmse    3.403562
test_mae     2.717575
fit_time     1.300712
test_time    0.133902
dtype: float64


The SVD (Singular Value Decomposition) recommendation system model was evaluated using cross-validation. The results indicate the following average metrics across the cross-validation folds:

- The average Root Mean Squared Error (RMSE) is 3.403562. RMSE measures the difference between the predicted ratings and the actual ratings. Lower RMSE values indicate better accuracy of the model in predicting ratings.

- The average Mean Absolute Error (MAE) is 2.717575. MAE represents the absolute difference between the predicted and actual ratings. Lower MAE values indicate better accuracy in predicting ratings.

- The average fit time is 1.300712 seconds. Fit time refers to the time taken to train the model on the dataset.

- The average test time is 0.133902 seconds. Test time represents the time taken to predict ratings for the test data.

These metrics provide insights into the performance and efficiency of the SVD recommendation system model. Lower RMSE and MAE values suggest that the model performs well in predicting ratings, while shorter fit and test times indicate faster model training and prediction processes.

## incorporating SVDpp(Singular Value Decomposition with Implicit Feedback) 

SVD++ (Singular Value Decomposition with Implicit Feedback) is an advanced algorithm used in collaborative filtering-based recommendation systems. Collaborative filtering is a popular approach to recommend items to users based on their similarity to other users or items. SVD++ takes this approach further by incorporating both explicit and implicit feedback from users.

Explicit feedback refers to the direct ratings or reviews given by users for items, indicating their preferences. Implicit feedback, on the other hand, includes indirect indicators of user preferences, such as purchase history, browsing behavior, or click-through rates. By considering implicit feedback, SVD++ can capture additional information about user preferences that might not be explicitly stated.

The algorithm utilizes the concept of singular value decomposition, which breaks down the user-item rating matrix into lower-dimensional representations. SVD++ extends this decomposition by incorporating a model that captures the influence of implicit feedback. It takes into account factors such as the observed ratings, user biases, and item biases to improve the accuracy of recommendations.

By combining both explicit and implicit feedback, SVD++ addresses the limitations of traditional collaborative filtering methods that solely rely on explicit ratings. It can provide more personalized and accurate recommendations by considering the underlying patterns and interactions between users and items. SVD++ has demonstrated its effectiveness in various recommendation scenarios and has become a popular choice for building recommendation systems in both research and industry domains.

In [45]:
svdpp_results = evaluate_model(df, SVDpp)
print("SVDpp Results:")
print(svdpp_results)

SVDpp Results:
test_rmse     3.615515
test_mae      2.842385
fit_time     10.743599
test_time     0.467458
dtype: float64


The results obtained from SVD++ show a slightly higher level of prediction errors compared to the previous SVD model. The test_rmse value of 3.615515 indicates that SVD++ has slightly larger deviations from the actual ratings. Similarly, the test_mae value of 2.842385 suggests that the average prediction errors of SVD++ are slightly higher than those of SVD. These findings imply that SVD++ may not perform as well in accurately predicting user ratings compared to SVD.

In terms of computational performance, SVD++ exhibits a significantly higher fit_time of 10.743599 seconds. This indicates that the training process for SVD++ takes longer than that of SVD, which had a fit_time of 1.300712 seconds. However, the test_time for SVD++ is only slightly higher at 0.467458 seconds compared to SVD's test_time of 0.133902 seconds. Despite the longer training time, the prediction time of SVD++ remains relatively efficient.

## Optimizing SVD Model

### Hyperparameter  tuning the SVD model minimising the RMSE

In [46]:
df = df_ratings_top.copy()
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(df[['user_id', 'isbn', 'book_rating']], reader)

param_grid = {
    'n_factors': [10, 100, 500],
    'n_epochs': [5, 20, 50], 
    'lr_all': [0.001, 0.005, 0.02],
    'reg_all': [0.005, 0.02, 0.1]}

gs_model = GridSearchCV(
    algo_class = SVD,
    param_grid = param_grid,
    n_jobs = -1,
    joblib_verbose = 5)

gs_model.fit(data)

 #Train the SVD model with the parameters that minimise the root mean squared error
 
best_SVD = gs_model.best_estimator['rmse']
print("Tuned SVD Model RMSE", gs_model.best_score['rmse'])
print("Best Paramers", gs_model.best_params['rmse'])

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    4.3s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:    9.8s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   32.7s
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:  1.7min


Tuned SVD Model RMSE 3.301179728274886
Best Paramers {'n_factors': 10, 'n_epochs': 50, 'lr_all': 0.001, 'reg_all': 0.1}


[Parallel(n_jobs=-1)]: Done 405 out of 405 | elapsed: 11.9min finished


After performing the grid search, the best SVD model was obtained with the following parameters: n_factors = 10, n_epochs = 50, lr_all = 0.001, and reg_all = 0.1. The tuned SVD model achieved an RMSE of 3.301179728274886, which is slightly lower than the previous results obtained from the default SVD and SVD++ models. This indicates that the tuned SVD model provides improved accuracy in predicting user ratings.

The comparison of results suggests that the hyperparameter tuning process was successful in finding a set of parameters that minimize the RMSE. By fine-tuning the SVD model, we were able to improve the prediction performance and reduce the error in rating predictions. Therefore, the tuned SVD model with the optimized parameters can be considered as a better choice for the recommendation system compared to the default SVD and SVD++ models.

# 7. MODEL DEPLOYMENT

# 8. RECOMMENDATIONS

# 9. CONCLUSIONS