# <span style= "color:cyan"> BUILDING A RECOMMENDATION SYSTEM </SPAN>

![bookreading.jpg](attachment:bookreading.jpg)

## BUSINESS OVERVIEW

In the era of exponential data growth, the emergence of more sophisticated systems leveraging big data has become increasingly prevalent. Among these systems, recommendation systems have proven to be valuable information filtering tools, enhancing search results by providing users with more relevant items based on their search queries or browsing history. Major technology companies have embraced recommendation systems across various applications: YouTube utilizes them to determine the next autoplay video, while Spotify employs them to curate personalized "Made for You" daily mixes.

In line with this project's objectives, we aim to harness the power of data analysis to recommend the best books to users. By examining user behaviors, both individual and collective, we can derive insights that enable us to deliver tailored book recommendations that align with their interests and preferences.

The underlying principle of this project is to leverage data-driven techniques to understand user preferences and behaviors. By analyzing user interactions, historical data, and patterns, we can uncover valuable insights that inform our recommendation system. This allows us to present users with a curated list of book suggestions that are highly likely to resonate with their tastes.

## PROBLEM STATEMENT

During the last few decades, with the rise of YouTube, Amazon, Netflix, and many other such web services, recommender systems have taken more and more place in our lives. From e-commerce (suggest to buyers articles that could interest them) to online advertisement (suggest to users the right contents, matching their preferences), recommender systems are today unavoidable in our daily online journeys. In a very general way, recommender systems are algorithms aimed at suggesting relevant items to users (items being movies to watch, text to read, products to buy, or anything else depending on industries). Recommender systems are really critical in some industries as they can generate a huge amount of income when they are efficient or also be a way to stand out significantly from competitors.

We have been therefore appointed Junior Data Scientists by Amazon so as to provide a book recommendation system that will help users. This will enhance customer engagement, improve sales and revenue, increase book discovery, personalize user experience and for their competitive advantage.

## BUSINESS OBJECTIVES

1. To develop a book recommendation system that provides personalized suggestions to users based on their individual preferences and reading history.
2. To analyze top book sales
3. To utilize the recommendation systems insights to identify peal hours of the day when users are mostly active, allowing for better optimization of online ad campaigns to target book enthusiast during most engaged periods.
4. To investigate the relationship between book unit prices and quantity demanded to determine if there are any significant correlations.
5. To monitor market trends, new book releases and emerging genres to update and refine the recommendation system, ensuring it remains relevant and up to date in the ever changing book landscape.

## DETERMINING PROJECT GOALS

* Develop a prediction model within the book recommendation system that can accurately forecast the likelihood of a specific user showing interest in a particular book, based on their historical data, preferences, and interactions.
* Implement a mechanism to handle new users joining the book recommendation system, providing them with initial recommendations that align with their interests and preferences. This will involve utilizing demographic information, user profiling, and collaborative filtering techniques to generate relevant book suggestions.
* Establish evaluation metrics to assess the performance of the recommendation system, such as precision, recall, and mean average precision.
* Create a function that will return top N recommendations for a user.
* Deploy and Implement a real-time recommendation feature that can adapt to users' changing preferences and provide up-to-date book suggestions. This involves continuously updating the recommendation model, incorporating new user interactions, and leveraging real-time data to deliver timely and relevant recommendations.

## PROJECT SUCCESS CRITERIA

1. Root Mean Squared Error of close to 0 to evaluate model efficiency.

## METHODS USED

* Descriptive Statistics
* Data Visualization
* Machine Learning

## DATA UNDERSTANDING

The Book-Crossing dataset comprises 3 files.

* Users: Contains the users. Note that user IDs (User-ID) have been anonymized and map to integers. Demographic data is provided (Location, Age) if available. Otherwise, these fields contain NULL values.
* Books: Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (Book-Title, Book-Author, Year-Of-Publication, Publisher), obtained from Amazon Web Services. Note that in the case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavors (Image-URL-S, Image-URL-M, Image-URL-L), i.e., small, medium, large. These URLs point to the Amazon website.
* Ratings: Contains the book rating information. Ratings (Book-Rating) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.

Load Libraries

In [40]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from surprise import Reader, Dataset
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD, SVDpp
from surprise.prediction_algorithms import KNNWithMeans
from surprise.model_selection import GridSearchCV


from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

import scipy
import math

from sklearn.metrics.pairwise import cosine_similarity

import matplotlib.pyplot as plt

from scipy.sparse import csr_matrix


from surprise.prediction_algorithms import knns

from surprise.similarities import cosine, msd, pearson
from surprise import accuracy
import joblib

#### <span style= "color:orange"> Loading the dataset </SPAN>

In [2]:
def read_data(path, error_bad_lines = False, encoding = 'latin-1', sep=';', on_bad_lines = 'skip'):

    "A simple function that reads the data"
    
    data = pd.read_csv(path, error_bad_lines = error_bad_lines, encoding = encoding, sep = sep)
    return data

book_ratings = read_data(r'C:\Users\user\Documents\Recommendation Systems\recommendation_system_project\BX-Book-Ratings.csv')
books = read_data(r'C:\Users\user\Documents\Recommendation Systems\recommendation_system_project\BX-Books.csv')
users = read_data(r'C:\Users\user\Documents\Recommendation Systems\recommendation_system_project\BX-Users.csv')

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'


we have three datasets:
* `books`
* `users`
* `rating`

Let us explore them by viewing first five rows of each

In [3]:
""" calling on variable book_ratings to view the first 5 rows"""

book_ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [4]:
""" calling on variable books to view the first five rows"""

books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [5]:
""" calling on variable users to view the first 5 rows"""

users.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


#### <span style= "color:orange"> Preliminary Data understanding </SPAN>

In [6]:

def get_info_shape_stats(dataset, dataset_name):

    """A simple function to check the shape, info and descriptive statistics of the dataset"""
    
    print('The Dataset:', dataset_name )
    print(f"has {dataset.shape[0]} rows and {dataset.shape[1]} columns")
    print('---------------------------')
    print('---------------------------')
    print(dataset.info())
    print('---------------------------')
    print('----------------------------')
    print(dataset.describe())

In [7]:
"""calling on the function get_info_shape_stats"""

get_info_shape_stats(book_ratings, 'Book Ratings')

The Dataset: Book Ratings
has 1149780 rows and 3 columns
---------------------------
---------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1149780 non-null  int64 
 1   ISBN         1149780 non-null  object
 2   Book-Rating  1149780 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB
None
---------------------------
----------------------------
            User-ID   Book-Rating
count  1.149780e+06  1.149780e+06
mean   1.403864e+05  2.866950e+00
std    8.056228e+04  3.854184e+00
min    2.000000e+00  0.000000e+00
25%    7.034500e+04  0.000000e+00
50%    1.410100e+05  0.000000e+00
75%    2.110280e+05  7.000000e+00
max    2.788540e+05  1.000000e+01


In [8]:
"""calling on the function get_info_shape_stats"""

get_info_shape_stats(books, 'Books')

The Dataset: Books
has 271360 rows and 8 columns
---------------------------
---------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Book-Author          271359 non-null  object
 3   Year-Of-Publication  271360 non-null  object
 4   Publisher            271358 non-null  object
 5   Image-URL-S          271360 non-null  object
 6   Image-URL-M          271360 non-null  object
 7   Image-URL-L          271357 non-null  object
dtypes: object(8)
memory usage: 16.6+ MB
None
---------------------------
----------------------------
              ISBN      Book-Title      Book-Author  Year-Of-Publication  \
count       271360          271360           271359               271360   
unique      271360          24

* There are columns labelled None, with numerous null values, these will be analyzed during the data cleaning stage

In [9]:
"""calling on the function get_info_shape_stats"""

get_info_shape_stats(users, 'Users')

The Dataset: Users
has 278858 rows and 3 columns
---------------------------
---------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   User-ID   278858 non-null  int64  
 1   Location  278858 non-null  object 
 2   Age       168096 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB
None
---------------------------
----------------------------
            User-ID            Age
count  278858.00000  168096.000000
mean   139429.50000      34.751434
std     80499.51502      14.428097
min         1.00000       0.000000
25%     69715.25000      24.000000
50%    139429.50000      32.000000
75%    209143.75000      44.000000
max    278858.00000     244.000000


In [10]:
def data_types(data, dataset_name):

    """A simple function to check the data types on th datasets """

    print("Dataset:",dataset_name, "has",len( data.select_dtypes(include='number').columns),
                "Numeric columns")
    
    print("and", len(data.select_dtypes(include='object').columns),
          "Categorical columns")

    print('*****************************************************')
    print('*****************************************************')

    print('Numerical Columns:', data.select_dtypes(include='number').columns)
    print('Categorical Coulumns:', data.select_dtypes(include='object').columns)

In [11]:
""" calling on the data_types function """

data_types(users, 'Users') 

Dataset: Users has 2 Numeric columns
and 1 Categorical columns
*****************************************************
*****************************************************
Numerical Columns: Index(['User-ID', 'Age'], dtype='object')
Categorical Coulumns: Index(['Location'], dtype='object')


In [12]:
""" calling on the data_types function """

data_types(books, 'Books')

Dataset: Books has 0 Numeric columns
and 8 Categorical columns
*****************************************************
*****************************************************
Numerical Columns: Index([], dtype='object')
Categorical Coulumns: Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher',
       'Image-URL-S', 'Image-URL-M', 'Image-URL-L'],
      dtype='object')


In [13]:
""" calling on the data_types function """

data_types(book_ratings, 'Book Ratings')

Dataset: Book Ratings has 2 Numeric columns
and 1 Categorical columns
*****************************************************
*****************************************************
Numerical Columns: Index(['User-ID', 'Book-Rating'], dtype='object')
Categorical Coulumns: Index(['ISBN'], dtype='object')


#### <span style= "color:orange"> Data Cleaning </SPAN>

Duplicates

In [14]:
duplicates = []

def check_duplicates(data):

    """Function that iterates through the rows of our dataset to check whether they are duplicated or not"""
    
    for i in data.duplicated():
        duplicates.append(i)
    duplicates_set = set(duplicates)
    if(len(duplicates_set) == 1):
        print('The Dataset has No Duplicates')

    else:
        duplicates_percentage = np.round(((sum(duplicates)/len(data)) * 100 ), 2)
        print(f'Duplicated rows constitute of {duplicates_percentage} % of our dataset')

In [15]:
check_duplicates(book_ratings) # checking for duplicates in book_ratings

The Dataset has No Duplicates


In [16]:
check_duplicates(books) # checking for duplicates in books

The Dataset has No Duplicates


In [17]:
check_duplicates(users) # checking for duplicates in users

The Dataset has No Duplicates


Missing Values

In [18]:
def missing_values(data):

    """ Function for checking null values in percentage in relation to length of the dataset """

    if data.isnull().any().any() == False :

        print("There Are No Missing Values")

    else:

        missing_values = data.isnull().sum().sort_values(ascending=False)

        missing_val_percent = ((data.isnull().sum()/len(data)).sort_values(ascending=False))

        missing_df = pd.DataFrame({'Missing Values': missing_values, 'Percentage %': missing_val_percent})

        return missing_df[missing_df['Percentage %'] > 0]

In [19]:
missing_values(book_ratings) # checking for missing values in book ratings

There Are No Missing Values


In [20]:
missing_values(books) # checking for missing values in books

Unnamed: 0,Missing Values,Percentage %
Image-URL-L,3,1.1e-05
Publisher,2,7e-06
Book-Author,1,4e-06


In [21]:
missing_values(users) # checking for missing values in users

Unnamed: 0,Missing Values,Percentage %
Age,110762,0.397199


In [22]:
def dropping_columns(data, columns):

    """A simple function to drop columns with missing values"""

    drop_column = data.drop(columns=columns, inplace = True)
    
    return drop_column

columns_to_drop = users[['Age']]

dropping_columns(users, columns_to_drop)

In [23]:
def drop_rows(data, columns):
    
    """A simple function to remove the rows of columns that have missing values """
    
    new_data = data.dropna(subset=columns, inplace=True)
    return new_data

col = ['Image-URL-L', 'Publisher', 'Book-Author']
drop_rows(books, col)

#### <span style= "color:orange"> Feature Selection and EDA </SPAN>

In [24]:
def merge_dataframe(data_0, data_1, merge_column):
    """A function to merge the datasets based on a given column"""
    new_df = data_0.merge(data_1, on=merge_column)
    return new_df

df_rating = merge_dataframe(users, book_ratings, "User-ID")
df_rating

Unnamed: 0,User-ID,Location,ISBN,Book-Rating
0,2,"stockton, california, usa",0195153448,0
1,7,"washington, dc, usa",034542252,0
2,8,"timmins, ontario, canada",0002005018,5
3,8,"timmins, ontario, canada",0060973129,0
4,8,"timmins, ontario, canada",0374157065,0
...,...,...,...,...
1149775,278854,"portland, oregon, usa",0425163393,7
1149776,278854,"portland, oregon, usa",0515087122,0
1149777,278854,"portland, oregon, usa",0553275739,6
1149778,278854,"portland, oregon, usa",0553578596,0


In [25]:
missing_values(df_rating) # checking for missing values

There Are No Missing Values


In [26]:
check_duplicates(df_rating) # checking for duplicates

The Dataset has No Duplicates


In [27]:
get_info_shape_stats(df_rating, 'Merged DataFrame') # checking the dataset info

The Dataset: Merged DataFrame
has 1149780 rows and 4 columns
---------------------------
---------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1149780 entries, 0 to 1149779
Data columns (total 4 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1149780 non-null  int64 
 1   Location     1149780 non-null  object
 2   ISBN         1149780 non-null  object
 3   Book-Rating  1149780 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 43.9+ MB
None
---------------------------
----------------------------
            User-ID   Book-Rating
count  1.149780e+06  1.149780e+06
mean   1.403864e+05  2.866950e+00
std    8.056228e+04  3.854184e+00
min    2.000000e+00  0.000000e+00
25%    7.034500e+04  0.000000e+00
50%    1.410100e+05  0.000000e+00
75%    2.110280e+05  7.000000e+00
max    2.788540e+05  1.000000e+01


In [28]:
""" merging the new dataset with the book dataset """
df_books = merge_dataframe(books, df_rating, 'ISBN')
df_books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L,User-ID,Location,Book-Rating
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,2,"stockton, california, usa",0
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,8,"timmins, ontario, canada",5
2,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,11400,"ottawa, ontario, canada",0
3,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,11676,"n/a, n/a, n/a",8
4,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,41385,"sudbury, ontario, canada",0


In [29]:
get_info_shape_stats(df_books, "Combined Dataset") # check merged dataset info

The Dataset: Combined Dataset
has 1031129 rows and 11 columns
---------------------------
---------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1031129 entries, 0 to 1031128
Data columns (total 11 columns):
 #   Column               Non-Null Count    Dtype 
---  ------               --------------    ----- 
 0   ISBN                 1031129 non-null  object
 1   Book-Title           1031129 non-null  object
 2   Book-Author          1031129 non-null  object
 3   Year-Of-Publication  1031129 non-null  object
 4   Publisher            1031129 non-null  object
 5   Image-URL-S          1031129 non-null  object
 6   Image-URL-M          1031129 non-null  object
 7   Image-URL-L          1031129 non-null  object
 8   User-ID              1031129 non-null  int64 
 9   Location             1031129 non-null  object
 10  Book-Rating          1031129 non-null  int64 
dtypes: int64(2), object(9)
memory usage: 94.4+ MB
None
---------------------------
----------------------

In [30]:

missing_values(df_books) # check for missing values

There Are No Missing Values


In [31]:
check_duplicates(df_books) # check for duplicates

The Dataset has No Duplicates


## Popularity Based Recommendation System

In [32]:
def calculate_popularity(df, column_name):

    """Calculates the popularity of values in a specific column of a dataframe"""

    popularity_df = pd.DataFrame(df[column_name].value_counts())
    return popularity_df

popularity_df = calculate_popularity(df_books, 'Book-Title')
popularity_df.head(20)

Unnamed: 0,Book-Title
Wild Animus,2502
The Lovely Bones: A Novel,1295
The Da Vinci Code,898
A Painted House,838
The Nanny Diaries: A Novel,828
Bridget Jones's Diary,815
The Secret Life of Bees,774
Divine Secrets of the Ya-Ya Sisterhood: A Novel,740
The Red Tent (Bestselling Backlist),723
Angels &amp; Demons,670


In [33]:

def filter_active_users(dataframe, threshold):

    """Filter the dataframe to include only users who have actively rated more than the threshold"""
    
    # Filter the DataFrame based on the count of each unique User-ID
    user_counts = dataframe['User-ID'].value_counts()
    filter = user_counts > threshold

    # Get the index values of the filtered rows
    filtered_index = filter[filter].index

    # Create a new DataFrame by selecting only the rows where User-ID is in the filtered index
    filtered_df = dataframe[dataframe['User-ID'].isin(filtered_index)]

    return filtered_df

df_filtered = filter_active_users(df_books, 300)
df_filtered.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L,User-ID,Location,Book-Rating
3,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,11676,"n/a, n/a, n/a",8
6,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,85526,"victoria, british columbia, canada",0
10,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,177458,"ottawa, ontario, canada",0
21,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,110912,"milpitas, california, usa",10
26,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,197659,"indiana, pennsylvania, usa",9


In [34]:
def calculate_rating_count(dataframe):

    """A Simple Function to Calculate the Number of Times each book has been rated"""

    # Group the dataframe by 'Book-Title' and count the occurrences of 'Book-Rating' for each title
    rating_count = dataframe.groupby('Book-Title')['Book-Rating'].count().reset_index()

    # Rename the 'Book-Rating' column to 'rating_count'
    rating_count.rename(columns={'Book-Rating': 'rating_count'}, inplace=True)

    # Merge the original dataframe with the 'rating_count' dataframe based on 'Book-Title'
    new_df = dataframe.merge(rating_count, on='Book-Title')

    # Display the first few rows of the merged dataframe
    return new_df

new_book_df = calculate_rating_count(df_filtered)
new_book_df.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L,User-ID,Location,Book-Rating,rating_count
0,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,11676,"n/a, n/a, n/a",8,3
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,85526,"victoria, british columbia, canada",0,3
2,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,177458,"ottawa, ontario, canada",0,3
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,110912,"milpitas, california, usa",10,2
4,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,197659,"indiana, pennsylvania, usa",9,2


In [35]:
def filter_rating_count(dataframe, threshold):
    
    """A Simple Funtion to Filter the dataframe based on a minimum rating count"""

    # Apply the filter to the 'dataframe' using the 'loc' function
    filtered_df = dataframe.loc[dataframe['rating_count'] >= threshold, :]

    # Display the first few rows of the filtered dataframe
    return filtered_df

rating_more_50 = filter_rating_count(new_book_df, 50)
rating_more_50.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L,User-ID,Location,Book-Rating,rating_count
5,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,11676,"n/a, n/a, n/a",9,88
6,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,36836,"raleigh, north carolina, usa",0,88
7,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,46398,"san antonio, texas, usa",9,88
8,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,113270,"evanston, illinois, usa",0,88
9,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,113519,"pleasanton, california, usa",0,88


If you preview the user ID and Book-Tittle you will get that a user has rated a book more than once. This can be brought about reading the book multiple times and having different different opinions on it. Let's preview the dataset that coontains the two columns 

In [36]:
book_user_id_df = rating_more_50[['User-ID', 'Book-Title']]
book_user_id_df

Unnamed: 0,User-ID,Book-Title
5,11676,The Kitchen God's Wife
6,36836,The Kitchen God's Wife
7,46398,The Kitchen God's Wife
8,113270,The Kitchen God's Wife
9,113519,The Kitchen God's Wife
...,...,...
171955,235105,M Is for Malice
171956,242824,M Is for Malice
171957,254899,M Is for Malice
171958,258534,M Is for Malice


In [37]:
check_duplicates(book_user_id_df)

Duplicated rows constitute of 4.2 % of our dataset


Let's go ahead and create the final dataframe and remove the duplicates in the two columns 

In [38]:
final_df = rating_more_50.drop_duplicates(subset=['User-ID', 'Book-Title'])
final_df.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L,User-ID,Location,Book-Rating,rating_count
5,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,11676,"n/a, n/a, n/a",9,88
6,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,36836,"raleigh, north carolina, usa",0,88
7,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,46398,"san antonio, texas, usa",9,88
8,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,113270,"evanston, illinois, usa",0,88
9,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,113519,"pleasanton, california, usa",0,88


In [39]:
get_info_shape_stats(final_df, 'Final DataFrame')

The Dataset: Final DataFrame
has 34365 rows and 12 columns
---------------------------
---------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 34365 entries, 5 to 171959
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   ISBN                 34365 non-null  object
 1   Book-Title           34365 non-null  object
 2   Book-Author          34365 non-null  object
 3   Year-Of-Publication  34365 non-null  object
 4   Publisher            34365 non-null  object
 5   Image-URL-S          34365 non-null  object
 6   Image-URL-M          34365 non-null  object
 7   Image-URL-L          34365 non-null  object
 8   User-ID              34365 non-null  int64 
 9   Location             34365 non-null  object
 10  Book-Rating          34365 non-null  int64 
 11  rating_count         34365 non-null  int64 
dtypes: int64(3), object(9)
memory usage: 3.4+ MB
None
---------------------------
------

# Modelling

## Memory Based Collaborative Filtering 

>> With memory/neighborhood-based collaborative filtering methods, we are attempting to quantify just how similar users and items are to one another and getting the top N recommendations based on that similarity metric. 

>> We will be using UserID, Book Title and Book Reating columns to make our predictions

## Nearest Neighbors using Cosine Similarity 
>> We will implement the brute force algorithm in the Nearest Neighbors class.
>> We will be using cosine similariry which determines how vectors are related to each other.

In [427]:
# creating a pivot table

book_pivot = final_df.pivot_table(columns='User-ID', index='Book-Title', values= 'Book-Rating')
book_pivot.fillna(0, inplace=True)
book_pivot.head()

User-ID,2276,3363,4385,6251,6543,6575,7158,7346,8681,8936,...,270713,271284,273979,274004,274061,274301,274308,275970,277427,278418
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1st to Die: A Novel,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 Blondes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A Bend in the Road,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A Case of Need,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,0.0,0.0


In [428]:
#creating a sparse matrix to fit into our model
book_sparse = csr_matrix(book_pivot)
book_sparse

<477x497 sparse matrix of type '<class 'numpy.float64'>'
	with 7912 stored elements in Compressed Sparse Row format>

In [429]:
# instantiating Nearest Neighbors 
nearest_neighbor_model = NearestNeighbors(metric='cosine', algorithm='brute')

#fitting the model to the sparse matrix
nearest_neighbor_model.fit(book_sparse)

In [430]:

def recommend_book(book_name):

    """  Definining a function to recommend books based on a given book name """

    # Find the index of the input book name in the book_pivot index array

    book_id = np.where(book_pivot.index == book_name)[0][0]
    
    # Compute the distances and suggestions using the nearest_neighbor_model

    distance, suggestion = nearest_neighbor_model.kneighbors(book_pivot.iloc[237,:].values.reshape(1, -1), n_neighbors=6)

    # Iterate over each suggestion

    for i in range(len(suggestion)):

        # Retrieve the book names from the book_pivot index based on the suggestions

        books = book_pivot.index[suggestion[i]]

        # Print the recommended book names

        for j in books:

            print(j)

# Specify the book name for which recommendations will be made

book_name = 'Wuthering Heights'

# Call the recommend_book function with the specified book name

recommend_book(book_name)


Pigs in Heaven
The Poisonwood Bible
The Shipping News : A Novel
A Thousand Acres (Ballantine Reader's Circle)
Disclosure
Big Stone Gap: A Novel (Ballantine Reader's Circle)


## KNNWithMeans

>> This algorithm leverages neighborhood information, handles user or item biases through rating adjustment, and aims to provide accurate predictions for rating prediction tasks in collaborative filtering.

>> We wil be using pearson similarity matrix

In [None]:
# Extract user IDs, ISBNs, and ratings from the final_df DataFrame
user_ids = final_df['User-ID'].values
isbns = final_df['ISBN'].values
ratings = final_df['Book-Rating'].values

# Create a new DataFrame to store the extracted data
data = pd.DataFrame({
    'User-ID': user_ids,
    'ISBN': isbns,
    'Book-Rating': ratings
})

# Create a Reader object and specify the rating scale
reader = Reader(rating_scale=(1, 10))

# Load the dataset from the DataFrame using the Reader object
data = Dataset.load_from_df(data, reader)

# Split the data into training and testing sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)


In [432]:
# Specify similarity metrics and user-based approach for KNNWithMeans
sim_pearson = {"name": "pearson", "user_based": True}

# Instantiate KNNWithMeans with the specified similarity options
knn_means = knns.KNNWithMeans(sim_options=sim_pearson)

# Train the KNNWithMeans model on the training set
knn_means.fit(trainset)

# Make predictions on the test set
predictions = knn_means.test(testset)

# Calculate and print the Root Mean Squared Error (RMSE) using the predictions
print(accuracy.rmse(predictions))


Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 3.3810
3.3810217496991717


* Based on the RMSE, KNNWithMeans has a 3.38 chance of making an error while making making predictions.

* Let's attempt to lower the RMSE

>> Although memory based collaborative filtering setup is relatively simple to write, it doesn't scale very well at all, as it is all stored in memory!
Instead, we should try using a model-based (based on matrix factorization) recommendation algorithm. These are inherently more scalable and can deal  withhigher sparsity level than memory-based models, and are considered more powerful due to their ability to pick up on "latent factors" in the relationships between what sets of items users like. 

## Model Based Collaborative Filtering Recommender

>> Model based collaborative filtering approach involves building machine learning algorithms to predict user's ratings. They involve dimensionality reduction methods that reduce high dimensional matrix containing abundant number of missing values with a much smaller matrix in a lower-dimensional space.
The goal of this section is to compare SVD and SVDpp algorithms, try optimizing parameters and explore obtained results.Let's start by preparing our dataset for modelling

In [433]:
# creating a new dataframe that contains only the relevant columns 

final_df.rename(columns = {'User-ID':'user_id' ,'ISBN':'isbn' ,'Book-Rating':'book_rating'},inplace=True)

In [436]:
# Set the threshold for the minimum number of ratings per user
user_ratings_threshold = 3

# Count the number of ratings for each user in final_df
filter_users = final_df['user_id'].value_counts()

# Filter out users who have at least user_ratings_threshold ratings
filter_users_list = filter_users[filter_users >= user_ratings_threshold].index.to_list()

# Create a new DataFrame, df_ratings_top, by keeping only the records from final_df
# where the user_id is present in the filter_users_list
df_ratings_top = final_df[final_df['user_id'].isin(filter_users_list)]

# Print the information about the filtering
print('Filter: users with at least %d ratings\nNumber of records: %d' % (user_ratings_threshold, len(df_ratings_top)))


Filter: users with at least 3 ratings
Number of records: 34361


In [438]:
# Set the threshold percentage for the most frequently rated books
book_ratings_threshold_perc = 0.1

# Calculate the threshold value based on the unique number of books in df_ratings_top
book_ratings_threshold = len(df_ratings_top['isbn'].unique()) * book_ratings_threshold_perc

# Filter out books that are among the most frequently rated
filter_books_list = df_ratings_top['isbn'].value_counts().head(int(book_ratings_threshold)).index.to_list()

# Create a new DataFrame, df_ratings_top, by keeping only the records from df_ratings_top
# where the isbn is present in the filter_books_list
df_ratings_top = df_ratings_top[df_ratings_top['isbn'].isin(filter_books_list)]

# Print the information about the filtering
print('Filter: Top %d%% Most Frequently Rated Books\nNumber of records: %d' % (book_ratings_threshold_perc*100, len(df_ratings_top)))


Filter: Top 10% Most Frequently Rated Books
Number of records: 12626


# SVD (Singular Value Decomposition)

>> SVD is a widely used matrix decomposition method that works by reducing dimensionality of the user item matrix by extracting its latent factors and capturing underlying patterns. We will be using UserID, ISBN and Book Rating features to extract latent factors to make our recommendations

In [441]:
def evaluate_model(df, model_class, rating_scale=(1, 10), cv=3):
    """
    A function to read our data into a Surprise Dataset format, instantiate a model, and perform cross-validation.
    
    Args:
        df (DataFrame): Input DataFrame containing user_id, isbn, and book_rating columns.
        model_class (class): Class of the model to be instantiated.
        rating_scale (tuple, optional): Tuple specifying the rating scale range. Default is (1, 10).
        cv (int, optional): Number of cross-validation folds. Default is 3.
    
    Returns:
        cv_results_df (DataFrame): DataFrame containing the mean cross-validation results.
    """

    # Create a Reader object with the specified rating scale
    reader = Reader(rating_scale=rating_scale)

    # Load the dataset from the DataFrame using the Reader object
    data = Dataset.load_from_df(df[['user_id', 'isbn', 'book_rating']], reader)

    # Instantiate the model
    model = model_class()

    # Perform cross-validation using the model and the dataset
    cv_results = cross_validate(model, data, cv=cv)

    # Calculate the mean of the cross-validation results
    cv_results_df = pd.DataFrame(cv_results).mean()

    return cv_results_df

In [442]:
df = df_ratings_top.copy()
svd_results = evaluate_model(df, SVD)
print("SVD Results:")
print(svd_results)

SVD Results:
test_rmse    3.393899
test_mae     2.717881
fit_time     1.718030
test_time    0.100629
dtype: float64


# SVDpp

>> The SVDpp algorithm is an extension of SVD that takes into account implicit ratings.

In [444]:
svdpp_results = evaluate_model(df, SVDpp)
print("SVDpp Results:")
print(svdpp_results)

SVDpp Results:
test_rmse     3.647254
test_mae      2.859234
fit_time     11.269644
test_time     0.485947
dtype: float64


The test_RMSE for SVD is much more better. We will go ahead and do some hyperparameter tuning on the SVD model

## Optimizing SVD Model

In [445]:
df = df_ratings_top.copy()
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(df[['user_id', 'isbn', 'book_rating']], reader)

param_grid = {
    'n_factors': [10, 100, 500],
    'n_epochs': [5, 20, 50], 
    'lr_all': [0.001, 0.005, 0.02],
    'reg_all': [0.005, 0.02, 0.1]}

gs_model = GridSearchCV(
    algo_class = SVD,
    param_grid = param_grid,
    n_jobs = -1,
    joblib_verbose = 5)

gs_model.fit(data)

 #Train the SVD model with the parameters that minimise the root mean squared error
 
best_SVD = gs_model.best_estimator['rmse']
print("Tuned SVD Model RMSE", gs_model.best_score['rmse'])
print("Best Paramers", gs_model.best_params['rmse'])

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    6.2s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   13.3s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   42.7s
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:  2.2min


Tuned SVD Model RMSE 3.3072360455440504
Best Paramers {'n_factors': 10, 'n_epochs': 50, 'lr_all': 0.001, 'reg_all': 0.005}


[Parallel(n_jobs=-1)]: Done 405 out of 405 | elapsed:  9.7min finished


Great ! We see a reduced RMSE, this is an indication of improved performance, therefore we will go ahead and settle on the tuned SVD model