# Book Crossing Recommendation System 
  
Author: Eleni Zarogianni. 
October 2019. 

Objective: to implement a Book Recommender system that utilizes some sort of collaborative filtering using the online-available Book-Crossing Data set (http://www2.informatik.uni-freiburg.de/~cziegler/BX/).

In [None]:
# import libraries
# for data manipulation
import pandas as pd
import numpy as np
# for plotting
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import pylab
import seaborn as sns

plt.style.use('classic')
plt.style.use('seaborn-whitegrid')

1. Load Data.

The readily available Book Crossing Data set is used here. This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings.
BX-Users : Contains the users. User IDs (User-ID) have been anonymized and map to integers. Demographic data is provided (Location, Age) if available.
BX-Books : Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (Book-Title, Book-Author, Year-Of-Publication, Publisher), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (Image-URL-S, Image-URL-M, Image-URL-L), i.e., small, medium, large.
BX-Book-Ratings : Contains the book rating information. Ratings (Book-Rating) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.
Let's jump straight into reading the csv files as pandas Dataframes (Dfs).

In [None]:
# change working directory
from os import chdir
chdir('/Users/elenizarogianni/Desktop/EXUS_ML_Task')

# Load Data 
# I've loaded them from my workspace!

users = pd.read_csv('BX-Users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
books = pd.read_csv('BX-Books.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings = pd.read_csv('BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1"

2. Inspect and clean the data.

In general, data inspection and cleaning prosedures include visually inspecting the data, through the use of graphs and plots, and figuring out any inconsistencies or peculiarities in the data sets. These might include, on a first-level, any duplicate entries or missing values, any wrongly assigned data types, and on a second-level any outliers. We will explore and handle each aspect of these below.

Let's have a first glance at the data and check the Df's shape.

In [None]:
# print the shape of the data
print users.shape
print books.shape
print ratings.shape

That looks fine. 
On itinial inspection, all 3 Df's contain column names with a '-'.That will lead to problems accessing the dataframes,so let's change them.

In [None]:
# remove middle slash  
users.columns = ['userID', 'Location', 'Age']
books.columns = ['ISBN', 'Title', 'Author', 'YearOfPublication', 'Publisher', 'ImageUrlS', 'ImageUrlM', 'imageUrlL']
ratings.columns = ['userID', 'ISBN', 'BookRating']

Also on a first-look basis, we can already spot missing values (e.g. in the users.Age variable), but let's have a closer look and address each dataframe's idiosynchracies separately.

i. USERS DataFrame

In [None]:
# Check 5 first entries
users.head(5)
# Get basic info first.
print users.info()
# Describe numerical variables
print users.describe()
# Describe categorical variables
print users.describe(include=['O'])

First initial 5 entries confirm that we have missing values in Age. Also, data types of the Users' Df seem reasonable and therefore that's fine. 

Upon description of the Df, we can spot a 'weird' min-max duo for Age. We'll keep that in mind. The userID variable seems fine.

Description of the Location, non-numerical variable seems OK but we can easily deduct that it might be more useful to split the Location variable into 3 separate ones, consisting of Town, State and Country that's more informationally relevant to a recommendation system.

So let's get our hands on users.Location and users.Age variables.


In [None]:
# Location
# Data cleaning (e.g. missing values, duplicates, data types problems and other inconisstencies)

In [None]:
# check for missing values
print users.Location.isnull().any()
# check for for duplicate entries
users.Location.nunique()

In [None]:
# split users.Location into 3 subparts
location_expanded = users.Location.str.split(',', 2, expand=True)
location_expanded.columns = ['Town', 'State', 'Country']
users = users.join(location_expanded)
# Drop the initial Location variable.
users.drop('Location', axis=1, inplace = True)


So, Location has no missing values and there are non-unique entries (duplicates), which is certainly OK. We've splitted up into 3 sub-parts as described and dropped the initial, corresponding variable.

Now, let's go an extra mile here, by having a look at some descriptives for location and some plots.

In [None]:
# Again, check on missing values and duplicates.
print(users.Town.isnull().any())
print(users.State.isnull().any())
print(users.Country.isnull().any())

# How many unique towns, states and countries do I have?
nTowns=users.Town.nunique()
nStates=users.Town.nunique()
nCountries=users.Country.nunique()

In [None]:
print("There are {} unique towns, {} unique states and {} unique coutries.".format(nTowns, nStates, nCountries))
print("In comparison to unique countries, total number of user entries is {}".format(users.shape[0]))
# users.userID.nunique()

There are missing values for State and Country, which is problematic and requires to be dealt with. 
There are 32770 unique town and state entries, and 1276 unique coutries, in contrast to 278858 unique user entries (unique userIDs).

In [None]:
# How many missing states and how many missing countries?
print(users.State.isnull().sum())
print(users.Country.isnull().sum())

There are 1 missing value for State and 2 for the Country variable. 
Let's do barplots for each.

In [None]:
# State
# Count number of users per each state
states = users.State.value_counts()
# Show the top 10 states based on their corresponding book users:
users.State.value_counts()[:10].plot(kind='bar', stacked = 'True', title='Top 10 States/Provinces per Book Users', alpha=.70)

A closer visual inspection of the states Df revealed other inconsistences too. For example, there are 'n/a' or '\n/a\"'instances. I've also spotted a '.' instance, so there might as well exist other english stopwords. The best solution I think would be to throw all these instances in an 'Other' bin.

In [None]:
# Replace any instance of n/a, with 'other'
print sum(users.State==' n/a') #  12527
users['State'].replace(r'[\s]n/a', 'other', regex = True, inplace=True)

In [None]:
# Replace any instances of '.' with 'other'.
import string
sum(users.State == ' .') # 15
users['State'].replace(r'[\s]\.', 'other', inplace =True, regex= True)

In [None]:
# replace empty-string instances
users.State.replace('', 'other', inplace=True)
users.State.replace(' ', 'other', inplace=True)
# OR users.State.replace('r[\s]*', inplace=True')

Finally, there are some interesting double-letter or three-letter acronyms that my guess is they might correspond to US/other state or province acronyms. A visit to https://www.fs.fed.us/database/feis/format.html revealed the accuracy of my hunch for some of these, like the 'ca', 'nh', 'mi', 'df' etc. Others, like 'zh', 'sp' or 'rm' did not correspond to any of these states/provinces.

I've downloaed the US/Canada province dictionary from here: http://code.activestate.com/recipes/577305-python-dictionary-of-us-states-and-territories/ and saved them all in py called provinces_mapping.py.

In [None]:
# Replace state/province acronyms with their full names.
from provinces_mapping import provinces
# lower-case dictionary key-value pairs to match ours
provinces = dict((k.lower(), v.lower()) for k,v in provinces.iteritems())

# map the dictionary to the State column
users['State'].map(provinces)  

In [None]:
# Missing Values
print sum(users.State.isnull()) # 0
# Replace Null values with 'other'
users['State'] = users['State'].fillna('other')

#plot again to observe differences.
users.State.value_counts()[:10].plot(kind='bar', stacked = 'True', title='Top 10 States/Provinces per Book Users', alpha=.70)

user.Country

In [None]:
# Country
# Count number of users per each country
countries = users.Country.value_counts()
# Show the top 10 countries according to their corresponding book users:
users.Country.value_counts()[:10].plot(kind='bar', stacked = 'True', title='Top 10 Countries per Book Users', alpha=.70)

USA is number one on books, with over 130.000 users, with Canada falling second with sixth below the amount of US. Interestingly, we observe in the 9th position an 'empty-string-country'. 


In [None]:
# How many countries are string-empty?
print(users[users.Country == ''].Country.value_counts())
# Replace empty string with 'Other' string.
users.Country.replace('', 'other', inplace=True)

Upon closer inspection of the countries Df (series actually), we observe a bunch of inconsistencies with misplaced strings, such as: ',' ,'n/a', 'scotland/uk','uk,united kingdom', 'illinois, usa'

In [None]:
# Checking for other inconsistencies
# countries.Country contains the following inconsistences : ',' ,'n/a', 'scotland/uk','uk,united kingdom', 'illinois, usa'

print sum(users.Country == ' n/a') #16
users['Country'].replace(r'[\s]n/a', 'other', regex = True, inplace=True) 
# Remove punctuation
users['Country'].replace(r'[\s][,.]', 'other', inplace =True, regex= True)

In [None]:
# Replace missing Values
sum(users.Country.isnull())
users['Country'] = users['Country'].fillna('other')

Let's know visit the Town variable and check things with this as well.

In [None]:
# users.Town
# Count number of users per each state
towns = users.Town.value_counts()
# Show the top 10 states based on their corresponding book users:
users.Town.value_counts()[:10].plot(kind='bar', stacked = 'True', title='Top 10 Towns per Book Users', alpha=.70)


In [None]:
print users.Town.nunique()   # 32770
print(users.Town.isnull().sum()) # 0, which we remember from description above.

# any n/a values?
sum(users.Town == 'n/a')
# Replace n/a with other
users['Town'].replace(r'n/a', 'other', regex = True, inplace=True)
 
# Replace punctuation marks with 'other'
users['Town'].replace(r'[\s]*[,.?]', 'other', inplace =True, regex= True)

users.userID

Let's now check the userID variable, although at first glance there wasn't anything wrong with it but let's double-check

In [None]:
# users.userID
# Duplicate entries, missing entries, wrong data types of users.userID
print users.userID.nunique()   # 278858
print(users.userID.isnull().sum()) # 0, which we remember from description above.
print users.userID.dtype

So, everything is OK with it. Let's now move into our numerical Variable, the Age.

In [None]:
# users.Age
# Describe variable
users.Age.describe()
# how many unique entries?
print users.Age.nunique()   # 165

# null/Nan values?
print users.Age.isnull().any().sum() # 1
# impute median after removing outliers
print sum(np.isnan(users.Age)) ## 110762


There are some null/Nan values that we will deal with later. Lets first do a distribution plot or histogram to further examine the age ranges.

In [None]:
# Distribution plot
sns.distplot(users.Age[~np.isnan(users.Age)])

In [None]:
# imputes na values with the median
medianAge = users.Age.median() 
users["Age"].fillna(medianAge, inplace = True) 

In [None]:
# outliers detection
ages = sorted(users.Age.unique())

# histogram using 10-length bins, from 0-250 years.
users.Age.hist(bins=range(0,250,10))
plt.title('Age Distribution\n')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

As we can see, the majority of users are between 20-40 years old, there are some over the age of 100, and some outliers falling on the edge of the graph, above 120-130 years (and up to 244), which is considered unreasonable.

The best way in my opinion to deal with this is first do a scatterplot to figure out the amount of misplaced Age values, so that we can perhaps deduce whether this might be a systematic entry error or we can deduce any other useful insight. And then, probably go ahead with imputing these extreme values with the data's median (since the data are skewed, it's better suited than the average as a measure), or even the meadian of a meanigful's subset.




In [None]:
## scatterplot of users' ages.
plt.scatter(users.Age, users.index)
plt.show()

This is not very useful in the end. Let's narrow our investigation to people above the age of 110, which is a rather reasonable age for someone to stop reading(?).

In [None]:
# slice users.Age variable and keep only values>110
old_users = users.loc[(users.Age>110), 'Age']
# how many are above 110?
print old_users.shape 
# Create the Index 
index_ = list(range(1, 97)) 
# reset the index 
old_users.index = index_ 

In [None]:
# scatterplot of users above the age of 110
plt.scatter(old_users, index_)
plt.show()
# histogram
plt.hist(old_users)
plt.show()

We can see that there are only 96 cases of people above 100 yrs. According to the histogram about over 30 are between 110-120, and the rest span across the rest of the bins almost equally. 

So, we would better draw the line on the age of 120 and then impute the rest of the values to the data's median (which is the age of 32). This won't have a significant impact on the median, whereas if I chose to impute to another subset's median (maybe an older slot) that would probably skew my results a wee bit.

In [None]:
### impute median for extreme values, above 120 yrs
# keep users above 120 yrs
over120_users = users.loc[(users.Age>120), 'Age']
# make a list of indexes to be replaced with the median
list_indexes = over120_users.index
# replace values with the median.
users['Age'].replace(users.loc[users.index[list_indexes], 'Age'], medianAge, inplace =True)

# sns.distplot(users.Age)
# plt.hist(users.Age)

ii. BOOKS Dataframe.

Let's move to the Books dataframe now, and run initial descriptive analyses, as before.

In [None]:
# Books
books.info()
print books.dtypes

#check for missing value
books.isnull().any().any()

# drop ImageUrls
books.drop(['ImageUrlS', 'ImageUrlM', 'imageUrlL'], axis=1, inplace=True)


There are a total of 271360 book entries. All book registries are of 'object' type. That should be altered for numerical-type variables, such as YearOfPublication.

On first-level inspection, there are missing variables. We'll check later each variable separately and resolve this. Also, after going through a bit of search on the 'ImageUrl*' variables, it seems there's no additional value keeping them at the moment, so we dropped them.


In [None]:
# Describe variables
print books.describe()
# Check 5 first entries
print books.head(5)

Upon dataframe's description, the variable Publisher seems to have missing entries. Also, first five entries are OK. Let's dive deeper into each variable.

In [None]:
# Book.ISBN
#check for missing value
books.ISBN.isnull().any().any() #no missing value
# Unique entries? 
books.ISBN.nunique() # check!


So, everything OK with our key-identifier. Let's move on.

In [None]:
# BooksTitle
print books.Title.isnull().any() # False
# check for unique entries
print books.Title.nunique() # 242135

In [None]:
# plot top 10 book titles
books.Title.value_counts()[:10].plot(kind='bar', stacked = 'True', title='Top 10 Book titles', alpha=.70)

Beware that here we have top number of book titles, based on their ISBN numbers, not their usage. So, the above figure informs us that a signle book title may have more than 1 ISBNs associated with it (and that makes sense, if we consider different editions or formats of the book etc.)

Let's take a closer look on the 'Little Women' book.

In [None]:
print books[books.Title == 'Little Women']

There are 24 different ISBN for Little Women. Since, there's not available any relevant info for the different formats/editions/etc accounting for the different ISBNs, it would be most reasonable to assign unique ISBN to each and every single book Title because in the end we're interested in recommending a book, not a specific format/edition of it. I would however leave this unchanged for the time being and come back to it at a later point if there's time.

Another solution that comes to mind though, would be to apply some sort of ISBN ML-clustering,based on the Book title. I will come to it later 

(Apart from the different ISBNs, right away we can observe problematic areas on the 'Author' variable, where there's 'Louisa May Alcott', 'Louisa M. Alcott' and 'Alcott', and also '0' values on the YearOf Publication. We will come later to those.)


In [None]:
# Looking for other inconsistencies
# not much (other than some special characters like '&'),
# i will just convert everything to low-case, 
# in case there multiple entries, using a combo of small-capital letters.
books.Title = books.Title.str.lower()

In [None]:
# BookAuthor
# null entries
print books.Author.isnull().any() # False
# check for unique entries
print books.Author.nunique() # 102024

# plot top 10 book titles
books.Author.value_counts()[:10].plot(kind='bar', stacked = 'True', title='Top 10 Book Author', alpha=.70)

# make lower-case
books.Author = books.Author.str.lower()

Again here, no issues with missing values. We have 102024 unique Authors (lower-casing them resulted in the same number of unique entries). Top ten authors by their name show on the figure. No issue spotted on the figure, but we pointed out above, that some entries appear with their fullname, other their 'Name/Surname', and others with just their surname. We will try to fix this.

In [None]:
# addressing Author name inconsistencies. 

# create a series for each unique book title
un_bookTitles = pd.Series(books.Title.unique())

for index, row in books.iterrows():
    # for each book title, retrieve each entry
    books_by_title = pd.DataFrame(books[books.Title == un_bookTitles[index]])
    # for each book title, keep the lengthiest relevant author name and assign it back
    list_uniq_authors = sorted(books_by_title.Author.unique(), reverse=True)
    uniq_author = list_uniq_authors[0]
    books_by_title.loc[index,'Author'] = uniq_author
    # go to the books Df and equal for each author a single string describing him/her.
    books.loc[index,'Author'] = books_by_title.loc[index, 'Author']
    
    books = books
    
return books

#  !!! BEWARE 
# due to time-restriction, I did not include this in my dataset. So the books.Title variable
# contains this name/surname, surname etc. inconsistencies.
# I've left this part here however, as it runs just fine, it just takes a lot of time to finish,
# and I could't affort it.

Beware this operation is a bit time-consuming. Another solution would probably be to do a groupby operation, but tried it and was computationally-intensive for my machine (old MacBook problems!).   


In [None]:
# Books.YearOfPublication
print books.YearOfPublication.describe()

# check for unique entries
print books.YearOfPublication.nunique() # 202
un_YofPub = pd.Series(sorted(books.YearOfPublication.unique())) 


We observe two string values for the variable. We will remove them, and transform the pd.series to 'int64'

In [None]:
#  remove string variables
books.YearOfPublication.replace(['DK Publishing Inc', 'Gallimard'], np.nan, inplace=True)

# transform to int64 dtype. 
books.YearOfPublication.replace(np.nan,0, inplace=True)
books.YearOfPublication = books.YearOfPublication.astype('int64')

# check for nan/null values
print books.YearOfPublication.isnull().sum() # 0

# are there 0 entries ?
zero_yr = books[books.YearOfPublication == 0].YearOfPublication.count()
# replace them with NaN
books.YearOfPublication.replace(0, np.nan, inplace=True)

In [None]:
# plots for years of publication
minYear = un_YofPub.min()
maxYear = 2037

books.YearOfPublication.hist(bins=range(minYear,2037,10))
plt.title('Year Distribution\n')
plt.xlabel('Year')
plt.ylabel('Number of books published')
plt.show()

There are some outliers before the year 1900 and after the year of 2000. Let's not also forget that according to the introduction, the Book-Crossing Library was created in 2004. There's a chance that some of the entries over 2004, are entry mistakes. Let's explore this a bit.

In [None]:
# Book entries over 2004.
books_over2004 = books[books.YearOfPublication>2005] 

As we can see among the entries, there are some clear mistakes, such as these of Edgar Allen Poe Collected Poems and Alice's Adventures in Wonderland and Through the Looking Glass (Puffin Books). In total there are 26 entries.

I could cross-reference some of them and match them to the correct year of publication. But I would leave that, if there's time later.

Let's move on for now.


In [None]:
# Books.Publisher 

print books.Publisher.describe() #  271358

# nan/null values
print books.Publisher.isnull().sum() #  2
# check for unique entries
print books.Publisher.nunique() # 16807
# lower-case the title.
books.Publisher = books.Publisher.str.lower()
# check again for unique entries
print books.Publisher.nunique() # 16575


# no empty strings in Publisher
print(books[books.Publisher == ' '].Publisher.value_counts())

# plot top 10 book titles
books.Publisher.value_counts()[:10].plot(kind='bar', stacked = 'True', title='Top 10 Publishers', alpha=.70)

Procedures are almost familiar to the reader now. There are 2 null entries, 16807 unique Publisher names. After lower-casing the names, this numbers drops down to 16575.
A barplot showing the top-10 publishers is also given.

In [None]:
#  Any other inconsistencies?
un_publishers = pd.Series(books.Publisher.unique())

Visually inspecting this variable doesn't reveal any eye-catching inconsistency. There's the chance however that the same publishers might be given slightly altered names due to entry mistakes. But this however would be time-consuming to explore, given that there might be no available corpus (or dictionary or whatever, as it was in the case of State/Provinces abbreviations) to map it onto.

Let's go to our final Dataframe.

iii. RATINGS Dataframe

In [None]:
# Ratings Dataframe
# Describe  variables
print ratings.describe()
# Describe categorical variables
print ratings.describe(include=['O'])

# Check datatypes
print ratings.dtypes

# Check 5 first entries
print ratings.head(5)

# check for missing value
print ratings.isnull().any().any()


In [None]:
# ratings.userID
# Duplicate entries, missing entries, wrong data types of users.userID
print ratings.userID.nunique()   # 105283
print ratings.userID.isnull().sum() # 0
print ratings.userID.dtype # int64

Everything looks OK and we move on.

In [None]:
# ratings.ISBN
#check for missing value
print ratings.ISBN.isnull().any().any() # False
# Unique entries? 
print ratings.ISBN.nunique() # 340556

There are 340,556 unique ISBN, in a total of 1,149,780. 

In [None]:
# ratings.Rating
# Describe variable
ratings.Rating.describe()
# how many unique entries?
print ratings.Rating.nunique()   # 11
print sorted(ratings.Rating.unique())  # 0 to 11

# null/Nan values?
print ratings.Rating.isnull().any().sum() # 0
print sum(np.isnan(ratings.Rating)) # 0

3. Join Tables and Finalize DataSet

After cleaning each dataframe separately, we will merge the data together and get them ready for analysis.

In [None]:
# TABLE JOINS

# join users with ratings on userID
users_and_ratings = ratings.join(users.set_index('userID'), on='userID')
print users_and_ratings.info()


All data variables are joint and sizes are reasonable.

In [None]:
# Double-Checks!

# users_and_ratings.userID
print users_and_ratings.userID.isnull().sum()  # 0
print users_and_ratings.userID.nunique()   # 105283
print users_and_ratings.userID.dtype  #int64



# users_and_ratings.ISBN
#check for missing value
print users_and_ratings.ISBN.isnull().any().any() # False
# Unique entries? 
print users_and_ratings.ISBN.nunique() # 340556

un_users_and_ratings_ISBN = pd.Series(sorted(users_and_ratings.ISBN.unique()))


# users_and_ratings.Rating
users_and_ratings.Rating.describe()
# how many unique entries?
print users_and_ratings.Rating.nunique()   # 11
print sorted(users_and_ratings.Rating.unique())  # 0-10

# null/Nan values?
print users_and_ratings.Rating.isnull().any().sum() # 0
print sum(np.isnan(users_and_ratings.Rating)) # 0


Eveything at this point seems OK. Let's do some standard double-checks for Town, State and Country.

In [None]:
# users_and_ratings.Town
# check for missing values
print users_and_ratings.Town.isnull().any() #False 
# check for for duplicate entries
print users_and_ratings.Town.nunique()  #16720

# Count number of users per each country
towns = users_and_ratings.Town.value_counts()

# How many countries are string-empty?
print(users_and_ratings[users_and_ratings.Town == ''].Town.value_counts())

# Replace empty string with 'Other' string.
users.Country.replace('', 'other', inplace=True)


# any n/a values?
print sum(users_and_ratings.Town== 'n/a')

In [None]:
# users_and_ratings.State
# same checks
print users_and_ratings.State.isnull().any() # False
# check for for duplicate entries
print users_and_ratings.State.nunique() # 2508


# Count number of users per each country
states = users_and_ratings.State.value_counts()

# How many countries are string-empty?
print(users_and_ratings[users_and_ratings.State == ' '].Town.value_counts())


# any n/a values?
sum(users_and_ratings.State == 'n/a')


In [None]:
# users_and_ratings.Country
print users_and_ratings.Country.isnull().any() # False
# check for for duplicate entries
print users_and_ratings.Country.nunique() # 525


# Count number of users per each country
countries = users_and_ratings.Country.value_counts()

# How many countries are string-empty?
print(users_and_ratings[users_and_ratings.Country == ' '].Town.value_counts())

# any n/a values?
sum(users_and_ratings.Country == ' n/a')


Again, no problems here. We move on.

In [None]:
# users_and_ratings.Age
# null/Nan values?
print users_and_ratings.Age.isnull().any().sum() # 0
# impute median after removing outliers
print sum(np.isnan(users_and_ratings.Age)) ## 0

All clear. Let's now Join this table to the books one.

In [None]:
# join in users_and_ratings with books on ISBN.
users_and_ratings_and_books = users_and_ratings.join(books.set_index('ISBN'), on='ISBN')
print users_and_ratings_and_books.info()


In [None]:
# users_and_ratings_and_books.ISBN
#check for missing value
print users_and_ratings_and_books.ISBN.isnull().any().any() # False
# Unique entries? 
print users_and_ratings_and_books.ISBN.nunique() # 340556

# un_users_and_ratings_and_books_ISBN = pd.Series(sorted(users_and_ratings_and_books.ISBN.unique()))

In [None]:
# users_and_ratings_and_books.Title
# null entries
print users_and_ratings_and_books.Title.isnull().any() # False
# check for unique entries
print users_and_ratings_and_books.Title.nunique() # 237912

# plot top 10 book titles
books.Title.value_counts()[:10].plot(kind='bar', stacked = 'True', title='Top 10 Book titles', alpha=.70)


In [None]:
# users_and_ratings_and_books.Author
print users_and_ratings_and_books.Author.isnull().any() # False
# check for unique entries
print users_and_ratings_and_books.Author.nunique() # 98909

# plot top 10 book titles
users_and_ratings_and_books.Author.value_counts()[:10].plot(kind='bar', stacked = 'True', title='Top 10 Book Author', alpha=.70)

In [None]:
#### GROUP BY OPERATIONS AND PLOTS HERE

4. Recommendation system based on Collaborative Filtering

In correlation based systems, recommendations are made based upon the similarity of the ratings/reviews given by users. 

So, for these systems, we use pearson correlation to suggest an item which is most similar to the item which user has already reviewed. In this sense, this technique takes user preference into account. If you want to refresh on Pearson correlation read here(https://datasciencebeginners.com/2018/09/30/05-statistics-and-branches-of-statistics-part-2/). Correlation based recommender systems are also called as item-based systems.
Now let us see how to create correlation based recommendation system in python

In [None]:
# RECOMMENDER SYSTEM
# For an introduction to recommender systems, chech README.md and the EXUS_ML_REPORT.
# Collaborative filtering: user-based CF and item-based CF
# User-Item CF are based on the notion that "Users who are similar to you also liked..."    
# Item-Item CF are based on :"Users who liked this item also liked..."





In [None]:

# create user-item matrix.


5. Discussion 


In this notebook the 'Book-Crossing' dataset was used to create a recommendation system. 

A couple of CF-recommendation approaches were investigated, namely item-based and user-based CF methods.

Of these, 

gave the best performance as assessed by comparing the predicted book ratings for a given user with the actual rating in a test set that the model was not trained on.


The fields that were used for the model were the "user ID", "book ID", and "rating". There were others available in the dataset, such as "age", "location", "publisher", "year published", etc,


Finally, we were able to build a recommender that could predict the 10 most likely book titles to be rated highly by a given user.


It should be noted that this approach still suffers from the "cold start problem"[3] - that is, for users with no ratings or history the model will not make accurate predictions. One way we could tackle this problem may be to initially start with popularity-based recommendations, before building up enough user history to implement the model.

Another piece of data that was not utilised in the current investigation was the "implicit" ratings - denoted as those with a rating of "0" in the dataset. Although more information about these implicit ratings (for example, does it represent a positive or negative interaction), these might be useful for supplementing the "explicit" ratings recommender.

6. Enhancements - Future directions 

Below are some suggestions about directions/steps that could be taken in the future to further increase out recommendation model or even enhance the book recommedation system through use of other approaches.

Also, there were other inconsistencies in the users. Even after cleaning the users.Town variable contained some name inconsistencies, such as values of 'c', 'b', 'ny', 'nis'. We could deduce a mapping for those, e.g. 'ny' probably refers to New York, but we couldn't do anything with single-letter ones. I decided to keep those entries and not remove them, so that I don't lose on data points.

Some state mappings also remained unresolved, such as the 'rm' or 'sp'. In future, I might be able to map geographical data better as there are apis for states/provinces.

Country variable contained some other inconsistencies that I didn't have the time to resolve. For example, some variables had entries like: 'scotland/uk' or 'uk/united kingdom'. This results in misleading entry numbers for each unique value (string) and therefore should be addressed.
 
The afore-mentioned problem with non-unique ISBNs for the same book title (possibly corresponding to different editions or formats of the book) could be addressed at a later point.
 
Also, some books appear to have years of publication above the year the dataset was released. Some of this, at it was mentioned in the corresponding section, are just entry mistakes that should be resurrected in the future.
 
In the Book.Author variable, although I did implement a code snippet to account for name inconsistences regarding fullname or just surname, etc. inclusion, that wasn't included in the the final data set. 
 
Book title - description: another approach would be to search for book descriptions for every title and then implement so sort of Natural Language Processing (NLP) and bag-of-words approach to assign a category to every book title. For example, fiction, history, science etc. That would be usefule for recommending similar-category books based on rating. 
 
Implement a model-based CF approach, such as Singular Vector Decomposition(SVD) and User item matrix (Utility Matrix). These matrices contain data about ratings given by each user for each item. As all customers do not review each product, these matrices are mostly sparse.
 

7. References 

I am a firm believer of transparency when building/implementing ML pipelines and an avid supporter of open-source coding and scientific research. Having said this, I must clarify that cold-blooded copy-pasting and plagiarism is something I frown upon, and this is very much instilled in me through my research years. 

To this end, I feel obliged to cite here a few of the wesites/blogs/papers/articles/books that helped with my understanding of the task and pushed my thinking even further. I hope you find them interesting and enlightening too! 
 
https://towardsdatascience.com/my-journey-to-building-book-recommendation-system-5ec959c41847 
 
https://towardsdatascience.com/building-a-recommendation-system-for-fragrance-5b00de3829da 
https://github.com/kellypeng/scentmate_rec

https://github.com/kellypeng/scentmate_rec

https://datascienceplus.com/building-a-book-recommender-system-the-basics-knn-and-matrix-factorization/ 
 
https://github.com/tttgm/fellowshipai/blob/master/Book-Crossing-Recommender.ipynb 
 


8. Suggested Additional Reading

Here are a couple of research papers, comparing and discussing CF-based and other recommendation frameworks on the Book-Crossing data set. 

https://pdfs.semanticscholar.org/ba2f/0b10f80ac3e569aef8a64320c54a4ca31e2b.pdf 

http://btw2017.informatik.uni-stuttgart.de/slidesandpapers/F-11-88/paper_web.pdf 

8. Closure
 
Τhank you for taking the time to read through my Notebook. I would be happy to discuss my approach with you whenever it suits you. Looking forward to hearing back from you! 
 
 