# Book Crossing Recommendation System 
  
Author: Eleni Zarogianni 
October 2019 

Objective: to implement a Book Recommender system that utilizes some sort of collaborative filtering using the online-available Book-Crossing Data set (http://www2.informatik.uni-freiburg.de/~cziegler/BX/)

In [None]:
# import libraries
# for data manipulation
import pandas as pd
import numpy as np
# for plotting
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import pylab
import seaborn as sns

plt.style.use('classic')
plt.style.use('seaborn-whitegrid')

1. Load Data
The readily available Book Crossing Data set is used here. This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings.  
 
BX-Users : Contains the users. User IDs (`User-ID`) have been anonymized and map to integers. Demographic data is provided (`Location`, `Age`) if available. 

BX-Books : Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (`Book-Title`, `Book-Author`, `Year-Of-Publication`, `Publisher`), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (`Image-URL-S`, `Image-URL-M`, `Image-URL-L`), i.e., small, medium, large. 
 
BX-Book-Ratings : Contains the book rating information. Ratings (`Book-Rating`) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0. 

Let's jump straight into reading the csv files as pandas Dataframes (Dfs).

In [None]:
users = pd.read_csv('BX-Users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
books = pd.read_csv('BX-Books.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings = pd.read_csv('BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1"

2. Inspect and clean the data
In general, data inspection and cleaning prosedures include visually inspecting the data, through the use of graphs and plots, and figuring out any inconsistencies or peculiarities in the data sets. These might include, on a first-level, any duplicate entries or missing values, any wrongly assigned data types, and on a second-level any outliers. We will explore and handle each aspect of these below.

Let's have a first glance at the data and check the Df's shape.

In [None]:
# print the shape of the data
print users.shape
print books.shape
print ratings.shape

That looks fine.
On itinial inspection, all 3 Df's contain column names with a '-'.That will lead to problems accessing the dataframes,so let's change them.

In [None]:
# remove middle slash  
users.columns = ['userID', 'Location', 'Age']
books.columns = ['ISBN', 'BookTitle', 'BookAuthor', 'YearOfPublication', 'Publisher', 'ImageUrlS', 'ImageUrlM', 'imageUrlL']
ratings.columns = ['userID', 'ISBN', 'BookRating']

Also on a first-look basis, we can already spot missing values (e.g. in the users.Age variable), but let's have a closer look and address each dataframe's idiosynchracies separately.

USERS DataFrame

In [None]:
# Check 5 first entries
users.head(5)
# Get basic info first.
users.info()
# Describe numerical variables
users.describe()
# Describe categorical variables
users.describe(include=['O'])

First initial 5 entries confirm that we have missing values in Age. Also, data types of the Users' Df seem reasonable and therefore that's fine. 
Upon description of the Df, we can sport a 'weird' min-max duo for Age. We'll keep that in mind. The userID variable seems fine.

Description of the Location, non-numerical variable seems OK but we can easily deduct that it might be more useful to split the Location variable into 3 separate ones, consisting of Town, State and Country that's more informationally relevant to a recommendation system.

So let's get our hands on users.Location and users.Age variables.

In [None]:
# check for missing values
users.Location.isnull().any()
# check for for duplicate entries
users.Location.nunique()

# split users.Location into 3 subparts
location_expanded = users.Location.str.split(',', 2, expand=True)
location_expanded.columns = ['Town', 'State', 'Country']
users = users.join(location_expanded)
# Drop the initial Location variable.
users.drop('Location', axis=1, inplace = True)

So, Location has no missing values and there are non-unique entries, which is certainly OK. We've splitted up into 3 sub-parts as described and dropped the initial, corresponding variable.

Now, let's go an extra mile here, by having a look at some descriptives for location and some plots.

In [None]:
# How many unique towns, states and countries do I have?
users.Town.nunique()
users.State.nunique()
users.Country.nunique()

In [None]:
Users Otherwise, these fields contain NULL-values. 
 