# Collaborative-Based Books Recommender System Using Collaborative Filter

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Matplotlib is building the font cache; this may take a moment.


# Ingest Dataset & Preliminary Massaging

## Ingestion

In [22]:
#Note that this 
books = pd.read_csv('../books_data/books.csv', delimiter=';', on_bad_lines='skip', encoding='latin-1')
ratings = pd.read_csv('../books_data/ratings.csv', delimiter=';', on_bad_lines='skip', encoding='latin-1')
users = pd.read_csv('../books_data/users.csv', delimiter=';', on_bad_lines='skip', encoding='latin-1')

  books = pd.read_csv('../books_data/books.csv', delimiter=';', on_bad_lines='skip', encoding='latin-1')


### Preliminary Dataframe Massaging

In [25]:
# Make all the columns lowercases and replace dashes with underscores
books.columns = [x.lower().replace('-', '_') for x in books.columns]
ratings.columns = [x.lower().replace('-', '_') for x in ratings.columns]
users.columns = [x.lower().replace('-', '_') for x in users.columns]

# Initial Exploratory Data Analysis

## Preview Books

In [9]:
books.head()

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher,image_url_s,image_url_m,image_url_l
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [31]:
books.tail(2)

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher,image_url_l
271358,192126040,Republic (World's Classics),Plato,1996,Oxford University Press,http://images.amazon.com/images/P/0192126040.0...
271359,767409752,A Guided Tour of Rene Descartes' Meditations o...,Christopher Biffle,2000,McGraw-Hill Humanities/Social Sciences/Languages,http://images.amazon.com/images/P/0767409752.0...


### Information Available
From the schema, we can see the following columns:
- isbn: Primary key for this table
- book_title: Compound Primary key for this table -> We should check to ensure that there are an equal number of unique isbn and book_titles. Otherwise, we'd see that one isbn code may map to different book_titles
- book_author
- year_of_publication
- publisher
- image_url_s
- image_url_m
- image_url_l

2. The dataframe has 3 columns that hold the image URL but for different sizes, we'll drop 2/3
TODO Drop all image_url columns but small

In [11]:
books.shape

(271360, 8)

In [13]:
len(books['book_title'].unique())

242135

From this preview, we can see:
1. there are approximately 29245 duplicate books in the dataframe that we'll need to drop
TODO Drop duplicate books

In [26]:
books = books.drop(columns=['image_url_s', 'image_url_m'])

In [27]:
books.head(2)

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher,image_url_l
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...


## Preview ratings

In [28]:
ratings.head()

Unnamed: 0,user_id,isbn,book_rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [29]:
ratings.tail()

Unnamed: 0,user_id,isbn,book_rating
1149775,276704,1563526298,9
1149776,276706,679447156,0
1149777,276709,515107662,10
1149778,276721,590442449,10
1149779,276723,5162443314,8


### Information Available
From the schema, we can see the following columns:
- user_id: foreign key for users table
- isbn: foreign key for books table
- book_rating: The ratings given

This is clearly the fact table so it'll be important to check the missingness in both foreign key columns. This is because we'll be joining these tables later on. 

It'll also be important to understand the scale of rating



### TODO Check the average amount of ratings given per users

## Preview users

In [16]:
users.head()

Unnamed: 0,user_id,location,age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [17]:
users.tail()

Unnamed: 0,user_id,location,age
278853,278854,"portland, oregon, usa",
278854,278855,"tacoma, washington, united kingdom",50.0
278855,278856,"brampton, ontario, canada",
278856,278857,"knoxville, tennessee, usa",
278857,278858,"dublin, n/a, ireland",


### Information Available
From the schema, we can see the following columns:
- user_id: primary key for table
- location
- age

Since we are doing a collaborative filtering based approach, both the secondary features may be useful in clustering users into demographics. It'll be useful in further EDA to explore which type of books, different age_groups and locations gravitated towards




In [15]:
users.shape

(278858, 3)

In [18]:
len(users['user_id'].unique())

278858

No duplicate users

## Investigate Missingness
From the respective 