# **Capstone Step 6: Survey Existing Research and Reproduce Existing Solutions**

The papers and articles whose results I will be attempting to reproduce: 

1. College Library Personalized Recommendation System Based on Hybrid Recommendation Algorithm published by Yonghong Tian et al. / Procedia CIRP 83 (2019) 490–494 - available at https://www.sciencedirect.com/science/article/pii/S2212827119307401.

2. How to Build a Book Recommendation System by Raghav Agrawal - available at https://www.analyticsvidhya.com/blog/2021/06/build-book-recommendation-system-unsupervised-learning-project/

3. https://towardsdatascience.com/how-did-we-build-book-recommender-systems-in-an-hour-the-fundamentals-dfee054f978e by 

In [41]:
# Import necessary dependencies

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from scipy.sparse import csr_matrix

# 1. Read raw data

In [2]:
raw_ratings = pd.read_csv('./data/Ratings.zip')

raw_books = pd.read_csv('./data/Books.zip')

raw_users = pd.read_csv('./data/Users.zip')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
raw_ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [4]:
raw_books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [5]:
raw_users.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


# 2. Clean and Wrangle Dataframes
    An important step Agrawal mentions in "How to Build a Book Recommendation System" is filtering the datasets to exclude users and books that have only rated/been rated a handful of times. Part of this process will be the basics- Dropping null entries and cleaning pre-filled values as I did in step 5 of the capstone process, but this filtering process will be very important to the quality of the model I ultimately produce.

In [29]:
dfRatings = raw_ratings[raw_ratings['Book-Rating']!=0] #Based off of findings in step 5
dfBooks = raw_books.dropna()
dfUsers = raw_users.drop(columns=['Age'])
print(dfRatings.info(), dfBooks.info(), dfUsers.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 433671 entries, 1 to 1149779
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   User-ID      433671 non-null  int64 
 1   ISBN         433671 non-null  object
 2   Book-Rating  433671 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 13.2+ MB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 271354 entries, 0 to 271359
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271354 non-null  object
 1   Book-Title           271354 non-null  object
 2   Book-Author          271354 non-null  object
 3   Year-Of-Publication  271354 non-null  object
 4   Publisher            271354 non-null  object
 5   Image-URL-S          271354 non-null  object
 6   Image-URL-M          271354 non-null  object
 7   Image-URL-L          271354 non-null  object
dtypes: object(8)


In [16]:
# Filter ratings to contain only ratings by users who have rated more than 200 books

x = dfRatings['User-ID'].value_counts() > 200
y = x[x].index  #User-IDs
print(y.shape)
dfRatings = dfRatings[dfRatings['User-ID'].isin(y)]
dfRatings.info()

(899,)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 526356 entries, 1456 to 1147616
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   User-ID      526356 non-null  int64 
 1   ISBN         526356 non-null  object
 2   Book-Rating  526356 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 16.1+ MB


In [17]:
# Merge books with filtered ratings

dfBookRatings = dfRatings.merge(dfBooks, on='ISBN')
dfBookRatings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,http://images.amazon.com/images/P/002542730X.0...,http://images.amazon.com/images/P/002542730X.0...
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,http://images.amazon.com/images/P/002542730X.0...,http://images.amazon.com/images/P/002542730X.0...
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,http://images.amazon.com/images/P/002542730X.0...,http://images.amazon.com/images/P/002542730X.0...
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,http://images.amazon.com/images/P/002542730X.0...,http://images.amazon.com/images/P/002542730X.0...
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,http://images.amazon.com/images/P/002542730X.0...,http://images.amazon.com/images/P/002542730X.0...


In [35]:
# Drop books with fewer than 50 ratings

ratingsSum = dfBookRatings.groupby('Book-Title')['Book-Rating'].count().reset_index()
ratingsSum.rename(columns={'Book-Rating':'ratingsSum'}, inplace=True)
dfRatingsFinal = dfBookRatings.merge(ratingsSum, on='Book-Title')
dfRatingsFinal = dfRatingsFinal[dfRatingsFinal['ratingsSum'] >= 50]
dfRatingsFinal.drop_duplicates(['User-ID','Book-Title'], inplace=True)
dfRatingsFinal.shape

# Extra columns compared to Agrawal's article are due to this version of the dataset including URLs for book cover images

(59850, 11)

In [40]:
# Create pivot table

dfBookPivot = dfRatingsFinal.pivot_table(columns='User-ID', index='Book-Title', values='Book-Rating')
dfBookPivot.fillna(0, inplace=True)
dfBookPivot.shape

(742, 888)

# 3. Modelling

In [45]:
# Create sparse matrix to reduce impact of filled 0 values in pivot table

sparseBookMatrix = csr_matrix(dfBookPivot)
from sklearn.neighbors import NearestNeighbors
model = NearestNeighbors(algorithm = 'brute')
model.fit(sparseBookMatrix)

# Test model

distances, suggestions = model.kneighbors(dfBookPivot.iloc[237, :].values.reshape(1, -1))
for i in range(len(suggestions)):
    print(dfBookPivot.index[suggestions[i]])

Index(['Harry Potter and the Chamber of Secrets (Book 2)',
       'Harry Potter and the Prisoner of Azkaban (Book 3)',
       'Harry Potter and the Goblet of Fire (Book 4)',
       'Harry Potter and the Sorcerer's Stone (Book 1)', 'Exclusive'],
      dtype='object', name='Book-Title')


This is a very simple solution that produces results. Ultimately, I plan on using tensorflow to create a neural network to produce recommendations. I'll see what I can replicate from the other reference material I've found in Step 7 of the capstone!