Implementing kNN into book recommendation system. 

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
import sklearn
from sklearn.decomposition import TruncatedSVD
import sys
#sys.path.append("../BX-CSV-Dump/")


book = pd.read_csv('BX-CSV-Dump/BX-Books.csv',sep=';', error_bad_lines=False, encoding='latin-1')
book.columns =["ISBN","bookTitle","bookAuthor","yearOfPublication","publisher","imageUrlS","imageUrlM","imageUrlL"]
user = pd.read_csv('BX-CSV-Dump/BX-Users.csv',sep=';', error_bad_lines=False, encoding='latin-1')
user.columns = ['userId','Location','Age']
rating = pd.read_csv('BX-CSV-Dump/BX-Book-Ratings.csv',sep=';', error_bad_lines=False, encoding='latin-1')
rating.columns = ['userId','ISBN','bookRating']


b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'


### Rating Info

In [3]:
rating.head()

Unnamed: 0,userId,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### User Info

In [4]:
user.head()

Unnamed: 0,userId,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Book Info

In [5]:
book.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


To look at popular books, combining rating and book data

In [6]:
combine_book_rating = pd.merge(rating,book,on='ISBN')
columns=['yearOfPublication','publisher','bookAuthor','imageUrlS','imageUrlM','imageUrlL']
combine_book_rating = combine_book_rating.drop(columns, axis=1)
combine_book_rating.head()

Unnamed: 0,userId,ISBN,bookRating,bookTitle
0,276725,034545104X,0,Flesh Tones: A Novel
1,2313,034545104X,5,Flesh Tones: A Novel
2,6543,034545104X,0,Flesh Tones: A Novel
3,8680,034545104X,5,Flesh Tones: A Novel
4,10314,034545104X,9,Flesh Tones: A Novel


Grouping book titles and creating new column for total rating count

In [7]:
combine_book_rating = combine_book_rating.dropna(axis=0,subset = ['bookTitle'])

In [8]:
book_ratingCount = (combine_book_rating.groupby(by = ['bookTitle'])['bookRating'].count().reset_index().rename(columns = {'bookRating':'totalRatingCount'})[['bookTitle','totalRatingCount']])
book_ratingCount.head()

Unnamed: 0,bookTitle,totalRatingCount
0,A Light in the Storm: The Civil War Diary of ...,4
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,"Ask Lily (Young Women of Faith: Lily Series, ...",1
4,Beyond IBM: Leadership Marketing and Finance ...,1


Now combining the rating data with the total rating count data, which gives us data to filter out the lesser known books.

In [9]:
rating_with_totalRating_Count = combine_book_rating.merge(book_ratingCount, left_on = 'bookTitle',right_on='bookTitle',how='left')
rating_with_totalRating_Count.head()

Unnamed: 0,userId,ISBN,bookRating,bookTitle,totalRatingCount
0,276725,034545104X,0,Flesh Tones: A Novel,60
1,2313,034545104X,5,Flesh Tones: A Novel,60
2,6543,034545104X,0,Flesh Tones: A Novel,60
3,8680,034545104X,5,Flesh Tones: A Novel,60
4,10314,034545104X,9,Flesh Tones: A Novel,60


Statistic of total rating count

In [10]:
pd.set_option('display.float_format',lambda x: '%.3f' % x)
print(book_ratingCount['totalRatingCount'].describe())

count   241071.000
mean         4.277
std         16.739
min          1.000
25%          1.000
50%          1.000
75%          3.000
max       2502.000
Name: totalRatingCount, dtype: float64


The median of the book has been rated only once. Top of the distribution are as follows:

In [11]:
print(book_ratingCount['totalRatingCount'].quantile(np.arange(.9,1,0.01)))

0.900    7.000
0.910    8.000
0.920    9.000
0.930   10.000
0.940   11.000
0.950   13.000
0.960   16.000
0.970   20.000
0.980   29.000
0.990   50.000
Name: totalRatingCount, dtype: float64


The above data shows that, above 1% book recieved 50 or more rating. Because we have so many books in our data, we will limit it to the top 1%, and this will give us 2713 unique books. 

In [12]:
popularity_threshold = 50
rating_popular_book = rating_with_totalRating_Count.query('totalRatingCount >= @popularity_threshold')
rating_popular_book.head()

Unnamed: 0,userId,ISBN,bookRating,bookTitle,totalRatingCount
0,276725,034545104X,0,Flesh Tones: A Novel,60
1,2313,034545104X,5,Flesh Tones: A Novel,60
2,6543,034545104X,0,Flesh Tones: A Novel,60
3,8680,034545104X,5,Flesh Tones: A Novel,60
4,10314,034545104X,9,Flesh Tones: A Novel,60


Filtering data only for people of US and Canada, and then combine with the user data with rating data and total rating count data.

In [13]:
combined = rating_popular_book.merge(user, left_on='userId',right_on='userId',how='left')
us_canada_user_rating = combined[combined['Location'].str.contains("usa|canada")]
us_canada_user_rating=us_canada_user_rating.drop('Age',axis=1)
us_canada_user_rating.head()

Unnamed: 0,userId,ISBN,bookRating,bookTitle,totalRatingCount,Location
0,276725,034545104X,0,Flesh Tones: A Novel,60,"tyler, texas, usa"
1,2313,034545104X,5,Flesh Tones: A Novel,60,"cincinnati, ohio, usa"
2,6543,034545104X,0,Flesh Tones: A Novel,60,"strafford, missouri, usa"
3,8680,034545104X,5,Flesh Tones: A Novel,60,"st. charles county, missouri, usa"
4,10314,034545104X,9,Flesh Tones: A Novel,60,"beaverton, oregon, usa"


Before applying KNN to data set we need to make sure for each user there is corresponding only one rating per book. Hence finding dublicate rows and removing that from data set. 

In [14]:
if not us_canada_user_rating[us_canada_user_rating.duplicated(['userId','bookTitle'])].empty:
    initial_rows=us_canada_user_rating.shape[0]
    print('Initial data frame shape'+str(us_canada_user_rating.shape))
    us_canada_user_rating=us_canada_user_rating.drop_duplicates(['userId','bookTitle'])
    current_rows = us_canada_user_rating.shape[0]
    print('New data frame shape'+str(us_canada_user_rating.shape))
    print('Removed {0} rows'.format(initial_rows-current_rows))


Initial data frame shape(251615, 6)
New data frame shape(248949, 6)
Removed 2666 rows


To implement KNN, we may need to do data cleaning. Converitng the table to our 2-D matrix, and fill the missing values with zeros(since we will be calculating distance between rating vectors). Then transforming the values(ratings) of the matrix dataframe into a scipy sparese matrix for more efficient calculation. 

In [15]:
us_canada_user_rating_pivot = us_canada_user_rating.pivot(index='bookTitle', columns='userId', values = 'bookRating').fillna(0)
us_canada_user_rating_matrix = csr_matrix(us_canada_user_rating_pivot.values)

Using unsupervised algorithms with sklearn.neighbors. The algorithm used to compute nearest neighbors is brute and matric is cosine that means algorihm will find cosine similarities between rating vectors. 

In [16]:
from sklearn.neighbors import NearestNeighbors

model_knn = NearestNeighbors(metric = 'cosine', algorithm ='brute')
model_knn.fit(us_canada_user_rating_matrix)

NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
         metric_params=None, n_jobs=1, n_neighbors=5, p=2, radius=1.0)

## Recommendations
Now in this step our model will make recommendation. The kNN measures distance to determine the closeness of the instances. It then classifies a instance by finding its nearest neighnors and pick the most popular class among the neighbors.

In [17]:
query_index= np.random.choice(us_canada_user_rating_pivot.shape[0])
distances, indices = model_knn.kneighbors(us_canada_user_rating_pivot.iloc[query_index,:].reshape(1,-1), n_neighbors=6)
book=us_canada_user_rating_pivot.index[query_index]
for i in range(0,len(distances.flatten())):
    if i==0:
        print('Recommendations for {0}:\n'.format(book))
    else:
        print('{0}: {1}, with distance {2}:'.format(i, us_canada_user_rating_pivot.index[indices.flatten()[i]], distances.flatten()[i]))

Recommendations for Blood Lure:

1: Ill Wind (Anna Pigeon Mysteries (Paperback)), with distance 0.7843579732571693:
2: Track of the Cat (Anna Pigeon Mysteries (Paperback)), with distance 0.8669298979478484:
3: Endangered Species (Anna Pigeon Mysteries (Paperback)), with distance 0.884917636910298:
4: My Antonia, with distance 0.8949918312611727:
5: Firestorm (Anna Pigeon Mysteries (Paperback)), with distance 0.9028468362850127:


### Matrix Factorization
Using matrix factorization technique, as sometime they are usually more effective, because they allow users to discover the latent featires underlying the interactions between users and items (books).

I will be using singular value decomposition (SVD) - one of the Matrix factorization models for identifying latent factors.

In [18]:
us_canada_user_rating_pivot2 = us_canada_user_rating.pivot(index='userId', columns='bookTitle', values = 'bookRating').fillna(0)
us_canada_user_rating_pivot2.head()

bookTitle,10 Lb. Penalty,16 Lighthouse Road,1984,1st to Die: A Novel,2010: Odyssey Two,204 Rosewood Lane,2061: Odyssey Three,24 Hours,2nd Chance,3rd Degree,...,YOU BELONG TO ME,Year of Wonders,You Belong To Me,You Shall Know Our Velocity,Young Wives,Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,Zoya,"\O\"" Is for Outlaw""","\Surely You're Joking, Mr. Feynman!\"": Adventures of a Curious Character""",stardust
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now transposing this utility matrix. After using TruncatedSVD to decompose it, we fir it into the model for dimentionality reduction. This compression happened on the dataframe's columns since we must preserve the book titles. Choosing n_components=12 for 12 latent variables. This will reduce the data's dimensions significantly.

In [19]:
us_canada_user_rating_pivot2.shape

(40017, 2442)

In [20]:
X = us_canada_user_rating_pivot2.values.T
X.shape

(2442, 40017)

In [21]:
from sklearn.decomposition import TruncatedSVD

SVD = TruncatedSVD(n_components=12, random_state=17)
matrix = SVD.fit_transform(X)
matrix.shape

(2442, 12)

Calculating Pearson's R correlation coefficient for every book pair in our final matrix. To compare this with out result from kNN, choosing same book to find the books that have high correlation coefficients (between 0.9 and 1.0) with it. 

In [22]:
corr = np.corrcoef(matrix)
corr.shape

(2442, 2442)

In [23]:
us_canada_book_title = us_canada_user_rating_pivot2.columns
us_canada_book_list = list(us_canada_book_title)
coffey_hands = us_canada_book_list.index(book)
print(coffey_hands)

280


In [24]:
corr_coffey_hands = corr[coffey_hands]
list(us_canada_book_title[(corr_coffey_hands<1.0) & (corr_coffey_hands>0.9)])

['Blood Shot (V.I. Warshawski Novels (Paperback))',
 'Cold Case',
 'Come to Grief',
 'Hard Eight : A Stephanie Plum Novel (A Stephanie Plum Novel)',
 'Orchid Blues (Holly Barker Novels (Paperback))',
 'P Is for Peril',
 "Pandora's Clock",
 'Prime Cut',
 'Second Wind',
 'The Empress File',
 'The Grilling Season',
 'The Main Corpse (Culinary Mysteries (Paperback))',
 'Windy City Blues (V.I. Warshawski Novels (Paperback))']