## Collaborative Filtering

In [71]:
# import pandas
import pandas as pd
from sklearn.neighbors import NearestNeighbors

We will create an recommender engine based on Item Based Collaborative Filtering (IBCF) which searches for the most similar books based on the user ratings. We can download the data from [here](https://drive.google.com/file/d/1WvTmAfO09TCX7xp7uu06__ziic7JnrL5/view?usp=sharing).

In [72]:
book_ratings = pd.read_csv('./res/data/bx-books/BX-Book-Ratings.csv',sep=";", encoding="latin")
books = pd.read_csv('./res/data/bx-books/BX-Books.csv',sep=";", encoding="latin", error_bad_lines=False)



  books = pd.read_csv('./res/data/bx-books/BX-Books.csv',sep=";", encoding="latin", error_bad_lines=False)
b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 2

* Explore both datasets

In [73]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Book-Author          271359 non-null  object
 3   Year-Of-Publication  271360 non-null  object
 4   Publisher            271358 non-null  object
 5   Image-URL-S          271360 non-null  object
 6   Image-URL-M          271360 non-null  object
 7   Image-URL-L          271357 non-null  object
dtypes: object(8)
memory usage: 16.6+ MB


In [74]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [75]:
book_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1149780 non-null  int64 
 1   ISBN         1149780 non-null  object
 2   Book-Rating  1149780 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


In [76]:
book_ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


* create dataframe with name 'df_book_features' from book_ratings that have `ISBN` as index, `User-ID` as columns and values are `Book-Rating`.
    - The data are quite big so it's OK to use a sample only in case your PC has limited RAM.


In [77]:
df_book_features = book_ratings.sample(frac=0.1).pivot_table(index='ISBN', columns='User-ID', values='Book-Rating')

In [78]:
df_book_features

User-ID,2,8,17,22,23,44,53,56,67,69,...,278723,278766,278771,278774,278813,278815,278838,278843,278851,278854
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0375404120,,,,,,,,,,,...,,,,,,,,,,
9032803328,,,,,,,,,,,...,,,,,,,,,,
904492401X,,,,,,,,,,,...,,,,,,,,,,
9061002273,,,,,,,,,,,...,,,,,,,,,,
(THEWINDMILLP,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
O82172060O,,,,,,,,,,,...,,,,,,,,,,
O889950962,,,,,,,,,,,...,,,,,,,,,,
X000000000,,,,,,,,,,,...,,,,,,,,,,
"\0432534220\""""",,,,,,,,,,,...,,,,,,,,,,


In [79]:
 df_book_features.index.values

array([' 0375404120', ' 9032803328', ' 904492401X', ..., 'X000000000',
       '\\0432534220\\""', '`3502103682'], dtype=object)

In [80]:
import re
df_book_features.index = df_book_features.index.astype('str')


isbn = {}
for i,index in enumerate(df_book_features.index.values):
    if re.search(r'\D', index):
        isbn[i] = index.strip()
        isbn[i] = re.sub(r'\W', '', isbn[i])

clean_isbn = {}
for key,value in isbn.items():
    if len(value) == 10:
        clean_isbn[key] = value

In [81]:
df_book_features.isnull().sum()

User-ID
2         70661
8         70661
17        70658
22        70661
23        70661
          ...  
278815    70661
278838    70661
278843    70658
278851    70660
278854    70661
Length: 27628, dtype: int64

In [82]:
df_index_reset = df_book_features.reset_index(drop=False)

In [83]:
df_index_reset

User-ID,ISBN,2,8,17,22,23,44,53,56,67,...,278723,278766,278771,278774,278813,278815,278838,278843,278851,278854
0,0375404120,,,,,,,,,,...,,,,,,,,,,
1,9032803328,,,,,,,,,,...,,,,,,,,,,
2,904492401X,,,,,,,,,,...,,,,,,,,,,
3,9061002273,,,,,,,,,,...,,,,,,,,,,
4,(THEWINDMILLP,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70657,O82172060O,,,,,,,,,,...,,,,,,,,,,
70658,O889950962,,,,,,,,,,...,,,,,,,,,,
70659,X000000000,,,,,,,,,,...,,,,,,,,,,
70660,"\0432534220\""""",,,,,,,,,,...,,,,,,,,,,


In [84]:
df_books = df_index_reset[df_index_reset['ISBN'].isin(list(clean_isbn.values()))]

In [85]:
df_books

User-ID,ISBN,2,8,17,22,23,44,53,56,67,...,278723,278766,278771,278774,278813,278815,278838,278843,278851,278854
69,000225414X,,,,,,,,,,...,,,,,,,,,,
94,000232539X,,,,,,,,,,...,,,,,,,,,,
95,000255397X,,,,,,,,,,...,,,,,,,,,,
111,000274080X,,,,,,,,,,...,,,,,,,,,,
118,000458726X,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70655,O708944639,,,,,,,,,,...,,,,,,,,,,
70656,O747514674,,,,,,,,,,...,,,,,,,,,,
70657,O82172060O,,,,,,,,,,...,,,,,,,,,,
70658,O889950962,,,,,,,,,,...,,,,,,,,,,


* create the instance of the NearestNeighbors class

In [86]:
nn = NearestNeighbors()

In [93]:
df_books_reindexed = df_books.set_index('ISBN').fillna(0)

* fit the NearestNeighbors using'df_book_features'

In [95]:
nn_isbn = nn.fit(df_books_reindexed)

In [99]:
distances, indices = nn_isbn.kneighbors(df_books_reindexed)

In [104]:
distances

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 8., 8., 8., 8.],
       ...,
       [0., 8., 8., 8., 8.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [105]:
indices

array([[3139, 3142, 3135, 3141, 3133],
       [3139, 3142, 3135, 3141, 3133],
       [   2, 3139, 3142, 3135, 3141],
       ...,
       [5649, 3141, 3139, 3135, 3142],
       [3139, 3142, 3135, 3141, 3133],
       [1235, 5635, 5651, 5215,  637]])

* create function that returns top 5 most similar movies (according to KNN model) for selected ISBN
    * the input will be Book-Title from the DataFrame books 
    * the output will be the Book-Titles of the top 5 most similar books.
    * for every book in the top 5 most similar books, print also the distance from the selected book (ISBN we chose as input to the function)

In [109]:
book_title = 'Classical Mythology'
isbn = books[books['Book-Title'] == book_title]['ISBN']
isbn
# idx = df_books_reindexed[isbn].index()

0        0195153448
95231    0801319536
Name: ISBN, dtype: object

In [None]:
def top_similar_books(book_title):
    isbn = books[books['Book-Title'] = book_title]['ISBN']
     idx = df_books_reindexed[isbn].index()
    
    
    
    

* Apply the function to book of your choice