<h1>Item-based collaborate filtering</h1>



__Description__

Goal: develop a recommender system model to recommend books based on reader ratings using singular-value decomposition (SVD) and correlation coefficients.

Technique:  item-based collaborative filtering

Dataset: Goodbooks-10k dataset, consists of two files
1. 'rating.csv': contains book ratings from individual users
2. 'books.csv': contains information about each book, such as average rating, ISBN, author, ...

Dataset source [link](https://github.com/zygmuntz/goodbooks-10k)


# Package and load data





## Load package and dataset

In [17]:
import pandas as pd
import numpy as np

import os, gc, warnings
warnings.filterwarnings('ignore')


from sklearn.decomposition import TruncatedSVD

In [16]:
# get filename in root
root = '/content/drive/MyDrive/Projects/RecommendationSystems/Book1/data/p01'
print(os.listdir(root))
books = pd.read_csv(root+'/books.csv')
ratings = pd.read_csv(root+'/ratings.csv')
books.shape, ratings.shape

['books.csv', 'ratings.csv']


((10000, 23), (5976479, 3))

In [29]:
df = pd.merge(ratings, books, on='book_id')
df.shape

(5976479, 25)

## Explore dataset

In [18]:
books.columns, ratings.columns

(Index(['book_id', 'goodreads_book_id', 'best_book_id', 'work_id',
        'books_count', 'isbn', 'isbn13', 'authors', 'original_publication_year',
        'original_title', 'title', 'language_code', 'average_rating',
        'ratings_count', 'work_ratings_count', 'work_text_reviews_count',
        'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5',
        'image_url', 'small_image_url'],
       dtype='object'),
 Index(['user_id', 'book_id', 'rating'], dtype='object'))

In [28]:
display(books.head())
display(ratings.head())

Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3


In [25]:
# examine if there is a missing data

# books
print("books dataset - missing data")

# print those with missing values only
display(books.isnull().sum()[books.isnull().sum()>0])


# ratings
print("ratings dataset - missing data")

# print those with missing values only
display(ratings.isnull().sum()[ratings.isnull().sum()>0])

books dataset - missing data


Unnamed: 0,0
isbn,700
isbn13,585
original_publication_year,21
original_title,585
language_code,1084


ratings dataset - missing data


Unnamed: 0,0


## Summary

__EDA__


__books__
- shape: (10000, 23)
- columns:
        `['book_id', 'goodreads_book_id', 'best_book_id', 'work_id',
        'books_count', 'isbn', 'isbn13', 'authors', 'original_publication_year',
        'original_title', 'title', 'language_code', 'average_rating',
        'ratings_count', 'work_ratings_count', 'work_text_reviews_count',
        'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5',
        'image_url', 'small_image_url']`
-       

__ratings__
- shape: (5976479, 3)
- columns: `['user_id', 'book_id', 'rating']`


---
__Preprocessing__: merging the two dataframe: the two dataframe are merged together on `book_id`.



# Develop item-based recoemmender system

approach:
1. create utility matrix: employ pivot table to store book ratings provided by each reader. columns: book titles, rows: users, content: rating.

Shape = (53424, 9964)
2. employ singular-value decomposition (SVD) to compress users in the utility matrix (SVD handles missing values compare with PCA ).


In [31]:
# step 1: create utility matrix
utilit_matrix = df.pivot_table(index='user_id', columns='title', values='rating', fill_value= 0)
display(utilit_matrix.shape, utilit_matrix.head())

(53424, 9964)

title,"Angels (Walsh Family, #3)","""حكايات فرغلي المستكاوي ""حكايتى مع كفر السحلاوية",#GIRLBOSS,'Salem's Lot,"'Tis (Frank McCourt, #2)","1,000 Places to See Before You Die",1/4 جرام,"10% Happier: How I Tamed the Voice in My Head, Reduced Stress Without Losing My Edge, and Found Self-Help That Actually Works","100 Bullets, Vol. 1: First Shot, Last Call",100 Love Sonnets,...,محال,مخطوطة بن إسحاق: مدينة الموتى,نادي السيارات,هشت کتاب,هيبتا,واحة الغروب,يوتوبيا,ڤيرتيجو,キスよりも早く1 [Kisu Yorimo Hayaku 1] (Faster than a Kiss #1),美少女戦士セーラームーン新装版 1 [Bishōjo Senshi Sailor Moon Shinsōban 1]
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [37]:
books_name = utilit_matrix.columns
books_name

Index([' Angels (Walsh Family, #3)',
       '"حكايات فرغلي المستكاوي "حكايتى مع كفر السحلاوية', '#GIRLBOSS',
       ''Salem's Lot', ''Tis (Frank McCourt, #2)',
       '1,000 Places to See Before You Die', '1/4 جرام',
       '10% Happier: How I Tamed the Voice in My Head, Reduced Stress Without Losing My Edge, and Found Self-Help That Actually Works',
       '100 Bullets, Vol. 1: First Shot, Last Call', '100 Love Sonnets',
       ...
       'محال', 'مخطوطة بن إسحاق: مدينة الموتى', 'نادي السيارات', 'هشت کتاب',
       'هيبتا', 'واحة الغروب', 'يوتوبيا', 'ڤيرتيجو',
       'キスよりも早く1 [Kisu Yorimo Hayaku 1] (Faster than a Kiss #1)',
       '美少女戦士セーラームーン新装版 1 [Bishōjo Senshi Sailor Moon Shinsōban 1]'],
      dtype='object', name='title', length=9964)

In [32]:
# step 2. employ SVD to compress users
X = utilit_matrix.T

SVD = TruncatedSVD(n_components=40, random_state=17) # compress number of users in 40 components
matrix_transposed = SVD.fit_transform(X)
matrix_transposed.shape

(9964, 40)

# perform inference

Find the correlation in user preferences between a given book and other books based on reader preferences.

The target book title is '1984'.

Two sets of recommendation are generated with two level of correlation:
1. above 0.9
2. above 0.7

In [33]:
# Inference - find correlation in user preferences btw a given book and other books
# target book:  1984 written by George Orwell with rating 5

corr_matrix = np.corrcoef(matrix_transposed)
corr_matrix.shape

(9964, 9964)

In [34]:
books_list = list(utilit_matrix.columns)
books_1984 = books_list.index('1984')
books_1984

28

In [39]:
# Find books with correlation value btw 0.9 and 1
coor_1984 = corr_matrix[28]
list(books_name[(coor_1984<1.0) & (coor_1984>0.9)])

['1984',
 'Animal Farm',
 'Brave New World',
 'Fahrenheit 451',
 'Lord of the Flies']

In [40]:
# Find books with corr value above 0.7
coor_1984 = corr_matrix[28]
list(books_name[(coor_1984<1.0) & (coor_1984>0.7)])

['1984',
 'A Clockwork Orange',
 'All Quiet on the Western Front',
 'Animal Farm',
 'Animal Farm / 1984',
 'Brave New World',
 'Brave New World / Brave New World Revisited',
 'Brave New World Revisited ',
 'Burmese Days',
 "Cat's Cradle",
 'Catch-22',
 'Crime and Punishment',
 'Down and Out in Paris and London',
 'Fahrenheit 451',
 'Fahrenheit 451: The Authorized Adaptation',
 'Fight Club',
 'Frankenstein',
 'Island',
 'Keep the Aspidistra Flying',
 "Kurt Vonnegut's Cat's Cradle (Modern Critical Interpretations)",
 'Lolita',
 'Lord of the Flies',
 'Moby-Dick or, The Whale',
 'Of Mice and Men',
 "One Flew Over the Cuckoo's Nest",
 'One Hundred Years of Solitude',
 'Slaughterhouse-Five',
 'The Adventures of Huckleberry Finn',
 'The Adventures of Tom Sawyer & Adventures of Huckleberry Finn',
 'The Catcher in the Rye',
 'The Chrysalids',
 'The Grapes of Wrath',
 'The Great Gatsby',
 "The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy, #1)",
 'The Metamorphosis',
 'The M

## Summary

This model employs Pierson correlation to find the correlation in user preferences between a given book and other books based on the reader preferences.

For a given book title '1984', the model used correlation coefficient to identify similar book preferences using correlation matrix (Pearson co-relationship between each pair of variables).