# Book Recommendation System

- Content-Based Collaborative Filtering using Title, Author, Publisher, Category as features

## About Dataset
Terdapat 278858 user memberikan 1149780 penilaian (explicit/implicit) terhadap 271379 buku
- user_id - id dari pengguna
- location - lokasi/alamat pengguna
- age - umur pengguna
- isbn - kode ISBN (International Standard Book Number) buku
- rating - rating dari buku
- book_title - judul buku
- book_author - penulis buku
- year_of_publication - tahun terbit buku
- publisher - penerbit buku
- img_s - gambar sampul buku (small)
- img_m - gambar sampul buku (medium)
- img_l - gambar sampul buku (large)
- Summary - ringkasan/sinopsis buku
- Language - bahasa yang digunakan buku
- Category - kategori buku
- city - kota pengguna
- state - negara bagian penguna
- country - negara pengguna

## Libraries

In [None]:
%pip install opendatasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import os
import re
import nltk
import requests
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import opendatasets as od

from nltk.corpus import stopwords
nltk.download("stopwords")

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from PIL import Image

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Load and Check Dataset

In [None]:
od.download('https://www.kaggle.com/datasets/ruchi798/bookcrossing-dataset')

Skipping, found downloaded files in "./bookcrossing-dataset" (use force=True to force download)


In [None]:
books = pd.read_csv('/content/bookcrossing-dataset/Books Data with Category Language and Summary/Preprocessed_data.csv')
books.head(2)

Unnamed: 0.1,Unnamed: 0,user_id,location,age,isbn,rating,book_title,book_author,year_of_publication,publisher,img_s,img_m,img_l,Summary,Language,Category,city,state,country
0,0,2,"stockton, california, usa",18.0,195153448,0,Classical Mythology,Mark P. O. Morford,2002.0,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,Provides an introduction to classical myths pl...,en,['Social Science'],stockton,california,usa
1,1,8,"timmins, ontario, canada",34.7439,2005018,5,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses'],timmins,ontario,canada


In [None]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1031175 entries, 0 to 1031174
Data columns (total 19 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   Unnamed: 0           1031175 non-null  int64  
 1   user_id              1031175 non-null  int64  
 2   location             1031175 non-null  object 
 3   age                  1031175 non-null  float64
 4   isbn                 1031175 non-null  object 
 5   rating               1031175 non-null  int64  
 6   book_title           1031175 non-null  object 
 7   book_author          1031175 non-null  object 
 8   year_of_publication  1031175 non-null  float64
 9   publisher            1031175 non-null  object 
 10  img_s                1031175 non-null  object 
 11  img_m                1031175 non-null  object 
 12  img_l                1031175 non-null  object 
 13  Summary              1031175 non-null  object 
 14  Language             1031175 non-null  object 
 15

In [None]:
print(sorted(books.rating.unique()))
print()
print(books.rating.value_counts())

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

0     647323
8      91806
10     71227
7      66404
9      60780
5      45355
6      31689
4       7617
3       5118
2       2375
1       1481
Name: rating, dtype: int64


In [None]:
books.isnull().sum()

Unnamed: 0                 0
user_id                    0
location                   0
age                        0
isbn                       0
rating                     0
book_title                 0
book_author                0
year_of_publication        0
publisher                  0
img_s                      0
img_m                      0
img_l                      0
Summary                    0
Language                   0
Category                   0
city                   14103
state                  22798
country                35374
dtype: int64

In [None]:
print(books.Category.unique())
print()
print(books.Category.value_counts().index)

["['Social Science']" "['Actresses']" "['1940-1949']" ...
 "['Microsoft Windows NT.']" "['Merchants']" "['Alternative histories']"]

Index(['9', '['Fiction']', '['Juvenile Fiction']',
       '['Biography & Autobiography']', '['Humor']', '['History']',
       '['Religion']', '['Juvenile Nonfiction']', '['Social Science']',
       '['Body, Mind & Spirit']',
       ...
       '['Human-alien encounters.']', '['Adel']',
       '['Brobdingnag (Imaginary place)']', '['Devotional literature.']',
       '['Tourism']', '['Angel (Fictitious character : Whedon)']', '['Face']',
       '['Church renewal']', '['Supermarkets']', '['Alternative histories']'],
      dtype='object', length=6448)


In [None]:
# Title, Author, Publisher, Category as features
books.publisher.value_counts()

Ballantine Books                                 34724
Pocket                                           31989
Berkley Publishing Group                         28614
Warner Books                                     25506
Harlequin                                        25029
                                                 ...  
Langley Press, Incorporated                          1
Division of Archives and Hist Tural Resources        1
Terra Nova Press                                     1
Editorial Mileto                                     1
Lone Star Books                                      1
Name: publisher, Length: 16729, dtype: int64

## Preprocessing

In [None]:
df = books.copy()
df.dropna(inplace=True, how='any', axis=0)
df.reset_index(drop=True, inplace=True)
df.drop(columns = ['Unnamed: 0','location','isbn',
                   'img_s','img_m', 'img_l', 'city','age',
                   'state','Language','country',
                   'year_of_publication', 'Summary'],axis=1,inplace = True) #kolom yang didrop tidak akan dipakai
df.drop(index=df[df.Category == '9'].index, inplace=True)
df.drop(index=df[df.rating == 0].index, inplace=True)
df.Category = df.Category.apply(lambda x: re.sub('[\W_]+', ' ', x).strip())
df.head()

Unnamed: 0,user_id,rating,book_title,book_author,publisher,Category
1,8,5,Clara Callan,Richard Bruce Wright,HarperFlamingo Canada,Actresses
4,67544,8,Clara Callan,Richard Bruce Wright,HarperFlamingo Canada,Actresses
7,123629,9,Clara Callan,Richard Bruce Wright,HarperFlamingo Canada,Actresses
9,200273,8,Clara Callan,Richard Bruce Wright,HarperFlamingo Canada,Actresses
10,210926,9,Clara Callan,Richard Bruce Wright,HarperFlamingo Canada,Actresses


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 217314 entries, 1 to 982277
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   user_id      217314 non-null  int64 
 1   rating       217314 non-null  int64 
 2   book_title   217314 non-null  object
 3   book_author  217314 non-null  object
 4   publisher    217314 non-null  object
 5   Category     217314 non-null  object
dtypes: int64(2), object(4)
memory usage: 11.6+ MB


In [None]:
df.isnull().sum()

user_id        0
rating         0
book_title     0
book_author    0
publisher      0
Category       0
dtype: int64

In [None]:
# df.Category.value_counts()
i = 1
for idx, name in enumerate(df['Category'].value_counts().index.tolist()):
    if(i==25): break
    print(i)
    print('Name :', name)
    print('Counts :', df['Category'].value_counts()[idx])
    print('---'*8)
    i+=1

1
Name : Fiction
Counts : 127055
------------------------
2
Name : Juvenile Fiction
Counts : 14181
------------------------
3
Name : Biography Autobiography
Counts : 8876
------------------------
4
Name : Humor
Counts : 3721
------------------------
5
Name : History
Counts : 3121
------------------------
6
Name : Religion
Counts : 2843
------------------------
7
Name : Body Mind Spirit
Counts : 1999
------------------------
8
Name : Juvenile Nonfiction
Counts : 1955
------------------------
9
Name : Social Science
Counts : 1937
------------------------
10
Name : Business Economics
Counts : 1801
------------------------
11
Name : Family Relationships
Counts : 1671
------------------------
12
Name : Self Help
Counts : 1644
------------------------
13
Name : Health Fitness
Counts : 1514
------------------------
14
Name : Cooking
Counts : 1325
------------------------
15
Name : Travel
Counts : 1161
------------------------
16
Name : Poetry
Counts : 985
------------------------
17
Name : Tr

In [None]:
cat_list = df.Category.value_counts().index.tolist()
print(cat_list[5:20])

['Religion', 'Body Mind Spirit', 'Juvenile Nonfiction', 'Social Science', 'Business Economics', 'Family Relationships', 'Self Help', 'Health Fitness', 'Cooking', 'Travel', 'Poetry', 'True Crime', 'Psychology', 'Science', 'Computers']


In [None]:
df_fil = df[df.Category.isin(cat_list[5:20])]
df_fil.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22576 entries, 694 to 982241
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      22576 non-null  int64 
 1   rating       22576 non-null  int64 
 2   book_title   22576 non-null  object
 3   book_author  22576 non-null  object
 4   publisher    22576 non-null  object
 5   Category     22576 non-null  object
dtypes: int64(2), object(4)
memory usage: 1.2+ MB


In [None]:
df_fil.Category.nunique()

15

In [None]:
prep = df_fil.copy()
prep.sort_values('book_title')

Unnamed: 0,user_id,rating,book_title,book_author,publisher,Category
958963,237883,9,Microsoft Application Architecture For Micros...,Microsoft Corporation Staff,Microsoft Press,Computers
862828,131193,8,$30 Film School,Michael W. Dean,Muska & Lipman Publishing,Computers
614402,31826,10,"1,000 Makers of the Millennium: The Men and Wo...",Dorling Kindersley Publishing,Dorling Kindersley,Juvenile Nonfiction
518889,115161,10,"1,000 Places to See Before You Die",Patricia Schultz,Workman Publishing,Travel
518893,149153,7,"1,000 Places to See Before You Die",Patricia Schultz,Workman Publishing,Travel
...,...,...,...,...,...,...
944737,216795,8,how to stop time : heroin from A to Z,Ann Marlowe,Basic Books,Psychology
788608,87141,8,sed & awk (2nd Edition),Dale Dougherty,O'Reilly,Computers
390374,240054,9,"street bible, the",Robert Lacey,Zondervan Publishing Company,Religion
672661,217211,6,teach yourself...C++,Al Stevens,John Wiley & Sons Inc,Computers


In [None]:
prep = prep.drop_duplicates('book_title')
prep.info()
print()
prep.head(4)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13048 entries, 694 to 982241
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      13048 non-null  int64 
 1   rating       13048 non-null  int64 
 2   book_title   13048 non-null  object
 3   book_author  13048 non-null  object
 4   publisher    13048 non-null  object
 5   Category     13048 non-null  object
dtypes: int64(2), object(4)
memory usage: 713.6+ KB



Unnamed: 0,user_id,rating,book_title,book_author,publisher,Category
694,6366,7,New Vegetarian: Bold and Beautiful Recipes for...,Celia Brooks Brown,Ryland Peters & Small Ltd,Cooking
4680,157475,8,The Therapeutic Touch: How to Use Your Hands t...,Dolores Krieger,Fireside,Health Fitness
6002,64010,7,The Dragons of Eden: Speculations on the Evolu...,Carl Sagan,Ballantine Books,Science
8561,99,9,McDonald's: Behind the Arches,John F. Love,Bantam,Business Economics


In [None]:
prep['Category'] = prep['Category'].str.replace(' ', '_')
prep.head(10)

Unnamed: 0,user_id,rating,book_title,book_author,publisher,Category
694,6366,7,New Vegetarian: Bold and Beautiful Recipes for...,Celia Brooks Brown,Ryland Peters & Small Ltd,Cooking
4680,157475,8,The Therapeutic Touch: How to Use Your Hands t...,Dolores Krieger,Fireside,Health_Fitness
6002,64010,7,The Dragons of Eden: Speculations on the Evolu...,Carl Sagan,Ballantine Books,Science
8561,99,9,McDonald's: Behind the Arches,John F. Love,Bantam,Business_Economics
8565,99,10,Creating Wealth : Retire in Ten Years Using Al...,Robert G. Allen,Fireside,Business_Economics
11962,190,7,Keep It Simple: And Get More Out of Life,Nick Page,Trafalgar Square,Self_Help
12584,114629,5,"If Singleness Is a Gift, What's the Return Pol...",Holly Virden,Nelson Books,Religion
23537,243,5,Chicken Soup for the Soul (Chicken Soup for th...,Jack Canfield,Health Communications,Self_Help
25362,254,7,Amazing Grace : Lives of Children and the Cons...,Jonathan Kozol,Perennial,Social_Science
27120,33517,9,Dictionary of Superstitions,David Pickering,Sterling Pub Co Inc,Social_Science


In [None]:
book_title = prep['book_title'].tolist()
book_cat = prep['Category'].tolist()
book_pub = prep['publisher'].tolist()
book_author = prep['book_author'].tolist()

print(len(book_title))
print(len(book_cat))
print(len(book_pub))
print(len(book_author))

13048
13048
13048
13048


In [None]:
book_new = pd.DataFrame({
    'title': book_title,
    'author': book_author,
    'category': book_cat,
    'publisher': book_pub
})
book_new

Unnamed: 0,title,author,category,publisher
0,New Vegetarian: Bold and Beautiful Recipes for...,Celia Brooks Brown,Cooking,Ryland Peters & Small Ltd
1,The Therapeutic Touch: How to Use Your Hands t...,Dolores Krieger,Health_Fitness,Fireside
2,The Dragons of Eden: Speculations on the Evolu...,Carl Sagan,Science,Ballantine Books
3,McDonald's: Behind the Arches,John F. Love,Business_Economics,Bantam
4,Creating Wealth : Retire in Ten Years Using Al...,Robert G. Allen,Business_Economics,Fireside
...,...,...,...,...
13043,"Remote Perceptions: Out-Of-Body Experiences, R...",Angela Thompson Smith,Body_Mind_Spirit,Hampton Roads Publishing Co.
13044,Who Speaks for Wolf: A Native American Learnin...,Paula Underwood,Social_Science,Tribe of Two Pr
13045,On Becoming Childwise,Gary Ezzo,Family_Relationships,Multnomah
13046,"Frommer's 2000 Bahamas (Frommer's Bahamas, 2000)",Arthur Frommer,Travel,"Hungry Minds, Inc"


### TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
#Init Tfidf
tf = TfidfVectorizer()

In [None]:
# Melakukan perhitung idf pada data category
tf.fit(book_new['category'])

# Mapping array dari fitur index integer ke fitur nama
tf.get_feature_names_out()


array(['body_mind_spirit', 'business_economics', 'computers', 'cooking',
       'family_relationships', 'health_fitness', 'juvenile_nonfiction',
       'poetry', 'psychology', 'religion', 'science', 'self_help',
       'social_science', 'travel', 'true_crime'], dtype=object)

In [None]:
# Melakukan fit lalu ditransformasikan ke bentuk matrix
tfidf_matrix = tf.fit_transform(book_new['category']) 
 
# Melihat ukuran matrix tfidf
tfidf_matrix.shape 

(13048, 15)

In [None]:
# Mengubah vektor tf-idf dalam bentuk matriks dengan fungsi todense()
tfidf_matrix.todense()

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 1., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
# Membuat dataframe untuk melihat tf-idf matrix
# Kolom diisi dengan category book
# Baris diisi dengan nama book
 
pd.DataFrame(
    tfidf_matrix.todense(), 
    columns=tf.get_feature_names_out(),
    index=book_new.title
).sample(5, axis=1).sample(10, axis=0)

Unnamed: 0_level_0,health_fitness,religion,family_relationships,cooking,social_science
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
The Pain Tree,0.0,0.0,0.0,0.0,0.0
The Blue Bear : A True Story of Friendship and Discovery in the Alaskan Wild,0.0,0.0,0.0,0.0,0.0
Brave New Families: Stories of Domestic Upheaval in Late Twentieth Century America,0.0,0.0,1.0,0.0,0.0
Poetry for the Common Man,0.0,0.0,0.0,0.0,0.0
CHOCOLATE FOR A LOVER'S HEART : SOUL-SOOTHING STORIES THAT CELEBRATE THE POWER OF LOVE (Chocolate),0.0,0.0,0.0,0.0,0.0
House That Jack Built,0.0,0.0,0.0,0.0,0.0
"Turtles, Termites, and Traffic Jams: Explorations in Massively Parallel Microworlds (Complex Adaptive Systems)",0.0,0.0,0.0,0.0,0.0
Positive Solitude : A Practical Program for Mastering Loneliness and Achieving Self-Fulfillment,0.0,0.0,0.0,0.0,0.0
Bryson City Tales,0.0,1.0,0.0,0.0,0.0
Ottawa With the Kids,0.0,0.0,0.0,0.0,0.0


### Cosine Similarity

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
 
# Menghitung cosine similarity pada matrix tf-idf
cosine_sim = cosine_similarity(tfidf_matrix) 
cosine_sim

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [None]:
# Membuat dataframe dari variabel cosine_sim dengan baris dan kolom berupa nama resto
cosine_sim_df = pd.DataFrame(cosine_sim, index=book_new['title'], columns=book_new['title'])
print('Shape:', cosine_sim_df.shape)
 
# Melihat similarity matrix pada setiap resto
cosine_sim_df.sample(5, axis=1).sample(3, axis=0)

Shape: (13048, 13048)


title,A Book of Middle Eastern Food,Windows XP in a Nutshell,The Nature of Animal Healing : The Definitive Holistic Medicine Guide to Caring for Your Dog and Cat,Neanderthals at Work: How People and Politics Can Drive You Crazy...and What You Can Do About Them,Lonely Planet Provence & the Cote D'Azur (Lonely Planet Provence and the Cote D'azur)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Macromedia Flash MX for Dummies,0.0,1.0,0.0,0.0,0.0
The Sacred Yew (Arkana S.),0.0,0.0,0.0,0.0,0.0
"Julia Morgan, Architect of Dreams (Lerner Biographies)",0.0,0.0,0.0,0.0,0.0


### Mendapatkan rekomendasi

In [None]:
def book_recommendation(nama_buku, similarity_data=cosine_sim_df, items=book_new[['title', 'category']], k=5):
    # Mengambil data dengan menggunakan argpartition untuk melakukan partisi secara tidak langsung sepanjang sumbu yang diberikan    
    # Dataframe diubah menjadi numpy
    # Range(start, stop, step)
    index = similarity_data.loc[:,nama_buku].to_numpy().argpartition(
        range(-1, -k, -1))
    
    # Mengambil data dengan similarity terbesar dari index yang ada
    closest = similarity_data.columns[index[-1:-(k+2):-1]]

    # Drop nama_resto agar nama resto yang dicari tidak muncul dalam daftar rekomendasi
    closest = closest.drop(nama_buku, errors='ignore')
    df = pd.DataFrame(closest).merge(items)
    df.drop_duplicates(keep='first', subset="title", inplace=True)
    return df.head(k)

In [None]:
book_new.head()

Unnamed: 0,title,author,category,publisher
0,New Vegetarian: Bold and Beautiful Recipes for...,Celia Brooks Brown,Cooking,Ryland Peters & Small Ltd
1,The Therapeutic Touch: How to Use Your Hands t...,Dolores Krieger,Health_Fitness,Fireside
2,The Dragons of Eden: Speculations on the Evolu...,Carl Sagan,Science,Ballantine Books
3,McDonald's: Behind the Arches,John F. Love,Business_Economics,Bantam
4,Creating Wealth : Retire in Ten Years Using Al...,Robert G. Allen,Business_Economics,Fireside


In [None]:
book_new[book_new['title'].eq("Macromedia Flash MX for Dummies")]

Unnamed: 0,title,author,category,publisher
8160,Macromedia Flash MX for Dummies,Gurdy Leete,Computers,For Dummies


In [None]:
book_recommendation("Macromedia Flash MX for Dummies", k=10)

Unnamed: 0,title,category
0,Learning the vi Editor (6th Edition),Computers
1,Transact-SQL Programming,Computers
2,"Developing JavaBeans Using VisualAge for Java,...",Computers
3,Running Microsoft Frontpage 2000,Computers
4,"Flash 5.0: Graphics, Animation & Interactivity",Computers
5,Linux System Administration: A User's Guide,Computers
6,XML Complete,Computers
7,Introduction to MFC Programming with Visual C++,Computers
8,Visual Basic 3 for Dummies (For Dummies),Computers
9,Running Microsoft Excel 2000 (Running),Computers
