# **Sistem Rekomendasi dengan Content-based Filtering**

Dataset berisi kumpulan data buku dengan judul, penulis, deskripsi, rating dan lainnya

# **Data Understanding**

**Download dataset dari kaggle dan unzip file datasetnya**

In [1]:
!pip install -q kaggle

In [2]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle (1).json


{'kaggle.json': b'{"username":"mramam167x0279","key":"725eb338a706250602b6297fdc689687"}'}

In [3]:
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [4]:
!kaggle datasets download -d dylanjcastillo/7k-books-with-metadata

7k-books-with-metadata.zip: Skipping, found more recently modified local copy (use --force to force download)


In [5]:
!mkdir book_dataset

mkdir: cannot create directory ‘book_dataset’: File exists


In [6]:
!unzip 7k-books-with-metadata.zip -d book_dataset

Archive:  7k-books-with-metadata.zip
replace book_dataset/books.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

**read dataset menggunakan fungsi pandas.read_csv**

In [33]:
# Import library
import pandas as pd
import numpy as np 
from zipfile import ZipFile
import tensorflow as tf
from tensorflow import keras

In [8]:
books = pd.read_csv('book_dataset/books.csv')


In [9]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6810 entries, 0 to 6809
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   isbn13          6810 non-null   int64  
 1   isbn10          6810 non-null   object 
 2   title           6810 non-null   object 
 3   subtitle        2381 non-null   object 
 4   authors         6738 non-null   object 
 5   categories      6711 non-null   object 
 6   thumbnail       6481 non-null   object 
 7   description     6548 non-null   object 
 8   published_year  6804 non-null   float64
 9   average_rating  6767 non-null   float64
 10  num_pages       6767 non-null   float64
 11  ratings_count   6767 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 638.6+ KB


Dataset menyediakan hampir tujuh ribu buku yang berisi identitas buku seperti, judul, subjudul, penulis, kategori, url thumbnail, deskripsi, tahun terbit,  rata-rata rating, dan jumlah rating. yang terdiri dari tiga tipe data yaitu int, object dan float

In [10]:
books.head()

Unnamed: 0,isbn13,isbn10,title,subtitle,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count
0,9780002005883,2005883,Gilead,,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,361.0
1,9780002261982,2261987,Spider's Web,A Novel,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,5164.0
2,9780006163831,6163831,The One Tree,,Stephen R. Donaldson,American fiction,http://books.google.com/books/content?id=OmQaw...,Volume Two of Stephen Donaldson's acclaimed se...,1982.0,3.97,479.0,172.0
3,9780006178736,6178731,Rage of angels,,Sidney Sheldon,Fiction,http://books.google.com/books/content?id=FKo2T...,"A memorable, mesmerizing heroine Jennifer -- b...",1993.0,3.93,512.0,29532.0
4,9780006280897,6280897,The Four Loves,,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=XhQ5X...,Lewis' work on the nature of love divides love...,2002.0,4.15,170.0,33684.0


# **Data Preparation**

In [11]:
books.isnull().sum()

isbn13               0
isbn10               0
title                0
subtitle          4429
authors             72
categories          99
thumbnail          329
description        262
published_year       6
average_rating      43
num_pages           43
ratings_count       43
dtype: int64

karena banyak sekali fitur yang memiliki missing value maka akan di pilih fitur yang akan digunakan untuk sistem rekomendasi

In [12]:
books.drop(['isbn13', 'isbn10', 'subtitle', 'thumbnail', 'description', 'published_year', 'num_pages'], axis = 1, inplace = True)
books

Unnamed: 0,title,authors,categories,average_rating,ratings_count
0,Gilead,Marilynne Robinson,Fiction,3.85,361.0
1,Spider's Web,Charles Osborne;Agatha Christie,Detective and mystery stories,3.83,5164.0
2,The One Tree,Stephen R. Donaldson,American fiction,3.97,172.0
3,Rage of angels,Sidney Sheldon,Fiction,3.93,29532.0
4,The Four Loves,Clive Staples Lewis,Christian life,4.15,33684.0
...,...,...,...,...,...
6805,I Am that,Sri Nisargadatta Maharaj;Sudhakar S. Dikshit,Philosophy,4.51,104.0
6806,Secrets Of The Heart,Khalil Gibran,Mysticism,4.08,324.0
6807,Fahrenheit 451,Ray Bradbury,Book burning,3.98,5733.0
6808,The Berlin Phenomenology,Georg Wilhelm Friedrich Hegel,History,0.00,0.0


In [13]:
books.isnull().sum()

title              0
authors           72
categories        99
average_rating    43
ratings_count     43
dtype: int64

In [14]:
len(books)

6810

- **hapus missing value pada data**

In [15]:
books_clean = books.dropna()
books_clean

Unnamed: 0,title,authors,categories,average_rating,ratings_count
0,Gilead,Marilynne Robinson,Fiction,3.85,361.0
1,Spider's Web,Charles Osborne;Agatha Christie,Detective and mystery stories,3.83,5164.0
2,The One Tree,Stephen R. Donaldson,American fiction,3.97,172.0
3,Rage of angels,Sidney Sheldon,Fiction,3.93,29532.0
4,The Four Loves,Clive Staples Lewis,Christian life,4.15,33684.0
...,...,...,...,...,...
6805,I Am that,Sri Nisargadatta Maharaj;Sudhakar S. Dikshit,Philosophy,4.51,104.0
6806,Secrets Of The Heart,Khalil Gibran,Mysticism,4.08,324.0
6807,Fahrenheit 451,Ray Bradbury,Book burning,3.98,5733.0
6808,The Berlin Phenomenology,Georg Wilhelm Friedrich Hegel,History,0.00,0.0


In [16]:
books_clean.isnull().sum()

title             0
authors           0
categories        0
average_rating    0
ratings_count     0
dtype: int64

In [17]:
# Membuat variabel preparation yang berisi dataframe books_clean kemudian mengurutkan berdasarkan title
preparation = books_clean
preparation.sort_values('title')

Unnamed: 0,title,authors,categories,average_rating,ratings_count
1437,"""A"" is for Abductive",Leonard I. Sweet;Brian D. McLaren;Jerry Haselm...,Religion,3.17,48.0
4948,"""Gorgias"" and ""Phaedrus""",Plato,Philosophy,4.41,15.0
2398,"""Surely You're Joking, Mr. Feynman!""",Richard Phillips Feynman;Ralph Leighton,Biography & Autobiography,4.28,105685.0
539,"""Surely You're Joking, Mr. Feynman!""",Richard Phillips Feynman;Ralph Leighton,Physicists,4.28,1700.0
2404,"""What Do You Care what Other People Think?""",Richard Phillips Feynman;Ralph Leighton,BIOGRAPHY & AUTOBIOGRAPHY,4.27,15646.0
...,...,...,...,...,...
5607,זן ואומנות האושר,Chris Prentiss,Philosophy,4.04,2746.0
6794,ゲンジモノガタリ,紫式部,Japan,3.76,46.0
6795,五輪書 : 英文版,武蔵·宮本,Philosophy,4.08,430.0
6796,日本語の感情表現集,村上真美子,Foreign Language Study,3.54,27.0


In [18]:
# Membuang data duplikat pada variabel preparation
preparation = preparation.drop_duplicates('title')
preparation

Unnamed: 0,title,authors,categories,average_rating,ratings_count
0,Gilead,Marilynne Robinson,Fiction,3.85,361.0
1,Spider's Web,Charles Osborne;Agatha Christie,Detective and mystery stories,3.83,5164.0
2,The One Tree,Stephen R. Donaldson,American fiction,3.97,172.0
3,Rage of angels,Sidney Sheldon,Fiction,3.93,29532.0
4,The Four Loves,Clive Staples Lewis,Christian life,4.15,33684.0
...,...,...,...,...,...
6804,The Monk Who Sold His Ferrari: A Fable About F...,Robin Sharma,Health & Fitness,3.82,1568.0
6805,I Am that,Sri Nisargadatta Maharaj;Sudhakar S. Dikshit,Philosophy,4.51,104.0
6806,Secrets Of The Heart,Khalil Gibran,Mysticism,4.08,324.0
6808,The Berlin Phenomenology,Georg Wilhelm Friedrich Hegel,History,0.00,0.0


In [19]:
# Mengonversi data series ‘title’ menjadi dalam bentuk list
books_title = preparation['title'].tolist()
 
# Mengonversi data series ‘authors’ menjadi dalam bentuk list
books_authors = preparation['authors'].tolist()
 
# Mengonversi data series ‘categories’ menjadi dalam bentuk list
books_categories = preparation['categories'].tolist()

# Mengonversi data series ‘average_rating’ menjadi dalam bentuk list
books_average_rating = preparation['average_rating'].tolist()
 
# Mengonversi data series ‘ratings_count’ menjadi dalam bentuk list
books_ratings_count = preparation['ratings_count'].tolist()
 
print(len(books_title))
print(len(books_authors))
print(len(books_categories))
print(len(books_average_rating))
print(len(books_ratings_count))

6217
6217
6217
6217
6217


In [20]:
# Membuat dictionary untuk data ‘books_title’, ‘books_authors’, 'books_categories', 'books_average_rating' dan ‘books_ratings_count’
books_clear = pd.DataFrame({
    'title': books_title,
    'authors': books_authors,
    'categories': books_categories,
    'average_rating' : books_average_rating,
    'ratings_count' : books_ratings_count
})
books_clear

Unnamed: 0,title,authors,categories,average_rating,ratings_count
0,Gilead,Marilynne Robinson,Fiction,3.85,361.0
1,Spider's Web,Charles Osborne;Agatha Christie,Detective and mystery stories,3.83,5164.0
2,The One Tree,Stephen R. Donaldson,American fiction,3.97,172.0
3,Rage of angels,Sidney Sheldon,Fiction,3.93,29532.0
4,The Four Loves,Clive Staples Lewis,Christian life,4.15,33684.0
...,...,...,...,...,...
6212,The Monk Who Sold His Ferrari: A Fable About F...,Robin Sharma,Health & Fitness,3.82,1568.0
6213,I Am that,Sri Nisargadatta Maharaj;Sudhakar S. Dikshit,Philosophy,4.51,104.0
6214,Secrets Of The Heart,Khalil Gibran,Mysticism,4.08,324.0
6215,The Berlin Phenomenology,Georg Wilhelm Friedrich Hegel,History,0.00,0.0


# **Model Development dengan Content Based Filtering**

In [21]:
data = books_clear
data.sample(5)

Unnamed: 0,title,authors,categories,average_rating,ratings_count
2466,Hideaway,Dean Ray Koontz,Fiction,3.83,29475.0
5303,The Plot Against America,Philip Roth,Fiction,3.75,31270.0
1520,When the Lion Feeds,Wilbur Smith,Fiction,4.18,9802.0
5395,Tunnel in the Sky,Robert A. Heinlein,Fiction,3.93,9996.0
1808,Trump,Donald Trump,Biography & Autobiography,3.65,11727.0


**TF-IDF Vectorizer**

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer
 
# Inisialisasi TfidfVectorizer
tf = TfidfVectorizer()
 
# Melakukan perhitungan idf pada data categories
tf.fit(data['categories']) 
 
# Mapping array dari fitur index integer ke fitur nama
tf.get_feature_names() 



['1800',
 '18e',
 '1914',
 '1918',
 '1941',
 '1942',
 '1945',
 '1960',
 '1961',
 '1975',
 '1989',
 '1991',
 '87th',
 'abandoned',
 'aboriginal',
 'abused',
 'accidents',
 'acting',
 'action',
 'activities',
 'actors',
 'actresses',
 'adirondack',
 'adjustment',
 'adolescence',
 'adult',
 'adultery',
 'adventure',
 'advertising',
 'aeneas',
 'aeronautics',
 'africa',
 'african',
 'age',
 'aged',
 'agents',
 'aging',
 'agriculture',
 'aids',
 'air',
 'aircraft',
 'alanna',
 'albigenses',
 'alcestis',
 'alcoholics',
 'alien',
 'alienation',
 'allegories',
 'alternative',
 'amazon',
 'american',
 'americans',
 'amis',
 'amish',
 'amnesia',
 'amnesty',
 'amour',
 'amyotrophic',
 'analysis',
 'ancient',
 'and',
 'angeles',
 'anger',
 'animal',
 'animals',
 'anita',
 'antiheroes',
 'antiques',
 'antisemitism',
 'antislavery',
 'apartheid',
 'apartment',
 'apartments',
 'applications',
 'apprentices',
 'arab',
 'arabic',
 'archaeological',
 'architecture',
 'arctic',
 'arithmetic',
 'art',
 'a

In [23]:
# Melakukan fit lalu ditransformasikan ke bentuk matrix
tfidf_matrix = tf.fit_transform(data['categories']) 
 
# Melihat ukuran matrix tfidf
tfidf_matrix.shape 

(6217, 655)

In [24]:
# Mengubah vektor tf-idf dalam bentuk matriks dengan fungsi todense()
tfidf_matrix.todense()

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

In [25]:
# Membuat dataframe untuk melihat tf-idf matrix
# Kolom diisi dengan jenis buku
# Baris diisi dengan judul buku
 
pd.DataFrame(
    tfidf_matrix.todense(), 
    columns=tf.get_feature_names(),
    index=data.title
).sample(22, axis=1).sample(10, axis=0)

Unnamed: 0_level_0,encounters,calif,dominican,pictorial,construction,sclerosis,male,allegories,alienation,relaciones,...,aids,aeneas,radio,israel,butlers,espionage,market,mystery,dysfunctional,horror
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
These Happy Golden Years,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
An Autobiography,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Blessings,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Servant of the Empire,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
I Wonder Why I Blink,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Archimedes Effect,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Americana,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
What Was She Thinking?,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Collapse,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Voice of the Gods,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Cosine Similarity**

In [26]:
from sklearn.metrics.pairwise import cosine_similarity
 
# Menghitung cosine similarity pada matrix tf-idf
cosine_sim = cosine_similarity(tfidf_matrix) 
cosine_sim

array([[1.        , 0.        , 0.29571014, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.29571014, 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [27]:
# Membuat dataframe dari variabel cosine_sim dengan baris dan kolom berupa judul buku
cosine_sim_df = pd.DataFrame(cosine_sim, index=data['title'], columns=data['title'])
print('Shape:', cosine_sim_df.shape)
 
# Melihat similarity matrix pada setiap buku
cosine_sim_df.sample(5, axis=1).sample(10, axis=0)

Shape: (6217, 6217)


title,Charles Dickens as a Legal Historian,The Heights of Courage,In Search of Captain Zero,Sparrow Hawk Red,Patriot Games
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Candide and Other Stories,0.0,0.0,0.0,0.0,0.0
Early Candlelight,0.0,0.0,0.0,0.0,0.0
Thriller,0.0,0.0,0.0,0.0,0.0
Matters of Choice,0.0,0.0,0.0,0.0,0.0
Kingdom Come,0.0,0.0,0.0,0.0,0.0
Shattered Dance,0.0,0.0,0.0,0.0,0.0
The Power of Now,0.0,0.0,0.0,0.0,0.0
A Doll's House,0.0,0.0,0.0,0.0,0.0
Black Beauty,0.0,0.0,0.0,0.0,0.0
Microserfs,0.0,0.0,0.0,0.0,0.0


**Mendapatkan Rekomendasi**

In [28]:
def books_recommendations(title, similarity_data=cosine_sim_df, items=data[['title', 'categories']], k=5):
    """
    Rekomendasi Buku berdasarkan kemiripan dataframe
 
    Parameter:
    ---
    title : tipe data string (str)
                title (index kemiripan dataframe)
    similarity_data : tipe data pd.DataFrame (object)
                      Kesamaan dataframe, simetrik, dengan judul sebagai 
                      indeks dan kolom
    items : tipe data pd.DataFrame (object)
            Mengandung kedua nama dan fitur lainnya yang digunakan untuk mendefinisikan kemiripan
    k : tipe data integer (int)
        Banyaknya jumlah rekomendasi yang diberikan
    ---
 
 
    Pada index ini, kita mengambil k dengan nilai similarity terbesar 
    pada index matrix yang diberikan (i).
    """
 
 
    # Mengambil data dengan menggunakan argpartition untuk melakukan partisi secara tidak langsung sepanjang sumbu yang diberikan    
    # Dataframe diubah menjadi numpy
    # Range(start, stop, step)
    index = similarity_data.loc[:,title].to_numpy().argpartition(
        range(-1, -k, -1))
    
    # Mengambil data dengan similarity terbesar dari index yang ada
    closest = similarity_data.columns[index[-1:-(k+2):-1]]
    
    # Drop nama_resto agar nama resto yang dicari tidak muncul dalam daftar rekomendasi
    closest = closest.drop(title, errors='ignore')
 
    return pd.DataFrame(closest).merge(items).head(k)

In [29]:
data[data.title.eq('The Shawshank redemption')]

Unnamed: 0,title,authors,categories,average_rating,ratings_count
4987,The Shawshank redemption,Mark Kermode,Performing Arts,4.42,640.0


In [30]:
# Mendapatkan rekomendasi buku yang mirip dengan 'The Shawshank redemption'
books_recommendations('The Shawshank redemption')

Unnamed: 0,title,categories
0,Gilliam on Gilliam,Performing Arts
1,Hype and Glory,Performing Arts
2,The Simpsons and Philosophy,Performing Arts
3,Taxi Driver,Performing Arts
4,Bite Me!,Performing Arts


In [31]:
data[data.title.eq('Rage of angels')]

Unnamed: 0,title,authors,categories,average_rating,ratings_count
3,Rage of angels,Sidney Sheldon,Fiction,3.93,29532.0


In [32]:
# Mendapatkan rekomendasi buku yang mirip dengan Rage of angels
books_recommendations('Rage of angels')

Unnamed: 0,title,categories
0,Fitzgerald: All The Sad Young Men,Fiction
1,The Bride Finder,Fiction
2,Murder in Foggy Bottom,Fiction
3,Wartime Lies,Fiction
4,Your Oasis on Flame Lake,Fiction


dari rekomendasi di atas buku yang direkomendasikan semuanya berkategori fiction jadi presisi 5/5 atau 100%. artinya sistem dapat merekomendasikan buku yang yang memiliki kategori yang sama dengan baik