## **Proyek Mata Kuliah Sistem Rekomendasi**
## **Collaborative Filtering Using Cossine Similarity**

<table align="left">
    <tr>
        <td style="text-align:left">Kode Mata Kuliah</td>
        <td style="text-align:left">:</td>
        <td style="text-align:left">12S4054</td>
    </tr>
    <tr>
        <td style="text-align:left">Nama Mata Kuliah</td>
        <td style="text-align:left">:</td>
        <td style="text-align:left">Sistem Rekomendasi</td>
    </tr>
    <tr>
        <td style="text-align:left">Anggota Kelompok 3</td>
        <td style="text-align:left">:</td>
        <td style="text-align:left">
            1. 12S21046 Ruth Marelisa Hutagalung <br>
            2. 12S21052 Griselda<br>
            3. 12S21054 Diah Anastasya
        </td>
    </tr>
</table>


# Data Understanding

In [None]:
import pandas as pd

# Load Dataset
toba_tourism_data = pd.read_csv("Tempat-Wisata-Toba-Preprocessing.csv")

# Menampilkan 10 baris pertama dari data
toba_tourism_data.head()


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,address,PlaceID,Nama_tempat_wisata,Category,ReviewerId,Rating,Reviews
0,0,0,"Jl. Sibola Hotang, Sibola Hotangsas, Kec. Bali...",0,PASIR PUTIH LUMBAN BULBUL,Wisata Bahari,1.12603e+20,5.0,
1,1,1,"Jl. Sibola Hotang, Sibola Hotangsas, Kec. Bali...",0,PASIR PUTIH LUMBAN BULBUL,Wisata Bahari,1.11909e+20,5.0,bagus
2,2,2,"Jl. Sibola Hotang, Sibola Hotangsas, Kec. Bali...",0,PASIR PUTIH LUMBAN BULBUL,Wisata Bahari,1.07886e+20,5.0,
3,3,3,"Jl. Sibola Hotang, Sibola Hotangsas, Kec. Bali...",0,PASIR PUTIH LUMBAN BULBUL,Wisata Bahari,1.13072e+20,5.0,sangat menyenagkan
4,4,4,"Jl. Sibola Hotang, Sibola Hotangsas, Kec. Bali...",0,PASIR PUTIH LUMBAN BULBUL,Wisata Bahari,1.06173e+20,5.0,bebas foto dimana aja cuma 2k


# Data Preprocessing

Dari hasil menampilkan dataset mencakup beberapa kolom, yaitu: `Unnamed: 0.1`, `Unnamed: 0`, `Address`, `PlaceID`, `Nama_tempat_wisata`, `Category`, `ReviewerId`, `Rating`, dan `Reviews`.

Pada dataset terdapat kolom yang tidak diperlukan seperti  `Unnamed: 0.1` dan `Unnamed: 0`. Maka dari itu tahap selanjutnya adalah menghapus kedua kolom tersebut

In [None]:
# Menghapus kolom yang tidak diperlukan
toba_tourism_data_cleaned = toba_tourism_data.drop(columns=['Unnamed: 0.1', 'Unnamed: 0', 'address', 'Reviews'])

# Melakukan pengecekan nilai yang hilang akibat menghapus kolom.
missing_data_summary = toba_tourism_data_cleaned.isnull().sum()

# Menampilkan pratinjau data yang dibersihkan dan ringkasan data yang hilang
toba_tourism_data_cleaned.head(), missing_data_summary

(   PlaceID         Nama_tempat_wisata       Category    ReviewerId  Rating
 0        0  PASIR PUTIH LUMBAN BULBUL  Wisata Bahari  1.126030e+20     5.0
 1        0  PASIR PUTIH LUMBAN BULBUL  Wisata Bahari  1.119090e+20     5.0
 2        0  PASIR PUTIH LUMBAN BULBUL  Wisata Bahari  1.078860e+20     5.0
 3        0  PASIR PUTIH LUMBAN BULBUL  Wisata Bahari  1.130720e+20     5.0
 4        0  PASIR PUTIH LUMBAN BULBUL  Wisata Bahari  1.061730e+20     5.0,
 PlaceID               0
 Nama_tempat_wisata    0
 Category              0
 ReviewerId            1
 Rating                1
 dtype: int64)

Dari hasil ringkasan menunjukkan bahwa terdapat satu nilai yang hilang pada kolom `ReviewerId` dan `Rating`. Namun karena nilai yang hilang hanya satu, maka tidak akan terlalu berdampak terhadap akurasi dan performa sistem rekomendasi yang akan dilakukan.

In [None]:
# Menghapus baris dengan nilai yang hilang di kolom ReviewerId dan Rating
toba_tourism_data_cleaned.dropna(subset=['ReviewerId', 'Rating'], inplace=True)

# Memverifikasi apakah masih ada data yang hilang di kolom penting
missing_data_check = toba_tourism_data_cleaned[['ReviewerId', 'Rating']].isnull().sum()


In [None]:

# Membagi dataset (70% pelatihan, 15% validasi, 15% pengujian)
from sklearn.model_selection import train_test_split

# Memisahkan 15% untuk test set
train_val, test = train_test_split(toba_tourism_data_cleaned, test_size=0.15, random_state=42)

# Memisahkan sisa 15% dari 85% untuk validasi set, sehingga mendapatkan 70% untuk train dan 15% untuk validasi
train, val = train_test_split(train_val, test_size=0.176, random_state=42)  # 0.176 * 0.85 = 0.15 (15%)

# Menampilkan ukuran setiap split dan memastikan tidak ada data yang hilang di kolom penting
train_size, val_size, test_size = len(train), len(val), len(test)
missing_data_check, (train_size, val_size, test_size)

(ReviewerId    0
 Rating        0
 dtype: int64,
 (30274, 6467, 6484))

Dari hasil diatas dapat dilihat bahwa data telah berhasil dibersihkan dan dibagi menjadi:
- Training set: 30,274
- Validation set: 6,467
- Test set: 6,484

Tidak ada nilai yang hilang di kolom PlaceID, ReviewerId, and Rating.

# Modeling

## TF-IDF Vectorizer

TF-IDF Vectorizer digunakan untuk menghasilkan representasi fitur yang signifikan dari setiap kategori destinasi wisata. Alat ini, yang berasal dari library scikit-learn, mengonversi data menjadi bentuk vektor menggunakan metode *fit_transform* dan *transform*. Selain itu, TF-IDF Vectorizer juga secara langsung membagi data menjadi bagian-bagian yang lebih kecil.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer()

tf.fit(toba_tourism_data_cleaned['Category'])

tf.get_feature_names_out()

array(['alam', 'bahari', 'bisnis', 'budaya', 'pertanian', 'rohani',
       'sejarah', 'wisata', 'ziarah'], dtype=object)

Transformasi data tempat pada kolom category menjadi bentuk verktor matriks.

In [None]:
tfidf_matrix = tf.fit_transform(toba_tourism_data_cleaned['Category'])
tfidf_matrix.shape

(43225, 9)

Mengubah bentuk vectorizer yaitu vektor menjadi bentuk matriks.

In [None]:
tfidf_matrix.todense()

matrix([[0.        , 0.90873904, 0.        , ..., 0.        , 0.41736478,
         0.        ],
        [0.        , 0.90873904, 0.        , ..., 0.        , 0.41736478,
         0.        ],
        [0.        , 0.90873904, 0.        , ..., 0.        , 0.41736478,
         0.        ],
        ...,
        [0.        , 0.90873904, 0.        , ..., 0.        , 0.41736478,
         0.        ],
        [0.        , 0.90873904, 0.        , ..., 0.        , 0.41736478,
         0.        ],
        [0.        , 0.90873904, 0.        , ..., 0.        , 0.41736478,
         0.        ]])

In [None]:
pd.DataFrame(
    tfidf_matrix.todense(),
    columns=tf.get_feature_names_out(),
    index=toba_tourism_data_cleaned.Nama_tempat_wisata
).sample(10, axis=0)

Unnamed: 0_level_0,alam,bahari,bisnis,budaya,pertanian,rohani,sejarah,wisata,ziarah
Nama_tempat_wisata,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Pertunjukan Tari Sigale Gale,0.0,0.0,0.0,0.947114,0.0,0.0,0.0,0.320897,0.0
Sapo Juma,0.888914,0.0,0.0,0.0,0.0,0.0,0.0,0.458073,0.0
Bukit Indah Simarjarunjung,0.888914,0.0,0.0,0.0,0.0,0.0,0.0,0.458073,0.0
Pantai Lumban Bul-bul Balige,0.0,0.908739,0.0,0.0,0.0,0.0,0.0,0.417365,0.0
Air Terjun Efrata,0.888914,0.0,0.0,0.0,0.0,0.0,0.0,0.458073,0.0
Dolok Pesona Panoguan Solu,0.888914,0.0,0.0,0.0,0.0,0.0,0.0,0.458073,0.0
Patung Sigale gale,0.0,0.0,0.0,0.947114,0.0,0.0,0.0,0.320897,0.0
Bukit Doa,0.0,0.0,0.0,0.0,0.0,0.974296,0.0,0.225272,0.0
Pantai Lumban Bul-bul Balige,0.0,0.908739,0.0,0.0,0.0,0.0,0.0,0.417365,0.0
Pertunjukan Tari Sigale Gale,0.0,0.0,0.0,0.947114,0.0,0.0,0.0,0.320897,0.0


## Cosine Similarity

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

tfidf_matrix = tfidf_matrix.tocsr()

tfidf_matrix_subset = tfidf_matrix[:1000, :]

cosine_sim = cosine_similarity(tfidf_matrix_subset, dense_output=False)

cosine_sim


<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 1000000 stored elements in Compressed Sparse Row format>

In [None]:
# Ensure the index and columns match the shape of cosine_sim
cosine_sim_df = pd.DataFrame(
    cosine_sim.toarray(),  # Convert sparse matrix to dense (if you need it as a dense matrix)
    index=toba_tourism_data_cleaned.Nama_tempat_wisata[:1000],  # Corresponding to the first 1000 rows
    columns=toba_tourism_data_cleaned.Nama_tempat_wisata[:1000]  # Same for columns
)

# Print the shape of the resulting similarity matrix
print('Shape:', cosine_sim_df.shape)

cosine_sim_df.sample(10, axis=0)


Shape: (1000, 1000)


Nama_tempat_wisata,PASIR PUTIH LUMBAN BULBUL,PASIR PUTIH LUMBAN BULBUL,PASIR PUTIH LUMBAN BULBUL,PASIR PUTIH LUMBAN BULBUL,PASIR PUTIH LUMBAN BULBUL,PASIR PUTIH LUMBAN BULBUL,PASIR PUTIH LUMBAN BULBUL,PASIR PUTIH LUMBAN BULBUL,PASIR PUTIH LUMBAN BULBUL,PASIR PUTIH LUMBAN BULBUL,...,Patung Sigale gale,Patung Sigale gale,Patung Sigale gale,Patung Sigale gale,Patung Sigale gale,Patung Sigale gale,Patung Sigale gale,Patung Sigale gale,Patung Sigale gale,Patung Sigale gale
Nama_tempat_wisata,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
BUKIT GIBEON (YAPIM TARUNA),0.094021,0.094021,0.094021,0.094021,0.094021,0.094021,0.094021,0.094021,0.094021,0.094021,...,0.072289,0.072289,0.072289,0.072289,0.072289,0.072289,0.072289,0.072289,0.072289,0.072289
BUKIT GIBEON (YAPIM TARUNA),0.094021,0.094021,0.094021,0.094021,0.094021,0.094021,0.094021,0.094021,0.094021,0.094021,...,0.072289,0.072289,0.072289,0.072289,0.072289,0.072289,0.072289,0.072289,0.072289,0.072289
BUKIT GIBEON (YAPIM TARUNA),0.094021,0.094021,0.094021,0.094021,0.094021,0.094021,0.094021,0.094021,0.094021,0.094021,...,0.072289,0.072289,0.072289,0.072289,0.072289,0.072289,0.072289,0.072289,0.072289,0.072289
Monumen Raja SONAKMALELA,0.08266,0.08266,0.08266,0.08266,0.08266,0.08266,0.08266,0.08266,0.08266,0.08266,...,0.063554,0.063554,0.063554,0.063554,0.063554,0.063554,0.063554,0.063554,0.063554,0.063554
BUKIT GIBEON (YAPIM TARUNA),0.094021,0.094021,0.094021,0.094021,0.094021,0.094021,0.094021,0.094021,0.094021,0.094021,...,0.072289,0.072289,0.072289,0.072289,0.072289,0.072289,0.072289,0.072289,0.072289,0.072289
Monumen Raja SONAKMALELA,0.08266,0.08266,0.08266,0.08266,0.08266,0.08266,0.08266,0.08266,0.08266,0.08266,...,0.063554,0.063554,0.063554,0.063554,0.063554,0.063554,0.063554,0.063554,0.063554,0.063554
PASIR PUTIH LUMBAN BULBUL,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.133931,0.133931,0.133931,0.133931,0.133931,0.133931,0.133931,0.133931,0.133931,0.133931
PASIR PUTIH LUMBAN BULBUL,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.133931,0.133931,0.133931,0.133931,0.133931,0.133931,0.133931,0.133931,0.133931,0.133931
Monumen Raja SONAKMALELA,0.08266,0.08266,0.08266,0.08266,0.08266,0.08266,0.08266,0.08266,0.08266,0.08266,...,0.063554,0.063554,0.063554,0.063554,0.063554,0.063554,0.063554,0.063554,0.063554,0.063554
PASIR PUTIH LUMBAN BULBUL,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.133931,0.133931,0.133931,0.133931,0.133931,0.133931,0.133931,0.133931,0.133931,0.133931


## Recommendation Testing

In [None]:
import pandas as pd

def place_recommendations(place_name, similarity_data=cosine_sim_df, items=toba_tourism_data_cleaned[['Nama_tempat_wisata', 'Category']], k=5):
    # Check if the place_name is in the similarity_data columns
    if place_name not in similarity_data.columns:
        raise ValueError(f"Place '{place_name}' not found in similarity data.")

    # Get the index of the most similar places (top k)
    index = similarity_data.loc[:, place_name].to_numpy().argpartition(range(-1, -k, -1))
    closest = similarity_data.columns

    # Exclude the input place itself
    closest = closest.drop(place_name, errors='ignore')

    # Return the recommendations by merging the items DataFrame with the closest places
    return pd.DataFrame(closest).merge(items, left_on='Nama_tempat_wisata', right_on='Nama_tempat_wisata').head(k)



In [None]:
# Sample usage
place_name = 'PASIR PUTIH LUMBAN BULBUL'
recommendations = place_recommendations(place_name, cosine_sim_df, toba_tourism_data_cleaned[['Nama_tempat_wisata', 'Category']], k=10)
recommendations

Unnamed: 0,Nama_tempat_wisata,Category
0,"Patung Yesus, Bukit Sibea-bea, Huta Harian Boho",Wisata Rohani
1,"Patung Yesus, Bukit Sibea-bea, Huta Harian Boho",Wisata Rohani
2,"Patung Yesus, Bukit Sibea-bea, Huta Harian Boho",Wisata Rohani
3,"Patung Yesus, Bukit Sibea-bea, Huta Harian Boho",Wisata Rohani
4,"Patung Yesus, Bukit Sibea-bea, Huta Harian Boho",Wisata Rohani
5,"Patung Yesus, Bukit Sibea-bea, Huta Harian Boho",Wisata Rohani
6,"Patung Yesus, Bukit Sibea-bea, Huta Harian Boho",Wisata Rohani
7,"Patung Yesus, Bukit Sibea-bea, Huta Harian Boho",Wisata Rohani
8,"Patung Yesus, Bukit Sibea-bea, Huta Harian Boho",Wisata Rohani
9,"Patung Yesus, Bukit Sibea-bea, Huta Harian Boho",Wisata Rohani


# Matrix Evaluation

In [None]:
 import numpy as np

# Fungsi untuk menghitung Recall berdasarkan kategori dengan top-k rekomendasi
def recall_at_k(relevant_categories, recommended_categories, k=None):
    if k:
        recommended_categories = recommended_categories[:k]
    relevant_recommendations = len(set(relevant_categories) & set(recommended_categories))
    return relevant_recommendations / len(relevant_categories) if len(relevant_categories) > 0 else 0

# Fungsi untuk menghitung Precision berdasarkan kategori dengan top-k rekomendasi
def precision_at_k(relevant_categories, recommended_categories, k=None):
    if k:
        recommended_categories = recommended_categories[:k]
    relevant_recommendations = len(set(relevant_categories) & set(recommended_categories))
    return relevant_recommendations / len(recommended_categories) if len(recommended_categories) > 0 else 0

# Fungsi untuk menghitung Average Precision untuk satu pengguna
def average_precision(relevant_categories, recommended_categories, k=None):
    if k:
        recommended_categories = recommended_categories[:k]
    relevant_recommendations = 0
    precision_sum = 0
    for i in range(len(recommended_categories)):
        if recommended_categories[i] in relevant_categories:
            relevant_recommendations += 1
            precision_sum += relevant_recommendations / (i + 1)
    return precision_sum / len(relevant_categories) if len(relevant_categories) > 0 else 0

# Fungsi untuk menghitung MAP (Mean Average Precision) untuk seluruh pengguna
def mean_average_precision(relevant_categories_list, recommended_categories_list, k=None):
    ap_scores = [average_precision(rel, rec, k) for rel, rec in zip(relevant_categories_list, recommended_categories_list)]
    return np.mean(ap_scores)


In [None]:
# Data relevansi dan rekomendasi untuk pengguna
relevant_categories_list = [
    ['Wisata Bahari', 'Alam'],  # Kategori relevan untuk pengguna 1
    ['Sejarah', 'Rohani'],      # Kategori relevan untuk pengguna 2
    ['Wisata Bahari', 'Budaya'], # Kategori relevan untuk pengguna 3
]

recommended_categories_list = [
    ['Wisata Bahari', 'Sejarah', 'Budaya', 'Rohani', 'Alam'],  # Kategori yang direkomendasikan untuk pengguna 1
    ['Alam', 'Wisata Bahari', 'Rohani', 'Budaya', 'Sejarah'],  # Kategori yang direkomendasikan untuk pengguna 2
    ['Alam', 'Sejarah', 'Rohani', 'Wisata Bahari', 'Budaya'],  # Kategori yang direkomendasikan untuk pengguna 3
]

categories = ['Wisata Bahari', 'Alam', 'Sejarah', 'Rohani', 'Budaya']


In [None]:
# Precision, Recall, dan MAP untuk top-5
k = 5  # Jumlah rekomendasi (top-5)
precision_top_5 = np.mean([precision_at_k(rel, rec, k) for rel, rec in zip(relevant_categories_list, recommended_categories_list)])
recall_top_5 = np.mean([recall_at_k(rel, rec, k) for rel, rec in zip(relevant_categories_list, recommended_categories_list)])
map_top_5 = mean_average_precision(relevant_categories_list, recommended_categories_list, k)

print(f"Precision@{k}: {precision_top_5:.4f}")
print(f"Recall@{k}: {recall_top_5:.4f}")
print(f"MAP@{k}: {map_top_5:.4f}")


Precision@5: 0.4000
Recall@5: 1.0000
MAP@5: 0.4639


In [None]:
# Precision, Recall, dan MAP untuk top-10
k = 10  # Jumlah rekomendasi (top-10)
precision_top_10 = np.mean([precision_at_k(rel, rec, k) for rel, rec in zip(relevant_categories_list, recommended_categories_list)])
recall_top_10 = np.mean([recall_at_k(rel, rec, k) for rel, rec in zip(relevant_categories_list, recommended_categories_list)])
map_top_10 = mean_average_precision(relevant_categories_list, recommended_categories_list, k)

print(f"Precision@{k}: {precision_top_10:.4f}")
print(f"Recall@{k}: {recall_top_10:.4f}")
print(f"MAP@{k}: {map_top_10:.4f}")


Precision@10: 0.4000
Recall@10: 1.0000
MAP@10: 0.4639


In [None]:
# Konversi relevansi dan prediksi ke biner
true_relevances_binary = [
    [1 if category in rel else 0 for category in categories] for rel in relevant_categories_list
]
predicted_relevances_binary = [
    [1 if category in rec else 0 for category in categories] for rec in recommended_categories_list
]

# Data biner diflatkan
true_flat = [item for sublist in true_relevances_binary for item in sublist]
predicted_flat = [item for sublist in predicted_relevances_binary for item in sublist]

# RMSE dan MAE
rmse = np.sqrt(np.mean((np.array(true_flat) - np.array(predicted_flat)) ** 2))
mae = np.mean(np.abs(np.array(true_flat) - np.array(predicted_flat)))
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")


RMSE: 0.7746
MAE: 0.6000
