## **Proyek Mata Kuliah Sistem Rekomendasi**
## **Collaborative Filtering Using Jaccard Similarity**

<table align="left">
    <tr>
        <td style="text-align:left">Kode Mata Kuliah</td>
        <td style="text-align:left">:</td>
        <td style="text-align:left">12S4054</td>
    </tr>
    <tr>
        <td style="text-align:left">Nama Mata Kuliah</td>
        <td style="text-align:left">:</td>
        <td style="text-align:left">Sistem Rekomendasi</td>
    </tr>
    <tr>
        <td style="text-align:left">Anggota Kelompok 3</td>
        <td style="text-align:left">:</td>
        <td style="text-align:left">
            1. 12S21046 Ruth Marelisa Hutagalung <br>
            2. 12S21052 Griselda<br>
            3. 12S21054 Diah Anastasya
        </td>
    </tr>
</table>


# Data Understanding

In [None]:
import pandas as pd

# Load Dataset
toba_tourism_data = pd.read_csv("Tempat-Wisata-Toba-Preprocessing.csv")

# Menampilkan 10 baris pertama dari data
toba_tourism_data.head()


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,address,PlaceID,Nama_tempat_wisata,Category,ReviewerId,Rating,Reviews
0,0,0,"Jl. Sibola Hotang, Sibola Hotangsas, Kec. Bali...",0,PASIR PUTIH LUMBAN BULBUL,Wisata Bahari,1.12603e+20,5.0,
1,1,1,"Jl. Sibola Hotang, Sibola Hotangsas, Kec. Bali...",0,PASIR PUTIH LUMBAN BULBUL,Wisata Bahari,1.11909e+20,5.0,bagus
2,2,2,"Jl. Sibola Hotang, Sibola Hotangsas, Kec. Bali...",0,PASIR PUTIH LUMBAN BULBUL,Wisata Bahari,1.07886e+20,5.0,
3,3,3,"Jl. Sibola Hotang, Sibola Hotangsas, Kec. Bali...",0,PASIR PUTIH LUMBAN BULBUL,Wisata Bahari,1.13072e+20,5.0,sangat menyenagkan
4,4,4,"Jl. Sibola Hotang, Sibola Hotangsas, Kec. Bali...",0,PASIR PUTIH LUMBAN BULBUL,Wisata Bahari,1.06173e+20,5.0,bebas foto dimana aja cuma 2k


# Data Preprocessing

Dari hasil menampilkan dataset mencakup beberapa kolom, yaitu: `Unnamed: 0.1`, `Unnamed: 0`, `Address`, `PlaceID`, `Nama_tempat_wisata`, `Category`, `ReviewerId`, `Rating`, dan `Reviews`.

Pada dataset terdapat kolom yang tidak diperlukan seperti  `Unnamed: 0.1` dan `Unnamed: 0`. Maka dari itu tahap selanjutnya adalah menghapus kedua kolom tersebut

In [None]:
# Menghapus kolom yang tidak diperlukan
toba_tourism_data_cleaned = toba_tourism_data.drop(columns=['Unnamed: 0.1', 'Unnamed: 0', 'address', 'Reviews'])

# Melakukan pengecekan nilai yang hilang akibat menghapus kolom.
missing_data_summary = toba_tourism_data_cleaned.isnull().sum()

# Menampilkan pratinjau data yang dibersihkan dan ringkasan data yang hilang
toba_tourism_data_cleaned.head(), missing_data_summary

(   PlaceID         Nama_tempat_wisata       Category    ReviewerId  Rating
 0        0  PASIR PUTIH LUMBAN BULBUL  Wisata Bahari  1.126030e+20     5.0
 1        0  PASIR PUTIH LUMBAN BULBUL  Wisata Bahari  1.119090e+20     5.0
 2        0  PASIR PUTIH LUMBAN BULBUL  Wisata Bahari  1.078860e+20     5.0
 3        0  PASIR PUTIH LUMBAN BULBUL  Wisata Bahari  1.130720e+20     5.0
 4        0  PASIR PUTIH LUMBAN BULBUL  Wisata Bahari  1.061730e+20     5.0,
 PlaceID               0
 Nama_tempat_wisata    0
 Category              0
 ReviewerId            1
 Rating                1
 dtype: int64)

Dari hasil ringkasan menunjukkan bahwa terdapat satu nilai yang hilang pada kolom `ReviewerId` dan `Rating`. Namun karena nilai yang hilang hanya satu, maka tidak akan terlalu berdampak terhadap akurasi dan performa sistem rekomendasi yang akan dilakukan.

In [None]:
# Menghapus baris dengan nilai yang hilang di kolom ReviewerId dan Rating
toba_tourism_data_cleaned.dropna(subset=['ReviewerId', 'Rating'], inplace=True)

# Memverifikasi apakah masih ada data yang hilang di kolom penting
missing_data_check = toba_tourism_data_cleaned[['ReviewerId', 'Rating']].isnull().sum()


In [None]:

# Membagi dataset (70% pelatihan, 15% validasi, 15% pengujian)
from sklearn.model_selection import train_test_split

# Memisahkan 15% untuk test set
train_val, test = train_test_split(toba_tourism_data_cleaned, test_size=0.15, random_state=42)

# Memisahkan sisa 15% dari 85% untuk validasi set, sehingga mendapatkan 70% untuk train dan 15% untuk validasi
train, val = train_test_split(train_val, test_size=0.176, random_state=42)  # 0.176 * 0.85 = 0.15 (15%)

# Menampilkan ukuran setiap split dan memastikan tidak ada data yang hilang di kolom penting
train_size, val_size, test_size = len(train), len(val), len(test)
missing_data_check, (train_size, val_size, test_size)

(ReviewerId    0
 Rating        0
 dtype: int64,
 (30274, 6467, 6484))

Dari hasil diatas dapat dilihat bahwa data telah berhasil dibersihkan dan dibagi menjadi:
- Training set: 30,274
- Validation set: 6,467
- Test set: 6,484

Tidak ada nilai yang hilang di kolom PlaceID, ReviewerId, and Rating.

# Calculating similarites

## Jaccard Similarity (Colaborative Filtering)

In [None]:
from sklearn.metrics import jaccard_score
import numpy as np

item_user_matrix = toba_tourism_data_cleaned.pivot_table(index='PlaceID', columns='ReviewerId', values='Rating').fillna(0)

binary_item_user_matrix = item_user_matrix.applymap(lambda x: 1 if x >= 3 else 0)

# Hitung Jaccard similarity
# Inisialisasi DataFrame kosong untuk kesamaan Jaccard
jaccard_similarity_matrix = pd.DataFrame(np.zeros((binary_item_user_matrix.shape[0],
                                                   binary_item_user_matrix.shape[0])),
                                         index=binary_item_user_matrix.index,
                                         columns=binary_item_user_matrix.index)

# Isi matriks kesamaan Jaccard
for i in range(len(jaccard_similarity_matrix)):
    for j in range(i, len(jaccard_similarity_matrix)):
        if i != j:
            # Calculate Jaccard similarity between each pair of items
            jaccard_similarity = jaccard_score(binary_item_user_matrix.iloc[i],
                                               binary_item_user_matrix.iloc[j])
            jaccard_similarity_matrix.iloc[i, j] = jaccard_similarity
            jaccard_similarity_matrix.iloc[j, i] = jaccard_similarity
        else:
            jaccard_similarity_matrix.iloc[i, j] = 1.0  # Similarity with itself is 1

# Menampilkan matrix jaccard similarity
jaccard_similarity_matrix.head()


  binary_item_user_matrix = item_user_matrix.applymap(lambda x: 1 if x >= 3 else 0)


PlaceID,0,1,2,3,4,5,6,8,9,10,...,92,93,94,95,96,97,98,99,100,101
PlaceID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.0,0.005063,0.0,0.007463,0.0,0.005329,0.02773,0.0,0.006466,0.009674,...,0.032203,0.01679,0.024931,0.020013,0.037748,0.010886,0.029664,0.022431,0.020507,0.022175
1,0.005063,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002053,...,0.0,0.000858,0.001768,0.003125,0.0,0.003311,0.002062,0.0,0.002597,0.002524
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.001182,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.007463,0.0,0.0,1.0,0.0,0.0,0.011952,0.02,0.007874,0.006085,...,0.009185,0.001705,0.004401,0.004666,0.002999,0.006472,0.003071,0.001235,0.005181,0.004403
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.001163,0.0,0.000891,0.001572,0.002278,0.006873,0.0,0.001875,0.000652,0.00254


In [None]:
target_user_id = 1.126030e+20

# Ekstrak rating yang diberikan oleh pengguna tersebut
user_ratings = toba_tourism_data_cleaned[toba_tourism_data_cleaned['ReviewerId'] == target_user_id]

# Identifikasi item yang diberi rating tinggi oleh pengguna (threshold rating >= 4)
highly_rated_items = user_ratings[user_ratings['Rating'] >= 1]['PlaceID']

# Hasilkan rekomendasi berbasis Jaccard dengan menemukan item yang mirip dengan item berperingkat tinggi
jaccard_similar_items = jaccard_similarity_matrix[highly_rated_items].mean(axis=1).sort_values(ascending=False)

# Kecualikan item yang sudah pernah di-rating oleh pengguna
jaccard_recommendations = jaccard_similar_items.drop(index=highly_rated_items)

place_names = toba_tourism_data[['PlaceID', 'Nama_tempat_wisata']].drop_duplicates()

# Gabungkan rekomendasi dengan nama tempat berdasarkan PlaceID
jaccard_recommendations_with_names = jaccard_recommendations.reset_index().merge(place_names, on='PlaceID', how='left')

# Tampilkan rekomendasi berbasis Jaccard untuk pengguna dengan nama tempat wisata
jaccard_recommendations_with_names[['PlaceID', 'Nama_tempat_wisata', 0]].rename(columns={0: 'Similarity_Score'}).head(10)

Unnamed: 0,PlaceID,Nama_tempat_wisata,Similarity_Score
0,96,Museum Batak Tb Silalahi Center,0.058361
1,66,The Kaldera,0.055384
2,52,Puncak Huta Ginjang,0.052575
3,101,Danau Toba Parapat,0.052529
4,99,Pertunjukan Tari Sigale Gale,0.050362
5,91,Bukit Holbung Samosir,0.047111
6,92,Geosite Sipinsur,0.044919
7,18,Pantai Pasir Putih Parparean,0.044441
8,94,Menara Pandang Tele,0.04336
9,95,Pantai Pasir Putih Parbaba,0.042416


Dari hasil `Jaccard Similarity` yang ditampilkan tempat wisata yang direkomendasikan untuk ReviewerID 1.126030e+20 adalah sebagai berikut.
- Museum Batak Tb Silalahi Center
- The Kaldera
- Puncak Huta Ginjang
- Danau Toba Parapat
- Pertunjukan Tari Sigale Gale
- Bukit Holbung Samosir
- Geosite Sipinsur
- Pantai Pasir Putih Parparean
- Menara Pandang Tele
- Istana Raja Sisingamangaraja

##### Kesimpulan
Sistem ini berhasil memberikan rekomendasi yang dipersonalisasi menggunakan metode Jaccard Similarity.
Meski skor kesamaannya relatif kecil, rekomendasi ini tetap relevan karena dihasilkan dari kemiripan terhadap item dengan rating tinggi oleh pengguna.

### Matrix Evaluation

In [None]:
test['Predicted_Rating'] = test['PlaceID'].map(lambda x: jaccard_recommendations.get(x, 0))

menambahkan kolom Predicted_Rating pada DataFrame test dengan memetakan setiap PlaceID ke nilai prediksi dari dictionary jaccard_recommendations, atau 0 jika tidak ditemukan.

In [None]:
# Membuat Predicted_Rating dan Relevant jika belum ada
if 'Predicted_Rating' not in test.columns:
    test['Predicted_Rating'] = test['PlaceID'].map(lambda x: jaccard_recommendations.get(x, 0))  # Contoh prediksi

if 'Relevant' not in test.columns:
    test['Relevant'] = (test['Rating'] >= 4).astype(int)  # Contoh relevansi

# Fungsi untuk menghitung Precision at K
def precision_at_k(relevance_list, k):
    relevance_list = relevance_list[:k]  # Ambil K teratas
    return sum(relevance_list) / k  # Precision at K adalah jumlah relevansi / K

# Fungsi untuk menghitung Average Precision at K
def average_precision_at_k(relevance_list, k):
    precision_sum = 0
    relevant = 0
    for i, rel in enumerate(relevance_list[:k], start=1):  # Hanya pertimbangkan top-K
        if rel == 1:
            relevant += 1
            precision_sum += relevant / i
    return precision_sum / min(k, relevant) if relevant > 0 else 0

# Menghitung MAP@10
k = 10
map_at_k_scores = test.groupby('ReviewerId').apply(
    lambda x: average_precision_at_k(x.sort_values(by='Predicted_Rating', ascending=False)['Relevant'].tolist(), k)
)

# Rata-rata MAP@10
mean_map_at_k = map_at_k_scores.mean()
print(f"Mean Average Precision at {k}: {mean_map_at_k}")


Mean Average Precision at 10: 0.8838429487179488


  map_at_k_scores = test.groupby('ReviewerId').apply(


In [None]:
# Fungsi untuk menghitung Precision dan Recall
def calculate_precision_recall_jaccard(user_id, top_n=10):
    # Item yang diberi rating tinggi oleh pengguna
    user_rated_items = test[test['ReviewerId'] == user_id]
    relevant_items = set(user_rated_items[user_rated_items['Relevant'] == 1]['PlaceID'])

    # Rekomendasi berdasarkan kesamaan Jaccard
    if relevant_items:
        user_recommendations = jaccard_similarity_matrix.loc[list(relevant_items)].mean(axis=0)
        recommended_items = user_recommendations.sort_values(ascending=False).head(top_n).index
    else:
        recommended_items = []

    # Precision dan Recall
    relevant_recommended = len(set(recommended_items) & relevant_items)
    precision = relevant_recommended / len(recommended_items) if len(recommended_items) > 0 else 0
    recall = relevant_recommended / len(relevant_items) if len(relevant_items) > 0 else 0

    return precision, recall

# Hitung Precision dan Recall untuk semua pengguna
results = test['ReviewerId'].unique()
precision_recall_scores = [calculate_precision_recall_jaccard(user_id) for user_id in results]

# Rata-rata Precision dan Recall
average_precision = sum([score[0] for score in precision_recall_scores]) / len(precision_recall_scores)
average_recall = sum([score[1] for score in precision_recall_scores]) / len(precision_recall_scores)

print(f"Average Precision (Jaccard): {average_precision:.4f}")
print(f"Average Recall (Jaccard): {average_recall:.4f}")

Average Precision (Jaccard): 0.1089
Average Recall (Jaccard): 0.8942


In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

# Pastikan matriks kesamaan Jaccard sudah ada
# Pastikan test set memiliki kolom 'PlaceID', 'ReviewerId', dan 'Rating'

def predict_jaccard(user_id, place_id):
    """
    Memperkirakan rating untuk user_id dan place_id berdasarkan kesamaan Jaccard.
    """
    # Ambil semua pengguna yang memberi rating pada place_id
    similar_items = jaccard_similarity_matrix[place_id].drop(place_id)  # Hapus dirinya sendiri
    relevant_ratings = train[train['PlaceID'].isin(similar_items.index)]

    # Filter reviewer lain yang telah memberi rating
    relevant_ratings = relevant_ratings[relevant_ratings['ReviewerId'] != user_id]

    # Hitung prediksi sebagai rata-rata berbobot
    weights = similar_items[relevant_ratings['PlaceID']]
    if weights.sum() > 0:
        predicted_rating = np.sum(weights * relevant_ratings['Rating']) / weights.sum()
    else:
        predicted_rating = train[train['PlaceID'] == place_id]['Rating'].mean()  # Default ke rata-rata rating

    return predicted_rating

# Tambahkan kolom 'Predicted_Rating' ke test set
test['Predicted_Rating'] = test.apply(lambda x: predict_jaccard(x['ReviewerId'], x['PlaceID']), axis=1)

# Hitung RMSE dan MAE
rmse = np.sqrt(mean_squared_error(test['Rating'], test['Predicted_Rating']))
mae = mean_absolute_error(test['Rating'], test['Predicted_Rating'])

print(f"RMSE (Jaccard): {rmse:.4f}")
print(f"MAE (Jaccard): {mae:.4f}")

RMSE (Jaccard): 2.1344
MAE (Jaccard): 2.0137


### Kesimpulan Berdasarkan Hasil Evaluasi

1. **Mean Average Precision (MAP)**:
   - **Nilai: 0.8799**
   - MAP menunjukkan performa rata-rata model dalam memberikan peringkat relevansi yang baik. Nilai ini sangat tinggi, yang berarti model berhasil memberikan rekomendasi yang sesuai secara keseluruhan.

2. **Average Precision (Jaccard)**:
   - **Nilai: 0.1013**
   - Nilai precision berdasarkan kesamaan Jaccard cukup rendah. Hal ini menunjukkan bahwa meskipun model berhasil memberikan peringkat relevan secara keseluruhan (tinggi pada MAP), proporsi relevansi pada skala lebih kecil atau item tertentu kurang baik.

3. **Average Recall (Jaccard)**:
   - **Nilai: 0.8857**
   - Recall yang tinggi menunjukkan bahwa model mampu menangkap hampir semua item yang relevan dalam rekomendasi. Ini menandakan bahwa cakupan model dalam menemukan item relevan sangat baik.

4. **Root Mean Square Error (RMSE)**:
   - **Nilai: 2.1344**
   - RMSE mengukur perbedaan rata-rata kuadrat antara prediksi dan nilai sebenarnya. Nilai ini menunjukkan adanya penyimpangan yang cukup besar dalam nilai prediksi model.

5. **Mean Absolute Error (MAE)**:
   - **Nilai: 2.0137**
   - MAE mengukur rata-rata kesalahan absolut antara prediksi dan nilai sebenarnya. Nilainya sedikit lebih kecil dibandingkan RMSE, menunjukkan bahwa error yang terjadi relatif stabil tanpa outlier yang terlalu ekstrem.

- **Model cenderung sangat baik dalam cakupan rekomendasi (Recall tinggi), tetapi kurang presisi untuk hasil individual (Precision rendah).**
- **Error prediksi (RMSE dan MAE)** masih cukup besar
