# Content-Based Recommendation System for Tourist
Aurellia Gita Elysia | 2602569722

This notebook implements a **content-based filtering recommendation system** to suggest tourist spots based on their features. It uses **Reciprocal Rank Fusion (RRF)** to combine multiple ranking signals for more accurate recommendations.

# 1. Import Libraries

In [35]:
import pandas as pd
import numpy as np
import re
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory
from fuzzywuzzy import process



# 2. Load Dataset

In [36]:
df = pd.read_csv('cleaned_dataset.csv')
df.head()

Unnamed: 0,Place_Id,Place_Ratings,Place_Name,Description,Category,City,Price,Rating,Lat,Long
0,1,3.7,Monumen Nasional,Monumen Nasional atau yang populer disingkat d...,Budaya,Jakarta,0.4,0.652174,-6.175392,106.827153
1,2,2.8,Kota Tua,"Kota tua di Jakarta, yang juga bernama Kota Tu...",Budaya,Jakarta,0.0,0.652174,-6.137645,106.817125
2,3,2.5,Dunia Fantasi,Dunia Fantasi atau disebut juga Dufan adalah t...,Taman Hiburan,Jakarta,1.0,0.652174,-6.125312,106.833538
3,4,2.9,Taman Mini Indonesia Indah (TMII),Taman Mini Indonesia Indah merupakan suatu kaw...,Taman Hiburan,Jakarta,0.2,0.565217,-6.302446,106.895156
4,5,3.5,Atlantis Water Adventure,Atlantis Water Adventure atau dikenal dengan A...,Taman Hiburan,Jakarta,1.0,0.565217,-6.12419,106.839134


# 3. Further Data Preprocessing

In [37]:
stemmer_factory = StemmerFactory()
stemmer = stemmer_factory.create_stemmer()

stopword_factory = StopWordRemoverFactory()
stopword_remover = stopword_factory.create_stop_word_remover()
indonesian_stopwords = stopword_factory.get_stop_words()

## 3.1. Preprocess Stopwords

In [38]:
# Preprocessing function using Sastrawi
def preprocessing(data):
    data = data.lower()  # Lowercase conversion
    data = re.sub(r'[^a-zA-Z\s]', '', data)  # Remove special characters
    data = stemmer.stem(data)  # Stemming (Bahasa Indonesia)
    data = stopword_remover.remove(data)  # Remove stopwords (Bahasa Indonesia)
    return data

**💡 Explanation:**<br>
> * **Lowercasing** `(data.lower())`:
>   * Converts all characters to lowercase.
>   * Prevents duplicate vectors for words like "Museum" and "museum".
>
> * **Removing Special Characters** `(re.sub(r'[^a-zA-Z\\s]', '', data))`:
>   * Removes numbers, punctuation, and symbols.
>   * Keeps only alphabetical characters and spaces.
>   * Prevents vectorizer noise from irrelevant symbols.
>
> * **Stemming with Sastrawi** `(stemmer.stem(data))`:
>   * Reduces words to their root forms (e.g., "bermain" → "main").
>   * Essential for Bahasa Indonesia, where words have many affixes.
>   * Reduces dimensionality and improves model performance.
>
> * **Stopword Removal** `(stopword_remover.remove(data))`:
>   * Removes common words like "dan", "di", "yang".
>   * Improves the TF-IDF signal by eliminating frequently occurring but unimportant words.

**❗ Notes:**<br>
> This preprocessing is important to ensure the data quality for training the model.
> * **Improves Accuracy:** Reduces noisy data before vectorization.
> * **Optimizes TF-IDF:** Only meaningful words are weighted.
> * **Reduces Overfitting:** Minimizes irrelevant features.

## 3.2. Add `Tags` Feature

In [39]:
# Create a copy for content-based filtering
df_content = df.copy()

In [40]:
# Create 'Tags' from 'Description' and 'Category' (if both exist)
if 'Category' in df_content.columns and 'Description' in df_content.columns:
    df_content['Tags'] = (df_content['Description'] + ' ' + df_content['Category']).apply(preprocessing)
elif 'Description' in df_content.columns:
    df_content['Tags'] = df_content['Description'].apply(preprocessing)

**💡 Explanation:**<br>
> This step generates a `Tags` column by combining `Description` and `Category` (if both exist) and applying the `preprocessing()` function to clean the text. If only `Description` is available, it processes that alone. This **creates a concise, stemmed, and stopword-free text representation** for each tourist spot, which is crucial for content-based recommendations.
>
> Combining `Category` with `Description` enriches the text data, providing more context for the model. Even if descriptions are short or generic, categories add valuable signals, helping the model distinguish between similar places. This enriched textual feature **improves the effectiveness of similarity measurements using techniques like TF-IDF**, leading to more accurate recommendations.

In [41]:
# Drop unnecessary columns
columns_to_drop = ['Price', 'Place_Ratings', 'Description', 'Lat', 'Long']
df_content.drop(columns=[col for col in columns_to_drop if col in df_content.columns], axis=1, inplace=True)

In [42]:
# Apply preprocessing to the 'Tags' column in your df_content
df_content['Tags'] = df_content['Tags'].apply(preprocessing)

# Display the dataframe to check the results
df_content.head()

Unnamed: 0,Place_Id,Place_Name,Category,City,Rating,Tags
0,1,Monumen Nasional,Budaya,Jakarta,0.652174,monumen nasional populer singkat monas tugu mo...
1,2,Kota Tua,Budaya,Jakarta,0.652174,kota tua jakarta nama kota tua pusat alunalun ...
2,3,Dunia Fantasi,Taman Hiburan,Jakarta,0.652174,dunia fantasi sebut dufan tempat hibur letak k...
3,4,Taman Mini Indonesia Indah (TMII),Taman Hiburan,Jakarta,0.565217,taman mini indonesia indah rupa suatu kawasan ...
4,5,Atlantis Water Adventure,Taman Hiburan,Jakarta,0.565217,atlantis water adventure kenal atlantis ancol ...


**💡 Explanation:**<br>
> In this step, the `preprocessing()` function is applied directly to the `Tags` column to **clean and normalize the text**. This ensures that all tags are lowercased, free of special characters, stemmed to their root forms, and stripped of stopwords. The result is a consistent and compact textual representation **suitable for vectorization**.
>
> By preprocessing `Tags`, the model focuses on meaningful terms while ignoring noise, which enhances the accuracy of similarity measurements using techniques like TF-IDF. This step is crucial for ensuring that the content-based filtering model identifies relevant patterns between tourist spots based on their descriptions and categories.

## 3.3. TF-IDF Vectorization

In [None]:
tv = TfidfVectorizer(
    stop_words=indonesian_stopwords,
    max_features=5000,
    ngram_range=(1, 4), 
    min_df=2, 
    max_df=0.85,
    sublinear_tf=True,
    use_idf=True,
    smooth_idf=True,
    norm='l2'
)

**💡 Explanation:**<br>
> In this step, we use `TfidfVectorizer` to convert the `Tags` column into **numerical vectors**, which represent the importance of words within each tourist spot's description. Several parameters are carefully configured to enhance the effectiveness of the vectorization process. There are a few parameters we specify:
> * `stop_words=indonesian_stopwords`: Removes common words in Bahasa Indonesia. 
> * `ngram_range=(1,4)`: Captures single words and phrases up to 4 words (e.g., “pantai indah”, “pantai pasir putih”). This parameter helps identify related spots even with different wording.
> * `max_features=5000`: Limits vocabulary size to 5,000 important terms and exclude rarely used terms.
> * `min_df=2`: Ignores terms appearing in fewer than 2 spots.
> * `max_df=0.85`: Removes overly frequent words.
> * `sublinear_tf=True` and `smooth_idf=True`: Adjusts term frequencies, preventing frequent terms from dominating results
> * `norm='l2'`: Normalizes vectors to have unit length, improving comparison.

In [69]:
# TF-IDF on 'Content Similarity'
content_tv = tv.fit_transform(df_content['Tags']).toarray()
content_sim = cosine_similarity(content_tv)

# TF-IDF on 'Category Similarity'
category_tv = TfidfVectorizer()
category_sim = cosine_similarity(category_tv.fit_transform(df_content['Category']))

# TF-IDF on 'City Similarity'
city_tv = TfidfVectorizer()
city_sim = cosine_similarity(city_tv.fit_transform(df_content['City']))

# Combine All Similarities (Multi-Feature Similarity)
final_similarity = (
    0.6 * content_sim +      # Content (Tags) has highest weight
    0.2 * category_sim +     # Category adds contextual grouping
    0.2 * city_sim           # City adds location-based relevance
)

# Check combined similarity for the first place
print(final_similarity[0][1:10])

[0.42114967 0.24284771 0.21133485 0.20292422 0.20436595 0.21263291
 0.20275551 0.20853848 0.20711209]


**💡 Explanation:**<br>
> This step computes similarity scores from content (`Tags`), `category`, and `city` and combines them into a single score using a weighted sum.
>
> First, content similarity (`content_sim`) is calculated from `Tags` using `TfidfVectorizer` and `cosine_similarity()`. It has the highest weight (60%) because it captures rich descriptive information.
>
> Next, category similarity (`category_sim`) and city similarity (`city_sim`) are computed from their respective columns, helping group places by type and location, each contributing 20% to the final score.
>
> The combined matrix (`final_similarity`) is a weighted average:
> * **60% Content:** Focuses on descriptions and themes.
> * **20% Category:** Groups similar types of attractions.
> * **20% City:** Promotes location-based relevance.
>
> The printed output shows similarity scores between the first place and others, with values closer to **1 indicating stronger similarity**. This approach balances thematic, contextual, and location-based relevance for more accurate recommendations.

## 3.4. Export Dataset to CSV

In [70]:
df_content.to_csv('content_based_dataset.csv', index=False)
print("Preprocessed data exported to 'content_based_dataset.csv'")

Preprocessed data exported to 'content_based_dataset.csv'


# 4. Split Train/Test Data

In [87]:
train_df, test_df = train_test_split(df_content, test_size=0.2, random_state=42)

In [88]:
for col in ['Category', 'City', 'Tags']:
    train_df[col] = train_df[col].astype('category').cat.codes
    test_df[col] = test_df[col].astype('category').cat.codes

X_train = train_df.drop(columns=['Place_Name', 'Rating'])
y_train = train_df['Rating']

In [96]:
# Check how many test places exist in df_content
test_places = set(test_df['Place_Name'].str.lower().str.strip())
train_places = set(df_content['Place_Name'].str.lower().str.strip())

unmatched = test_places - train_places
print(f"Unmatched places from test set: {len(unmatched)}")

Unmatched places from test set: 0


**💡 Explanation:**<br>
> This step ensures that all places in the test set also exist in the training set, which is crucial for generating accurate recommendations.
>
> This check helps prevent issues during evaluation, such as **index errors or zero-precision results**, by ensuring that all test places are known to the recommendation model.

# 5. Train Model

In [103]:
df_content = df_content.reset_index(drop=True)
X_train = X_train.reset_index(drop=True)


**💡 Explanation:**<br>
> This step resets the indices of both the training data (`X_train`) and the content dataset (`df_content`). Resetting indices is crucial for **ensuring that both datasets align correctly**, especially after data modifications such as filtering, merging, or splitting.
>
> Resetting indices prevents indexing errors during model training and ensures that recommendations remain correctly aligned with their original place information.

In [104]:
def reciprocal_rank_fusion(ranks, k=65, weights=None):
    score_dict = {}
    for idx, rank_list in enumerate(ranks):
        weight = weights[idx] if weights else 1.0
        for i, item in enumerate(rank_list):
            score_dict[item] = score_dict.get(item, 0) + weight / (k + i + 1)
    return sorted(score_dict.items(), key=lambda x: x[1], reverse=True)

**💡 Explanation:**<br>
> This function combines multiple ranked lists into a single ranking using **Reciprocal Rank Fusion (RRF)**. It assigns scores to items based on their ranks in each list, with higher-ranked items receiving more points. A weighted formula is applied to balance the influence of different ranking sources.
>
> RRF is useful because it blends multiple recommendation strategies, ensuring that items consistently ranked well across lists score higher. It is a simple, robust method for creating **a unified recommendation list** from various signals like content similarity, category relevance, and popularity.

In [105]:
def recommend_places(nama_tempat, num_recommendations=5):
    match = df_content[df_content['Place_Name'] == nama_tempat]
    if match.empty:
        return []

    nama_tempat_index = match.index[0]

    # Rank by Content
    content_rank = sorted(
        range(len(df_content)),
        key=lambda x: final_similarity[nama_tempat_index, x],
        reverse=True
    )

    # Rank by Category
    category_rank = sorted(
        range(len(df_content)),
        key=lambda x: category_sim[nama_tempat_index, x],
        reverse=True
    )

    # Rank by Popularity (Review_Count)
    popularity_rank = sorted(
        range(len(df_content)),
        key=lambda x: df_content.iloc[x].get('Review_Count', 0),
        reverse=True
    )

    # Rank by Rating
    rating_rank = sorted(
        range(len(df_content)),
        key=lambda x: df_content.iloc[x].get('Rating', 0),
        reverse=True
    )

    # Combine Rankings with RRF
    fused_rank = reciprocal_rank_fusion(
        [content_rank, category_rank, popularity_rank, rating_rank],
        weights=[0.55, 0.20, 0.15, 0.10]
    )
    
    # Return top recommendations
    recommended_places = [
        df_content.iloc[i[0]].Place_Name 
        for i in fused_rank[1:num_recommendations+1]
    ]
    return recommended_places


**💡 Explanation:**<br>
> This function generates recommendations by ranking tourist spots based on multiple factors and combining them using **Reciprocal Rank Fusion (RRF)**.
>
> * **Find Place Index:**
> Searches for the input place (`nama_tempat`) in `df_content` and retrieves its index.
>
> * **Generate Individual Rankings:**
>   * **Content Rank:** Based on multi-feature similarity (`final_similarity`).
>   * **Category Rank:** Based on category similarity (`category_sim`).
>   * **Popularity Rank:** Based on the number of reviews (`Review_Count`).
>   * **Rating Rank:** Based on user ratings (`Rating`).
>
> * **Combine with RRF:**
> Merges the four rankings using `reciprocal_rank_fusion()` with assigned weights:
>   * **55% Content:** Main factor focusing on descriptive similarity.
>   * **20% Category:** Groups similar types of places.
>   * **15% Popularity:** Highlights well-reviewed spots.
>   * **10% Rating:** Adds user experience feedback.
> 
> * **Return Top Results:**
> Outputs the top recommendations (excluding the input place itself).

**❗ Notes:**<br>
> The popularity rank is weighted higher than the rating rank because popularity (based on the number of reviews) is **generally a more reliable indicator of user engagement and relevance**. A spot with a high number of reviews reflects consistent interest and visitor experience, while a high rating with few reviews may be statistically insignificant or biased. This decision is based on the principle that volume of feedback often provides a stronger signal than isolated opinions.
>
> Additionally, the assigned weights were **fine-tuned through multiple iterations**, using empirical testing and evaluation based on both **Precision@5** and **MAP@5**. Several weight combinations were tested, and the current configuration represents the best-performing balance, achieving the highest scores for both metrics. This data-driven adjustment ensures that the model is optimized for both accuracy (precision) and ranking quality (MAP), making the recommendation system more effective and reliable.

# 6. Evaluate Model

In [110]:
# Use RRF for recommendations only
def precision_at_5(test_df):
    relevant_count = 0
    total = len(test_df)
    for _, row in test_df.iterrows():
        actual = row['Place_Name']
        recommended = recommend_places(actual, 5)  # Use RRF-based recommendations
        if actual in recommended:
            relevant_count += 1
    return relevant_count / total if total > 0 else 0

def mean_average_precision_at_k(test_df, k=5, show_samples=5):
    average_precisions = []

    print("\n📊 Sample of Recommendations:")
    sample_count = 0

    for _, row in test_df.iterrows():
        actual = row['Place_Name']
        recommended = recommend_places(actual, k)
        
        hits = [1 if rec == actual else 0 for rec in recommended]
        precisions = [sum(hits[:i+1])/(i+1) for i in range(len(hits)) if hits[i] == 1]
        average_precisions.append(sum(precisions) / sum(hits) if sum(hits) > 0 else 0)
        
        if sample_count < show_samples:
            print(f"\nActual: {actual}")
            print(f"Top-{k} Recommendations: {recommended}")
            print(f"Hits: {hits}")
            print(f"Precisions: {precisions}")
            sample_count += 1

    map_score = sum(average_precisions) / len(average_precisions) if average_precisions else 0
    return map_score


**💡 Explanation:**<br>
> This code evaluates the recommendation system using `Precision@5` and `MAP@5` (**Mean Average Precision at 5**), which measure **accuracy and ranking quality**, respectively.
>
> * `Precision@5`: Calculates how often the actual place appears in the top 5 recommendations. It counts correct matches and divides them by the total test cases. This metric reflects how **accurately the system delivers relevant results upfront**, which is crucial for user satisfaction.
>
> * `MAP@5`: Evaluates the **quality of ranking** by averaging precision scores at each correct recommendation. It rewards models that **list relevant places higher in the ranking**, providing a more detailed measure of ranking performance. The function also prints sample recommendations, displaying hits and their corresponding precision scores.
>
> Together, these metrics ensure the model is optimized for both accuracy (`Precision@5`) and ranking quality (`MAP@5`), offering a **well-rounded performance assessment**.

In [113]:
# Check MAP@5
map_score = mean_average_precision_at_k(test_df, 5)
print(f"\n🏆 Overall MAP@{5}: {map_score:.2f}")

# Check Precision@5
precision = precision_at_5(test_df)
print(f"🎯 Overall Precision@{5}: {precision:.2f}")


📊 Sample of Recommendations:

Actual: Monumen Kapal Selam
Top-5 Recommendations: ['Museum Nasional', 'Monumen Selamat Datang', 'Museum Sumpah Pemuda', 'Monumen Tugu Pahlawan', 'Museum Bank Indonesia']
Hits: [0, 0, 0, 0, 0]
Precisions: []

Actual: Taman Spathodea
Top-5 Recommendations: ['Dunia Fantasi', 'Taman Impian Jaya Ancol', 'Taman Menteng', 'Taman Spathodea', 'Taman Mini Indonesia Indah (TMII)']
Hits: [0, 0, 0, 1, 0]
Precisions: [0.25]

Actual: Masjid Agung Trans Studio Bandung
Top-5 Recommendations: ['Masjid Agung Trans Studio Bandung', 'Masjid Raya Bandung', 'Masjid Daarut Tauhiid Bandung', 'Masjid Salman ITB', 'Gereja Katedral Santo Petrus Bandung']
Hits: [1, 0, 0, 0, 0]
Precisions: [1.0]

Actual: Sungai Palayangan
Top-5 Recommendations: ['Sungai Palayangan', 'Sunrise Point Cukul', 'Situ Cileunca', 'Situ Patenggang', 'Kebun Binatang Bandung']
Hits: [1, 0, 0, 0, 0]
Precisions: [1.0]

Actual: Taman Kupu-Kupu Cihanjuang
Top-5 Recommendations: ['Taman Kupu-Kupu Cihanjuang', 'Happy

## **📊 Output Analysis:** Recommendation System Performance (`Precision@5` & `MAP@5`)<br>
> This output displays sample recommendations from the system along with evaluation scores for `MAP@5` and `Precision@5`, providing insight into its ranking accuracy and overall effectiveness.
>
> 📈 **Sample Analysis:**
> * For "Monumen Kapal Selam", none of the top 5 recommendations matched the actual place (`Hits: [0,0,0,0,0]`), showing a potential weakness in identifying similar landmarks.
> * For "Taman Spathodea", the model found a correct hit but only in the 4th position (`Precision: 0.25`), indicating that although relevant results are found, their ranking could improve.
> * For places like "Masjid Agung Trans Studio Bandung", "Sungai Palayangan", and "Taman Kupu-Kupu Cihanjuang", the model achieved perfect result (`precision: 1.0`), placing the correct results at the top.
>
> 🎯 Overall Evaluation:
> * `Precision@5 = 0.72` means that 72% of test places appeared in the top 5 recommendations, demonstrating strong overall accuracy.
> * `MAP@5 = 0.51` indicates that, on average, correct places were well-ranked but with room for improvement in ordering results.