<h1 style='text-align:center; color:orange; font-weight:bold'>A Hybrid Recommendation System for Indonesian Travel Destination</h1>

In [1]:
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import cosine_similarity

## **1 Introduction**
In the rapid gowing tourism industry in Indonesia, a recommendation system becomes a key instrument in travel platforms. A recommendation system does not only offers personalization to customers based on their individual preferences or past behaviors but also streamlines travel planning process. With the system, customers can save more time and effort as the system displays relevant travel destination to them according to their platform usage data (e.g., previous search/booking or preference). In the business perspective, a recommendation system helps company generate more bookings and purchases by promoting relevant contents to customers.

In a highly competitive market, travel businesses face challenges in maximizing operational efficiency and revenue. In Indonesia, there have been many giant travel companies such as Agoda, Traveloka, and Booking\.com. Countless travel options are available, complicating the process of targeting relevant contents to the customers, and thus, leading to potential missed revenue opportunities. To address this challenge, I will build a recommendation system to increase the conversion rates by showing relevant travel destination to the customers. While recommendation system which will be built here is a hybrid one, I will show how content-based and collaborative filtering systems work. Hybrid recommendation system incorporates both content-based and collaborative filtering to deal with weaknesses of each recommendation system. 
- **Content-based filtering system** sometimes can be too similar to the user's past activities, leading to more limited exposure to new and different items. Hybrid approach combines two recommendation systems to find balance between relevance and recency.
- **Collaborative filtering system** may favor popular items. As a result, the recommendations can lack diversity and potentially neglect less popular items. Hybrid approach reduces this bias to the popular items.
It is important to highlight here that the hybrid recommendation system here is not a machine learning solution, but a distance-based approach for an initial development. This distance-based hybrid recommendation system is more straightforward because it only uses similarity measures, making it easier and faster to develop. 

The datasets listed below were taken from [Prabowo](https://www.kaggle.com/datasets/aprabowo/indonesia-tourism-destination?select=tourism_with_id.csv) on Kaggle. In addition to these four datasets, I will also use Indonesian stop word list collected from [Hartono](https://www.kaggle.com/datasets/oswinrh/indonesian-stoplist) on the same platform. This list is crucial since making content-based filtering requires text preprocessing, particularly for removing non-content words which are not relevant for the goal.
- `package_tourism`: file containing places, time, cost, and rating
- `tourism_rating`: consists of three columns, namely user, place, and rating
- `tourism_with_id`: tourist attractions in 5 cities in Indonesia
- `user`: user data

## **2 Data Preparation**
In this step, I will check the cleanliness of the data such as missing values and duplicates. However, before doing this, merging datasets must be done since there are four datasets. Afterwards, text cleaning will be performed to avoid noise in the text preprocessing technique, especially for content-based filtering. 

In [2]:
# load dataset
tourism_package = pd.read_csv('../data/package_tourism.csv')
tourism_rating = pd.read_csv('../data/tourism_rating.csv')
tourism_user = pd.read_csv('../data/user.csv')
tourism_wid = pd.read_csv('../data/tourism_with_id.csv').drop(columns=['Unnamed: 11', 'Unnamed: 12'])
stopwords = list(pd.read_csv('../data/stopwordbahasa.csv'))

**Note**: After successfully importing the datasets, including the list of stopwords, I need to check primary keys in each dataset so merging different datasets can be done. And to make the identification process easier, I will put the result in the form of a dataframe. 

In [3]:
# dict of df and names
dataframes = {
    'tourism_package': tourism_package,
    'tourism_rating': tourism_rating,
    'tourism_user': tourism_user,
    'tourism_wid': tourism_wid
}

# list to store dataset names and columns
data = []
for name, df in dataframes.items():
    columns = list(df.columns)
    data.append({'Dataset': name, 'Columns': columns})

result_df = pd.DataFrame(data)
pd.options.display.max_colwidth = None
result_df

Unnamed: 0,Dataset,Columns
0,tourism_package,"[Package, City, Place_Tourism1, Place_Tourism2, Place_Tourism3, Place_Tourism4, Place_Tourism5]"
1,tourism_rating,"[User_Id, Place_Id, Place_Ratings]"
2,tourism_user,"[User_Id, Location, Age]"
3,tourism_wid,"[Place_Id, Place_Name, Description, Category, City, Price, Rating, Time_Minutes, Coordinate, Lat, Long]"


**Note**: Above is a dataframe containing information of what columns each dataset has. This strategy makes the identification of how to merge datasets easier because what different datasets have in common can be easily found. For instance, `tourism_wid` and `tourism_user` cannot be directly merged because they do not share the same identifiers. To merge these two dataset, `tourism_wid` needs to be merged with tourism_rating first on `Place_Id`.

In [4]:
# merge datasets
temp = pd.merge(left=tourism_wid, right=tourism_rating, how='left', on='Place_Id')
merged_df = pd.merge(left=temp, right=tourism_user, how='left', on='User_Id')

In [5]:
# create function to inspect df
def inspect_dataframe(df):
    summary = {
        'ColumnName': df.columns.values.tolist(),
        'Nrow': df.shape[0],
        'DataType': df.dtypes.values.tolist(),
        'NAPct': (df.isna().mean() * 100).round(2).tolist(),
        'DuplicatePct': (df.duplicated().sum()/df.shape[0]).round(2),
        'UniqueValue': df.nunique().tolist(),
        'Sample': [df[col].unique() for col in df.columns]
    }
    return pd.DataFrame(summary)

**Note**: Above is a function to inspect dataframe. This function already covers simple identification of data types, number of entries, missing value rate, duplicate rate, and number of unique values. This function serves as an initial screening on the dataframe. If found any issues during this screening, a further identification will be done to confirm the insights.

In [6]:
pd.options.display.max_colwidth = 50
print(f'The dataframe contains {merged_df.shape[0]} rows and {merged_df.shape[1]} cols.')
print(f"- {len(merged_df.select_dtypes(include='number').columns)} are numeric cols")
print(f"- {len(merged_df.select_dtypes(include='O').columns)} are object cols")
inspect_dataframe(merged_df)

The dataframe contains 10000 rows and 15 cols.
- 9 are numeric cols
- 6 are object cols


Unnamed: 0,ColumnName,Nrow,DataType,NAPct,DuplicatePct,UniqueValue,Sample
0,Place_Id,10000,int64,0.0,0.01,437,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14..."
1,Place_Name,10000,object,0.0,0.01,437,"[Monumen Nasional, Kota Tua, Dunia Fantasi, Ta..."
2,Description,10000,object,0.0,0.01,437,[Monumen Nasional atau yang populer disingkat ...
3,Category,10000,object,0.0,0.01,6,"[Budaya, Taman Hiburan, Cagar Alam, Bahari, Pu..."
4,City,10000,object,0.0,0.01,5,"[Jakarta, Yogyakarta, Bandung, Semarang, Surab..."
5,Price,10000,int64,0.0,0.01,50,"[20000, 0, 270000, 10000, 94000, 25000, 4000, ..."
6,Rating,10000,float64,0.0,0.01,14,"[4.6, 4.5, 4.0, 4.4, 4.2, 4.8, 4.3, 4.7, 5.0, ..."
7,Time_Minutes,10000,float64,53.72,0.01,15,"[15.0, 90.0, 360.0, nan, 60.0, 10.0, 300.0, 15..."
8,Coordinate,10000,object,0.0,0.01,437,"[{'lat': -6.1753924, 'lng': 106.8271528}, {'la..."
9,Lat,10000,float64,0.0,0.01,437,"[-6.1753924, -6.1376448, -6.1253124, -6.302445..."


**Note**
- The merged dataframe consists of 10,000 records and 15 columns in which 9 columns are numeric and the other 6 columns are categorical. Based on the data types only, there is only one numerical feature, i.e., Coordinate, mistakenly identifed as object. It is quite understandable since the column contains strings such as `lat` and `lng` (latitude and longitude, respectively). Even if coordinates are crucial, this column will still not be used because it has been represented by two separate columns, namely `Lat` and `Long`. And for this project, I will not use coordinates so the misidentification can be left untreated.
- One column, namely `Time_Minutes` (originating from `tourism_wid`), contains a significant rate of missing values, constituting roughly a half of the total dataset (53.72%). The missing value treatment in the context of this project is not necessary since this column is not relevant for the goal of this project. 
- In terms of duplicates, the duplicate rate is only 1% of the total data but it remains unknown how many duplicates are in the dataset since the initial screening only provides the percentage of duplicates, not the raw count. 
- Some categorical columns contain high numbr of unique values (high cardinality) such as `Place_Name` and `Description` but since this project currently does not involve any predictive modeling yet, no need to be concerned with the high cardinality issues. 
- What needs to do next is to investigate what number the duplicates is, what data cleaning techniques will be necessary, and whether some categorical columns contain inconsistent values referring to the same entities. 

In [7]:
# remove unnecessary columns
merged_df.drop(columns=['Time_Minutes', 'Coordinate', 'Lat', 'Long'], inplace=True)

In [8]:
# check duplicates
print(f'Total duplicates: {merged_df.duplicated().sum()} entries')
merged_df[merged_df.duplicated(keep=False)].head()

Total duplicates: 79 entries


Unnamed: 0,Place_Id,Place_Name,Description,Category,City,Price,Rating,User_Id,Place_Ratings,Location,Age
151,7,Kebun Binatang Ragunan,Kebun Binatang Ragunan adalah sebuah kebun bin...,Cagar Alam,Jakarta,4000,4.5,278,1,"Jakarta Selatan, DKI Jakarta",40
152,7,Kebun Binatang Ragunan,Kebun Binatang Ragunan adalah sebuah kebun bin...,Cagar Alam,Jakarta,4000,4.5,278,1,"Jakarta Selatan, DKI Jakarta",40
302,15,Pasar Seni,Pasar Seni merupakan Pusat kerajinan dan kesen...,Pusat Perbelanjaan,Jakarta,0,4.4,26,2,"Palembang, Sumatera Selatan",38
303,15,Pasar Seni,Pasar Seni merupakan Pusat kerajinan dan kesen...,Pusat Perbelanjaan,Jakarta,0,4.4,26,2,"Palembang, Sumatera Selatan",38
326,16,Jembatan Kota Intan,Jembatan Kota Intan adalah jembatan tertua di ...,Budaya,Jakarta,0,4.3,34,5,"Sragen, Jawa Tengah",31


In [9]:
merged_df.drop_duplicates(keep='first', inplace=True)
print(f'Total duplicates: {merged_df.duplicated().sum()} entries')

Total duplicates: 0 entries


**Note**: The dataset consists of 79 duplicates, and these redundant values can be handled by keeping the first entry of each duplicates and remove the rest. The treatment is therefore quite straightforward.

In [10]:
url_df = merged_df[merged_df['Description'].str.contains('http')]
number_df = merged_df[merged_df['Description'].str.contains(r'\d', regex=True)]
nonascii_df = merged_df[merged_df['Description'].str.contains(r'[^\x00-\x7F]', regex=True)]
print(f'There are {len(number_df)} rows containing numbers')
print(f'There are {len(url_df)} rows containing URLs')
print(f'There are {len(nonascii_df)} rows containing nonascii characters')

There are 6327 rows containing numbers
There are 0 rows containing URLs
There are 2041 rows containing nonascii characters


**Note**
- In the code here, I checked whether column `Description` which is useful for content-based filtering contains irrelevat items such as URLs, numbers, and non-ASCII characters (e.g., Javanese or Sundanese characters). These items will affect the quality of the data if not removed so treatment to these items are necessary.
- As can be seen, there are a lot of rows containing numbers and non-ASCII characters.
- The removal of these items can be done at once by creating a function to clean text. In the function below, I will preprocess text by firstly changing the letter case into lower. Afterward, I can start removing numbers, non-alphabetic characters, including the non-ASCII ones, and extra white spaces.

In [11]:
# define text cleaning function
def clean_text(text):
    # lower text
    text = text.lower()
    # remove all digits
    text = re.sub(r'\d+', '', text)
    # remove standalone digits
    text = re.sub(r'\b\d+\b', '', text)
    # remove non-alphabetic characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # remove extra white space
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# apply to two columns
merged_df['DescriptionClean'] = merged_df['Description'].apply(clean_text)
merged_df['CategoryClean'] = merged_df['Category'].apply(clean_text)

# check results
pd.options.display.max_colwidth = 50
merged_df.head()

Unnamed: 0,Place_Id,Place_Name,Description,Category,City,Price,Rating,User_Id,Place_Ratings,Location,Age,DescriptionClean,CategoryClean
0,1,Monumen Nasional,Monumen Nasional atau yang populer disingkat d...,Budaya,Jakarta,20000,4.6,36,4,"Solo, Jawa Tengah",20,monumen nasional atau yang populer disingkat d...,budaya
1,1,Monumen Nasional,Monumen Nasional atau yang populer disingkat d...,Budaya,Jakarta,20000,4.6,38,2,"Serang, Banten",26,monumen nasional atau yang populer disingkat d...,budaya
2,1,Monumen Nasional,Monumen Nasional atau yang populer disingkat d...,Budaya,Jakarta,20000,4.6,64,2,"Bandung, Jawa Barat",38,monumen nasional atau yang populer disingkat d...,budaya
3,1,Monumen Nasional,Monumen Nasional atau yang populer disingkat d...,Budaya,Jakarta,20000,4.6,74,2,"Semarang, Jawa Tengah",30,monumen nasional atau yang populer disingkat d...,budaya
4,1,Monumen Nasional,Monumen Nasional atau yang populer disingkat d...,Budaya,Jakarta,20000,4.6,86,4,"Depok, Jawa Barat",32,monumen nasional atau yang populer disingkat d...,budaya


**Note**:
-  Two columns have been cleaned. And instead of removing the original ones, I keep them this way just in case I need to check the effects further. The cleaned versions are in column `DescriptionClean` and `CategoryClean`.
- To note that the place names here appear to be redundant but here actually they are not. Each row represents user's activity in the platform. Some users may visit the same places. Furthermore, `Location` here represents the origin of the platform user, not the geographical location of the travel destination.

In [12]:
# check formating: top-10 entries
(merged_df['Place_Name']
 .value_counts()
 .reset_index(name='count')
 .rename(columns={'index':'place_name'})
 .sort_values(by='place_name')
 .head(10))

Unnamed: 0,place_name,count
431,Air Mancur Menari,13
158,Air Terjun Kali Pancur,24
241,Air Terjun Kedung Pedut,22
195,Air Terjun Semirang,23
244,Air Terjun Sri Gethuk,22
231,Alive Museum Ancol,22
116,Alun Alun Selatan Yogyakarta,25
106,Alun-Alun Kota Bandung,26
396,Alun-alun Utara Keraton Yogyakarta,17
187,Amazing Art World,23


In [13]:
# check unque values in CategoryClean
merged_df['CategoryClean'].unique()

array(['budaya', 'taman hiburan', 'cagar alam', 'bahari',
       'pusat perbelanjaan', 'tempat ibadah'], dtype=object)

In [14]:
# check unique values in Location
merged_df['Location'].unique()

array(['Solo, Jawa Tengah', 'Serang, Banten', 'Bandung, Jawa Barat',
       'Semarang, Jawa Tengah', 'Depok, Jawa Barat', 'Bogor, Jawa Barat',
       'Bekasi, Jawa Barat', 'Karawang, Jawa Barat',
       'Jakarta Barat, DKI Jakarta', 'Yogyakarta, DIY',
       'Cirebon, Jawa Barat', 'Jakarta Timur, DKI Jakarta',
       'Purwakarat, Jawa Barat', 'Kota Gede, DIY', 'Ponorogo, Jawa Timur',
       'Lampung, Sumatera Selatan', 'Tanggerang, Banten',
       'Nganjuk, Jawa Timur', 'Jakarta Pusat, DKI Jakarta',
       'Jakarta Utara, DKI Jakarta', 'Sragen, Jawa Tengah',
       'Cilacap, Jawa Tengah', 'Subang, Jawa Barat',
       'Surabaya, Jawa Timur', 'Palembang, Sumatera Selatan',
       'Klaten, Jawa Tengah', 'Jakarta Selatan, DKI Jakarta',
       'Madura, Jawa Timur'], dtype=object)

**Note**: Column `Place_Name`, `CategoryClean`, and `Location` contains high number of unique valus but no inconsistent values were found during the search. Therefore, there is no need for inconsistent value treatment. And if this condition occurs, the inconsistent values can be uniformed by changing the values to the desired ones. 

## **3 Content-Based Filtering Approach**
### **3.1 Recommendation System Building**

**Step-1: Extract features from column `DescriptionClean`**

In [48]:
# feature extraction using TF-IDF
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords)
tfidf_matrix = tfidf_vectorizer.fit_transform(merged_df['DescriptionClean'])

# compute cosine similarity between places
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [54]:
tfidf_dense = tfidf_matrix.todense()

# convert to df for readability
tfidf_df = pd.DataFrame(tfidf_matrix.todense(), 
                        columns=tfidf_vectorizer.get_feature_names_out())

# display 5 records and 5 features
print(f'Total features: {tfidf_df.shape[1]} columns & {tfidf_df.shape[0]} rows')
tfidf_df.head().iloc[:, :10]

Total features: 6383 columns & 9921 rows


Unnamed: 0,abad,abah,abang,abdul,abdullah,abdurrahman,abraham,abrasi,abu,acara
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Note**
- To extract term frequency-inverse document frequency (TF-IDF) from column `DescriptionClean`, `TfidfVectorizer` class can be used and pass the list of Indonesian stopwords imported earlier to the class.
- The vectorizer class converts every word in the column into vectors and filter out the stopwords from the output.
- TF-IDF incorporates frequency of term appearing in each row (term frequency, TF) and the rarity of the term across all rows (inverse document frequency, IDF)
$$
w_{x,y}= TF_{x,y} \times \log \frac{N}{df_x}
$$
- In the formula above, $\text{TF}_{x,y}$ represents the frequency of $x$ in $y$. The $df_x$ denotes the number of documents containing term $x$, and $N$ is the total number of documents. For more details on how TF-IDF works, kindly see [my presentation deck](https://drive.google.com/file/d/1vaFM0GCO8WRMHjvvHyPOyQY7Agk03ek3/view) where I explain TF-IDF formula along with the application in machine learning based classification tasks.
- The result of the vectorization can be seen in the dataframe above.

**Step-2: Compute cosine similarity between place names based on their descriptions**  

In [17]:
# create a Series mapping cleaned Place_Name to index
place_name_to_index = pd.Series(merged_df.index, 
                                index=merged_df['Place_Name'].apply(clean_text)).to_dict()

# feature extraction using TF-IDF for the description and category
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords)
tfidf_matrix = tfidf_vectorizer.fit_transform(merged_df['Description'])

# compute cosine similarity between places
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

**Note**
- Here, both cosine similarity and linear kernel actually can be used to compute similarity between two items but linear kernel assumes that the TF-IDF has been normalized (e.g., using L2 normalization `norm=l2` in `Tfidfvectorizer`). This normalization aims to make all vectors directly comparable across different rows regardless of their original lengths. It ensures that the similarity scores reflect true item similarity rather than being influenced by the varying lengths of text descriptions.
$$\text{Normalized TF-IDF} = \frac{\text{TF-IDF vector}}{\parallel \text{TF-IDF vector} \parallel}$$
- The $\parallel \cdot \parallel$ represents the Euclinean norm
- Cosine similarity can is more flexible it handles normalization internally (without adding normalization to `Tfidfvectorizer`). Let us see the comparison between cosine similarity and linear kernel function below.

$$\text{Cosine}(x,y)= \frac{x \cdot y}{\parallel x \parallel \parallel y \parallel}$$

$$\text{linear kernel}(x,y)=x \cdot y$$
- As can be seen the linear kernel function does not involve the Euclidean norm of the vector ($\parallel \cdot \parallel$) whereas cosine similarity does.

**Step-3: Generate recommendations**

In [18]:
# function to get recommendations based on a cleaned place name
def get_recommendations(place_name, cosine_sim=cosine_sim, df=merged_df):
    from IPython.display import display  

    # clean and normalize input using the clean_text function
    if not isinstance(place_name, str):
        print("Invalid input: Place name should be a string.")
        return
    
    cleaned_name = clean_text(place_name)
    
    # check if the cleaned name exists in the place_name_to_index dictionary
    if cleaned_name not in place_name_to_index:
        print(f"Place '{place_name}' not found in the dataset.")
        return
    
    # get the index of the place corresponding to the given place name
    place_idx = place_name_to_index[cleaned_name]
    
    # calculate similarity scores
    sim_scores = list(enumerate(cosine_sim[place_idx]))
    
    # sort scores by similarity, highest first
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # exclude the place itself (first item)
    sim_scores = sim_scores[1:]
    
    # check if there are any recommendations
    if not sim_scores:
        print(f"No relevant places found for '{place_name}'.")
        return
    
    # get the indices of the places, ensuring they're within a valid range
    place_indices = [i[0] for i in sim_scores if i[0] < len(df)]
    
    # check if the indices list is empty
    if not place_indices:
        print(f"No relevant places found for '{place_name}'.")
        return
    
    # get the df rows that correspond to these indices
    recommendations = df.iloc[place_indices].drop_duplicates(subset=['Place_Name', 'Description', 'Category', 'Rating']).head(5)

    # check if any recommendations were found
    if recommendations.empty:
        print(f"No relevant places found for '{place_name}'.")
        return
    
    # if recommendations are found, display these
    display(recommendations[['Place_Name', 'Description', 'Category', 'Rating']])

In [19]:
# get recommendations
get_recommendations("monumen nasional")

Unnamed: 0,Place_Name,Description,Category,Rating
1,Monumen Nasional,Monumen Nasional atau yang populer disingkat d...,Budaya,4.6
5845,Monumen Bandung Lautan Api,"Monumen Bandung Lautan Api, merupakan monumen ...",Budaya,4.3
5893,Monumen Perjuangan Rakyat Jawa Barat,Monumen Perjuangan Rakyat Jawa Barat (Monju) a...,Budaya,4.5
995,Monumen Selamat Datang,Monumen Selamat Datang adalah sebuah monumen y...,Budaya,4.7
9759,Monumen Bambu Runcing Surabaya,Monumen Bambu Runcing adalah ikon pariwisata S...,Budaya,4.6


In [20]:
# get recommendations
get_recommendations("tugu proklamasi")

Unnamed: 0,Place_Name,Description,Category,Rating
1408,Taman Legenda Keong Emas,Taman Legenda Keong Emas merupakan salah satu ...,Taman Hiburan,4.5
8132,Saloka Theme Park,SALOKA hadir sebagai taman rekreasi terbesar d...,Taman Hiburan,4.4
5449,Taman Lalu Lintas Ade Irma Suryani Nasution,Taman Lalu-lintas Ade Irma Suryani adalah sebu...,Taman Hiburan,4.4
7656,Grand Maerakaca,Masyarakat Jawa Tengah mungkin sudah tidak asi...,Taman Hiburan,4.4
1920,Taman Pintar Yogyakarta,Taman Pintar Yogyakarta (bahasa Jawa: Hanacara...,Taman Hiburan,4.5


### **3.2 Recommendation System Evaluation**

In [21]:
# get recommendations 
get_recommendations("monas")

Place 'monas' not found in the dataset.


**Note**
- The content-based recommendation system here can be implemented as follows: (1) The customer's activity, e.g., last search or last booking, is stored in the platform. (2) The information is then passed to the recommendation system. (2) If there are similar travel destinations based on the place description, the system will offer top-5 recommendation based on the cosine similarity scores. Otherwise, the system should not recommend anything. The output "Place X not found in the dataset" is only for a demonstration purpose only. In real implementation, there should be no such message. 
- While the content-based recommendation is useful, it does not seem to recognize synonyms. In Indonesian context, *Monas* is the short term for *Monumen Nasional*. When the rule-based system handles the same entity using different terms, it fails to recognize and resulting in no recommendations. If this model is implemented, a synonym detection should be placed before feeding the data into the recommendation system.

## **4 Colaborative Filtering Approach**

- As discussed earlier, content-based filtering approach exploits cosine similarity to compute similarity between two travel destination based on place description but actually Jaccard similarity or Pearson's correlation coefficient can also be options.
- Here I will remain using cosine similarity to compute similarity between users for practicality (same as above).
$$\text{Cosine}(x,y)= \frac{x \cdot y}{\parallel x \parallel \parallel y \parallel} = \frac{\sum^n_{i=1}x_iy_i}{\sqrt{\sum_{i=1}^n x^2_i}\sqrt{\sum_{i=1}^ny_i^2}}$$
- Formula here is actually the same as above. I just added some details in how to compute the similarity.
    - The numerator $\sum^n_{i=1}x_iy_i$ is dot product, representing the similarity between two vectors.
    - The denominator $\sqrt{\sum_{i=1}^n x^2_i}$ and $\sqrt{\sum_{i=1}^ny_i^2}$ are the normalization for the first and the second vector.

### **4.1 Recommendation System Building**

**Step-1: Get user-place matrix containing place ratings**

In [55]:
# pivot the data to create a user-item matrix
user_item_matrix = merged_df.pivot_table(index='User_Id', columns='Place_Id',
                                         values='Place_Ratings', fill_value=0)

# show snippet of result
display(user_item_matrix.head().iloc[:, :10])

Place_Id,1,2,3,4,5,6,7,8,9,10
User_Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,0.0,0.0,0,5,0.0,0,0.0,0,0
2,0,5.0,0.0,0,0,0.0,0,0.0,0,0
3,0,0.0,0.0,0,0,0.0,0,0.0,0,0
4,0,0.0,0.0,4,5,0.0,0,0.0,0,0
5,0,0.0,0.0,3,0,0.0,0,0.0,5,0


**Step-2: Compute cosine similarity between users based on preferences (ratings)**

In [56]:
# Calculate user similarity using cosine similarity
user_similarity = cosine_similarity(user_item_matrix)
user_similarity_df = pd.DataFrame(user_similarity, index=user_item_matrix.index, 
                                  columns=user_item_matrix.index)

display(user_similarity_df.head().iloc[:, :10])

User_Id,1,2,3,4,5,6,7,8,9,10
User_Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,1.0,0.058921,0.010902,0.120602,0.04152,0.027104,0.0,0.090879,0.097536,0.106227
2,0.058921,1.0,0.048176,0.0,0.086006,0.029943,0.011765,0.059059,0.010775,0.088733
3,0.010902,0.048176,1.0,0.028665,0.063653,0.011081,0.065302,0.101078,0.06978,0.03972
4,0.120602,0.0,0.028665,1.0,0.032752,0.116877,0.131601,0.073093,0.06155,0.0327
5,0.04152,0.086006,0.063653,0.032752,1.0,0.166165,0.0,0.150862,0.08305,0.060511


**Step-3: Show recommendations based on places rated by similar users**

In [24]:
# function to get top N similar users
def get_similar_users(user_id, n=3):
    # Sort users by similarity to the input user_id
    similar_users = user_similarity_df[user_id].sort_values(ascending=False)
    return similar_users.iloc[1:n+1]  # Exclude the user themselves

# function to recommend places for a user based on similar users' preferences
def recommend_places(user_id, n_recommendations=5):
    similar_users = get_similar_users(user_id)
    similar_user_ids = similar_users.index
    
    # get the places these similar users have rated
    similar_users_ratings = user_item_matrix.loc[similar_user_ids].mean(axis=0)
    
    # filter out places the target user has already rated
    user_rated_places = user_item_matrix.loc[user_id]
    unrated_places = similar_users_ratings[user_rated_places == 0]
    
    # recommend the top N unrated places
    recommendations = unrated_places.sort_values(ascending=False).head(n_recommendations)
    return recommendations.index

# function to show the final recommendations
def get_detailed_recommendations(user_id, n_recommendations=5):
    recommended_place_ids = recommend_places(user_id, n_recommendations)
    detailed_recommendations = df[df['Place_Id'].isin(recommended_place_ids)].drop_duplicates('Place_Id')
    return detailed_recommendations[['Place_Name', 'Description', 'Category', 'City', 'Price']]

In [25]:
# get recommendations for user with id 36
get_detailed_recommendations(user_id=36)

Unnamed: 0,Place_Name,Description,Category,City,Price
36,Bumi Perkemahan Cibubur,Bumi Perkemahan dan Graha Wisata Pramuka Cibub...,Taman Hiburan,Jakarta,10000
148,Goa Cerme,"Gua Cerme (bahasa Jawa: ꦒꦸꦮ​ꦕꦺꦂꦩꦺ, translit. G...",Cagar Alam,Yogyakarta,3000
168,Puncak Segoro,Puncak Segoro menjadi destinasi wisata terbaru...,Cagar Alam,Yogyakarta,5000
231,Bukit Moko,Bandung sebagai destinasi wisata tak pernah ad...,Cagar Alam,Bandung,25000
365,Tirto Argo Siwarak,"Kolam Renang Tirto Argo Siwarak, merupakan kol...",Taman Hiburan,Semarang,20000


### **4.2 Recommendation System Evaluation**

To evaluate why *Bumi Perkemahan Cibubur* is recommended to the user ID 36, we first need to check to which user ID this individual is similar based on the cosine similarity. This can be achieved by exploiting the function previously defined `get_similar_user`. After getting the users, I will check the travel destination each individual has gone and rated.

In [26]:
# get similar users
get_similar_users(user_id=36).reset_index(name='sim_score').round(3)

Unnamed: 0,User_Id,sim_score
0,162,0.223
1,253,0.192
2,22,0.191


**Note**: User ID 36 is closer to User 162, user 253, and user 22. Before moving on to examining the travel destinations of each individuals, I need to make sure if User 36 has not visited *Bumi Perkemahan Cibubur* because the algorithm set up above already filter out destinations which the user has rated (based on ratings). 

In [27]:
user_36 = merged_df[merged_df['User_Id'] == 36]
print(f'This user has {len(user_36)} records')
if user_36['Place_Name'].str.contains('bumi perkemahan', case=False).any(): 
    print('This user has visited this place')
else:
    print('This user has not visited this place')
display(user_36[['Place_Name', 'City', 'Place_Ratings']].head())

This user has 33 records
This user has not visited this place


Unnamed: 0,Place_Name,City,Place_Ratings
0,Monumen Nasional,Jakarta,4
66,Taman Mini Indonesia Indah (TMII),Jakarta,2
484,Gereja Katedral,Jakarta,5
507,Museum Nasional,Jakarta,3
693,Setu Babakan,Jakarta,5


**Note**: It has been confirmed that this person has not visited the place. Now, the task is to find out which of the three user has visited the location and gave high ratings. 

In [45]:
(merged_df[(merged_df['User_Id'].isin([162, 253, 22])) & 
           merged_df['Place_Name'].str.contains('bumi', case=False)]
 .drop(columns=['Description', 'Category']))

Unnamed: 0,Place_Id,Place_Name,City,Price,Rating,User_Id,Place_Ratings,Location,Age,DescriptionClean,CategoryClean
813,37,Bumi Perkemahan Cibubur,Jakarta,10000,4.5,22,5,"Subang, Jawa Barat",25,bumi perkemahan dan graha wisata pramuka cibub...,taman hiburan
830,37,Bumi Perkemahan Cibubur,Jakarta,10000,4.5,253,1,"Cirebon, Jawa Barat",34,bumi perkemahan dan graha wisata pramuka cibub...,taman hiburan


**Note**
- User 22 which shares similarity in terms of preference as to User 36 has visited *Bumi Perkemahan* and gave perfect rating score of 5. Now, it is clear why this travel destination is recommended to User 36. 
- One drawback of this recommendation system is that if the place has not been rated previously, recommending the place to an individual will not happen because the algorithm firstly creates a matrix of users and travel desinations based on the place ratings. When the place is not rated, the place will not be listed in the matrix as the basis for cosine similarity and the following steps.

## **5 Hybrid Approach**

### **5.1 Recommendation System Building**

In [44]:
# display dataframe
pd.options.display.max_colwidth = 50
merged_df.drop(columns=['Description', 'Category']).head()

Unnamed: 0,Place_Id,Place_Name,City,Price,Rating,User_Id,Place_Ratings,Location,Age,DescriptionClean,CategoryClean
0,1,Monumen Nasional,Jakarta,20000,4.6,36,4,"Solo, Jawa Tengah",20,monumen nasional atau yang populer disingkat d...,budaya
1,1,Monumen Nasional,Jakarta,20000,4.6,38,2,"Serang, Banten",26,monumen nasional atau yang populer disingkat d...,budaya
2,1,Monumen Nasional,Jakarta,20000,4.6,64,2,"Bandung, Jawa Barat",38,monumen nasional atau yang populer disingkat d...,budaya
3,1,Monumen Nasional,Jakarta,20000,4.6,74,2,"Semarang, Jawa Tengah",30,monumen nasional atau yang populer disingkat d...,budaya
4,1,Monumen Nasional,Jakarta,20000,4.6,86,4,"Depok, Jawa Barat",32,monumen nasional atau yang populer disingkat d...,budaya


**Note**: Above is the combined dataframe I used for building recommendation systems based on cosine similarity above. This is just to remind that some rows may seem to contain similar entries despite not duplicates. The similar entries because each row represent activities of an individual user (user-based rows).

**Step-1: Compute cosine similarity between place names based on their descriptions**

In [30]:
# content-based filtering
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords)
tfidf_matrix = tfidf_vectorizer.fit_transform(merged_df['DescriptionClean'])
content_based_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

**Step-2: Compute cosine similarity based on users**

In [31]:
# compute mean based on user id and place id
merged_df2 = merged_df.groupby(['User_Id', 'Place_Id'], as_index=False).agg({'Rating': 'mean'})
merged_df2.head()

Unnamed: 0,User_Id,Place_Id,Rating
0,1,5,4.5
1,1,15,4.4
2,1,20,4.5
3,1,21,4.5
4,1,41,4.4


In [42]:
# collaborative filtering
user_item_matrix = merged_df2.pivot(index='User_Id', columns='Place_Id', values='Rating').fillna(0)
display(user_item_matrix.head(10).iloc[:, 0:10])
collab_sim = cosine_similarity(user_item_matrix)

Place_Id,1,2,3,4,5,6,7,8,9,10
User_Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0.0,0.0,0.0,0.0,4.5,0.0,0.0,0.0,0.0,0.0
2,0.0,4.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,4.5,4.5,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,4.5,0.0,0.0,0.0,0.0,4.4,0.0
6,0.0,0.0,0.0,4.5,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Step-3: Display recommendations combining both similarity measures**

In [33]:
# create function for hybrid recommendation 
def get_hybrid_recommendations(place_idx, content_based_sim, collab_sim, alpha=0.5, df=merged_df):
    # get content-based scores
    content_scores = list(enumerate(content_based_sim[place_idx]))
    
    # get collaborative filtering scores
    collab_scores = list(enumerate(collab_sim[place_idx]))
    
    # combine scores
    content_scores = [score[1] for score in content_scores]
    collab_scores = [score[1] for score in collab_scores]
    
    combined_scores = [alpha * content + (1 - alpha) * collab for content, collab in zip(content_scores, collab_scores)]
    
    # sort and exclude the place itself
    sim_scores = list(enumerate(combined_scores))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:]
    
    # get the indices of the recommended places
    place_indices = [i[0] for i in sim_scores if i[0] < len(df)]
    
    # filter duplicates based on Place_Id
    unique_place_indices = []
    seen_place_ids = set()
    for idx in place_indices:
        place_id = df.iloc[idx]['Place_Id']
        if place_id not in seen_place_ids:
            unique_place_indices.append(idx)
            seen_place_ids.add(place_id)
    
    # return df rows that correspond to these indices
    return df.loc[unique_place_indices].head()[['Place_Name', 'Description', 'Category', 'City', 'Price']]

**Note**
- The parameter alpha ($\alpha$) here refers to weighting between the content-based and user-based filtering scores. Suppose that the $\alpha$ is set to 0.5. It means that both content-based and user-based filtering scores are equal. This can be adjusted based on the preference of the stakeholder after its implementation and evaluation.
- The calculation is done as follows:
$$
\text{combined score} = \alpha \times \text{content score} + (1 - \alpha) \times \text{collaboration score}
$$

In [34]:
# check oputput
get_hybrid_recommendations(place_idx=66, content_based_sim=content_based_sim, 
                           collab_sim=collab_sim, alpha=0.5)

Unnamed: 0,Place_Name,Description,Category,City,Price
65,Taman Mini Indonesia Indah (TMII),Taman Mini Indonesia Indah merupakan suatu kaw...,Taman Hiburan,Jakarta,10000
44,Dunia Fantasi,Dunia Fantasi atau disebut juga Dufan adalah t...,Taman Hiburan,Jakarta,270000
213,Pulau Tidung,Pulau Tidung adalah salah satu kelurahan di ke...,Bahari,Jakarta,150000
8,Monumen Nasional,Monumen Nasional atau yang populer disingkat d...,Budaya,Jakarta,20000
180,Ocean Ecopark,Ocean Ecopark Salah satu zona rekreasi Ancol y...,Taman Hiburan,Jakarta,180000


### **5.2 Recommendation System Evaluation**
To evaluate the hybrid recommendation system, I need to see whether Place 66 is frequently visited by different users. Afterwards, I will check the extent to which the place being recommended is simiar to Place 66.

In [35]:
# check place name of Place 66
merged_df[merged_df['Place_Id'] == 66].drop_duplicates(subset='Place_Id').iloc[:, 0:3]

Unnamed: 0,Place_Id,Place_Name,Description
1491,66,Museum Layang-layang,Museum Layang-Layang adalah sebuah museum yang...


In [36]:
# how many people visited this place
len(merged_df[merged_df['Place_Id'] == 66])

21

**Note**: Place 66 has been visited for 66 times. This result suggests that this place is quite popular among the platform users.

In [37]:
display(merged_df.groupby('Place_Name', as_index=False)
        .agg(func={'Place_Ratings':'mean'}).round(2)
        .sort_values(by='Place_Ratings', ascending=False)
        .head())
display(merged_df.groupby('Place_Name', as_index=False)
        .agg(func={'Place_Ratings':'mean'}).round(2)
        .sort_values(by='Place_Ratings', ascending=False)
        .tail())

Unnamed: 0,Place_Name,Place_Ratings
158,Keraton Surabaya,3.93
312,Puncak Gunung Api Purba - Nglanggeran,3.88
135,Kampung Cina,3.84
25,Bukit Jamur,3.79
402,Teras Cikapundung BBWS,3.79


Unnamed: 0,Place_Name,Place_Ratings
295,Patung Sura dan Buaya,2.25
9,Amazing Art World,2.22
279,Pantai Sanglen,2.21
399,Tebing Breksi,2.17
354,Taman Barunawati,2.06


In [38]:
# recommended places based on place ID 66
(merged_df[merged_df['Place_Name'].isin(['Taman Mini Indonesia Indah (TMII)', 'Dunia Fantasi', 
                                         'Pulau Tidung', 'Monumen Nasional', 'Ocean Ecopark'])]
 .groupby('Place_Name', as_index=False)
 .agg(func={'Place_Ratings':'mean'})
 ).sort_values(by='Place_Ratings', ascending=False)

Unnamed: 0,Place_Name,Place_Ratings
1,Monumen Nasional,3.722222
2,Ocean Ecopark,3.133333
4,Taman Mini Indonesia Indah (TMII),2.857143
3,Pulau Tidung,2.714286
0,Dunia Fantasi,2.526316


**Note**
- When using user-based approach to place recommendation, the system will prioritize high rated places (e.g., *Keraton Surabaya*, *Puncak Gunung Api Purba - Nglanggeran*, or *Kampung Cina*) since user preference is operated in the form of ratings (see the code below). The low rated places (e.g., *Taman Barunawati*, *Patung Sura dan Buaya*, or *Amazing Art World*) or even those with no ratings at all (no records) will not be recommended to the platform users. As a result, the possibility of the places to get booked small.
- However, using a hybrid approach, a lower rated places can also be recommended. As can be seen, even though *Taman Mini Indonesian Inda (TMII)*, *Pulau Tidung*, and *Dunia Fantasi* have rating lower than 3.0, the hybrid recommendation system remains recommending the places to users as this recommender does not only takes user preference into account but also content relevance based on the place description (see the cosine similarity below).

```python
# pivot the data to create a user-item matrix
user_item_matrix = merged_df.pivot_table(index='User_Id', columns='Place_Id',
                                         values='Place_Ratings', fill_value=0)

display(user_item_matrix.head())
```

In [39]:
# compare Museum Layang-layang (ID 66) and TMII
result = (merged_df[(merged_df['Place_Name'] == 'Museum Layang-layang') |
                   (merged_df['Place_Name'] == 'Taman Mini Indonesia Indah (TMII)')]
          .drop_duplicates(subset='Place_Name'))

# display
pd.options.display.max_colwidth = None
result[['Place_Name', 'DescriptionClean']]

Unnamed: 0,Place_Name,DescriptionClean
62,Taman Mini Indonesia Indah (TMII),taman mini indonesia indah merupakan suatu kawasan taman wisata bertema budaya indonesia di jakarta timur area seluas kurang lebih hektare atau kilometer persegi ini terletak pada koordinat lsbt
1491,Museum Layang-layang,museum layanglayang adalah sebuah museum yang terletak di jl h kamang no pondok labu jakarta selatan museum ini merupakan museum layanglayang pertama di indonesia jumlah koleksi layanglayang di museum ini berjumlah tetapi jumlah tersebut terus bertambah seiring datangnya koleksikoleksi baru dari para pelayang daerah dan luar negeri maupun layanglayang yang dibuat sendiri oleh karyawan museum museum layanglayang buka setiap hari mulai pukul wib hari libur nasional museum layanglayang tutup


In [40]:
# display two instances
tfidf_matrix1 = tfidf_vectorizer.fit_transform(result['DescriptionClean'])
feature_names1 = tfidf_vectorizer.get_feature_names_out()

# display in dataframe
pd.DataFrame(tfidf_matrix1.toarray(), columns=feature_names1).round(2)

Unnamed: 0,adalah,area,atau,baru,berjumlah,bertambah,bertema,budaya,buka,daerah,...,taman,terletak,tersebut,terus,tetapi,timur,tutup,wib,wisata,yang
0,0.0,0.19,0.19,0.0,0.0,0.0,0.19,0.19,0.0,0.0,...,0.38,0.14,0.0,0.0,0.0,0.19,0.0,0.0,0.19,0.0
1,0.08,0.0,0.0,0.08,0.08,0.08,0.0,0.0,0.08,0.08,...,0.0,0.06,0.08,0.08,0.08,0.0,0.08,0.08,0.0,0.16


In [41]:
# compute cosine similarity between two places
round(cosine_similarity(tfidf_matrix1, tfidf_matrix1)[0, 1], 2)

0.08

## **6 Conclusions**
This project aims to build a recommendation system to personalize recommendations of Indonesian travel destinations to the platform users. This project involves building content-based and user-based recommendation systems and highlighting the drawbacks of each systems. To overcome the limitations of each recommendation system, this project combines both approaches into a hybrid recommendation system, taking both content and user-preferences into account. This hybrid recommender system, therefore, offer better advantages in comparison to the two previous recommenders by addressing limitations of each recommender.

<h1 style='text-align:center; font-weight:bold; color:orange'>---END---</h1>