<h1 style='text-align:center; color:orange; font-weight:bold'>A Hybrid Recommendation System for Indonesian Travel Destination</h1>

In [202]:
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import cosine_similarity

## **1 Introduction**
In rapid gowing tourism industry in Indonesia, a recommendation system becomes a key instrument in travel platforms. A recommendation system does not only offers personalization to customers based on their individual preferences or past behaviors but also streamlines travel planning process. With the system, customers save more time and effort as the system displays contents relevant travel destination to them according to their platform data (e.g., previous search or booking). In the business perspective, a recommendation system helps company generate more bookings and purchases by promoting relevant contents to customers.

In a highly competitive market, travel businesses face challenges in maximizing operational efficiency and revenue. In Indonesia, there have been many giant travel companies such as Agoda, Traveloka, and Booking\.com. Countless travel options are available, complicating the process of targeting relevant contents to the customers, and thus, leading to potential missed revenue opportunities.

To address this challenge, I will build a recommendation system to increase the conversion rates by showing relevant travel destination to the customers. While recommendation system which will be built here is a hybrid one, I will show how content-based and collaborative filtering systems work. Hybrid recommendation system uses both content-based and collaborative filtering to deal with weaknesses of each recommendation system. 
- Content-based filtering sometimes can be too similar to the user's past activities, leading to more limited exposure to new and different items. Hybrid approach combines two recommendation systems to find balance between relevance and recency.
- Collaborative filtering system may favor popular items. As a result, the recommendations can lack diversity and potentially neglect less popular items. Hybrid approach reduces this bias to the popular items.
It is important to highlight here that the hybrid recommendation system here is not a machine learning solution, but a distance-based approach for an initial development. This distance-based hybrid recommendation system is more straightforward because it only uses similarity measures, making it easier to implement. 

The datasets listed below were taken from [Prabowo](https://www.kaggle.com/datasets/aprabowo/indonesia-tourism-destination?select=tourism_with_id.csv) on Kaggle. In addition to these four datasets, I will also use Indonesian stop word list collected from [Hartono](https://www.kaggle.com/datasets/oswinrh/indonesian-stoplist) on the same platform. This list is crucial since making content-based filtering requires text preprocessing, particularly for removing non-content words which are not relevant for the goal.
- `package_tourism`: file containing places, time, cost, and rating
- `tourism_rating`: consists of three columns, namely user, place, and rating
- `tourism_with_id`: tourist attractions in 5 cities in Indonesia
- `user`: user data

## **2 Data Preparation**
In this step, I will check the cleanliness of the data such as missing values and duplicates. However, before doing this, merging datasets must be done since there are four datasets. Afterwards, text cleaning will be performed to avoid noise in the text preprocessing technique, especially for content-based filtering. 

In [121]:
# load dataset
tourism_package = pd.read_csv('../data/package_tourism.csv')
tourism_rating = pd.read_csv('../data/tourism_rating.csv')
tourism_user = pd.read_csv('../data/user.csv')
tourism_wid = pd.read_csv('../data/tourism_with_id.csv').drop(columns=['Unnamed: 11', 'Unnamed: 12'])
stopwords = list(pd.read_csv('../data/stopwordbahasa.csv'))

**Note**: After successfully importing the datasets, including the list of stopwords, I need to check primary keys in each dataset so merging different datasets can be done. And to make the identification process easier, I will put the result in the form of a dataframe. 

In [122]:
# List of DataFrames with their names
dataframes = {
    'tourism_package': tourism_package,
    'tourism_rating': tourism_rating,
    'tourism_user': tourism_user,
    'tourism_wid': tourism_wid
}

# Create a list to store dataset names and their columns
data = []
for name, df in dataframes.items():
    columns = list(df.columns)
    data.append({'Dataset': name, 'Columns': columns})

# Create a DataFrame from the collected data
result_df = pd.DataFrame(data)
pd.options.display.max_colwidth = None
result_df

Unnamed: 0,Dataset,Columns
0,tourism_package,"[Package, City, Place_Tourism1, Place_Tourism2, Place_Tourism3, Place_Tourism4, Place_Tourism5]"
1,tourism_rating,"[User_Id, Place_Id, Place_Ratings]"
2,tourism_user,"[User_Id, Location, Age]"
3,tourism_wid,"[Place_Id, Place_Name, Description, Category, City, Price, Rating, Time_Minutes, Coordinate, Lat, Long]"


**Note**: Above is a dataframe containing information of what columns each dataset has. This strategy makes the identification of how to merge datasets easier because what different datasets have in common can be easily found. For instance, `tourism_wid` and `tourism_user` cannot be directly merged because they do not share the same identifiers. To merge these two dataset, `tourism_wid` needs to be merged with tourism_rating first on `Place_Id`.

In [159]:
# merge dataset
temp = pd.merge(left=tourism_wid, right=tourism_rating, how='left', on='Place_Id')
merged_df = pd.merge(left=temp, right=tourism_user, how='left', on='User_Id')

In [124]:
# create function to inspect df
def inspect_dataframe(df):
    summary = {
        'ColumnName': df.columns.values.tolist(),
        'Nrow': df.shape[0],
        'DataType': df.dtypes.values.tolist(),
        'NAPct': (df.isna().mean() * 100).round(2).tolist(),
        'DuplicatePct': (df.duplicated().sum()/df.shape[0]).round(2),
        'UniqueValue': df.nunique().tolist(),
        'Sample': [df[col].unique() for col in df.columns]
    }
    return pd.DataFrame(summary)

**Note**: Above is a function to inspect dataframe. This function already covers simple identification of data types, number of entries, missing value rate, duplicate rate, and number of unique values. This function serves as an initial screening on the dataframe. If found any issues during this screening, a further identification will be done to confirm the insights.

In [160]:
pd.options.display.max_colwidth = 50
print(f'The dataframe contains {merged_df.shape[0]} rows and {merged_df.shape[1]} cols.')
print(f"- {len(merged_df.select_dtypes(include='number').columns)} are numeric cols")
print(f"- {len(merged_df.select_dtypes(include='O').columns)} are object cols")
inspect_dataframe(merged_df)

The dataframe contains 10000 rows and 15 cols.
- 9 are numeric cols
- 6 are object cols


Unnamed: 0,ColumnName,Nrow,DataType,NAPct,DuplicatePct,UniqueValue,Sample
0,Place_Id,10000,int64,0.0,0.01,437,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14..."
1,Place_Name,10000,object,0.0,0.01,437,"[Monumen Nasional, Kota Tua, Dunia Fantasi, Ta..."
2,Description,10000,object,0.0,0.01,437,[Monumen Nasional atau yang populer disingkat ...
3,Category,10000,object,0.0,0.01,6,"[Budaya, Taman Hiburan, Cagar Alam, Bahari, Pu..."
4,City,10000,object,0.0,0.01,5,"[Jakarta, Yogyakarta, Bandung, Semarang, Surab..."
5,Price,10000,int64,0.0,0.01,50,"[20000, 0, 270000, 10000, 94000, 25000, 4000, ..."
6,Rating,10000,float64,0.0,0.01,14,"[4.6, 4.5, 4.0, 4.4, 4.2, 4.8, 4.3, 4.7, 5.0, ..."
7,Time_Minutes,10000,float64,53.72,0.01,15,"[15.0, 90.0, 360.0, nan, 60.0, 10.0, 300.0, 15..."
8,Coordinate,10000,object,0.0,0.01,437,"[{'lat': -6.1753924, 'lng': 106.8271528}, {'la..."
9,Lat,10000,float64,0.0,0.01,437,"[-6.1753924, -6.1376448, -6.1253124, -6.302445..."


**Note**
- The merged dataframe consists of 10,000 records and 15 columns in which 9 columns are numeric and the other 6 columns are categorical. Based on the data types only, there is only one numerical feature, i.e., Coordinate, mistakenly identifed as object. It is quite understandable since the column contains strings such as `lat` and `lng` (latitude and longitude, respectively). Even if coordinates are crucial, this column will still not be used because it has been represented by two separate columns, namely `Lat` and `Long`. And for this project, I will not use coordinates so the misidentification can be left untreated.
- One column, namely `Time_Minutes` (originating from `tourism_wid`), contains a significant rate of missing values, constituting roughly a half of the total dataset (53.72%). The missing value treatment in the context of this project is not necessary since this column is not relevant for the goal of this project. 
- In terms of duplicates, the duplicate rate is only 1% of the total data but it remains unknown how many duplicates are in the dataset since the initial screening only provides the percentage of duplicates, not the raw count. 
- Some categorical columns contain high numbr of unique values (high cardinality) such as `Place_Name` and `Description` but since this project currently does not involve any predictive modeling yet, no need to be concerned with the high cardinality issues. 
- What needs to do next is to investigate what number the duplicates is, what data cleaning techniques will be necessary, and whether some categorical columns contain inconsistent values referring to the same entities. 

In [161]:
# remove unnecessary columns
merged_df.drop(columns=['Time_Minutes', 'Coordinate', 'Lat', 'Long'], inplace=True)

In [162]:
# check duplicates
print(f'Total duplicates: {merged_df.duplicated().sum()} entries')
merged_df[merged_df.duplicated(keep=False)].head()

Total duplicates: 79 entries


Unnamed: 0,Place_Id,Place_Name,Description,Category,City,Price,Rating,User_Id,Place_Ratings,Location,Age
151,7,Kebun Binatang Ragunan,Kebun Binatang Ragunan adalah sebuah kebun bin...,Cagar Alam,Jakarta,4000,4.5,278,1,"Jakarta Selatan, DKI Jakarta",40
152,7,Kebun Binatang Ragunan,Kebun Binatang Ragunan adalah sebuah kebun bin...,Cagar Alam,Jakarta,4000,4.5,278,1,"Jakarta Selatan, DKI Jakarta",40
302,15,Pasar Seni,Pasar Seni merupakan Pusat kerajinan dan kesen...,Pusat Perbelanjaan,Jakarta,0,4.4,26,2,"Palembang, Sumatera Selatan",38
303,15,Pasar Seni,Pasar Seni merupakan Pusat kerajinan dan kesen...,Pusat Perbelanjaan,Jakarta,0,4.4,26,2,"Palembang, Sumatera Selatan",38
326,16,Jembatan Kota Intan,Jembatan Kota Intan adalah jembatan tertua di ...,Budaya,Jakarta,0,4.3,34,5,"Sragen, Jawa Tengah",31


In [163]:
merged_df.drop_duplicates(keep='first', inplace=True)
print(f'Total duplicates: {merged_df.duplicated().sum()} entries')

Total duplicates: 0 entries


**Note**: The dataset consists of 79 duplicates, and these redundant values can be handled by keeping the first entry of each duplicates and remove the rest. The treatment is therefore quite straightforward.

In [165]:
url_df = merged_df[merged_df['Description'].str.contains('http')]
number_df = merged_df[merged_df['Description'].str.contains(r'\d', regex=True)]
nonascii_df = merged_df[merged_df['Description'].str.contains(r'[^\x00-\x7F]', regex=True)]
print(f'There are {len(number_df)} rows containing numbers')
print(f'There are {len(url_df)} rows containing URLs')
print(f'There are {len(nonascii_df)} rows containing nonascii characters')

There are 0 rows containing URLs
There are 6327 rows containing numbers
There are 2041 rows containing nonascii characters


**Note**
- In the code here, I checked whether column `Description` which is useful for content-based filtering contains irrelevat items such as URLs, numbers, and non-ASCII characters (e.g., Javanese or Sundanese characters). These items will affect the quality of the data if not removed so treatment to these items are necessary.
- As can be seen, there are a lot of rows containing numbers and non-ASCII characters.
- The removal of these items can be done at once by creating a function to clean text. In the function below, I will preprocess text by firstly changing the letter case into lower. Afterward, I can start removing numbers, non-alphabetic characters, including the non-ASCII ones, and extra white spaces.

In [178]:
# define text cleaning function
def clean_text(text):
    # lower text
    text = text.lower()
    # remove all digits
    text = re.sub(r'\d+', '', text)
    # remove standalone digits
    text = re.sub(r'\b\d+\b', '', text)
    # remove non-alphabetic characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # remove extra white space
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# apply to two columns
merged_df['DescriptionClean'] = merged_df['Description'].apply(clean_text)
merged_df['CategoryClean'] = merged_df['Category'].apply(clean_text)

# check results
pd.options.display.max_colwidth = 50
merged_df.head()

Unnamed: 0,Place_Id,Place_Name,Description,Category,City,Price,Rating,User_Id,Place_Ratings,Location,Age,DescriptionClean,CategoryClean
0,1,Monumen Nasional,Monumen Nasional atau yang populer disingkat d...,Budaya,Jakarta,20000,4.6,36,4,"Solo, Jawa Tengah",20,monumen nasional atau yang populer disingkat d...,budaya
1,1,Monumen Nasional,Monumen Nasional atau yang populer disingkat d...,Budaya,Jakarta,20000,4.6,38,2,"Serang, Banten",26,monumen nasional atau yang populer disingkat d...,budaya
2,1,Monumen Nasional,Monumen Nasional atau yang populer disingkat d...,Budaya,Jakarta,20000,4.6,64,2,"Bandung, Jawa Barat",38,monumen nasional atau yang populer disingkat d...,budaya
3,1,Monumen Nasional,Monumen Nasional atau yang populer disingkat d...,Budaya,Jakarta,20000,4.6,74,2,"Semarang, Jawa Tengah",30,monumen nasional atau yang populer disingkat d...,budaya
4,1,Monumen Nasional,Monumen Nasional atau yang populer disingkat d...,Budaya,Jakarta,20000,4.6,86,4,"Depok, Jawa Barat",32,monumen nasional atau yang populer disingkat d...,budaya


**Note**: Two columns have been cleaned. And instead of removing the original ones, I keep them this way just in case I need to check the effects further. The cleaned versions are in column `DescriptionClean` and `CategoryClean`.

In [171]:
# check formating: top-10 entries
merged_df['Place_Name'].value_counts().reset_index(name='count').rename(columns={'index':'place_name'}).sort_values(by='place_name').head(10)

Unnamed: 0,place_name,count
431,Air Mancur Menari,13
158,Air Terjun Kali Pancur,24
241,Air Terjun Kedung Pedut,22
195,Air Terjun Semirang,23
244,Air Terjun Sri Gethuk,22
231,Alive Museum Ancol,22
116,Alun Alun Selatan Yogyakarta,25
106,Alun-Alun Kota Bandung,26
396,Alun-alun Utara Keraton Yogyakarta,17
187,Amazing Art World,23


In [182]:
# check unque values in CategoryClean
merged_df['CategoryClean'].unique()

array(['budaya', 'taman hiburan', 'cagar alam', 'bahari',
       'pusat perbelanjaan', 'tempat ibadah'], dtype=object)

In [183]:
# check unique values in Location
merged_df['Location'].unique()

array(['Solo, Jawa Tengah', 'Serang, Banten', 'Bandung, Jawa Barat',
       'Semarang, Jawa Tengah', 'Depok, Jawa Barat', 'Bogor, Jawa Barat',
       'Bekasi, Jawa Barat', 'Karawang, Jawa Barat',
       'Jakarta Barat, DKI Jakarta', 'Yogyakarta, DIY',
       'Cirebon, Jawa Barat', 'Jakarta Timur, DKI Jakarta',
       'Purwakarat, Jawa Barat', 'Kota Gede, DIY', 'Ponorogo, Jawa Timur',
       'Lampung, Sumatera Selatan', 'Tanggerang, Banten',
       'Nganjuk, Jawa Timur', 'Jakarta Pusat, DKI Jakarta',
       'Jakarta Utara, DKI Jakarta', 'Sragen, Jawa Tengah',
       'Cilacap, Jawa Tengah', 'Subang, Jawa Barat',
       'Surabaya, Jawa Timur', 'Palembang, Sumatera Selatan',
       'Klaten, Jawa Tengah', 'Jakarta Selatan, DKI Jakarta',
       'Madura, Jawa Timur'], dtype=object)

## **3 Content-Based Filtering Approach**
### **3.1 Recommendation System Building**

**Step-1: Extract features from column `Description`**

In [192]:
merged_df[['DescriptionClean', 'CategoryClean']].head()

Unnamed: 0,DescriptionClean,CategoryClean
0,monumen nasional atau yang populer disingkat d...,budaya
1,monumen nasional atau yang populer disingkat d...,budaya
2,monumen nasional atau yang populer disingkat d...,budaya
3,monumen nasional atau yang populer disingkat d...,budaya
4,monumen nasional atau yang populer disingkat d...,budaya


In [147]:
# # Feature extraction using TF-IDF for the description and category
# tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords)
# tfidf_matrix = tfidf_vectorizer.fit_transform(merged_df['DescriptionClean'] + ' ' + merged_df['CategoryClean'])

# # Compute cosine similarity between places
# cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [193]:
# Feature extraction using TF-IDF for the description and category
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords)
tfidf_matrix = tfidf_vectorizer.fit_transform(merged_df['DescriptionClean'])

# Compute cosine similarity between places
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [194]:
tfidf_dense = tfidf_matrix.todense()

# Convert to DataFrame for better readability
tfidf_df = pd.DataFrame(tfidf_matrix.todense(), columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df.head()

Unnamed: 0,abad,abah,abang,abdul,abdullah,abdurrahman,abraham,abrasi,abu,acara,...,yustinus,zaman,zamannya,zeeland,zeven,zheng,ziarah,zona,zoo,zuider
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Step-2: Compute cosine similarity of place names based on their descriptions**  

In [225]:
# Create a Series mapping cleaned Place_Name to index
place_name_to_index = pd.Series(merged_df.index, index=merged_df['Place_Name'].apply(clean_text)).to_dict()

# Feature extraction using TF-IDF for the description and category
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords)
tfidf_matrix = tfidf_vectorizer.fit_transform(merged_df['Description'])

# Compute cosine similarity between places
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

**Step-3: Generate recommendations**

In [226]:
# Function to get recommendations based on a cleaned place name
def get_recommendations(place_name, cosine_sim=cosine_sim, df=merged_df):
    # Clean and normalize input using the clean_text function
    if not isinstance(place_name, str):
        return "Invalid input: Place name should be a string."
    
    cleaned_name = clean_text(place_name)  # Normalize input using the cleaning function
    
    if cleaned_name not in place_name_to_index:
        return f"Place '{place_name}' not found in the dataset."
    # Get the index of the place corresponding to the given place name
    place_idx = place_name_to_index[cleaned_name]
    # Calculate similarity scores
    sim_scores = list(enumerate(cosine_sim[place_idx]))
    # Sort scores by similarity, highest first
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Exclude the place itself (first item)
    sim_scores = sim_scores[1:]
    
    # Check if there are any recommendations
    if not sim_scores:
        return f"No relevant places found for '{place_name}'."
    # Get the indices of the places, ensuring they're within valid range
    place_indices = [i[0] for i in sim_scores if i[0] < len(df)]
    
    # Check if the indices list is empty
    if not place_indices:
        return f"No relevant places found for '{place_name}'."
    # Return the DataFrame rows that correspond to these indices
    recommendations = df.iloc[place_indices].drop_duplicates(subset=['Place_Name', 'Description', 'Category', 'Rating']).head(5)

    # Check if any recommendations were found
    if recommendations.empty:
        return f"No relevant places found for '{place_name}'."
    
    return recommendations

In [227]:
# Example usage: Get recommendations for "Monumen Nasional"
recommended_places = get_recommendations("monumen nasional")
if isinstance(recommended_places, str):
    print(recommended_places)  # Print the message if it's a string
else:
    display(recommended_places[['Place_Name', 'Description', 'Category', 'Rating']])

Unnamed: 0,Place_Name,Description,Category,Rating
1,Monumen Nasional,Monumen Nasional atau yang populer disingkat d...,Budaya,4.6
5845,Monumen Bandung Lautan Api,"Monumen Bandung Lautan Api, merupakan monumen ...",Budaya,4.3
5893,Monumen Perjuangan Rakyat Jawa Barat,Monumen Perjuangan Rakyat Jawa Barat (Monju) a...,Budaya,4.5
995,Monumen Selamat Datang,Monumen Selamat Datang adalah sebuah monumen y...,Budaya,4.7
9759,Monumen Bambu Runcing Surabaya,Monumen Bambu Runcing adalah ikon pariwisata S...,Budaya,4.6


In [228]:
# Example usage: Get recommendations for "monas"
recommended_places = get_recommendations("tugu proklamasi")
if isinstance(recommended_places, str):
    print(recommended_places)  
else:
    display(recommended_places[['Place_Name', 'Description', 'Category', 'Rating']])

Unnamed: 0,Place_Name,Description,Category,Rating
1408,Taman Legenda Keong Emas,Taman Legenda Keong Emas merupakan salah satu ...,Taman Hiburan,4.5
8132,Saloka Theme Park,SALOKA hadir sebagai taman rekreasi terbesar d...,Taman Hiburan,4.4
5449,Taman Lalu Lintas Ade Irma Suryani Nasution,Taman Lalu-lintas Ade Irma Suryani adalah sebu...,Taman Hiburan,4.4
7656,Grand Maerakaca,Masyarakat Jawa Tengah mungkin sudah tidak asi...,Taman Hiburan,4.4
1920,Taman Pintar Yogyakarta,Taman Pintar Yogyakarta (bahasa Jawa: Hanacara...,Taman Hiburan,4.5


### **3.2 Recommendation System Evaluation**

In [229]:
# Example usage: Get recommendations for "monas"
recommended_places = get_recommendations("monas")
if isinstance(recommended_places, str):
    print(recommended_places)  
else:
    display(recommended_places[['Place_Name', 'Description', 'Category', 'Rating']])

Place 'monas' not found in the dataset.


**Note**: In Indonesian context, *Monas* is the short term for *Monumen Nasional*. When the rule-based system handles the same entity using different terms, it fails to recognize and resulting in no recommendations. If this model is implemented, a synonym detection should be placed before feeding the data into the recommendation system.

## **4 Colaborative Filtering Approach**

In [230]:
# Pivot the data to create a user-item matrix
user_item_matrix = merged_df.pivot_table(index='User_Id', columns='Place_Id', values='Place_Ratings', fill_value=0)

# Calculate user similarity using cosine similarity
user_similarity = cosine_similarity(user_item_matrix)
user_similarity_df = pd.DataFrame(user_similarity, index=user_item_matrix.index, columns=user_item_matrix.index)

# Function to get top N similar users
def get_similar_users(user_id, n=3):
    # Sort users by similarity to the input user_id
    similar_users = user_similarity_df[user_id].sort_values(ascending=False)
    return similar_users.iloc[1:n+1]  # Exclude the user themselves

# Function to recommend places for a user based on similar users' preferences
def recommend_places(user_id, n_recommendations=5):
    similar_users = get_similar_users(user_id)
    similar_user_ids = similar_users.index
    
    # Get the places these similar users have rated
    similar_users_ratings = user_item_matrix.loc[similar_user_ids].mean(axis=0)
    
    # Filter out places the target user has already rated
    user_rated_places = user_item_matrix.loc[user_id]
    unrated_places = similar_users_ratings[user_rated_places == 0]
    
    # Recommend the top N unrated places
    recommendations = unrated_places.sort_values(ascending=False).head(n_recommendations)
    return recommendations.index

# Function to get detailed recommendations
def get_detailed_recommendations(user_id, n_recommendations=5):
    recommended_place_ids = recommend_places(user_id, n_recommendations)
    detailed_recommendations = df[df['Place_Id'].isin(recommended_place_ids)].drop_duplicates('Place_Id')
    return detailed_recommendations[['Place_Name', 'Description', 'Category', 'City', 'Price']]

In [231]:
# Example usage: Get detailed recommendations for user with User_Id = 36
detailed_recommendations = get_detailed_recommendations(user_id=36)
display(detailed_recommendations)

Unnamed: 0,Place_Name,Description,Category,City,Price
36,Bumi Perkemahan Cibubur,Bumi Perkemahan dan Graha Wisata Pramuka Cibub...,Taman Hiburan,Jakarta,10000
148,Goa Cerme,"Gua Cerme (bahasa Jawa: ꦒꦸꦮ​ꦕꦺꦂꦩꦺ, translit. G...",Cagar Alam,Yogyakarta,3000
168,Puncak Segoro,Puncak Segoro menjadi destinasi wisata terbaru...,Cagar Alam,Yogyakarta,5000
231,Bukit Moko,Bandung sebagai destinasi wisata tak pernah ad...,Cagar Alam,Bandung,25000
365,Tirto Argo Siwarak,"Kolam Renang Tirto Argo Siwarak, merupakan kol...",Taman Hiburan,Semarang,20000


## **Hybrid Approach**

In [232]:
# Clean text
merged_df['DescriptionClean'] = merged_df['Description'].apply(clean_text)
merged_df['CategoryClean'] = merged_df['Category'].apply(clean_text)

In [233]:
# Content-based filtering
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords)
tfidf_matrix = tfidf_vectorizer.fit_transform(merged_df['DescriptionClean'])
content_based_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [234]:
# # Content-based filtering
# tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords)
# tfidf_matrix = tfidf_vectorizer.fit_transform(merged_df['DescriptionClean'] + ' ' + merged_df['CategoryClean'])
# content_based_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [235]:
# Handle duplicates
merged_df2 = merged_df.groupby(['User_Id', 'Place_Id'], as_index=False).agg({'Rating': 'mean'})
merged_df2.head()

Unnamed: 0,User_Id,Place_Id,Rating
0,1,5,4.5
1,1,15,4.4
2,1,20,4.5
3,1,21,4.5
4,1,41,4.4


In [236]:
# Collaborative filtering
user_item_matrix = merged_df2.pivot(index='User_Id', columns='Place_Id', values='Rating').fillna(0)
collab_sim = cosine_similarity(user_item_matrix)

In [237]:
# Hybrid recommendation function
def get_hybrid_recommendations(place_idx, content_based_sim, collab_sim, alpha=0.5, df=merged_df):
    # Get content-based scores
    content_scores = list(enumerate(content_based_sim[place_idx]))
    
    # Get collaborative filtering scores
    collab_scores = list(enumerate(collab_sim[place_idx]))
    
    # Combine scores
    content_scores = [score[1] for score in content_scores]
    collab_scores = [score[1] for score in collab_scores]
    
    combined_scores = [alpha * content + (1 - alpha) * collab for content, collab in zip(content_scores, collab_scores)]
    
    # Sort and exclude the place itself
    sim_scores = list(enumerate(combined_scores))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:]
    
    # Get the indices of the recommended places
    place_indices = [i[0] for i in sim_scores if i[0] < len(df)]
    
    # Filter duplicates based on Place_Id
    unique_place_indices = []
    seen_place_ids = set()
    for idx in place_indices:
        place_id = df.iloc[idx]['Place_Id']
        if place_id not in seen_place_ids:
            unique_place_indices.append(idx)
            seen_place_ids.add(place_id)
    
    # Return the DataFrame rows that correspond to these indices, including extra columns
    return df.loc[unique_place_indices].head(5)

In [238]:
# Example usage
recommended_places = get_hybrid_recommendations(0, content_based_sim, collab_sim, alpha=0.5)
display(recommended_places[['Place_Name', 'Description', 'Category', 'City', 'Price']])

Unnamed: 0,Place_Name,Description,Category,City,Price
11,Monumen Nasional,Monumen Nasional atau yang populer disingkat d...,Budaya,Jakarta,20000
68,Taman Mini Indonesia Indah (TMII),Taman Mini Indonesia Indah merupakan suatu kaw...,Taman Hiburan,Jakarta,10000
31,Kota Tua,"Kota tua di Jakarta, yang juga bernama Kota Tu...",Budaya,Jakarta,0
193,Pelabuhan Marina,Pelabuhan Marina Ancol berada di kawasan Taman...,Bahari,Jakarta,175000
58,Dunia Fantasi,Dunia Fantasi atau disebut juga Dufan adalah t...,Taman Hiburan,Jakarta,270000
