# Setup and Preparations

## Import Libraries

The necessary libraries for the project are downloaded and imported.

In [1]:
!pip install -q tensorflow-recommenders --quiet
!pip install -q --upgrade tensorflow-datasets --quiet

In [2]:
import os
import requests
import pandas as pd
import tensorflow as tf
from IPython.display import Markdown, display
from concurrent.futures import ThreadPoolExecutor
import tensorflow_recommenders as tfrs

## Define Datasets URLs

The dataset URLs used in the project are listed.

In [3]:
urls = [
    "https://raw.githubusercontent.com/AldiraPutri19/Locoveer/refs/heads/machine-learning/datasets/user_ratings.csv",
    "https://raw.githubusercontent.com/AldiraPutri19/Locoveer/refs/heads/machine-learning/datasets/users.csv",
    "https://raw.githubusercontent.com/AldiraPutri19/Locoveer/refs/heads/machine-learning/datasets/travel_destinations.csv"
]

## Setup Directory

A directory is created to store the downloaded datasets.

In [4]:
file_path = "/content/"
os.makedirs(file_path, exist_ok=True)

# Download and Load Dataset

## Download Function

A function is defined to download files from the provided URLs and save them to the local directory.

In [5]:
def download_data(url):
    file_name = url.split("/")[-1]
    full_file_path = os.path.join(file_path, file_name)

    try:
        response = requests.get(url)
        response.raise_for_status()
        with open(full_file_path, "wb") as file:
            file.write(response.content)
            print(f"Successfully downloaded: {file_name}")
    except requests.exceptions.RequestException as e:
        print(f"Failed to download {file_name} - Error: {e}")

## Download All Datasets

The datasets are downloaded in parallel using `ThreadPoolExecutor` to accelerate the download process.



In [6]:
with ThreadPoolExecutor() as executor:
    executor.map(download_data, urls)

Successfully downloaded: users.csv
Successfully downloaded: travel_destinations.csv
Successfully downloaded: user_ratings.csv


## Read Datasets

The downloaded CSV files are read into pandas DataFrames for further processing.

In [7]:
user_rating = pd.read_csv('user_ratings.csv')
user = pd.read_csv('users.csv')
travel_destination = pd.read_csv('travel_destinations.csv')

# Data Exploration and Cleaning

## Visualizing the Dataset

A helper function is defined to display information about the dataset in markdown format for better readability.

In [8]:
def printmd(string):
    display(Markdown(string))

printmd("Dataset user:")
print(user.head())

Dataset user:

   User_ID             Name                          Email  Age  Gender  \
0        1     Tono Pratama     tonopratama858@example.com   34    Male   
1        2       Eka Kusuma       ekakusuma629@example.com   59    Male   
2        3   Lina Sari B.A.         linasari11@example.com   61  Female   
3        4  Bambang Hidayat  bambanghidayat565@example.com   26  Female   
4        5     Tono Santoso     tonosantoso978@example.com   49    Male   

                       Address  
0           Tegal, Jawa Tengah  
1  Kupang, Nusa Tenggara Timur  
2            Lhokseumawe, Aceh  
3        Semarang, Jawa Tengah  
4        Batam, Kepulauan Riau  


## Dataset Statistics

The script calculates and displays statistics, such as the number of unique users and destinations, and checks for missing values in the datasets.

In [9]:
printmd("Number of Users: {:,}".format(len(user.User_ID.unique())))
printmd("Number of Travel Destinations: {:,}".format(len(travel_destination.Destination_ID.unique())))

printmd("**Missing Values:**")
print(user.isnull().sum(), '\n')
print(user_rating.isnull().sum(), '\n')
print(travel_destination.isnull().sum())

Number of Users: 1,000

Number of Travel Destinations: 437

**Missing Values:**

User_ID    0
Name       0
Email      0
Age        0
Gender     0
Address    0
dtype: int64 

User_ID           0
Destination_ID    0
Rating            0
dtype: int64 

Destination_ID        0
Destination_Name      0
Description           0
Category              0
City                  0
Price                 0
Coordinate            0
Lat                   3
Long                  0
Unnamed: 11         437
Unnamed: 12           0
dtype: int64


## Data Cleaning

Unnecessary columns in the travel destinations dataset are removed to simplify the data structure.

In [10]:
travel_destination.drop(columns=['Coordinate', 'Lat', 'Long', 'Unnamed: 11', 'Unnamed: 12'], inplace=True)
travel_destination.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 437 entries, 0 to 436
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Destination_ID    437 non-null    float64
 1   Destination_Name  437 non-null    object 
 2   Description       437 non-null    object 
 3   Category          437 non-null    object 
 4   City              437 non-null    object 
 5   Price             437 non-null    float64
dtypes: float64(2), object(4)
memory usage: 20.6+ KB


## Rating Analysis

The script computes the number of ratings and the average rating for each destination. Destinations with a number of ratings above a specified cutoff are selected.

In [11]:
rating_by_destination = user_rating.groupby("Destination_ID").agg({"User_ID": "count", "Rating": "mean"}).reset_index()
rating_by_destination.columns = ["Destination_ID", "Number of Ratings", "Average Rating"]

cutoff = 50
top_rated_destinations = rating_by_destination.loc[rating_by_destination["Number of Ratings"] > cutoff].sort_values(by="Average Rating", ascending=False)

# Data Preprocessing

## Filter Data

The ratings dataset is filtered to include only destinations that meet the criteria for the number of ratings.

In [12]:
recent_ratings = user_rating.loc[user_rating["Destination_ID"].isin(top_rated_destinations["Destination_ID"])]

## Map IDs

User and destination IDs are mapped to numeric values to facilitate embedding during modeling.

In [13]:
userIds = recent_ratings.User_ID.unique()
destinationIds = recent_ratings.Destination_ID.unique()

user_mapping = {id_: idx for idx, id_ in enumerate(userIds)}
destination_mapping = {id_: idx for idx, id_ in enumerate(destinationIds)}

recent_ratings['User_ID'] = recent_ratings['User_ID'].map(user_mapping)
recent_ratings['Destination_ID'] = recent_ratings['Destination_ID'].map(destination_mapping)

## TensorFlow Dataset

A TensorFlow dataset is created from the filtered ratings data, preparing it for the training process.

In [14]:
ratings = tf.data.Dataset.from_tensor_slices({
    "userId": tf.convert_to_tensor(recent_ratings.User_ID.astype(str).values, dtype=tf.string),
    "destinationId": tf.convert_to_tensor(recent_ratings.Destination_ID.astype(str).values, dtype=tf.string),
    "rating": tf.cast(recent_ratings.Rating.values, tf.float32),
})

## Splitting the Dataset

To evaluate the performance of the recommendation model, the data is split into three subsets:
1. Training Set (70%): Used to train the model.
2. Validation Set (15%): Used to tune hyperparameters and prevent overfitting.
3. Test Set (15%): Used to evaluate the model's performance.

The dataset is shuffled and then divided using TensorFlow’s `Dataset` API.

In [15]:
total_ratings = len(recent_ratings)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train_size = int(total_ratings * 0.7)
val_size = int(total_ratings * 0.15)

train = shuffled.take(train_size)
validation = shuffled.skip(train_size).take(val_size)
test = shuffled.skip(train_size + val_size).take(total_ratings - train_size - val_size)

print("Training set size:", train_size)
print("Validation set size:", val_size)
print("Testing set size:", total_ratings - train_size - val_size)

Training set size: 70000
Validation set size: 15000
Testing set size: 15000


# Modeling

## Ranking Model Definition

A function is defined to build a ranking model using embeddings for users and destinations. The embeddings are concatenated and passed through dense layers to predict ratings.

In [16]:
def create_ranking_model(user_ids, destination_ids, embedding_dimension=128):
    user_input = tf.keras.layers.Input(shape=(1,), name="userId", dtype=tf.string)
    user_lookup = tf.keras.layers.StringLookup(vocabulary=[str(x) for x in user_ids], mask_token=None)(user_input)
    user_embedding = tf.keras.layers.Embedding(len(user_ids) + 1, embedding_dimension)(user_lookup)

    destination_input = tf.keras.layers.Input(shape=(1,), name="destinationId", dtype=tf.string)
    destination_lookup = tf.keras.layers.StringLookup(vocabulary=[str(x) for x in destination_ids], mask_token=None)(destination_input)
    destination_embedding = tf.keras.layers.Embedding(len(destination_ids) + 1, embedding_dimension)(destination_lookup)

    concatenated = tf.keras.layers.concatenate([user_embedding, destination_embedding], axis=-1)

    x = tf.keras.layers.Dense(512, activation="relu", kernel_regularizer=tf.keras.regularizers.l2(0.01))(concatenated)
    x = tf.keras.layers.Dropout(0.5)(x)
    x = tf.keras.layers.Dense(256, activation="relu")(x)
    x = tf.keras.layers.Dropout(0.4)(x)
    x = tf.keras.layers.Dense(128, activation="relu")(x)
    x = tf.keras.layers.Dropout(0.3)(x)
    output = tf.keras.layers.Dense(1, name="rating")(x)

    ranking_model = tf.keras.Model(inputs=[user_input, destination_input], outputs=output, name="RankingModel")
    return ranking_model

## Recommendation Model

A custom recommendation model is implemented using `tensorflow_recommenders (tfrs)`. The model combines a ranking model with a task for predicting ratings.

In [17]:
class TravelRecommendationModel(tfrs.models.Model):
    def __init__(self, ranking_model, **kwargs):
        super(TravelRecommendationModel, self).__init__(**kwargs)
        self.ranking_model = ranking_model
        self.task = tfrs.tasks.Ranking(
            loss=tf.keras.losses.MeanSquaredError(),
            metrics=[tf.keras.metrics.RootMeanSquaredError()]
        )

    def call(self, inputs):
        return self.ranking_model(inputs)

    def compute_loss(self, features, training=False):
        if isinstance(features, tuple):
            features, labels = features
        else:
            labels = features.pop("rating")
        rating_predictions = self.ranking_model(features)
        return self.task(labels=labels, predictions=rating_predictions)

## Compiling The Model

The model is compiled using the Adam optimizer and a learning rate of 0.001.

In [18]:
embedding_dimension = 128
ranking_model = create_ranking_model(userIds, destinationIds, embedding_dimension)
travel_model = TravelRecommendationModel(ranking_model)

travel_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))

## Model Training

The model is trained on the prepared TensorFlow dataset, with early stopping implemented to prevent overfitting.

In [19]:
history = travel_model.fit(
    train.map(lambda x: ({"userId": x["userId"], "destinationId": x["destinationId"]}, x["rating"])).shuffle(100_000).batch(128).cache(),
    validation_data=validation.map(lambda x: ({"userId": x["userId"], "destinationId": x["destinationId"]}, x["rating"])).batch(64).cache(),
    epochs=50,
    # callbacks=[early_stopping]
)

Epoch 1/50
[1m547/547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 12ms/step - loss: 2.1533 - regularization_loss: 0.3520 - root_mean_squared_error: 1.6137 - total_loss: 2.5053 - val_loss: 2.2805 - val_regularization_loss: 0.0076 - val_root_mean_squared_error: 1.4088 - val_total_loss: 2.2881
Epoch 2/50
[1m547/547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 7ms/step - loss: 2.0146 - regularization_loss: 0.0025 - root_mean_squared_error: 1.4217 - total_loss: 2.0170 - val_loss: 2.2646 - val_regularization_loss: 8.6460e-04 - val_root_mean_squared_error: 1.4114 - val_total_loss: 2.2655
Epoch 3/50
[1m547/547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - loss: 2.0128 - regularization_loss: 0.0011 - root_mean_squared_error: 1.4212 - total_loss: 2.0139 - val_loss: 2.2623 - val_regularization_loss: 0.0011 - val_root_mean_squared_error: 1.4125 - val_total_loss: 2.2634
Epoch 4/50
[1m547/547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step -

# Evaluation and Model Saving

## Evaluate the Model

The model's performance is evaluated on the test dataset, and the results are displayed.

In [20]:
def evaluate_model(model, test_data):
    cached_test = test_data.batch(1024).cache()
    return model.evaluate(cached_test, return_dict=True)

# Example evaluation:
test_results = evaluate_model(travel_model, test)
print("Test Results:", test_results)

[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - loss: 2.0150 - regularization_loss: 1.0969e-08 - root_mean_squared_error: 1.4245 - total_loss: 2.0150
Test Results: {'loss': 1.9261870384216309, 'regularization_loss': 1.0968746799733253e-08, 'root_mean_squared_error': 1.4223798513412476, 'total_loss': 1.9261870384216309}


## Wrapping the Model

The trained model is wrapped in a functional model structure with explicit inputs and outputs for easier integration and deployment.

In [21]:
inputs = {
    "userId": tf.keras.layers.Input(name="userId", shape=(), dtype=tf.string),
    "destinationId": tf.keras.layers.Input(name="destinationId", shape=(), dtype=tf.string),
}
outputs = travel_model(inputs)
wrapped_model = tf.keras.Model(inputs=inputs, outputs=outputs)

## Save and Download the Model

The wrapped model is saved and downloaded to disk for future use.

In [22]:
wrapped_model.save('collaborative_filtering_recommendation_system.keras')

from google.colab import files
files.download('collaborative_filtering_recommendation_system.keras')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Generating and Displaying Recommendations

Once the model is trained and evaluated, it can be used to generate personalized recommendations for users. The `generate_recommendations` function takes a user ID and a list of destination IDs, then ranks the destinations based on predicted ratings.

In [23]:
def generate_recommendations(model, user_id, destination_ids, user_mapping, travel_destination, top_n=10):
    test_rating = {}

    for m in destination_ids:
        test_rating[m] = model.ranking_model(
            {"userId": tf.convert_to_tensor([user_id], dtype=tf.string)},
            {"destinationId": tf.convert_to_tensor([m], dtype=tf.string)}
        )

    top_destinations = sorted(test_rating, key=test_rating.get, reverse=True)[:top_n]

    recommendations = []
    for dest_id in top_destinations:
        dest_name = travel_destination[travel_destination["Destination_ID"] == dest_id]["Destination_Name"].iloc[0]
        recommendations.append(dest_name)

    return recommendations

user_rand = userIds[123]
test_rating = {}
for m in test.take(10):
    test_rating[m["destinationId"].numpy()] = ranking_model(
        {"userId": tf.convert_to_tensor([str(user_rand)], dtype=tf.string),
         "destinationId": tf.convert_to_tensor([str(m["destinationId"].numpy())], dtype=tf.string)}
    )

print("Top 10 recommended travel destinations for User {}: ".format(user_rand))
for m in sorted(test_rating, key=test_rating.get, reverse=True):
    dest_id = int(m)

    dest_name = travel_destination[travel_destination["Destination_ID"] == dest_id]["Destination_Name"].iloc[0]
    user_name = user[user["User_ID"] == user_rand]["Name"].iloc[0]

    print(f"User: {user_name}, Destination: {dest_name} (ID: {dest_id})")

Top 10 recommended travel destinations for User 181: 
User: Sukmono Hidayat, Destination: Taman Tabanas (ID: 383)
User: Sukmono Hidayat, Destination: Desa Wisata Lembah Kalipancur (ID: 340)
User: Sukmono Hidayat, Destination: Curug Cipanas (ID: 275)
User: Sukmono Hidayat, Destination: Kebun Bibit Wonorejo (ID: 406)
User: Sukmono Hidayat, Destination: Geoforest Watu Payung Turunan (ID: 167)
User: Sukmono Hidayat, Destination: Curug Bugbrug (ID: 273)
User: Sukmono Hidayat, Destination: Taman Kupu-Kupu Cihanjuang (ID: 326)
User: Sukmono Hidayat, Destination: Masjid Istiqlal (ID: 22)
User: Sukmono Hidayat, Destination: La Kana Chapel (ID: 377)
User: Sukmono Hidayat, Destination: Puncak Kebun Buah Mangunan (ID: 133)
