# Clustering of Google Reviews of Traveling Sites
## Evaluation whether a recommender is feasible.
---
<b>MADS-MMS Portfolio-Exam Part 2<br>
Janosch Höfer, 938969</b>

## Table of contents

- [Introduction](#intro) <br>
- [1. Data Exploration](#data-prep) <br>
    - [1.1. Data Engineering](#dataeng) <br>
    - [1.2. Data Visualization](#datavis) <br>
    - [1.3. Data Reduction](#datared) <br>
- [2. Parameters](#parameters) <br>
- [3. Model setup](#model-setup) <br>
   - [3.1. K-Means](#kmean) <br>
   - [3.2. HAC](#hac)<br>
   - [3.3. OPTICS](#optics) <br>
- [4. Model Evaluation](#model-eval) <br>
    - [4.1. K-Means](#evalkmean) <br>
    - [4.2. HAC](#evalhac) <br>
    - [4.3. OPTICS](#evaloptics) <br>
- [5. Results](#results)<br>
- [References](#ref)<br>

## Introduction

In [None]:
# Standard libraries
import os
import itertools

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns
from sklearn.cluster import OPTICS, AgglomerativeClustering, cluster_optics_dbscan
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from tqdm.notebook import tqdm

# Own classes and functions
from helper_functions.data_manipulation import setup_raw_data
from helper_functions.plot_clusters import draw_plot, OPTICSResults, CMAP_PLT

In [None]:
pd.set_option("display.max_columns", 25)

bla<br>
Using [[1]](https://archive.ics.uci.edu/ml/datasets/Tarvel+Review+Ratings)

---
<a id='data-prep'></a>

## 1. Data Exploration
<a id='dataeng'></a>
### 1.1. Data Engineering

bla

In [None]:
path_to_data = "data"
data_url = (
    "https://archive.ics.uci.edu/ml/machine-learning-databases/00485/google_review_ratings.csv"
)
filename = "google_review_ratings.csv"

In [None]:
# Check for data
setup_raw_data(data_url, path_to_data, filename)

In [None]:
features = [
    "churches",
    "resorts",
    "beaches",
    "parks",
    "theatres",
    "museums",
    "malls",
    "zoo",
    "restaurants",
    "pubs_bars",
    "local_services",
    "burger_pizza_shops",
    "hotels_other_lodgings",
    "juice_bars",
    "art_galleries",
    "dance_clubs",
    "swimming_pools",
    "gyms",
    "bakeries",
    "beauty_spas",
    "cafes",
    "view_points",
    "monuments",
    "gardens",
    "c25",
]
data_full = pd.read_csv(
    os.path.join(path_to_data, filename), sep=",", index_col=0, names=features, header=0
)

In [None]:
data_full.head()

Remove the empty last column.

In [None]:
data_full = data_full.iloc[:, :-1]

In [None]:
data_full.describe()

Ratings between 1 and 5. 0 means that no rating has been made.

In [None]:
data_full.isna().sum()

In [None]:
data_full[data_full.isna().any(axis=1)]

In [None]:
df_full = data_full.dropna().copy()

In [None]:
df_full.dtypes

Because of the false value for User 2713 in the local services column, the data type is not float.

In [None]:
df_full["local_services"] = pd.to_numeric(df_full["local_services"])

In [None]:
df_full.dtypes

<a id='datavis'></a>
### 1.2. Data Visualization 

In [None]:
start, end = 0, 8
draw_plot(
    df_full[features[start:end]],
    plot_type="histplot",
    figsize=(16, 8),
    grid_size=(2, 4),
    title=f"Feature distribution for the features: {', '.join(features[start:end])}.",
)

In [None]:
start, end = 8, 16
draw_plot(
    df_full[features[start:end]],
    plot_type="histplot",
    figsize=(16, 8),
    grid_size=(2, 4),
    title=f"Feature distribution for the features: {', '.join(features[start:end])}.",
)

In [None]:
start, end = 16, 24
draw_plot(
    df_full[features[start:end]],
    plot_type="histplot",
    figsize=(16, 8),
    grid_size=(2, 4),
    title=f"Feature distribution for the features: {', '.join(features[start:end])}.",
)

In [None]:
norating = df_full[df_full == 0].count(axis=0) / df_full.shape[0] * 100

In [None]:
df_norating = (
    pd.DataFrame(norating, columns=["perc_norating"])
    .reset_index()
    .sort_values(by="perc_norating", ascending=False)
)

In [None]:
ax = sns.barplot(df_norating, x="perc_norating", y="index")
plt.xlabel("Users that left no rating [%]")
ax.xaxis.set_major_formatter(mtick.PercentFormatter())
plt.ylabel("Feature")
plt.title("Percentage of users that have not rated the feature.")
plt.show()

In [None]:
df_average = (
    pd.DataFrame(df_full.replace(0, np.NaN).mean(), columns=["Average"])
    .reset_index()
    .rename(columns={"index": "Feature"})
    .sort_values(by="Average", ascending=False)
)

In [None]:
palette = [
    "red" if 0 < val <= 1 else "orange" if 1 < val <= 2 else "blue" if 2 < val <= 3 else "green"
    for val in df_average["Average"].tolist()
]

sns.barplot(df_average, y="Feature", x="Average", palette=palette)
plt.xlabel("Rating")
plt.title("Average Ratings excluding zero values.")
plt.show()

<a id='datared'></a>
### 1.3. Data Reduction

Reduce the number of features. Credit [[2]](https://www.kaggle.com/code/johnmantios/travel-review-ratings-dataset#Clustering)

In [None]:
culture_features = ["theatres", "museums", "art_galleries"]
city_features = ["churches", "malls", "zoo", "local_services"]
nature_features = ["beaches", "parks"]
scenic_features = ["monuments", "view_points", "gardens"]
wellness_features = ["beauty_spas", "gyms", "swimming_pools"]
food_features = ["burger_pizza_shops", "juice_bars", "cafes", "bakeries", "restaurants"]
nightlife_features = ["pubs_bars", "dance_clubs"]
accommodation_features = ["resorts", "hotels_other_lodgings"]

In [None]:
test_features = ["monuments", "view_points", "gardens"]

In [None]:
# check_feature_reduction(df_full, test_features)

In [None]:
def check_feature_reduction(df, features):
    df_reduced = df[features].agg([np.mean, np.std], axis=1)
    _, axes = plt.subplots(figsize=(12, 6), ncols=2)
    sns.histplot(df_reduced, x="mean", ax=axes[0])
    sns.histplot(df_reduced, x="std", ax=axes[1])
    axes[0].set_xlabel("Average Rating")
    axes[1].set_xlabel("Standard deviation of average Rating")
    plt.suptitle(f"Distribution for the combined features {', '.join(features)}.")
    plt.show()

In [None]:
check_feature_reduction(df_full, culture_features)

In [None]:
check_feature_reduction(df_full, city_features)

In [None]:
check_feature_reduction(df_full, nature_features)

In [None]:
check_feature_reduction(df_full, scenic_features)

In [None]:
check_feature_reduction(df_full, wellness_features)

In [None]:
check_feature_reduction(df_full, food_features)

In [None]:
check_feature_reduction(df_full, nightlife_features)

In [None]:
check_feature_reduction(df_full, accommodation_features)

In [None]:
df_reduced = pd.DataFrame(
    {
        "culture": df_full[culture_features].mean(axis=1),
        "city": df_full[city_features].mean(axis=1),
        "nature": df_full[nature_features].mean(axis=1),
        "scenic": df_full[scenic_features].mean(axis=1),
        "wellness": df_full[wellness_features].mean(axis=1),
        "food": df_full[food_features].mean(axis=1),
        "nightlife": df_full[nightlife_features].mean(axis=1),
        "accommodation": df_full[accommodation_features].mean(axis=1),
    }
)

In [None]:
df_reduced

In [None]:
sns.pairplot(df_reduced)
plt.show()

In [None]:
df_reduced.describe()

In [None]:
sns.boxenplot(df_reduced)
plt.show()

In [None]:
df = df_reduced

---
<a id='parameters'></a>

## 2. Parameters

bla

In [None]:
random_state = 42

---
<a id='model-setup'></a>

## 3. Model setup
<a id='kmean'></a>
### 3.1. K-Means

bla

In [None]:
max_ks = 10
ks = range(2, max_ks)

In [None]:
sscores = draw_plot(
    df,
    ks=ks,
    plot_type="ksscore",
    random_state=random_state,
    labels=["K", "Silhouette Coefficient"],
    title="Silhouette Score for different Ks",
)

In [None]:
df_kscores = pd.DataFrame({"k": ks, "score": sscores}).sort_values(by="score", ascending=False)
df_kscores

In [None]:
best_k_silhouette = df_kscores.iloc[0, 0]
kmean_best_labels = draw_plot(
    df,
    plot_type="silhouette",
    ks=best_k_silhouette,
    random_state=random_state,
    labels=["The silhouette coefficient values", "Cluster label"],
    title=f"Silhouette analysis for KMeans clustering with n_clusters = {best_k_silhouette}",
)

In [None]:
df_kmean_best = df.copy()
df_kmean_best["kmeans_labels"] = kmean_best_labels
sns.pairplot(df_kmean_best, hue="kmeans_labels", palette=CMAP_PLT)
plt.show()

In [None]:
second_k_silhouette = df_kscores.iloc[1, 0]
kmean_second_labels = draw_plot(
    df,
    plot_type="silhouette",
    ks=second_k_silhouette,
    random_state=random_state,
    labels=["The silhouette coefficient values", "Cluster label"],
    title=f"Silhouette analysis for KMeans clustering with n_clusters = {second_k_silhouette}",
)

In [None]:
df_kmean_second = df.copy()
df_kmean_second["kmeans_labels"] = kmean_second_labels
sns.pairplot(df_kmean_second, hue="kmeans_labels", palette=CMAP_PLT)
plt.show()

<a id='hac'></a>

### 3.2. HAC (Hierarchical Agglomerative Clustering)

In [None]:
dendo_distance = "single"
dendo_cut = 1
dendo_model = AgglomerativeClustering(
    n_clusters=None,
    distance_threshold=1,
    affinity="euclidean",
    linkage=dendo_distance,
    compute_distances=True,
)
dendo_model.fit_predict(df)
_ = draw_plot(
    dendo_model,
    plot_type="dendo",
    dendo_cut=dendo_cut,
    dendo_distance=dendo_distance,
    labels=["Samples", "Distance"],
    title=f"Dendogram using {dendo_distance}-link.",
)

In [None]:
dendo_distance = "complete"
dendo_cut = 6
dendo_model = AgglomerativeClustering(
    n_clusters=None,
    distance_threshold=1,
    affinity="euclidean",
    linkage=dendo_distance,
    compute_distances=True,
)
dendo_model.fit_predict(df)
_ = draw_plot(
    dendo_model,
    plot_type="dendo",
    dendo_cut=dendo_cut,
    dendo_distance=dendo_distance,
    labels=["Samples", "Distance"],
    title=f"Dendogram using {dendo_distance}-link.",
)

In [None]:
dendo_distance = "average"
dendo_cut = 3.2
dendo_model = AgglomerativeClustering(
    n_clusters=None,
    distance_threshold=1,
    affinity="euclidean",
    linkage=dendo_distance,
    compute_distances=True,
)
labels_dendo = dendo_model.fit_predict(df)
_ = draw_plot(
    dendo_model,
    plot_type="dendo",
    dendo_cut=dendo_cut,
    dendo_distance=dendo_distance,
    labels=["Samples", "Distance"],
    title=f"Dendogram using {dendo_distance}-link.",
)

In [None]:
dendo_model = AgglomerativeClustering(
    n_clusters=4,
    distance_threshold=None,
    affinity="euclidean",
    linkage="complete",
    compute_distances=True,
)
labels_dendo = dendo_model.fit_predict(df)

In [None]:
dendo_unique, dendo_count = np.unique(labels_dendo, return_counts=True)

In [None]:
dendo_unique

In [None]:
dendo_count

In [None]:
df_dendo = df.copy()
df_dendo["dendo_labels"] = labels_dendo
sns.pairplot(df_dendo, hue="dendo_labels", palette=CMAP_PLT)
plt.show()

<a id='optics'></a>

### 3.3. OPTICS (Ordering Points To Identify the Clustering Structure)

In [None]:
def optics_experiment(df, parameters: dict[str, list]):
    results = list()
    space = np.arange(len(df))

    max_len = np.prod([len(item) for item in parameters.values()])  # Iterables have no length
    for item in tqdm(itertools.product(*parameters.values()), total=max_len):
        optics_clustering = OPTICS(
            min_samples=item[0], metric=item[1], xi=item[2], min_cluster_size=item[3]
        ).fit(df)
        results.append(
            OPTICSResults(
                optics=optics_clustering,
                space=space,
                reachability=optics_clustering.reachability_[optics_clustering.ordering_],
                targets=optics_clustering.labels_[optics_clustering.ordering_],
                params=optics_clustering.get_params(),
            )
        )
    return results

In [None]:
parameters = {
    "min_samples": [10, 20, 40],
    "metric": ["euclidean"],
    "xi": [0.001, 0.005],
    "min_cluster_size": [0.05, 0.1, 0.2],
}

In [None]:
optics_res = optics_experiment(df, parameters)

In [None]:
draw_plot(
    optics_res,
    figsize=(16, 18),
    grid_size=(round(len(optics_res) / 2), 2),
    plot_type="reachability",
    labels=["", "Reachability distance"],
    # top_cut_off=3,
    title="Reachability Diagram",
)

In [None]:
best_optics = optics_res[-4]
labels_optics = cluster_optics_dbscan(
    reachability=best_optics.reachability_,
    core_distances=best_optics.core_distances_,
    ordering=best_optics.ordering_,
    eps=2,
)

In [None]:
df_optics = df.copy()
df_optics["optics_labels"] = labels_optics
sns.pairplot(df_optics, hue="optics_labels", palette=CMAP_PLT)
plt.show()

---
<a id='model-eval'></a>

## 4. Model Evaluation
<a id='evalkmean'></a>
### 4.1. K-Means

In [None]:
df_kmean_best.head()

In [None]:
df_kmean_second.head()

In [None]:
conf_matrix = confusion_matrix(
    y_true=df_kmean_best["kmeans_labels"],
    y_pred=df_kmean_second["kmeans_labels"],
    normalize="true",
)
ConfusionMatrixDisplay(conf_matrix).plot()
plt.show()

<a id='evalhac'></a>
### 4.2. HAC

In [None]:
df_dendo.head()

In [None]:
conf_matrix = confusion_matrix(
    y_true=df_dendo["dendo_labels"],
    y_pred=df_kmean_best["kmeans_labels"],
    normalize="true",
)
ConfusionMatrixDisplay(conf_matrix).plot()
plt.show()

<a id='evaloptics'></a>
### 4.3. OPTICS

In [None]:
df_optics.head()

In [None]:
conf_matrix = confusion_matrix(
    y_true=df_optics["optics_labels"],
    y_pred=df_kmean_best["kmeans_labels"],
    normalize="true",
)
ConfusionMatrixDisplay(conf_matrix).plot()
plt.show()

In [None]:
conf_matrix = confusion_matrix(
    y_true=df_optics["optics_labels"],
    y_pred=df_dendo["dendo_labels"],
)
ConfusionMatrixDisplay(conf_matrix).plot()
plt.show()

---
<a id='results'></a>

## 5. Results

bla

---
<a id='ref'></a>

## References

<p> [1] https://archive.ics.uci.edu/ml/datasets/Tarvel+Review+Ratings
<p> [2] https://www.kaggle.com/code/johnmantios/travel-review-ratings-dataset#Clustering