<div style="display: flex; align-items: center; justify-content: center; flex-wrap: wrap;">
    <div style="flex: 1; min-width: 250px; display: flex; justify-content: center;">
        <img src="https://adnova.novaims.unl.pt/media/22ui3ptm/logo.svg" style="max-width: 80%; height: auto; margin-top: 50px; margin-bottom: 50px;margin-left: 3rem;">
    </div>
    <div style="flex: 2; text-align: center; margin-top: 20px;margin-left: 8rem;">
        <div style="font-size: 28px; font-weight: bold; line-height: 1.2;color:#6f800f;">
            Data Mining Project | ABCDEats Inc.
        </div>
        <div style="font-size: 17px; font-weight: bold; margin-top: 10px;">
            Fall Semester | 2024 - 2025
        </div>
        <div style="font-size: 17px; font-weight: bold;">
            Master in Data Science and Advanced Analytics
        </div>
        <div style="margin-top: 20px;">
            <div>André Silvestre, 20240502</div>
            <div>Filipa Pereira, 20240509</div>
            <div>Umeima Mahomed, 20240543</div>
        </div>
        <div style="margin-top: 20px; font-weight: bold;">
            Group 37
        </div>
    </div>
</div>

<div style="background: linear-gradient(to right,#6f800f, #6f800f); 
            padding: .7px; color: white; border-radius: 300px; text-align: center;">
</div>

## 📚 Libraries Import

In [None]:
# For data
import pandas as pd
import numpy as np
import os

# For plotting and EDA
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as mtick
import matplotlib.lines as mlines
from matplotlib.colors import LinearSegmentedColormap

# For preprocessing

# For statistical tests

# For clustering

# Set the style of the visualization
pd.set_option('display.max_columns', None)                  # display all columns
pd.set_option('display.float_format', lambda x: '%.2f' % x) # display floats with 2 decimal places

# for better resolution plots
%config InlineBackend.figure_format = 'retina' # optionally, you can change 'svg' to 'retina'

# Setting seaborn style
plt.style.use('ggplot')
sns.set_theme(style='white')

# <a class='anchor' id='2'></a>
<br>
<style>
@import url('https://fonts.cdnfonts.com/css/avenir-next-lt-pro?styles=29974');
</style>

<div style="background: linear-gradient(to right,#bEd62f, #6f800f); 
            padding: 10px; color: white; border-radius: 300px; text-align: center;">
    <center><h1 style="margin-left: 140px;margin-top: 10px; margin-bottom: 4px; color: white;
                       font-size: 32px; font-family: 'Avenir Next LT Pro', sans-serif;">
        <b>Part 2 | Data Preprocess & Clustering </b></h1></center>
</div>

In [None]:
# Importing the dataset
ABCDEats = pd.read_parquet('data/DM2425_ABCDEats_1stPart.parquet')

In [None]:
# Display the first 5 rows just to confirm the import was successful
ABCDEats.head() 

In [None]:
# Number of rows and columns
print('Number of\033[1m rows \033[0m:', ABCDEats.shape[0])
print('Number of\033[1m columns \033[0m:', ABCDEats.shape[1])

In [None]:
# Check the data types
ABCDEats.dtypes

---

## 🛠️ Data Preprocessing/Feature Engineering

In [None]:
# Create a continuous and discrete colormap
colors = ["#3E460F", "#4E5813", "#626E18", "#7A891E", "#98AB26", "#BED62F"]
NOVAIMS_palette_colors = sns.color_palette(colors[::-1], as_cmap=True)

colors = ["#3E460F", "#4E5813", "#626E18", "#7A891E", "#98AB26", "#BED62F", "#FFFFFF"]
NOVAIMS_palette_colors_continuous = LinearSegmentedColormap.from_list("NOVAIMS_palette", colors[::-1])

In [None]:
# Define metric and non-metric features
metric_cols = ABCDEats.select_dtypes(include=['int64', 'float64']).columns
non_metric_cols = ABCDEats[:1].select_dtypes(include=['object']).columns

# Exclude the column 'customer_id' from the Non-Metric columns
non_metric_cols = non_metric_cols.drop('customer_id')

print(f'Metric columns: {len(metric_cols)}, {metric_cols} \n')
print(f'Non-Metric columns: {len(non_metric_cols)}, {non_metric_cols}')

In [None]:
# Unique values of the columns 'CUI_American', 'CUI_Asian', 'CUI_Chinese', 'CUI_Italian', etc.
cuisines_cols = [col for col in ABCDEats.columns if 'CUI_' in col]
cuisines_cols

In [None]:
# Weeekdays columns
weekdays_cols = ABCDEats.loc[:, 'DOW_0':'DOW_6'].columns

# Hours columns
hours_cols = ABCDEats.loc[:, 'H_0':'H_23'].columns

In [None]:
# List of weekdays (0 = Sunday, 6 = Saturday)
weekdays = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
weekdays_dict = dict(enumerate(weekdays))
weekdays_dict

---

### **PCA (Principal Component Analysis)**

---

## ⚫🟢⚪ Clustering

#### **Hierarchical Clustering Algorithm[<sup>[1]</sup>](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html)**

---

#### **K-Means Clustering Algorithm[<sup>[2]</sup>](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)**

---

#### **Self Organizing Maps (SOM)**[<sup>[5]</sup>](https://github.com/sevamoo/sompy)

---

#### **Self Organizing Maps (MiniSOM)**[<sup>[6]</sup>](https://github.com/JustGlowing/minisom)

---

#### **Density Based Clustering [Mean Shift[<sup>[7]</sup>](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MeanShift.html), DBSCAN [<sup>[8]</sup>](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html), GMM[<sup>[9]</sup>](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html)]**

---

## 📏 Clustering Evaluation/Analysis

---

# 💾 Save Data 

<br>

- To finish this notebook and proceed to Streamlit App, we will save the preprocessed dataset and the clustering results.

In [None]:
# Save the preprocessed dataset
#ABCDEats.to_parquet('data/DM2425_ABCDEats_2ndPart.parquet')

# Save the clustering results
#cluster_results.to_parquet('data/DM2425_ABCDEats_ClusteringResults.parquet')