# Hands-On Project

In [1]:
import pandas as pd

from sklearn.preprocessing import StandardScaler

from sklearn.cluster import KMeans, SpectralClustering, AgglomerativeClustering 
from sklearn.cluster import AffinityPropagation, MeanShift


from pydataset import data
import plotly.express as px
import ClusterVisualizer as cv

## Tips Dataset

The Tips dataset comprises information about tips given by customers at a restaurant and various attributes of the meals they purchased.

In [2]:
tips = data('tips')
print(tips.shape)
tips.head()

(244, 7)


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


1. **total_bill** (numeric): The total bill amount for the meal, including tax, in dollars. 
2. **tip** (numeric): The tip amount left by the customer, in dollars.
3. **sex** (nominal): The gender of the person paying for the meal. 
4. **smoker** (nominal): Indicates whether the party included smokers, Yes (if there was at least one smoker) or No (if there were no smokers).
5. **day** (nominal): The day of the week on which the meal was purchased. 
6. **time** (nominal): Represents the time of day during which the meal was served.
7. **size** (ordinal): The size of the party, indicating the number of people dining.

In [3]:
# Are there missing values?
print(tips.isnull().sum())

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64


## Data preprocessing

We will use the two numerical features for the clustering algorithms: `total_bill` and `tip`.

In [4]:
# Plot the original data
features = tips[['total_bill', 'tip']]
features_melt = features.melt(var_name='feature', value_name='value')

fig_o = px.box(features_melt, x='feature', y='value')
fig_o.update_layout(title='Boxplot of Total Bill and Tip (Original Data)',
                    width=600, height=400)
fig_o.show()

In [5]:
# Standardizing the features
scaler = StandardScaler()

features_scaled = pd.DataFrame(scaler.fit_transform(features), columns=features.columns)

In [6]:
# Plot the standardized data
features_scaled_melt = features_scaled.melt(var_name='feature', value_name='value')

fig_s = px.box(features_scaled_melt, x='feature', y='value')
fig_s.update_layout(title='Boxplot of Total Bill and Tip (Standardized Data)',
                  width=600, height=400)
fig_s.show()

## Clustering algorithms

In [7]:
# Plotting the standardized data
cv_1 = cv.ClusterVisualizer(features_scaled)

cv_1.plot_data()

Let's apply some clustering algorithms to features_scaled data.

### k-Means Clustering

In [8]:
# k-means clustering (k = 2)
kMeans2 = KMeans(n_clusters=2, n_init='auto', random_state=123).fit(features_scaled)

cv_1.plot_clusters(kMeans2.labels_, title='k-means Clustering (k = 2)')

In [9]:
# k-means clustering (k = 3)
kMeans3 = KMeans(n_clusters=3, n_init='auto', random_state=123).fit(features_scaled)

cv_1.plot_clusters(kMeans3.labels_, title='k-means Clustering (k = 3)')

### Spectral Clustering

In [10]:
# Spectral Clustering (k = 2)
spectral2 = SpectralClustering(n_clusters=2, random_state=123).fit(features_scaled)

cv_1.plot_clusters(spectral2.labels_, title='Spectral Clustering (k = 2)')

In [11]:
# Spectral Clustering (k = 3)
spectral3 = SpectralClustering(n_clusters=3, random_state=123).fit(features_scaled)

cv_1.plot_clusters(spectral3.labels_, title='Spectral Clustering (k = 3)')

### Agglomerative Clustering

In [12]:
# Agglomerative Clustering (k = 2)
agg2 = AgglomerativeClustering(n_clusters=2).fit(features_scaled) 

cv_1.plot_clusters(agg2.labels_, title='Agglomerative Clustering (k = 2)')

In [13]:
# Agglomerative Clustering (k = 3)
agg3 = AgglomerativeClustering(n_clusters=3).fit(features_scaled) 

cv_1.plot_clusters(agg3.labels_, title='Agglomerative Clustering (k = 3)')

### Affinity Propagation Clustering

In [14]:
# Applying Affinity Propagation Clustering
ap = AffinityPropagation(random_state=42, preference=-45).fit(features_scaled)

cv_1.plot_clusters(ap.labels_, title='Affinity Propagation Clustering')

### Mean-Shift Clustering

In [15]:
# Applying Mean-Shift Clustering
msh = MeanShift().fit(features_scaled)

cv_1.plot_clusters(msh.labels_, title='Mean-Shift Clustering')

## Comparing clustering results

Let's create a DataFrame with the clustering results of all algorithms.

In [16]:
clusters = pd.DataFrame()
clusters['kM-2']  = kMeans2.labels_
clusters['kM-3']  = kMeans3.labels_
clusters['Sp-2']  = spectral2.labels_
clusters['Sp-3']  = spectral3.labels_
clusters['Agg-2'] = agg2.labels_
clusters['Agg-3'] = agg3.labels_
clusters['AP']    = ap.labels_
clusters['MSh']   = msh.labels_
clusters.head()

Unnamed: 0,kM-2,kM-3,Sp-2,Sp-3,Agg-2,Agg-3,AP,MSh
0,0,0,1,0,0,2,1,0
1,0,0,1,0,0,2,1,0
2,0,2,1,0,0,0,2,0
3,0,2,1,0,0,0,2,0
4,1,2,1,0,0,0,2,0


The Silhouette Score assesses the quality of cluster separation, the Davies-Bouldin Index evaluates the efficiency of cluster compactness and separation, and the Calinski-Harabasz Index measures the variance ratio to determine cluster density and distinctness.

In [17]:
# Computing the Silhouette Scores
cv_1.plot_silhouette(clusters)

In [18]:
# Computing the Davies-Bouldin Scores
cv_1.plot_davies_bouldin(clusters)

In [19]:
# Compute the Calinski-Harabasz Scores
cv_1.plot_calinski_harabasz(clusters)

Two internal validation measures agree with Spectral 2 as the better cluster configuration.

## Analyzing clustering results

In [20]:
cv_1.plot_clusters(spectral2.labels_, title='Spectral Clustering (k = 2)')

According to the Spectral algorithm, there is one big cluster and other small one with only two cases. Those two cases have highest values of `total_bill` and `tip` features.

In [21]:
features_scaled['Sp-2'] = spectral2.labels_
features_scaled.head()

# Plotting boxplots of the clusters
features_scaled_melt = features_scaled.melt(id_vars='Sp-2', var_name='feature', value_name='value')
fig_sp2 = px.box(features_scaled_melt, x='feature', y='value', color='Sp-2')
fig_sp2.update_layout(title='Spectral Clustering (k = 2)',
                      width=600, height=400)
fig_sp2.show()

In [22]:
tips['Sp-2'] = spectral2.labels_
tips[tips['Sp-2'] == 0]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,Sp-2
171,50.81,10.0,Male,Yes,Sat,Dinner,3,0
213,48.33,9.0,Male,No,Sat,Dinner,4,0


The two cases could be outliers. Both were men, had dinner on Saturday, and went with family (3 and 4 people).

Let's remove them from the data and repeat the analysis.

## Removing Outliers

In [23]:
tips_2 = tips[tips['Sp-2'] == 1].copy()
tips_2 = tips_2.drop(columns='Sp-2')

features_scaled2 = features_scaled[features_scaled['Sp-2'] == 1].copy()
features_scaled2 = features_scaled2.drop(columns='Sp-2')
print(features_scaled2.shape)
features_scaled2.head()

(242, 2)


Unnamed: 0,total_bill,tip
0,-0.314711,-1.439947
1,-1.063235,-0.969205
2,0.13778,0.363356
3,0.438315,0.225754
4,0.540745,0.44302


## Clustering algorithms again

In [24]:
# Plotting the standardized data
cv_2 = cv.ClusterVisualizer(features_scaled2)

cv_2.plot_data('Data Points without Outliers')

In [25]:
# k-means clustering (k = 2)
kMeans2_2 = KMeans(n_clusters=2, n_init='auto', random_state=123).fit(features_scaled2)

cv_2.plot_clusters(kMeans2_2.labels_, title='k-means Clustering (k = 2)')

In [26]:
# k-means clustering (k = 3)
kMeans3_2 = KMeans(n_clusters=3, n_init='auto', random_state=123).fit(features_scaled2)

cv_2.plot_clusters(kMeans3_2.labels_, title='k-means Clustering (k = 3)')

In [27]:
# Spectral Clustering (k = 2)
spectral2_2 = SpectralClustering(n_clusters=2, random_state=123).fit(features_scaled2)

cv_2.plot_clusters(spectral2_2.labels_, title='Spectral Clustering (k = 2)')

In [28]:
# Spectral Clustering (k = 3)
spectral3_2 = SpectralClustering(n_clusters=3, random_state=123).fit(features_scaled2)

cv_2.plot_clusters(spectral3_2.labels_, title='Spectral Clustering (k = 3)')

In [29]:
# Agglomerative Clustering (k = 2)
agg2_2 = AgglomerativeClustering(n_clusters=2).fit(features_scaled2) 

cv_2.plot_clusters(agg2_2.labels_, title='Agglomerative Clustering (k = 2)')

In [30]:
# Agglomerative Clustering (k = 3)
agg3_2 = AgglomerativeClustering(n_clusters=3).fit(features_scaled2) 

cv_2.plot_clusters(agg3_2.labels_, title='Agglomerative Clustering (k = 3)')

In [31]:
# Applying Affinity Propagation Clustering
ap_2 = AffinityPropagation(random_state=42, preference=-45).fit(features_scaled2)

cv_2.plot_clusters(ap_2.labels_, title='Affinity Propagation Clustering')

In [32]:
# Applying Mean-Shift Clustering
msh_2 = MeanShift().fit(features_scaled2)

cv_2.plot_clusters(msh_2.labels_, title='Mean-Shift Clustering')

## Comparing clustering results again

In [33]:
clusters_2 = pd.DataFrame()
clusters_2['kM-2']  = kMeans2_2.labels_
clusters_2['kM-3']  = kMeans3_2.labels_
clusters_2['Sp-2']  = spectral2_2.labels_
clusters_2['Sp-3']  = spectral3_2.labels_
clusters_2['Agg-2'] = agg2_2.labels_
clusters_2['Agg-3'] = agg3_2.labels_
clusters_2['AP']    = ap_2.labels_
clusters_2['MSh']   = msh_2.labels_
clusters_2.head()

Unnamed: 0,kM-2,kM-3,Sp-2,Sp-3,Agg-2,Agg-3,AP,MSh
0,0,0,0,0,0,1,2,0
1,0,0,0,0,0,1,2,0
2,0,1,0,0,0,2,3,0
3,1,1,0,0,0,2,3,0
4,1,1,0,0,0,2,0,0


In [34]:
# Computing the Silhouette Scores
cv_2.plot_silhouette(clusters_2)

In [35]:
# Computing the Davies-Bouldin Scores
cv_2.plot_davies_bouldin(clusters_2)

In [36]:
# Compute the Calinski-Harabasz Scores
cv_2.plot_calinski_harabasz(clusters_2)

When different clustering validation metrics (Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz Score) each favor a different clustering algorithm without any overlap, it suggests that each algorithm excels in different aspects of clustering quality. 
- Silhouette Score highlights how well-separated and cohesive the clusters are.
- Davies-Bouldin Index indicates low intra-cluster and high inter-cluster distances, favoring compact and well-separated clusters.
- Calinski-Harabasz Index values dense and well-separated clusters, emphasizing the variance ratio between and within clusters.


The Silhouette scores of the algorithms `Sp-2` and `MSh` are quite similar. As `MSh` is the best option for the Davies-Bouldin Index, we will chose it as our winner cluster algorithm.

## Analyzing clustering results again

In [37]:
features_scaled2['MSh'] = msh_2.labels_
features_scaled2.head()

Unnamed: 0,total_bill,tip,MSh
0,-0.314711,-1.439947,0
1,-1.063235,-0.969205,0
2,0.13778,0.363356,0
3,0.438315,0.225754,0
4,0.540745,0.44302,0


In [38]:
# Plotting boxplots of the clusters
features_scaled_melt2 = features_scaled2.melt(id_vars='MSh', var_name='feature', value_name='value')
fig_sp2 = px.box(features_scaled_melt2, x='feature', y='value', color='MSh')
fig_sp2.update_layout(title='Mean Shift Clustering)',
                      width=600, height=400)
fig_sp2.show()

In [39]:
tips_2['MSh'] = msh_2.labels_
tips_2.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,MSh
1,16.99,1.01,Female,No,Sun,Dinner,2,0
2,10.34,1.66,Male,No,Sun,Dinner,3,0
3,21.01,3.5,Male,No,Sun,Dinner,3,0
4,23.68,3.31,Male,No,Sun,Dinner,2,0
5,24.59,3.61,Female,No,Sun,Dinner,4,0


In [40]:
# Sex Distribution by clusters
fig_x = px.bar(pd.crosstab(tips_2.MSh, tips_2.sex), barmode='group', 
               width=600, height=400, title='Sex Distribution by Clusters')
fig_x.show()

In [41]:
# Smoke Distribution by clusters
fig_k = px.bar(pd.crosstab(tips_2.MSh, tips_2.smoker), barmode='group', 
               width=600, height=400, title='Smoker Distribution by Clusters')
fig_k.show()

In [42]:
# Day Distribution by clusters
fig_k = px.bar(pd.crosstab(tips_2.MSh, tips_2.day), barmode='group', 
               width=600, height=400, title='Day Distribution by Clusters')
fig_k.show()

In [43]:
# Time Distribution by clusters
fig_k = px.bar(pd.crosstab(tips_2.MSh, tips_2.time), barmode='group', 
               width=600, height=400, title='Time Distribution by Clusters')
fig_k.show()

By sight, clusters have no significant differences in sex, smoker, day, and time. To be sure, we need to use statistical tests beyond this course's objective. It seems that the main differences occur in total_bill and tips features.

## References

- https://scikit-learn.org/stable/modules/clustering.html
- Muller, A.C. & Guido, S. (2017) Introduction to Machine Learning with Python. A guide for Data scientists. USA: O'Reilly, chapter 3.
- VanderPlas, J. (2017) Python Data Science Handbook: Essential Tools for Working with Data. USA: O'Reilly Media, Inc. chapter 5.