<center><img src="https://p1.pxfuel.com/preview/700/259/602/mushrooms-fungi-forest-nature.jpg" alt="mushrooms" width="400"/>
    <h1>Clustering Categorical Data using Gower distance</h1>
    <h3>🍄 Mushrooms clustered Hierarchically!</h3>
</center>

As you know, K-Means clustering, DBSCAN, OPTICS and hierarchical clustering all have one thing in common: They are all Distance-based clustering algorithms. since these algorithms all use Euclidean distance function, they are not good for clustering categorical data. so, in order to cluster non-numerical data using these methods, we have to use other distance functions. 

one of the most famous distance functions that can be aplied on categorical data, is **Gower Distance function**, which we are going to use in this notebook to cluster some data on different mushrooms. 

# Getting Started

## Load Libraries

Let's begin by loading some libraries that we are going to use later on:

In [None]:
!pip install --upgrade scikit-learn

In [None]:
# Essentials:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Clustering algorithm
from sklearn.cluster import AgglomerativeClustering

# Rand Index
from sklearn.metrics.cluster import rand_score

# Encode labels
from sklearn import preprocessing

# Confusion Matrix
from sklearn.metrics import confusion_matrix

In [None]:
# To make the code reproducable
np.random.seed(42)

## Load Dataset

In [None]:
data_full = pd.read_csv('../input/mushroom-classification/mushrooms.csv')
data_full.head()

# Preprocessing

## Removing the target value
Since we want to perform clustering on this dataset, we must remove target values.

In [None]:
target = data_full[['class']]
data_no_target = data_full.drop(['class'],axis=1)
data_no_target.head()

## Examine Data type

Let's see which columns are numerical and which ones are not:

In [None]:
data_no_target.info()

since all of the columns have `Dtype = object` we conclude that we are facing a dataset that only consists of categorical data. also, note that there are no _missing values_ in this dataset.

## Investigate categories

here is how we can see the number of categories in each column:

In [None]:
data_no_target.nunique()

one of the columns (`veil-type`) has only 1 unique value. since there are no missing values, this means that in every row, this column has the same repeated value. in other words, this column is useless and we can remove it without any effect on our performance. let's do it:

In [None]:
data_categorical = data_no_target.drop(['veil-type'], axis=1)

**That's it! we need no more 'preprocessing' on our data.**

# Gower Distance

Gower Distance is a distance measure that can be used to calculate distance between two entity whose attribute has a mixed of categorical and numerical values.

It is not included in Scikit learn package, but fortunately, there is a nice implementation of it available on [github](https://github.com/wwwjk366/gower).

so, let's install it:

In [None]:
!pip install gower

here is how it works: you simply feed the `gower.gower_matrix` your dataset, and it returns a distance matrix; which then can be fed to several scikit learn models like `DBSCAN` and `AgglomerativeClustering`. Let's calculate the distance matrix:

In [None]:
import gower

distance_matrix = gower.gower_matrix(data_categorical)

distance_matrix

# Using Gower Distance for Agglomerative Clustering

Now that we have a nice distance matrix, we can cluster our data. some of the scikit learn's clustering models are able to process a distance matrix instead of raw data. for example, `DBSCAN`, `OPTICS` and `AgglomerativeClustering`.

As I've experienced, parameter tuning for `DBSCAN` and `OPTICS` can be a pain in the A$$, so let's use `AgglomerativeClustering` instead.

<div class="alert alert-warning" role="alert">
  ⚠ if you want to learn more about Agglomerative clustering, read <a href="https://github.com/HalflingWizard/MachineLearning/blob/main/3-%20Clustering/Hierarchical%20clustering.md" class="alert-link">my notes on this method</a>.
</div>

In order to cluster our data with `AgglomerativeClustering` using distance matrix, we should set the `affinity` as `precomputed` and then feed the model with `distance_matrix`. we also want our model make 2 clusters, because the original targets (which we are going to use to evaluate our model's performance) have two classes: `p` (poisonous) and `e` (edible)

In [None]:
model = AgglomerativeClustering(n_clusters=2, 
                                affinity='precomputed')

In [None]:
clusters = model.fit_predict(distance_matrix)

## Oops! we made a mistake! 😬

Scikit learn's `AgglomerativeClustering` uses `ward` as its **Linkage** by default. linkage is the measure we use to find distance between clusters. scikit learn provides 4 linkage criterions: `ward`, `average`, `complete` and `single`. you can find a comparison between them [here](https://scikit-learn.org/stable/auto_examples/cluster/plot_linkage_comparison.html). 

> - single linkage is fast, and can perform well on non-globular data, but it performs poorly in the presence of noise.
> - average and complete linkage perform well on cleanly separated globular clusters, but have mixed results otherwise.
> - Ward is the most effective method for noisy data.

as the error massege confirms, **Ward can only work with euclidean distances.** so, we have to choose between `single`, `complete` and `average`. Let's just use three models with different linkages and see which one is better.

In order to do so, we need a evaluation metric. since we have our target values, I'm going to use _Rand Index_.

## Prepare target values for Rand Index

The Rand Index computes a similarity measure between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings. Perfect labeling is scored 1.0.

To do so, I have to encode target labels:

In [None]:
encoder = preprocessing.LabelEncoder()

encoded_target = target.apply(encoder.fit_transform)

print(f'in this encoding, {encoded_target.iloc[0].values} represents {target.iloc[0].values}')

labels = pd.DataFrame()
labels['target'] = encoded_target.values.reshape(1, -1).tolist()[0]

Now that we have encoded target values, we can begin training models!

## Agglomerative Clustering with Single Linkage

In [None]:
model_single = AgglomerativeClustering(n_clusters=2, linkage='single', affinity='precomputed')
clusters_single = model_single.fit_predict(distance_matrix)

In [None]:
labels['single-predictions'] = clusters_single

now let's evaluate the results with rand index and a pie chart indicating the number of data in each cluster. (we expect it to be rather balanced)

In [None]:
sri = rand_score(encoded_target.values.reshape(1, -1)[0], clusters_single)
print(f'Rand Index: {sri}')

In [None]:
labels[['single-predictions']].value_counts().plot.pie(autopct='%1.0f%%', pctdistance=0.7, labeldistance=1.1)

This is terrible! obviously `single` shouldn't be our choice.

## Agglomerative Clustering with Average Linkage

In [None]:
model_average = AgglomerativeClustering(n_clusters=2, linkage='average', affinity='precomputed')
clusters_average = model_average.fit_predict(distance_matrix)

In [None]:
labels['average-predictions'] = clusters_average

In [None]:
ari = rand_score(encoded_target.values.reshape(1, -1)[0], clusters_average)
print(f'Rand Index: {ari}')

In [None]:
labels[['average-predictions']].value_counts().plot.pie(autopct='%1.0f%%', pctdistance=0.7, labeldistance=1.1)

It's even worse than `single` linkage. just one more option is left. let's see if it works! (fingers crossed!!! 🤞)

## Agglomerative Clustering with Complete Linkage

In [None]:
model_complete = AgglomerativeClustering(n_clusters=2, linkage='complete', affinity='precomputed')
clusters_complete = model_complete.fit_predict(distance_matrix)

In [None]:
labels['complete-predictions'] = clusters_complete

In [None]:
cri = rand_score(encoded_target.values.reshape(1, -1)[0], clusters_complete)
print(f'Rand Index: {cri}')

In [None]:
labels[['complete-predictions']].value_counts().plot.pie(autopct='%1.0f%%', pctdistance=0.7, labeldistance=1.1)

Wow! this is much better! 👏 

Let's compare our clusters with original target classes:

In [None]:
labels.value_counts(["target", "complete-predictions"])

hmmm... it seems that our clusters have the opposite labels compared with encoded targets. so, let's first align our labels:

In [None]:
labels['aligned-clusters'] = labels['complete-predictions'].apply(lambda x: int(not x))

In [None]:
labels.value_counts(["target", "aligned-clusters"])

Now we can use a confusion matrix to better understand our performance.

## Confusion Matrix

let's create a confusion matrix to compare our predicted labels with the actual target:

In [None]:
cf_matrix = confusion_matrix(encoded_target.values.reshape(1, -1)[0], labels[["aligned-clusters"]].values.reshape(1, -1)[0])
cf_labels = ['True Neg','False Pos','False Neg','True Pos']
cf_labels = np.asarray(cf_labels).reshape(2,2)
fig, ax = plt.subplots(1, 1)
sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=cf_labels, fmt='', cmap='Blues')
ax.set_ylabel('Target Labels')    
ax.set_xlabel('Predicted Labels')

using the confusion matrix, we could calculate other evaluation metrics such as **accuracy**, **percision**, **recall** and **F1 Score**

<div class="alert alert-danger" role="alert">
  ⚠ If you are not familiar with Confusion matrix and Classification Evaluation and Metrics, I recommend you watch <a href="https://www.youtube.com/watch?v=-ORE0pp9QNk" class="alert-link">my video on this subject</a>.
</div>

In [None]:
from IPython.display import YouTubeVideo

YouTubeVideo('-ORE0pp9QNk')

In [None]:
True_neg = cf_matrix[0,0]
False_pos = cf_matrix[0,1]
True_pos = cf_matrix[1,1]
False_neg = cf_matrix[1,0]

accuracy = (True_neg + True_pos)/(True_neg + False_neg + True_pos + False_pos)
recall = (True_pos)/(False_neg+True_pos)
precision = (True_pos)/(False_pos + True_pos)
F1_score = 2 * ((precision*recall)/(precision+recall))

In [None]:
print(f'Accuracy: {accuracy}')
print(f'Recall: {recall}')
print(f'Precision: {precision}')
print(f'F1_score: {F1_score}')

# Everything Looks good!

Congrats! we did it! 🎉

We successfully used Gower Distance to cluster categorical data using Agglomerative clustering and the results were totally acceptable. 

<div class="alert alert-danger" role="alert" style="text-align:center;">
    I hope you enjoyed this tutorial. If you did, please consider subscribing to <b><a href="https://www.youtube.com/channel/UC34Gj0-vHuBiTNEYlP7wczg">my YouTube Channel ▶</a></b>
</div>

<center><h2><span style="font-family:cursive;"> Also, please Upvode! 😜 </span></h2></center>