<center><img src="https://www.pngix.com/pngfile/big/323-3231328_mario-mushroom-png.png" alt="drawing" width="200"/>
    <h1>🍄 Shrooming!</h1>
    <h3>K-mode Clustering on Mushrooms Dataset</h3>
</center>

In this notebook, I'll try to cluster mushroom data in two classes so that we can figure out which ones are poisonous and which ones are edible. 

We'll go through data, we will perform pre-processing on it and then we will use our powerful clustering methods to help us detect delicious, friendly mushrooms from the killer ones.

First, let's start by Importing libraries:

# Libraries

first we should add every library we want to use in the future. I've added comments so you'll know why I used every library. 

In [None]:
!pip install --upgrade scikit-learn

In [None]:
# Essentials 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Rand Index
from sklearn.metrics.cluster import adjusted_rand_score
from sklearn.metrics.cluster import rand_score

# Encode labels
from sklearn import preprocessing

# Confusion Matrix
from sklearn.metrics import confusion_matrix

In [None]:
# To make the code reproducable
np.random.seed(42)

# Load the Data

In [None]:
mushrooms = pd.read_csv("../input/mushroom-classification/mushrooms.csv")
mushrooms.head()

## See no evil! 🙈
We want to _Cluster_ our data. clustering is an _unsupervised_ task. so, before we do anything, we mush remove the target (`class`) column. we won't look at our targets until we have finished clustering; so we can evaluate our model.

In [None]:
target = mushrooms[['class']]
see_no_evil = mushrooms.drop(['class'], axis=1)
see_no_evil.head()

# Take a look at our data
with target values gone, it is now safe to investigate the data. first, let's take a look at the description of each column:
* cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
* cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
* cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
* bruises: bruises=t,no=f
* odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
* gill-attachment: attached=a,descending=d,free=f,notched=n
* gill-spacing: close=c,crowded=w,distant=d
* gill-size: broad=b,narrow=n
* gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
* stalk-shape: enlarging=e,tapering=t
* stalk-root: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?
* stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
* stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
* stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
* stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
* veil-type: partial=p,universal=u
* veil-color: brown=n,orange=o,white=w,yellow=y
* ring-number: none=n,one=o,two=t
* ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z
* spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y
* population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
* habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d

**Uh-oh, It looks like every column in this dataset contains _Categorical Features_! these kinds of data can be a pain in the A$$! 🤦‍♂️**

Let's check again:

In [None]:
see_no_evil.info()

Yup, the `Dtype` of every column is `object`; so every one of them contain categorical data. the good news is, there are no _missing values_ so that's a relief. Let's check the number of unique values in each column. this helps us understand these categorical features better.

In [None]:
see_no_evil.nunique()

notice that `veil-type` has only one unique value, which means every data in this column is the same. since it doesn't help us classify our model, let's just get rid of it.

In [None]:
data = see_no_evil.drop(['veil-type'],axis=1)

# Preprocessing

Now we should pre-process the data so that it is ready for an ML model. since all our data is categorical, there is no need to scale the data. also, there are no outliers. so, the only thing we should do is to encode our labels. 

In [None]:
encoder = preprocessing.LabelEncoder()
encoded_data = data.apply(encoder.fit_transform)
encoded_data.head()

# Train the Model

Now it's time to train our clustering model. we know that our model should find 2 classes. so, the number of clusters is known. 

It would be really convenient if we could just use the k-means clustering method; because it could simply give us two classes. but it's not possible because k-means works with Euclidean distance which is not meaningful on discrete data like ours. 

<div class="alert alert-warning" role="alert">
  ⚠ If you want to learn more about k-means clustering, read <a href="https://github.com/HalflingWizard/MachineLearning/blob/main/3-%20Clustering/K-Means.md">my notes on this method</a>.
</div>

fortunately, there is an extention for this algorithm called **K-modes** that works on categorical data. there is a python implementation of it [here](https://github.com/nicodv/kmodes). 

> k-modes is used for clustering categorical variables. It defines clusters based on the number of matching categories between data points. (This is in contrast to the more well-known k-means algorithm, which clusters numerical data based on Euclidean distance.) The k-prototypes algorithm combines k-modes and k-means and is able to cluster mixed numerical / categorical data.

Let's first install this library:

In [None]:
!pip install kmodes

Now we can use it to cluster our data into two classes:

In [None]:
from kmodes.kmodes import KModes

km = KModes(n_clusters=2, init='Cao', verbose=1)
clusters = km.fit_predict(encoded_data)
predicted_labels = pd.DataFrame(clusters, columns=['predicted-label'])

In [None]:
predicted_labels.value_counts().plot.pie(autopct='%1.0f%%', pctdistance=0.7, labeldistance=1.1)

after 5 different initializations, k-mode gives us the best results. let's add these labels to our original dataframe (`data`)

In [None]:
data['predicted-labels'] = clusters

In [None]:
encoded_target = target.apply(encoder.fit_transform)
print(f'in this encoding, {encoded_target.iloc[0].values} represents {target.iloc[0].values}')

# Evaluate our predictions

Now we can finally look at our targets. I want to compare our predicted labels with the target classes and figure out wherher we've done a good job or not. 

## Rand Index

First, I'm going to Calculate the **Rand Index**. The Rand Index computes a similarity measure between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings. Perfect labeling is scored 1.0.

To do so, I have to encode target labels:

In [None]:
labels = pd.DataFrame()
labels['target'] = encoded_target.values.reshape(1, -1).tolist()[0]
labels['prediction'] = clusters
labels.value_counts(["target", "prediction"])

hmmm... it seems that our predicted labels are aligned with the target labels. which means, Class 0 = `e` and Class 1 = `p`. 

This is how we calculate rand index using sklearn:

In [None]:
ri = rand_score(encoded_target.values.reshape(1, -1)[0], clusters)
ari = adjusted_rand_score(encoded_target.values.reshape(1, -1)[0], clusters)

print(f'Rand Index: {ri}')
print(f'Adjusted Rand Index: {ari}')

rand score is **0.78**, which is good. 👍

## Confusion Matrix

Now, let's create a confusion matrix to compare our predicted labels with the actual target:

In [None]:
cf_matrix = confusion_matrix(encoded_target.values.reshape(1, -1)[0], clusters)
labels = ['True Neg','False Pos','False Neg','True Pos']
labels = np.asarray(labels).reshape(2,2)
fig, ax = plt.subplots(1, 1)
sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=labels, fmt='', cmap='Blues')
ax.set_ylabel('Target Labels')    
ax.set_xlabel('Predicted Labels')

using the confusion matrix, we could calculate other evaluation metrics such as **accuracy**, **percision**, **recall** and **F1 Score**

<div class="alert alert-danger" role="alert">
  If you are not familiar with Confusion matrix and Classification Evaluation and Metrics, I recommend you watch my video on this subject 👇
</div>

In [None]:
from IPython.display import YouTubeVideo

YouTubeVideo('-ORE0pp9QNk')

In [None]:
True_neg = cf_matrix[0,0]
False_pos = cf_matrix[0,1]
True_pos = cf_matrix[1,1]
False_neg = cf_matrix[1,0]

accuracy = (True_neg + True_pos)/(True_neg + False_neg + True_pos + False_pos)
recall = (True_pos)/(False_neg+True_pos)
precision = (True_pos)/(False_pos + True_pos)
F1_score = 2 * ((precision*recall)/(precision+recall))

In [None]:
print(f'Accuracy: {accuracy}')
print(f'Recall: {recall}')
print(f'Precision: {precision}')
print(f'F1_score: {F1_score}')

Finally, I add the clusters to the dataset and save it as output.

In [None]:
mushrooms['clusters'] = clusters
mushrooms.to_csv('./results.csv')

# Everything Looks good!

Congrats! we did it! 🎉

We successfully used K-Mods algorithm to cluster categorical data and the results were totally acceptable. 

<div class="alert alert-danger" role="alert" style="text-align:center;">
    I hope you enjoyed this tutorial. If you did, please consider subscribing to <b><a href="https://www.youtube.com/channel/UC34Gj0-vHuBiTNEYlP7wczg">my YouTube Channel ▶</a></b>
</div>

<center><h2><span style="font-family:cursive;"> Also, please Upvode! 😜 </span></h2></center>