## From the given ‘Iris’ dataset, predict the optimum number of clusters and represent it visually !

### Author : 
> ***Mouad Riali***

#### First we import basic labraries :

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

##### Then we read data from the csv File : 

In [5]:
iris = pd.read_csv("Iris.csv")
iris.head(5)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


Here we just show some infos about the dataset to understand it better :

In [6]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [7]:
set(iris.Species.values)

{'Iris-setosa', 'Iris-versicolor', 'Iris-virginica'}

And to better undertsand our data, we could plot it as columns matrix, that shows the variation of every single column in function of every other single column including itself :

In [9]:
import plotly.express as px
fig = px.scatter_matrix(iris.drop("Id",axis=1),
width=1000, height=1000)
fig.show()

Here we see an other way to present our data in 2D plot:

In [11]:
fig1 = px.scatter(iris, x="SepalLengthCm", y="SepalWidthCm", color="PetalLengthCm",
                 size="PetalWidthCm")
fig1.update_layout(title="Iris Representation")
fig1.show()

Below, we plot data in function of 4 columns using ***scatter_3d***

In [12]:
"""
Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')
"""

fig2 = px.scatter_3d(iris, z="SepalLengthCm", y="SepalWidthCm",x="PetalLengthCm",
                     color="PetalWidthCm")
fig2.update_layout(title="Iris 3D Representation")
fig2.show()

## Analysing Data :

Because finding the optimal amount of clusters is our goal, THe code below shows the decision of the cluster number only based on the inertia and the “elbow rule”.

In [13]:
import plotly.graph_objects as go

# Finding the optimum number of clusters for k-means classification

x = iris.iloc[:, [1, 2, 3, 4]].values

from sklearn.cluster import KMeans
wcss = []

for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', 
                    max_iter = 300, n_init = 10, random_state = 0)
    kmeans.fit(x)
    wcss.append(kmeans.inertia_)
    
# Plotting the results onto a line graph, 
fig = go.Figure(data=go.Scatter(x=np.arange(1,11),y=wcss))
fig.update_layout(title="Inertia vs Cluster Number",xaxis=dict(range=[0,11],title="Cluster Number"),
                  yaxis={'title':'Inertia'},
                 annotations=[
        dict(
            x=3,
            y=wcss[2],
            xref="x",
            yref="y",
            text="Elbow!",
            showarrow=True,
            arrowhead=7,
            ax=20,
            ay=-40
        )
    ])

> So, we can split our data to 3 clusters maybe 4, but here and for reasons of simplicity I will choose 3 clusters.

## Fit the model :

In [14]:
kmeans = KMeans(
        n_clusters=3, init="k-means++",
        n_init=10,
        tol=1e-04, random_state=42
    )
kmeans.fit(iris.drop(["Species"],axis=1))
clusters=pd.DataFrame(iris,columns=iris.drop("Id",axis=1).columns)
clusters['label']=kmeans.labels_

In [15]:
kmeans.labels_

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

In [16]:
clusters

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,label
0,5.1,3.5,1.4,0.2,Iris-setosa,1
1,4.9,3.0,1.4,0.2,Iris-setosa,1
2,4.7,3.2,1.3,0.2,Iris-setosa,1
3,4.6,3.1,1.5,0.2,Iris-setosa,1
4,5.0,3.6,1.4,0.2,Iris-setosa,1
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica,0
146,6.3,2.5,5.0,1.9,Iris-virginica,0
147,6.5,3.0,5.2,2.0,Iris-virginica,0
148,6.2,3.4,5.4,2.3,Iris-virginica,0


As presented above, We're mostly done, but to understand this clustering in some better and clearly, especially if you're donig this process for non-tech stuff of your company, we will plot the data as ***line_polar*** . In fact, on a circle, we can draw infinty of lines and present as many variables as we want:

In [18]:
polar=clusters.drop("label",axis=1).groupby("Species").mean().reset_index()
polar=pd.melt(polar,id_vars=["Species"])
fig4 = px.line_polar(polar, r="value", theta="variable", color="Species", line_close=True,height=800,width=900)
fig4.show()

Based on the previous line_polar we've just presented it, we can easily say what varibles really affect our clustring process, which are in our case : **PetalLengthCm** > **PetalWidthCm** > **SepalLengthCm** and finally **SepalWidthCm**

In [19]:
clusters.head(5)

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,label
0,5.1,3.5,1.4,0.2,Iris-setosa,1
1,4.9,3.0,1.4,0.2,Iris-setosa,1
2,4.7,3.2,1.3,0.2,Iris-setosa,1
3,4.6,3.1,1.5,0.2,Iris-setosa,1
4,5.0,3.6,1.4,0.2,Iris-setosa,1


In [20]:
clusters.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   SepalLengthCm  150 non-null    float64
 1   SepalWidthCm   150 non-null    float64
 2   PetalLengthCm  150 non-null    float64
 3   PetalWidthCm   150 non-null    float64
 4   Species        150 non-null    object 
 5   label          150 non-null    int32  
dtypes: float64(4), int32(1), object(1)
memory usage: 6.6+ KB


In [21]:
#clusters["id"]=iris.index.values
fig = px.scatter_matrix(clusters,
width=1000, height=900)
fig.show()

We represent the clusters in function of the three most important columns that we've talked about before. The clusters are separated here and the data is far more clear and understood .

In [22]:
fig2 = px.scatter_3d(clusters, z="SepalLengthCm", y="PetalLengthCm",x="PetalWidthCm",
                     color="label")
fig2.update_layout(title="Iris 3D Representation")
fig2.show()

To prove the last point, we plot here the same data but using different columns. As result we can easily notice that the colors are less separated than the plot before.

In [23]:
fig2 = px.scatter_3d(clusters, z="PetalWidthCm", y="SepalLengthCm",x="SepalWidthCm",
                     color="label")
fig2.update_layout(title="Iris 3D Representation")
fig2.show()

# Thank you