<a href="https://www.kaggle.com/code/fareselmenshawii/kmeans-from-scratch?scriptVersionId=117154439" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<div class="table-of-contents" style="background-color:#AF4BCE; color:black; padding: 20px; margin: 10px; font-size: 110%; border-radius: 25px; box-shadow: 10px 10px 5px 0px rgba(0,0,0,0.75);">
  <h1>TOC</h1>
  <ol>
    <li><a href="#1" style="color: black;">1. Overview</a></li>
    <li><a href="#2" style="color: black;">2. Imports</a></li>
      <li><a href="#3" style="color: black;">3. Load Data</a></li>
    <li><a href="#4" style="color: black;">4. EDA</a></li>
    <li><a href="#5" style="color: black;">5. Model Implementation</a></li>
    <li><a href="#6" style="color: black;">6. Evaluation</a></li>
    <li><a href="#7" style="color: black;">7. Thank You</a></li>  
  </ol>
</div>


<a id="1"></a>
<h1 style='background:#AF4BCE;border:0; color:black;
    box-shadow: 10px 10px 5px 0px rgba(0,0,0,0.75);
    transform: rotateX(10deg);
    '><center>Overview</center></h1>
    
# Overview
  
**Previously we've implemented our first classification algorithm [logistic regression](https://www.kaggle.com/code/fareselmenshawii/logistic-regression-from-scratch)**
    
**Now we'll implement our first Clustering Algorithm: KMeans**
    
**The K-means algorithm is a method to automatically cluster similar data points together more on that on the model part**
     
    
 **We'll be using KMeans to cluster the famous Iris Dataset**
    
**Let's get started !**

<a id="2"></a>
<h1 style='background:#AF4BCE;border:0; color:black;
    box-shadow: 10px 10px 5px 0px rgba(0,0,0,0.75);
    transform: rotateX(10deg);
    '><center>Imports</center></h1>

# Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import plotly.graph_objects as go
from sklearn.preprocessing import MinMaxScaler

<a id="3"></a>
<h1 style='background:#AF4BCE;border:0; color:black;
    box-shadow: 10px 10px 5px 0px rgba(0,0,0,0.75);
    transform: rotateX(10deg);
    '><center>Loading The Data</center></h1>

# Loading The Data

In [2]:
iris = pd.read_csv("../input/iris/Iris.csv") #Load Data
iris.drop('Id',inplace=True,axis=1) #Drop Id column

In [3]:
X = iris.iloc[:,:-1] #Set our training data

y = iris.iloc[:,-1] #We'll use this just for visualization as clustering doesn't require labels

In [4]:
iris.head().style.background_gradient(cmap =sns.cubehelix_palette(as_cmap=True))

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


<a id="4"></a>
<h1 style='background:#AF4BCE;border:0; color:black;
    box-shadow: 10px 10px 5px 0px rgba(0,0,0,0.75);
    transform: rotateX(10deg);
    '><center>EDA</center></h1>

# EDA

## Data Distribution

In [5]:
fig = px.pie(iris, 'Species',color_discrete_sequence=['#491D8B','#7D3AC1','#EB548C'],title='Data Distribution',template='plotly')

fig.show()

## From this plot we conclude that:

**The Data is perfectly balanced**

****


## Sepal-Length

In [6]:
fig = px.box(data_frame=iris, x='Species',y='SepalLengthCm',color='Species',color_discrete_sequence=['#29066B','#7D3AC1','#EB548C'],orientation='v')
fig.show()

In [7]:
fig = px.histogram(data_frame=iris, x='SepalLengthCm',color='Species',color_discrete_sequence=['#491D8B','#7D3AC1','#EB548C'],nbins=50)
fig.show()

### From these plots we conclude that: 

* **Setosa has much smaller SepalLength than the other 2 classes**

* **Virginca has the highest SepalLength, however It seems hard to distingush between Virginca and Versicolor using SepalLength as the difference is less clear**

* **We can see that Virginica contains an outlier**

****

## SepalWidth

In [8]:
fig = px.box(data_frame=iris, x='Species',y='SepalWidthCm',color='Species',color_discrete_sequence=['#29066B','#7D3AC1','#EB548C'],orientation='v')
fig.show()

In [9]:
fig = px.histogram(data_frame=iris, x='SepalWidthCm',color='Species',color_discrete_sequence=['#491D8B','#7D3AC1','#EB548C'],nbins=30)
fig.show()

### From these plots we conclude that: 

* **Setosa has  larger SepalWidth than the other 2 classes**

* **Versicolo has smaller SepalWidth than the other 2 classes**

* **Overall all classes seem to have relatively close value of sepalwidth which indicate that is might not be a very useful feature**

****

## Petal-Length

In [10]:
fig = px.box(data_frame=iris, x='Species',y='PetalLengthCm',color='Species',color_discrete_sequence=['#29066B','#7D3AC1','#EB548C'],orientation='v')
fig.show()

In [11]:
fig = px.histogram(data_frame=iris, x='PetalLengthCm',color='Species',color_discrete_sequence=['#491D8B','#7D3AC1','#EB548C'],nbins=30)
fig.show()

### From these plots we conclude that: 

* **Setosa has much smaller PetaLength than the other 2 classes**

* **This difference is less clear between Virginica and Versicolor**

* **Overall this seems like an  PetaLength interesting feature**

****

## Petal-Width

In [12]:
fig = px.box(data_frame=iris, x='Species',y='PetalWidthCm',color='Species',color_discrete_sequence=['#29066B','#7D3AC1','#EB548C'],orientation='v')
fig.show()

In [13]:
fig = px.histogram(data_frame=iris, x='PetalWidthCm',color='Species',color_discrete_sequence=['#491D8B','#7D3AC1','#EB548C'],nbins=30)
fig.show()

### From these plots we conclude that: 

* **Setosa has much smaller PetalWidth than the other 2 classes**

* **This difference is less clear between Virginica and Versicolor**

* **Overall this seems like an  PetalWidth interesting feature**

****

In [14]:
fig = px.scatter(data_frame=iris, x='SepalLengthCm',y='SepalWidthCm'
           ,color='Species',size='PetalLengthCm',template='seaborn',color_discrete_sequence=['#491D8B','#7D3AC1','#EB548C'],)

fig.update_layout(width=800, height=600,
                  xaxis=dict(color="#BF40BF"),
                 yaxis=dict(color="#BF40BF"))
fig.show()

In [15]:
fig = px.scatter(data_frame=iris, x='PetalLengthCm',y='PetalWidthCm'
           ,color='Species',size='SepalLengthCm',template='seaborn',color_discrete_sequence=['#491D8B','#7D3AC1','#EB548C'],)

fig.update_layout(width=800, height=600,
                  xaxis=dict(color="#BF40BF"),
                 yaxis=dict(color="#BF40BF"))
fig.show()

<a id="5"></a>
<h1 style='background:#AF4BCE;border:0; color:black;
    box-shadow: 10px 10px 5px 0px rgba(0,0,0,0.75);
    transform: rotateX(10deg);
    '><center>Implement KMeans</center></h1>

# Implement KMeans

**We are given a training set that we want to group into clusters.** 


**K-means is an iterative procedure that :**

**Starts by guessing the randomly intialize centroids** 
  
**Then adjects this guess by  assigning training examples to their closest centroids, and then 
 Recomputing the centroids based on the assignments.**
 
 **As always we'll follow the rule to avoid for loops and use [vectorized code](https://www.kaggle.com/code/fareselmenshawii/vectorization)**
 

## Set training data

## Initialize Centroids

**Let's start by randomly initializing centroids**

**We'll initialize centroids to random examples in the training set**

In [16]:
def initialize_centroids(X,K):
    
    #Randomize Traininng inidcies
    randomized_X = np.random.permutation(X.shape[0])
    
    #Get the training cluster indicies for number of clusters
    centroid_indx = randomized_X[:K]

    centroids = X[centroid_indx]
    
    return centroids
    

## Assign Points To Centroids

**our job is to find the closest centroid  for each point**

**Assuming that $c^{(i)}$ is the index of the centroid that is closest to $x^{(i)}$**

**We want to choose  $c^{(i)}$ that minimizes  our cost $$\quad ||x^{(i)} - \mu_j||^2$$**



In [17]:
def assign_points_centroids(X,centroids):
    #Expand X dimensions in order to get correct shapes
    X = np.expand_dims(X,axis=1)
    #Calculate the norm
    distance = np.linalg.norm((X - centroids),axis=-1)
    #assign clusters to points that minimize our cost
    points = np.argmin(distance, axis=1)
    return points

## Compute Mean

**Here we'll compute the mean of the points assigned to each cluster and assign the cluster to it** 

<div class="alert alert-block alert-info">
<b>Note :</b> I wasn't able to find a way around a for loop here if you do please let me know
</div>

In [18]:
def compute_mean(X,points,K):
    #Intialize empty array to store our centroids
    centroids = np.zeros((K, X.shape[1]))
    
    #Iterate over each cluster and assign it to the mean of it's points
    for i in range(K):
        centroid_mean = X[points ==i].mean(axis=0)
        centroids[i] = centroid_mean
    return centroids

## Run KMeans

**Now let's create a function to run KMeans**

In [19]:
def KMeans(X, K, iterations=10):
    
    #Initialize centroids
    centroids= initialize_centroids(X, K) 
    
    #Iterate for specified iterations
    for i in range(iterations):
        
        points = assign_points_centroids(X, centroids)#get points assigned to each cluster

        centroids = compute_mean(X, points, K) #update the centroids to the mean of it's points

    return centroids,points

**Now let's run KMeans for 1000 iterations**

In [20]:
X = X.values
K = 3
centroids, points = KMeans(X, K, 1000)

<a id="6"></a>
<h1 style='background:#AF4BCE;border:0; color:black;
    box-shadow: 10px 10px 5px 0px rgba(0,0,0,0.75);
    transform: rotateX(10deg);
    '><center>Evaluation</center></h1>

# Evaluation

**Now let's visualize our results**

In [21]:
fig = go.Figure()
fig.add_trace(go.Scatter(
    x=X[points == 0, 0], y=X[points == 0, 1],
    mode='markers',marker_color='#DB4CB2',name='Iris-setosa'
))

fig.add_trace(go.Scatter(
    x=X[points == 1, 0], y=X[points == 1, 1],
    mode='markers',marker_color='#c9e9f6',name='Iris-versicolour'
))

fig.add_trace(go.Scatter(
    x=X[points == 2, 0], y=X[points == 2, 1],
    mode='markers',marker_color='#7D3AC1',name='Iris-virginica'
))

fig.add_trace(go.Scatter(
    x=centroids[:, 0], y=centroids[:,1],
    mode='markers',marker_color='#CAC9CD',marker_symbol=4,marker_size=13,name='Centroids'
))
fig.update_layout(template='plotly_dark',width=1000, height=500,)

<a id="7"></a>
<h1 style='background:#AF4BCE;border:0; color:black;
    box-shadow: 10px 10px 5px 0px rgba(0,0,0,0.75);
    transform: rotateX(10deg);
    '><center>Thank You</center></h1>

# Thank You


**Thank you for going through this notebook**

**If you have any feedback please let me know**

**For KMeans from Sklearn Implementation please refer to this [notebook](https://www.kaggle.com/code/fareselmenshawii/kmeans-iris-clustering)**