<h1 style = "text-align:center"> K-means Clustering Algorithm</h1>


## About the data
the data about <strong>Mall Customers</strong> and we are trying to divide the customers into k-groups based on feature similarity.
We might subdivide the Customers in 5 distinct groups:
1. Medium Annual Income, Medium Spending Score
2. High Annual Income, Low Spending Score
3. Low Annual Income, Low Spending Score
4. Low Annual Income, High Spending Score
5. High Annual Income, High Spending Score

## Attribute Detailes

<table style="width:100%">
  <tr>
    <th style="text-align:center" >Name</th>
     <th style="text-align:center">Type</th> 
    <th style="text-align:center">Description</th>
   </tr>
  <tr>
    <td style="text-align:center">CustomerID</td>
     <td style="text-align:center">Integer</td> 
    <td style="text-align:center" >A user identification number for a customer</td>
   </tr>
  <tr>
    <td style="text-align:center">Genre</td>
     <td style="text-align:center">String</td> 
    <td style="text-align:center" >A gender: the fact of being male or female</td>
   </tr>
  <tr>
    <td style="text-align:center">Age</td>
     <td style="text-align:center">Integer</td> 
    <td style="text-align:center">The age of the customer</td>
   </tr>
   <tr>
    <td style="text-align:center">Annual Income</td>
     <td style="text-align:center">Integer</td> 
    <td style="text-align:center">An income calculated over a period of one year in dollars</td>
   </tr>
   <tr>
    <td style="text-align:center">Spending Score (1-100)</td>
     <td style="text-align:center">Integer</td> 
    <td style="text-align:center">A score assigned by the mall based on customer behavior and spending nature.The higher the Spending Score (out of 100), the more they spend at the Mall.</td>
   </tr> 
</table>

 




In [None]:
#import the important libraires 

import numpy as np 

import math

import pandas as pd 

import matplotlib.pyplot as plt 

In [None]:
#import the data set
df = pd.read_csv("Mall_Customers.csv")

In [None]:
#Explore the top five of the data 
df.head()

In [None]:
#Explore the last five of the data 
df.tail()

In [None]:
#Explore the data types in the data set
df.dtypes

In [None]:
#Explore the shape of the data 
df.shape

In [None]:
#define the number of training examples 
m = df.shape[0]

print("The number of training examples:",m)

# Preprocessing the dataset 

In [None]:
#Convert the data set to numpy-array so we can feed it to our model
X = df.values

In [None]:
#Explore the converted data set in numpy-form
X

In [None]:
#Visulaize the 5-distinct groups
plt.scatter(X[:,-2], X[:,-1],color = 'r')
plt.scatter(X[:,-2] >= 40, X[:,-1] >= 80 ,color = 'b')

plt.xlabel("Annual Income (k$)")
plt.ylabel("Spending Score (1-100)")
plt.title("The relation between Spending Score and Annual Income")
plt.show()

<br>

<p style = "font-size:18px">We can see that we can divide the data into 5 groups based on its feature similarity.</p>

#  Define Some Helper Functions


<br>
<dl>
   <dt>featureNormalize()</dt>
  <dd>For feature scalling </dd>
    <dt>findClosestCentroids()</dt>
  <dd>For computing the centroid memberships for every example</dd>
  <dt>computeCentroids()</dt>
  <dd>For Computing centroid means </dd>
   <dt>kMeansInitCentroids()<dt>
    <dd> For initializing centroids</dd>
  <dt>oneHot()</dt>
  <dd>For for encoding the data</dd>
  <dt>kMeans()</dt>
  <dd>For building the k-means clustering algorithm<dd>
    <dt>lowestCost()</dt>
    <dd>For computing the lowest cost </dd>
</dl>

## 1.featureNormalize()
The result of standardization (or Z-score normalization) is that the features will be rescaled so that they’ll have the properties of a standard normal distribution with


\begin{align}
\largeμ=0 \hspace{2cm}  σ=1
\end{align}

<strong>where ,</strong>

$\normalsize \mu:$ is the mean (average) 

$\normalsize\sigma:$ is the standard deviation from the mean; standard scores (also called z scores) of the samples are calculated as follows:

\begin{align}
\large z=\frac{x - μ} {σ}
\end{align}

In [None]:
def featureNormalize(X):
    '''
    Usage:
      #featureNormalize--> used for normalizing features using Z-score normalization
  
    Arguments:
      #X --> The Design Matrix
    
    Returns:
      #The Normalized Matrix
      
    Notes:
      #X is a matrix where each column is a feature and each row is an example
      #So, you need to perform the normalization separately for each feature
    '''
    
    #Preallocating some variables to be used later 
    X_norm = np.copy(X)
    mu = np.zeros((1, X.shape[1]))
    sigma = np.zeros((1,X.shape[1]))

    #compute the mean of the feature and subtract it from the dataset, storing the mean value in mu
    #Next, compute the standard deviation of each feature, storing the standard deviation in sigma.
    for i in range(X.shape[1]):
        mu[0, i] = mu[0, i] + np.mean(X_norm[:, i])
        sigma[0, i] = sigma[0, i] + np.std(X_norm[:, i])
        
    #Finally, compute the standard deviation of each feature and divide each feature by it's standard deviation, storing the result in x_norm
    for i in range(X.shape[1]):
        X_norm[:, i] = np.divide(np.subtract(X_norm[:, i], mu[0, i]), sigma[0, i])
        
    return X_norm, mu, sigma

## 2.findClosestCentroids()

In the <strong>"cluster assignment"</strong> phase of the K-means algorithm, the algorithm
assigns every training example $x^{(i)}$ to its closest centroid, given the current
positions of centroids.

Given training examples (data points) and the current centroids we find , for every training example $x^{(i)}$, $c^{(i)}$ which is the index  of the centroid that is closest to $x^{(i)}$

In [None]:
def findClosestCentroids(X,centroids): 
    '''
    Usage:
      #findClosestCentroids--> computes the centroid memberships for every example
  
    Arguments:
      #X --> The Design(data) Matrix
      #centroids --> the locations of all centroids
    
    Returns:
      #idx --> a one-dimensional array, idx, that holds the index (a value in {1,....,K} 
               where K is total number of centroids) of the closest centroid 
               to every training example.
    '''
    
    #Define K which is the total number of centroids
    K = centroids.shape[0]
    
    #Pre-allocating idx which has size of (#training examples,1)
    idx = np.zeros((X.shape[0],1), dtype = np.int8)
    
    #iterating over every training example 
    for i in range(X.shape[0]):
        
        #Pre-allocating distance_array with size(K,)
        #distance_array: is array holding distance between the training example , x[i], 
        #and the all the centroids in  the centroids array
        distance_array = np.zeros((K,))
        
        #iterating over all the centroids in the centroids array 
        for j in range(K):
            
            #Compute the distance between the training example X[i] and the centroid with index j
            #then, store the result in the distance_array
            distance_array[j] = np.power(np.linalg.norm(X[i] - centroids[j]),2)
            
            
        #find the index of  minmum distance in the distance_array 
        #corresponding to the training example, x[i]
        #then, store the result in the idx 
        idx[i] = np.argmin(distance_array, axis = 0)
        
    return idx

## 3.computeCentroids()

Given assignments of every point to a centroid, the second phase of the
algorithm recomputes, for each centroid, the mean of the points that were
assigned to it. Specifically, for every centroid k we set

\begin{equation}
\large \mu_k := \frac{1}{|C_{k}|} \sum_{i \in C_{k}} {x^{(i)}}
\end{equation}

<strong>where,</strong>
<br>
$ \normalsize\mu_{k}:$ The mean of of the data points assigned to the centroid k

$\normalsize C_{k}:$ is the set of examples that are assigned to centroid k

In [None]:
def computeCentroids(X,idx,K):
    '''
    Usage:
      #findClosestCentroids--> computes the new centroids by computing 
                               the means of the data points assigned to 
                               each centroid.
    Arguments:
      #X --> The Design(data) Matrix
      #idx --> a one-dimensional array, idx, that holds the index (a value in {1,....,K} 
               where K is total number of centroids) of the closest centroid 
               to every training example.
      #K --> the total number of centroids
    
    Returns:
      #centroids --> the new centroids by computing the means of
                     the data points assigned to each centroid.
                     
    Notes: 1.centroids is  an array with size (k,#features of X)
           2.below we use n.squeeze() to reduce the rank of idx so we can iterate over what it holds
             Ex: idx = [[1,2,3]] --> [1,2,3] --> idx[0] --> 1
           
           
    '''
    #pre-allocating centroids with size (k,#features of X)
    centroids = np.zeros((K,X.shape[1]))
    
    #iterating over every centroid and compute mean of all points that belong to it
    for k in range(K):
        
        #define an empty list which will hold the data ponits coressponding to centroid k 
        dataPoints = []
        
        #iterating over all the training examples
        for i in range(X_train.shape[0]):
            
            #if the index in idx equal to index k
            if (idx[i] == k):
                #append the value of index i wich belong to centroid k
                #Note: we can use list with numpy array so we can append numpy-array to empty list
                dataPoints.append(X[i])
            
        #after getting the data points that belong to centroid k 
        #first, we compute the mean of them 
        #then, store them in the centroids
        #Note: we can applay numpy-methods to lists and the output is numpy-array
        centroids[k,:] = np.mean(dataPoints, axis = 0)

    return centroids

## 4.kMeansInitCentroids()
a good strategy for initializing the centroids is to select random examples from
the training set.

## 3.computeCentroids()

Given assignments of every point to a centroid, the second phase of the
algorithm recomputes, for each centroid, the mean of the points that were
assigned to it. Specifically, for every centroid k we set

\begin{equation}
\large \mu_k := \frac{1}{|C_{k}|} \sum_{i \in C_{k}} {x^{(i)}}
\end{equation}

<strong>where,</strong>
<br>
$ \normalsize\mu_{k}:$ The mean of of the data points assigned to the centroid k

$\normalsize C_{k}:$ is the set of examples that are assigned to centroid k

In [None]:
def kMeansInitCentroids(X,K):
    '''
    Usage:
      #kMeansInitCentroids --> used for the initial assignments of centroids
  
    Arguments:
      #X --> The Design(data) Matrix
      #K --> the total number of centroids
    
    Returns:
      #centroids -->  K initial centroids to be used with the K-Means on the dataset X
    '''
    
    #pre-allocating centroids with size (k,#features of X)
    centroids = np.zeros((K,X.shape[1]))
    
    #initialize the centroids to be random examples
    #shuffle the indicies of the examples 
    randidx = np.random.permutation(X.shape[0])
    
    #take the first K-examples as centroids
    centroids = X[randidx[:K],:]
    
    return centroids

## 5.oneHot()

In digital circuits, one-hot refers to a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0). In this case, one-hot encoding means that if the feature is  "Male", then the output will be encoded as a vector of 2 elements with all elements being 0, except for the first element will be one.For the "Female", the output will be encoded as a vector of 2 elements with all elements being 0, except for second element will be one.

<strong>The image below explain what i say easily for Gender feature</strong>

<img src = "https://i.imgur.com/BwguFW6.jpg" >

In [None]:
def oneHot(left,unencoded_input,right,width):
    
    '''
    Usage:
      #oneHot --> used for encoding the unencoded data
  
    Arguments:
      #left --> the features which will be concatenated with encoded data from the left side
      #unencoded_input --> the feature that will be encoded
      #right --> the features which will be concatenated with encoded data from the right side
      #width --> the width of the encoded data 
    
    Returns:
      #one_hot --> the one-hot-encoded data
    '''
    #reshape the data to be (#training examples,1) if it has size (#training examples,)
    #Note: if the data has the size that we reshape this line will do nothing 
    #because we reshape the data with the same size 
    unencoded_input = unencoded_input.reshape(unencoded_input.shape[0],1)
    
    
    #label-encoding the uncoded data
    #replacing Male with one and the remaining , female  in our case, will be 2
    label_encoding = np.where(unencoded_input == 'Male',1,2)
    
    #Pre-allocating a container to assign the encoded data to it 
    container = np.zeros((unencoded_input.shape[0],width))
    
    #Define an index,j, to assign the encoded data to  container 
    j = 0
    
    #loop over the label-encoded data
    for i in label_encoding:
        #if the i == [1] --> Male
        if (i == [1]):
            container[j,:] = [1,0] #represents Male in our example

        #if the i == [1] --> Female
        elif (i == [2]):
            container[j,:] = [0,1] #represents Male in our example

        #increment the index  for new assigning
        j += 1
        
    
    #left-concatenation 
    left_concatenation = np.concatenate((left,container), axis = 1)
    
    #right-concatenation --> (left + right)
    right_concatenation = np.concatenate((left_concatenation,right), axis = 1)
    
    #assign the concatenated parts to one_hot
    one_hot = right_concatenation
    
    return one_hot

# 6.kMeans()

<p style = "font-size: 18px" >The Algorithm
In the clustering problem, we are given a training set ${x^{(1)}, ... , x^{(m)}}$, and want to group the data into a few cohesive "clusters." Here, we are given feature vectors for each data point $x^{(i)} \in \mathbb{R}^n$ as usual; but no labels $y^{(i)}$ (making this an unsupervised learning problem). Our goal is to predict $k$ centroids and a label $c^{(i)}$ for each datapoint. The k-means clustering algorithm is as follows: </p>


  


<img src = "https://i.imgur.com/BWipiBC.png" >


### Note:
The K-means
algorithm will always converge to some final set of means for the centroids.
Note that the converged solution may not always be ideal and depends on the
initial setting of the centroids. Therefore, in practice the K-means algorithm
is usually run a few times with different random initializations. One way to
choose between these different solutions from different random initializations
is to choose the one with the lowest cost function value (distortion).

In other words choose $\large c^{(i)} := j$  that minimizes $\large ||x^{(i)} − µ_{j}||^2$

<strong>where,</strong>
<br>
$\normalsize c_{i}:$ is the index of the centroid that is closest to $\normalsize x^{(i)}$

$ \normalsize\mu_{j}:$ is the position (value) of the j’th centroid.

<strong>Note that</strong> $\normalsize c^{(i)}$ corresponds to $\normalsize idx(i)$

<strong>The distortion function is defined as :</strong>

<img src = "https://www.saedsayad.com/images/Clustering_kmeans_c.png" >

In [None]:
def kMeans(X,K,iterations):
    '''
    Usage:
      #kMeans --> used for implementing the K-means clustering algorithm
  
    Arguments:
      #X --> The Design(data) Matrix
      #K --> the total number of centroids
      #iterations --> the number of iterations needed to get the optimal centorids 
    
    Returns:
      #idx --> a one-dimensional array, idx, that holds the index (a value in {1,....,K} 
               where K is total number of centroids) of the closest centroid 
               to every training example.
      #centroids --> the optimal centroids by computing the means of
                     the data points assigned to each centroid  
    '''
    
    #Initialize centroids
    centroids = kMeansInitCentroids(X,K)
    
    #Keep until convergance 
    for iteration in range(iterations):
        
        #helper to observe in which iteration we are
        print("K-Means iteration {}/{}...".format(iteration, iterations - 1))
        
        #Cluster assignment step: Assign each data point to the
        #closest centroid. idx(i) corresponds to cˆ(i), the index
        #of the centroid assigned to example i
        idx = findClosestCentroids(X,centroids)
        
        #Move centroid step: Compute means based on centroid assignments
        centroids = computeCentroids(X,idx,K)
    
    print("\n\nThe Model has been trained\n\n")
        
        
    return idx, centroids

## 7.lowestCost()

<br>

## reminder

in practice the K-means algorithm is usually run a few times with different random initializations. One way to choose between these different solutions from different random initializations is to choose the one with the lowest cost function value (distortion).

In [None]:
def lowestCost(X,K,iterations,init_num):
    '''
    Usage:
      #kMeans --> For computing the lowest cost corresponding to a specific centroids
  
    Arguments:
      #X --> The Design(data) Matrix
      #K --> the total number of centroids
      #iterations -->  the number of iterations needed to get the optimal centorids
      #init_num --> the number of intializations of centroids needed to get the optimal centorids 
    
    Returns:
      #Cost --> array holds all the cost values at different random intializations
      #dic --> a dictionary the key is the index of the a specific cost at a random initialization
              and the value is (idx,centroid) pair
      #lowest_cost --> the lowest cost corresponding to a specific centroids
      
      #idx_lowest_cost --> index of the lowest cost
      
      
    #Notes:
     1.from the idx_lowset_cost we can get the (idx,centroids) corresponding to the lowest cost
       using --> (dic)
    '''
    
    #pre-allocating empty-dic and its key will be the index of the cost 
    #and the value will be (idx,centroids) pair correponding to this cost
    dic = {}
    
    #Define a container to hold all the values of the cost for different initialization
    cost = np.zeros((init_num,))
    
    for r in range(init_num):
        #Track the initialization number 
        print("At initialization number: {}\n\n".format(r))
        
        #compute idx,centroids for at a random initialization 
        idx,centroids= kMeans(X,K,iterations)
        
        #iterating over the training examples to compute the cost 
        for i in range(X.shape[0]):
            #incement the cost r so at the end of the iteration r we get the overall value of cost r
            cost[r] += np.power(np.linalg.norm(X[i] - centroids[idx[i]]),2)
        
        #Store the (idx,centroids) pair corresponding to the cost r
        dic[r] = (idx,centroids)
            
    
    #get the lowest cost 
    lowest_cost = np.min(cost, axis = 0)
    
    #get the index of the lowest cost
    idx_lowest_cost = np.argmin(cost, axis = 0)
    
    
    return cost,dic,lowest_cost,idx_lowest_cost

# Training the model

In [None]:
#########################
#Define the training set#
#########################

#Define the left-encoded part
left =X[:,0].reshape(200,1)

#Define the Un-encoded part
unencoded = X[:,1].reshape(200,1)


#Dfine the right-encoded part 
right = X[:,2:]

#Define the concatenated-encoded parts --> Training set
X_train = oneHot(left,unencoded,right,width = 2)

In [None]:
#Explore the training examples after encoding 
X_train

In [None]:
#Explore the shape of the training examples after encoding 
X_train.shape

In [None]:
idx, centroids = kMeans(X = X_train,K = 5,iterations = 1000)

In [None]:
#Visulaize the 5-distinct groups
plt.scatter(X_train[:,-2], X_train[:,-1],color = 'r')
plt.scatter(centroids[:,-2], centroids[:,-1] ,color = 'b')

plt.xlabel("Annual Income (k$)")
plt.ylabel("Spending Score (1-100)")
plt.title("The relation between Spending Score and Annual Income")
plt.show()

<br>


<p style = "font-size:18px"><strong>in practise,</strong> we need to run the algorithm for different random initializations to get the optimal centroids </p>

In [None]:
#Get the optimal centroids
cost,dic,lowest_cost,idx_lowest_cost = lowestCost(X_train,K = 5,iterations = 1000,init_num = 20)

In [None]:
#Explore the cost values at different initializations
cost

In [None]:
#Explore the lowest cost 
lowest_cost

In [None]:
#Explore the index of the lowest cost 
idx_lowest_cost

In [None]:
#Get the index of the closest centroid to every training example, and the optimal centroids
idx,centroids = dic[idx_lowest_cost]

In [None]:
#Explore the index
idx

In [None]:
#Explore the optimal centroids
centroids

In [None]:
#Visulaize the 5-distinct groups
plt.scatter(X_train[:,-2], X_train[:,-1],color = 'r',label = "Data Points")
plt.scatter(centroids[:,-2], centroids[:,-1] ,color = 'b',label = "Centroids")
plt.xlabel("Annual Income (k$)")
plt.ylabel("Spending Score (1-100)")
plt.title("The relation between Spending Score and Annual Income")
plt.legend()

plt.show()

# Congratulations!