# K-Means Clustering

## 1. Introduction

The k-means type
clustering algorithms are widely used in real world
applications such as marketing research and data mining
to cluster very large data sets due to their efficiency and
ability to handle numeric and categorical variables that are
ubiquitous in real databases

 ## 2. The Algorithm



  

Let $X= \{ X_{1}, \ X_2,...,\ X_n \}$ be a set of $n$ objects. Object $X_i= \{ x_{i,1}, \ x_{i,2},...,\ x_{i,m}\}$ is characterized by a set of $m$ variables. We want to partition the $n$ objects into $k$ clusters that minimizes the objective function $P$ with unknown variables $U$ and $Z$ as follows: $$  P(U,Z)= \sum_{l=1}^{k}\sum_{i=1}^{n}\sum_{j=1}^{m} u_{i,l}  d(x_{i,j}, z_{l,j})  $$ where,  
- $U$ is an $n \times k$ partition matrix, $u_{i,l}$ is a binary variable, and $u_{i,l}=1$ indicates that object $i$ is allocated to cluster $l$;  
- $Z= \{ Z_{1}, \ Z_2,...,\ Z_k \}$ is a set of $k$ vectors representing the centroids of the $k$ clusters;   
- $d(x_{i,j}, z_{l,j}) $ is a distance measure between object $i$ and the centroid of the cluster $l$ on the $j$th variable. We define here $d(x_{i,j}, z_{l,j}) $ as : $$d(x_{i,j}, z_{l,j})= (x_{i,j}-z_{l,j})^{2}.$$   

So to minimize the cost function $P$, we have the K-means clustering as follows:  

**K-means Algorithm**    
1. Randomly choose an initial set of centroids $Z^{0}=\{Z_{1},Z_{2},...,Z_{k}\}$. Determine $U^{0}$ such that $P(U^{0},Z^{0},W^{0})$ is minimized. Set $t=0$ ;   
2. Let $\hat{Z}=Z^{t}$, now solve the reduced problem $P(U,\hat{Z})$ to obtain $U^{t+1}$. If $P(U^{t+1},\hat{Z})=P(U^{t},\hat{Z})$, output $(U^{t},\hat{Z})$ and stop; otherwise go to step 2;   
3. Let $\hat{U}=U^{t+1}$ , now solve the reduced problem $P(\hat{U},Z)$ to obtain $Z^{t+1}$. If $P(\hat{U},Z^{t+1})=P(\hat{U},Z^{t})$, output $(\hat{U},Z^{t})$ and stop; otherwise set $t=t+1$ go to step 2.   






In [2]:
pip install colorama

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install pandas

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [4]:
import pandas as pd
import numpy as np
import math
import random
from colorama import Fore, Back, Style
from IPython.display import Markdown, display
from matplotlib import pyplot as plt
def printmd(string):
    display(Markdown(string))
    
def np_ed(point_1, point_2):
    distance = np.sum(np.square(np.array(point_1) - np.array(point_2)))
    return distance


## 3. Defining the K-Means Function

### 3.1 Initializing the first centroids and determining $T$ by minimizing $P$


As the 1st step of the Weighted K Means clustering we initialize the centroid vector and randomly generate the set of initial weights. Then we determine the clusters for each data points in the first iteration.   
**Replacing $U$ by $T$**   
In the original paper, to denote the clusters for each data point in every iteration, the $n \times k$ matrix $U$ was used. The elements $u_{i,l}$ of $U$ is a binary variable, and $u_{i,l}=1$ indicates that object $i$ is allocated to cluster $l$. For brevity and ease, I use a different variable $T$ which is a $(n \times 1)$ column vector, each element of $T$ is a number from the set $\{1,2,...,k\}$ and $t_{i,1}=l$ ,$(l \in \ \{1,2,...,k\})$, means that the $i$th data point belongs to the $l$th cluster. So rewriting our cost function $P$ in terms of $T$, $$  P(T,Z)= \sum_{l=1}^{k}\sum_{i=1}^{n}\sum_{j=1}^{m} \mathbb{1}(t_{i,1}=l) d(x_{i,j}, z_{l,j})  $$ where $\mathbb{1}$ is the indicator function defined as 
$$\begin{equation}
\mathbb{1}(t_{i,1}=l) =
    \begin{cases}
        1 & \text{if the $i$th data point belongs to the $l$th cluster} \\
        0 & \text{otherwise}
    \end{cases}
\end{equation}$$


### 3.2 Iteration step for the centroids 

Now we implement the K-means clustering. We carry the three steps until the cost function $P$ becomes constant. In the 2nd step of the K-means clustering, using the $Z$ of the previous iteration we solve for the $T$ that minimizes the cost function $P$. In the following code, after every iteration for the number of data points that belong to each cluster, we get the matrix $T$, the number of data points in each cluster in that iteration and the value of $P$ as output.   
In the 3rd step of the K-means clustering, using the $T$ of the previous iteration we solve for the $Z$ that minimizes the cost function $P$. Solving for $Z$, we get that $P$ is minimized when $$z_{l,j}= \frac{\sum_{i=1}^{n} \mathbb{1}(t_{i,1}=l)\ x_{i,j}}{\sum_{i=1}^{n} \mathbb{1}(t_{i,1}=l)}$$where $1 \le l \le k$ and $0 \le j \le m$.  The following code gives us the new centroids and the value of $P$.   


In [5]:
from numpy import genfromtxt
D1= genfromtxt("C:/Users/Deepam Saha/OneDrive/Desktop/Deepam.csv",delimiter=",")
D= np.delete(D1,0,0)
print(D)

[[  9.30029201   5.57762667]
 [  7.29691436   3.21136253]
 [-10.59698143  -2.68536417]
 ...
 [  4.95514709  12.59525492]
 [  2.11705948  13.96820765]
 [ -2.24138468  15.58603699]]


In [None]:
def kmeans(X, k, max_iter=100, tol=1e-6):
    P=0
    q=1
    A=np.random.choice(X.shape[0], k, replace=False)
    print(A)
    Z = np.array(X[A]) 
    
        
    for o in range(max_iter):
        T=np.zeros(X.shape[0])
        P1=P
        P=0
        ed=np.zeros((X.shape[0],k))
        for i in range(X.shape[0]):
            for j in range(k):
                ed[i][j] = np_ed(Z[j], X[i])
        np.T = np.argmin(ed, axis=1)
        
        for i in range(X.shape[0]):
            P =  P + np_ed(X[i], Z[np.T[i]])
            
        print(Fore.LIGHTBLUE_EX,"The new clusters for the",q,"th iteration")    
        printmd("**The Matrix T :**")
        print(np.T)

        for l in range(k):
            print("No. of data points in", l+1, "cluster:")
            print(np.count_nonzero(np.T == l))
        
        printmd("**Value of P:**")
        
        print(P,"\n")
        if (P1==P): 
            break
        
        P1=P
        P=0
        for j in range(k):
            Z[j]= np.sum(X[np.T==j],axis=0)/np.count_nonzero(np.T == j)

        for i in range(X.shape[0]):
             P=  P + np_ed(X[i], Z[np.T[i]])
           
        print(Fore.LIGHTBLUE_EX,"The new centroids for the",q,"th iteration")  
        printmd("**Centroids:**")
        print(Z)                                                     #The new clusters in this iteration
        plt.scatter(X[np.T == 0, 0], X[np.T == 0, 1], s = 100, c = 'purple', label = 'N1')
        plt.scatter(X[np.T == 1, 0], X[np.T == 1, 1], s = 100, c = 'orange', label = 'N2')
        plt.scatter(X[np.T == 2, 0], X[np.T == 2, 1], s = 100, c = 'green', label = 'N3')
  #      plt.scatter(X[np.T == 3, 0], X[np.T == 3, 1], s = 100, c = 'blue', label = 'N4')
        plt.scatter(Z[:, 0], Z[:,1], s = 100, c = 'red', label = 'Centroids')
        plt.show()
        printmd("**Value of P:**")
        print(P,"\n")
        if (P1==P): 
            break
        q=q+1
    print(Fore.LIGHTMAGENTA_EX,"The number of iterations:",q)

In [None]:
kmeans(D,3)

iris=pd.read_csv ("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
                   names = ["Sepal Length", "Sepal Width", "Petal Length", "Petal Width", "Class" ])  #Reading the csv file
X1 = iris.iloc[:,0:4]
y = iris.iloc[:,-1]
X=np.array(X1)
kmeans(X,3)
print(X)

