# K-Means:

## k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

<img src="images/kmeans.png">

## So as we see above, out of the unclassified data with no class set (output states given), we can determine the similar based classes or clusters of data.

# We will be using a clustering technique known as k-means in our analysis. k-means is a partitioning algorithm that partitions the data space into k clusters and uses the following steps to achieve it:

-  Randomly choose k centres for k clusters to be formed. We can call these points as pseudo centres.
-  Assign each data point to the nearest pseudo-center. By doing so, we have just formed clusters, with each cluster     comprising all data points associated with its pseudo-center.
-  Recalculate the centre of each cluster. Update the location of each cluster’s pseudo-center after each iteration.
-  Repeat this step till pseudo-centres are shifted such that they become the actual centres.

# Steps:

-  ** 1. Randomly choose k centres for k clusters to be formed.These centres are called Centroids.**
-  ** 2. Connect the so called Centroids together and form a perpendicular bisector to the sub plane.**
-  ** 3. Since the points on the Plane are Equidistant, the points above the plane get close to 1 and form first cluster and others form the second Cluster.**
-  ** 4. Now, Cluster together the points closer and form 2 Clusters.**
-  ** 5. Relocate the Centroids inside the Cluster and now taking another Random Points, complete the steps again.**

<img src="images/kmeans2.png">

# About the Dataset : 

<img src="images/iris1.png">

### The Iris Dataset, like the name suggests, is a dataset about the Iris flower and its classes. Below is how the dataset looks like:

<img src="images/iris.png">

# The Iris dataset uses the above 4 features(columns) to predict the class of iris flower. There are 3 classes of iris flowers: Setosa, Versicolor and Virginica

# Code:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
from sklearn.datasets import load_iris

In [3]:
iris=load_iris()

In [4]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [5]:
X=iris.data

In [6]:
X.shape

(150L, 4L)

In [7]:
X

array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2],
       [ 5.4,  3.9,  1.7,  0.4],
       [ 4.6,  3.4,  1.4,  0.3],
       [ 5. ,  3.4,  1.5,  0.2],
       [ 4.4,  2.9,  1.4,  0.2],
       [ 4.9,  3.1,  1.5,  0.1],
       [ 5.4,  3.7,  1.5,  0.2],
       [ 4.8,  3.4,  1.6,  0.2],
       [ 4.8,  3. ,  1.4,  0.1],
       [ 4.3,  3. ,  1.1,  0.1],
       [ 5.8,  4. ,  1.2,  0.2],
       [ 5.7,  4.4,  1.5,  0.4],
       [ 5.4,  3.9,  1.3,  0.4],
       [ 5.1,  3.5,  1.4,  0.3],
       [ 5.7,  3.8,  1.7,  0.3],
       [ 5.1,  3.8,  1.5,  0.3],
       [ 5.4,  3.4,  1.7,  0.2],
       [ 5.1,  3.7,  1.5,  0.4],
       [ 4.6,  3.6,  1. ,  0.2],
       [ 5.1,  3.3,  1.7,  0.5],
       [ 4.8,  3.4,  1.9,  0.2],
       [ 5. ,  3. ,  1.6,  0.2],
       [ 5. ,  3.4,  1.6,  0.4],
       [ 5.2,  3.5,  1.5,  0.2],
       [ 5.2,  3.4,  1.4,  0.2],
       [ 4.7,  3.2,  1.6,  0.2],
       [ 4

# Code:

In [8]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [16]:
from sklearn.cluster import KMeans
k=KMeans(n_clusters=3)

In [17]:
k.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

# What are the cluster center vectors?

In [19]:
k.cluster_centers_

array([[ 5.9016129 ,  2.7483871 ,  4.39354839,  1.43387097],
       [ 5.006     ,  3.418     ,  1.464     ,  0.244     ],
       [ 6.85      ,  3.07368421,  5.74210526,  2.07105263]])

## Evaluation
There is no perfect way to evaluate clustering if you don't have the labels, however since this is just an exercise, we do have the labels, so we take advantage of this to evaluate our clusters, keep in mind, you usually won't have this luxury in the real world.

# Create a confusion matrix and classification report to see how well the Kmeans clustering worked without being given any labels.

In [20]:
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(iris.target,k.labels_))
print('\n')
print(classification_report(iris.target,k.labels_))


[[ 0 50  0]
 [48  0  2]
 [14  0 36]]


             precision    recall  f1-score   support

          0       0.00      0.00      0.00        50
          1       0.00      0.00      0.00        50
          2       0.95      0.72      0.82        50

avg / total       0.32      0.24      0.27       150



# Not so bad considering the algorithm is purely using the features to cluster the universities into 2 distinct groups! 

# This is how KMeans is done