# **KMeans**

## What is K-means Clustering

- K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.
Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes.
-  Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes.
- The objective of K-means is simple: **group similar data points together and discover underlying patterns**. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.
- A cluster refers to a collection of data points aggregated together because of certain similarities.
- You’ll define a target number k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster.
- Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares.
- In other words, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.
- The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.

## How the K-means algorithm works

- To process the learning data, the K-means 
algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids
- It halts creating and optimizing clusters when either:
-- The centroids have stabilized — there is no change in their values because the clustering has been successful.
-- The defined number of iterations has been achieved.

## Algorithm

```
 Input:  Dataset X and the number of clusters to find (K)
 Output:Centers and label vector for each data point
```

1. Choose any K points as the initial centers.
2. Assign each data point to the cluster whose center is closest to it.
3. If the assignment of data to each cluster in step 2 does not change compared to the previous loop, then we stop the algorithm.
4. Update the center for each cluster by averaging all the data points assigned to that cluster after step 2.
5. Go back to step 2. 

## Code

 #### Import *libraries*

In [None]:
import numpy as np 
import json
import matplotlib.pyplot as plt

#### Process data from file json to vector (dataset) X

In [None]:
def process_data(path):
    with open(path, "r") as f:
        data = json.load(f)
    X=np.zeros((len(data),2))
    x,y=[],[]
    for d in data:
        x.append(d['Math'])
        y.append(d['Literature'])
    X[:,0]=x 
    X[:,1]=y
    return X

####Initialize the initial centers. 

In [None]:
def init_center(X,k):
    return X[np.random.choice(X.shape[0],k,replace=False)]


#### Distances from dataset to centers 

In [None]:
def distance(X,center):
    distances=[]
    for x in X:
        distance=[]
        for c in center:
            distance.append(np.sqrt(np.sum((x-c)**2)))
        distances.append(distance)
    return distances

#### Assign new labels to points when the centers are known

In [None]:
def assign_label(X,center):
    D= distance(X,center)
    return np.argmin(D,axis=1)

#### Update new centers based on newly labeled data 

In [None]:
def update_center(X,label,K):
    center=np.zeros((K,X.shape[1]))
    for k in range(K):
        Xk=X[label ==k ,:]
        center[k,:]=np.mean(Xk,axis=0)
    return center

#### Check the stopping condition of the algorithm

In [None]:
def has_converged(center, new_center):
    return (set([tuple(a) for a in center]) == 
    set([tuple(b) for b in new_center]))

#### Display

In [None]:
def display(X, label,center):
    K=np.amax(label)+1
    X0=X[label==0,:]
    X1=X[label==1,:]
    X2=X[label==2,:]
    X3=X[label==3,:]
    plt.plot(X0[:, 0], X0[:, 1], 'yo', markersize = 5, alpha = .8)
    plt.plot(X1[:, 0], X1[:, 1], 'go', markersize = 5, alpha = .8)
    plt.plot(X2[:, 0], X2[:, 1], 'ro', markersize = 5, alpha = .8)
    plt.plot(X3[:, 0], X3[:, 1], 'bo', markersize = 5, alpha = .8)

    plt.plot(center[:, 0], center[:, 1], 'ko', markersize = 5, alpha = .8)

    plt.axis('equal')
    plt.plot()
    plt.show()
    

#### K Means

In [None]:

def kMeans(X,K):
    center =[ init_center(X,K)]
    label=[]
    while True:
        label.append(assign_label(X,center[-1]))
        new_center= update_center(X,label[-1],K)
        if has_converged(center[-1], new_center):
            break
        center.append(new_center)
        display(X,label[-1],new_center)
    



#### Run

In [None]:
K=4
X= process_data("https://github.com/ThuyLinh110/ThuyLinh110.github.io/blob/master/AI/K-Mean/score_cluster.json")
kMeans(X, K)


FileNotFoundError: ignored