# K-Means Clustering with scikit-learn

We are going to use the implementation for k-means from scikit-learn, see [here](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.fit) for a documentation. 

In [1]:
from sklearn.cluster import KMeans

When using k-means from scikit-learn, we recommend you that your data be stored as a numpy array. Create it or convert your data into a numpy array as follows.

In [2]:
import numpy as np
import pandas as pd

#create a numpy array
X = np.array([[1, 2], [1, 4], [1, 0],[4, 2], [4, 4], [4, 0]])

#convert a list to a numpy array
a=[]
for i in range(0,10):
    p=[i,2*i]
    a.append(p)

Y=np.array(a, dtype='float32')

The following execute the k-means algorithm on the points in X. Make sure you understand the parameters see [here](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.fit)

In [3]:
kmeans = KMeans(init='random', n_clusters=2, max_iter=10000, n_init=100).fit(X)

The following code gives the labels:

In [4]:
kmeans.labels_

array([0, 0, 0, 1, 1, 1])

The following code computes the clusters for the points [0,0] and [4,4]. In this case, [0,0] is placed in cluster labeled 0 and [4,4] in the cluster labeled 1.

In [5]:
kmeans.predict([[4,4],[0,0]])

array([1, 0])

The following code shows the centroids (in this case called centers ) of the two clusters.

In [6]:
kmeans.cluster_centers_

array([[1., 2.],
       [4., 2.]])

___

# My Work:

## 1. Run Algo with Default Parameters

In [7]:
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd

#read the input csv file
df=pd.read_csv('SD201-TP2-Clustering-data.csv')

Z=[]
for val in df.values:
    Z.append(val[1:])
Z=np.array(Z, dtype='float32')

#use default KMeans values (don't specify anything to use defaults)
kmeans = KMeans(init='random').fit(Z) #init='random', n_clusters=8, max_iter=300, n_init=10 (these are the default values)
labels = kmeans.labels_
stocks = df.T.values[0]
clusters = [[]]*8
i=0
for label in labels:
    clusters[label] = clusters[label] + [stocks[i]]
    i+=1


SSE = kmeans.inertia_

# calculating the SSE manually:
i=0;mySSE=0
centers = kmeans.cluster_centers_
for value in Z:
    mySSE += sum(np.power(value-centers[labels[i]],2))
    i+=1

In [8]:
clusters

[['Hewlett-Packard'],
 ['American Express',
  'Boeing',
  'IBM',
  'The Home Depot',
  'Walt Disney',
  'Intel',
  'Wal-Mart',
  'General Electric',
  'United Technologies',
  'Travelers',
  'JPMorgan Chase',
  '3M',
  'Johnson & Johnson'],
 ['Cisco Systems', 'Microsoft', 'Alcoa'],
 ['Verizon', 'Merck'],
 ['Chevron', 'Pfizer', 'ExxonMobil'],
 ['Bank of America'],
 ['DuPont', 'Caterpillar'],
 ['Kraft', 'Procter & Gamble', 'AT&T', 'McDonalds', 'Coca-Cola']]

In [9]:
mySSE, SSE

(1648.9618647838797, 1648.9619140625)

#### Conclusion

I ran the K-Means algorithm with its default values, initiating with random centroids, limiting the max_iter to 300, initiating it with 10 different centroids to take the best result, setting a tolerance of 1e-4 in order with the centroid and of course being in compliance with the 8-cluster rule.
I also wrote my SSE calculation snippet which converges to the same value as kmeans.inertia_

We get an SSE in the neighbourhood of 1700.
Looking deeper in this number, it means that the average scalar squared distance in the 25-dimensional space (study on 25 different dates) is 1700/30= 56.6667 (unit^2)

We devide by 30 since we are clustering 30 different points (companies) in the 25D space, so the average distance is 7.5 (unit) between each company's stock and its chosen centroid of cluster. The unit is the direct distance between 2 points in a 25D space.

We got these results using the default K-Means algorithm with random initiation, which explains the results that change with each run.

# 2. Improving the K-Means Algo

In [38]:
kmeans = KMeans(init='k-means++', n_clusters=8,  max_iter=100000, n_init=100, tol = 1e-10).fit(Z) 
labels = kmeans.labels_
stocks = df.T.values[0]
clusters = [[]]*8
i=0
for label in labels:
    clusters[label] = clusters[label] + [stocks[i]]
    i+=1
SSE = kmeans.inertia_
print("SSE",SSE)
clusters

SSE 1540.729736328125


[['American Express', 'Bank of America', 'JPMorgan Chase'],
 ['Kraft',
  'Verizon',
  'IBM',
  'The Home Depot',
  'Procter & Gamble',
  'Wal-Mart',
  'AT&T',
  'Merck',
  'Travelers',
  'McDonalds',
  'Coca-Cola'],
 ['Chevron', 'Pfizer', 'ExxonMobil'],
 ['Hewlett-Packard'],
 ['Cisco Systems'],
 ['DuPont', 'Caterpillar'],
 ['Boeing',
  'Microsoft',
  'Walt Disney',
  'Intel',
  'General Electric',
  'United Technologies',
  '3M',
  'Johnson & Johnson'],
 ['Alcoa']]

#### Conclusion

In order to decrease the SSE, we need to improve the K-Means algorithm, from initiation, the number of maximum iterations, the number of repeated calculations to the tolerance (epsilon).

As a first step, we need to use the K-Means++ as an initiation algorithm to guaguarantee our convergence and accelerating the calculation time. This will definitely always decrease the total SSE at convergence since it begins the kmeans algorithm from an optimized starting point.
Just by changing this parameter, we get an average SSE of 1600 (a decrease of 100) which is an improved SSE compared to the initial default result.

As a second step, we can play on the [max_iter] <-> [tolerance] parameters.
In fact these two parameters are the two conditionals that break the algorithm proceding, or declare its convergence (OR relation).
Increasing the max number of iterations in a test should decrease the SSE since it is guarantteed to get smaller or equal SSE in each next step. This alteration always improves the SSE (decrease), unless the centroids converge quickly (optimized centroids). In such a case we won't attend the max_iter limit, but stop on tolerance.
Increasing the max_iter, we get an improved SSE in the region of 1500, which is also a great improvement.


Note:
n_init is an important parameter that simulates the repetition of the algorithm for n_init times and returning us the best realisation. But, increasing it never guarantees us a better SSE (decreased) since it is a random process that only relies on the random choice of the initial centroid!

# 3. Labeling Clusters with Appropriate Labels

We copy the distribution that we have got with an SSE of 1540.73 to label its clusters:

In [47]:
resultCluster= [['American Express', 'Bank of America', 'JPMorgan Chase'],
 ['Kraft',
  'Verizon',
  'IBM',
  'The Home Depot',
  'Procter & Gamble',
  'Wal-Mart',
  'AT&T',
  'Merck',
  'Travelers',
  'McDonalds',
  'Coca-Cola'],
 ['Chevron', 'Pfizer', 'ExxonMobil'],
 ['Hewlett-Packard'],
 ['Cisco Systems'],
 ['DuPont', 'Caterpillar'],
 ['Boeing',
  'Microsoft',
  'Walt Disney',
  'Intel',
  'General Electric',
  'United Technologies',
  '3M',
  'Johnson & Johnson'],
 ['Alcoa']]

labels = ['Banking', 'Consumer goods', 'Oil', 'Computers', 'Networking', 'NO CLASS', 'Manufacturers', 'Aluminum(Industry)']

result = dict(zip(labels,resultCluster))
i=1
for r in result:
    print(i,":",end=" ")
    print(r)
    print(result[r])
    print()
    i+=1

1 : Banking
['American Express', 'Bank of America', 'JPMorgan Chase']

2 : Consumer goods
['Kraft', 'Verizon', 'IBM', 'The Home Depot', 'Procter & Gamble', 'Wal-Mart', 'AT&T', 'Merck', 'Travelers', 'McDonalds', 'Coca-Cola']

3 : Oil
['Chevron', 'Pfizer', 'ExxonMobil']

4 : Computers
['Hewlett-Packard']

5 : Networking
['Cisco Systems']

6 : NO CLASS
['DuPont', 'Caterpillar']

7 : Manufacturers
['Boeing', 'Microsoft', 'Walt Disney', 'Intel', 'General Electric', 'United Technologies', '3M', 'Johnson & Johnson']

8 : Aluminum(Industry)
['Alcoa']



Motivation:

    The clusters 1, 4, 5, and 8 are perfect matches.
    Cluster 3 is almost perfect since it contains two oil companies but is broken by a chemicals and health care company.
    Cluster 2 is mostly made of companies that sell goods directly to the consumers like McDonald's, Coca-Cola, Kraft, Procter&Gamble, Walmart... However Home Depot, IBM, Verizon, and AT&T are communications and electronics stocks, Merck is for health care, and Travelers is an insurance company. But overall, this cluster fits under Consumer goods.
    Cluster 7 has big industry manufacturers such as Boeing, General Electric, United Technologies, 3M, Johnson & Johnson, Intel, and Microsoft. Walt Disney is for entertainment.
    So Manufacturers stocks fits well this cluster.