# Clustering Tutorial
## Q4
In scikit learn, apply k-Means clustering with Euclidean distance to the Penguins unabeled dataset with the `n_init` parameter set to 1. Report the Within cluster sum of squared errors (SSE) for clusterings with different numbers of clusters: `k=2, k=3` and `k=4`.  
Repeat the above process again, but change the random seed parameter for k-Means. Are the SSE scores identical?  
Repeat again with `n_init` set to 50. Does this make a difference?

In [15]:
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

In [16]:
penguins_all = pd.read_csv('penguins_af.csv')
penguins = penguins_all[['bill_length_mm', 'bill_depth_mm','flipper_length_mm', 'body_mass_g']]
X = penguins.values
X_scal = StandardScaler().fit_transform(X)

In [17]:
rs = 1
tries = 1
for k in range(2,5):
    km = KMeans(n_clusters = k, n_init = tries, init = 'random', random_state=rs)
    km.fit(X_scal)
    print("k = {} SSE: {:.3f}".format(k, km.inertia_))

k = 2 SSE: 552.671
k = 3 SSE: 370.766
k = 4 SSE: 305.369


In [18]:
rs = 2
tries = 1
for k in range(2,5):
    km = KMeans(n_clusters = k, n_init = tries, init = 'random', random_state=rs)
    km.fit(X_scal)
    print("k = {} SSE: {:.3f}".format(k, km.inertia_))

k = 2 SSE: 552.671
k = 3 SSE: 370.766
k = 4 SSE: 305.368


We see different values when we cluster just one time for each value of *k*.

In [19]:
rs = 1
tries = 50
for k in range(2,5):
    km = KMeans(n_clusters = k, n_init = tries, init = 'random', random_state=rs)
    km.fit(X_scal)
    print("k = {} SSE: {:.3f}".format(k, km.inertia_))

k = 2 SSE: 552.671
k = 3 SSE: 370.766
k = 4 SSE: 293.905


In [20]:
rs = 2
tries = 50
for k in range(2,5):
    km = KMeans(n_clusters = k, n_init = tries, init = 'random', random_state=rs)
    km.fit(X_scal)
    print("k = {} SSE: {:.3f}".format(k, km.inertia_))

k = 2 SSE: 552.671
k = 3 SSE: 370.766
k = 4 SSE: 293.905


Over 50 tries for each *k* with the best SSE selected the results match. 

In [21]:
rs = 1
tries = 10
for k in range(2,5):
    km = KMeans(n_clusters = k, n_init = tries, init = 'random', random_state=rs)
    km.fit(X_scal)
    print("k = {} SSE: {:.3f}".format(k, km.inertia_))

k = 2 SSE: 552.671
k = 3 SSE: 370.766
k = 4 SSE: 293.905
