# K-Means Homemade

Below and application of our homemade algorithm, and its benchmark with `sklearn` KMeans.

In [None]:
import numpy as np
import pandas as pd


import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# Import our library
import kmeans_homemade.kmeans as khm

We're using the `iris` dataset for this benchmark. See more [here](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html).

In [None]:
iris = datasets.load_iris()
X = iris.data
y = iris.target

X[:5, :]

In [None]:
x_train = pd.DataFrame(X, columns = iris.feature_names)
x_train.head()

Quickly apply PCA on 2d to visualize it

In [None]:
# Reduce Dimensionality
pca_2d = PCA(n_components = 2).fit_transform(x_train)
pca_2d = pd.DataFrame(pca_2d)

# Plot
plt.figure(figsize=(15, 15))
sns.scatterplot(data = pca_2d, x = 0, y = 1)

## Fitting KMeans Homemade

We can load the data into our object and `fit` KMeans

In [None]:
# Load object
k_model = khm.KMeans(n_clusters = 3 , X = x_train)

In [None]:
k_model.__doc__.split('\n')

In [None]:
k_model.fit()

### Attributes

In [None]:
# Number of iterations
print(f'\nNumber of iterations \n {k_model.n_iter}')

# Cluster centroids
print(f'\nCluster centroids \n {k_model._centroids}')

# Assigned clusters/labels
print(f'\nAssigned Clusters \n {k_model._clusters}')

# Feature names (if available)
print(f'\nFeature Names \n {k_model.features}')

# Total Variance
print(f'\nTotal Variance \n {k_model.total_sse}')

### Visualising Clusters

In [None]:
print(k_model.n_iter)
pca_2d['labels'] = k_model._clusters

import seaborn as sns
plt.figure(figsize=(15, 15))
sns.scatterplot(data = pca_2d, x= 0, y=1, hue = "labels")

## Benchmarking

We can benchmark this model with that applied from `sklearn`.

In this case, we choose `init='random'`. This will do the same as our homemade algorithm: initialize random clusters.

By default, `KMeans` uses a smarter and faster way of initialising random centroids, instead of random clusters. This ensures faster convergence.

We're too lazy to do that here so we will compare the random method.

In [None]:
fit_k_sk = KMeans(n_clusters = 3, init = 'random', random_state = 24).fit(x_train)

# Add Labels to PCA
pca_2d['labels_sk'] = fit_k_sk.predict(x_train)
plt.figure(figsize=(15, 15))
sns.scatterplot(data = pca_2d, x= 0, y=1, hue = "labels_sk")

In [None]:
print(f'Comparing Score: \n sklearn KMeans {-fit_k_sk.score(X)} \n Homemade KMeans {k_model.total_sse}')

In [None]:
print(f'Comparing centroids: \n sklearn KMeans \n {fit_k_sk.cluster_centers_} \n\n Homemade KMeans \n{k_model._centroids}')

In [None]:
print(f'Comparing iterations: \n sklearn KMeans \n {fit_k_sk.n_iter_} \n Homemade KMeans \n{k_model.n_iter}')

Only one point was slightly misclassified by our algorithm, in comparison with sklearn.

# Conclusion

Above a comparison of our homemade KMeans model with `sklearn` KMeans. Our model is clearly not as efficient, and no one is ever going to use it, but this shows to prove how we can easily de-mistify an algorithm with some lines of code, and little help from external libraries.