# K-Means Demo

KMeans is a basic but powerful clustering method which is optimized via Expectation Maximization. It randomnly selects K data points in X, and computes which samples are close to these points. For every cluster of points, a mean is computed, and this becomes the new centroid.

cuML’s KMeans supports the scalable KMeans++ intialization method. This method is more stable than randomnly selecting K points.
    
The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or _cuda_array_interface_compliant), as well as cuDF DataFrames as the input. 

For additional information on cuML's k-means implementation, refer to the [cuML documentation](https://rapidsai.github.io/projects/cuml/en/latest/index.html)
    

In [None]:
import numpy as np

import pandas as pd
import cudf as gd

from sklearn import datasets

from sklearn.metrics import adjusted_rand_score

%matplotlib inline
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans as skKMeans
from cuml.cluster import KMeans as cumlKMeans

## Generate Data

In [None]:
n_samples = 100000
n_features = 2

In [5]:
data, labels = datasets.make_blobs(
   n_samples=n_samples, n_features=n_features, centers=5, random_state=7)

## Fit Scikit-learn model

In [None]:
%%time
kmeans_sk = skKMeans(n_clusters=5, n_jobs=-1, random_state=rs)
kmeans_sk.fit(data)

## Fit cuML Model

In [None]:
%%time
device_data = gd.from_pandas(data)

In [None]:
%%time
kmeans_cuml = cumlKMeans(n_clusters=5, n_gpu=1, random_state=rs)
kmeans_cuml.fit(device_data)

## Visualize Centroids

In [None]:
fig = plt.figure(figsize=(16, 10))
plt.scatter(data[:, 0], data[:, 1], c=labels, s=50, cmap='viridis')

#plot the sklearn kmeans centers with blue filled circles
centers_sk = kmeans_sk.cluster_centers_
plt.scatter(centers_sk[:,0], centers_sk[:,1], c='blue', s=100, alpha=.5)

#plot the cuml kmeans centers with red circle outlines
centers_cuml = kmeans_cuml.cluster_centers_
plt.scatter(centers_cuml['0'], centers_cuml['1'], facecolors = 'none', edgecolors='red', s=100)

plt.title('cuml and sklearn kmeans clustering')

plt.show()

## Compare Results

In [None]:
%%time
cuml_score = adjusted_rand_score(labels, kmeans_cuml.labels_)
sk_score = adjusted_rand_score(labels, kmeans_sk.labels_)

In [None]:
threshold = 1e-5

passed = (cuml_score - sk_score) < threshold
message = 'compare kmeans: cuml vs sklearn labels_ are ' + ('equal' if passed else 'NOT equal')
print(message)