# K-Means Clustering

KMeans is a basic but powerful clustering method which is optimized via Expectation Maximization. 

It randomly selects K data points in X, and computes which samples are close to these points. For every cluster of points, a mean is computed, and this becomes the new centroid.

cuML’s KMeans supports the scalable KMeans++ intialization method. This method is more stable than randomnly selecting K points.

[Docs](https://docs.rapids.ai/api/cuml/stable/api.html#k-means-clustering)

## Data Prep

In [None]:
import cudf

In [None]:
df = cudf.read_csv('https://github.com/gumdropsteve/datasets/raw/master/iris.csv')

In [None]:
df.tail(3)

#### Visualize 

Let's see what the model is working with. We can plot our clusters with Matplotlib.

In [None]:
df.to_pandas().plot(kind='scatter', x='petal_width', y='sepal_length', c='target', cmap=('spring'), sharex=False)

Let's take 80% of our data for training, and leave the other 20% for testing.

Clustering will be determined based on 4 parameters (`sepal_length`, `sepal_width`, `petal_length`, `petal_width`) and the accuracy of the clusters can be termined by compairing to the `target` (y).

In [None]:
from cuml.preprocessing.model_selection import train_test_split

In [None]:
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]

y = df['target']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

As the purpose of the model is to cluster, we won't be using the `y_train` set and can delete it. We will still use `y_test` to determine how accurate the model is at predicting those clusters (groups) by compairing the similarity of the model's predictions (`y_preds`) on the `X_test` set with those values (more on this later). 

In [None]:
del y_train

## cuML KMeans

Fit the model with our training data. As training is just holding the data, this almost takes no time at all. 

The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames as the input.


In [None]:
from cuml.cluster import KMeans as KMeans

In [None]:
kmeans = KMeans(n_clusters=3)

In [None]:
%%time

kmeans.fit(X_train)

### Make Predictions
Our model has data on 120 different flowers. Let's give it 30 more and see which cluster it thinks each belongs in.

In [None]:
preds = kmeans.predict(X_test)

In [None]:
preds.tail()

## Score Results
We can score our model with cuML's adjusted_rand_score, which is a [Rand index](https://en.wikipedia.org/wiki/Rand_index) adjusted for chance.

The Rand Index computes a similarity measure between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings. The raw RI score is then “adjusted for chance” into the ARI score using the following scheme:
```python
ARI = (RI - Expected_RI) / (max(RI) - Expected_RI)
```
The adjusted Rand index is thus ensured to have a value close to 0.0 for random labeling independently of the number of clusters and samples and exactly 1.0 when the clusterings are identical (up to a permutation).

In [None]:
from cuml.metrics import adjusted_rand_score

In [None]:
score = adjusted_rand_score(labels_true=y_test, 
                            labels_pred=preds)

In [None]:
score

Because train_test_split returns our X_test set as a cudf.DataFrame, we can easily add columns for the `predicted` and `actual` values.

Note: The model was never made aware of the actual clusters, and came up with its own, so compairing the `predicted` and `actual` patterns will give you a more accurate understanding than compairing the `predicted` and `actual` values here. _For high scores on this dataset, this usually means the 2s match up and the 1s and 0s are flipped._

In [None]:
results_df = X_test.copy()

results_df['actual'] = y_test.values
results_df['predicted'] = preds.values

results_df