K-Means Clustering
An attempt to create labels. You input some unlabeled data, and the unsupervised learning algorithm returns back possible clusters of the data. It's up to you to determine which clusters are correct.

Steps:
* Choose a number of clusters 'K'
* Randomly assign each point to a cluster
* Until clusters stop changing, repeat the following:

For each cluster, compute the cluster centroid by taking the mean vector points in the cluster.

Assign each data point to the cluster for which the centroid is the closest.

In [1]:
# Boiler Plate
import findspark
import numpy as np
findspark.init('/home/ubuntu/spark-2.1.1-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('cluster').getOrCreate()

In [2]:
from pyspark.ml.clustering import KMeans

In [3]:
dataset = spark.read.format('libsvm').load('sample_kmeans_data.txt')

In [4]:
dataset.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|           (3,[],[])|
|  1.0|(3,[0,1,2],[0.1,0...|
|  2.0|(3,[0,1,2],[0.2,0...|
|  3.0|(3,[0,1,2],[9.0,9...|
|  4.0|(3,[0,1,2],[9.1,9...|
|  5.0|(3,[0,1,2],[9.2,9...|
+-----+--------------------+



In [5]:
# We don't want labels.
final_data = dataset.select('features')

In [6]:
kmeans = KMeans().setK(2).setSeed(1)

In [7]:
model = kmeans.fit(final_data)

In [8]:
# Within Some Set Errors
wsse = model.computeCost(final_data)

In [9]:
print(wsse)

0.11999999999994547


In [11]:
# Get the centers
centers = model.clusterCenters()

In [12]:
# Since K = 2, these are our two centers
centers

[array([ 0.1,  0.1,  0.1]), array([ 9.1,  9.1,  9.1])]

What we really want to know is which group each data point belongs to.

In [14]:
# No train test split because this is unsupervised.
results = model.transform(final_data)

In [15]:
results.show()

+--------------------+----------+
|            features|prediction|
+--------------------+----------+
|           (3,[],[])|         0|
|(3,[0,1,2],[0.1,0...|         0|
|(3,[0,1,2],[0.2,0...|         0|
|(3,[0,1,2],[9.0,9...|         1|
|(3,[0,1,2],[9.1,9...|         1|
|(3,[0,1,2],[9.2,9...|         1|
+--------------------+----------+

