# KMeans Clustering Documentation Example

Let's work through the documentation example for clustering.

Pay close attention to how we don't need the label column (which makes sense given that we are solving an unsupervised ML problem).

The documentation's example is a bit peculiar in its choice of data set, but we'll explain it along the way.

Hopefully our own custom code along will clarify things further.

Let's get started:

In [1]:
# Start a spark session
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("kmeans_doc").getOrCreate()

In [3]:
from pyspark.ml.clustering import KMeans

In [4]:
# Upload the dataset.
data = spark.read.format("libsvm").load("sample_kmeans_data.txt")

In [5]:
# EDA of the (weird) dataset.
data.show()

# 6 rows of data.
# Are they trying to say that there are 6 different clusters.
# But in the documentation example, they set the number of clusters equal to 2.

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|           (3,[],[])|
|  1.0|(3,[0,1,2],[0.1,0...|
|  2.0|(3,[0,1,2],[0.2,0...|
|  3.0|(3,[0,1,2],[9.0,9...|
|  4.0|(3,[0,1,2],[9.1,9...|
|  5.0|(3,[0,1,2],[9.2,9...|
+-----+--------------------+



In [6]:
unlabeled_data = data.select(["features"])
unlabeled_data.show()
# Something faulty is going on when displaying this data.
# Check out the raw txt file in a text editor...

+--------------------+
|            features|
+--------------------+
|           (3,[],[])|
|(3,[0,1,2],[0.1,0...|
|(3,[0,1,2],[0.2,0...|
|(3,[0,1,2],[9.0,9...|
|(3,[0,1,2],[9.1,9...|
|(3,[0,1,2],[9.2,9...|
+--------------------+



In [7]:
# Create our model
kmeans = KMeans().setK(2).setSeed(1)
# KMeans only expects:
# --> featuresCol="features"

# Sets the value of K.
# Sets the value of seed for the pseudo random-number generator.

In [8]:
kmeans_fitted = kmeans.fit(unlabeled_data)

### Evaluate our clustering algorithm.

In [9]:
# within-set-sum-of-squared-errors
wssse = kmeans_fitted.computeCost(unlabeled_data)
# Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the
# given data.  Deprecated in 2.4.0, it will be removed in 3.0.0.

In [10]:
print(wssse)

0.11999999999994547


In [11]:
centers = kmeans_fitted.clusterCenters()
# Get the cluster centers, represented as a list of NumPy arrays.

In [12]:
centers
# Returns 2 centers.
# Our clusters are centered at two points in 3-dimensional space.

[array([0.1, 0.1, 0.1]), array([9.1, 9.1, 9.1])]

In [14]:
results = kmeans_fitted.transform(unlabeled_data)
# Transforms the input dataset.  Get the "label" created for each sample point.

In [15]:
results.show()

# the first 3 rows belong to the first group.
# the last 3 rows belong to the second group.

+--------------------+----------+
|            features|prediction|
+--------------------+----------+
|           (3,[],[])|         0|
|(3,[0,1,2],[0.1,0...|         0|
|(3,[0,1,2],[0.2,0...|         0|
|(3,[0,1,2],[9.0,9...|         1|
|(3,[0,1,2],[9.1,9...|         1|
|(3,[0,1,2],[9.2,9...|         1|
+--------------------+----------+

