Clustering Documentation Example
K-means
k-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The MLlib implementation includes a parallelized variant of the k-means++ method called kmeans||.

KMeans is implemented as an Estimator and generates a KMeansModel as the base model.

Input Columns
Param name	Type(s)	Default	Description
featuresCol	Vector	"features"	Feature vector
Output Columns
Param name	Type(s)	Default	Description
predictionCol	Int	"prediction"	Predicted cluster center

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("K-means Clustering").getOrCreate()

In [3]:
from pyspark.ml.clustering import KMeans

In [4]:
data = spark.read.format("libsvm").load('sample_kmeans_data.txt')

In [5]:
data.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|           (3,[],[])|
|  1.0|(3,[0,1,2],[0.1,0...|
|  2.0|(3,[0,1,2],[0.2,0...|
|  3.0|(3,[0,1,2],[9.0,9...|
|  4.0|(3,[0,1,2],[9.1,9...|
|  5.0|(3,[0,1,2],[9.2,9...|
+-----+--------------------+



In [6]:
#Now for K-means Clustering, Only features columns is essential as its an Unsupervised Learning Algorithm
final_data = data.select('features')

In [7]:
# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(data)

In [8]:
from pyspark.ml.evaluation import ClusteringEvaluator

In [10]:
# Evaluate clustering by computing Within Set Sum of Squared Errors.
#this error is important to keep as Pyspark API has changed and doesnt recognise computeCsot function. 
#wssse = model.computeCost(data)
#print(wssse)

In [11]:
# Make predictions
predictions = model.transform(final_data)

# Evaluate clustering by computing  score
evaluator = ClusteringEvaluator()

score = evaluator.evaluate(predictions)

In [12]:
centers = model.clusterCenters()

In [13]:
centers

[array([9.1, 9.1, 9.1]), array([0.1, 0.1, 0.1])]

In [14]:
results = model.transform(final_data)

In [15]:
results.show()

+--------------------+----------+
|            features|prediction|
+--------------------+----------+
|           (3,[],[])|         1|
|(3,[0,1,2],[0.1,0...|         1|
|(3,[0,1,2],[0.2,0...|         1|
|(3,[0,1,2],[9.0,9...|         0|
|(3,[0,1,2],[9.1,9...|         0|
|(3,[0,1,2],[9.2,9...|         0|
+--------------------+----------+

