# Introduction

### K-means
K-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The MLib implementation includes parallelized variant of the K-means++ method called kmeans||.

KMeans is implemented as an Estimator and generates a KMeansModel as the base model.

### Input Columns

       Param name   |	  Type(s)	 |   Default	 |    Description
       ---------------------------------------------------------------
       featuresCol	      Vector	   "features"	     Feature vector
       
### Output Columns
        
       Param name    |	  Type(s)	 |   Default	 |    Description
       ---------------------------------------------------------------
       predictionCol	    Int	       "prediction"	     Predicted cluster center

In [1]:
#Cluster method example
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('cluster').getOrCreate()

In [2]:
from pyspark.ml.clustering import KMeans

In [3]:
# load data
dataset = spark.read.format('libsvm').load("file:///home/erin/Downloads/spark-3.0.1-bin-hadoop2.7/SparkFolder/spark/Data/sample_kmeans_data.txt")

In [13]:
# Trains a k-means model
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)

In [15]:
# Make predictions
predictions = model.transform(dataset)

In [18]:
# Evaluate clustering by computing Silhouette score
from pyspark.ml.evaluation import ClusteringEvaluator


In [20]:
evaluator = ClusteringEvaluator()

In [21]:
# Evaluate clustering by computing within set sum of Squared Errors.
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

Silhouette with squared euclidean distance = 0.9997530305375207


In [22]:
# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

Cluster Centers: 
[9.1 9.1 9.1]
[0.1 0.1 0.1]
