<h2 id="k-means">K-means</h2>


<p><code>KMeans</code> is implemented as an <code>Estimator</code> and generates a <code>KMeansModel</code> as the base model.</p>

<h3 id="input-columns">Input Columns</h3>

<table class="table">
  <thead>
    <tr>
      <th align="left">Param name</th>
      <th align="left">Type(s)</th>
      <th align="left">Default</th>
      <th align="left">Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>featuresCol</td>
      <td>Vector</td>
      <td>"features"</td>
      <td>Feature vector</td>
    </tr>
  </tbody>
</table>

<h3 id="output-columns">Output Columns</h3>

<table class="table">
  <thead>
    <tr>
      <th align="left">Param name</th>
      <th align="left">Type(s)</th>
      <th align="left">Default</th>
      <th align="left">Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>predictionCol</td>
      <td>Int</td>
      <td>"prediction"</td>
      <td>Predicted cluster center</td>
    </tr>
  </tbody>
</table>

In [1]:
#Cluster methods Example

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('cluster').getOrCreate()

In [2]:
# Loads data.
dataset = spark.read.format("libsvm").load("sample_kmeans_data.txt")

dataset.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|           (3,[],[])|
|  1.0|(3,[0,1,2],[0.1,0...|
|  2.0|(3,[0,1,2],[0.2,0...|
|  3.0|(3,[0,1,2],[9.0,9...|
|  4.0|(3,[0,1,2],[9.1,9...|
|  5.0|(3,[0,1,2],[9.2,9...|
+-----+--------------------+



In [3]:
# Trains a k-means model.
from pyspark.ml.clustering import KMeans

# set 2 cluster centres
kmeans = KMeans().setK(2).setSeed(1)

In [4]:
model = kmeans.fit(dataset)

# Within Set Sum of Squared Errors.

In [5]:
# Evaluate clustering by computing Within Set Sum of Squared Errors.
wssse = model.computeCost(dataset)

print("Within Set Sum of Squared Errors = " + str(wssse))

Within Set Sum of Squared Errors = 0.11999999999994547


In [6]:
# Shows the result.
centers = model.clusterCenters()


In [7]:
print("Cluster Centers: ")
for center in centers:
    print(center)

Cluster Centers: 
[0.1 0.1 0.1]
[9.1 9.1 9.1]


In [8]:
results = model.transform(dataset)

In [9]:
results.show()

+-----+--------------------+----------+
|label|            features|prediction|
+-----+--------------------+----------+
|  0.0|           (3,[],[])|         0|
|  1.0|(3,[0,1,2],[0.1,0...|         0|
|  2.0|(3,[0,1,2],[0.2,0...|         0|
|  3.0|(3,[0,1,2],[9.0,9...|         1|
|  4.0|(3,[0,1,2],[9.1,9...|         1|
|  5.0|(3,[0,1,2],[9.2,9...|         1|
+-----+--------------------+----------+

