# Clustering with KMeans

General Pipeline:

- Importing Data
- RFormula transformation
- Split into train and test
- Building the model
- Prediction on the test set
- Evaluation

Hyperparams:

- distanceMeasure: the distance metric we want to use
- k: the number of clusters
- maxIter: the maximum number of iterations
- predictionCol: the column name that contains the cluster

## Importing

In [1]:
import pyspark, findspark
from pyspark.sql import SparkSession

findspark.init()

spark = SparkSession.builder.appName("kmeans").getOrCreate()

In [7]:
from pyspark.ml.feature    import RFormula, VectorAssembler, StringIndexer
from pyspark.ml.clustering import KMeans

from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType

## Loading Data

In [3]:
data = spark.read.load(
    "../../data/iris.csv",
    format="csv",
    sep=",",
    header = True, 
    inferSchema=True)

data.show(2)

+-----------+----------+-----------+----------+-----------+
|sepallength|sepalwidth|petallength|petalwidth|      class|
+-----------+----------+-----------+----------+-----------+
|        5.1|       3.5|        1.4|       0.2|Iris-setosa|
|        4.9|       3.0|        1.4|       0.2|Iris-setosa|
+-----------+----------+-----------+----------+-----------+
only showing top 2 rows



## Data Preparation

In [4]:
asb = VectorAssembler(
    inputCols=["sepallength","sepalwidth","petallength","petalwidth"],
    outputCol="features"
)

data = asb.transform(data)
data.show(2)

+-----------+----------+-----------+----------+-----------+-----------------+
|sepallength|sepalwidth|petallength|petalwidth|      class|         features|
+-----------+----------+-----------+----------+-----------+-----------------+
|        5.1|       3.5|        1.4|       0.2|Iris-setosa|[5.1,3.5,1.4,0.2]|
|        4.9|       3.0|        1.4|       0.2|Iris-setosa|[4.9,3.0,1.4,0.2]|
+-----------+----------+-----------+----------+-----------+-----------------+
only showing top 2 rows



In [5]:
ind = StringIndexer(
    inputCol="class",
    outputCol="target"
)

data = ind.fit(data).transform(data)
data.select("features", "target").show(2)

+-----------------+------+
|         features|target|
+-----------------+------+
|[5.1,3.5,1.4,0.2]|   0.0|
|[4.9,3.0,1.4,0.2]|   0.0|
+-----------------+------+
only showing top 2 rows



In [10]:
data = data.withColumn("target", data["target"].cast(IntegerType()))
data.select("features", "target").show(2)

+-----------------+------+
|         features|target|
+-----------------+------+
|[5.1,3.5,1.4,0.2]|     0|
|[4.9,3.0,1.4,0.2]|     0|
+-----------------+------+
only showing top 2 rows



## Model Development and Training

In [11]:
km = KMeans(
    predictionCol="cluster",
    featuresCol="features",
    maxIter=100,
    k=3
)

model = km.fit(data)

## Predicting on Test Set

In [14]:
clusters = model.transform(data)
clusters.show(5)

+-----------+----------+-----------+----------+-----------+-----------------+------+-------+
|sepallength|sepalwidth|petallength|petalwidth|      class|         features|target|cluster|
+-----------+----------+-----------+----------+-----------+-----------------+------+-------+
|        5.1|       3.5|        1.4|       0.2|Iris-setosa|[5.1,3.5,1.4,0.2]|     0|      1|
|        4.9|       3.0|        1.4|       0.2|Iris-setosa|[4.9,3.0,1.4,0.2]|     0|      1|
|        4.7|       3.2|        1.3|       0.2|Iris-setosa|[4.7,3.2,1.3,0.2]|     0|      1|
|        4.6|       3.1|        1.5|       0.2|Iris-setosa|[4.6,3.1,1.5,0.2]|     0|      1|
|        5.0|       3.6|        1.4|       0.2|Iris-setosa|[5.0,3.6,1.4,0.2]|     0|      1|
+-----------+----------+-----------+----------+-----------+-----------------+------+-------+
only showing top 5 rows



In [18]:
clusters.select('cluster').distinct().collect()

[Row(cluster=1), Row(cluster=2), Row(cluster=0)]