# Clustering example Code Along.

We'll work through a real data set containing some data on three distinct seed types.

The instructor's notebook is identified as ```Clustering Code Along.ipynb``` (notice the horrible whitespace in the filename).

For certain ML algorithms, it is a good idea to scale your data.

Drops in model performance can occur with highly dimensional data, so we'll practice scaling features using PySpark!  (Curse of Dimensionality)

Remember, there won't be any confusion matrix or classification test results.  This is unsupervised learning!

Meaning we don't have the original labels to actually perform some sort of test against!

This is a common point of confusion for beginners, you can't easily check to see how well your clustering algorithm performed.  This is the difficulty of all unsupervised tasks!

Let's get started!

In [1]:
# Start a Spark Session
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("cluster_code_along").getOrCreate()

In [3]:
# Read in the dataset
data = spark.read.csv("seeds_dataset.csv", header=True, inferSchema=True)

In [4]:
data.printSchema()

root
 |-- area: double (nullable = true)
 |-- perimeter: double (nullable = true)
 |-- compactness: double (nullable = true)
 |-- length_of_kernel: double (nullable = true)
 |-- width_of_kernel: double (nullable = true)
 |-- asymmetry_coefficient: double (nullable = true)
 |-- length_of_groove: double (nullable = true)



In [5]:
data.head(1)[0]

Row(area=15.26, perimeter=14.84, compactness=0.871, length_of_kernel=5.763, width_of_kernel=3.312, asymmetry_coefficient=2.221, length_of_groove=5.22)

In [6]:
from pyspark.ml.clustering import KMeans

In [7]:
from pyspark.ml.feature import VectorAssembler

In [8]:
data.columns

['area',
 'perimeter',
 'compactness',
 'length_of_kernel',
 'width_of_kernel',
 'asymmetry_coefficient',
 'length_of_groove']

In [9]:
assembler = VectorAssembler(inputCols=data.columns, outputCol="features")

In [10]:
data_assembled = assembler.transform(dataset=data)

In [11]:
data_assembled.printSchema()

print(data_assembled.head(1)[0])

root
 |-- area: double (nullable = true)
 |-- perimeter: double (nullable = true)
 |-- compactness: double (nullable = true)
 |-- length_of_kernel: double (nullable = true)
 |-- width_of_kernel: double (nullable = true)
 |-- asymmetry_coefficient: double (nullable = true)
 |-- length_of_groove: double (nullable = true)
 |-- features: vector (nullable = true)

Row(area=15.26, perimeter=14.84, compactness=0.871, length_of_kernel=5.763, width_of_kernel=3.312, asymmetry_coefficient=2.221, length_of_groove=5.22, features=DenseVector([15.26, 14.84, 0.871, 5.763, 3.312, 2.221, 5.22]))


### Now work through the scaling of the data

In [12]:
from pyspark.ml.feature import StandardScaler

In [13]:
std_scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
# Standardises features by removing the mean and scaling to unit variance using column summary statistics on the
# samples in the training set.

In [14]:
scaler_model = std_scaler.fit(data_assembled)

In [15]:
# Transform the data
data_scaled = scaler_model.transform(dataset=data_assembled)

In [16]:
data_scaled.head(1)[0]

Row(area=15.26, perimeter=14.84, compactness=0.871, length_of_kernel=5.763, width_of_kernel=3.312, asymmetry_coefficient=2.221, length_of_groove=5.22, features=DenseVector([15.26, 14.84, 0.871, 5.763, 3.312, 2.221, 5.22]), scaled_features=DenseVector([5.2445, 11.3633, 36.8608, 13.0072, 8.7685, 1.4772, 10.621]))

### Train the model

In [17]:
kmeans = KMeans(featuresCol="scaled_features", k=3)
# There are three different variations of wheat seeds.

In [18]:
kmeans_fitted = kmeans.fit(data_scaled)

In [19]:
print("WSSSE: {:.4f}.".format(kmeans_fitted.computeCost(data_scaled)))

WSSSE: 429.0756.


In [20]:
# Get the cluster centers:
centers = kmeans_fitted.clusterCenters()

In [21]:
centers

[array([ 4.06105916, 10.13979506, 35.80536984, 11.82133095,  7.50395937,
         3.27184732, 10.42126018]),
 array([ 6.31670546, 12.37109759, 37.39491396, 13.91155062,  9.748067  ,
         2.39849968, 12.2661748 ]),
 array([ 4.87257659, 10.88120146, 37.27692543, 12.3410157 ,  8.55443412,
         1.81649011, 10.32998598])]

### Now we move on seeing the groupings that were made.

In [22]:
kmeans_fitted.transform(data_scaled).select(["prediction"]).show()

+----------+
|prediction|
+----------+
|         2|
|         2|
|         2|
|         2|
|         2|
|         2|
|         2|
|         2|
|         1|
|         1|
|         2|
|         2|
|         2|
|         2|
|         2|
|         2|
|         2|
|         2|
|         2|
|         0|
+----------+
only showing top 20 rows

