<img width="200" style="float:left" 
     src="https://upload.wikimedia.org/wikipedia/commons/f/f3/Apache_Spark_logo.svg" />

<img style="display: float:left" src="https://storage.googleapis.com/kaggle-datasets-images/903978/1533070/57da797ac0a3334dfa9e0eda0f5559cc/dataset-cover.jpg?t=2020-10-14-15-50-13" />

# Sections
* [Description](#0)
* [1. Setup](#1)
  * [1.1 Start Hadoop](#1.1)  
  * [1.2 Search for Spark Installation](#1.2)
  * [1.3 Create SparkSession](#1.3)
* [2. Lab](#2)
  * [2.1 Check Lab Files](#2.1)
* [3. Clustering](#3)
* [4. TearDown](#4)
  * [4.1 Stop Hadoop](#4.1)

<a id='0'></a>
## Description
<p>
In this notebook, we are going to use K-Means to cluster our data. 

We will be using the Iris dataset, which has labels.
    
</p>
In thi lab we will use Apache Spark to do unsupervised learning.     
<div>The goal for this lab are:</div>
<ul>    
    <li>Practice the Spark ML API</li>
    <li>Build a K-Means model</li>
</ul>    
</p>



<a id='1'></a>
## 1. Setup

Since we are going to process data stored from HDFS let's start the service

<a id='1.1'></a>
### 1.1 Start Hadoop

Start Hadoop

Open a terminal and execute
```sh
hadoop-start.sh
```

<a id='1.2'></a>
### 1.2 Search for Spark Installation 
This step is required just because we are working in the course environment.

In [None]:
import findspark
findspark.init()

I'm changing pandas max column width property to improve data displaying

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

<a id='1.3'></a>
### 1.3 Create SparkSession

Let's create the SparkSession in first place:<br/>

In [None]:
from pyspark.sql.session import SparkSession

spark = (SparkSession.builder
    .appName("Iris - Clustering - MLlib")
    .getOrCreate())

<a id='2'></a>
## 2. Lab

<a id='2.1'></a>
### 2.1 Check Lab Files

In order to complete this lab you need to previosly upload the datasets into HDFS.<br/>

Check you have the data ready in HDFS

http://localhost:50070/explorer.html#/datalake/raw/iris/

<a id='3'></a>
## 3. Clustering

Let's create the DataFrame

In [None]:
irisDF = (spark.read.option("header","true")
                 .option("inferSchema","true")
                 .csv("hdfs://localhost:9000/datalake/raw/iris/")
                 .cache())

print(f"There are {irisDF.count()} rows in the datasets")

In [None]:
irisDF.printSchema()

In [None]:
irisDF.limit(5).toPandas()

Notice that we have four variables we will consider as "features".  


In [None]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=["SepalLengthCm", "SepalWidthCm","PetalLengthCm","PetalWidthCm"], outputCol="features")
irisFeaturesDF = assembler.transform(irisDF)
irisFeaturesDF.limit(5).toPandas()

We'll reduce those down to two values (for visualization purposes) using [PCA](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.PCA.html)

In [None]:
from pyspark.ml.feature import PCA

pca = PCA(k=2, inputCol="features", outputCol="pca_features")
irisTwoFeaturesDF = pca.fit(irisFeaturesDF).transform(irisFeaturesDF)
irisTwoFeaturesDF.limit(5).toPandas()

In [None]:
from pyspark.ml.clustering import KMeans

kmeans = KMeans(k=3, seed=221, maxIter=20, featuresCol="pca_features")

model = kmeans.fit(irisTwoFeaturesDF)

In [None]:
type(model)

The model has a summary
[KMeansSummary](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.KMeansSummary.html)

In [None]:
model.summary.clusterSizes

In [None]:
# Obtain the clusterCenters from the KMeansModel
for center in model.clusterCenters():    
    print(center)

Remember: K-means doesn't use the true labels when training, but we can use them to evaluate. 

In [None]:
# Use the model to transform the DataFrame by adding cluster predictions
predictions = model.transform(irisTwoFeaturesDF)
predictions.limit(5).toPandas()

In [None]:
predictions.printSchema()

In K-means clustering algorithm the number of clusters (k) is the hyper-parameter to be tuned.

There are two metrics to measure how goo is the clustering:

1.**Silhouett Score**: The higher the silhouette score the better is the clustering. 

2.**Within set sum of squared errors (WSSSE)**: It's just a measure of how far apart each point is from its centroid.  The higher the WSSSE the worst is the clustering.

In [None]:
from pyspark.ml.evaluation import ClusteringEvaluator

evaluator = ClusteringEvaluator()

# Evaluate clustering by computing Silhouette score
silhouette = evaluator.evaluate(predictions)
print(f"Silhouette with squared euclidean distance = {silhouette}")

# Evaluate clustering by computing Within Set Sum of Squared Errors.
WSSSE = model.summary.trainingCost
print(f"Within Set Sum of Squared Errors = {WSSSE}")


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pyspark.sql.functions import col
from pyspark.ml.functions import vector_to_array

df = predictions.select((vector_to_array(col('pca_features'))[0]).alias('x'),
                        (vector_to_array(col('pca_features'))[1]).alias('y'),
                         col('prediction').alias('label')).toPandas()


In [None]:
clusters = df['label'].unique()

centroids = model.clusterCenters()
  
fig = plt.figure()
ax = fig.add_subplot(111)

for i in list(clusters):
    t = df.loc[df['label']==i]
    ax.scatter(x=t['x'],y=t['y'],label=i)

for c in centroids:    
    ax.scatter(x=c[0],y=c[1],c='black')
       
plt.show()

<a id='4'></a>
## 4. Tear Down

Once we complete the the lab we can stop all the services

<a id='4.1'></a>
### 4.1 Stop Hadoop

Stops Hadoop
Open a terminal and execute
```sh
hadoop-stop.sh
```