# Part I: K-means Clustering

In this first part of the session Lab we are going to load a dataset, run the k-Means clustering algorithm, and use the `display` command to visualize the results. 

### Alban de Crevoisier

## 1. Load a dataset

In [3]:
# Imports datasets from scikit-learn
from sklearn import datasets, linear_model
from pyspark.mllib.linalg import Vectors

def _convert_vec(vec):
  return Vectors.dense([float(x) for x in vec])

def convert_bunch(bunch):
  n = len(bunch.data)
  df = sqlContext.createDataFrame([(_convert_vec(bunch.data[i]), float(bunch.target[i])) for i in range(n)])
  return df.withColumnRenamed("_1", "features").withColumnRenamed("_2", "label")

diabetes = datasets.load_diabetes()
df = convert_bunch(diabetes)
df.registerTempTable("diabetes")

df = convert_bunch(datasets.load_iris())
df.registerTempTable("iris")

## 2. Run K-Means Clustering Algorithm

In [5]:
from pyspark.mllib.clustering import *

# Load and parse the data
data = sql("select * from iris")

# Because the MLLib package requires RDDs of arrays of doubles, we need to unpack the content of the dataframe.
features = data.rdd.map(lambda r: r.features.array)

# Build the model (cluster the data)
model = KMeans.train(features, k=3, seed=1, maxIterations=10,
                       initializationMode="#random")

##3. Evaluation

In [7]:
# Evaluate clustering by computing Within Set Sum of Squared Errors.
wssse = model.computeCost(features)
print("Within Set Sum of Squared Errors = " + str(wssse))


##4. Visualize Results

The command for visualizing clusters from a K-Means model is:

  ```
    display(
      model: KMeansModel,
      data: DataFrame
    )
  ```
  
This visualization creates a grid plot of numFeatures x numFeatures using a sample of the data.  Each plot in the grid corresponds to 2 features, with data points colored by their cluster labels. If the feature vector has more than 10 dimensions, only the first ten features are displayed.

Parameters:
 - `model`: the cluster distribution (`pyspark.ml.clustering.KMeans`)
 - `data`: points that will be matched against the clusters. This dataframe is expected to have a `features` column that contains vectors of doubles (the feature representation of each point)

In [9]:
display(model, data)

feature0,feature1,feature2,feature3,cluster
5.1,3.5,1.4,0.2,0
4.9,3.0,1.4,0.2,0
4.7,3.2,1.3,0.2,0
4.6,3.1,1.5,0.2,0
4.6,3.4,1.4,0.3,0
4.9,3.1,1.5,0.1,0
5.4,3.7,1.5,0.2,0
4.8,3.4,1.6,0.2,0
4.8,3.0,1.4,0.1,0
4.3,3.0,1.1,0.1,0


## 5. Experimental Evaluation

Now, we are going to use different parameter values to build the KMeans model. We are going to check how different values can change the results of the evaluation.

### 1. Change seeds

Let's change the parameter "seed=1" to "seed=2", "seed=3". Let's compute the Sum of Squared Errors for each one of the seeds. Are the results different?

### 2. Change initialization mode

Let's change the parameter *initializationMode="random"* to *initializationMode="k-means||"*. This corresponds to the K-Means++ algorithm. Let's compute the Sum of Squared Errors for each one of the two initializations. What is the best initialization method?

### 3. Change number of Iterations

Let's change the max number of iterations from 10 to 20. Is there any change in the Sum of Squared Errors?

### 4. Change number of Clusters

What happens if we change the number of clusters? 

### 5. Repeat the experimental evaluation using the diabetes dataset

###1. Change seeds

In [12]:
# Seed = 2
model = KMeans.train(features, k=3, seed=2, maxIterations=10,
                       initializationMode="#random")
wssse = model.computeCost(features)
print("Seed=2, wssse = " + str(wssse))

# Seed = 3
model = KMeans.train(features, k=3, seed=3, maxIterations=10,
                       initializationMode="#random")
wssse = model.computeCost(features)
print("Seed=3, wssse = " + str(wssse))

The results are exactly the same.

### 2. Change initialization method

In [15]:
# Initialization = random
model = KMeans.train(features, k=3, seed=2, maxIterations=10,
                       initializationMode="#random")
wssse = model.computeCost(features)
print("Within Set Sum of Squared Errors = " + str(wssse))

# Initialization = ++
model = KMeans.train(features, k=3, seed=2, maxIterations=10,
                       initializationMode="#k-means||")
wssse = model.computeCost(features)
print("K-means++ initialization, wssse = " + str(wssse))

Again, the results are the same, which is a bit surprising, to say the least.

### 3. Change the number of iterations

In [18]:
# 20 Iterations
model = KMeans.train(features, k=3, seed=2, maxIterations=20,
                       initializationMode="#random")
wssse = model.computeCost(features)
print("20 iterations, wssse = " + str(wssse))

Again, the exact same results. Looks like the data is very easily partitionned, which I find doubtful considering the clusters 1 & 2 on the graphs.

### 4.  Change the number of clusters

In [21]:
# 5 clusters
model = KMeans.train(features, k=5, seed=2, maxIterations=10,
                       initializationMode="#random")
wssse = model.computeCost(features)
print("5 clusters, wssse = " + str(wssse))

The error is severly degraded, which is expected: if the error did not change previously, it can only be because the data is easily and reliably clusterable in 3 clusters.

### Same, but with the diabetes dataset.

In [24]:
diabetes = datasets.load_diabetes()
df = convert_bunch(diabetes)
df.registerTempTable("diabetes")


# Load and parse the data
data = sql("select * from diabetes")

# Because the MLLib package requires RDDs of arrays of doubles, we need to unpack the content of the dataframe.
features = data.rdd.map(lambda r: r.features.array)

# Build the model (cluster the data)
model = KMeans.train(features, k=3, seed=1, maxIterations=10,
                       initializationMode="#random")
wssse = model.computeCost(features)
print("Within Set Sum of Squared Errors = " + str(wssse))

# Seed = 2
model = KMeans.train(features, k=3, seed=2, maxIterations=10,
                       initializationMode="#random")
wssse = model.computeCost(features)
print("Seed=2, wssse = " + str(wssse))

# Seed = 3
model = KMeans.train(features, k=3, seed=3, maxIterations=10,
                       initializationMode="#random")
wssse = model.computeCost(features)
print("Seed=3, wssse = " + str(wssse))

# Initialization = ++
model = KMeans.train(features, k=3, seed=2, maxIterations=10,
                       initializationMode="#k-means||")
wssse = model.computeCost(features)
print("K-means++ initialization, wssse = " + str(wssse))

# 20 Iterations
model = KMeans.train(features, k=3, seed=2, maxIterations=20,
                       initializationMode="#random")
wssse = model.computeCost(features)
print("20 iterations, wssse = " + str(wssse))

# 5 clusters
model = KMeans.train(features, k=5, seed=2, maxIterations=10,
                       initializationMode="#random")
wssse = model.computeCost(features)
print("5 clusters, wssse = " + str(wssse))

This time, we observe differences, so the data must not be as clearly seperable.