# Part I: K-means Clustering

In this first part of the session Lab we are going to load a dataset, run the k-Means clustering algorithm, and use the `display` command to visualize the results.

## 1. Load a dataset

In [3]:
# Imports datasets from scikit-learn
from sklearn import datasets, linear_model
from pyspark.mllib.linalg import Vectors

def _convert_vec(vec):
  return Vectors.dense([float(x) for x in vec])

def convert_bunch(bunch):
  n = len(bunch.data)
  df = sqlContext.createDataFrame([(_convert_vec(bunch.data[i]), float(bunch.target[i])) for i in range(n)])
  return df.withColumnRenamed("_1", "features").withColumnRenamed("_2", "label")

diabetes = datasets.load_diabetes()
df = convert_bunch(diabetes)
df.registerTempTable("diabetes")

df = convert_bunch(datasets.load_iris())
df.registerTempTable("iris")

## 2. Run K-Means Clustering Algorithm

In [5]:
from pyspark.mllib.clustering import *

# Load and parse the data
data = sql("select * from iris")

# Because the MLLib package requires RDDs of arrays of doubles, we need to unpack the content of the dataframe.
features = data.rdd.map(lambda r: r.features.array)

# Build the model (cluster the data)
model = KMeans.train(features, k=3, seed=1, maxIterations=10,
                       initializationMode="#random")

##3. Evaluation

In [7]:
# Evaluate clustering by computing Within Set Sum of Squared Errors.
wssse = model.computeCost(features)
print("Within Set Sum of Squared Errors = " + str(wssse))


##4. Visualize Results

The command for visualizing clusters from a K-Means model is:

  ```
    display(
      model: KMeansModel,
      data: DataFrame
    )
  ```
  
This visualization creates a grid plot of numFeatures x numFeatures using a sample of the data.  Each plot in the grid corresponds to 2 features, with data points colored by their cluster labels. If the feature vector has more than 10 dimensions, only the first ten features are displayed.

Parameters:
 - `model`: the cluster distribution (`pyspark.ml.clustering.KMeans`)
 - `data`: points that will be matched against the clusters. This dataframe is expected to have a `features` column that contains vectors of doubles (the feature representation of each point)

In [9]:
display(model, data)

feature0,feature1,feature2,feature3,cluster
5.1,3.5,1.4,0.2,0
4.9,3.0,1.4,0.2,0
4.7,3.2,1.3,0.2,0
4.6,3.1,1.5,0.2,0
4.6,3.4,1.4,0.3,0
4.9,3.1,1.5,0.1,0
5.4,3.7,1.5,0.2,0
4.8,3.4,1.6,0.2,0
4.8,3.0,1.4,0.1,0
4.3,3.0,1.1,0.1,0


## 5. Experimental Evaluation

Now, we are going to use different parameter values to build the KMeans model. We are going to check how different values can change the results of the evaluation.

### 1. Change seeds

Let's change the parameter "seed=1" to "seed=2", "seed=3". Let's compute the Sum of Squared Errors for each one of the seeds. Are the results different?

### 2. Change initialization mode

Let's change the parameter *initializationMode="random"* to *initializationMode="k-means||"*. This corresponds to the K-Means++ algorithm. Let's compute the Sum of Squared Errors for each one of the two initializations. What is the best initialization method?

### 3. Change number of Iterations

Let's change the max number of iterations from 10 to 20. Is there any change in the Sum of Squared Errors?

### 4. Change number of Clusters

What happens if we change the number of clusters? 

### 5. Repeat the experimental evaluation using the diabetes dataset

### 1. Change seeds

Let's change the parameter "seed=1" to "seed=2", "seed=3". Let's compute the Sum of Squared Errors for each one of the seeds. Are the results different?

In [12]:
def change_the_seed(seed,features=features):
  model = KMeans.train(features, k=3, seed=seed, maxIterations=10,
                       initializationMode="#random")
  wssse = model.computeCost(features)
  print("Within Set Sum of Squared Errors = " + str(wssse))
  return str(wssse)

In [13]:
for i in range(1,4):
  change_the_seed(i)

The results are very similars, the seed is the parameter to set up the cluster initialisation (as shown [here](https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/mllib/clustering/KMeans.html)).

####2. Change initialization mode
Let's change the parameter initializationMode="random" to initializationMode="k-means||". This corresponds to the K-Means++ algorithm. Let's compute the Sum of Squared Errors for each one of the two initializations. What is the best initialization method?

In [16]:
def change_the_init(init,features=features):
  model = KMeans.train(features, k=3, seed=1, maxIterations=10,
                       initializationMode=init)
  wssse = model.computeCost(features)
  print("Within Set Sum of Squared Errors = " + str(wssse))
  return str(wssse)

In [17]:
change_the_init("")# default is k-means ||
change_the_init("#random")
change_the_init("k-means||")

We obtain the same error for the two initializations methods.

### 3. Change number of Iterations

Let's change the max number of iterations from 10 to 20. Is there any change in the Sum of Squared Errors?

In [20]:
def change_the_max_iter(init,features=features):
  model = KMeans.train(features, k=3, seed=1, maxIterations=init,
                       initializationMode="#random")
  wssse = model.computeCost(features)
  print("Within Set Sum of Squared Errors = " + str(wssse))
  return str(wssse)

In [21]:
change_the_max_iter(10)
change_the_max_iter(20)

We obtain the same error for the two max_iter values.

### 4. Change number of Clusters

What happens if we change the number of clusters?

In [24]:
def change_the_k(k,features=features):
  model = KMeans.train(features, k=k, seed=1, maxIterations=10,
                       initializationMode="#random")
  wssse = model.computeCost(features)
  print("Within Set Sum of Squared Errors = " + str(wssse))
  return wssse

In [25]:
def find_optimal_k(features=features):
  is_decreasing,k = True,1
  while(is_decreasing):
    old = change_the_k(k,features=features)
    new = change_the_k(k+1,features=features)
    if(new > old ):
        is_decreasing = False
    k += 1 
  print( k - 1 )
  return k - 1

find_optimal_k()

The computation show that the error decrease for k from 1 to 15 and then increase.

In [27]:
# Load and parse the data
data_diabetes = sql("select * from diabetes")

# Because the MLLib package requires RDDs of arrays of doubles, we need to unpack the content of the dataframe.
features_diabetes = data_diabetes.rdd.map(lambda r: r.features.array)

In [28]:
for i in range(1,4):
  change_the_seed(i,features=features_diabetes)

In [29]:
change_the_init("",features=features_diabetes)# default is k-means ||
change_the_init("#random",features=features_diabetes)
change_the_init("k-means||",features=features_diabetes)

In [30]:
change_the_max_iter(10,features=features_diabetes)
change_the_max_iter(20,features=features_diabetes)

In [31]:
find_optimal_k(features=features_diabetes)

The values that we obtain show that the dataset will determine if changing the seed, the init function, the max-iteration or the number of clusters will improve the approximation error. 
For the iris dataset only the number of clusters seemed to improve the results. However for the diabetes dataset all parameters have an impact.