# <center>Introduction to Spark MLlib with Python</center>
## <center>Clustering</center>
### <center>July 20, 2016</center>

<img src = "https://ibm.box.com/shared/static/wfbduwkbx22nx3i2psbp9g27s2p9s86v.png", width="500" align = 'center'>

## <b>Welcome to the fourth lab in the course, Machine Learning with Spark MLlib.</b>
### <b>Spark has many libraries, namely under MLlib (Machine Learning Library)! Spark allows for quick and easy scalability of practical machine learning!</b>

In this lab exercise, you will learn how to create two different clustering algorithms with Spark MLlib! These clustering algorithms include K-Means clustering and Gaussian Mixture Clustering. We will also look at some of the information that the functions can provide about the models.

### Some Notebook Commands
#### In case you haven't dealt with a Jupyter Notebook before, here are some quick, useful commands that may be handy to get started.
<ul>
    <li>Run a cell: CTRL + ENTER</li>
    <li>Create a cell above a cell: a</li>
    <li>Create a cell below a cell: b</li>
    <li>Change a cell to Markdown: m</li>
    
    <li>Change a cell to code: y</li>
</ul>

<b> If you are interested in more keyboard shortcuts, go to Help -> Keyboard Shortcuts </b>

## K-Means Clustering

Import the following libraries:
<ul>
    <li>KMeans, KMeansModel from pyspark.mllib.clustering</li>
    <li>make_blobs from sklearn.datasets.samples_generator</li>
    <li>array from numpy</li>
    <li>sqrt from math</li>
    <li>numpy as np</li>
</ul>

In [6]:
from pyspark.mllib.clustering import KMeans, KMeansModel
from sklearn.datasets.samples_generator import make_blobs
from numpy import array
from math import sqrt
import numpy as np
from pyspark import SparkContext

We will be generating our own data using the make_blobs class, so we will need to create a random seed to initialize a random number generator. Using numpy's random.seed() function, pass it the value <b>0</b>

In [2]:
np.random.seed(0)

Next, we will create the dataset. Using the make_blobs function will output two <b>numpy.ndarray</b> values, which we will store as <b>X</b> and <b>y</b>. The <b>make_blobs</b> function accepts three inputs:
<ul>
    <li>1st: <b>n_samples</b>: The number of samples you want to generate. (We will use 6500 samples)</li>
    <li>2nd: <b>centers</b>: A list of lists containing the centers that we want the clusters to center around. The centers look like: [x_value, y_value]. (We will be creating three centers at the following coordinates: [1,2], [3,-1], and [-4,-2])</li>
    <li>3rd: <b>cluster_std</b>: The standard deviation of the clusters, which determines how spread out the clusters are. (We will use 1.1)</li>
</ul>

In [3]:
X, y = make_blobs(n_samples=6500, centers=[[1,2],[3,-1],[-4,2]], cluster_std=1.1)

So the first ndarray, <b>X</b> contains the actual samples themselves and the second ndarray, <b>y</b> contains the labels for the samples. We will be using <b>X</b> as the data to train the KMeans model. This means we will need to convert the data from a <b>ndarray</b> to a <b>RDD</b>. Do this using <b>sc.parallelize</b> and passing in <b>X</b>, calling the output <b>kmeans_data</b>.

In [9]:
# sc = SparkContext() for first run
kmeans_data = sc.parallelize(X)

Now we will create the KMeans model called <b>kmeans_model</b> using <b>KMeans.train</b> and passing in the following parameters:
<ul>
    <li>1st: The data (use kmeans_data)</li>
    <li>2nd: The number of desired clusters (use 3)</li>
    <li>3rd: The max number of iterations (use maxIterations=10)</li>
    <li>4th: The number of runs (use runs=10)</li>
    <li>5th: The initialization mode (use initializationMode="k-means||")</li>
    <li>6th: The initialization steps (use initializationSteps=5)</li>
    <li>7th: The value for epsilon (use epsilon=0.003)</li>
    <li>8th: The initial model (use initialModel=None)</li>
</ul> <br>
Note: You may get a deprecation warning, please ignore it.

In [10]:
kmeans_model = KMeans.train(kmeans_data, 3, maxIterations=10, runs=10, initializationMode="k-means||",initializationSteps=5,epsilon=0.003,initialModel=None)



Now we will define a function that will evaluate the clustering by computing the Within Set Sum of Squared Error. <br><br>

Within Set Sum of Squared errors will find take the difference between all of the points to its cluster, square each difference, sum all of those results together, then square root it. You can see that in the function below.

In [14]:
def error(model, point):
    center = model.centers[model.predict(point)]
    return sqrt(sum([x**2 for x in (point - center)]))

Now we need to take the data (<b>kmeans_data</b>) and map it to the error function (<b>.map(lambda p: error(p))</b>). After, we add up the result of each cluster together using a .reduce function (<b>.reduce(lambda x, y: x + y)</b>). The Within Set Sum of Squared Error result will be stored in a variable called <b>WSSSE</b>.<br> <br>

Then you can print out the <b>WSSSE</b>.

In [16]:
WSSSE = kmeans_data.map(lambda p: error(kmeans_model, p)).reduce(lambda x, y: x+y)
print(WSSSE)

8706.90278674


Now that we have built the K-Means model, let's look at some of the functions that it comes with.

Starting off, let's take a look at the number of clusters in the model. This can be done by calling <b>.k</b> on <b>kmeans_model</b>.

In [17]:
kmeans_model.k

3

We get 3 clusters, which is expected since that's how many clusters we initialized when training the model.

Now, we can take a look at the coordinates of where the cluster centers actually are. Call <b>.clusterCenters</b> on <b>kmeans_model</b>.

In [18]:
kmeans_model.clusterCenters

[array([ 3.02182138, -1.04459141]),
 array([-4.02437674,  1.9578032 ]),
 array([0.91902958, 2.04128787])]

So here, we get a list of 3 arrays, each containing the x and y coordinates of each cluster center!

Now we will try predicting one of the points from the dataset we created, <b>X</b>. First, let's check the <b>shape</b> of </b>X</b> with .shape. The shape should be (6500, 2), or 6500 data points, each with a x and y value.

In [19]:
X.shape

(6500, 2)

Now take we will use <b>.predict</b> to predict one of the data points from <b>X</b>, so we can see what cluster that point belongs to. We can index <b>X</b> to get a single data point (ex. X[7]). So try something like <b>kmeans_model.predict(X[7])</b> 

In [20]:
kmeans_model.predict(X[7])

2

Look back at the make_blobs function we used. There are two inputs, which are <b>X</b> that has the coordinates of each data point and <b>y</b> that has the label for each data point. The label is the cluster where the data point was generated from! So knowing this, we can see if the model clustered the data point under the same cluster as it was made from. <br>
We can run <b>kmeans_model.predict(X[7]) == y[7]</b>.

In [27]:
kmeans_model.predict(X[18]) == y[18]

False

Let's print out the coordinates of the data point (ex. <b>print(X[7])</b>)

In [22]:
print(X[7])

[2.11109986 2.0942723 ]


Now we have the <b>x</b> and <b>y</b> coordinates of the data points. Since k-means uses euclidean distance, you can check the coordinates of this point with each cluster center to make sure it is correct if you are curious! <br> <br>

Also note that you can insert a list containing a x distance and y distance and the model will determine which cluster that data point belongs to. <br>(ex. kmeans_model.predict([5, 2]))

In [29]:
kmeans_model.predict([10,10])

2

Finally, let's see the computation cost of the entire kmeans_data dataset by calling <b>.computeCost</b> on <b>kmeans_model</b>, passing in <b>kmeans_data</b>

In [24]:
kmeans_model.computeCost(kmeans_data)

14677.975272676682

---

## Gaussian Mixture Clustering

We will now move into creating a Gaussian Mixture Clustering Model and looking at its functions as well.

Import the following libraries:
<ul>
    <li>GaussianMixture from pyspark.mllib.clustering</li>
    <li>array from numpy</li>
</ul>

In [30]:
from numpy import array
from pyspark.mllib.clustering import GaussianMixture

For the dataset, we will be creating a small dataset, as it will be easier to visually see the predictions of each point. <br><br>

First, we will create a numpy array called <b>data</b>, which will contain the data. We can create a <b>numpy array</b> by passing in a <b>list</b>. The list will contain the following values: <b>4,5, -3,1, 9,5, 1,-2, 7,2, -10,-3, 4,7</b>. Outside of <b>array()</b>, we will use <b>.reshape(7, 2)</b> which will give us 7 data points, where the list values correspond to the x and y coordinates. <br>(ex. for 1 data point: array([1,2]).reshape(1, 2))

In [33]:
data = array([4,5, -3,1, 9,5, 1,-2, 7,2, -10,-3, 4,7]).reshape(7,2)

In order to train our model, we need to turn the data into a RDD. We can run SparkContext's <b>.parallelize</b>, passing in the <b>numpy array</b> or <b>data</b> as input. Call the output <b>gm_data</b>.

In [34]:
gm_data = sc.parallelize(data)

Now we can train the Gaussian Mixture Model called <b>gm_model</b>, using <b>GaussianMixture.train</b>, passing in the following parameters:
<ul>
    <li>1st: The training data (use gm_data)</li>
    <li>2nd: The number of desired clusters (used 3)</li>
    <li>3rd: convergenceTol (use convergenceTol=0.0006)</li>
    <li>4th: maxIterations (use maxIterations=50)</li>
    <li>5th: initialModel (use initialModel=None)</li>
</ul>

In [35]:
gm_model = GaussianMixture.train(gm_data, 3, convergenceTol=0.0006, maxIterations=50, initialModel=None)

Now that we have built the Gaussian Mixture Model, let's try out some functions on it!

Let's start off by seeing how much clusters it has, with <b>.k</b> on <b>gm_model</b>

In [36]:
gm_model.k

3

There should be 3 clusters as the output, which is the value we used to train the model.

Now let's take a look at what each Gaussian looks like with <b>gm_model.gaussians</b>

In [37]:
gm_model.gaussians

[MultivariateGaussian(mu=DenseVector([1.0, -2.0]), sigma=DenseMatrix(2, 2, [0.0, 0.0, 0.0, 0.0], 0)),
 MultivariateGaussian(mu=DenseVector([6.6667, 4.6667]), sigma=DenseMatrix(2, 2, [4.2222, -2.1111, -2.1111, 4.2222], 0)),
 MultivariateGaussian(mu=DenseVector([-3.0, 1.0]), sigma=DenseMatrix(2, 2, [32.6667, 18.6667, 18.6667, 10.6667], 0))]

With this output, you can get a feel for the size of each of these gaussians by looking at the <b>mean (mu)</b> and the <b>standard deviation (sigma)</b>.

We can also look at the <b>weight</b> of each gaussian in the mixture in order to see which gaussians make up most of the mixture. Run <b> gm_model.weights</b>

In [38]:
gm_model.weights

array([0.14285714, 0.42857153, 0.42857132])

Now it's time to use <b>.predict</b> to see what each of the data points that were used to train the model are classified as. Use <b>gm_data</b> as input into <b>.predict</b>. Outside of .predict, add on <b>.collect()</b> so we can see the output in list form.

In [43]:
gm_model.predict(gm_data).collect()

[2, 2, 1, 0, 1, 2, 1]

Now let's try using <b>.predictSoft</b> on <b>gm_data</b> to see what that output looks like. Remember that predictSoft provides each gaussians membership to each datapoint, showing which gaussian has the most precedence. Don't forget a <b>.collect()</b> afterwards to display the output.

In [42]:
gm_model.predictSoft(gm_data).collect()

[array('d', [2.3881200540782028e-20, 7.380631194279263e-07, 0.9999992619368806]),
 array('d', [1.1280683480311063e-20, 1.6462747447581932e-16, 0.9999999999999998]),
 array('d', [3.235661529552778e-14, 0.9999999999999353, 3.235661529552778e-14]),
 array('d', [1.0, 2.5658492254156683e-20, 2.1137637929561462e-26]),
 array('d', [3.235660892452372e-14, 0.9999999999999353, 3.235660892452372e-14]),
 array('d', [2.38812004843812e-20, 2.38812004843812e-20, 1.0]),
 array('d', [3.235659648115988e-14, 0.9999999999999353, 3.235659648115988e-14])]

Search for the highest value in each data points and see if that aligns with the output from <b>.predict</b>, which it should. .predictSoft provides more information on the values associated with .predict.