## Project 6 : Clustering
- Name: Carson Stevens
- Date: 11/11/2018

## Instructions

### Description

Practice clustering on a using the well known and very popular `Iris` Dataset! The Iris flower data set is fun for learning supervised classification algorithms, and is known as a difficult case for unsupervised learning. 
https://cran.r-project.org/web/packages/dendextend/vignettes/Cluster_Analysis.html
<br><br>Yes, there are many examples out there, but see if you can do it yourself :). We can easily hypothesize on how many clusters would yield the best result, so let us prove it through a simple experiment that you could repeat with additional data sets.

### Grading

For grading purposes, we will clear all outputs from all your cells and then run them all from the top.  Please test your notebook in the same fashion before turning it in.

### Submitting Your Solution

To submit your notebook, first clear all the cells (this won't matter too much this time, but for larger data sets in the future, it will make the file smaller).  Then use the File->Download As->Notebook to obtain the notebook file.  Finally, submit the notebook file on Canvas.

### Setup

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA
from sklearn import datasets
import sklearn as sk
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import Normalizer
from sklearn.cluster import KMeans

### Problem 1: Data Generation (5 points)
Reference for more information: Chapter 5.11 K-Means in the online course book.

1. Load the `iris` dataset and separate into `X` and `y` variables (our ground truth labels will just be used for visualization).
2. Write a hypothesis on how many clusters will yield the best labeling.

In [None]:
iris = datasets.load_iris()
x = iris.data  
y = iris.target

**Hypothesis**(Edit this cell)
>
> Since there are 3 different families of flowers, the best k should be 3.

### Problem 2: Data exploration (10 points)

This is the step where you would normally conduct any needed preprocessing, data wrangling, and investigation of the data.
<br>**Note:** `print(iris.DESCR)` prints the iris dataset description, provided you loaded it into a variable named `iris`

a. Using your skills from previous projects, provide code below to produce answers to the following questions (edit this cell with your answers): 

    1. How many features are provided?
        
        There are 4 features: 
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        
    2. How many total observations?

        150 (50 in each of three classes)

    3. How many different labels are included, what are they called, and is it a balanced dataset with the same number of observations for each class?
    
        There are 3 labels (50 in each of three classes: Balanced):
        - Iris-Setosa
        - Iris-Versicolour
        - Iris-Virginica
        
b. Create a 2D or 3D scatter plot of two or three of the features and use the y labels for color coding. Do not reduce the data or number of features in any way (you will do this by applying PCA in problem 5).

c. Since clusters can be influenced by the magnitudes of the variables, normalize the feature data and plot a histogram of the normalized features data.

In [None]:
# a
#print(iris.DESCR)

In [None]:
# b
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from matplotlib.colors import ListedColormap
cmap = ListedColormap(['r', 'y','b'])
plt.scatter(x[:,0], x[:, 1], c=y, cmap=cmap)
plt.title("Clustering using Sepal Length & Width")
plt.show()

In [None]:
#c. Normalization
normalized = Normalizer()
normalized.fit(x)
xNorm = normalized.transform(x)
plt.hist(xNorm)
plt.title("Normalized Iris Histogram")
plt.ylabel("Number of flower")
plt.xlabel("Length/Width in cm")
plt.legend(["sepal length", "sepal width", "petal length", "petal width"])
plt.show()

### Problem 3: Unsupervised Learning - Clustering (15 points)
Conduct clustering experiments with one of algorithms discussed in class (e.g., k-means) for number of clusters k = 2-10. Create another 2D or 3D scatter plot utilizing the <b>cluster assignments</b> for color coding (this output can be a plot for each of the values of k or just one final plot using the value of k from your best Silhouette result obtained in Problem 4 below).  

#### Steps:
Repeat for each value of k (maybe a loop here would be appropriate):
1. Create model object
2. Train or fit the model
3. Predict cluster assignments
4. Calculate Silhouette width (see Problem 4)
4. Plot points color coded by class labels predicted by the model.

In [None]:
from sklearn.metrics import silhouette_score
 
bestSilhouette = 0
bestCluster = 0
for i in range(2, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++')
    y = kmeans.fit_predict(xNorm)
    silhouette= silhouette_score(xNorm, y)
    if(silhouette > bestSilhouette):
        bestSilhouette = silhouette
        bestCluster = i
    plt.scatter(xNorm[:,0], xNorm[:,1], c=y, cmap=plt.cm.get_cmap('Accent'))
    #plt.legend(["Setosa", "Versicolor", "Virginica"])
    plt.title("Clustering with %i Clusters" % i)
    plt.show()  
print("The best cluster value is:\t", bestCluster)        

### Problem 4: Evaluate results (20 points)

As we have discussed, validating an usupervised problem is difficult. There is a metric that can be used to determine the density or separation of cluster assignments, called Silhouette width. In this step, perform analysis of results using the above `k = 2-10` and compute the Silhouette width (Hint: possibly you can just add code to your loop in problem 3 and store the results in a list of values). 

Scikit Learn has a great example for Silhouette analysis [here](http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html)

1. For each k (k = 2-10), what are the Silhouette width values?
 

2. Discuss if your best number of clusters (highest Silhouette width value) matches your hypothesis from Problem 1.


The best result was obtained with k=2. This didn't fit my hypothesis of 3. I thought since there was 3 types of flowers, that the best cluster result would be 3, yet the result of k=2: 0.8188570772941627 is much better than k=3: 0.5761482778685276.

In [None]:
bestSilhouette = 0
bestCluster = 0
cluster = 0
for i in range(2, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++')
    y = kmeans.fit_predict(xNorm)
    silhouette= silhouette_score(xNorm, y)
    cluster = i +1 #for 0 indexing
    print("Silhouette Score with %i Clusters:\t" %i, silhouette,)
    if(silhouette > bestSilhouette):
        bestSilhouette = silhouette
        bestCluster = i
    
print("\nThe best cluster value is:\t", bestCluster)
print("The silhouette score was:\t", bestSilhouette)
#discussion above

### Problem 5 (15 points): Principal Component Analysis (PCA)
PCA is the most popular form of dimensionality reduction, which basically, rotates and transforms the data into a new subspace, such that the resultant matrix has:
- Most relevance (variation) now associated with first feature
- Second feature gets the next most, etc.
#### Steps:
1. Reduce the feature data (X) using PCA
2. Repeat the same experiment from problem 3 above (remember your plots are now the 1st, 2nd, and possibly 3rd principal component vs. the raw feature data like before).
3. Compare and contrast results to those from previous/non-PCA problems; does it perform better/worse/same? Provide discussion below (this could vary, depending on setup).

In [None]:
# Clustering with PCA
bestSilhouette = -2
bestCluster = 0
bestComponent= 0
for i in range(2, 11):
    for j in range(1, 5):
        Xpca = PCA(n_components=j).fit_transform(xNorm)
        kmeans = KMeans(n_clusters = i, init = 'k-means++')
        y = kmeans.fit_predict(Xpca)
        silhouette= silhouette_score(Xpca, y)
        if(silhouette > bestSilhouette):
            bestComponent = j
            bestSilhouette = silhouette
            bestCluster = i
        plt.scatter(xNorm[:,0], xNorm[:,1], c=y, cmap=plt.cm.get_cmap('Accent'))
        #plt.legend(["Setosa", "Versicolor", "Virginica"])
        plt.title("PCA with %i Components & %i Clusters" %(j, i))
        plt.show()
print("The best PCA has", bestComponent, "component(s)")
print("With a cluster value of:\t", bestCluster)
print("The silhouette score was:\t", bestSilhouette)
    

    

**Discuss new results**(Edit this cell)
>
> The Kmeans algorithm w/ PCA provided a better silhouette score for my runs: (0.818857077294163 VS w/PCA: 0.8793430588300643). The best results were obtained from the algorithm when k = 2 with 1 component. This did not match my hypothesis (like stated in 4). The test (w/ PCA) that preformed the best used the same number of clusters (k=2) as in 3 & 4.

## You Finished! Treat yourself by taking this questionnaire
### Questionnaire
1) How long did you spend on this assignment?
<br>
    1.5 hours
<br>
2) What did you like about it? What did you not like about it?
<br>
    It was interesting to see how the different number of clusters/dimensions produced a wide range of results
<br>
3) Did you find any errors or is there anything you would like changed?
<br>
    No changes needed
<br>