## Project 6 : Clustering
- Name:
- Date: 

## Instructions

### Description

Practice clustering on a using the well known and very popular `Iris` Dataset! The Iris flower data set is fun for learning supervised classification algorithms, and is known as a difficult case for unsupervised learning. 
https://cran.r-project.org/web/packages/dendextend/vignettes/Cluster_Analysis.html
<br><br>Yes, there are many examples out there, but see if you can do it yourself :). We can easily hypothesize on how many clusters would yield the best result, so let us prove it through a simple experiment that you could repeat with additional data sets.

### Grading

For grading purposes, we will clear all outputs from all your cells and then run them all from the top.  Please test your notebook in the same fashion before turning it in.

### Submitting Your Solution

To submit your notebook, first clear all the cells (this won't matter too much this time, but for larger data sets in the future, it will make the file smaller).  Then use the File->Download As->Notebook to obtain the notebook file.  Finally, submit the notebook file on Canvas.

### Setup

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA
from sklearn import datasets
import sklearn as sk
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import Normalizer
from sklearn.cluster import KMeans

### Problem 1: Data Generation (5 points)
Reference for more information: Chapter 5.11 K-Means in the online course book.

1. Load the `iris` dataset and separate into `X` and `y` variables (our ground truth labels will just be used for visualization).
2. Write a hypothesis on how many clusters will yield the best labeling.

## **Hypothesis**(Edit this cell)
> I think that 3 clusters will yield the best labeling accuracy because there are three classes of data.
>

In [None]:
import pandas as pd
iris = datasets.load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target
X.head()

### Problem 2: Data exploration (10 points)

This is the step where you would normally conduct any needed preprocessing, data wrangling, and investigation of the data.
<br>**Note:** `print(iris.DESCR)` prints the iris dataset description, provided you loaded it into a variable named `iris`

a. Using your skills from previous projects, provide code below to produce answers to the following questions (edit this cell with your answers): 

    1. How many features are provided?
        ->4
    2. How many total observations?
        ->150
    3. How many different labels are included, what are they called, and is it a balanced dataset with the same number of observations for each class?
        ->3 labels are included, which are called Setosa, Versicolour, and Virginica. The dataset is balanced with 50 observations for each class.
b. Create a 2D or 3D scatter plot of two or three of the features and use the y labels for color coding. Do not reduce the data or number of features in any way (you will do this by applying PCA in problem 5).

c. Since clusters can be influenced by the magnitudes of the variables, scale the feature data and plot a histogram of the transformed feature data (think about if you should use the min-max, standard scaler, or normalizer).

In [None]:
# a
print("Number of Features: {}".format(len(iris.feature_names)))
print("Number of Observations: {}".format(len(X)))
print(iris.DESCR)

In [None]:
# b
from matplotlib.colors import ListedColormap
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

colormp = ListedColormap(['r','b','g'])

plt.scatter(X["petal length (cm)"], X["petal width (cm)"], c=y, cmap=colormp)
plt.xlabel("petal length (cm)")
plt.ylabel("petal width (cm)")
plt.show()

In [None]:
#c. Scale the data (think about if you should use the min-max, standard scaler, or normalizer)

normalized = Normalizer()
normalized.fit(X)

X_norm=normalized.transform(X)
plt.hist(X)
plt.title("Histogram of transformed data")
plt.xlabel("BinNumber")
plt.ylabel("BinCount")
plt.legend(X.columns)
plt.show()

### Problem 3: Unsupervised Learning - Clustering (15 points)
Conduct clustering experiments with one of algorithms discussed in class (e.g., k-means) for number of clusters k = 2-10. Create another 2D or 3D scatter plot utilizing the <b>cluster assignments</b> for color coding (this output can be a plot for each of the values of k or just one final plot using the value of k from your best Silhouette result obtained in Problem 4 below).  

#### Steps:
Repeat for each value of k (maybe a loop here would be appropriate):
1. Create model object
2. Train or fit the model
3. Predict cluster assignments
4. Calculate Silhouette width (see Problem 4)
4. Plot points color coded by class labels predicted by the model.

In [None]:
'''
Placing this here because I keep getting spammed with warnings that are specific to my computer
'''
import warnings
warnings.filterwarnings('ignore')

In [None]:
from sklearn.metrics import silhouette_score
krange = np.arange(2,11,1)
silhouette_widths = [0.0]*len(krange)
bestSilhouette = 0.0
bestY_hat = y
for k in krange:
    model = KMeans(k)
    model.fit(X)
    y_hat = model.predict(X)
    silhouette_widths[k-2] = silhouette_score(X, y_hat)
    if(silhouette_widths[k-2] > bestSilhouette): 
        bestY_hat = y_hat
        bestSilhouette = silhouette_widths[k-2]

plt.scatter(X["petal length (cm)"], X["petal width (cm)"], c=bestY_hat, cmap=colormp)
plt.xlabel("petal length (cm)")
plt.ylabel("petal width (cm)")
plt.show()
print()
print("k(n_cluster) range = \n{}\n".format(krange))
print("silhouette width scores = \n{}\n".format(silhouette_widths))
print("Best silhouette width score = {}".format(bestSilhouette))

### Problem 4: Evaluate results (20 points)

As we have discussed, validating an usupervised problem is difficult. There is a metric that can be used to determine the density or separation of cluster assignments, called Silhouette width. In this step, perform analysis of results using the above `k = 2-10` and compute the Silhouette width (Hint: possibly you can just add code to your loop in problem 3 and store the results in a list of values). 

Scikit Learn has a great example for Silhouette analysis [here](http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html)

1. For each k (k = 2-10), what are the Silhouette width values?
 

2. Discuss if your best number of clusters (highest Silhouette width value) matches your hypothesis from Problem 1.

In [None]:
for i in range(0,len(krange)):
    print("k value = {},  Silhouette width = {}\n".format(krange[i], silhouette_widths[i]))

In [None]:
#4B
"""
The best number of clusters in terms of silhouette width does not match my hypothesis from problem 1. Which was that
3 clusters would be the best number of clusters, but I think that is because we are currently taking in too many features
for processing.
"""

### Problem 5 (15 points): Principal Component Analysis (PCA)
PCA is the most popular form of dimensionality reduction, which basically, rotates and transforms the data into a new subspace, such that the resultant matrix has:
- Most relevance (variation) now associated with first feature
- Second feature gets the next most, etc.
#### Steps:
1. Reduce the feature data (X) using PCA
2. Repeat the same experiment from problem 3 above (remember your plots are now the 1st, 2nd, and possibly 3rd principal component vs. the raw feature data like before).
3. Compare and contrast results to those from previous/non-PCA problems; does it perform better/worse/same? Provide discussion below (this could vary, depending on setup).

In [None]:
# Clustering with PCA
nComponents = 2
pca = PCA(n_components = nComponents)
pca.fit(X)
X_pca = pca.transform(X)

silhouette_widths_pca = [0.0]*len(krange)
bestSilhouette_pca = 0.0
bestY_hat_pca = y
for k in krange:
    model = KMeans(k)
    model.fit(X_pca)
    y_hat = model.predict(X_pca)
    silhouette_widths_pca[k-2] = silhouette_score(X_pca, y_hat)
    if(silhouette_widths[k-2] > bestSilhouette_pca): 
        bestY_hat_pca = y_hat
        bestSilhouette_pca = silhouette_widths_pca[k-2]

plt.scatter(X["petal length (cm)"], X["petal width (cm)"], c=bestY_hat_pca, cmap=colormp)
plt.xlabel("petal length (cm)")
plt.ylabel("petal width (cm)")
plt.show()
print()
print("k(n_cluster) range = \n{}\n".format(krange))
print("silhouette width scores = \n{}\n".format(silhouette_widths_pca))
print("Best silhouette width score = {}".format(bestSilhouette_pca))

**Discuss new results**(Edit this cell)
> Reducing the number of features via PCA appears to have barely improved the silhouette width score, with pca getting 70.5% and non-pca getting 68.1%. Looking back I now realize 2 clusters is the better number because the data is clustered into 2 main groups with one of those groups containing two classes of data. Although the issue with only using 2 clusters is that you lose one class completely, so clearly something is missing in our method.
>

## You Finished! Treat yourself by taking this questionnaire
### Questionnaire
1) How long did you spend on this assignment?
<br><br>
    About 2-3 hours wasn't really paying attention to when I started and I have gotten up to do things around the house frequently.
<br><br>
2) What did you like about it? What did you not like about it?
<br><br>
     I found the dataset interesting, and I found it interesting how clustering fails to work fully with this dataset. At least the clustering algorithm we are currently using.
<br><br>
3) Did you find any errors or is there anything you would like changed?
<br><br>
    None that I can think of.
<br><br>