# Iris Dataset Clustering

## Employ multiple clustering techniques on the Iris dataset, with and without PCA.  Evaluate results.

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
from pandas_profiling import ProfileReport
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans, MeanShift
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize
import scipy.cluster.hierarchy as shc
from sklearn.cluster import AgglomerativeClustering
from sklearn import metrics
from sklearn.cluster import DBSCAN
import plotly.express as px

## Define Function for Clustering

In [None]:
def iris_clustering(func, data, target, **args):
        
        """This function generates a clustering model using a scikit-learn function, displays a confusion matrix and plot of
        the data (using the first 3 columns of the data), and returns the model
            
            Arguments:
            func - A scikit-learn clustering function
            data - A pandas dataframe containing the features to be used in clustering
            target - The target values for comparison with the model predictions
            
            Any keyword arguments requried for the specific clustering function should also be 
            included at the end of the call
        """
        
        normalized = normalize(data)
        
        model = func(**args)
        preds = model.fit_predict(normalized)
        
        preds_df = data.copy()
        preds_df['target'] = target
        preds_df['preds'] = preds
        
        conf = pd.crosstab(preds_df['preds'], preds_df['target'], margins=True)
        print("Confusion Matrix:")
        display(conf)
        
        homogeneity = metrics.homogeneity_score(preds_df["target"], preds_df["preds"])
        completeness = metrics.completeness_score(preds_df["target"], preds_df["preds"])
        v_measure = metrics.v_measure_score(preds_df["target"], preds_df["preds"], beta=1.0)

        print(f"Homogeneity_score: {homogeneity:.2f}, measures that each cluster contains only members of a single class")
        print(f"Completeness_score: {completeness:.2f}, measures that all members of a given class are assigned to the same cluster")
        print(f"V-measure: {v_measure:.2f}, harmonic mean of homogeneity and completeness")

        scatter = px.scatter_3d(preds_df, x=preds_df.columns[0], y=preds_df.columns[1], z=preds_df.columns[2],
                            color = 'target', symbol = 'preds', size_max=8,)
        scatter.show()
        
        return model
    

## Load Dataset, Explore and Display Features

In [None]:
iris = load_iris()
iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])
iris_df['target'] = iris_df['target'].replace([0,1,2],['setosa', 'versicolor', 'virginica'])

* There are 3 unique target variables: setosa, versicolor and virginica; each a species of Iris
* Changed the target attribute labels to a descriptive string vs numerical category for ease in analysis

In [None]:
iris_df

In [None]:
iris_df.shape

In [None]:
iris_df.info()

* The dataset contains 150 observations, has 4 predictive attributes and 1 target variable
* The 4 predictive attributes are numerical, the target variable is categorical

In [None]:
iris_df['target'].describe()

In [None]:
iris_df.describe()

* petal length has the largest range and greatest variation of the 4 attributes, and also has the greatest difference of the 4 between its mean and median
* for the other three attributes, their mean approximates their median which suggests the mean is not affected by outliers

In [None]:
profile = ProfileReport(iris_df)
profile

In [None]:
iris_df.corr()

### Observations:

* The dataset has zero missing observations
* This is a balanced dataset in that each of the three target labels have the same number of observations
* The distributions of sepal length and sepal width are fairly normal
* The distributions of petal length and petal width both have two distinct groupings
* Correlation - because the 4 predictive attributes are all numerical, refer to the Pearson's r chart, above:
    * Sepal width and sepal length appear to be uncorrelated
    * Petal width and petal length appear to be highly correlated 
    * Petal length and sepal length appear to be fairly correlated
    * Petal width and sepal length also appear to be correlated, though less so than petal length and sepal length    

## Build Elbow Plot to determine optimal number of clusters for KMeans

In [None]:
# Create Iris data frame without the target column

iris_features = iris_df.drop(columns='target')
iris_target = iris_df['target']
iris_features.head()

In [None]:
# Create Elbow plot of inertia values to determine optimal number of clusters to use in a K-Means clustering method
# inertia is the sum of the squared distances of observations to their closest cluster center

inertia_values = []
cluster_centers = []
K = range(1,11) #Try number of clusters from 1 to 10

for k in K:
    k_mean_model = KMeans(n_clusters = k)
    k_mean_model.fit(iris_features)
    inertia_values.append(k_mean_model.inertia_) #track the inertia values for each number of clusters
    cluster_centers.append(k_mean_model.cluster_centers_) #track the cluster centers for each number of clusters

# Create data frame of values for elbow plot
elbow_data = {'Number of Clusters': K, 'Inertia': inertia_values}
elbow_df = pd.DataFrame(elbow_data) 

In [None]:
# Graph the Elbow plot

fig_dims = (8, 5)
fig, ax = plt.subplots(figsize=fig_dims)

sns.set_theme(style = "whitegrid")
sns.pointplot(data = elbow_df, x = 'Number of Clusters' ,y = 'Inertia', markers=["o"])\
.set(title='Eblow Plot using Inertia for K-Means Clustering of Iris Data');

The optimal number of clusters is at the "elbow" of the graph, where the Inertia begins to decline in a linear fashion.  In this case the optimal number is 3.

In [None]:
# Display cluster centers for k = 3 clusters in k-means model
cluster_centers[2:3]

## Build Dendogram to determine optimal number of clusters for Agglomerative method

In [None]:
iris_features.head()

In [None]:
iris_normalized = pd.DataFrame(normalize(iris_features),columns=iris_features.columns[:])
iris_normalized.head()

In [None]:
# Like an elbow plot for k-means, a dendogram helps determine the ideal number of clusters for hierarchical clustering
# Note: based on this chart, we'll go down to the 3rd level and use 3 clusters

plt.figure(figsize=(18, 7))  
plt.title("Dendrograms")  
x = shc.linkage(iris_normalized, method = 'ward')
dend = shc.dendrogram(x,above_threshold_color="blue", color_threshold=.25, orientation='right');

Based on the above dendogram, we will use 3 clusters for our Agglomerative clustering analysis.

## Reduce data frame using Principal Components Analysis

In [None]:
iris_features

In [None]:
# PCA is affected by scale, so we need to scale our features before applying PCA
# StandardScaler will standardize the dataset’s features onto unit scale (mean = 0 and variance = 1) 

x = iris_features.values
scaled_array = StandardScaler().fit_transform(x) #This is an array of the standardized values of the four feature columns
iris_standardized = pd.DataFrame(data= np.c_[scaled_array], \
                             columns = ('sepal length', 'sepal width', 'petal length', 'petal width'))

# View standardized data frame
iris_standardized.head()

In [None]:
# The first decision in PCA is to select the number of components to reduce to.
# The goal is to reduce dimensions while still retaining most of the variance of the features.

# Start with components=4 (use all) to assess what % of the variance each principal component contains
pca = PCA(n_components=4)
principalComponents = pca.fit_transform(scaled_array)

print(pca.explained_variance_ratio_)
print(pca.explained_variance_ratio_.cumsum())

#### Scree Plot

The scree plot helps to determine the optimal number of components. 

In [None]:
# Plotting the Explained variance as a function of Principal Components considered

figure=plt.figure(figsize=(9,5))
plt.plot(range(1,5),pca.explained_variance_ratio_.cumsum(),marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Scree Plot')
plt.locator_params(axis='x', nbins=4)

plt.show()

Based on the cumulative % of variance explained by the principal components, I can see that principal components 1, 2, and 3 contain 99.4% of the variation (information). We can also see on the Scree Plot above that the explained variance gain from 2 to 3 is signficant. As such 3 principal compnents will be used for the PCA data frame in this analysis.

In [None]:
# Apply PCA to the scaled version of the Iris features, using 3 principal components (based on above analysis)
# Create and display PCA Features data frame. 
# Note: there isn't particular meaning assigned to each principal component, the new components are just the two main 
# dimensions of variation - which are linear combinations of the original variables.

iris_model = PCA(n_components=3)
principalComponents = iris_model.fit_transform(scaled_array)
iris_PCA = pd.DataFrame(data = principalComponents
             , columns = ['principal comp 1', 'principal comp 2', 'principal comp 3'])
iris_PCA.head()

In [None]:
# Scatterplot of the iris data reduced to 3 Principal Components
iris_PCA_target=iris_PCA.assign(target=iris_df['target'])
fig = px.scatter_3d(iris_PCA_target, x='principal comp 1', y='principal comp 2', z='principal comp 3',
              color='target', size_max=8, opacity = .7)
fig.show()

## Employ clustering techniques using defined clustering function: on original data frame and on PCA data frame
* KMeans
* Agglomerative
* DBScan
* Mean Shift

## KMeans

Use optimal number of clusters derived from Elbow Plot = 3

In [None]:
iris_clustering(KMeans, iris_features, iris_target, n_clusters=3)

Visualization notes: Setosa is nicely grouped by itself and versicolor and virgnica are fairly separated but there are some versicolor grouped with virginica.

In [None]:
iris_clustering(KMeans, iris_PCA, iris_target, n_clusters=3)

Visualization notes: Setosa is again nicely grouped by itself but there is much more overlap between versicolor and virginica than with the original data frame.

Results:

KMeans on the original Iris data frame performed quite well, with a V-measure of 0.90.  

In this case, PCA appears to have diminished the results, with a V-measure of only 0.66.

## Agglomerative

Use optimal number of clusters derived from Dendogram = 3

The linkage criterion determines which distance to use between sets of observation.

The option ‘ward’ minimizes the variance of the clusters being merged, euclidean is the distance used for this criterion.

In [None]:
iris_clustering(AgglomerativeClustering, iris_features, iris_target, n_clusters=3, affinity='euclidean', linkage='ward')

Visualization notes: Setosa is nicely grouped by itself, and there is pretty good separation of versicolor and virginica, with just a little overlap.

In [None]:
iris_clustering(AgglomerativeClustering, iris_PCA, iris_target, n_clusters=3, affinity='euclidean', linkage='ward')

Visualization notes: Setosa is clustered by itself with the exception of one point. Virginica is also clustered fairly well with one exception.  Versicolor though, has what looks like as many diamonds as squares which indicates this method/data frame combination did not do a good job of differentiating versicolor.

Results:

Agglomerative clustering on the original Iris data frame performed fairly well, with a V-measure of 0.86.  

Again in this case, PCA appears to have diminished the results, with a V-measure of only 0.68.

## DBScan

Notes on DBScan Inputs and Output

* Input - eps: The maximum distance between two samples for one to be considered as in the neighborhood of the other
* Input min_samples: The number of samples (or total weight) in a neighborhood for a point to be considered as a core point
* Output noise: points that don't have the min # of points within the eps distance (not core points)

In [None]:
iris_clustering(DBSCAN, iris_features, iris_target, eps = 0.2, min_samples = 5)

Visualization notes: Setosa is nicely separated as it's own cluster, but versicolor and virginica are combined as one cluster.

In [None]:
iris_clustering(DBSCAN, iris_PCA, iris_target, eps = 0.48, min_samples = 5)

Visualization notes: Setosa is nicely separated as it's own cluster, but versicolor and virginica are combined as one cluster.

Results:

DBScan on the original Iris data frame did not perform very well: it only detected 2 clusters and had a V-measure of 0.73.  

In this case, PCA appears to have not diminished nor improved the results, as the results for the PCA data frame were the same as the results for the original Iris data frame.

## Mean Shift

Mean shift clustering aims to discover “blobs” in a smooth density of samples. It is a centroid-based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region. These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids.

n_jobs is the number of jobs to use for the computation. This works by computing each of the n_init runs in parallel. -1 means using all processors. 

In [None]:
iris_clustering(MeanShift, iris_features, iris_target, n_jobs=-1)

Visualization notes: Setosa is nicely separated as it's own cluster, but versicolor and virginica are combined as one cluster.

In [None]:
iris_clustering(MeanShift, iris_PCA, iris_target, n_jobs=-1)

Visualization notes: Setosa is all clustered together but this cluster also contains some versicolor.  The remaining versicolor are combined in one cluster with virginica.

Results:

Mean Shift on the original Iris data frame did not perform very well: it only detected 2 clusters and had a V-measure of 0.73. Note that these results are the same as the DBScan on the original Iris data frame. 

In this case, PCA appears to have diminished the results.  It too, detected only 2 clusters but it's V-measure of 0.61 reflects the fact that six of the Versicolor are not in the same cluster as the other Versicolor records.

## Conclusions: 

To evaluate the various techniques, on both the original Iris data frame and on the corresponding PCA data frame, we will use the V-measure.  This measure combines the concept of completeness (all members of a target class are in the same cluster) and homogeneity (each cluster contains only members of one target class).

The results of our analysis are:

Method         | Data Frame | V-Measure
:-----         | :----      | :-----
KMeans         | Iris       | 0.90
Kmeans         | PCA        | 0.66
Agglomerative  | Iris       | 0.86
Agglomerative  | PCA        | 0.68
DBScan         | Iris       | 0.73
DBScan         | PCA        | 0.73
MeanShift      | Iris       | 0.73
MeanShift      | PCA        | 0.61

For the clustering techniques and data set used in this analysis:

* The technique that performed the best was Kmeans on the original Iris data frame.
* The technique that performed the worst was MeanShift on the PCA data frame.

Also, for this data set and using 3 principal components in our PCA data frame, PCA appears to be at best neutral (in DBScan) and at worst to have a detrimental effect on our clustering results.