### Cluster Analysis

The goal of this notebook is to explore the data structure for hidden
clusters. Good clusters have high intra-class similarity and low
inter-class similarity. Good clustering produces clusters where points
within a cluster are similar and dissimilar from points in different
clusters. We hope to find good clusters in the data we could exploit by
fitting classification models on individual clusters instead of the
entire dataset. We expect the models for individual clusters to be more
accurate and provide better results than a single model fitted on the
entire dataset.

In [None]:
import matplotlib.pyplot as plt
from matplotlib import cm
import numpy as np
import pandas as pd
from sklearn.cluster import AgglomerativeClustering, DBSCAN, KMeans, OPTICS
from sklearn.decomposition import PCA
from sklearn.metrics import calinski_harabasz_score, davies_bouldin_score, silhouette_score, silhouette_samples
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
import warnings
%matplotlib inline

In [None]:
# Format output of data frame. 
pd.set_option("precision", 4)
pd.set_option('display.max_columns', None)

In [None]:
# Load the data with all columns
linearReg = pd.read_csv('Data/LinearRegressionWith8.csv')
randomForest = pd.read_csv('Data/RandomForestRegressor_feature_importancesWith8.csv')

In [None]:
# Format tables and drop rows of all zeros
linearReg.index = linearReg.iloc[:,0]
linearReg = linearReg.iloc[:,1:]
linearReg.loc[:,'Total'] = linearReg.sum(axis = 1)
linearReg = linearReg.loc[(linearReg['Total']!=0),:]
linearReg = linearReg.iloc[:,:-1]

randomForest.index = randomForest.iloc[:,0]
randomForest = randomForest.iloc[:,1:]
randomForest.loc[:, 'Total'] = randomForest.sum(axis = 1)
randomForest = randomForest.loc[(randomForest['Total']!=0),:]
randomForest = randomForest.iloc[:,:-1]

In [None]:
# Drop rows with NA
linearReg = linearReg.dropna()
randomForest = randomForest.dropna()

In [None]:
# Scale data using min-max scaling.
scaler = MinMaxScaler()
scaler.fit(randomForest)
randomForestScale = scaler.transform(randomForest)

In [None]:
# Scale data using min-max scaling.
scaler = MinMaxScaler()
scaler.fit(linearReg)
linearRegScale = scaler.transform(linearReg)

## Functions

The following functions will be used throughout the cluster analysis
that follows.

The runClusteringModel function fits the cluster model and returns three
different clustering scores, along with the cluster labels. The three
clustering scores are the Variance Ratio Criterion, Davies-Bouldin
Criterion, and Silhouette Coefficient using the specified distance
measure.

In [None]:
def runClusteringModel(data, silhouetteScoreMeasure, clusteringModel): 
    """Returns Variance Ratio Criterion, Davies-Bouldin score, Silhouette Coefficient 
    based on given distance measure, and the class labels."""

    # Initialize and fit clustering model
    mdl = clusteringModel
    mdl.fit(data)    

    mdlLabels = mdl.labels_  # Get cluster labels

    VRC = calinski_harabasz_score(data, mdlLabels)
    dBScore = davies_bouldin_score(data, mdlLabels)
    silhouetteScore = silhouette_score(data, mdlLabels, metric=silhouetteScoreMeasure, random_state=210)

    return VRC, dBScore, silhouetteScore, mdlLabels

The runPCAClusteringModel function performs PCA analysis keeping the
number of components for the amount of variance specified by the user as
an input parameter.  
The function then calls runClusteringModel to run
the cluster analysis using the PCA transformed data. It returns the same
information as runClusteringModel plus the number of components needed
to capture the amount of variance specified.

In [None]:
def runPCAClusteringModel(data, silhouetteScoreMeasure, varCaptured, clusteringModel): 
    """Returns the results of clustering algorithm using PCA transformed data and the number of components
    required to caputure the inputted amount of variance.
    varCaptured is in decimal form to specify the amount of the data's variance the components capture. 
    """

    pca = PCA(n_components=None)

    # Initialize PCA keeping all components and fit to data. 
    pca = PCA(n_components=None)
    pca.fit(data)

    # Find number of components that obtain 95% of variance
    cumVar = pca.explained_variance_ratio_.cumsum().round(decimals=2)
    i = 0
    while True:
        if cumVar[i] >= varCaptured:  # True if at least varCaptured of the variance is captured
            numComp = i + 1  # Component number is index plus 1. 
            break
        i += 1

    # Rerun PCA analysis and transformations keeping only three components. 
    pca = PCA(n_components=numComp)
    pca.fit(data)
    data_pca = pca.transform(data)

    # Call runClusteringModel function with PCA transformed data.
    VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(data_pca, silhouetteScoreMeasure, clusteringModel)
    return VRC, dBScore, silhouetteScore, mdlLabels, numComp

The printClusteringStats function prints the results of the cluster
analysis. The results include the three clustering scores obtain from
runClusteringModel. Additionally, the function prints the number of
clusters produced and some information regarding the top 5 clusters
based on the number of instances associated with them.

In [None]:
def printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels):
    """Prints clustering results, different measures, number of clusters, and stats of top 5 clusters."""
    print('The Variance Ratio Criterion is equal to ' + str(round(VRC,2)) + '.')
    print('The Davies-Bouldin score is equal to ' + str(round(dBScore,2)) + '.')
    print('The mean Silhouette Coefficient using the ' + silhouetteScoreMeasure + ' distance is equal to ' + str(round(silhouetteScore,2)) + '.')
    print('The model produced ' + str(mdlLabels.max() + 1) + ' clusters.\n')

    #show top five clusters and percentage of data in each
    unique, counts = np.unique(mdlLabels, return_counts=True)
    percent = counts/mdlLabels.size
    clusters = pd.DataFrame({'Clusters': unique, 'Number of Instances': counts, 'Percent of Instances': percent})
    print(clusters.sort_values(by=['Number of Instances'], ascending=False).head(5))
    return

The silhouette_scorer function is the scoring method used for
GridSearchCV function. silhouette_scorer is needed because there is no
built-in option for GridSearchCV to score clusters without the true
class labels, which we do not know in this situation. After all, we're
trying to find unknown clusters in the data.

In [None]:
def silhouette_scorer(data, clusteringModel, silhouetteScoreMeasure='euclidean'):
    """Returns Silhouette Coefficient to be used in GridSearchCV."""

    # Initialize and fit clustering model
    mdl = clusteringModel
    mdl.fit(data)    

    mdlLabels = mdl.labels_  # Get cluster labels

    numLabels = np.unique(mdlLabels).size 
    numInstances = data.shape[0]

    if numLabels == 1 or numLabels == numInstances:  # True if only one cluster or cluster for each individual instance
        return -1
    else:
        return silhouette_score(data, mdlLabels, metric=silhouetteScoreMeasure, random_state=210)

The graphResults function graphs the results from using different
hyperparameters. It displays the Variance Ratio Criterion,
Davies-Bouldin Index, and Silhouette Coefficient on the Y-axes over the
range of hyperparameter values on the x-axis.

In [None]:
def graphResults(x, VRCLst, dBScoreLst, silhouetteScoreLst):
    """Produce graph for tuning clustering parameters"""

    # Create graph with number of K on x-axis and measures on y-axes.
    fig,ax = plt.subplots()

    # Formatting Y-axis on left side of plot 
    vrc = ax.plot(x, VRCLst, color='red', marker='o', label='Variance Ratio Criterion')
    ax.set_xlabel("Hyperparameter",fontsize=14)
    ax.set_ylabel("Variance Ratio Criterion",color="red",fontsize=14)

    # Formatting Y-axis on right side of plot
    ax2=ax.twinx()
    dbScore = ax2.plot(x, dBScoreLst,color='blue',marker='o', label='Davies–Bouldin index')
    silScore = ax2.plot(x, silhouetteScoreLst,color='green',marker='o', label='Silhouette Score')
    ax2.set_ylabel("DB & Silhouette Score",color='blue',fontsize=14)

    # Formatting the legend and displaying plot
    lns = vrc+dbScore+silScore
    labs = [l.get_label() for l in lns]
    ax.legend(lns, labs, loc='center left', bbox_to_anchor=(1.25, 0.5))
    plt.show()

In [None]:
def plot_silhouettes(data, clusters, metric='euclidean'):
    """Plots cluster silhouettes."""

    cluster_labels = np.unique(clusters)+1  # Label clusters beginning at 1.
    n_clusters = cluster_labels.shape[0]
    silhouette_vals = silhouette_samples(data, clusters, metric='cosine')
    c_ax_lower, c_ax_upper = 0, 0
    cticks = []
    for i, k in enumerate(cluster_labels):
        c_silhouette_vals = silhouette_vals[clusters == k]
        c_silhouette_vals.sort()
        c_ax_upper += len(c_silhouette_vals)
        color = cm.jet(float(i) / n_clusters)
        plt.barh(range(c_ax_lower, c_ax_upper), c_silhouette_vals, height=1.0, 
                      edgecolor='none', color=color)

        cticks.append((c_ax_lower + c_ax_upper) / 2)
        c_ax_lower += len(c_silhouette_vals)

    silhouette_avg = np.mean(silhouette_vals)
    plt.axvline(silhouette_avg, color="red", linestyle="--") 

    plt.yticks(cticks, cluster_labels)
    plt.ylabel('Cluster')
    plt.xlabel('Silhouette coefficient')

    plt.tight_layout()
    plt.show()

    return

In [None]:
def tuner(data, model, simMeasure='euclidean', varCaptured = .95):
    # X-axis values and those to be tested for k
    x = np.arange(2, 18)

    # placeholders for evaluation measure scores.
    VRCLst = []
    dBScoreLst = []
    silhouetteScoreLst = []

    # Test different values for K
    for k in x:
        if model == 'runClusteringModel':
            VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(data, simMeasure, KMeans(n_clusters=k, random_state=210))
        else:
            VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(data, simMeasure, varCaptured, KMeans(n_clusters=k, random_state=210))
        VRCLst.append(VRC)
        dBScoreLst.append(dBScore)
        silhouetteScoreLst.append(silhouetteScore)

    graphResults(x, VRCLst, dBScoreLst, silhouetteScoreLst)

## KMeans Clustering

#### Random Forest

In [None]:
tuner(randomForestScale, 'runClusteringModel', 'euclidean')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, KMeans(n_clusters=2, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, KMeans(n_clusters=3, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, KMeans(n_clusters=5, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, KMeans(n_clusters=6, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
tuner(randomForestScale, 'runClusteringModel', 'cosine')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, KMeans(n_clusters=2, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, KMeans(n_clusters=3, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, KMeans(n_clusters=5, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, KMeans(n_clusters=6, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
tuner(randomForestScale, 'runPCAClusteringModel', 'euclidean')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, KMeans(n_clusters=2, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, KMeans(n_clusters=3, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, KMeans(n_clusters=5, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, KMeans(n_clusters=6, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
tuner(randomForestScale, 'runPCAClusteringModel', 'cosine')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, KMeans(n_clusters=2, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, KMeans(n_clusters=3, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, KMeans(n_clusters=5, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, KMeans(n_clusters=6, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
randomForest.loc[:,'Labels'] = mdlLabels
randomForest.to_csv('randomforestlabelsv2.csv')

In [None]:
warnings.filterwarnings('ignore')
cv = [(slice(None), slice(None))]
paramDict = {'n_clusters': np.arange(2, 18)}
grid = GridSearchCV(estimator=KMeans(random_state=210), param_grid=paramDict, 
                  scoring=silhouette_scorer, cv=cv, n_jobs=-1)
grid.fit(randomForestScale)
print(grid.best_params_) 

All three measures agree the best number of clusters is 2. The Variance
Ratio Criterion signifies the optimal number of clusters at its first
local mean. The Davies-Bouldin Index indicated the optimal number of
clusters when minimized and the Silhouette Coefficient does when
maximized. All of the indications by the three measures agree two is the
optimal number of clusters.

#### Linear Regression

In [None]:
tuner(linearRegScale, 'runClusteringModel', 'euclidean')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, KMeans(n_clusters=2, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, KMeans(n_clusters=3, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, KMeans(n_clusters=6, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
tuner(linearRegScale, 'runClusteringModel', 'cosine')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, KMeans(n_clusters=2, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, KMeans(n_clusters=3, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
tuner(linearRegScale, 'runPCAClusteringModel', 'euclidean')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, KMeans(n_clusters=2, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, KMeans(n_clusters=3, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
tuner(linearRegScale, 'runPCAClusteringModel', 'cosine')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, KMeans(n_clusters=2, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, KMeans(n_clusters=3, random_state=210))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
warnings.filterwarnings('ignore')
cv = [(slice(None), slice(None))]
paramDict = {'n_clusters': np.arange(2, 18)}
grid = GridSearchCV(estimator=KMeans(random_state=210), param_grid=paramDict, 
                  scoring=silhouette_scorer, cv=cv, n_jobs=-1)
grid.fit(linearRegScale)
print(grid.best_params_) 

Obtain confirmation from GridSearchCV that 2 number of clusters produces
the best Silhouette Coefficient.

K-Means clustering with and without PCA transformed data did not perform
well. The best Silhouette Coefficient produced was .19. The two clusters
may represent functional and non functional water pumps.

## Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

In [None]:
def dbTuner(data, model, simMeasure='euclidean', varCaptured = .95):
    # X-axis values and those to be tested for k
    x = np.linspace(.1, .4, 20)

    # placeholders for evaluation measure scores.
    VRCLst = []
    dBScoreLst = []
    silhouetteScoreLst = []

    # Test different values for K
    for eps in x:
        if model == 'runClusteringModel':
            VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(data, simMeasure, DBSCAN(eps=eps))
        else:
            VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(data, simMeasure, varCaptured, DBSCAN(eps=eps))
        VRCLst.append(VRC)
        dBScoreLst.append(dBScore)
        silhouetteScoreLst.append(silhouetteScore)

    graphResults(x, VRCLst, dBScoreLst, silhouetteScoreLst)

#### Random Forest

In [None]:
dbTuner(randomForestScale, 'runClusteringModel', 'euclidean')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, DBSCAN(eps=.1))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, DBSCAN(eps=.15))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, DBSCAN(eps=.2))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
dbTuner(randomForestScale, 'runClusteringModel', 'cosine')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, DBSCAN(eps=.1))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, DBSCAN(eps=.13))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, DBSCAN(eps=.15))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
dbTuner(randomForestScale, 'runPCAClusteringModel', 'euclidean')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, DBSCAN(eps=.1))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, DBSCAN(eps=.13))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, DBSCAN(eps=.15))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
dbTuner(randomForestScale, 'runPCAClusteringModel', 'cosine')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, DBSCAN(eps=.1))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, DBSCAN(eps=.13))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, DBSCAN(eps=.15))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
warnings.filterwarnings('ignore')
cv = [(slice(None), slice(None))]
paramDict = {'eps': np.linspace(.1, .4, 20)}
grid = GridSearchCV(estimator=DBSCAN(), param_grid=paramDict, 
                  scoring=silhouette_scorer, cv=cv, n_jobs=-1)
grid.fit(randomForestScale)
print(grid.best_params_) 

#### Linear Regression

In [None]:
dbTuner(linearRegScale, 'runClusteringModel', 'euclidean')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, DBSCAN(eps=.1))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, DBSCAN(eps=.12))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, DBSCAN(eps=.13))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
dbTuner(linearRegScale, 'runClusteringModel', 'cosine')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, DBSCAN(eps=.1))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, DBSCAN(eps=.12))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, DBSCAN(eps=.13))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
dbTuner(linearRegScale, 'runPCAClusteringModel', 'euclidean')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, DBSCAN(eps=.1))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, DBSCAN(eps=.115))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, DBSCAN(eps=.125))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
dbTuner(linearRegScale, 'runPCAClusteringModel', 'cosine')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, DBSCAN(eps=.1))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, DBSCAN(eps=.115))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, DBSCAN(eps=.125))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
warnings.filterwarnings('ignore')
cv = [(slice(None), slice(None))]
paramDict = {'eps': np.linspace(.1, .4, 20)}
grid = GridSearchCV(estimator=DBSCAN(), param_grid=paramDict, 
                  scoring=silhouette_scorer, cv=cv, n_jobs=-1)
grid.fit(linearRegScale)
print(grid.best_params_) 

GridSearchCV indicates the best eps is 0.1, lower than that indicated by
the three measures graphed above.

Density-Based Spatial Clustering of Applications with Noise is producing
too many small clusters to be useful. Noise is accounting for the
largest share of instances, then there are hundreds of many small
clusters that make up around 1.4% or less of the instances in the data
set.

## Hierarchical Clustering

In [None]:
def hierTuning(data, model, simMeasure='euclidean', varCaptured = .95, affinity='euclidean', linkage='ward'):

    # X-axis values and those to be tested for k
    x = np.arange(2, 11)

    # placeholders for evaluation measure scores.
    VRCLst = []
    dBScoreLst = []
    silhouetteScoreLst = []

    # Test different values for K
    for numClusters in x:        
        if model == 'runClusteringModel':
            VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(data, simMeasure, AgglomerativeClustering(n_clusters=numClusters, affinity=affinity, linkage=linkage))
        else:
            VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(data, simMeasure, varCaptured, AgglomerativeClustering(n_clusters=numClusters, affinity=affinity, linkage=linkage))

        VRCLst.append(VRC)
        dBScoreLst.append(dBScore)
        silhouetteScoreLst.append(silhouetteScore)

    graphResults(x, VRCLst, dBScoreLst, silhouetteScoreLst)

    

In [None]:
hierTuning(randomForestScale, 'runClusteringModel', simMeasure='euclidean', affinity='euclidean', linkage='ward')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
hierTuning(randomForestScale, 'runClusteringModel', simMeasure='euclidean', affinity='euclidean', linkage='complete')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='complete'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='complete'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=6, affinity='euclidean', linkage='complete'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=6, affinity='euclidean', linkage='complete'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
hierTuning(randomForestScale, 'runClusteringModel', simMeasure='euclidean', affinity='euclidean', linkage='average')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='average'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='average'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
hierTuning(randomForestScale, 'runClusteringModel', simMeasure='euclidean', affinity='euclidean', linkage='single')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='single'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='single'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
hierTuning(randomForestScale, 'runClusteringModel', simMeasure='cosine', affinity='euclidean', linkage='ward')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
hierTuning(randomForestScale, 'runClusteringModel', simMeasure='cosine', affinity='cosine', linkage='complete')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=2, affinity='cosine', linkage='complete'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=3, affinity='cosine', linkage='complete'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
hierTuning(randomForestScale, 'runClusteringModel', simMeasure='cosine', affinity='cosine', linkage='average')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=2, affinity='cosine', linkage='average'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=3, affinity='cosine', linkage='average'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=7, affinity='cosine', linkage='average'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=8, affinity='cosine', linkage='average'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
hierTuning(randomForestScale, 'runClusteringModel', simMeasure='cosine', affinity='cosine', linkage='single')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=2, affinity='cosine', linkage='single'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=3, affinity='cosine', linkage='single'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
hierTuning(randomForestScale, 'runPCAClusteringModel', simMeasure='euclidean', affinity='euclidean', linkage='ward')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
hierTuning(randomForestScale, 'runPCAClusteringModel', simMeasure='euclidean', varCaptured = .95, affinity='euclidean', linkage='complete')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='complete'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
hierTuning(randomForestScale, 'runPCAClusteringModel', simMeasure='euclidean', varCaptured = .95, affinity='euclidean', linkage='average')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='average'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
hierTuning(randomForestScale, 'runPCAClusteringModel', simMeasure='euclidean', varCaptured = .95, affinity='euclidean', linkage='single')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='single'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
hierTuning(randomForestScale, 'runPCAClusteringModel', simMeasure='cosine', affinity='euclidean', linkage='ward')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
hierTuning(randomForestScale, 'runPCAClusteringModel', simMeasure='cosine', varCaptured = .95, affinity='cosine', linkage='complete')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=2, affinity='cosine', linkage='complete'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
hierTuning(randomForestScale, 'runPCAClusteringModel', simMeasure='cosine', varCaptured = .95, affinity='cosine', linkage='average')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='complete'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
hierTuning(randomForestScale, 'runPCAClusteringModel', simMeasure='cosine', varCaptured = .95, affinity='cosine', linkage='single')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=2, affinity='cosine', linkage='single'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=3, affinity='cosine', linkage='single'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

#### Linear Regression

In [None]:
hierTuning(linearRegScale, 'runClusteringModel', simMeasure='euclidean', affinity='euclidean', linkage='ward')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
hierTuning(linearRegScale, 'runClusteringModel', simMeasure='euclidean', affinity='euclidean', linkage='complete')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='complete'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
hierTuning(linearRegScale, 'runClusteringModel', simMeasure='euclidean', affinity='euclidean', linkage='average')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='average'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
hierTuning(linearRegScale, 'runClusteringModel', simMeasure='euclidean', affinity='euclidean', linkage='single')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='single'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='single'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
hierTuning(linearRegScale, 'runClusteringModel', simMeasure='cosine', affinity='euclidean', linkage='ward')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
hierTuning(linearRegScale, 'runClusteringModel', simMeasure='cosine', affinity='cosine', linkage='complete')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=5, affinity='cosine', linkage='complete'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=4, affinity='cosine', linkage='complete'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
hierTuning(linearRegScale, 'runClusteringModel', simMeasure='cosine', affinity='cosine', linkage='average')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=2, affinity='cosine', linkage='average'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=5, affinity='cosine', linkage='average'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
hierTuning(linearRegScale, 'runClusteringModel', simMeasure='cosine', affinity='cosine', linkage='single')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=2, affinity='cosine', linkage='single'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=3, affinity='cosine', linkage='single'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, AgglomerativeClustering(n_clusters=9, affinity='cosine', linkage='single'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
hierTuning(linearRegScale, 'runPCAClusteringModel', simMeasure='euclidean', affinity='euclidean', linkage='ward')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
hierTuning(linearRegScale, 'runPCAClusteringModel', simMeasure='euclidean', affinity='euclidean', linkage='complete')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='complete'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=6, affinity='euclidean', linkage='complete'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
hierTuning(linearRegScale, 'runPCAClusteringModel', simMeasure='euclidean', affinity='euclidean', linkage='average')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='average'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='average'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
hierTuning(linearRegScale, 'runPCAClusteringModel', simMeasure='euclidean', affinity='euclidean', linkage='single')

In [None]:
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='single'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
hierTuning(linearRegScale, 'runPCAClusteringModel', simMeasure='cosine', affinity='euclidean', linkage='ward')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=6, affinity='euclidean', linkage='ward'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
hierTuning(linearRegScale, 'runPCAClusteringModel', simMeasure='cosine', affinity='cosine', linkage='complete')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=2, affinity='cosine', linkage='complete'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
hierTuning(linearRegScale, 'runPCAClusteringModel', simMeasure='cosine', affinity='cosine', linkage='average')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=2, affinity='cosine', linkage='average'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
hierTuning(linearRegScale, 'runPCAClusteringModel', simMeasure='cosine', affinity='cosine', linkage='single')

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=2, affinity='cosine', linkage='single'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, AgglomerativeClustering(n_clusters=3, affinity='cosine', linkage='single'))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

## OPTICS

In [None]:
def opticsTuning(data, model, simMeasure='euclidean', varCaptured = .95):
    # X-axis values and those to be tested for k
    x = np.arange(5, 51, 5)

    # placeholders for evaluation measure scores.
    VRCLst = []
    dBScoreLst = []
    silhouetteScoreLst = []

    # Test different values for K
    for numSamples in x:
        if model == 'runClusteringModel':
            VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(data, simMeasure, OPTICS(min_samples=numSamples))
        else:
            VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(data, simMeasure, varCaptured, OPTICS(min_samples=numSamples))
        VRCLst.append(VRC)
        dBScoreLst.append(dBScore)
        silhouetteScoreLst.append(silhouetteScore)

    graphResults(x, VRCLst, dBScoreLst, silhouetteScoreLst)

#### Random Forest

The Davies-Bouldin Index and Silhouette Score agree the default minimum
samples of 5 is the best value for the hyperparameter. The Variance
Ratio Criterion continues to rise throughout the entire range tested but
appears to be plateauing. Due to the high computational cost of running
the OPTICS method, only the default value for min_samples was tested,
which is the value two of the three measures agreed to be the best
hyperparameter.

In [None]:
warnings.filterwarnings('ignore')
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, OPTICS(min_samples=10))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
warnings.filterwarnings('ignore')
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, OPTICS(min_samples=20))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
warnings.filterwarnings('ignore')
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, OPTICS())
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
warnings.filterwarnings('ignore')
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, OPTICS())
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
warnings.filterwarnings('ignore')
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(randomForestScale, silhouetteScoreMeasure, OPTICS(min_samples=20))
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
warnings.filterwarnings('ignore')
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, OPTICS())
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

In [None]:
warnings.filterwarnings('ignore')
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(randomForestScale, silhouetteScoreMeasure, .95, OPTICS())
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(randomForestScale, mdlLabels)

#### Linear Regression

In [None]:
warnings.filterwarnings('ignore')
silhouetteScoreMeasure = 'euclidean'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, OPTICS())
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
warnings.filterwarnings('ignore')
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels = runClusteringModel(linearRegScale, silhouetteScoreMeasure, OPTICS())
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
warnings.filterwarnings('ignore')
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, OPTICS())
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

In [None]:
warnings.filterwarnings('ignore')
silhouetteScoreMeasure = 'cosine'
VRC, dBScore, silhouetteScore, mdlLabels, numComp = runPCAClusteringModel(linearRegScale, silhouetteScoreMeasure, .95, OPTICS())
printClusteringStats(VRC, dBScore, silhouetteScore, silhouetteScoreMeasure, mdlLabels)

In [None]:
plot_silhouettes(linearRegScale, mdlLabels)

The OPTICS measure did not produce useful clusters. Almost a third of
the instances are classified as noise and none of the almost 3,300
clusters contain more than a percent of the instances.

None of the clustering methods achieved a small number of useful
clusters with good clustering scoring. The methods either produced too
many small clusters with a significant amount of instances classified as
noise or a small number of clusters with poor scoring results.
Therefore, fitting classification models on the resulting clusters would
not be useful.