Run K Means on simulated data

Plot PCA of features

Plot K means for all features (pairplot)

Plot Kmeans on PCA reduced

Determine cluster from elbow plot (inertia, automatic)

Choose K visually

Test minimal features

Plot feature importance



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.cluster import KMeans
from sklearn import metrics


import os, glob, inspect, sys


currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(currentdir)
sys.path.insert(0,parentdir) 
import epri_mc_lib as mc
from importlib import reload
reload(mc)

# K-means clustering with simulated data

As an alternative to handling the uncertainty of the measurements mathematically when clustering, we can use data that was simulated to reflect the uncertainty of the measurements to train the model on. This should lead to a similar result even though it uses a different method. This also allows us to use more classical approaches to clustering and cluster evaluation.

### Import data

This data was simulated with 1000 replicates per condition based on the observed data. Details can be found in the notebook NB/NB_modeling/sample_generation.ipynb

In [None]:
data_path = "../../Data/Merged_data"
df = pd.read_csv(os.path.join(data_path, 'ALL_TUBE_PIPE_simulated.csv'), 
                 index_col=0)


### Calculating new values

The AUC was calculated and the parameters were dropped.

In [None]:
df["AUC_avg"] = mc.findAUC(df, df['A'], df['B'], df['p'])
df.drop(columns=["A","B","p",'Absorption_avg_500','Absorption_avg_200'],inplace=True)


Optional calculate the CF/perm ratio

In [None]:
#df['CF_perm'] = df['mean_CF']/df['mean_perm'].astype('float64')
# df.drop(columns=["mean_MBN","mean_perm","mean_CF"],inplace=True)

### Scaling values and selecting subsamples

In [None]:
df_known = df.iloc[8000:,]
scaled_known, scaler_known = mc.scale_general(df_known, MinMaxScaler())

In [None]:
scaled_df = mc.scale_general(df, MinMaxScaler())[0]

tube, pipe, scaled_known, scaled_unknown = mc.get_subsample_df(scaled_df)

In [None]:
scaled_df['CF_perm'] = scaled_df['mean_CF']/scaled_df['mean_perm'].astype('float64')

corr_scaled_df = scaled_df.copy().loc[:,mc.correlation_list]
tube_scaled_corr, pipe_scaled_corr, \
tube_wo_blind_scaled_corr, tube_blind_scaled_corr = mc.get_subsample_df(corr_scaled_df)

In [None]:
mini_scaled_df = scaled_df.copy().loc[:,mc.minimal_informative_features]
tube_scaled_mini, pipe_scaled_mini, \
tube_wo_blind_scaled_mini, tube_blind_scaled_mini = mc.get_subsample_df(mini_scaled_df)

In [None]:
mini_df = df.copy().loc[:,mc.minimal_informative_features]
tube_mini, pipe_mini, \
tube_wo_blind_mini, tube_blind_mini = mc.get_subsample_df(mini_df)

## Visualization of PCA

To see what the uncertainty of the data looks like in terms of their distribution, principal component analysis was done with the simulated data and the first two components were plotted followed by the third and fourth components. The Last two components only explain a small amount of the variation. First this is done for the known tubes.

In [None]:
pca = PCA(n_components=4, svd_solver='full')
pca.fit(scaled_known)

color_dict = { 'T_AR':'red', 'T_N':'blue', 'T_N_T':'black', 'T_T':'green','T_OT':'purple',
             'T_FF':'grey', 'T_HAZ':'orange', 'T_HAZ_T':'yellow' }

mc.biplot(pca, scaled_known, 0, 1, "PCA biplot, Tubes (Known)", color_dict)

In [None]:
mc.biplot(pca, scaled_known, 2, 3, "PCA biplot, Tubes (Known)", color_dict)

Next we repeat for the unknown tubes, which were transformed using the same PCA fit as the known tubes. There seems to be 3 samples that cannot be told apart, but there are possible identifications for the others. 

In [None]:
pca.transform(scaled_unknown)

color_dict = { 'T_B1':'red', 'T_B2':'blue', 'T_B3':'black', 'T_B4':'green','T_B5':'purple',
             'T_B6':'grey', 'T_B7':'orange', 'T_B8':'yellow' }

mc.biplot(pca, scaled_unknown, 0, 1, "PCA biplot, Tubes (Unknown)", color_dict)

In [None]:
mc.biplot(pca, scaled_unknown, 2, 3, "PCA biplot, Tubes (Unknown)", color_dict)

Of the blind microstructure samples that were identified based on a single measurement in the previous reports, 4 are identified using this method, 1 additional sample is potentially identified, and one previously identified sample could not be identified. But is should be noted that this is based on the first two principal components alone so a full  model would presuably have more power.

In agreement with previous reports:
* FF=B7
* OT=B8
* N=B4
* HAZ=B6

Identified in previous report but not here:
* AR=B5

Identified here but not in previous reports:
* N_T=B2

## K-means Clustering

### Elbow method

First this tries to find a reasonable k automatically in the classic way. This doesn't work well.

In [None]:
min_range = 2
max_range = 8

def plot_elbow_kmeans(feat_norm, title):
    '''
    Elbow plot
    Args:
    - feat_norm : pandas dataframe
    - title : title of the figure ideally correpond to the samples
    return plot
    '''
    
    inertia = []
    k_list = range(min_range, max_range+1)

    for k in k_list:
        km = KMeans(n_clusters = k, random_state= 0)
        km.fit(feat_norm) 
        score = km.inertia_
        inertia.append(score)


    plt.figure(1 , figsize = (10 ,6))
    plt.plot(np.arange(min_range , max_range+1) , inertia , 'o')
    plt.plot(np.arange(min_range , max_range+1) , inertia , '-' , alpha = 0.5)

    plt.xlabel('Number of Clusters', fontsize=20) , plt.ylabel('Inertia', fontsize=20)
    plt.title(title, fontsize=20)
    plt.show()

In [None]:
plot_elbow_kmeans(tube_wo_blind_scaled_mini, title='Identified tubes minimal features')

In [None]:
plot_elbow_kmeans(tube_wo_blind_scaled_corr, title='Identified tubes selected features')

### Auto find K
Source: https://jtemporal.com/kmeans-and-elbow-method/

In [None]:
def calculate_wcss(data):
    '''
    Calculate within class sum-squared value which represents loss in KMeans clustering
    '''
    wcss = []
    for n in range(min_range, max_range):
        kmeans = KMeans(n_clusters=n,random_state=0)
        kmeans.fit(data)
        wcss.append(kmeans.inertia_)
    
    return wcss

from math import sqrt

def optimal_number_of_clusters(wcss):
    '''
    Calculate normal distance 
    '''
    x1, y1 = min_range, wcss[0]
    x2, y2 = max_range, wcss[len(wcss)-1]

    distances = []
    for i in range(len(wcss)):
        x0 = i+2
        y0 = wcss[i]
        numerator = abs((y2-y1)*x0 - (x2-x1)*y0 + x2*y1 - y2*x1)
        denominator = sqrt((y2 - y1)**2 + (x2 - x1)**2)
        distances.append(numerator/denominator)
    
    return distances.index(max(distances)) + 2
    

In [None]:
# calculating the within clusters sum-of-squares for n cluster amounts
sum_of_squares = calculate_wcss(tube_wo_blind_scaled_corr)
    
# calculating the optimal number of clusters
n = optimal_number_of_clusters(sum_of_squares)
print('Number of cluster =', n)

In [None]:
# calculating the within clusters sum-of-squares for n cluster amounts
sum_of_squares = calculate_wcss(tube_wo_blind_scaled_mini)
    
# calculating the optimal number of clusters
n = optimal_number_of_clusters(sum_of_squares)
print('Number of cluster for all tubes =', n)

### Plot K-Means on sample distribution scatterplot

In [None]:
def plot_kmeans(df_scaled, df_ori, k):
    '''
    Scatter plot
    Args:
    - df : scaled pandas dataframe
    - range_col : np.r_[range of column wanted]
    return plot
    '''
    model = KMeans(n_clusters = k, random_state= 42)
    model.fit(df_scaled) 
    labels = model.predict(df_scaled)
    print(labels)
    silhouette = metrics.silhouette_score(df_scaled, labels, metric='euclidean')
    print(silhouette)
    df_ori['labels'] = labels
    sns.pairplot(df_ori, hue='labels')

In [None]:
plot_kmeans(tube_wo_blind_scaled_mini, tube_wo_blind_mini, 4)

## Choosing k visually

The elbow method gives relatively low values of k even though more clusters are clerly separable based on the PCA visualization. Instead, this simply plots the PCA and colors the points based on the clustering with different values of k to see whether it can identify the actual conditions. In fact with k=6 the model is able to roughly identify the known clusters. 

In [None]:

pca = PCA(n_components=0.9, svd_solver='full')
pca.fit(scaled_known)

color_dict = { 0:'cyan', 1:'burlywood', 2:'pink', 3:'silver', 4:'khaki', 5:'palegreen', 6:'steelblue', 7:'plum'}

plot_scaled = scaled_known.copy()


In [None]:
model = KMeans(n_clusters = 6, random_state= 42)
model.fit(scaled_known) 
labels = model.labels_
plot_scaled.index = labels
mc.biplot(pca, plot_scaled, 0, 1, "K-means clustering, Tubes (Known)", color=color_dict, plot_vectors=False)

In [None]:
model = KMeans(n_clusters = 7, random_state= 42)
model.fit(scaled_known) 
labels = model.labels_
plot_scaled.index = labels
mc.biplot(pca, plot_scaled, 0, 1, "K-means clustering, Tubes (Known)", color=color_dict, plot_vectors=False)

In [None]:
model = KMeans(n_clusters = 8, random_state= 42)
model.fit(scaled_known) 
labels = model.labels_
plot_scaled.index = labels
mc.biplot(pca, plot_scaled, 0, 1, "K-means clustering, Tubes (Known)", color=color_dict, plot_vectors=False)

## Classify original blind samples with k-means

For this the original 8 samples of known and unknown tubes are classified with the model built using simulated data. This does not take into account the uncertainty of the blind data. 

In [None]:
data_path = "../../Data/Merged_data"
df_original = pd.read_csv(os.path.join(data_path, 'ALL_TUBE_PIPE_merge_1.csv'), 
                 index_col=0)
df_original["AUC_avg"] = mc.findAUC(df_original, df_original['A'], df_original['B'], df_original['p'])
df_original.drop(columns=["median_CF","median_perm","median_MBN","A","B","p",'Absorption_avg_500','Absorption_avg_200']+mc.errors_list,inplace=True)
df_original = df_original.iloc[:16,]
df_original.dropna(axis=1, inplace=True)
scaled_original = mc.scale_general(df_original,MinMaxScaler())[0]

In [None]:
model = KMeans(n_clusters = 6, random_state= 42)
model.fit(scaled_known) 
model.predict(scaled_original)

The clusters created are:

* 0: N_T, B2
* 1: N, 
* 2: AR, HAZ_T, T, B1, B3, B5
* 3: FF, B7
* 4: HAZ, B6, B4
* 5: OT, B8

In agreement with previous reports:
* FF=B7
* OT=B8
* HAZ=B6

Identified in previous report but not here:
* AR=B5

Identified here but not in previous reports:
* N_T=B2

Problems:
* B4 corresponds to N but its uncertainty overlaps with HAZ and is incorrectly grouped there

This methods handles well the uncertainty of the training data but does not handle the uncertainty of the prediction data of the blind tubes causing a misclassification. We may need a way to compare between two sample distributions instead. Reducing the uncertainty would also allow the clusters to be more easily separated. 

## A minimal feature set

Since many of the features are correlated and contribute similarly to the PCA this tries to find a minimum that can recreate the same result. Features were individually dropped and their effect on the principal components were observed. The features with minimal effect on the components were removed.

In [None]:
minimal_set_known = scaled_known.copy()[["TEP_mean_uV_C","Absorption_avg_50","mean_perm","AUC_avg","backscatter_avg"]]
minimal_original = scaled_original[["TEP_mean_uV_C","Absorption_avg_50","mean_perm","AUC_avg","backscatter_avg"]]


In [None]:
pca = PCA(n_components=4, svd_solver='full')
pca.fit(minimal_set_known)

color_dict = { 0:'cyan', 1:'burlywood', 2:'pink', 3:'silver', 4:'khaki', 5:'palegreen', 6:'steelblue', 7:'plum'}

plot_scaled = minimal_set_known.copy()

model = KMeans(n_clusters = 6, random_state= 42)
model.fit(minimal_set_known) 
labels = model.labels_
plot_scaled.index = labels
mc.biplot(pca, plot_scaled, 0, 1, "Minimal K-means clustering, Tubes (Known)", color_dict)

In [None]:
model.predict(minimal_original)

The clusters are:

* 0: B1, B3, B5, AR, HAZ_T, T
* 1: B4, B6, HAZ
* 2: B7, FF
* 3: B2, N_T
* 4: N
* 5: B8, OT

These predictions are identical to what was made with the full model. This may represent a minimal feature set that contains the majority of the information.


## Feature importance

The first two principal components (and especially the first) explain the majority of the variance. The first component is largely made up of TEP and permeability. All the features except backscatter contribute to the second component and backscatter contributes to the third component.

In [None]:
pca.explained_variance_ratio_

In [None]:
plt.bar(["PC1","PC2","PC3","PC4"],pca.explained_variance_ratio_, align='center', alpha=0.5, color="gray")
plt.ylim(0,1)
plt.ylabel("Explained variance")

In [None]:
pca.explained_variance_ratio_

In [None]:
pca.components_

In [None]:
feature_importance = pd.DataFrame([x*abs(y) for x,y in zip(pca.explained_variance_ratio_, pca.components_)],columns=minimal_set_known.columns, index=["PC1","PC2","PC3","PC4"] )

In [None]:
feature_importance

In [None]:
plot_feat_imp = feature_importance.transpose().sort_values('PC1', ascending=False).transpose()

In [None]:
plot_feat_imp.plot(kind='barh', color=sns.color_palette('PuBu_r', 5, desat=0.9), width=0.6, figsize=(6,6))
plt.xlabel('Feature importance (explained variance ratio)', fontsize = 15)

In [None]:
N = 4

fig, ax = plt.subplots()

ind = np.arange(N)    # the x locations for the groups
width = 0.15         # the width of the bars

pca_components = ["PC1","PC2","PC3","PC4"]

for i in range(5):
    ax.bar(ind + width*i, feature_importance.iloc[:,i], width, label=feature_importance.columns[i])

ax.set_xticks(ind + width / 2)
ax.set_xticklabels(pca_components)
plt.ylim(0,1)
plt.ylabel('PCA components scaled by explained variance')
ax.legend()

In [None]:
feature_importance.iloc[:,i]