### Import packages

In [1]:
import imlreliability
import pandas as pd
import numpy as np#### Load Packages dir(imlreliability)

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:
dir(imlreliability)

['__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_version',
 'clustering',
 'dimension_reduction',
 'feature_importance']

### Clustering 

In [3]:
dir(imlreliability.clustering)

['__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '_clustering',
 'clustering',
 'util_clustering']

Reliability test of clustering techniques can be performed with the module imlreliability.clustering.

#### Load data
We use the WDBC Breast Cancer Wisconsin data as an example for the following sections. The data has 569 observations and 30 feature, and 2 oracle clusters. We scale and normalize the data as pre-processing steps. 

In [4]:
from sklearn.model_selection import train_test_split
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)
data=data.dropna()
y=(data[1])
x = data.drop(columns=[0,1]).to_numpy()
K=len(set(y))

The clustering estimator is assumed to implement the scikit-learn estimator interface.  We propose two data perturbation approaches: 

    1. noise addition and 
    2. data spliting. 

We will show examples under these two scenarios in the following sections.

### 1. Noise Addition

We wish to measure the interpretation reliability of K-Means, by perturbing the data with noise addition ``perturbation = 'noise'``. Here we add normal noise with mean 0 and standard deviation 1 by setting ``noise_type='normal'`` and ``sigma=1``. For illustration purpose, we run 3 repeats.

The ``.get_consistency`` function results in a summary pandas dataframe: ``results``, which includes model details, clustering accuracy, if the true label is provided, and clusteirng consistency measured by different criterias. 

The ``results`` pandas dataframe can be downloaded and upload to the dashboard. 

In [6]:
from sklearn.cluster import KMeans
esti_km = KMeans(n_clusters=K,init='k-means++')
# from sklearn.cluster import AgglomerativeClustering


model_km = imlreliability.clustering.clustering(data=x,estimator=esti_km,K=len(set(y)),
                 label=y,
                 perturbation = 'noise',
                 sigma=1,
                 noise_type='normal',
                 n_repeat=3,
                 norm=True,
                 rand_index=1,
                 verbose=True)

model_km.fit()
model_km.get_consistency('WDBC',method_name='K-means')
print(model_km.results)
####################### 
# model_km.results.to_csv('clus_new_km_noise.csv')

Iter:  0
<built-in method normal of numpy.random.mtrand.RandomState object at 0x7fa59d5f7258>
Iter:  1
<built-in method normal of numpy.random.mtrand.RandomState object at 0x7fa59d5f7258>
Iter:  2
<built-in method normal of numpy.random.mtrand.RandomState object at 0x7fa59d5f7258>
noise
noise
noise
   data   method perturbation   noise  sigma               criteria  Accuracy  \
0  WDBC  K-means        noise  normal      1                    ARI  0.642333   
1  WDBC  K-means        noise  normal      1  Fowlkes Mallows Score  0.829667   
2  WDBC  K-means        noise  normal      1     Mutual Information  0.529333   
3  WDBC  K-means        noise  normal      1        V Measure Score  0.529667   

   Consistency  
0     0.762667  
1     0.885000  
2     0.655333  
3     0.655333  


#### without labels 

In [7]:
model_km2 = imlreliability.clustering.clustering(data=x,estimator=esti_km,K=len(set(y)),
                 label=None,
                 perturbation = 'noise',
                 sigma=1,
                 noise_type='normal',
                 n_repeat=2,
                 norm=True,
                 stratify=True,
                 rand_index=1,
                 verbose=True)

model_km2.fit()
model_km2.get_consistency('WDBC',method_name='K-means')
print(model_km2.results)
####################### 

Iter:  0
<built-in method normal of numpy.random.mtrand.RandomState object at 0x7fa59d5f7258>
Iter:  1
<built-in method normal of numpy.random.mtrand.RandomState object at 0x7fa59d5f7258>
noise
   data   method perturbation   noise  sigma               criteria  \
0  WDBC  K-means        noise  normal      1                    ARI   
1  WDBC  K-means        noise  normal      1  Fowlkes Mallows Score   
2  WDBC  K-means        noise  normal      1     Mutual Information   
3  WDBC  K-means        noise  normal      1        V Measure Score   

   Consistency  Accuracy  
0        0.720       NaN  
1        0.864       NaN  
2        0.609       NaN  
3        0.609       NaN  


### 2. Data spliting

We still measure the interpretation reliability of K-Means, but change the perturtion approach to stratified data spliting, by setting ``perturbation = 'split'`` and ``stratify=True``. 

The ``.get_consistency`` function results in a summary pandas dataframe: ``results``, which includes model details, clustering accuracy, if the true label is provided, and clusteirng consistency measured by different criterias. 

The ``results`` pandas dataframe can be downloaded and upload to the dashboard. 

In [8]:
from sklearn.cluster import KMeans
esti_km = KMeans(n_clusters=K,init='k-means++')

model_km3 = imlreliability.clustering.clustering(data=x,estimator=esti_km,K=len(set(y)),
                 label=y,
                 perturbation = 'split',
                 sigma='NA',
                 noise_type='NA',
                 n_repeat=3,
                 norm=True,
                 stratify=True,
                 rand_index=1,
                 verbose=True)

model_km3.fit()
model_km3.get_consistency('WDBC',method_name='K-means')
print(model_km3.results)
####################### 
# model_km3.results.to_csv('clus_new_km_split.csv')

Iter:  0
Iter:  1
Iter:  2
split
split
split
   data   method perturbation noise sigma               criteria  Accuracy  \
0  WDBC  K-means        split    NA    NA                    ARI  0.714667   
1  WDBC  K-means        split    NA    NA  Fowlkes Mallows Score  0.865333   
2  WDBC  K-means        split    NA    NA     Mutual Information  0.597333   
3  WDBC  K-means        split    NA    NA        V Measure Score  0.598333   

   Consistency  
0     0.971000  
1     0.986333  
2     0.944333  
3     0.944333  
