### Import packages

In [1]:
import imlreliability
import pandas as pd
import numpy as np

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


### Dimension Reduction 

#### Load data
We use the WDBC Breast Cancer Wisconsin data as an example for the following sections. The data has 569 observations and 30 feature, and 2 oracle clusters. We scale and normalize the data as pre-processing steps. 

In [2]:
from sklearn.model_selection import train_test_split
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)
data=data.dropna()
y=(data[1])
x = data.drop(columns=[0,1]).to_numpy()
K=len(set(y))

The dimension reduction estimator is assumed to implement the scikit-learn estimator interface.  We propose two data perturbation approaches: 

    1. noise addition and 
    2. data spliting. 


In addition, we measure the reliability of dimension reduction techniques from two aspects: 

    1. Consistency of clustering results on the reduced dimesion, 
    2. Consistency of local neighborhood. 
    
We will show examples under each scenario in the following sections.


### 1. Noise Addition


###  1.1 Clustering consistency on reducted dimesion

We wish to evaluate the interpretation reliability of PCA using the ``PCA()`` function of ``sklearn``. We measure the consistency of K-Means clustering on the first two reduced dimensions (``rank=2``). Here we perturb the data with noise addition by setting ``perturbation = 'noise'``.  We add normal noise with mean 0 and standard deviation 1 by setting ``noise_type='normal'`` and ``sigma=1``. For illustration purpose, we run 3 repeats.



In [3]:
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
esti=PCA()

model = imlreliability.dimension_reduction.dimension_reduction(x,estimator=esti,K=len(set(y)),
                 label=y,
                 perturbation = 'noise',
                rank=2,
                 sigma=1,
                 noise_type='normal',
                 n_repeat=3,
                    rand_index=1,
                    verbose=True)

model.fit()


Iter:  0
Iter:  1
Iter:  2


The ``.get_consistency_clustering`` perform clustering on the reduced dimension. Here we use the ``sklearn`` function ``K-Means()`` as an example. A summary pandas dataframe: ``results`` includes model details, clustering accuracy, if the true label is provided, and clusteirng consistency measured by different criterias. 

The ``results`` pandas dataframe can be downloaded and upload to the dashboard. 

###### 1.1.1 With true labels 

In [4]:
model.get_consistency_clustering('WDBC','PCA',KMeans(n_clusters=4),'KM')
print(model.results)

# model.results.to_csv('dr_clus_new_km_noise.csv')

   data method perturbation clustering   noise  sigma  rank  \
0  WDBC    PCA        noise         KM  normal      1     2   
1  WDBC    PCA        noise         KM  normal      1     2   
2  WDBC    PCA        noise         KM  normal      1     2   
3  WDBC    PCA        noise         KM  normal      1     2   

                criteria  Accuracy  Consistency  
0                    ARI  0.321000     0.431333  
1  Fowlkes Mallows Score  0.593667     0.578000  
2     Mutual Information  0.381667     0.465000  
3        V Measure Score  0.383667     0.468000  


###### 1.1.2 Without true labels 

In [7]:
model2 = imlreliability.dimension_reduction.dimension_reduction(x,estimator=esti,K=len(set(y)),
                 label=None,
                 perturbation = 'noise',
                rank=2,
                 sigma=1,
                 noise_type='normal',
                 n_repeat=3,
                    rand_index=1,
                    verbose=True)

model2.fit()
model2.get_consistency_clustering('WDBC','PCA',KMeans(n_clusters=4),'KM')
print(model2.results)

Iter:  0
Iter:  1
Iter:  2
   data method perturbation clustering   noise  sigma  rank  \
0  WDBC    PCA        noise         KM  normal      1     2   
1  WDBC    PCA        noise         KM  normal      1     2   
2  WDBC    PCA        noise         KM  normal      1     2   
3  WDBC    PCA        noise         KM  normal      1     2   

                criteria  Consistency  Accuracy  
0                    ARI     0.431333       NaN  
1  Fowlkes Mallows Score     0.578000       NaN  
2     Mutual Information     0.465000       NaN  
3        V Measure Score     0.468000       NaN  


###  1.2 Local neighborhood consistency of the reducted dimesions


The ``.get_consistency_knn`` measure local neighborhood consistency of the reduced dimensions, which construct a ``NN-Jaccard-AUC`` pandas dataframe that includes model details and ``NN-Jaccard-AUC`` consistency scores. The resulting dataframe can be downloaded and upload to the dashboard. 

In [8]:
## Nearest neighbor 
model.get_consistency_knn('WDBC','PCA')
# print(model.consistency_knn_mean)
print(model.AUC)

   data method   noise  sigma  rank criteria  Consistency
0  WDBC    PCA  normal      1     2  Jaccard     0.565961


### 2. Data spliting

We conduct reliability test with data splitting perturbation by simply setting ``perturbation = 'split'``, with all other codes the same.

In [10]:
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
esti=PCA()

model = imlreliability.dimension_reduction.dimension_reduction(x,estimator=esti,K=len(set(y)),
                 label=y,
                 perturbation = 'split',
                rank=2,
                 n_repeat=3,
                    rand_index=1,
                    verbose=True)

model.fit()
model.get_consistency_clustering('WDBC2','PCA',KMeans(n_clusters=4),'KM')
print(model.results)

# model.results.to_csv('dr_clus_new_km_split.csv')

Iter:  0
Iter:  1
Iter:  2
    data method perturbation clustering   noise  sigma  rank  \
0  WDBC2    PCA        split         KM  normal      1     2   
1  WDBC2    PCA        split         KM  normal      1     2   
2  WDBC2    PCA        split         KM  normal      1     2   
3  WDBC2    PCA        split         KM  normal      1     2   

                criteria  Accuracy  Consistency  
0                    ARI  0.357667     0.741333  
1  Fowlkes Mallows Score  0.620333     0.809333  
2     Mutual Information  0.429000     0.760000  
3        V Measure Score  0.431333     0.763333  
