<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Structure-cheat-sheet" data-toc-modified-id="Structure-cheat-sheet-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Structure cheat sheet</a></span></li><li><span><a href="#Data-structure" data-toc-modified-id="Data-structure-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data structure</a></span></li><li><span><a href="#get-features" data-toc-modified-id="get-features-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>get features</a></span></li><li><span><a href="#Preprocessing" data-toc-modified-id="Preprocessing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Preprocessing</a></span><ul class="toc-item"><li><span><a href="#Scaling" data-toc-modified-id="Scaling-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Scaling</a></span></li><li><span><a href="#PCA" data-toc-modified-id="PCA-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>PCA</a></span></li></ul></li><li><span><a href="#Unsupervised-Clustering" data-toc-modified-id="Unsupervised-Clustering-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Unsupervised Clustering</a></span></li></ul></div>

# Basics

## Structure cheat sheet

1. func: train data lead (following order)
    1. read the descriptive dataframe from the feature-pipeline
    2. extract feature from the feature-objects which are labeled train-dataset from dataframe
    3. create numpy feature array for the processing pipeline
2. preprocessing
    1. Transformation (any combination of the following)
        + log-transform
        + PCA
        + others
    2. Scaling (one of the following)
        + StandardScaler
        + MinMaxScaler
3. Unsupervised Clustering
    1. Estimate initial hyperparameter
    2. Create grid over various hyperparameters
    3. Train all and choose the best according to metric
    
    
in all steps the cluster-recorder object (possibly dataframe-row) will record all the meta-information like hyper-parameters

## Data structure

There are multiple degrees of freedom in the data:

1. Signal to noise ratio (SNR)
2. Machine type
    1. pump
    2. fan
    3. valve (solenoid)
    4. slider
3. Machine ID
    1. four different machine IDs
    
The pipeline will be applied to fixed SNR, fixed machine type and fixed ID

In [1]:
import numpy as np
import scipy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from tqdm import tqdm
sns.set()

BASE_FOLDER = '../../'
%run -i ..\..\utility\feature_extractor\JupyterLoad_feature_extractor.py
%run -i ..\..\utility\modeling\JupyterLoad_modeling.py

load feature_extractor_mother
load feature_extractor_mel_spectra
load feature_extractor_psd
load feature_extractore_pre_nnFilterDenoise


## get features

Get the descriptive dataframe for the features.

The descriptive dataframe contains all IDs of the pump. We will focus on ID '00' for now since the modeling phase is seperated per SNR, per machine, per ID anyway.

In [15]:
path_descr = '.\..\..\dataset\MEL_to_Pandas\data_6dB_pump\FEpandas_MELv1_nm80_ch0.pkl'
ID = '00'
# loading time feature extractor: 4:37
if ('df_train' in dir()) and ('data_train' in dir()):
    pass
else:
    df_descr, data_train = load_data(path_descr, feat={'function':'frame', 'frames':5}, feat_col='MELv1_nm80_ch0', SNR='6dB', machine='pump', ID='00', train_set=True, BASE_FOLDER=BASE_FOLDER)

100%|██████████| 863/863 [00:01<00:00, 636.04it/s]


In [20]:
data_train.shape

(266667, 400)

In [17]:
df_test.shape

(266667, 10)

In [21]:
df_test

Unnamed: 0,path,abnormal,ID,file,machine,SNR,MELv1_nm80_ch0,train_set,file_idx,frame
0,d:\Capstone\NF_Prj_MIMII_Dataset\dataset\6dB\p...,0,00,00000011,pump,6dB,\dataset\MEL_to_Pandas\data_6dB_pump\MELv1_nm8...,0,11,0
1,d:\Capstone\NF_Prj_MIMII_Dataset\dataset\6dB\p...,0,00,00000011,pump,6dB,\dataset\MEL_to_Pandas\data_6dB_pump\MELv1_nm8...,0,11,1
2,d:\Capstone\NF_Prj_MIMII_Dataset\dataset\6dB\p...,0,00,00000011,pump,6dB,\dataset\MEL_to_Pandas\data_6dB_pump\MELv1_nm8...,0,11,2
3,d:\Capstone\NF_Prj_MIMII_Dataset\dataset\6dB\p...,0,00,00000011,pump,6dB,\dataset\MEL_to_Pandas\data_6dB_pump\MELv1_nm8...,0,11,3
4,d:\Capstone\NF_Prj_MIMII_Dataset\dataset\6dB\p...,0,00,00000011,pump,6dB,\dataset\MEL_to_Pandas\data_6dB_pump\MELv1_nm8...,0,11,4
...,...,...,...,...,...,...,...,...,...,...
88369,d:\Capstone\NF_Prj_MIMII_Dataset\dataset\6dB\p...,1,00,00000142,pump,6dB,\dataset\MEL_to_Pandas\data_6dB_pump\MELv1_nm8...,0,1148,304
88370,d:\Capstone\NF_Prj_MIMII_Dataset\dataset\6dB\p...,1,00,00000142,pump,6dB,\dataset\MEL_to_Pandas\data_6dB_pump\MELv1_nm8...,0,1148,305
88371,d:\Capstone\NF_Prj_MIMII_Dataset\dataset\6dB\p...,1,00,00000142,pump,6dB,\dataset\MEL_to_Pandas\data_6dB_pump\MELv1_nm8...,0,1148,306
88372,d:\Capstone\NF_Prj_MIMII_Dataset\dataset\6dB\p...,1,00,00000142,pump,6dB,\dataset\MEL_to_Pandas\data_6dB_pump\MELv1_nm8...,0,1148,307


In [22]:
path_descr = '.\..\..\dataset\MEL_to_Pandas\data_6dB_pump\FEpandas_MELv1_nm80_ch0.pkl'
ID = '00'
# loading time feature extractor: 4:37
if ('df_test' in dir()) and ('data_test' in dir()):
    pass
else:
    df_test, data_test = load_data(path_descr, feat={'function':'frame', 'frames':5}, feat_col='MELv1_nm80_ch0', SNR='6dB', machine='pump', ID='00', train_set=False, BASE_FOLDER=BASE_FOLDER)

## Preprocessing

### Scaling

In [23]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data_train = scaler.fit_transform(data_train)
data_test = scaler.transform(data_test)

### PCA/ICA

In [25]:
from sklearn.decomposition import PCA, FastICA

# instantiate pca
n_comp = 40
#xca = PCA(n_components=n_comp, svd_solver='full')
xca = FastICA(n_components=n_comp, algorithm='parallel')

data_train = xca.fit_transform(data_train)
data_test = xca.transform(data_test)

In [27]:
from sklearn.covariance import EllipticEnvelope

cov = EllipticEnvelope(random_state = 0).fit(data_train)

In [34]:
# Now we can use predict method. It will return 1 for an inlier and -1 for an outlier.
y_pred = cov.predict(data_test)
y_pred
np.unique(y_pred)

array([-1,  1])

In [36]:
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, roc_auc_score

y_true = [1 if i==0 else -1 for i in df_test.abnormal.astype(np.int8)]
confusion_matrix(y_true, y_pred)

array([[31528, 12659],
       [ 4272, 39915]], dtype=int64)

In [None]:
roc_curve(y_true, cov.predict(data_test))

In [None]:
y_pred == y_true

In [None]:
a = cov.get_precision()
fig, ax = plt.subplots()
im = ax.imshow(a)

In [None]:
from sklearn.metrics import roc_curve

y_true = [1 if i==-1 else 0 for i in prediction]]

roc_curve(, cov.decision_function(data_test))

In [None]:
cov.decision_function(data_test)

In [None]:
from sklearn import svm
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest

In [None]:
# Example settings
n_samples = 200
outliers_fraction = 0.25
clusters_separation = [0, 1, 2]

# define two outlier detection tools to be compared
classifiers = {
    "One-Class SVM": svm.OneClassSVM(nu=0.95 * outliers_fraction + 0.05,
                                     kernel="rbf", gamma=0.1),
    "Robust covariance": EllipticEnvelope(contamination=outliers_fraction),
    "Isolation Forest": IsolationForest(max_samples=n_samples,
                                        contamination=outliers_fraction,
                                        random_state=rng)}

for i, (clf_name, clf) in enumerate(classifiers.items()):
    # fit the data and tag outliers
    clf.fit(X)
    scores_pred = clf.decision_function(X)
    threshold = stats.scoreatpercentile(scores_pred,
                                        100 * outliers_fraction)
    y_pred = clf.predict(X)
    n_errors = (y_pred != ground_truth).sum()
    # plot the levels lines and the points
    Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    subplot = plt.subplot(1, 3, i + 1)
    subplot.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),
                        cmap=plt.cm.Blues_r)
    a = subplot.contour(xx, yy, Z, levels=[threshold],
                        linewidths=2, colors='red')
    subplot.contourf(xx, yy, Z, levels=[threshold, Z.max()],
                        colors='orange')
    b = subplot.scatter(X[:-n_outliers, 0], X[:-n_outliers, 1], c='white')
    c = subplot.scatter(X[-n_outliers:, 0], X[-n_outliers:, 1], c='black')
    subplot.axis('tight')
    subplot.legend(
        [a.collections[0], b, c],
        ['learned decision function', 'true inliers', 'true outliers'],
        prop=matplotlib.font_manager.FontProperties(size=11),
        loc='lower right')
    subplot.set_title("%d. %s (errors: %d)" % (i + 1, clf_name, n_errors))
    subplot.set_xlim((-7, 7))
    subplot.set_ylim((-7, 7))
plt.subplots_adjust(0.04, 0.1, 0.96, 0.92, 0.1, 0.26)

plt.show()