<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Structure-cheat-sheet" data-toc-modified-id="Structure-cheat-sheet-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Structure cheat sheet</a></span></li><li><span><a href="#Data-structure" data-toc-modified-id="Data-structure-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data structure</a></span></li><li><span><a href="#get-features" data-toc-modified-id="get-features-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>get features</a></span></li></ul></div>

# Basics

## Structure cheat sheet

1. func: train data lead (following order)
    1. read the descriptive dataframe from the feature-pipeline
    2. extract feature from the feature-objects which are labeled train-dataset from dataframe
    3. create numpy feature array for the processing pipeline
2. preprocessing
    1. Transformation (any combination of the following)
        + log-transform
        + PCA
        + others
    2. Scaling (one of the following)
        + StandardScaler
        + MinMaxScaler
3. Unsupervised Clustering
    1. Estimate initial hyperparameter
    2. Create grid over various hyperparameters
    3. Train all and choose the best according to metric
    
    
in all steps the cluster-recorder object (possibly dataframe-row) will record all the meta-information like hyper-parameters

## Data structure

There are multiple degrees of freedom in the data:

1. Signal to noise ratio (SNR)
2. Machine type
    1. pump
    2. fan
    3. valve (solenoid)
    4. slider
3. Machine ID
    1. four different machine IDs
    
The pipeline will be applied to fixed SNR, fixed machine type and fixed ID

## get features

Get the descriptive dataframe for the features.

The descriptive dataframe contains all IDs of the pump. We will focus on ID '00' for now since the modeling phase is seperated per SNR, per machine, per ID anyway.

class: 
+ uni\_\<model\>
attributes:
+ default threshold
+ roc_auc
methods:
+ fit
+ predict
+ predict_score
+ eval_roc_auc

In [2]:
#===============================================
# Basic Imports


BASE_FOLDER = '../../'
%run -i ..\..\utility\feature_extractor\JupyterLoad_feature_extractor.py
%run -i ..\..\utility\modeling\JupyterLoad_modeling.py

load feature_extractor_mother
load feature_extractor_mel_spectra
load feature_extractor_psd
load feature_extractore_pre_nnFilterDenoise
load extractor_diagram_mother
load load_data
Load split_data
Load anomaly_detection_models
Load detection_pipe


In [3]:
diagrams = ['extdia_v0']
machines = ['pump'#, 'fan', 'slider', 'valve'
            ]
SNRs = ['6dB'#, 'min6dB'
        ]
IDs = ['00'#, '02', '04', '06'
        ]
features = ['MEL_den'#, 'PSD_den'
            ]

tasks = [{
        'path_descr':BASE_FOLDER + 'dataset/{}/{}{}{}_EDiaV0_pandaDisc.pkl'.format(diagram, machine, SNR, ID), 
        'feat':{'function':'frame', 'frames':5}, 
        'feat_col':feature, 
        'SNR':SNR, 
        'machine':machine, 
        'ID':ID,
        'BASE_FOLDER':BASE_FOLDER} 
        for diagram in diagrams
        for machine in machines
        for SNR in SNRs
        for ID in IDs
        for feature in features
        ]



In [4]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA, FastICA

preprocessing = [
    (FastICA, {'n_components':40, 'algorithm':'parallel'}),
    (StandardScaler, {})
]

modeling = (uni_OneClassSVM, {})

pipes = [Pipe(preprocessing, modeling) for i in range(len(tasks))]

In [5]:
#lw = LoggerWrap()

# create the threads
n_jobs = 4
worker_list = []
queue = Queue()
for worker in range(n_jobs):
    worker = PipeThread(queue)
    worker.daemon = True
    worker.start()
    worker_list.append(worker)

In [6]:
# fill the Queue
#lw.log('multithread mode filling the queue' )
for pipe, task in (zip(pipes, tasks)):
    queue.put((pipe, task))

...loading data
data loading completed

...preprocessing data
data preprocessing finished

...fitting the model
model fitted successfully

...evaluating model
evaluation successfull, roc_auc: 0.9856618655632816
pipe saved to pickle


In [11]:
with open('./pipes/MEL_den_frame5_6dB_pump_ID00_20200502_222404.pkl', 'rb') as f:
    a = pickle.load(f)

In [13]:
a.roc_auc

0.9856618655632816

In [138]:
for worker in worker_list:
    worker.stop = True

In [132]:
# Example settings
n_samples = 200
outliers_fraction = 0.25
clusters_separation = [0, 1, 2]

# define two outlier detection tools to be compared
classifiers = {
    "One-Class SVM": svm.OneClassSVM(nu=0.95 * outliers_fraction + 0.05,
                                     kernel="rbf", gamma=0.1),
    "Robust covariance": EllipticEnvelope(contamination=outliers_fraction),
    "Isolation Forest": IsolationForest(max_samples=n_samples,
                                        contamination=outliers_fraction,
                                        random_state=rng)}

for i, (clf_name, clf) in enumerate(classifiers.items()):
    # fit the data and tag outliers
    clf.fit(X)
    scores_pred = clf.decision_function(X)
    threshold = stats.scoreatpercentile(scores_pred,
                                        100 * outliers_fraction)
    y_pred = clf.predict(X)
    n_errors = (y_pred != ground_truth).sum()
    # plot the levels lines and the points
    Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    subplot = plt.subplot(1, 3, i + 1)
    subplot.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),
                        cmap=plt.cm.Blues_r)
    a = subplot.contour(xx, yy, Z, levels=[threshold],
                        linewidths=2, colors='red')
    subplot.contourf(xx, yy, Z, levels=[threshold, Z.max()],
                        colors='orange')
    b = subplot.scatter(X[:-n_outliers, 0], X[:-n_outliers, 1], c='white')
    c = subplot.scatter(X[-n_outliers:, 0], X[-n_outliers:, 1], c='black')
    subplot.axis('tight')
    subplot.legend(
        [a.collections[0], b, c],
        ['learned decision function', 'true inliers', 'true outliers'],
        prop=matplotlib.font_manager.FontProperties(size=11),
        loc='lower right')
    subplot.set_title("%d. %s (errors: %d)" % (i + 1, clf_name, n_errors))
    subplot.set_xlim((-7, 7))
    subplot.set_ylim((-7, 7))
plt.subplots_adjust(0.04, 0.1, 0.96, 0.92, 0.1, 0.26)

plt.show()

NameError: name 'svm' is not defined