<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Structure-cheat-sheet" data-toc-modified-id="Structure-cheat-sheet-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Structure cheat sheet</a></span></li><li><span><a href="#Data-structure" data-toc-modified-id="Data-structure-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data structure</a></span></li><li><span><a href="#get-features" data-toc-modified-id="get-features-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>get features</a></span></li><li><span><a href="#Exploration-of-non-spectral-Features" data-toc-modified-id="Exploration-of-non-spectral-Features-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Exploration of non-spectral Features</a></span></li><li><span><a href="#Exploring-spectral-features" data-toc-modified-id="Exploring-spectral-features-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Exploring spectral features</a></span></li><li><span><a href="#Running-the-Auto-Encoder" data-toc-modified-id="Running-the-Auto-Encoder-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Running the Auto-Encoder</a></span></li></ul></div>

# Basics

## Structure cheat sheet

1. func: train data lead (following order)
    1. read the descriptive dataframe from the feature-pipeline
    2. extract feature from the feature-objects which are labeled train-dataset from dataframe
    3. create numpy feature array for the processing pipeline
2. preprocessing
    1. Transformation (any combination of the following)
        + log-transform
        + PCA
        + others
    2. Scaling (one of the following)
        + StandardScaler
        + MinMaxScaler
3. Unsupervised Clustering
    1. Estimate initial hyperparameter
    2. Create grid over various hyperparameters
    3. Train all and choose the best according to metric
    
    
in all steps the cluster-recorder object (possibly dataframe-row) will record all the meta-information like hyper-parameters

## Data structure

There are multiple degrees of freedom in the data:

1. Signal to noise ratio (SNR)
2. Machine type
    1. pump
    2. fan
    3. valve (solenoid)
    4. slider
3. Machine ID
    1. four different machine IDs
    
The pipeline will be applied to fixed SNR, fixed machine type and fixed ID

## get features

Get the descriptive dataframe for the features.

The descriptive dataframe contains all IDs of the pump. We will focus on ID '00' for now since the modeling phase is seperated per SNR, per machine, per ID anyway.

class: 
+ uni\_\<model\>
attributes:
+ default threshold
+ roc_auc
methods:
+ fit
+ predict
+ predict_score
+ eval_roc_auc

In [3]:
#===============================================
# Basic Imports
BASE_FOLDER = '../../'
%run -i ..\..\utility\feature_extractor\JupyterLoad_feature_extractor.py
%run -i ..\..\utility\modeling\JupyterLoad_modeling.py

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA, FastICA
from tqdm import tqdm
import glob

load feature_extractor_mother
load feature_extractor_mel_spectra
load feature_extractor_psd
load feature_extractor_ICA2
load feature_extractore_pre_nnFilterDenoise
load extractor_diagram_mother
load Simple_FIR_HP
load TimeSliceAppendActivation
load load_data
Load split_data
Load anomaly_detection_models
Load pseudo_supervised_models
Load tensorflow models
Load detection_pipe


The exploration of the Dimensionality Reduction was already done in another notebook. We derived the following rules of thumb:

1. PCA and ICA deliver almost the same results looking at the relative absolute error
2. PCA is usually much faster
3. on PSD and the ICA demix matrix, no dimensionality reduction is needed
4. for a framed Mel-spectrum a number of components between 32 and 64 is a good measure. The resulting error is about 2-4%
5. for a whole Mel-spectrum a number of components between 64 and 128 is advised

## Exploration of non-spectral Features

In [64]:
diagram = 'extdia_v1'
machines = ['pump'
            , 'fan'
            , 'slider'
            , 'valve'
            ]
SNRs = ['6dB', 'min6dB'
        ]
IDs = ['00'
       , '02'
       ,# '04'
       #, '06'
        ]
features = [#('MEL_denbssm', {'function':'frame', 'frames':3})
            #, ('MEL_denbssm', {'function':'frame', 'frames':5})
            #, ('MEL_denbssm', {'function':'frame', 'frames':7})
            #, ('MEL_denbssm', {'function':'flat'})
            ('PSD_denbssm', {'function':'channel'})
            #, ('MEL_bssm', {'function':'frame', 'frames':3})
            #, ('MEL_bssm', {'function':'frame', 'frames':5})
            #, ('MEL_bssm', {'function':'frame', 'frames':7})
            #, ('MEL_bssm', {'function':'flat'})
            , ('PSD_bssm', {'function':'channel'})
            #, ('MEL_raw', {'function':'frame', 'frames':3})
            #, ('MEL_raw', {'function':'frame', 'frames':5})
            #, ('MEL_raw', {'function':'frame', 'frames':7})
            #, ('MEL_raw', {'function':'flat'})
            , ('PSD_raw', {'function':'channel'})
            #, ('MEL_den', {'function':'frame', 'frames':3})
            #, ('MEL_den', {'function':'frame', 'frames':5})
            #, ('MEL_den', {'function':'frame', 'frames':7})
            #, ('MEL_den', {'function':'flat'})
            , ('PSD_den', {'function':'channel'})
            #, ('ICA_demix', {'function':'flat'})
            #, ('ICA_demix', {'function':'maxrange'})
            ]

tasks = [{
        'path_descr':glob.glob(BASE_FOLDER 
                               + '/dataset/{}/{}{}{}_EDiaV1'.format(diagram, machine, SNR, ID) 
                               + "*pandaDisc*.pkl", recursive=True)[0],
        'feat':feature[1], 
        'feat_col':feature[0], 
        'SNR':SNR, 
        'machine':machine, 
        'ID':ID,
        'BASE_FOLDER':BASE_FOLDER}
        for machine in machines
        for SNR in SNRs
        for ID in IDs
        for feature in features
        ]

preprocessing = [
    #(PCA, {'n_components':64}),
    (StandardScaler, {})
]

#modeling = (uni_IsolationForest, {
#    'n_estimators':100
#    ,'max_features':1
#    ,'random_state':42})

# Gaussian Mixture: n_comp: 1, 2, 4, 8, 16, 32
modeling = (uni_GaussianMixture, {'n_components':32})

pipes = [Pipe(preprocessing, modeling) for i in range(len(tasks))]

tasks_failed = []
for pipe, task in tqdm(zip(pipes, tasks), total=len(tasks)):
    try:
        pipe.run_pipe(task)
    except:
        tasks_failed.append(task)
        print('Task failed')

0%|          | 0/64 [00:00<?, ?it/s]../..//dataset/extdia_v1\pump6dB00_EDiaV1HPaug0_pandaDisc.pkl --> Done
...loading data
data loading completed

...preprocessing data
data preprocessing finished

...fitting the model
model fitted successfully

...evaluating model
evaluation successfull, roc_auc: 0.5097070761406426
  2%|▏         | 1/64 [00:17<18:22, 17.50s/it]pipe saved to pickle
../..//dataset/extdia_v1\pump6dB00_EDiaV1HPaug0_pandaDisc.pkl --> Done
...loading data
data loading completed

...preprocessing data
data preprocessing finished

...fitting the model
model fitted successfully

...evaluating model
evaluation successfull, roc_auc: 0.8925130813242701
  3%|▎         | 2/64 [00:34<17:58, 17.39s/it]pipe saved to pickle
../..//dataset/extdia_v1\pump6dB00_EDiaV1HPaug0_pandaDisc.pkl --> Done
...loading data
data loading completed

...preprocessing data
data preprocessing finished

...fitting the model
model fitted successfully

...evaluating model
evaluation successfull, roc_auc: 0.8

## Exploring spectral features

In [4]:
def run_GMM(i):
    diagram = 'extdia_v1'
    machines = [#'pump'
                #, 'fan'
                #, 'slider'
                'valve'
                ]
    SNRs = ['6dB', 'min6dB'
            ]
    IDs = ['00'
        , '02'
        ,# '04'
        #, '06'
            ]
    features = [('MEL_denbssm', {'function':'frame', 'frames':3})
                , ('MEL_denbssm', {'function':'frame', 'frames':5})
                , ('MEL_denbssm', {'function':'frame', 'frames':7})
                #, ('MEL_denbssm', {'function':'flat'})
                #, ('PSD_denbssm', {'function':'channel'})
                , ('MEL_bssm', {'function':'frame', 'frames':3})
                , ('MEL_bssm', {'function':'frame', 'frames':5})
                , ('MEL_bssm', {'function':'frame', 'frames':7})
                #, ('MEL_bssm', {'function':'flat'})
                #, ('PSD_bssm', {'function':'channel'})
                , ('MEL_raw', {'function':'frame', 'frames':3})
                , ('MEL_raw', {'function':'frame', 'frames':5})
                , ('MEL_raw', {'function':'frame', 'frames':7})
                #, ('MEL_raw', {'function':'flat'})
                #, ('PSD_raw', {'function':'channel'})
                , ('MEL_den', {'function':'frame', 'frames':3})
                , ('MEL_den', {'function':'frame', 'frames':5})
                , ('MEL_den', {'function':'frame', 'frames':7})
                #, ('MEL_den', {'function':'flat'})
                #, ('PSD_den', {'function':'channel'})
                #, ('ICA_demix', {'function':'flat'})
                #, ('ICA_demix', {'function':'maxrange'})
                ]

    tasks = [{
            'path_descr':glob.glob(BASE_FOLDER 
                               + '/dataset/extdia_v1*/{}{}{}_EDiaV1'.format(machine, SNR, ID) 
                               + "*pandaDisc*.pkl", recursive=True)[0],
            'feat':feature[1], 
            'feat_col':feature[0], 
            'SNR':SNR, 
            'machine':machine, 
            'ID':ID,
            'BASE_FOLDER':BASE_FOLDER}
            for machine in machines
            for SNR in SNRs
            for ID in IDs
            for feature in features
            ]

    preprocessing = [
        (PCA, {'n_components':64}),
        (StandardScaler, {})
    ]

    # Gaussian Mixture: n_comp: 1, 2, 4, 8, 16, 32
    modeling = (uni_GaussianMixture, {'n_components':i})

    pipes = [Pipe(preprocessing, modeling) for i in range(len(tasks))]

    # # create the threads
    # n_jobs = 2
    # worker_list = []
    # queue = Queue()
    # for worker in range(n_jobs):
    #     worker = PipeThread(queue)
    #     worker.daemon = True
    #     worker.start()
    #     worker_list.append(worker)

    tasks_failed = []
    for pipe, task in tqdm(zip(pipes, tasks), total=len(tasks)):
        try:
            pipe.run_pipe(task)
        except:
            tasks_failed.append(task)
            print('Task failed')

In [5]:
for i in [1, 2, 4, 8, 16, 32]:
    run_GMM(i)

_pandaDisc.pkl --> Done
...loading data
data loading completed

...preprocessing data
data preprocessing finished

...fitting the model
model fitted successfully

...evaluating model
evaluation successfull, roc_auc: 0.6910883069462535
 71%|███████   | 34/48 [1:04:12<32:13, 138.08s/it]pipe saved to pickle
../..//dataset\extdia_v1_sporafic\valvemin6dB00_EDiaV1HPaug0TsSl_pandaDisc.pkl --> Done
...loading data
data loading completed

...preprocessing data
data preprocessing finished

...fitting the model
model fitted successfully

...evaluating model
evaluation successfull, roc_auc: 0.6283972934642651
 73%|███████▎  | 35/48 [1:06:03<28:06, 129.72s/it]pipe saved to pickle
../..//dataset\extdia_v1_sporafic\valvemin6dB00_EDiaV1HPaug0TsSl_pandaDisc.pkl --> Done
...loading data
data loading completed

...preprocessing data
data preprocessing finished

...fitting the model
model fitted successfully

...evaluating model
evaluation successfull, roc_auc: 0.7893140445889216
 81%|████████▏ | 39/48 [1

## Running the Auto-Encoder

In [21]:
diagram = 'extdia_v1'
machines = ['pump'
            , 'fan'
            , 'slider'
            , 'valve'
            ]
SNRs = ['6dB', 'min6dB'
        ]
IDs = ['00'
       , '02'
       ,# '04'
       #, '06'
        ]
features = [('MEL_denbssm', {'function':'frame', 'frames':3})
            , ('MEL_denbssm', {'function':'frame', 'frames':5})
            , ('MEL_denbssm', {'function':'frame', 'frames':7})
            , ('MEL_denbssm', {'function':'flat'})
            #('PSD_denbssm', {'function':'channel'})
            , ('MEL_bssm', {'function':'frame', 'frames':3})
            , ('MEL_bssm', {'function':'frame', 'frames':5})
            , ('MEL_bssm', {'function':'frame', 'frames':7})
            , ('MEL_bssm', {'function':'flat'})
            #, ('PSD_bssm', {'function':'channel'})
            , ('MEL_raw', {'function':'frame', 'frames':3})
            , ('MEL_raw', {'function':'frame', 'frames':5})
            , ('MEL_raw', {'function':'frame', 'frames':7})
            , ('MEL_raw', {'function':'flat'})
            #, ('PSD_raw', {'function':'channel'})
            , ('MEL_den', {'function':'frame', 'frames':3})
            , ('MEL_den', {'function':'frame', 'frames':5})
            , ('MEL_den', {'function':'frame', 'frames':7})
            , ('MEL_den', {'function':'flat'})
            #, ('PSD_den', {'function':'channel'})
            #, ('ICA_demix', {'function':'flat'})
            #, ('ICA_demix', {'function':'maxrange'})
            ]

tasks = [{
        'path_descr':glob.glob(BASE_FOLDER 
                               + '/dataset/{}/{}{}{}_EDiaV1'.format(diagram, machine, SNR, ID) 
                               + "*pandaDisc*.pkl", recursive=True)[0],
        'feat':feature[1], 
        'feat_col':feature[0], 
        'SNR':SNR, 
        'machine':machine, 
        'ID':ID,
        'BASE_FOLDER':BASE_FOLDER}
        for machine in machines
        for SNR in SNRs
        for ID in IDs
        for feature in features
        ]

preprocessing = [
    (PCA, {'n_components':64}),
    (StandardScaler, {})
]

modeling = (uni_AutoEncoder, {'epochs':50})

# modeling = (uni_EllipticEnvelope, {})

pipes = [Pipe(preprocessing, modeling) for i in range(len(tasks))]

# create the threads
n_jobs = 1
worker_list = []
queue = Queue()
for worker in range(n_jobs):
    worker = PipeThread(queue)
    worker.daemon = True
    worker.start()
    worker_list.append(worker)

In [23]:
# fill the Queue
task_status = []
for pipe, task in (zip(pipes[5:], tasks[5:])):
    queue.put((pipe, task))

../..//dataset/extdia_v1\pump6dB00_EDiaV1HPaug0_pandaDisc.pkl --> Done
...loading data
data loading completed

...preprocessing data
data preprocessing finished

...fitting the model
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50