<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"></ul></div>

# Introduction

This notebook updates the descriptive dataframe for a subsequent modeling or evaluating task. This descriptive dataframe serves as the leading input to the modeling pipeline and contains all information that is necessary to create the training and evaluating datasets.

The steps within this workflow are as follows:

1. accordingly to the initialized path the function loads the descriptive dataframe
2. the function splits the instances per SNR, per machine and per ID into training and testing and creates an additional column
3. the column is being added to the descriptive dataframe and saved back to the location

To use this notebook you will have to do the following steps:

1. define the path to the descriptive dataframe (path='....')
2. run all the cells after that

In [91]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [92]:
def split_index(indeces, labels):
    
    '''
    Will combine the testset from all abnormal operation data
    and add up the same amount of normal operation data
    the remaining will be the training dataset
    
    indeces: indeces of descriptive table or dataframe
    labels: labels whether instance is abnormal (label==1 - abnormal)
    '''

    idx_abnormal = indeces[labels==1]
    num_abnormal = len(idx_abnormal)
    
    idx_normal = indeces[labels==0]
    idx_train, idx_test_normal = train_test_split(idx_normal, test_size=num_abnormal)

    # the testset contains all abnormal operation data
    idx_test = idx_test_normal.union(idx_abnormal)

    return idx_train, idx_test

In [93]:
def tt_split(table_path):
    '''
    Reads desciptive table from pickle, splits it into training and testing dataset.
    Returns table with additional column with training/testing index
    '''

    table = pd.read_pickle(table_path)

    SNRs = table.SNR.unique()
    machines = table.machine.unique()
    IDs = table.ID.unique()

    if 'train_set' in table.columns:

        return 'Train test split already done'

    else:

        # initialize the new column
        tt_series = pd.Series(0, index=table.index,
                              name='train_set', dtype=np.int8)

        # split for every individual ID, machine and SNR
        for SNR in SNRs:
            for machine in machines:
                for ID in IDs:

                    # create the individual mask 
                    # and read the indeces and labels accordingly
                    mask = (table.SNR == SNR) & (
                        table.machine == machine) & (table.ID == ID)
                    
                    idx = table[mask].index
                    labels = table[mask].abnormal

                    # get the indeces that belong to the training dataset 
                    # and update the new column
                    idx_train, _ = split_index(idx, labels)
                    tt_series[idx_train] = 1

        table = table.join(tt_series)
        table.to_pickle(table_path)

        return 'Done'

In [94]:
path = '.\..\..\dataset\MEL_to_Pandas\data\pandas_pump_6dB_00020406_MEL_v1_64.pkl'
tt_split(path)

'Train test split already done'

In [95]:
table = pd.read_pickle(path)

In [96]:
table[table.abnormal==1].train_set.unique()

array([0], dtype=int8)