# Preprocessing of Experimental Data

## Summary
This file is to preprocess the raw .mat-Files. We want to save them as .csv-files in a second folder. The .csv-files should be ready to use in any analysis approach. In our experiments, the gas turbine has a motor phase at the start of each experiment and a cool-down phase at each end. We save three files: One includes the entire experiment, the second excludes motor and cool-down phase to focus on the pure data. The third includes the raw data in a csv for easier accessability.

In [1]:
# import all necessary libraries
import Global_Functions as gf
import Preprocessing as pp
import os
from scipy.io import loadmat
import pandas as pd
from math import ceil
import numpy as np
import matplotlib.pyplot as plt

In [2]:
RAW_FOLDER = "../Data/Raw_Data/" # where are the raw matlab files?
SAVE_FOLDER = "../Data/Preped_Data/"  # where do you want to save the .csv files


folder_1 = RAW_FOLDER + "Messdaten_Test_ID_1/"
folder_4 = RAW_FOLDER + "Messdaten_Test_ID_4b/"
folder_9 = RAW_FOLDER + "Messdaten_Test_ID_9/"
folder_20 = RAW_FOLDER + "Messdaten_Test_ID_20/"
folder_21 = RAW_FOLDER + "Messdaten_Test_ID_21/"
folder_22 = RAW_FOLDER + "Messdaten_Test_ID_22/"
folder_23 = RAW_FOLDER + "Messdaten_Test_ID_23/"
folder_24 = RAW_FOLDER + "Messdaten_Test_ID_24/"

gf.check_folder(SAVE_FOLDER)

Folder already exists.


Since we want to sample the data down, we need to specify the ratio at which the data should be downsampled. In our case we sample the data to one data point per second, therefore resulting in a SAMPLE_RATIO of 1.

In [3]:
SAMPLE_RATIO = 1 #specify at what rate the date should be downsampled

We need to specify at which time, which voltage was applied, because it is not included in this data set. We therefore use either prespecified or analyzed splits in time, where the input voltage changes. In our case, we only use a small array of values (which can be seen in the *prep_exp_1* method). To reach values for every time step in the data frame, we need to fill the data. This is conducted in the following method.

In [4]:
def fill_data(time_splits, value_splits, df_time):
    #initialize empty numpy-arrays
    time = np.empty(len(df_time))
    values = np.empty(len(df_time))
    ix_start = 0

    for i in range(len(time_splits)-1):
        lower = ceil(time_splits[i]) # find beginning of time frame
        upper = ceil(time_splits[i+1]) # find end of time frame
        
        ix_start = np.argmax(df_time >= lower) #find index where lower timestep is surpassed      
        ix_end = np.argmax(df_time > upper) #find index where lower timestep is surpassed      
        
        v = value_splits[i]
        values[ix_start:ix_end] = v
    
    return values

We want to downsample the data frame to the specified sample ratio. The general function is the following function. If there are necessary adaptations, they are described in the relevant experiment.

In [5]:
def downsample_dataframe(data_full):
    df = pd.DataFrame()
    #determine sample size depending on duration of experiment and sample ratio (specified above)
    sample_size = ceil(data_full['time'][len(data_full)-1] * SAMPLE_RATIO)

    for col in data_full.columns:
        df[col] = pp.downsample_data(data_full['time'], data_full[col], sample_size)

    return df

In the data the motor and cool-down phase are often included. This can yield insufficient results. Therefore, they are excluded and the data frame is shortened.

In [6]:
def shorten_data_frame(df, cut_start, cut_end):
    return df[cut_start:cut_end]

We can use the structure to call one function to conduct both methods.

In [7]:
def downsample_and_shorten_files(exp_full, name, cut_start = 0, cut_end = 0):
    exp = downsample_dataframe(exp_full)
    exp.to_csv(SAVE_FOLDER + name + ".csv",
                    index = False, sep = "|", encoding='utf-8')
    
    if cut_end == 0:
        cut_end = len(exp)
    
    exp_short = shorten_data_frame(exp, cut_start, cut_end)
    exp_short.to_csv(SAVE_FOLDER + name + "_short.csv",
                    index = False, sep = "|", encoding='utf-8')

We have a method which can determine, which experiment should be reduced. This can reduce the runtime and can differ between an initial run and later runs. Especially when new data is available, only the new preperations can be executed.

In [8]:
def prepare_experiment(exp1 = False, exp4 = False, exp9 = False,
                       exp20 = False, exp21 = False, exp22 = False,
                      exp23 = False, exp24 = False):
    if exp1:
        finish_exp_1()
    if exp4:
        finish_exp_4()
    if exp9:
        finish_exp_9()
    if exp20:
        finish_exp_20()
    if exp21:
        finish_exp_21()
    if exp22:
        finish_exp_22()
    if exp23:
        finish_exp_23()
    if exp24:
        finish_exp_24()

## Preperation of experiment 1 (100-30-100-30)

First we prepare the data of experiment 1 which has two steps (100-30-100-30). We first load the respective files from the folder and then apply neccessary modifications.

In [9]:
def load_exp_1(foldername):
    data = {}
    
    # Our .mat-files are called that way for this experiment, please change to your application
    mat_raw_files = ['n_soll.mat', 'P_el_rms.mat', 'P_th.mat', 't_el_rms.mat', 't_nsoll.mat', 't_th.mat']
    
    for file in mat_raw_files:
        path = os.path.join(foldername, file)
        mat_file = loadmat(path)
        search = file[:-4]
        switcher = { #switcher is neccessary since files and matlab headers are not identical
            'n_soll': 'n_soll',
            'P_el_rms': 'P_elrms',
            'P_th': 'P_th',
            't_el_rms': 't_el_mean',
            't_nsoll': 't_2A_ela',
            't_th': 't_1B_th'
        }
        data[search] = mat_file[switcher.get(search)]
    return data

In [10]:
raw_exp_1 = load_exp_1(folder_1)

According to our analysis the experiment consists of 4 different phases, each with following durations:<br/>
0% -> 100%: 60 minutes (3600 secs -> change after 3605)<br/>
100% -> 30%: 36,75 minutes (2200 secs -> change after 5811)<br/>
30% -> 100%: 42 minutes (2500 secs -> change after 8429)<br/>
100% -> 30% 50 minutes (3000 secs -> change after 11294)<br/>
30% -> 0%: 11,5 minutes (690 secs -> ends at 11981)

In [11]:
def prep_raw_exp_1(data):
    df = pd.DataFrame()
    
    df['time'] = data['t_nsoll'][0][::1000][:-58]
    df['spinning_soll'] = data['n_soll'][:,0][::1000][:-58]
    df['el_power'] = data['P_el_rms'][0][:-51]
    df['th_power'] = data['P_th'][0][::10]
    
    time_splits = [0,1,3605,5811,8429,11981]
    voltage_splits = [0, 10, 3,10,3,0]
    
    #include length for matching lengths
    time_splits.append(len(df['time']))
    voltage_splits.append(0)
    
    df['input_voltage'] = fill_data(time_splits, voltage_splits, df['time'])
    
    return df

In [12]:
def finish_exp_1():
    exp_1_full = prep_raw_exp_1(raw_exp_1)
    exp_1_full.to_csv(SAVE_FOLDER + "experiment_1_raw.csv",
                    index = False, sep = "|", encoding='utf-8')
    downsample_and_shorten_files(exp_1_full, "experiment_1", 800, -1250)
    print('Experiment 1 is preprocessed.')

## Preparation of Experiment 4 (30-50-30-75-30-100-30-100-30)

The next experiment we prepare is experiment 4, in our case called the *Stufenexperiment* (30-50-30-75-30-100-30-100). We conduct the same steps as for experiment 1 in preparation, downsampling and shortening.

In [13]:
def prep_raw_exp_4():
    df_raw = pd.DataFrame()
    
    #load matlab-files with data from the relevant folder
    mat_raw_files_4 = (file for file in os.listdir(folder_4) if file[-4:] == '.mat' and "daten_Test_" in file)
    raw_exp_4 = pp.open_raw_mat_files(mat_raw_files_4, folder_4)

    
    #split the returned data in the two starting files
    df_spin = raw_exp_4['Drehzahldaten_Test_ID_4b']
    df_power = raw_exp_4['Leistungdaten_Test_ID_4b']
    
    df_raw['time'] = df_power['t_1B_el_neu']
    df_raw['el_power'] = df_power['P_el_rms']
    df_raw['th_power'] = df_power['P_th'][::2]
    
    df_raw['spinning_soll'] = fill_data(df_spin['t_nsoll_stil'], df_spin['n_4b_soll'], df_raw['time'])
    df_raw['input_voltage'] = fill_data(df_spin['t_nsoll_stil'], df_spin['sw_nsoll_stil'], df_raw['time'])
    
    return df_raw

In [14]:
def finish_exp_4():
    exp_4_full = prep_raw_exp_4()
    exp_4_full.to_csv(SAVE_FOLDER + "experiment_4_raw.csv",
                    index = False, sep = "|", encoding='utf-8')
    downsample_and_shorten_files(exp_4_full, "experiment_4", 750, 10545)
    print('Experiment 4 is preprocessed.')

## Preparation of Experiment 9 (Haushaltsexperiment)

The next experiment we prepare is experiment 9, in our case called the *Haushaltsexperiment*. We conduct the same steps as for experiment 1 in preparation, downsampling and shortening.

In [15]:
def prep_raw_exp_9():
    df_raw = pd.DataFrame()
    
    #load matlab-files with data from the relevant folder
    mat_raw_files_9 = (file for file in os.listdir(folder_9) if file[-4:] == '.mat' and "daten_Test_" in file)
    raw_exp_9 = pp.open_raw_mat_files(mat_raw_files_9, folder_9)
    
    #split the returned data in the two starting files
    df_spin = raw_exp_9['Drehzahldaten_Test_ID_9']
    df_power = raw_exp_9['Leistungdaten_Test_ID_9']
    
    df_raw['time'] = df_spin['t_n']
    df_raw['el_power'] = df_power['P_el_rms']
    
    df_raw['th_power'] = fill_data(df_power['t_n_th'], df_power['P_th_mean'], df_raw['time'])    
    df_raw['spinning_soll'] = fill_data(df_spin['t_n'], df_spin['n_soll'], df_raw['time'])
    df_raw['input_voltage'] = fill_data(df_spin['t_n'], df_spin['u_ary'], df_raw['time'])
    
    return df_raw

In [16]:
def finish_exp_9():
    exp_9_full = prep_raw_exp_9()
    exp_9_full.to_csv(SAVE_FOLDER + "experiment_9_raw.csv",
                    index = False, sep = "|", encoding='utf-8')
    downsample_and_shorten_files(exp_9_full, "experiment_9", 818, -400)
    print('Experiment 9 is preprocessed.')

## Preparation of Experiment 20 (30-50-75-100)

The next experiment we prepare is experiment 20, in our case called the *Stufenexperiment*. The difference to experiment 4 is that the input is now fixed and clean. We conduct the same steps as for experiment 1 in preparation, downsampling and shortening.

The preparation step is the same for all Experiments through 20 to 24 so they are summarized in one method.

In [17]:
def read_lastprofil(folder, time):
    lastprofil = pd.read_csv(folder + 'Lastprofil.csv', delimiter= ',', header = None)
    time_splits = np.array(lastprofil[0])
    value_splits = np.array(lastprofil[1])
    
    return fill_data(time_splits, value_splits, time)

In [18]:
def prep_raw_exp_20_24(raw_exp, folder):
    df_raw = pd.DataFrame()
    
    min_length = min(len(raw_exp['t_nreg']['t_nreg']), len(raw_exp['t_Nrms']['t_Nrms']))
    
    df_raw['time'] = raw_exp['t_Nrms']['t_Nrms'][:min_length]
    df_raw['el_power'] = raw_exp['P_Nel']['P_Nel'][:min_length]
    df_raw['th_power'] = raw_exp['P_th']['P_th'][:min_length]
    df_raw['spinning_ist'] = raw_exp['n_reg']['n_reg'][:min_length]
    
    df_raw['input_voltage'] = read_lastprofil(folder, df_raw['time'])
    
    #exclude areas where measurements are false
    window = 100
    delta = 50
    
    #delete cells, where value is far from rolling mean value
    df_raw['roll'] = 0
    df_raw['roll'][window:] = df_raw.rolling(window).mean()['el_power'][window:]
    df_raw['diff_to_roll'] = abs(df_raw['el_power'] - df_raw['roll'])
    df_raw = df_raw[df_raw['diff_to_roll'] < delta]
    
    df_raw = df_raw.drop(['roll', 'diff_to_roll'], axis = 1)

    return df_raw

In [19]:
def finish_exp_20():
    #load matlab-files with data from the relevant folder
    heads = ['P_th', 'P_Nel', 't_Nrms', 'n_reg', 't_nreg']
    mat_raw_files_20 = (file + '.mat' for file in heads)
    raw_exp_20 = pp.open_raw_mat_files(mat_raw_files_20, folder_20)
    
    exp_20_full = prep_raw_exp_20_24(raw_exp_20, folder_20)
    exp_20_full.to_csv(SAVE_FOLDER + "experiment_20_raw.csv",
                    index = False, sep = "|", encoding='utf-8')
    downsample_and_shorten_files(exp_20_full, "experiment_20", 785, 7280)
    print('Experiment 20 is preprocessed.')

## Preparation of Experiment 21 (30-50-75-100)

The next experiment we prepare is experiment 21, in our case called the *Stufenexperiment*. The difference to experiment 4 is that the input is now fixed and clean. This is a doubling of experiment 20 to have more reliable data. We conduct the same steps as for experiment 1 in preparation, downsampling and shortening.

In [20]:
def finish_exp_21():
        #load matlab-files with data from the relevant folder
    heads = ['P_th', 'P_Nel', 't_Nrms', 'n_reg', 't_nreg']
    mat_raw_files_21 = (file + '.mat' for file in heads)
    raw_exp_21 = pp.open_raw_mat_files(mat_raw_files_21, folder_21)
    
    exp_21_full = prep_raw_exp_20_24(raw_exp_21, folder_21)
    exp_21_full.to_csv(SAVE_FOLDER + "experiment_21_raw.csv",
                    index = False, sep = "|", encoding='utf-8')
    downsample_and_shorten_files(exp_21_full, "experiment_21", 785, 7280)
    print('Experiment 21 is preprocessed.')

## Preparation of Experiment 22 (Sinus-Form)

The next experiment we prepare is experiment 22, in our case called the *Sinusexperiment*. We conduct the same steps as for experiment 1 in preparation, downsampling and shortening.

In [21]:
def finish_exp_22():
        #load matlab-files with data from the relevant folder
    heads = ['P_th', 'P_Nel', 't_Nrms', 'n_reg', 't_nreg']
    mat_raw_files_22 = (file + '.mat' for file in heads)
    raw_exp_22 = pp.open_raw_mat_files(mat_raw_files_22, folder_22)
    
    exp_22_full = prep_raw_exp_20_24(raw_exp_22, folder_22)
    exp_22_full.to_csv(SAVE_FOLDER + "experiment_22_raw.csv",
                    index = False, sep = "|", encoding='utf-8')
    downsample_and_shorten_files(exp_22_full, "experiment_22", 795, 9285)
    print('Experiment 22 is preprocessed.')

## Preparation of Experiment 23 (Sägezahn-Experiment)

The next experiment we prepare is experiment 23, in our case called the *Sägezahnexperiment*. We conduct the same steps as for experiment 1 in preparation, downsampling and shortening.

In [22]:
def finish_exp_23():
        #load matlab-files with data from the relevant folder
    heads = ['P_th', 'P_Nel', 't_Nrms', 'n_reg', 't_nreg']
    mat_raw_files_23 = (file + '.mat' for file in heads)
    raw_exp_23 = pp.open_raw_mat_files(mat_raw_files_23, folder_23)
    
    exp_23_full = prep_raw_exp_20_24(raw_exp_23, folder_23)
    exp_23_full.to_csv(SAVE_FOLDER + "experiment_23_raw.csv",
                    index = False, sep = "|", encoding='utf-8')
    downsample_and_shorten_files(exp_23_full, "experiment_23", 750, 9938)
    print('Experiment 23 is preprocessed.')

## Preparation of Experiment 24 (Dreieck-Experiment)

The next experiment we prepare is experiment 24, in our case called the *Dreieckexperiment*. We conduct the same steps as for experiment 1 in preparation, downsampling and shortening.

In [23]:
def finish_exp_24():
        #load matlab-files with data from the relevant folder
    heads = ['P_th', 'P_Nel', 't_Nrms', 'n_reg', 't_nreg']
    mat_raw_files_24 = (file + '.mat' for file in heads)
    raw_exp_24 = pp.open_raw_mat_files(mat_raw_files_24, folder_24)
    
    exp_24_full = prep_raw_exp_20_24(raw_exp_24, folder_24)
    exp_24_full.to_csv(SAVE_FOLDER + "experiment_24_raw.csv",
                    index = False, sep = "|", encoding='utf-8')
    downsample_and_shorten_files(exp_24_full, "experiment_24", 760, 9783)
    print('Experiment 24 is preprocessed.')

In [24]:
prepare_experiment(exp1 = True,
                   exp4 = True,
                   exp9 = True,
                   exp20 = True,
                   exp21 = True,
                   exp22 = True,
                   exp23 = True,
                   exp24 = True)

Experiment 1 is preprocessed.
Experiment 4 is preprocessed.
Experiment 9 is preprocessed.


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_raw['roll'][window:] = df_raw.rolling(window).mean()['el_power'][window:]


Experiment 20 is preprocessed.
Experiment 21 is preprocessed.
Experiment 22 is preprocessed.
Experiment 23 is preprocessed.
Experiment 24 is preprocessed.
