# Data Processing<a class="tocSkip">

In this notebook we will process data and prepare it for modelling. 

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc">
    <ul class="toc-item">
        <li>
            <span>
                <a href="#Import-packages-and-libraries" data-toc-modified-id="Import-packages-and-libraries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import packages and libraries</a>
            </span>
        </li>
        <li>
            <span>
                <a href="#Load-data" data-toc-modified-id="Load-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load data</a>
            </span>
        </li>
        <li>
            <span>
                <a href="#Data-processing" data-toc-modified-id="Data-processing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data processing</a>
            </span>
            <ul class="toc-item">
                <li>
                    <span>
                        <a href="#Feature-extraction" data-toc-modified-id="Feature-extraction-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Feature extraction</a>
                    </span>
                </li>
                <li>
                    <span>
                        <a href="#Feature-scaling" data-toc-modified-id="Feature-scaling-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Feature scaling</a>
                    </span>
                </li>
            </ul>
        </li>
        <li>
            <span>
                <a href="#Save-data" data-toc-modified-id="Save-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Save data</a>
            </span>
        </li>
    </ul>
</div>


## Import packages and libraries

In [1]:
import sys

# insert libraries folder at the beginning of system path to enable fast access
sys.path.insert(1, '../src')

# data manipulation package
import pandas as pd

# configure pandas display settings
pd.set_option('display.max_columns', 30)
pd.set_option('display.max_rows', 50)

# import robust scaler
from sklearn.preprocessing import RobustScaler 

# library containing utility functions
import utils
# library containing data processing functions
import processing

## Load data

In [2]:
df_train1 = pd.read_csv('../data/02_interim/train_FD001.csv')
df_train2 = pd.read_csv('../data/02_interim/train_FD002.csv')
df_train3 = pd.read_csv('../data/02_interim/train_FD003.csv')
df_train4 = pd.read_csv('../data/02_interim/train_FD004.csv')
df_test1 = pd.read_csv('../data/02_interim/test_FD001.csv')
df_test2 = pd.read_csv('../data/02_interim/test_FD002.csv')
df_test3 = pd.read_csv('../data/02_interim/test_FD003.csv')
df_test4 = pd.read_csv('../data/02_interim/test_FD004.csv')

In [3]:
df_train1.head()

Unnamed: 0,Engine_no,Cycle,Altitude,Mach,TRA,T2,T24,T30,T50,P2,P15,P30,Nf,Nc,epr,Ps30,phi,NRf,NRc,BPR,farB,htBleed,Nf_dmd,PCNfR_dmd,W31,W32,RUL
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,21.61,554.36,2388.06,9046.19,1.3,47.47,521.66,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.419,191
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,21.61,553.75,2388.04,9044.07,1.3,47.49,522.28,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.0,23.4236,190
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,21.61,554.26,2388.08,9052.94,1.3,47.27,522.42,2388.03,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442,189
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,21.61,554.45,2388.11,9049.48,1.3,47.13,522.86,2388.08,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739,188
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,21.61,554.0,2388.06,9055.15,1.3,47.28,522.19,2388.04,8133.8,8.4294,0.03,393,2388,100.0,38.9,23.4044,187


In [4]:
df_test1.head()

Unnamed: 0,Engine_no,Cycle,Altitude,Mach,TRA,T2,T24,T30,T50,P2,P15,P30,Nf,Nc,epr,Ps30,phi,NRf,NRc,BPR,farB,htBleed,Nf_dmd,PCNfR_dmd,W31,W32,RUL
0,1,1,0.0023,0.0003,100.0,518.67,643.02,1585.29,1398.21,14.62,21.61,553.9,2388.04,9050.17,1.3,47.2,521.72,2388.03,8125.55,8.4052,0.03,392,2388,100.0,38.86,23.3735,142
1,1,2,-0.0027,-0.0003,100.0,518.67,641.71,1588.45,1395.42,14.62,21.61,554.85,2388.01,9054.42,1.3,47.5,522.16,2388.06,8139.62,8.3803,0.03,393,2388,100.0,39.02,23.3916,141
2,1,3,0.0003,0.0001,100.0,518.67,642.46,1586.94,1401.34,14.62,21.61,554.11,2388.05,9056.96,1.3,47.5,521.97,2388.03,8130.1,8.4441,0.03,393,2388,100.0,39.08,23.4166,140
3,1,4,0.0042,0.0,100.0,518.67,642.44,1584.12,1406.42,14.62,21.61,554.07,2388.03,9045.29,1.3,47.28,521.38,2388.05,8132.9,8.3917,0.03,391,2388,100.0,39.0,23.3737,139
4,1,5,0.0014,0.0,100.0,518.67,642.51,1587.19,1401.92,14.62,21.61,554.16,2388.01,9044.55,1.3,47.31,522.15,2388.03,8129.54,8.4031,0.03,390,2388,100.0,38.99,23.413,138


## Data processing

### Feature extraction

First, we will compute new features like:
* Outlet pressure at core nozzle P50 = P2 * epr
* Fan speed difference between demand and supply DNf = Nf_dmd - Nf
* Corrected fan speed difference between demand and supply DNRf = PCNRf_dmd - Nf 


Since high temperatures reduce the strength of composite materials (materials used in turbofan manufacturing). We will generate expanding window features to register the maximum temperatures reached for each engine until the specific row. 



Since pressure is synonym of constraint, attention will be given to the former as it can cause distortion. We will generate expanding window features to register the maximum pressures reached for each engine until the specific row.

In [5]:
df_train1 = processing.feature_extraction(df_train1)
df_train2 = processing.feature_extraction(df_train2)
df_train3 = processing.feature_extraction(df_train3)
df_train4 = processing.feature_extraction(df_train4)
df_test1 = processing.feature_extraction(df_test1)
df_test2 = processing.feature_extraction(df_test2)
df_test3 = processing.feature_extraction(df_test3)
df_test4 = processing.feature_extraction(df_test4)

In [6]:
df_train1.head()

Unnamed: 0,Engine_no,Cycle,Altitude,Mach,TRA,T2,T24,T30,T50,P2,P15,P30,Nf,Nc,epr,...,W31,W32,RUL,DNf,DNRf,P50,T2_expandmax,T24_expandmax,T30_expandmax,T50_expandmax,P2_expandmax,P15_expandmax,P30_expandmax,Ps30_expandmax,P50_expandmax
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,21.61,554.36,2388.06,9046.19,1.3,...,39.06,23.419,191,-0.06,-2288.02,19.006,518.67,641.82,1589.7,1400.6,14.62,21.61,554.36,47.47,19.006
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,21.61,553.75,2388.04,9044.07,1.3,...,39.0,23.4236,190,-0.04,-2288.07,19.006,518.67,642.15,1591.82,1403.14,14.62,21.61,554.36,47.49,19.006
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,21.61,554.26,2388.08,9052.94,1.3,...,38.95,23.3442,189,-0.08,-2288.03,19.006,518.67,642.35,1591.82,1404.2,14.62,21.61,554.36,47.49,19.006
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,21.61,554.45,2388.11,9049.48,1.3,...,38.88,23.3739,188,-0.11,-2288.08,19.006,518.67,642.35,1591.82,1404.2,14.62,21.61,554.45,47.49,19.006
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,21.61,554.0,2388.06,9055.15,1.3,...,38.9,23.4044,187,-0.06,-2288.04,19.006,518.67,642.37,1591.82,1406.22,14.62,21.61,554.45,47.49,19.006


In [7]:
df_test1.head()

Unnamed: 0,Engine_no,Cycle,Altitude,Mach,TRA,T2,T24,T30,T50,P2,P15,P30,Nf,Nc,epr,...,W31,W32,RUL,DNf,DNRf,P50,T2_expandmax,T24_expandmax,T30_expandmax,T50_expandmax,P2_expandmax,P15_expandmax,P30_expandmax,Ps30_expandmax,P50_expandmax
0,1,1,0.0023,0.0003,100.0,518.67,643.02,1585.29,1398.21,14.62,21.61,553.9,2388.04,9050.17,1.3,...,38.86,23.3735,142,-0.04,-2288.03,19.006,518.67,643.02,1585.29,1398.21,14.62,21.61,553.9,47.2,19.006
1,1,2,-0.0027,-0.0003,100.0,518.67,641.71,1588.45,1395.42,14.62,21.61,554.85,2388.01,9054.42,1.3,...,39.02,23.3916,141,-0.01,-2288.06,19.006,518.67,643.02,1588.45,1398.21,14.62,21.61,554.85,47.5,19.006
2,1,3,0.0003,0.0001,100.0,518.67,642.46,1586.94,1401.34,14.62,21.61,554.11,2388.05,9056.96,1.3,...,39.08,23.4166,140,-0.05,-2288.03,19.006,518.67,643.02,1588.45,1401.34,14.62,21.61,554.85,47.5,19.006
3,1,4,0.0042,0.0,100.0,518.67,642.44,1584.12,1406.42,14.62,21.61,554.07,2388.03,9045.29,1.3,...,39.0,23.3737,139,-0.03,-2288.05,19.006,518.67,643.02,1588.45,1406.42,14.62,21.61,554.85,47.5,19.006
4,1,5,0.0014,0.0,100.0,518.67,642.51,1587.19,1401.92,14.62,21.61,554.16,2388.01,9044.55,1.3,...,38.99,23.413,138,-0.01,-2288.03,19.006,518.67,643.02,1588.45,1406.42,14.62,21.61,554.85,47.5,19.006


### Feature scaling

Since our datasets are different from one another, we will fit a scaler on the concatenation of the four datasets. Our data contain outliers, so we will use robust scaler to keep the information given by the later.

In [8]:
# concatenate train dataframes
df = pd.concat([df_train1, df_train2, df_train3, df_train4])
# get input variables from dataframe
X = df.loc[:, ~df.columns.isin(['Engine_no', 'RUL'])].values
# get target variable from dataframe
y = df.loc[:, ['RUL']].values
# create scaler
scaler = RobustScaler().fit(X=X, y=y)
# save scaler
utils.save_scaler(scaler=scaler)

In [9]:
df_train1 = processing.feature_scaling(df_train1)
df_train2 = processing.feature_scaling(df_train2)
df_train3 = processing.feature_scaling(df_train3)
df_train4 = processing.feature_scaling(df_train4)
df_test1 = processing.feature_scaling(df_test1)
df_test2 = processing.feature_scaling(df_test2)
df_test3 = processing.feature_scaling(df_test3)
df_test4 = processing.feature_scaling(df_test4)

In [10]:
df_train1.head()

Unnamed: 0,Cycle,Altitude,Mach,TRA,T2,T24,T30,T50,P2,P15,P30,Nf,Nc,epr,Ps30,...,W31,W32,RUL,DNf,DNRf,P50,T2_expandmax,T24_expandmax,T30_expandmax,T50_expandmax,P2_expandmax,P15_expandmax,P30_expandmax,Ps30_expandmax,P50_expandmax
0,-0.974138,-0.571391,-0.738747,0.0,0.427849,0.388504,0.422676,0.467952,0.576586,0.584129,0.563245,0.390439,0.391066,0.75,0.476548,...,0.577379,0.576147,-0.047619,0.263158,0.656994,0.0,-1.9,-0.71231,-0.915346,0.0,0.0,-0.333333,-0.257143,0.0,191.0
1,-0.965517,-0.571317,-0.738628,0.0,0.427849,0.392076,0.431924,0.477176,0.576586,0.584129,0.561629,0.390326,0.388126,0.75,0.4803,...,0.574929,0.57646,0.047619,0.0,0.656994,0.0,-1.35,-0.419087,-0.685378,0.0,0.0,-0.333333,-0.2,0.0,190.0
2,-0.956897,-0.571494,-0.737914,0.0,0.427849,0.394241,0.415216,0.481026,0.576586,0.584129,0.56298,0.390553,0.400427,0.75,0.439024,...,0.572887,0.571056,-0.142857,0.210526,0.656994,0.0,-1.016667,-0.419087,-0.589407,0.0,0.0,-0.333333,-0.2,0.0,189.0
3,-0.948276,-0.571351,-0.738271,0.0,0.427849,0.394241,0.392532,0.472564,0.576586,0.584129,0.563483,0.390724,0.395629,0.75,0.412758,...,0.570029,0.573078,-0.285714,-0.052632,0.656994,0.0,-1.016667,-0.419087,-0.589407,0.0,0.0,-0.252252,-0.2,0.0,188.0
4,-0.939655,-0.571425,-0.738509,0.0,0.427849,0.394458,0.392793,0.488361,0.576586,0.584129,0.562291,0.390439,0.403492,0.75,0.440901,...,0.570845,0.575153,-0.047619,0.157895,0.656994,0.0,-0.983333,-0.419087,-0.406519,0.0,0.0,-0.252252,-0.2,0.0,187.0


In [11]:
df_test1.head()

Unnamed: 0,Cycle,Altitude,Mach,TRA,T2,T24,T30,T50,P2,P15,P30,Nf,Nc,epr,Ps30,...,W31,W32,RUL,DNf,DNRf,P50,T2_expandmax,T24_expandmax,T30_expandmax,T50_expandmax,P2_expandmax,P15_expandmax,P30_expandmax,Ps30_expandmax,P50_expandmax
0,-0.974138,-0.571305,-0.737914,0.0,0.427849,0.401494,0.403438,0.459273,0.576586,0.584129,0.562027,0.390326,0.396586,0.75,0.425891,...,0.569212,0.57305,0.047619,0.210526,0.656994,0.0,0.1,-1.322268,-1.131734,0.0,0.0,-0.747748,-1.028571,0.0,142.0
1,-0.965517,-0.571448,-0.738628,0.0,0.427849,0.387313,0.417223,0.449141,0.576586,0.584129,0.564543,0.390155,0.40248,0.75,0.482176,...,0.575745,0.574282,0.190476,0.052632,0.656994,0.0,0.1,-0.885201,-1.131734,0.0,0.0,0.108108,-0.171429,0.0,141.0
2,-0.956897,-0.571362,-0.738152,0.0,0.427849,0.395432,0.410636,0.47064,0.576586,0.584129,0.562583,0.390383,0.406002,0.75,0.482176,...,0.578195,0.575984,0.0,0.210526,0.656994,0.0,0.1,-0.885201,-0.848348,0.0,0.0,0.108108,-0.171429,0.0,140.0
3,-0.948276,-0.571251,-0.738271,0.0,0.427849,0.395215,0.398334,0.489087,0.576586,0.584129,0.562477,0.390269,0.389818,0.75,0.440901,...,0.574929,0.573064,0.095238,0.105263,0.656994,0.0,0.1,-0.885201,-0.388411,0.0,0.0,0.108108,-0.171429,0.0,139.0
4,-0.939655,-0.571331,-0.738271,0.0,0.427849,0.395973,0.411726,0.472746,0.576586,0.584129,0.562715,0.390155,0.388792,0.75,0.446529,...,0.57452,0.575739,0.190476,0.210526,0.656994,0.0,0.1,-0.885201,-0.388411,0.0,0.0,0.108108,-0.171429,0.0,138.0


## Save data

In [12]:
dataframes = {'df_train1': df_train1, 'df_train2': df_train2, 'df_train3': df_train3, 'df_train4': df_train4,
              'df_test1': df_test1, 'df_test2': df_test2, 'df_test3': df_test3, 'df_test4': df_test4}
utils.save_data(dataframes, data_type = '03_processed')

  0%|          | 0/8 [00:00<?, ?it/s]

Saving successful
