# Data Processing<a class="tocSkip">

In this notebook we will process data and prepare it for modelling. 

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc">
    <ul class="toc-item">
        <li>
            <span>
                <a href="#Import-packages-and-libraries" data-toc-modified-id="Import-packages-and-libraries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import packages and libraries<a name="1" rel="nofollow"></a></a>
            </span>
        </li>
        <li>
            <span>
                <a href="#Load-data" data-toc-modified-id="Load-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load data<a name="2" rel="nofollow"></a></a>
            </span>
        </li>
        <li>
            <span>
                <a href="#Feature-Engineering" data-toc-modified-id="Feature-Engineering-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Feature Engineering<a name="3" rel="nofollow"></a></a>
            </span>
        </li>
        <li>
            <span>
                <a href="#Save-data" data-toc-modified-id="Save-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Save data<a name="4" rel="nofollow"></a></a>
            </span>
        </li>
    </ul>
</div>

## Import packages and libraries

In [3]:
import sys

# insert libraries folder at the beginning of system path to enable fast access
sys.path.insert(1, '../src')

# data manipulation package
import pandas as pd

# configure pandas display settings
pd.set_option('display.max_columns', 30)
pd.set_option('display.max_rows', 50)

# library containing utility functions
import utils
# library containing data processing functions
import processing

## Load data

In [18]:
df_train1 = pd.read_csv('../data/02_interim/train_FD001.csv')
df_train2 = pd.read_csv('../data/02_interim/train_FD002.csv')
df_train3 = pd.read_csv('../data/02_interim/train_FD003.csv')
df_train4 = pd.read_csv('../data/02_interim/train_FD004.csv')
df_test1 = pd.read_csv('../data/02_interim/test_FD001.csv')
df_test2 = pd.read_csv('../data/02_interim/test_FD002.csv')
df_test3 = pd.read_csv('../data/02_interim/test_FD003.csv')
df_test4 = pd.read_csv('../data/02_interim/test_FD004.csv')

In [19]:
df_train1.head()

Unnamed: 0,Engine_no,Cycle,Altitude,Mach,TRA,T2,T24,T30,T50,P2,P15,P30,Nf,Nc,epr,Ps30,phi,NRf,NRc,BPR,farB,htBleed,Nf_dmd,PCNfR_dmd,W31,W32,RUL
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,21.61,554.36,2388.06,9046.19,1.3,47.47,521.66,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.419,191
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,21.61,553.75,2388.04,9044.07,1.3,47.49,522.28,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.0,23.4236,190
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,21.61,554.26,2388.08,9052.94,1.3,47.27,522.42,2388.03,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442,189
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,21.61,554.45,2388.11,9049.48,1.3,47.13,522.86,2388.08,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739,188
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,21.61,554.0,2388.06,9055.15,1.3,47.28,522.19,2388.04,8133.8,8.4294,0.03,393,2388,100.0,38.9,23.4044,187


In [20]:
df_test1.head()

Unnamed: 0,Engine_no,Cycle,Altitude,Mach,TRA,T2,T24,T30,T50,P2,P15,P30,Nf,Nc,epr,Ps30,phi,NRf,NRc,BPR,farB,htBleed,Nf_dmd,PCNfR_dmd,W31,W32,RUL
0,1,1,0.0023,0.0003,100.0,518.67,643.02,1585.29,1398.21,14.62,21.61,553.9,2388.04,9050.17,1.3,47.2,521.72,2388.03,8125.55,8.4052,0.03,392,2388,100.0,38.86,23.3735,142
1,1,2,-0.0027,-0.0003,100.0,518.67,641.71,1588.45,1395.42,14.62,21.61,554.85,2388.01,9054.42,1.3,47.5,522.16,2388.06,8139.62,8.3803,0.03,393,2388,100.0,39.02,23.3916,141
2,1,3,0.0003,0.0001,100.0,518.67,642.46,1586.94,1401.34,14.62,21.61,554.11,2388.05,9056.96,1.3,47.5,521.97,2388.03,8130.1,8.4441,0.03,393,2388,100.0,39.08,23.4166,140
3,1,4,0.0042,0.0,100.0,518.67,642.44,1584.12,1406.42,14.62,21.61,554.07,2388.03,9045.29,1.3,47.28,521.38,2388.05,8132.9,8.3917,0.03,391,2388,100.0,39.0,23.3737,139
4,1,5,0.0014,0.0,100.0,518.67,642.51,1587.19,1401.92,14.62,21.61,554.16,2388.01,9044.55,1.3,47.31,522.15,2388.03,8129.54,8.4031,0.03,390,2388,100.0,38.99,23.413,138


## Feature engineering

### Feature extraction

First, we will compute new features like:
* Outlet pressure at core nozzle P50 = P2 * epr
* Fan speed difference between demand and supply DNf = Nf_dmd - Nf
* Corrected fan speed difference between demand and supply DNRf = PCNRf_dmd - Nf 


Since high temperatures reduce the strength of composite materials (materials used in turbofan manufacturing). We will generate expanding window features to register the maximum temperatures reached for each engine until the specific row. 



Since pressure is synonym of constraint, attention will be given to the former as it can cause distortion. We will generate expanding window features to register the maximum pressures reached for each engine until the specific row.

In [21]:
df_train1 = processing.feature_extraction(df_train1)
df_train2 = processing.feature_extraction(df_train2)
df_train3 = processing.feature_extraction(df_train3)
df_train4 = processing.feature_extraction(df_train4)
df_test1 = processing.feature_extraction(df_test1)
df_test2 = processing.feature_extraction(df_test2)
df_test3 = processing.feature_extraction(df_test3)
df_test4 = processing.feature_extraction(df_test4)

In [22]:
df_train1.head()

Unnamed: 0,Engine_no,Cycle,Altitude,Mach,TRA,T2,T24,T30,T50,P2,P15,P30,Nf,Nc,epr,...,W31,W32,RUL,DNf,DNRf,P50,T2_expandmax,T24_expandmax,T30_expandmax,T50_expandmax,P2_expandmax,P15_expandmax,P30_expandmax,Ps30_expandmax,P50_expandmax
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,21.61,554.36,2388.06,9046.19,1.3,...,39.06,23.419,191,-0.06,-2288.02,19.006,518.67,641.82,1589.7,1400.6,14.62,21.61,554.36,47.47,19.006
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,21.61,553.75,2388.04,9044.07,1.3,...,39.0,23.4236,190,-0.04,-2288.07,19.006,518.67,642.15,1591.82,1403.14,14.62,21.61,554.36,47.49,19.006
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,21.61,554.26,2388.08,9052.94,1.3,...,38.95,23.3442,189,-0.08,-2288.03,19.006,518.67,642.35,1591.82,1404.2,14.62,21.61,554.36,47.49,19.006
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,21.61,554.45,2388.11,9049.48,1.3,...,38.88,23.3739,188,-0.11,-2288.08,19.006,518.67,642.35,1591.82,1404.2,14.62,21.61,554.45,47.49,19.006
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,21.61,554.0,2388.06,9055.15,1.3,...,38.9,23.4044,187,-0.06,-2288.04,19.006,518.67,642.37,1591.82,1406.22,14.62,21.61,554.45,47.49,19.006


In [23]:
df_test1.head()

Unnamed: 0,Engine_no,Cycle,Altitude,Mach,TRA,T2,T24,T30,T50,P2,P15,P30,Nf,Nc,epr,...,W31,W32,RUL,DNf,DNRf,P50,T2_expandmax,T24_expandmax,T30_expandmax,T50_expandmax,P2_expandmax,P15_expandmax,P30_expandmax,Ps30_expandmax,P50_expandmax
0,1,1,0.0023,0.0003,100.0,518.67,643.02,1585.29,1398.21,14.62,21.61,553.9,2388.04,9050.17,1.3,...,38.86,23.3735,142,-0.04,-2288.03,19.006,518.67,643.02,1585.29,1398.21,14.62,21.61,553.9,47.2,19.006
1,1,2,-0.0027,-0.0003,100.0,518.67,641.71,1588.45,1395.42,14.62,21.61,554.85,2388.01,9054.42,1.3,...,39.02,23.3916,141,-0.01,-2288.06,19.006,518.67,643.02,1588.45,1398.21,14.62,21.61,554.85,47.5,19.006
2,1,3,0.0003,0.0001,100.0,518.67,642.46,1586.94,1401.34,14.62,21.61,554.11,2388.05,9056.96,1.3,...,39.08,23.4166,140,-0.05,-2288.03,19.006,518.67,643.02,1588.45,1401.34,14.62,21.61,554.85,47.5,19.006
3,1,4,0.0042,0.0,100.0,518.67,642.44,1584.12,1406.42,14.62,21.61,554.07,2388.03,9045.29,1.3,...,39.0,23.3737,139,-0.03,-2288.05,19.006,518.67,643.02,1588.45,1406.42,14.62,21.61,554.85,47.5,19.006
4,1,5,0.0014,0.0,100.0,518.67,642.51,1587.19,1401.92,14.62,21.61,554.16,2388.01,9044.55,1.3,...,38.99,23.413,138,-0.01,-2288.03,19.006,518.67,643.02,1588.45,1406.42,14.62,21.61,554.85,47.5,19.006


### Feature scaling

First, we will compute new features like:
* Outlet pressure at core nozzle P50 = P2 * epr
* Fan speed difference between demand and supply DNf = Nf_dmd - Nf
* Corrected fan speed difference between demand and supply DNRf = PCNRf_dmd - Nf 


Since high temperatures reduce the strength of composite materials (materials used in turbofan manufacturing). We will generate expanding window features to register the maximum temperatures reached for each engine until the specific row. 



Since pressure is synonym of constraint, attention will be given to the former as it can cause distortion. We will generate expanding window features to register the maximum pressures reached for each engine until the specific row.

In [24]:
from sklearn.preprocessing import RobustScaler
import numpy as np

In [40]:
def feature_scaling(dataframe_train, dataframe_test, dataset_no):
    columns = dataframe_train.columns[1:]
    X_train = dataframe_train.drop(['Engine_no', 'RUL'], axis=1).values
    y_train = dataframe_train.loc[:, 'RUL'].values
    X_test = dataframe_test.drop(['Engine_no', 'RUL'], axis=1).values
    y_test = dataframe_test.loc[:, 'RUL'].values
    Scaler = RobustScaler()
    X_train = Scaler.fit_transform(X_train)
    X_test = Scaler.transform(X_test)
    dataframe_train_scaled = pd.DataFrame(np.hstack((X_train, y_train[:,None])), columns=columns)
    dataframe_test_scaled = pd.DataFrame(np.hstack((X_test, y_test[:,None])), columns=columns)
    return dataframe_train_scaled, dataframe_test_scaled

In [41]:
df_train1, df_test1 = feature_scaling(df_train1, df_train2, '1')

In [5]:
df_train1 = processing.feature_scaling(df_train1)
df_train2 = processing.feature_scaling(df_train2)
df_train3 = processing.feature_scaling(df_train3)
df_train4 = processing.feature_scaling(df_train4)
df_test1 = processing.feature_scaling(df_test1)
df_test2 = processing.feature_scaling(df_test2)
df_test3 = processing.feature_scaling(df_test3)
df_test4 = processing.feature_scaling(df_test4)

In [42]:
df_train1.head()

Unnamed: 0,Cycle,Altitude,Mach,TRA,T2,T24,T30,T50,P2,P15,P30,Nf,Nc,epr,Ps30,...,W31,W32,RUL,DNf,DNRf,P50,T2_expandmax,T24_expandmax,T30_expandmax,T50_expandmax,P2_expandmax,P15_expandmax,P30_expandmax,Ps30_expandmax,P50_expandmax
0,-0.990385,-0.233333,-0.8,0.0,0.0,-1.214815,-0.049261,-0.610086,0.0,0.0,0.766667,-0.333333,-0.886642,0.0,-0.114286,...,0.92,0.835172,0.333333,0.7,0.0,0.0,-2.553571,-1.251534,-1.403027,0.0,0.0,-0.578313,-0.71875,0.0,191.0
1,-0.980769,0.633333,-0.6,0.0,0.0,-0.725926,0.211823,-0.401804,0.0,0.0,0.258333,-0.555556,-1.016544,0.0,-0.057143,...,0.68,0.866897,0.555556,0.2,0.0,0.0,-1.964286,-0.92638,-1.162725,0.0,0.0,-0.578313,-0.65625,0.0,190.0
2,-0.971154,-1.433333,0.6,0.0,0.0,-0.42963,-0.259852,-0.314883,0.0,0.0,0.683333,-0.111111,-0.473039,0.0,-0.685714,...,0.48,0.31931,0.111111,0.6,0.0,0.0,-1.607143,-0.92638,-1.062441,0.0,0.0,-0.578313,-0.65625,0.0,189.0
3,-0.961538,0.233333,0.0,0.0,0.0,-0.42963,-0.900246,-0.505945,0.0,0.0,0.841667,0.222222,-0.685049,0.0,-1.085714,...,0.2,0.524138,-0.222222,0.1,0.0,0.0,-1.607143,-0.92638,-1.062441,0.0,0.0,-0.46988,-0.65625,0.0,188.0
4,-0.951923,-0.633333,-0.4,0.0,0.0,-0.4,-0.892857,-0.149241,0.0,0.0,0.466667,-0.333333,-0.337623,0.0,-0.657143,...,0.28,0.734483,0.333333,0.5,0.0,0.0,-1.571429,-0.92638,-0.871334,0.0,0.0,-0.46988,-0.65625,0.0,187.0


In [43]:
df_test1.head()

Unnamed: 0,Cycle,Altitude,Mach,TRA,T2,T24,T30,T50,P2,P15,P30,Nf,Nc,epr,Ps30,...,W31,W32,RUL,DNf,DNRf,P50,T2_expandmax,T24_expandmax,T30_expandmax,T50_expandmax,P2_expandmax,P15_expandmax,P30_expandmax,Ps30_expandmax,P50_expandmax
0,-0.990385,11666.1,1680.0,0.0,-69.23,-129.362963,-28.508621,-22.206642,-9.14,-13.61,-299.0,-1838.222222,-44.041054,-0.28,-15.685714,...,-96.4,-99.936552,4.888889,3.7,-13.4164,-69.23,-157.017857,-36.694785,-26.319773,-9.14,-13.61,-433.975904,-17.75,-13.4164,148.0
1,-0.980769,13999.4,1681.6,0.0,-73.67,-137.392593,-29.172414,-23.145551,-10.71,-15.9,-345.775,-1961.333333,-46.366422,-0.28,-15.171429,...,-113.68,-117.457931,5.777778,4.3,-15.0178,-69.23,-157.017857,-36.694785,-26.319773,-9.14,-13.61,-433.975904,-17.1875,-13.4164,147.0
2,-0.971154,8332.933333,1243.6,-40.0,-56.13,-156.044444,-41.051724,-29.568676,-7.57,-12.59,-314.775,-5255.333333,-64.904412,-0.36,-30.914286,...,-99.0,-100.866207,-0.222222,3449.9,-12.379,-56.13,-157.017857,-36.694785,-26.319773,-7.57,-12.59,-433.975904,-17.1875,-12.379,146.0
3,-0.961538,14002.566667,1683.2,0.0,-73.67,-137.97037,-29.07266,-23.096351,-10.71,-15.9,-345.816667,-1961.222222,-46.366422,-0.28,-15.857143,...,-112.96,-116.053793,5.666667,4.8,-15.0178,-56.13,-157.017857,-36.694785,-26.319773,-7.57,-12.59,-433.975904,-17.1875,-12.379,145.0
4,-0.951923,8333.5,1240.6,-40.0,-56.13,-156.4,-40.934729,-29.529315,-7.57,-12.58,-315.325,-5255.444444,-65.40625,-0.36,-30.342857,...,-98.8,-101.857241,-0.111111,3450.2,-12.379,-56.13,-157.017857,-36.694785,-26.319773,-7.57,-12.58,-433.975904,-17.1875,-12.379,144.0


## Save data

In [8]:
dataframes = {'df_train1': df_train1, 'df_train2': df_train2, 'df_train3': df_train3, 'df_train4': df_train4,
              'df_test1': df_test1, 'df_test2': df_test2, 'df_test3': df_test3, 'df_test4': df_test4}
utils.save_data(dataframes, data_type = '03_processed')

  0%|          | 0/8 [00:00<?, ?it/s]

Saving successful
