# 02_2 Feature engineering
Due to NDA agreements no data can be displayed.

In this notebook the DataFrame for the use in the Machine Learning models is generated and put together.  
Therefore the high frequency data is sorted and the number of features reduced. Then the DataFrame is enriched with daily data from the noon report and with predicted data from the Engine Model.  
On top the sensor data from the draft is flattened out with a rolling average and also included into the DataFrame.

---

## Feature selection of high frequency data

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

For this notebook the data from the notebook "Preprocessing" is used. The data contains 5 month of sensor-data (110 features plus timestamp) from July 2021 to October 2021. Measurements are logged in minute intervals. For details of preprocessing like e.g. handling of duplicates and identification of missing values, see notebook "Preprocessing".

In [None]:
# load the data
df = pd.read_csv('../data/FINAL_Bluetracker_mean_1m_ES0NaN.csv')
df.head()

In [None]:
print('We have timestamps from {} to {}. Therefore a series ranging over 5 months.'.format(df.EntryDate.min(), df.EntryDate.max()))

We have 111 features over a period of 5 month to look at. One of them is the timestamp. A detailed look on the feature names and explanations showed that some features might be more important than others. Less important might e.g. information about the controller or duplicate information that are given in different units or granularity. The following feature are considered to be not important:
* Information about the Programable Logic Controller, i.e. feature names starting with ```LM1.plc_``` or ```LM2.plc_```
    * This block of features give information about the computation load but not about the vessel performance. Thus, they are considered to be not relevant for this study.
* Detailed information about the auxiliary engine (```AE{1-X}.LOD.act.PRC``` and ```AE{1-X}.POW.act.kW```)
    * These information are given for each of the auxiliary engines separately and as totals. For this study, only the totals are considered.
* Fuel oil temperature (```{engine}.FTS.act.dgC```)
    * Heavy fuel oil needs to be heated before using it for combustion. This process is mandatory and cannot be adjusted. Therefore these features are dropped.
* Rate of Turn (```V.ROT.act.degPmin```)
    * This feature might be important for daily business, but has less effect on the general performance of the vessel
* Vessel distance (```V.DOG.cnt_tot.nm``` and ```V.DTW.cnt_tot.NM```)
    * For this study the actual location, course and speed are considered to be more important. The vessel distance is probably more of interest for the shipping company.
* Total running information
    * We are investigating only one vessel, thus there is no need and possibility to compare with other vessel and different running hours.
* Heavy fuel oil consumption totals ```{engine}.{fuel type}.cnt_tot.t```
    * We will work with the current fuel oil consumptions, thus the totals are not necessary. The same applies for the average fuel oil consumption (```{engine}.SFC.avg_tot.gPkWh```).


All features are stored in a list where feature importance has been added. This feature importance will be used to exclude features.
In a scooring from 1-3 the importance has been set for each feature, where
* 1: keep the feature
* 2: might be importent
* 3: drop the feature

In [None]:
# read list with feature importance
data_log = pd.read_csv('../data/Capstone_features_Features.csv')
data_log.head()

In [None]:
# create list of important features (feature importance < 3)
list_imp_feat = list(data_log[data_log['F_Imp_new'] < 3]['VarName'])
len(list_imp_feat)

In [None]:
# create list of all features in the data frame
columns = df.columns
len(columns)

The list of important features includes some features the have no entry values in our data frame, i.e. the high frequency sensor data. Therefore we have to exclude these from our list of important features.

In [None]:
# list of important features that are not included in our data
list_notinfeat = list(set(list_imp_feat) - set(columns))
len(list_notinfeat)

There are some features that have the same VarName but different DataLogID and LogName. We figured out that these features are marked with ```trendlog```. The trendlog DataLogIDs do not have measurements in the delivered high frequency data. Therefore, we remove the duplicate feature names from the list.

In [None]:
# check for duplicates in list of important features
list_double = []
for i in list_imp_feat:
    if list_imp_feat.count(i) > 1:
        list_double.append(i)
set(list_double)

In [None]:
# remove duplicates
list_imp_feat = list(set(list_imp_feat))
len(list_imp_feat)

With these information and the preparatory steps, a list with the features that can be dropped is created. Afterwards, these features are dropped from the data frame with the high frequency sensor data.

In [None]:
list_imp_feat = list(set(list_imp_feat) - set(list_notinfeat))
len(list_imp_feat)

In [None]:
list_todrop = list(set(columns) - set(list_imp_feat))
len(list_todrop)

In [None]:
# drop the features from the DateFrame
df = df.drop(list_todrop, axis = 1)
df.head()

After dropping 82 features that are considered to be not important, there are now 28 features plus the time stamp left.

In [None]:
df.info(verbose = True)

All features are floats, which makes sense, since we obtained sensor data. The following table gives an overview of the feature statistics.

In [None]:
df.describe().T

In [None]:
print(df.isnull().sum().sum(), 'out of', df.shape[0]*df.shape[1], 'entries are NaN.')

In [None]:
df.isnull().sum().sum()/(df.shape[0]*df.shape[1])*100

7.88% of the sensor measurements are missing values. It should be further investigated how to deal with these missing values. Do they make sense for at least some features? Are there single sensors that are prone to dropout? Is there a possibility to fill them reasonable? These questions will be further regarded in the EDA.  At this stage, the focus is still on the feature selection. In order to check if there are features that are related to each other and thus have no added value, a correlation matrix for all feature is evaluated.

---

## Defining passage types

In [None]:
# Geoplot with location, speed and date to visualize the positions and speed of the vessel
fig = px.scatter_mapbox(df,
                        lat='V.GPSLAT.act.deg',lon='V.GPSLON.act.deg',color='V.SOG.act.kn',text='EntryDate',
                        width=1000, height=600, 
                        title='observation period', 
                        labels={'V.GPSLAT.act.deg':'Latitude','V.GPSLON.act.deg':'Longitude','V.SOG.act.kn':'True Speed [kn]','EntryDate':'Date'},
                        color_discrete_sequence=px.colors.qualitative.Safe, range_color=(0,df['V.SOG.act.kn'].max()))
fig.update_layout(mapbox_style="open-street-map",
                  title_font_family="Arial",
                  title_font_color="grey",
                  title_font_size=24,
                  title_x=0.5,
                  legend=dict(title_font_family="Arial",
                                title_font_size=20,
                                title_font_color="grey",
                                font=dict(family="Arial",
                                            size=18,
                                            color="grey"))
)

fig.show()

The vessel is operating between Europe and South America. Over the Atlantic the speed is high and rather constant. While entering the ports or ankering the speed is low. Therefore different passages types can be defined. A categorization is made as follows: long constant passages over the Atlantic, port entries with low vessel speed and all other passages. With this information a more precise EDA can be conducted.

In [None]:
# plot histogram of all speed values
px.histogram(df,x='V.SOG.act.kn')

In [None]:
# plot histogram of speed on Atlantic passages
px.histogram(df[(df['V.GPSLAT.act.deg']>=19) & (df['V.GPSLON.act.deg']<=-3)], x='V.SOG.act.kn')

According to the lowest speed on Atlantic passages, 13.5kn are selected as threshold for low speed, i.e. port entries.

In [None]:
conditions = [
    (df['V.GPSLAT.act.deg']>=19) & (df['V.GPSLON.act.deg']<=-3),
    (df['V.GPSLAT.act.deg']<19) & (df['V.SOG.act.kn']>13.5),
    (df['V.GPSLAT.act.deg']<19) & (df['V.SOG.act.kn']<=13.5),
    (df['V.GPSLAT.act.deg']>-3) & (df['V.SOG.act.kn']>13.5),
    (df['V.GPSLAT.act.deg']>-3) & (df['V.SOG.act.kn']<=13.5)]
choices = ['Atlantic', 'SouthAmerica>13.5kn', 'SouthAmerica<13.5kn', 'Europe>13.5kn', 'Europe<13.5kn']
df['passage_type'] = np.select(conditions, choices)
df['passage_type'].value_counts()

The split of the passage types is done by Latitude and speed. For the Latitude fixed values are used. These values are set with a sensible look at the map.

---

## Combination of high frequency data and daily performance trends report data

The performance trends report contains daily data observed by the crew. Especially trim data from this report seems to be more reliable than trim data from the sensor in the high frequency data. Both data sets should be joined.

In [None]:
# read data from performance trend report
df_daily = pd.read_csv('../data/PerformanceTrendsReport.csv',header=[0,1])
df_daily.columns = df_daily.columns.map(lambda h: '  '.join(h).replace(' ', '_'))

The data has 45 columns and 1470 rows. The first entry is from 2017-02-18 thus the time series is longer than the one from the high frequency data. Not all of the features are important and some are related or just given in different units. Thus, only some of them are joined to the high frequency data (see list below):

In [None]:
# keep only important features in the data frame of the daily data
df_daily = df_daily[['Report_Date__Date',
                            'Report_Type__Noon/Autolog/Perf_test',
                            'Speed_Observed__[kn]',
                            'ME_Fuel_Cons__[t/24_h]',
                            'ME_Power_(Propulsion)__[kW]',
                            'ME_RPM__[rpm]',
                            'Mean_Draft__[m]',
                            'Trim__[m]',
                            'Heading_Dir__[deg]',
                            'True_Wind_Speed__[m/s]',
                            'True_Wind_Dir__[deg]',
                            'Wave_Height_[m]__[m]',
                            'True_Wave_Dir__[deg]']]

In order keep the data frame clear, the column name are cleaned and the columns are marked with the postfix ```_daily```.

In [None]:
# rename column names
df_daily = df_daily.rename(columns={'Report_Date__Date':'Date_daily',
                            'Report_Type__Noon/Autolog/Perf_test':'Type_daily',
                            'Speed_Observed__[kn]':'Speed_Obs_kn_daily',
                            'ME_Fuel_Cons__[t/24_h]':'ME_Fuel_Cons_tP24h_daily',
                            'ME_Power_(Propulsion)__[kW]':'ME_Power_Prop_kW_daily',
                            'ME_RPM__[rpm]':'ME_RPM_rpm_daily',
                            'Mean_Draft__[m]':'Mean_Draft_m_daily',
                            'Trim__[m]':'Trim_m_daily',
                            'Heading_Dir__[deg]':'Heading_Dir_deg_daily',
                            'True_Wind_Speed__[m/s]':'True_Wind_Speed_mPs_daily',
                            'True_Wind_Dir__[deg]':'True_Wind_Dir_deg_daily',
                            'Wave_Height_[m]__[m]':'Wave_Height_m_daily',
                            'True_Wave_Dir__[deg]':'True_Wave_Dir_deg_daily'})

In order to join both data frames the date columns of both data frames need to be converted to datetime.

In [None]:
# convert date columns to datetime
df['EntryDate'] = pd.to_datetime(df['EntryDate'])
df_daily['Date_daily'] = pd.to_datetime(df_daily['Date_daily'])

In [None]:
print(df['EntryDate'].min())
print(df['EntryDate'].max())

The daily data will be reduced to the time frame of the high frequency data which starts on 2021-05-31 and ends on 2021-10-31.

In [None]:
# reduce time frame of daily data‚
df_daily = df_daily[df_daily['Date_daily']>='2021-04-30']
df_daily = df_daily[df_daily['Date_daily']<='2021-11-01'].reset_index(drop=True)

In [None]:
print(df_daily['Date_daily'].min())
print(df_daily['Date_daily'].max())

Now the daily data starts on 2021-04-30 and ends on 2021-10-31. The next step is to add the daily data to the high frequency data as new features. Due to the different temporal resolutions, the same values of the daily data will be used for several consecutive timestamps of the high frequency data.

In [None]:
# define function to add new columns to high frequency data
def add_daily_cols(name):
    df[name] = np.nan
    for index, row in df_daily.iterrows():
        if index == 0:
            end = row['Date_daily']
            continue
        else:
            start = end
            end = row['Date_daily']
        df[name][(df['EntryDate']>=start) & (df['EntryDate']<end)]=row[name]

In [None]:
# get column names from daily data
col_names = df_daily.columns

In [None]:
# add column to high frequency data
for i in col_names:
    add_daily_cols(i)

With the features from the daily data the total data frame now contains 57 features plus the time stamps of the high frequency data and the daily data and the correlation matrix can be checked again.

Only Wind has NaN values, due to empty rows in the daily noon report.

---

## Creating a new feature to separate single passages

So far, we only differentiated between passage type. In addition, it would be helpful to identify single trips from these passage types. Therefor a new feature ```trip_id``` is added. A new trip starts, when the passage type changes.

In [None]:
# loop through row and create trip id
trip_id = []
for index, row in df.iterrows():
    if index == 0:
        t_id = 1
        p_type = row['passage_type']
    else:
        if row['passage_type'] != p_type:
            t_id +=1
            p_type = row['passage_type']
    trip_id.append(t_id)

In [None]:
# join trip_id to dataframe
df = pd.concat([df,pd.DataFrame({'trip_id':trip_id})],axis=1)

In [None]:
df.value_counts('trip_id').head(10)

---

## Adding feature of Power prediction
Feature generated from a theoretical Engine Model (Matrix)


In [None]:
import pickle

# load model of theoretical required Engine Power to move the vessel. Used as reference for the power used by the vessel while sailing.
RandForestReg_EngineModel = '../models/RFReg_Engine_Model.sav'
Engine_Model = pickle.load(open(RandForestReg_EngineModel, 'rb'))

#generate input DataFrame for Engine Model. Input order: Draft [m], Trim [m], Speed [kn] 
df_inputEM = df[['Mean_Draft_m_daily', 'Trim_m_daily', 'Speed_Obs_kn_daily']]
# Rename columns to column names used during fit of Engine Model
df_inputEM = df_inputEM.rename(columns = {'Mean_Draft_m_daily': 'Mean_Draft_[m]', 'Trim_m_daily': 'Trim_[m]', 'Speed_Obs_kn_daily': 'Speed_[kn]'})

# predict the required Power of the vessel with the help of the Engine Model
Value = Engine_Model.predict(df_inputEM)

# write a DataFrame for further use
Power_predict = pd.DataFrame(data = Value, columns = {'Power_EM_predict'})

# add Power prediction from Engine Model to DataFrame
df = pd.concat([df, Power_predict], axis = 1)

---

## Draft Marks
1) Creating Rolling Average with window = 10.
2) Calculate Trim, Heel, Draft  > DDM.TRIM.calc.m / DDM.HEEL.calc.m / DDM.DRAFT.calc.m

In [None]:
df_temp = df.copy()

# Create Rolling Average
df_temp['DDM.FWDCL.10ava.m'] = df_temp['DDM.FWDCL.act.m'].rolling(window=10).mean()
df_temp['DDM.AFTCL.10ava.m'] = df_temp['DDM.AFTCL.act.m'].rolling(window=10).mean()

df_temp['DDM.MIDPS.10ava.m'] = df_temp['DDM.MIDPS.act.m'].rolling(window=10).mean()
df_temp['DDM.MIDSB.10ava.m'] = df_temp['DDM.MIDSB.act.m'].rolling(window=10).mean()

# Calculate TRIM, DRAFT, HEEL
df_temp['DDM.TRIM.act.m'] = df_temp['DDM.FWDCL.10ava.m'] - df_temp['DDM.AFTCL.10ava.m']
df_temp['DDM.DRAFT.act.m'] = (df_temp['DDM.MIDPS.10ava.m'] - df_temp['DDM.MIDSB.10ava.m'])/2
df_temp['DDM.HEEL.act.m'] = df_temp['DDM.MIDPS.10ava.m'] - df_temp['DDM.MIDSB.10ava.m']

# Fill first 10 values with backward filling
df_temp['DDM.TRIM.act.m'].fillna(method='bfill', inplace=True)
df_temp['DDM.DRAFT.act.m'].fillna(method='bfill', inplace=True)
df_temp['DDM.HEEL.act.m'].fillna(method='bfill', inplace=True)

df_temp.drop(columns=['DDM.FWDCL.10ava.m', 'DDM.AFTCL.10ava.m', 'DDM.MIDPS.10ava.m', 'DDM.MIDSB.10ava.m'], inplace=True)

In [None]:
# Reassign to Dataframe df
df = df_temp.copy()
del df_temp

---

## Write .csv file for further use

In [None]:
df.columns.tolist()

In [None]:
df.to_csv('../data/Featureselection03.csv', index = False)

* 01: Features dropped
* 02: Daily date from Noon Report included as feature
* 03: added Power prediction feature and rolling average