# 04_2 Model_Stacking

This notebook includes the data preparation and the developement of a Stacking model.

Due to NDA agreements no data can be displayed.

Data Preparation, Data Cleaning, and Preparation for Modelling is the same for all algorithms. To directly go to modelling click [here](#modelling)

---

## Data preparation

### Import libraries and read data

In [None]:
import pandas as pd 
import numpy as np

from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

import seaborn as sns
import matplotlib.pyplot as plt

import sys
sys.path.append("..")
import mlflow
from modeling.config import EXPERIMENT_NAME
TRACKING_URI = open("../.mlflow_uri").read().strip()


from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import RidgeCV
from sklearn.svm import LinearSVR
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

from sklearn.preprocessing import power_transform
from sklearn.preprocessing import PowerTransformer
from scipy import stats
#>>> print(power_transform(data, method='box-cox'))
import statsmodels.api as sm

In [None]:
# read data
df = pd.read_csv('../data/Featureselection03.csv')
df.head()

### Create data frame with important features

So that everyone is on track with the feature selection, we created another csv file to rate the importance and only use important features for training our models and further analysis.

Only important features are used to train the model. In this case we use 17 features beside the target.

In [None]:
# read list with feature importance
data_log = pd.read_csv('../data/Capstone_features_Features.csv')
data_log.head()

In [None]:
# create list of important features (feature importance < 3)
list_imp_feat = list(data_log[data_log['ModelImportance'] < 3]['VarName'])
len(list_imp_feat)

In [None]:
df_model = df[list_imp_feat].copy()

In [None]:
df_model.info()

### Fill and drop NaN

Values for V.SLPOG.act.PRC and ME.SFCI.act.gPkWh contain missing values. The EDA showed that these are mainly caused during harbour times when the main engine was not running. Therefore it makes sense to fill the missing values with 0.

In [None]:
df_model['V.SLPOG.act.PRC'].fillna(0,inplace=True)
df_model['ME.SFCI.act.gPkWh'].fillna(0,inplace=True)

In [None]:
df_model['A.SOG.next.kn'] = (df_model['V.SOG.act.kn'].shift(-1) - df_model['V.SOG.act.kn'])
df_model['A.SOG.next.kn'].fillna(df_model['V.SOG.act.kn'], inplace=True)
df_model['A.SOG.next.kn'].describe()

The remaining rows with missing values are dropped.

In [None]:
df_model.dropna(inplace=True)

In [None]:
df_model.info()

In [None]:
plt.figure(figsize = (30,28))
sns.heatmap(df_model.corr(), annot = True, cmap = 'RdYlGn')

### Define target

In [None]:
X = df_model.drop(['ME.FMS.act.tPh'], axis = 1)
y = df_model['ME.FMS.act.tPh']

In [None]:
X.rename(columns={'passage_type_Europe<13.5kn': 'passage_type_Europe_smaller_13.5kn', 'passage_type_Europe>13.5kn': 'passage_type_Europe_greater_13.5kn',\
    'passage_type_SouthAmerica<13.5kn': 'passage_type_SouthAmerica_smaller_13.5kn', 'passage_type_SouthAmerica>13.5kn': 'passage_type_SouthAmerica_greater_13.5kn'}, inplace=True)

### Train Test Split

Due to the high amount of data, a split into 10% test data and 90% train data is chosen. The random state is set to 42 to have comparable results for diffent models. To account for the imbalance in the distribution of passage types the stratify parameter is used for this feature. This results in approximately the same percentage of the different passage types in each subset.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = X['passage_type'], test_size = 0.1, random_state = 42)

### Create dummy values for passage type

As passage_type is the only object type, get_dummies will only create dummies for passage_type.

In [None]:
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)

### Set MLFlow connection

MLFlow is used to track and compare different models and model settings.

In [None]:
runmlflow = False

# setting the MLFlow connection and experiment
if runmlflow == True:
    mlflow.set_tracking_uri(TRACKING_URI)
    mlflow.set_experiment(EXPERIMENT_NAME)
    mlflow.start_run(run_name='Stacking (Poly, RF_Hyper)') # CHANGE!
    run = mlflow.active_run()

---

## Modelling <a id='modelling'></a>

In [None]:
RSEED = 42

For all models in this project a MinMaxScaler is applied. For this model a random forrest is used. The hyperparameter are selected based on grid search and offer a reasonable balance between optimal results and overfitting. These settings are used in a pipeline.

### Pipeline

In [None]:
estimators = [
    ('rfh', make_pipeline(MinMaxScaler(), RandomForestRegressor(criterion= 'squared_error',
                                            max_depth= 40, 
                                            max_features= 'auto',
                                            max_leaf_nodes= 7000, 
                                            min_samples_split= 20,
                                            n_estimators= 100,
                                            random_state=RSEED))),                                    #    ('xgb', make_pipeline(MinMaxScaler(), XGBRegressor(seed = RSEED))),
    ('plr', make_pipeline(PolynomialFeatures(degree=2), MinMaxScaler() , LinearRegression())),
    ]
reg = StackingRegressor(estimators=estimators, final_estimator=RandomForestRegressor(random_state=RSEED))


### Fit and predict

In [None]:
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
y_pred_train = reg.predict(X_train)

---

## Analysis

### Errors and residuals

The root mean squared error (RMSE) is used to evaluate the model. 

In [None]:
y_pred2 = y_pred.copy()
y_pred2_train = y_pred_train.copy()

y_pred2[y_pred2 < 0.013509] = 0
y_pred2_train[y_pred2_train < 0.013509] = 0 #0.013509

print('RMSE train: ', mean_squared_error(y_train, y_pred2_train, squared= False))
rmse_train = mean_squared_error(y_train, y_pred2_train, squared= False)
print('RMSE test: ', mean_squared_error(y_test, y_pred2, squared= False))
rmse_test = mean_squared_error(y_test, y_pred2, squared= False)

Plotting actual values against predicted shows that the points are close to the optimal diagonale. However, this plot and the yellowbrick residual plot show some dificulties the model has when predicting low target values.

In [None]:
fig=plt.figure(figsize=(6, 6))
plt.axline([1, 1], [2, 2],color='lightgrey')
plt.scatter(y_train, y_pred2_train, color ='#33424F')
plt.scatter(y_test, y_pred2, color = '#FF6600')
#plt.xticks(np.arange(0,501,100));
#plt.yticks(np.arange(0,501,100));
plt.xlabel("ME.FMS.act.tPh actual");
plt.ylabel("ME.FMS.act.tPh predicted");
#plt.xlim(0, 450);
#plt.ylim(0, 450);

In [None]:
residuals_train = y_pred2_train - y_train
residuals_test = y_pred2 - y_test

In [None]:
sns.scatterplot(x = y_pred2_train, y = residuals_train)
sns.scatterplot(x = y_pred2, y = residuals_test)
plt.axhline(y = 0, color = 'black')
plt.xlabel("ME.FMS.act.tPh predicted");
plt.ylabel("Residuals");
plt.legend(labels=['', 'train', 'test'])

---

## Write to MLFlow

In [None]:
#seting parameters that should be logged on MLFlow
#these parameters were used in feature engineering (inputing missing values)
#or parameters of the model (fit_intercept for Linear Regression model)
params = {
      "features drop": 'EntryDate,Date_daily, Type_daily, TI.LOC.act.ts, WEA.WDR.act.deg, WEA.WSR.act.mPs, WEA.WDTV.act.deg, trip_id, LS.GME.act.nodim, V.WD.act.m',
      "explanation": 'correlated features with <0.95 where dropped',
      "csv used": 'Featureselection03.csv',
      "NaN handling": 'V.SLPOG.act.PRC and ME.SFCI.act.gPkWh filled with 0, rest dropped by row',
      'Shape' : df.shape,
      'Scaler' : 'MinMaxScaler'
  }

In [None]:
if runmlflow == True:
    #logging params to mlflow
    mlflow.log_params(params)
    #setting tags
    mlflow.set_tag("running_from_jupyter", "True")
    #logging metrics
    mlflow.log_metric("train-" + "RMSE", rmse_train)
    mlflow.log_metric("test-" + "RMSE", rmse_test)
    # logging the model to mlflow will not work without a AWS Connection setup.. too complex for now
    # but possible if running mlflow locally
    # mlflow.log_artifact("../models")
    # mlflow.sklearn.log_model(reg, "model")
    mlflow.end_run()