# 04_1 Random Forest

This notebook includes the data preparation and the developement of a random forest model.

Due to NDA agreements no data can be displayed.

---

## Data preparation

### Import libraries and read data

In [None]:
import pandas as pd 
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error
from sklearn.inspection import permutation_importance

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from yellowbrick.regressor import ResidualsPlot

import sys
sys.path.append("..")
import mlflow
from modeling.config import EXPERIMENT_NAME
TRACKING_URI = open("../.mlflow_uri").read().strip()

In [None]:
# read data
df = pd.read_csv('../data/Featureselection03.csv')

In [None]:
# read list with feature importance
data_log = pd.read_csv('../data/Capstone_features_Features.csv')

### Create data frame with important features

Only important features are used to train the model. In this case we use 17 features beside the target.

In [None]:
# create list of important features (feature importance < 3)
list_imp_feat = list(data_log[data_log['ModelImportance'] < 3]['VarName'])
len(list_imp_feat)

In [None]:
# create a dataframe consisting of target and 17 features
df_model = df[list_imp_feat].copy()

### Fill and drop NaN

Values for V.SLPOG.act.PRC and ME.SFCI.act.gPkWh contain missing values. The EDA showed that these are mainly caused during harbour times when the main engine was not running. Therefore it makes sense to fill the missing values with 0.

In [None]:
df_model['V.SLPOG.act.PRC'].fillna(0,inplace=True)
df_model['ME.SFCI.act.gPkWh'].fillna(0,inplace=True)

The remaining rows with missing values are dropped.

In [None]:
df_model.dropna(inplace=True)

### Define target

For this project the focus is on optimising the fuel consumption. Therefore the supply mass rate is used as target. Target values greater 8 t/h are defined as outlier.

In [None]:
# remove outlier
df_model = df_model[df_model['ME.FMS.act.tPh']<8]

In [None]:
# separate features (X) from target (y)
X = df_model.drop(['ME.FMS.act.tPh'], axis = 1)
y = df_model['ME.FMS.act.tPh']

### Train Test Split

Due to the high amount of data, a split into 10% test data and 90% train data is chosen. The random state is set to 42 to have comparable results for diffent models. To account for the imbalance in the distribution of passage types the stratify parameter is used for this feature. This results in approximately the same percentage of the different passage types in each subset.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = X['passage_type'], test_size = 0.1, random_state = 42)

### Create dummy values for passage type

Object types need to be transformed to dummy values. For this model this concerns the passage types.

In [None]:
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)

### Set MLFlow connection

MLFlow is used to track and compare different models and model settings.

In [None]:
# setting the MLFlow connection and experiment
mlflow.set_tracking_uri(TRACKING_URI)
mlflow.set_experiment(EXPERIMENT_NAME)
mlflow.start_run(run_name='RandomForrest') 
run = mlflow.active_run()

---

## Modelling

For all models in this project a MinMaxScaler is applied. For this model a random forrest is used. The hyperparameter are selected based on grid search and offer a reasonable balance between optimal results and overfitting. These settings are used in a pipeline.

### Pipeline

In [None]:
forest = make_pipeline(MinMaxScaler(), 
                        RandomForestRegressor(criterion= 'squared_error',
                                            max_depth= 40, 
                                            max_features= 'auto',
                                            max_leaf_nodes= 7000, 
                                            min_samples_split= 20,
                                            n_estimators= 100, 
                                            random_state= 42))

### Fit and predict

In [None]:
# fit the model using train data
forest.fit(X_train, y_train)

In [None]:
# make predictions based on test and train data
y_pred_test = forest.predict(X_test)
y_pred_train = forest.predict(X_train)

---

## Analysis

### Errors and residuals

The root mean squared error (RMSE) is used to evaluate the model. 

In [None]:
print('RMSE train: ', mean_squared_error(y_train, y_pred_train, squared= False))
rmse_train = mean_squared_error(y_train, y_pred_train, squared= False)
print('RMSE test: ', mean_squared_error(y_test, y_pred_test, squared= False))
rmse_test = mean_squared_error(y_test, y_pred_test, squared= False)

Plotting actual values against predicted shows that the points are close to the optimal diagonale. However, this plot and the yellowbrick residual plot show some dificulties the model has when predicting low target values.

In [None]:
fig=plt.figure(figsize=(6, 6))
plt.axline([1, 1], [2, 2],color='lightgrey')
plt.scatter(y_train, y_pred_train, color ='#33424F')
plt.scatter(y_test, y_pred_test, color = '#FF6600')
plt.xticks(np.arange(0,7,1));
plt.yticks(np.arange(0,7,1));
plt.xlabel("ME.FMS.act.tPh actual");
plt.ylabel("ME.FMS.act.tPh predicted");

In [None]:
visualizer = ResidualsPlot(forest)

visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
visualizer.show()  

### False prediction of target = 0

In some cases the actual target value is zero, but the prediction are above zero. These are identified and further investigated.

In [None]:
# combine actual and predicted target in one dataframe
df_result_train = X_train.copy()
df_result_train['predicted'] = y_pred_train
df_result_train['actual'] = y_train

# identify cases where the actual target = 0 and the prediction > 0.1
df_target_zero_false = df_result_train[df_result_train['actual']==0]
df_target_zero_false = df_target_zero_false[df_target_zero_false['predicted']>0.1]

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(
    x=df_target_zero_false['V.SOG.act.kn'], 
    y=df_target_zero_false['predicted'],
    mode='markers', marker=dict(color='#ff6600',size=5)))
fig.show()

Most of these predictions result from low speeds. An explation might be that the ship was pulled by a tugboat. Thus the model could be further improved by information about the operation of tugboats.

### High residuals

Following investigations focus on residuals >0.5 or <-0.5.

In [None]:
# calculate residuals
df_result_train['residual'] = df_result_train['predicted'] - df_result_train['actual']
# define high positive residuals
df_residual_high_pos = df_result_train[df_result_train['residual']>0.5]
# define high negative residuals
df_residual_high_neg = df_result_train[df_result_train['residual']<-0.5]

Comparisons with different features showed a clear pattern for longitude and latitude. There are almost no data point with high residuals on the Atlantic passages. This means the model predicts very well for the rather stable conditions on the Atlantic but would need additional data to better capture the routes to and from the ports.

In [None]:
fig=plt.figure(figsize=(12, 4), dpi=80)
sns.scatterplot(data = df_residual_high_neg, 
                x = 'V.GPSLON.act.deg', 
                y = 'predicted',
                linewidth=0
                )
sns.scatterplot(data = df_residual_high_pos, 
                x = 'V.GPSLON.act.deg', 
                y = 'predicted',
                linewidth=0
                );

In [None]:
fig=plt.figure(figsize=(12, 4), dpi=80)
sns.scatterplot(data = df_residual_high_neg, 
                x = 'V.GPSLAT.act.deg', 
                y = 'predicted',
                linewidth=0
                )
sns.scatterplot(data = df_residual_high_pos, 
                x = 'V.GPSLAT.act.deg', 
                y = 'predicted',
                linewidth=0
                );

### Important features

There are different possibilities to identify important features. One way is to use the attribute impurity-based feature importances from scikit learn's RandomForestRegressor. The alternative is the permutation importance from scikit learn. Both identify the same top 6 features.

In [None]:
df_importance = pd.DataFrame({'features' : X_train.columns, 'importance' : forest['randomforestregressor'].feature_importances_})
df_importance.sort_values('importance',ascending=False)

In [None]:
result = permutation_importance(
    forest, X_test, y_test, n_repeats=10, random_state=42, n_jobs=2
)
sorted_idx = result.importances_mean.argsort()

fig, ax = plt.subplots()
ax.boxplot(
    result.importances[sorted_idx].T, vert=False, labels=X_test.columns[sorted_idx]
)
ax.set_title("Permutation Importances (test set)")
fig.tight_layout()
plt.show()

---

## Write to MLFlow

In [None]:
# seting parameters that should be logged on MLFlow
# these parameters were used in feature engineering (inputing missing values)
# or parameters of the model (fit_intercept for Linear Regression model)
params = {
      "features drop": 'According to model importance list',
      "criterion": 'squared_error',
      'max_features': 'auto',
      "random_state": 42,
      "max_depth": 40,
      'max_leaf_nodes': 7000,
      'min_samples_split': 20,
      'n_estimators': 100,
      "csv used": 'Featureselection03.csv',
      "NaN handling": 'V.SLPOG.act.PRC and ME.SFCI.act.gPkWh filled with 0, rest dropped by row',
      'Shape' : df.shape,
      'Scaler' : 'MinMaxScaler'
  }

In [None]:
# logging params to mlflow
mlflow.log_params(params)
# setting tags
mlflow.set_tag("running_from_jupyter", "True")
# logging metrics
mlflow.log_metric("train-" + "RMSE", rmse_train)
mlflow.log_metric("test-" + "RMSE", rmse_test)

mlflow.end_run()