# 04_8 SVM Model
Due to NDA agreements no data can be displayed.

In the following, the preparation of the data and the SVM model use are describe and shown.  
Therefore the DataFrame from "Featureengineering" is loaded, together with the feature importance list for this particular model.

Data Preparation, Data Cleaning, and Preparation for Modelling is the same for all algorithms. To directly go to modelling click [here](#modelling)

---

## Data preparation

### Import libraries and read data

In [None]:
import pandas as pd 
import numpy as np

from sklearn import svm

from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

import seaborn as sns
import matplotlib.pyplot as plt

import sys
sys.path.append("..")
import mlflow
from modeling.config import EXPERIMENT_NAME
TRACKING_URI = open("../.mlflow_uri").read().strip()

In [None]:
# read data
df = pd.read_csv('../data/Featureselection03.csv')

### Create data frame with important features

So that everyone is on track with the feature selection, we created another csv file to rate the importance and only use important features for training our models and further analysis.

Only important features are used to train the model. In this case we use 17 features beside the target.

In [None]:
# read list with feature importance
data_log = pd.read_csv('../data/Capstone_features_Features.csv')

In [None]:
# create list of important features (feature importance < 3)
list_imp_feat = list(data_log[data_log['ModelImportance'] < 3]['VarName'])
len(list_imp_feat)

In [None]:
df_model = df[list_imp_feat].copy()

### Fill and drop NaN

Values for V.SLPOG.act.PRC and ME.SFCI.act.gPkWh contain missing values. The EDA showed that these are mainly caused during harbour times when the main engine was not running. Therefore it makes sense to fill the missing values with 0.

In [None]:
df_model['V.SLPOG.act.PRC'].fillna(0,inplace=True)
df_model['ME.SFCI.act.gPkWh'].fillna(0,inplace=True)

The remaining rows with missing values are dropped.

In [None]:
df_model.dropna(inplace=True)

### Correlation Matrix

In [None]:
plt.figure(figsize = (30,28))
sns.heatmap(df_model.corr(), annot = True, cmap = 'RdYlGn')

The correlation matrix is checked again, to ensure no high correlations between features. The feature correlation has been checked in the "Featureengineering" already.

### Define target

For this project the focus is on optimising the fuel consumption. Therefore the supply mass rate is used as target. Target values greater 8 t/h are defined as outlier.

In [None]:
X = df_model.drop(['ME.FMS.act.tPh'], axis = 1)
y = df_model['ME.FMS.act.tPh']

The supply mass fuel rate as target is separated from the feature dataset.

### Test train split

Due to the high amount of data, a split into 10% test data and 90% train data is chosen. The random state is set to 42 to have comparable results for diffent models. To account for the imbalance in the distribution of passage types the stratify parameter is used for this feature. This results in approximately the same percentage of the different passage types in each subset.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = X['passage_type'], test_size = 0.1, random_state = 42)

Test Train split with random_state = 42 to have comparable dataset for the different models. Due to the size fo the dataset the test size is set rather small with 10%:

### Create dummy values for passage type

Object types need to be transformed to dummy values. For this model this concerns the passage types.

In [None]:
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)

For the passage type (Atalantic, Europe, South America) dummy features are created, to be able to feed the data to the model only as numerical values.

### Set MLFlow connection

MLFlow is used to track and compare different models and model settings.

In [None]:
# setting the MLFlow connection and experiment
#mlflow.set_tracking_uri(TRACKING_URI)
#mlflow.set_experiment(EXPERIMENT_NAME)
#mlflow.start_run(run_name='SVM')
#run = mlflow.active_run()

The upload to MLFlow is enabled, to not upload any model results and because the setup is most probably not done for the machine.

---

## Modelling

For all models in this project a MinMaxScaler is applied. For this model a random forrest is used. The hyperparameter are selected based on grid search and offer a reasonable balance between optimal results and overfitting. These settings are used in a pipeline.

### Pipline with SVM Regressor and parameter definition

In [None]:
clf = make_pipeline(MinMaxScaler(), svm.SVR(kernel='poly',
                                            gamma='scale',
                                            C=1.0,
                                            epsilon=0.1,
                                            ))

As a model the SVM Regressor (SVR) is used, with a minMaxSacler.


The pipline is set up, with the  
* Kernel: "ploy"
* gama: "scale"
* C: 1.0
* epsilon: 0.1  



### Fit and predict

In [None]:
clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)

### Predict y on train data

In [None]:
y_pred_train = clf.predict(X_train)

To get residuals for the error analysis of regression models, the train datapoints have to be predicted as well.

---

## Analysis

### Errors and residuals

The root mean squared error (RMSE) is used to evaluate the model. 

In [None]:
print('RMSE train: ', mean_squared_error(y_train, y_pred_train, squared= False))
rmse_train = mean_squared_error(y_train, y_pred_train, squared= False)
print('RMSE test: ', mean_squared_error(y_test, y_pred, squared= False))
rmse_test = mean_squared_error(y_test, y_pred, squared= False)

Plotting actual values against predicted shows that the points are close to the optimal diagonale. However, this plot and the yellowbrick residual plot show some dificulties the model has when predicting low target values.

In [None]:
fig=plt.figure(figsize=(6, 6))
plt.axline([1, 1], [2, 2],color='lightgrey')
plt.scatter(y_train, y_pred_train, color ='#33424F')
plt.scatter(y_test, y_pred, color = '#FF6600')
plt.xlabel("ME.FMS.act.tPh actual");
plt.ylabel("ME.FMS.act.tPh predicted");

---

### Write to MLFlow

In [None]:
#seting parameters that should be logged on MLFlow
params = {
       'csv used': 'Featureselection03.csv',
       'features drop' : 'Accroding to model importance list',
       'NaN handling': 'dropped',
       'Shape' : df.shape,
       'Scaler' : 'MinMaxScaler',
       'kernel' : 'poly',
       'gamma' : 'scale',
       'C' : 1.0,
       'epsilon' : 0.1
  }

Write down all parameters, which shall be uploaded to MLFlow. This can be different metrics for different models.

### Writing to MLFlow

In [None]:
#logging params to mlflow
#mlflow.log_params(params)

#setting tags
#mlflow.set_tag("running_from_jupyter", "True")

#logging metrics
#mlflow.log_metric("train-" + "RMSE", rmse_train)
#mlflow.log_metric("test-" + "RMSE", rmse_test)

# logging the model to mlflow will not work without a AWS Connection setup.. too complex for now
#mlflow.end_run()

Again, the upload to MLFlow is enabled to not accidentely try to upload new model runs.

## Summery of SVM Model

The RMSE values from the first runs with the SVM model where not good compared to the other models. Hence some changes where c´done to the model and because the values did not improve significantly the efforts put into this model where reduced and soon decided to not further use it.