# SVM Model

In the following, the preparation of the data and the SVM model use are describe and shown.  
Therefore the DataFrame from "Featureengineering" is loaded, together with the feature importance list for this particular model.

---

### Import libraries

In [None]:
import pandas as pd 
import numpy as np

from sklearn import svm

from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

import seaborn as sns
import matplotlib.pyplot as plt

import sys
sys.path.append("..")
import mlflow
from modeling.config import EXPERIMENT_NAME
TRACKING_URI = open("../.mlflow_uri").read().strip()

---

### Read data file "Featureengineering"

In [None]:
# read data
df = pd.read_csv('../data/Featureselection03.csv')

---

### Read file "Feature importance list"

In [None]:
# read list with feature importance
data_log = pd.read_csv('../data/Capstone_features_Features.csv')

In [None]:
# create list of important features (feature importance < 3)
list_imp_feat = list(data_log[data_log['ModelImportance'] < 3]['VarName'])
len(list_imp_feat)

In [None]:
df_model = df[list_imp_feat].copy()

---

### Nan value treatment

In [None]:
df_model['V.SLPOG.act.PRC'].fillna(0,inplace=True)
df_model['ME.SFCI.act.gPkWh'].fillna(0,inplace=True)

The nan values in the tow features are filled with 0, after investigating the data and coming to the conclusion, that the vessel is in the harbour and the Main Engine not running.

In [None]:
df_model.dropna(inplace=True)

The additional nan values are dropped, because the meaningful filling is not possible or to uncertain.

---

### Correlation Matrix

In [None]:
plt.figure(figsize = (30,28))
sns.heatmap(df_model.corr(), annot = True, cmap = 'RdYlGn')

The correlation matrix is checked again, to ensure no high correlations between features. The feature correlation has been checked in the "Featureengineering" already.

---

## SVM Model Details

### Target and Feature definition

In [None]:
X = df_model.drop(['ME.FMS.act.tPh'], axis = 1)
y = df_model['ME.FMS.act.tPh']

The supply mass fuel rate as target is separated from the feature dataset.

### Test train split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = X['passage_type'], test_size = 0.1, random_state = 42)

Test Train split with random_state = 42 to have comparable dataset for the different models. Due to the size fo the dataset the test size is set rather small with 10%:

### Dummy creation

In [None]:
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)

For the passage type (Atalantic, Europe, South America) dummy features are created, to be able to feed the data to the model only as numerical values.

### Open MLFlow and definition of run name

In [None]:
# setting the MLFlow connection and experiment
#mlflow.set_tracking_uri(TRACKING_URI)
#mlflow.set_experiment(EXPERIMENT_NAME)
#mlflow.start_run(run_name='SVM')
#run = mlflow.active_run()

The upload to MLFlow is enabled, to not upload any model results and because the setup is most probably not done for the machine.

### Pipline with SVM Regressor and parameter definition

In [None]:
clf = make_pipeline(MinMaxScaler(), svm.SVR(kernel='poly',
                                            gamma='scale',
                                            C=1.0,
                                            epsilon=0.1,
                                            ))

As a model the SVM Regressor (SVR) is used, with a minMaxSacler.


The pipline is set up, with the  
* Kernel: "ploy"
* gama: "scale"
* C: 1.0
* epsilon: 0.1  



### Fit Model

In [None]:
clf.fit(X_train, y_train)

### Predict y

In [None]:
y_pred = clf.predict(X_test)


### Predict y on train data

In [None]:
y_pred_train = clf.predict(X_train)

To get residuals for the error analysis of regression models, the train datapoints have to be predicted as well.

### RMSE

As the model metric the root mean square error ( RMSE) is used.

In [None]:
print('RMSE train: ', mean_squared_error(y_train, y_pred_train, squared= False))
rmse_train = mean_squared_error(y_train, y_pred_train, squared= False)
print('RMSE test: ', mean_squared_error(y_test, y_pred, squared= False))
rmse_test = mean_squared_error(y_test, y_pred, squared= False)

### Plot "Actuel" vs. "Predicted"

In [None]:
fig=plt.figure(figsize=(6, 6))
plt.axline([1, 1], [2, 2],color='lightgrey')
plt.scatter(y_train, y_pred_train, color ='#33424F')
plt.scatter(y_test, y_pred, color = '#FF6600')
plt.xlabel("ME.FMS.act.tPh actual");
plt.ylabel("ME.FMS.act.tPh predicted");

### MLFlow paramaeters

In [None]:
#seting parameters that should be logged on MLFlow
params = {
       'csv used': 'Featureselection03.csv',
       'features drop' : 'Accroding to model importance list',
       'NaN handling': 'dropped',
       'Shape' : df.shape,
       'Scaler' : 'MinMaxScaler',
       'kernel' : 'poly',
       'gamma' : 'scale',
       'C' : 1.0,
       'epsilon' : 0.1
  }

Write down all parameters, which shall be uploaded to MLFlow. This can be different metrics for different models.

### Writing to MLFlow

In [None]:
#logging params to mlflow
#mlflow.log_params(params)

#setting tags
#mlflow.set_tag("running_from_jupyter", "True")

#logging metrics
#mlflow.log_metric("train-" + "RMSE", rmse_train)
#mlflow.log_metric("test-" + "RMSE", rmse_test)

# logging the model to mlflow will not work without a AWS Connection setup.. too complex for now
#mlflow.end_run()

Again, the upload to MLFlow is enabled to not accidentely try to upload new model runs.

---

## Summery of SVM Model

The RMSE values from the first runs with the SVM model where not good compared to the other models. Hence some changes where c´done to the model and because the values did not improve significantly the efforts put into this model where reduced and soon decided to not further use it.