# KNN Model


In the following, the preparation of the data and the KNN model use are describe and shown.  
Therefore the DataFrame from "Featureengineering" is loaded, together with the feature importance list for this particular model.

---


### Import libraries

In [None]:
import pandas as pd 
import numpy as np

# KNN Regression Model
from sklearn.neighbors import KNeighborsRegressor

from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

import seaborn as sns
import matplotlib.pyplot as plt

# Used in the Error Analysis
from yellowbrick.regressor import PredictionError, ResidualsPlot

# required to safe the Model Parameters to MLFlow
import sys
sys.path.append("..")
import mlflow
from modeling.config import EXPERIMENT_NAME
TRACKING_URI = open("../.mlflow_uri").read().strip()

---

### Read data file "Featureengineering"

In [None]:
# read data from .csv file
df = pd.read_csv('../data/Featureselection03.csv')

---

### Read file "Feature importance list"

In [None]:
# read list with feature importance
data_log = pd.read_csv('../data/Capstone_features_Features.csv')

A selection of the features is done by giving them a ModelImportance from 1 to 3, where 1 is important, 2 maybe important and 3 will be neglected.  
Compared to the feature importance in teh "Featureengineering", this list keeps features which are important for this particular Machine Learning Model.

In [None]:
# create list of important features (feature importance < 3)
list_imp_feat = list(data_log[data_log['ModelImportance'] < 3]['VarName'])
len(list_imp_feat)

In [None]:
df_model = df[list_imp_feat].copy()

The feature list for the model is including also the target: Supply mass fuel rate of the Main Engine [t/h] )

---

### Nan value treatment

In [None]:
df_model['V.SLPOG.act.PRC'].fillna(0,inplace=True)
df_model['ME.SFCI.act.gPkWh'].fillna(0,inplace=True)

The nan values in the tow features are filled with 0, after investigating the data and coming to the conclusion, that the vessel is in the harbour and the Main Engine not running.

In [None]:
df_model.dropna(inplace=True)

The additional nan values are dropped, because the meaningful filling is not possible or to uncertain.

---

### Correlation Matrix

In [None]:
plt.figure(figsize = (30,28))
sns.heatmap(df_model.corr(), annot = True, cmap = 'RdYlGn')

The correlation matrix is checked again, to ensure no high correlations between features. The feature correlation has been checked in the "Featureengineering" already.

---

## KNN Model Details

### Target and Feature definition

In [None]:
X = df_model.drop(['ME.FMS.act.tPh'], axis = 1)
y = df_model['ME.FMS.act.tPh']

The supply mass fuel rate as target is separated from the feature dataset.

### Test train split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = X['passage_type'], test_size = 0.1, random_state = 42)

Test Train split with random_state = 42 to have comparable dataset for the different models. Due to the size fo the dataset the test size is set rather small with 10%:

### Dummy creation

In [None]:
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)

For the passage type (Atalantic, Europe, South America) dummy features are created, to be able to feed the data to the model only as numerical values.

### Open MLFlow and definition of run name

In [None]:
# setting the MLFlow connection and experiment
#mlflow.set_tracking_uri(TRACKING_URI)
#mlflow.set_experiment(EXPERIMENT_NAME)
#mlflow.start_run(run_name='KNN')
#run = mlflow.active_run()

The upload to MLFlow is enabled, to not upload any model results and because the setup is most probably not done for the machine.

### Pipline with KNN Regressor and parameter definition

In [None]:
knn = make_pipeline(MinMaxScaler(), KNeighborsRegressor(n_neighbors=4,
                                                        metric='minkowski',
                                                        p=2,
                                                        n_jobs=-1)) 

As a model the k-nearest neighbors Regressor (KNN Regressor) is used, with a minMaxSacler.
Using StandardScaler over MinMaxScaler gives not better RMSE values, hence the MinMAxScaler will be used for hte KNN Model.  

The pipline is set up, with the  
* model parameters: n_neighbors 
* Metric, p and n_jobs  

The number of jobs is set to -1 to start all CPUs to fit and predict the model. This is important, because the KNN model is resource intensive and in this case took up to 30 minutes to predict, where the "RandomForest" model run only 2-3 minutes.


### Fit Model

In [None]:
knn.fit(X_train, np.ravel(y_train))

Train the Model. For KNN this is very quick compared to other models. 

### Predict y

In [None]:
y_pred = knn.predict(X_test)


Predict on the test dataset is taking some time, because the distances between the target and all datapoints in the dataset have to be calucalted.

### Predict y on train data

In [None]:
y_pred_train = knn.predict(X_train)

To get residuals for the error analysis of regression models, the train datapoints have to be predicted as well.

---

### RMSE

As the model metric the root mean square error ( RMSE) is used.

In [None]:
print('RMSE train: ', mean_squared_error(y_train, y_pred_train, squared= False))
rmse_train = mean_squared_error(y_train, y_pred_train, squared= False)
print('RMSE test: ', mean_squared_error(y_test, y_pred, squared= False))
rmse_test = mean_squared_error(y_test, y_pred, squared= False)

### Plot "Actuel" vs. "Predicted"

In [None]:
fig=plt.figure(figsize=(6, 6))
plt.axline([1, 1], [2, 2],color='lightgrey')
plt.scatter(y_train, y_pred_train, color ='#33424F')
plt.scatter(y_test, y_pred, color = '#FF6600')
plt.xlabel("ME.FMS.act.tPh actual");
plt.ylabel("ME.FMS.act.tPh predicted");

### Residual Plot

In [None]:
# calculate residuals
residuals_train = y_pred_train - y_train
residuals_test = y_pred - y_test

In [None]:
sns.scatterplot(x = y_pred_train, y = residuals_train)
sns.scatterplot(x = y_pred, y = residuals_test)
plt.axhline(y = 0, color = 'black')
plt.xlabel("ME.FMS.act.tPh predicted");
plt.ylabel("Residuals");
plt.legend(labels=['', 'train', 'test'])

---

## MLFlow parameters

In [None]:
#seting parameters that should be logged on MLFlow
params = {
    'csv used': 'Featureselection03.csv',
    'features drop' : 'Accroding to model importance list',
    'random_state' : 42,
    'NaN handling' : 'dropped',  
    'Shape' : df.shape,
    'Scaler' : 'MinMaxScaler',
    'K-Neighbors' : 4,
    'metric' : 'minkowski',
    'p' : 2
  }

Write down all parameters, which shall be uploaded to MLFlow. This can be different metrics for different models.

## Writing to MLFlow

In [None]:
#logging params to mlflow
#mlflow.log_params(params)

#setting tags
#mlflow.set_tag("running_from_jupyter", "True")

#logging metrics
#mlflow.log_metric("train-" + "RMSE", rmse_train)
#mlflow.log_metric("test-" + "RMSE", rmse_test)

# logging the model to mlflow will not work without a AWS Connection setup.. too complex for now
#mlflow.end_run()

Again, the upload to MLFlow is enabled to not accidentely try to upload new model runs.

---

## Summery of KNN Model

The KNN model is taking to much time to predict, compared to other ML models. In this case, with the high frequency data, the amount of data is most probably too high for an efficient KNN model.  
Compared to other models, the RMSE is not as low as from other models. Hence, with a long runtime and medium good RMSE values the KNN is not the prefered model to continue with.