### Demo: http://ec2-13-48-176-197.eu-north-1.compute.amazonaws.com:5000/

As the given dataset is split into three types of data - FMS, NASM and time - it was decided to select some combinations of these three groups and see which model would have the highest Accuracy Score. LDA, QDA and KNN were performed on the same feature groups and on different datasets. More specifically, the WeakLink_label is the last feature of the AimoScore_WeakLink_big_scores_labels dataset, and the One_hot_encoding is a feature in the AimoScore_WeakLink_big_scores_one_hot dataset.  
Removal of values for symmetrical pairs or choosing high or low correlations was performed in the same manner as in Sprint2 and Sprint3.  


To access the different feature groups, to remove left or right elements of the symmetries or tests with high or low correlation values were performed with the help of the following:


In [None]:
# Accessing different feature groups
FMS_arr = data.iloc[:, 1:14] # only FMS  
NASM_arr = data.iloc[:, 14:39] # only NASM  
time_arr = data.iloc[:, 39:41] # only time features  
all_f_arr = data.iloc[:, 1:39] # FMS and NASM features, but no time  
all_f_t_arr = data.iloc[:, 1:41] # FMS + NASM + time  
FSM_time_arr = data.iloc[:, np.r_[1:14, 39:41]] # FMS + time  
NASM_time_arr = data.iloc[:, 14:41] # NASM + time  

In [None]:
# Remove high and low correlations, then remove right and left of pairs
fms_sym = [[4, 6], [5, 7], [8, 11], [9, 12], [10, 13]]
nasm_sym = [[14, 15], [17, 18], [21, 22], [24, 25], [26, 27], [28, 29], [31, 32]]

In [None]:
# This loop just removes either right or left element of the symmetrical pair.
# Same applies for selected_features and for either having fms_sym or nasm_sym
for left, right in fms_sym:
    to_remove = data.columns[right]
    del selected_features[to_remove]

In [None]:
# If element on the left of the pair has higher correlation with WeakLink_score than
# element on the right side of the pair, remove right element. If low correlation is
# wanted then simply change the ">" to "<"
for left, right in fms_sym:
    l = data.columns[left]
    r = data.columns[right]
    if abs(data[l].corr(labels)) > abs(data[r].corr(labels)): 
        del selected_features[r]
    else:
        del selected_features[l]

# Iterations
## 1. Removing right value of symmetrical pairs 

* WeakLink_label
    
| Groups             | LDA   | QDA   | KNN   |
|--------------------|-------|-------|-------|
|  FMS               | 0.440 | 0.481 | 0.631 |
|  FMS + time        | 0.443 | 0.506 | 0.634 |
|  NASM              | 0.558 | 0.681 | 0.720 |
|  NASM + time       | 0.562 | 0.706 | **0.722** |  

* One_hot_encoding

| Groups             | LDA   | QDA   | KNN   |
|--------------------|-------|-------|-------|
|  FMS               | 0.455 | 0.462 | 0.646 |
|  FMS + time        | 0.457 | 0.485 | 0.648 |
|  NASM              | 0.549 | 0.698 | **0.730** |
|  NASM + time       | 0.553 | 0.709 | 0.713 |

## 2. Removing left value of symmetrical pairs  

* WeakLink_label
  
| Groups             | LDA   | QDA   | KNN   |
|--------------------|-------|-------|-------|
|  FMS               | 0.445 | 0.488 | 0.643 |
|  FMS + time        | 0.452 | 0.508 | 0.665 |
|  NASM              | 0.535 | 0.684 | 0.711 |
|  NASM + time       | 0.550 | 0.709 | **0.719** |  

* One_hot_encoding
    
| Groups             | LDA   | QDA   | KNN   |
|--------------------|-------|-------|-------|
|  FMS               | 0.436 | 0.474 | 0.644 |
|  FMS + time        | 0.458 | 0.5   | 0.651 |
|  NASM              | 0.534 | 0.677 | **0.709** |
|  NASM + time       | 0.544 | 0.705 | 0.702 |  

## 3. Removing higher colerration feature (to WeakLink_score) from symmetrical pairs

* WeakLink_label

| Groups             | LDA   | QDA   | KNN   |
|--------------------|-------|-------|-------|
|  FMS               | 0.460 | 0.504 | 0.644 |
|  FMS + time        | 0.463 | 0.539 | 0.647 |
|  NASM              | 0.537 | 0.681 | 0.726 |
|  NASM + time       | 0.544 | 0.709 | **0.728** |  

 * One_hot_encoding
    
| Groups             | LDA   | QDA   | KNN   |
|--------------------|-------|-------|-------|
|  FMS               | 0.462 | 0.492 | 0.654 |
|  FMS + time        | 0.466 | 0.509 | 0.661 |
|  NASM              | 0.537 | 0.681 | **0.713** |
|  NASM + time       | 0.538 | 0.702 | 0.710 |  

## 4. Removing lower colerration feature (to WeakLink_score) from symmetrical pairs  

* WeakLink_label

| Groups             | LDA   | QDA   | KNN   |
|--------------------|-------|-------|-------|
|  FMS               | 0.429 | 0.476 | 0.629 |
|  FMS + time        | 0.442 | 0.496 | 0.631 |
|  NASM              | 0.558 | 0.690 | 0.696 |
|  NASM + time       | 0.573 | 0.704 | **0.721** |  

 * One_hot_encoding
    
| Groups             | LDA   | QDA   | KNN   |
|--------------------|-------|-------|-------|
|  FMS               | 0.433 | 0.451 | 0.634 |
|  FMS + time        | 0.435 | 0.480 | 0.625 |
|  NASM              | 0.558 | 0.691 | **0.715** |
|  NASM + time       | 0.560 | 0.706 | 0.706 | 

## 5. Individual group selection without modifications

* WeakLink_label
    
| Groups             | LDA   | QDA   | KNN   |
|--------------------|-------|-------|-------|
|  FMS               | 0.495 | 0.582 | 0.665 |
|   NASM             | 0.603 | 0.781 | 0.737 |
|  Time features     | 0.301 | 0.001 | 0.515 |
|  FMS + NASM        | 0.628 | 0.482 | **0.757** |
|  FMS + NASM + time | 0.631 | 0.504 | 0.749 |
|  FMS + time        | 0.508 | 0.606 | 0.663 |
|  NASM + time       | 0.606 | **0.797** | 0.730 |

 * One_hot_encoding
    
| Groups             | LDA   | QDA   | KNN   |
|--------------------|-------|-------|-------|
|  FMS               | 0.497 | 0.567 | 0.674 |
|   NASM             | 0.6   | 0.762 | 0.723 |
|  Time features     | 0.306 | 0.001 | 0.487 |
|  FMS + NASM        | 0.625 | 0.481 | **0.750** |
|  FMS + NASM + time | 0.627 | 0.450 | 0.745 |
|  FMS + time        | 0.506 | 0.590 | 0.682 |
|  NASM + time       | 0.601 | **0.784** | 0.728 |

QDA performs best when testing against NASM + time features, when performing tests with the WeakLink_label, with an accuracy score of 0.797. However, KNN is chosen for the implementation since the QDA has limitations when tested against the WeakLink_label due to ill defined covariance. KNN peformed best when tested against FMS + NASM with an accuracy score of 0.757.

In [None]:
import sys
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn import neighbors
np.warnings.filterwarnings('ignore')

In [None]:
data = pd.read_csv('datasets/AimoScore_WeakLink_big_scores_labels.csv',index_col=0, parse_dates=True)
data_score = pd.read_csv('datasets/AimoScore_WeakLink_big_scores_Labels_and_Scores.csv',index_col=0, parse_dates=True)

In [None]:
#data['WeakLink_label'].value_counts()

In [None]:
# Split training and validation set 70%-30%
train_set, validation_set = train_test_split(data, test_size=0.3)

In [None]:
selected_features = train_set.iloc[:, 1:39]
    
X_train = selected_features
y_train = train_set['WeakLink_label']

X_test = selected_features
y_test = train_set['WeakLink_label']

knn = neighbors.KNeighborsClassifier(n_neighbors = 3)
model = knn.fit(X_train, y_train)

pred = model.predict(X_test)
print('\n FMS and NASM features'
      ' \n Accuracy Score: ',accuracy_score(train_set["WeakLink_label"], pred))


 FMS and NASM features 
 Accuracy Score:  0.7569965870307167


# Architecture updates

1. Extended ml_core to contain pipelines for both models
2. Updated config for new datasets and models
3. Added preprocessing script to merge weak link labels with the feature dataset. Also added the creation of one-hot-encoding labels
4. Added classification feature to Front-end
5. Updated the UI design

# Dependency management strategy updates

The datasets are now avaliable at the artifactory and are not commited to repository branch. In order to download them and fit to project structure the `update.sh` bash script was created.