## Iterations over ML process steps




1. Provided data was loaded and shuffled.
2. The data set was analyzed by looking at correlation heatmap and scatter plots between AimoScore and the features.
3. The data set was then split into training and evaluation subsets (70% - 30%)
4. Some model was chosen, then it was trained on training subset and  evaluated on evaluation subset. Table below show which models were tested and their evaluation.
5. Step number 4 was repeated several times, each time using a different model.
6. Model that produced the best results was implemented in the system.


## Design & coding rules

1. Follow Python code convention. Run PEP8 test before commit.
2. Follow the SOLID principles
3. Make the merge request from the feature branch to the development, hold the hold review for the MR.
4. Design the foundation architecture that would be flexible to new changes in each new sprint.
5. Microservice architecture
6. Unit testing
7. Write the required packages and their versions to run the project
8. Implement the docker file to make the project independent of the hardware.

## DevOps & REST

1. Implemented Exception handling for the score evaluation end-point
2. Added static IP to the server (http://ec2-13-48-176-197.eu-north-1.compute.amazonaws.com:5000/)

# Tested models, reasoning behind choosing these models and test results

### Sprint 3

Exclusion of some columns in feature groups (FMS, NASM, time or joins of them) did not provide an accurate enough model in Sprint 2, therefore other approaches were used for this sprint. In addition, more metrics were to look at how good the model is - mean absolute error(MAE) and residual sum of squares(RSS). Both of these values have to be as small as possible.

1. Backwards elimitation - a feature group is selected, then the feature that performs the worst is removed (feature with p value > 0.05 and whichever has the highest p value). The results are of models that had all p values < 0.05.

|             | FMS                | FMS + time         | NASM                | NASM + time   |
|:------------------:|:------------------:|:-------------------:|:-------------:|:-------------:|
| R2  | 0.42 | 0.43 | 0.54 | 0.54 |
| MSE | 0.03 | 0.03 | 0.03 | 0.03 |
| MAE | 0.14 | 0.14 | 0.13 | 0.12 |
| RSS | 0.03 | 0.03 | 0.03 | 0.03
|features selected | 7 | 7 | 10 | 11

The correlation map was checked again as doing backwards elimitation indicated that some features in groups might be dependant on each other. And indeed, some were. Any correlation between a single group's features that was higher than 0.5 was considered to be high correlation.

2. Using interaction terms on highly correlated features by multiplying these features together and not removing them.

|             | FMS                | FMS + time         | NASM                | NASM + time   |
|:------------------:|:------------------:|:-------------------:|:-------------:|:-------------:|
| R2  | 0.42 | 0.43 | 0.559 | 0.61 |
| MSE | 0.03 | 0.03 | 0.02 | 0.02 |
| MAE | 0.14 | 0.14 | 0.12 | 0.12 |
| RSS | 0.03 | 0.03 | 0.02 | 0.02

3. Using interaction terms on highly correlated features by multiplying these features together and then removing them.

|             | FMS                | FMS + time         | NASM                | NASM + time   |
|:------------------:|:------------------:|:-------------------:|:-------------:|:-------------:|
| R2  | 0.4 | 0.41 | 0.550 | 0.561 |
| MSE | 0.03 | 0.03 | 0.02 | 0.02 |
| MAE | 0.14 | 0.14 | 0.12 | 0.12 |
| RSS | 0.03 | 0.03 | 0.02 | 0.02

4. There are lots of features and applying various interaction terms on them gets difficult. On top of that, the data is scattered and noisy. It is difficult to see which kind of model to use as the data seem to be random and each feature on its own does not have high enough correlation. It is clear, though, that the relationship between features and the label is not linear. So to see if there will be any improvement, non-linear transformations were applied to each feature in a feature group (for this test, each feature was raised to the power of 2 as well as to the power of 1).

|             | FMS                | FMS + time         | NASM                | NASM + time   |
|:------------------:|:------------------:|:-------------------:|:-------------:|:-------------:|
| R2  | 0.454 | 0.46 | 0.65 | 0.656 |
| MSE | 0.03 | 0.03 | 0.02 | 0.02 |
| MAE | 0.14 | 0.14 | 0.11 | 0.11 |
| RSS | 0.03 | 0.03 | 0.02 | 0.02

5. For this experiment, each feature was raised to the powers up to 5 (so power of 1, power of 2 etc. until power of 5). This slowed down model training and prediction by quite a bit.

|             | FMS                | FMS + time         | NASM                | NASM + time   |
|:------------------:|:------------------:|:-------------------:|:-------------:|:-------------:|
| R2  | 0.51 | 0.521 | 0.7 | 0.71 |
| MSE | 0.03 | 0.03 | 0.02 | 0.02 |
| MAE | 0.14 | 0.14 | 0.11 | 0.11 |
| RSS | 0.03 | 0.03 | 0.02 | 0.02

6. Finally, all feature groups were selected and each feature was raised to the power of 2 again.

|             | FMS + NASM            | FMS + NASM + time         |
|:------------------:|:------------------:|:------------------:|
| R2  | 0.68 | 0.68 |
| MSE | 0.02 | 0.02 |
| MAE | 0.11 | 0.11 |
| RSS | 0.02 | 0.02 |

While the results might look promising at the first glance as R-squared has improved the most in the last test, it seems that other metrics barely improved when raising features by higher power. 

Model that includes powers 1 and 2 from feature groups FMS, NASM and time (test 6) was chosen as it has given the best results for this sprint, even if it is overfitting. 

The reason why transformations of powers up to 5 was not chosen was because its prediction is visibly slower than the rest of the models' and it does not show any improvement in any other metrics besides R-squared.


### Sprint 2

As the given dataset is split into three types of data - FMS, NASM and time - it was decided to first simply select some combinations of these three groups and see which model would perform the best. Model accuracy was defined by low MSE and high R2 value.

1. Removing lower colerration feature (to AimoScore) from symmetrical pairs (if pair is at positions [4 6] and column at position 4 has lower correlation to AimoScore than column at position 6, 4 gets removed)

| FMS                | FMS + time         | NASM                | NASM + time   |
|:------------------:|:------------------:|:-------------------:|:-------------:|
| MSE: 0.03, R2: 0.49| MSE: 0.03, R2: 0.41 | MSE: 0.03, R2: 0.53 | MSE: 0.03, R2:0.56 |

2. For this test, FMS and NASM group features were combined (and time as well, eventually). Then, symmetrical pairs of some feature group (the group that was chosen is italicized in the results) had their correlation to checked against AimoScore. Those that have higher correlation of the two in the pair were selected. This is similar to test 4, but several feature groups are used now.

| >*FMS* + NASM         | FMS + >*NASM*       | >*FMS* + >*NASM*     | >*FMS* + NASM + time | FMS + >*NASM* + time | >*FMS* + >*NASM* + time |
|:------------------:|:------------------:|:-------------------:|:-------------:|:-------------:|:-------------:|
| MSE: 0.02, R2: 0.58| MSE: 0.02, R2: 0.57 | MSE: 0.02, R2: 0.55 | **MSE: 0.02, R2:0.58** | **MSE: 0.02, R2:0.58** | MSE: 0.02, R2:0.56 |

3. Removing higher colerration feature (to AimoScoire) from symmetrical pairs (same idea as previous step, but this time higher correlation feature gets removed)

| FMS                | FMS + time         | NASM                | NASM + time   |
|:------------------:|:------------------:|:-------------------:|:-------------:|
| MSE: 0.04, R2: 0.31| MSE: 0.04, R2: 0.32| MSE: 0.03, R2: 0.49 | MSE: 0.03, R2:0.48 |

4. Just like in experiment 5, FMS and NASM group features were combined, together with time later on. Then symmetrical pairs of some selected feature group had their correlation tested against AimoScore individually and the one that had lower correlation was selected. This is similar to test 6, but now several feature groups are combined.

| >*FMS* + NASM         | FMS + >*NASM*       | >*FMS* + >*NASM*     | >*FMS* + NASM + time | FMS + >*NASM* + time | >*FMS* + >*NASM* + time |
|:------------------:|:------------------:|:-------------------:|:-------------:|:-------------:|:-------------:|
| MSE: 0.02, R2: 0.56| MSE: 0.02, R2: 0.56 | MSE: 0.03, R2: 0.49 | MSE: 0.02, R2:0.57 | MSE: 0.03, R2:0.57 | MSE: 0.03, R2:0.52 |

5. Removing left value of symmetrical pairs

| FMS                | FMS + time         | NASM                | NASM + time   |
|:------------------:|:------------------:|:-------------------:|:-------------:|
| MSE: 0.04, R2: 0.34| MSE: 0.04, R2: 0.39| MSE: 0.03, R2: 0.49 | MSE: 0.03, R2: 0.53 |

6. Removing right value of symmetrical pairs

| FMS                | FMS + time         | NASM                | NASM + time   |
|:------------------:|:------------------:|:-------------------:|:-------------:|
| MSE: 0.04, R2: 0.35| MSE: 0.04, R2: 0.32| MSE: 0.03, R2: 0.47 | MSE: 0.03, R2:0.49 |

7. Selecting whole separate feature groups (so, for example, only FSM, or only NASM)


| FMS                | FMS + time         | NASM                | NASM + time   | time               |
|:------------------:|:------------------:|:-------------------:|:-------------:|:------------------:|
| MSE: 0.03, R2: 0.42| MSE: 0.03, R2: 0.44| MSE: 0.03, R2: 0.56 | MSE: 0.02, R2:0.55 | MSE: 0.05, R2: 0.09 |

8. Selecting both, FMS and NASM

| With time          | Without time       |
|:------------------:|:------------------:|
|**MSE: 0.02, R2: 0.60**| **MSE: 0.02, R2: 0.57**|

In [None]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sb
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import sklearn.metrics as metrics
from sklearn.preprocessing import PolynomialFeatures

In [None]:
data = pd.read_csv('data.csv', decimal=',')
data = data.sample(frac=1).reset_index(drop=True)

In [None]:
corr = data.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,AimoScore,No_1_Angle_Deviation,No_2_Angle_Deviation,No_3_Angle_Deviation,No_4_Angle_Deviation,No_5_Angle_Deviation,No_6_Angle_Deviation,No_7_Angle_Deviation,No_8_Angle_Deviation,No_9_Angle_Deviation,No_10_Angle_Deviation,No_11_Angle_Deviation,No_12_Angle_Deviation,No_13_Angle_Deviation,No_1_NASM_Deviation,No_2_NASM_Deviation,No_3_NASM_Deviation,No_4_NASM_Deviation,No_5_NASM_Deviation,No_6_NASM_Deviation,No_7_NASM_Deviation,No_8_NASM_Deviation,No_9_NASM_Deviation,No_10_NASM_Deviation,No_11_NASM_Deviation,No_12_NASM_Deviation,No_13_NASM_Deviation,No_14_NASM_Deviation,No_15_NASM_Deviation,No_16_NASM_Deviation,No_17_NASM_Deviation,No_18_NASM_Deviation,No_19_NASM_Deviation,No_20_NASM_Deviation,No_21_NASM_Deviation,No_22_NASM_Deviation,No_23_NASM_Deviation,No_24_NASM_Deviation,No_25_NASM_Deviation,No_1_Time_Deviation,No_2_Time_Deviation,EstimatedScore
AimoScore,1.0,-0.137229,-0.0845674,-0.148295,-0.134915,-0.286813,-0.307983,-0.22323,-0.309268,-0.218926,-0.374444,-0.304764,-0.328021,-0.420565,-0.286813,-0.22323,-0.137229,-0.374444,-0.420565,-0.174858,-0.123511,-0.163174,-0.282951,-0.418292,-0.465621,-0.487423,-0.0953627,-0.221256,0.00696233,-0.239067,-0.261748,-0.215797,-0.109127,-0.202482,-0.172457,-0.258476,-0.118722,-0.210732,-0.231237,-0.249069,-0.278101,-0.567916
No_1_Angle_Deviation,-0.137229,1.0,0.526053,0.156826,0.278644,-0.0943196,0.13192,-0.141623,0.341792,0.0194613,-0.119849,0.239443,-0.0582924,-0.108243,-0.0943196,-0.141623,1.0,-0.119849,-0.108243,0.172849,0.13748,0.285697,0.193416,0.230719,-0.174393,-0.197049,0.16634,-0.0262522,0.0694927,-0.129628,0.291908,0.0840014,0.143534,-0.0886336,0.0817164,-0.0470823,0.0188622,-0.111729,-0.0235144,0.0925693,0.0886082,0.138526
No_2_Angle_Deviation,-0.0845674,0.526053,1.0,0.0819099,0.295949,-0.157224,0.16852,-0.200167,0.379083,-0.0258647,-0.217131,0.263826,-0.0894711,-0.175232,-0.157224,-0.200167,0.526053,-0.217131,-0.175232,0.106545,0.193233,0.422109,0.340132,0.296792,-0.239229,-0.244733,0.133128,-0.0664893,0.0215264,-0.193771,0.275406,0.0909319,0.203849,-0.148294,0.0137677,-0.103064,-0.0494267,-0.136207,-0.0954128,1.04613e-05,-0.00841252,0.0728015
No_3_Angle_Deviation,-0.148295,0.156826,0.0819099,1.0,0.232017,0.0712337,-0.0104419,0.0808021,0.138328,0.0970242,-0.142343,0.1809,-0.121492,-0.132266,0.0712337,0.0808021,0.156826,-0.142343,-0.132266,0.0636899,0.318274,0.00753057,-0.0485685,0.10671,-0.0299264,-0.066493,0.0715381,-0.0130403,0.222346,-0.13366,0.209448,-0.0241487,0.112024,-0.0809122,0.0120533,-0.0895673,0.154096,0.0835875,-0.0453737,0.107645,0.120851,0.216311
No_4_Angle_Deviation,-0.134915,0.278644,0.295949,0.232017,1.0,0.158208,0.46394,0.0291257,0.297846,-0.0263655,-0.178719,0.18073,-0.0684913,-0.175336,0.158208,0.0291257,0.278644,-0.178719,-0.175336,0.14546,0.196992,0.390692,0.299792,0.200553,-0.133999,-0.141408,0.121261,-0.000667361,0.0288661,-0.145113,0.208907,0.347994,0.413387,-0.0728374,0.0734192,-0.0426761,-0.00613674,-0.0858466,-0.0839272,0.175187,0.18976,0.369099
No_5_Angle_Deviation,-0.286813,-0.0943196,-0.157224,0.0712337,0.158208,1.0,0.268018,0.848998,-0.143766,-0.0097709,0.239884,-0.138713,0.149705,0.248364,1.0,0.848998,-0.0943196,0.239884,0.248364,-0.149747,0.272322,-0.151285,-0.0276782,-0.139964,0.334076,0.347633,-0.131978,0.094942,0.0297004,0.222828,-0.127112,0.0839247,0.0398729,0.205016,-0.00109813,0.108782,0.0455978,0.10995,0.218649,0.288958,0.315246,0.689959
No_6_Angle_Deviation,-0.307983,0.13192,0.16852,-0.0104419,0.46394,0.268018,1.0,0.228434,0.216347,0.0438095,0.168692,0.277655,0.143035,0.172476,0.268018,0.228434,0.13192,0.168692,0.172476,0.205107,0.0567786,0.222688,0.201577,0.216239,0.151106,0.138564,0.107768,0.0960304,-0.0498602,0.118664,0.16744,0.34282,0.440983,0.100343,0.159851,0.0955287,0.0376601,-0.000328095,0.0827242,0.1939,0.213439,0.511013
No_7_Angle_Deviation,-0.22323,-0.141623,-0.200167,0.0808021,0.0291257,0.848998,0.228434,1.0,-0.21562,-0.0311628,0.2284,-0.167438,0.113435,0.24197,0.848998,1.0,-0.141623,0.2284,0.24197,-0.17433,0.261849,-0.215463,-0.0795849,-0.175615,0.285227,0.295649,-0.18536,0.044841,0.0442029,0.204285,-0.150662,0.0130505,-0.0435924,0.182664,0.00631514,0.123755,0.0421181,0.108528,0.239889,0.219362,0.2387,0.613191
No_8_Angle_Deviation,-0.309268,0.341792,0.379083,0.138328,0.297846,-0.143766,0.216347,-0.21562,1.0,0.137868,-0.068544,0.46174,0.0382392,-0.0243199,-0.143766,-0.21562,0.341792,-0.068544,-0.0243199,0.221944,0.0795513,0.315918,0.265424,0.4157,-0.0305363,-0.0408819,0.251418,0.0208244,-0.0326504,-0.102559,0.406525,0.163858,0.150484,-0.0378346,0.119555,0.0120193,0.0204073,-0.0639324,-0.0155189,0.0777021,0.0908213,0.195739
No_9_Angle_Deviation,-0.218926,0.0194613,-0.0258647,0.0970242,-0.0263655,-0.0097709,0.0438095,-0.0311628,0.137868,1.0,0.252674,0.17987,0.291181,0.228994,-0.0097709,-0.0311628,0.0194613,0.252674,0.228994,0.137804,-0.0656642,-0.00611663,0.00341364,0.0659822,0.2214,0.217952,0.404833,0.200127,0.248033,0.143982,0.101198,-0.00283088,0.0140751,0.0690067,0.00327573,0.055537,0.123237,0.220221,0.0187314,0.0529871,0.0535534,0.113559


In [None]:
def poly(arr, n):
    # Generates a string for smf.ols() with a polynomial
    # of a certain degree given some DataFrame of feature names
    features = '+'.join(arr)
    for x in range(n+1):
        if x==0: continue
        power = ''
        for f in arr:
            power = power + '+np.power({0}, {1})'.format(f, x)
        features = features + power
    return features


In [None]:
# Split training and validation set 70%-30%
train_set, validation_set = train_test_split(data, test_size=0.3)

In [None]:
# Select all features groups (FMS, NASM and time)
selected_features = poly(data.iloc[:, 1:41], 2)
reg = smf.ols('AimoScore ~ ' + selected_features, data=train_set).fit()
reg.summary()

0,1,2,3
Dep. Variable:,AimoScore,R-squared:,0.686
Model:,OLS,Adj. R-squared:,0.67
Method:,Least Squares,F-statistic:,43.51
Date:,"Sat, 22 Feb 2020",Prob (F-statistic):,2.3499999999999998e-298
Time:,22:29:46,Log-Likelihood:,877.17
No. Observations:,1465,AIC:,-1612.0
Df Residuals:,1394,BIC:,-1237.0
Df Model:,70,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.1251,0.622,3.419,0.001,0.906,3.344
No_1_Angle_Deviation,-0.0091,0.053,-0.172,0.864,-0.113,0.095
No_2_Angle_Deviation,-0.0142,0.062,-0.230,0.818,-0.135,0.107
No_3_Angle_Deviation,-0.1699,0.139,-1.225,0.221,-0.442,0.102
No_4_Angle_Deviation,0.1222,0.080,1.525,0.127,-0.035,0.279
No_5_Angle_Deviation,-0.0371,0.046,-0.799,0.425,-0.128,0.054
No_6_Angle_Deviation,0.0725,0.070,1.042,0.297,-0.064,0.209
No_7_Angle_Deviation,0.0461,0.051,0.907,0.365,-0.054,0.146
No_8_Angle_Deviation,0.3856,0.336,1.149,0.251,-0.273,1.044

0,1,2,3
Omnibus:,68.843,Durbin-Watson:,2.013
Prob(Omnibus):,0.0,Jarque-Bera (JB):,104.916
Skew:,-0.405,Prob(JB):,1.6499999999999998e-23
Kurtosis:,4.031,Cond. No.,1.04e+16


In [None]:
smf_pred = reg.predict(validation_set)
mse = mean_squared_error(validation_set.iloc[:, 0], smf_pred)
mae = metrics.mean_absolute_error(validation_set.iloc[:, 0], smf_pred)
rss = np.mean((smf_pred - validation_set.iloc[:, 0]) ** 2)
print('Mean squared error: %.2f' % mse)
print('Mean absolute error: %.2f' % mae)
print('Residual sum of squares: %.2f' % rss)

Mean squared error: 0.02
Mean absolute error: 0.11
Residual sum of squares: 0.02


In [None]:
# Split training and validation set 80%-20%
train_set, validation_set = train_test_split(data, test_size=0.2)

In [None]:
# Select all features groups (FMS, NASM and time)
selected_features = poly(data.iloc[:, 1:41], 2)
reg = smf.ols('AimoScore ~ ' + selected_features, data=train_set).fit()
reg.summary()

0,1,2,3
Dep. Variable:,AimoScore,R-squared:,0.691
Model:,OLS,Adj. R-squared:,0.677
Method:,Least Squares,F-statistic:,51.17
Date:,"Sat, 22 Feb 2020",Prob (F-statistic):,0.0
Time:,22:32:07,Log-Likelihood:,1012.0
No. Observations:,1675,AIC:,-1882.0
Df Residuals:,1604,BIC:,-1497.0
Df Model:,70,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.0581,0.577,3.568,0.000,0.927,3.189
No_1_Angle_Deviation,0.0597,0.050,1.196,0.232,-0.038,0.158
No_2_Angle_Deviation,-0.0564,0.057,-0.987,0.324,-0.169,0.056
No_3_Angle_Deviation,-0.2913,0.130,-2.234,0.026,-0.547,-0.036
No_4_Angle_Deviation,0.1531,0.074,2.058,0.040,0.007,0.299
No_5_Angle_Deviation,0.0163,0.042,0.388,0.698,-0.066,0.099
No_6_Angle_Deviation,0.0782,0.065,1.209,0.227,-0.049,0.205
No_7_Angle_Deviation,-0.0105,0.047,-0.227,0.821,-0.102,0.081
No_8_Angle_Deviation,-0.0080,0.298,-0.027,0.979,-0.592,0.576

0,1,2,3
Omnibus:,59.073,Durbin-Watson:,2.014
Prob(Omnibus):,0.0,Jarque-Bera (JB):,80.853
Skew:,-0.359,Prob(JB):,2.77e-18
Kurtosis:,3.803,Cond. No.,1.04e+16


In [None]:
smf_pred = reg.predict(validation_set)
mse = mean_squared_error(validation_set.iloc[:, 0], smf_pred)
mae = metrics.mean_absolute_error(validation_set.iloc[:, 0], smf_pred)
rss = np.mean((smf_pred - validation_set.iloc[:, 0]) ** 2)
print('Mean squared error: %.2f' % mse)
print('Mean absolute error: %.2f' % mae)
print('Residual sum of squares: %.2f' % rss)

Mean squared error: 0.02
Mean absolute error: 0.11
Residual sum of squares: 0.02


## **Architecture and Design**

![Architecture](https://i.imgur.com/Nvf4xcK.jpg)

**ML core**  
This part includes the steps for the robust machine learning pipeline:  
•	Data preprocessing – module to process the raw data before feeding it to the model  
•	Model selection – the architecture is designed in a way to make it easy to replace the already implemented model with the new one. This will boost the speed of conducting the experiments and makes it more probable to get a more accurate system as the end goal.  
•	Model training – is defined with the common interface so that the parameters can be easily replaced for the new experiments.  
•	Model estimator – the module that loads the module, keeps it at the background in RAM and makes the predictions. Using this approach one can increase the speed of model inference since the model is loaded only once in a server.  

**REST API**  
It serves as the wrapper for the score predicting model and as the connector to the front-end part. It was decided to implement Flask based REST API as it is fast in development, can be iteratively updated and saves time for the experiment part of project, which is scalable for low-mid sized projects.

**Front End**  
Includes the interface and functions so that the user can interact with the system. Includes the following pages:   
1. Home page – serves as the intro to the product site.  
2. Evaluation – takes the user’s document which includes the features to describe the process of conducting the exercise by the person. As the output, the user receives the expert score between 0 and 1, which indicates how accurately the technique was executed by the customer. 0 – Worst score, 1 – best score.  

**Config**  
Config serves as a separate file that includes all the parameters needed for the system to operate properly. It can be changed to change the behavior of the system.  

**Deployment**  
The project was wrapped using the docker container in order to make the system independent from the hardware on which it was developed and make it scalable. The AWS server with minimum hardware specifications was chosen for Sprint #1 as it is optimal for this version of the product and.


## **Devops process**

<h3>Libraries</h3>

**easydict==1.9**  
EasyDict allows to access dict values as attributes (works recursively). A Javascript-like properties dot notation for python dicts.  

**pandas==0.25.1**  
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with structured (tabular, multidimensional, potentially heterogeneous) and time series data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

**numpy==1.18.1**  
NumPy is the fundamental package for scientific computing with Python. It contains among other things:
•	a powerful N-dimensional array object  
•	sophisticated (broadcasting) functions  
•	tools for integrating C/C++ and Fortran code  
•	useful linear algebra, Fourier transform, and random number capabilities  
•	scikit-learn==0.22.1  

**pathlib==1.0.1**  
pathlib offers a set of classes to handle filesystem paths. It offers the following advantages over using string objects:  
•	No more cumbersome use of os and os.path functions. Everything can be done easily through operators, attribute accesses, and method calls.  
•	Embodies the semantics of different path types. For example, comparing Windows paths ignores casing.  
•	Well-defined semantics, eliminating any warts or ambiguities (forward vs. backward slashes, etc.).

**flask==1.1.1**  
Flask is a lightweight WSGI web application framework. It is designed to make getting started quick and easy, with the ability to scale up to complex applications.


## **User's roadmap**

![Flowchart](https://i.imgur.com/Oo5xhso.jpg)


## **Design rules**
The following design rules have been set up and will be followed throughout the entire project.

**1. Small classes:** 
- For readability reasons small classes are implemented.

**2. Single responsibility:**
- For classes and methods single responsibility is implemented. The reason for this is to keep the classes and methods focused on a single concern, which results in making the class or method more robust.

**3. High cohesion:**
- Achieved by keeping parts of a code base that are related to each other in a single place.

**2. Low coupling:**
- Achieved by seperating unrelated parts of the code base as much as possible.

## **Coding rules** 
The following style guide for python code has been set up in order to set certain coding rules which will be followed throughout the project.

<h3>pep8</h3>

**1. Indentation:**  
- 4 spaces are used per indentation level.
- continuation lines align wrapped elements vertically (using Python's implicit joining inside bracekts or parantheses) or using hanging indents.

**2. Spaces vs Tabs:**
- The preferred indentation method is Spaces.

**3. Maximum Line Length:** 
- All lines limited to a maximum of 79 characters.
- wrap long lines by using Python's implied line continuation inside parantheses, brackets and braces. By wrapping expressions in parantheses long lines can be broken over multiple lines.

**4. Blank Lines:**
- Top-level function and class definitions are surrounded with two blank lines
- Method definitions are surrounded by a single blank line (method definitions inside a class).

**5. Whitespaces:**
- Surround binary operators with a single whitespace on both sides.
- Never: trailing whitespaces; more than one space around assignments or operators; immediately before the opening parenthesis / comma and after closing paranthesis; immediately inside parentheses / brackets / braces.

**6. Imports:**
- All imports are on separate lines.
- Put at the top of the file, before module globals and constants.

**7. Comments:**
- Inline comments are separated by two spaces from the code and start with a # together with a single space.
- Never: obvious or contradicting comments; commented out code.

**8. Naming Convention:**
- Intention-revealing and pronounceable names
- Nouns for class names
- Verbs for method names  
</br>
</br>

<h3>Setup of automated formatting in PyCharm for simple rules:</h3>

**1. General autoformat (like spaces):**
- Windows: ctrl + alt + l
- macOS: Alt + ⌘ + L

**2. Configuration edit to break long lines:**  
- file -> settings -> editor -> python:  
- wrapping and braces: Hard Wrap at 79  
- wrapping and braces: ensure right margin is not exceeded [x]  

Now, with ctrl + alt + l  long lines will be formatted automatically and running pep8 ml-project-first or pep8 app.py will give no warnings.

Server URL: http://ec2-13-48-176-197.eu-north-1.compute.amazonaws.com:5000/