In [1]:
# Import useful libraries
import pandas as pd, utils, ipywidgets, numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler, StandardScaler, MaxAbsScaler, RobustScaler

# **Machine Learning Models**

The project aim is to create a Machine Learning model capable of detecting the dates when a crop field has been manured, using satellite data. <br>
Now is the turn to create and compare different models, considering the previously created features extracted datasets jointly with the results discussed in the [analysis notebook](../2-analysis/notebook.ipynb).

In [2]:
s2_df = pd.read_csv('../../Datasets/main/main-fields-s2-features-extracted.gz', compression='gzip')
s1_df = pd.read_csv('../../Datasets/main/main-fields-s1-features-extracted.gz', compression='gzip')

## **Modify original DataFrames**

The purpose of building these `DataFrames` is to detect whether manure has been applied to a particular crop field, by analyzing time series data of Sentinel satellite acquisitions (having for train samples informations about when manure have been applied).

The resulting `DataFrames` (one for *Sentinel-1* and one for *Sentinel-2*) consist of rows, each corresponding to a single specific crop field and contains multiple pieces of information. <br>
Each column (fixed the row) represents the difference between two consecutive satellites acquisitions (`consequent_xx_acquisitions`), for the considered spectral index. The `y` column, which is a binary indicator variable, indicates whether any manure application date (`manure_dates`) falls within the two consecutive acquisition dates being considered.

In [3]:
# Sentinel-2
s2_df_mod = utils.get_modified_df(s2_df, satellite='s2')
s2_df_mod

Unnamed: 0,crop_field_name,consequent_s2_acquisitions,B1,B2,B3,B4,B5,B6,B7,B8,...,CARI2,MCARI,MCARI1,MCARI2,BSI,GLI,ALTERATION,SDI,manure_dates,y
0,P-BLD,"[2022-01-06, 2022-01-16]",11.525510,-8.440476,-21.562925,-9.930272,6.205782,124.039116,220.292517,193.943878,...,-14.821130,214.161369,275.431837,-0.001536,262.245588,0.004881,-0.003641,151.874150,['2022-05-26'],0.0
1,P-BLD,"[2022-01-16, 2022-01-26]",1078.435374,1062.938776,1057.346939,1060.000000,1126.557823,1147.452381,1167.767007,1123.741497,...,3564.075800,-2129.186864,87.648980,-0.175677,4166.625746,-0.391301,-0.629555,163.807823,['2022-05-26'],0.0
2,P-BLD,"[2022-01-26, 2022-01-31]",2484.477891,2260.610544,1760.287415,1861.486395,1280.738095,-483.047619,-834.066327,-886.345238,...,2216.904547,-1006.267915,-4114.747959,-0.406936,3935.251609,-0.118245,-0.277985,-1866.045918,['2022-05-26'],0.0
3,P-BLD,"[2022-01-31, 2022-02-05]",-2304.959184,-2102.022109,-1548.894558,-1711.838435,-1013.522109,1331.586735,1878.299320,1974.547619,...,-1508.152782,1195.170656,5562.588367,0.420944,-1925.189004,0.121698,0.329274,2617.175170,['2022-05-26'],0.0
4,P-BLD,"[2022-02-05, 2022-02-10]",1566.938776,1633.841837,1222.882653,1338.595238,744.566327,-1291.680272,-1764.062925,-1971.335034,...,1402.703367,-1072.556874,-4946.811224,-0.339269,371.829866,-0.104554,-0.321017,-2013.052721,['2022-05-26'],0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
821,P-VNS,"[2022-09-08, 2022-10-08]",5606.143498,5374.345291,4764.394619,4737.530942,4444.085202,2348.694170,1702.821525,1609.651121,...,5029.973432,-454.213280,-4462.239605,-0.604684,14214.068737,-0.097813,-0.469385,-1611.464574,['2022-04-23'],0.0
822,P-VNS,"[2022-10-08, 2022-10-23]",-2735.709417,-2652.242152,-2654.146188,-2698.858296,-2656.408072,-2181.641256,-2108.233184,-1965.759641,...,-3804.884046,63.383846,1125.412951,0.200635,-9480.670199,0.002424,0.019581,111.947085,['2022-04-23'],0.0
823,P-VNS,"[2022-10-23, 2022-11-12]",-2616.098655,-2423.871749,-1739.669955,-1393.901345,-1126.258296,-380.855605,-177.975785,-130.049327,...,3243.365795,262.395659,1280.547874,0.190679,-3448.417006,0.046757,0.294226,185.921973,['2022-04-23'],0.0
824,P-VNS,"[2022-11-12, 2022-11-17]",-310.511211,-237.513004,-243.252018,-341.180269,-277.982960,26.653812,128.503139,97.081614,...,-1977.622835,150.359500,783.865184,0.098095,-763.198817,0.022308,0.055699,403.480717,['2022-04-23'],0.0


In [4]:
# Sentinel-1
s1_df_mod = utils.get_modified_df(s1_df, satellite='s1')
s1_df_mod

Unnamed: 0,crop_field_name,consequent_s1_acquisitions,VV,VH,AVE,DIF,RAT1,RAT2,NDI,RVI,manure_dates,y
0,P-BLD,"[2022-01-08, 2022-01-20]",-1.158187,0.891477,-0.133355,-2.049664,0.089755,-0.183735,0.062131,-0.124261,['2022-05-26'],0.0
1,P-BLD,"[2022-01-20, 2022-02-01]",0.630822,-0.955744,-0.162461,1.586566,-0.065934,0.130239,-0.044994,0.089987,['2022-05-26'],0.0
2,P-BLD,"[2022-02-01, 2022-02-13]",-0.235273,-0.045906,-0.140590,-0.189366,0.009996,-0.021348,0.007047,-0.014095,['2022-05-26'],0.0
3,P-BLD,"[2022-02-13, 2022-02-25]",-0.879335,-0.561385,-0.720360,-0.317949,0.023471,-0.047771,0.016224,-0.032448,['2022-05-26'],0.0
4,P-BLD,"[2022-02-25, 2022-03-09]",0.305355,1.617982,0.961669,-1.312628,0.043798,-0.081217,0.029115,-0.058230,['2022-05-26'],0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
929,P-VNS,"[2022-10-23, 2022-11-04]",1.302458,0.523352,0.912905,0.779106,-0.056724,0.134730,-0.041709,0.083419,['2022-04-23'],0.0
930,P-VNS,"[2022-11-04, 2022-11-16]",-0.525650,-1.262530,-0.894090,0.736880,-0.014387,0.038177,-0.011047,0.022094,['2022-04-23'],0.0
931,P-VNS,"[2022-11-16, 2022-11-28]",1.401747,1.776336,1.589041,-0.374589,-0.020007,0.056202,-0.015695,0.031391,['2022-04-23'],0.0
932,P-VNS,"[2022-11-28, 2022-12-10]",-1.446950,-2.677388,-2.062169,1.230438,-0.006568,0.019296,-0.005239,0.010479,['2022-04-23'],0.0


## **Balancing the DataFrames**
Making a dataset balanced is important when dealing with binary classification problems because it ensures that the model is not biased towards either of the classes.

In a binary classification problem (like ours), the goal is to predict the correct label for each sample, which can be either $0$ or $1$. However, if the dataset is imbalanced, meaning that one class has significantly more samples than the other, the model may learn to always predict the majority class, even if the minority class is actually the correct label.

This is especially problematic if the minority class is the one that we are more interested in identifying, such as in cases of fraud detection or rare disease diagnosis. In such cases, a model that is biased towards the majority class would be of little use.

Therefore, balancing the dataset by increasing the number of samples in the minority class or decreasing the number of samples in the majority class can help improve the accuracy of the model and reduce the chances of bias. This can be done through techniques such as `undersampling` or `oversampling`.

In [5]:
# Sentinel-2
s2_df_mod_bal = utils.get_balanced_df(s2_df_mod, method='under', random_state=10)
s2_df_mod_bal

Unnamed: 0,crop_field_name,consequent_s2_acquisitions,B1,B2,B3,B4,B5,B6,B7,B8,...,CARI2,MCARI,MCARI1,MCARI2,BSI,GLI,ALTERATION,SDI,manure_dates,y
0,P-BLD,"[2022-03-27, 2022-04-06]",-253.076531,-234.619048,-175.443878,-93.348639,60.715986,-101.731293,-172.363946,-110.688776,...,821.288132,228.081261,-153.038367,-0.005311,-167.12624,0.004451,0.031185,-253.591837,['2022-05-26'],0.0
1,P-BLD,"[2022-05-06, 2022-05-26]",289.886054,335.82483,201.591837,735.891156,398.739796,-1703.545918,-2127.702381,-2062.307823,...,5738.416867,-771.599601,-4862.913469,-0.393025,563.443952,-0.112486,-0.159643,-3466.957483,['2022-05-26'],1.0
2,P-BLLT1,"[2022-05-06, 2022-05-31]",79.674051,162.807753,104.986551,196.462816,118.330696,-719.818038,-1032.433544,-1109.239715,...,1056.706152,-180.465971,-2022.91462,-0.093796,-98.555306,-0.027062,0.016326,-1542.100475,['2022-05-16'],1.0
3,P-BLLT1,"[2022-07-15, 2022-07-20]",160.143196,165.589399,40.116297,20.627373,-51.699367,-158.123418,-149.814082,-188.129747,...,-415.94411,-95.232165,-270.207532,-0.013968,122.433477,-0.017761,-0.012081,-296.157437,['2022-05-16'],0.0
4,P-BLLT2,"[2022-05-06, 2022-05-31]",110.491453,231.299145,207.799145,369.918803,315.145299,-470.863248,-767.512821,-827.431624,...,2546.149145,-170.030773,-1977.091282,-0.139965,623.28048,-0.034264,0.015565,-1411.042735,['2022-05-26'],1.0
5,P-BLLT2,"[2022-07-15, 2022-07-20]",144.337607,162.867521,42.529915,4.995726,-39.230769,-89.576923,-63.504274,-87.068376,...,-572.922809,-45.645903,-74.018974,-0.001734,214.956271,-0.012311,-0.021819,-228.722222,['2022-05-26'],0.0
6,P-CBRCS1,"[2022-03-27, 2022-04-06]",-380.070981,-412.482255,-307.4238,-403.820459,-116.997912,572.19833,700.480167,756.098121,...,-932.994954,593.991382,1820.661545,0.120347,66.18538,0.052827,0.102936,811.411273,['2022-05-26'],0.0
7,P-CBRCS1,"[2022-05-06, 2022-05-26]",99.931106,106.438413,29.59499,171.755741,62.016701,-767.453027,-961.979123,-953.956159,...,1024.921474,-283.062827,-1842.795908,-0.087669,-252.311687,-0.042444,-0.017196,-1263.488518,['2022-05-26'],1.0
8,P-CBRCS2,"[2022-03-27, 2022-04-06]",-401.345982,-385.428571,-324.080357,-289.220982,-181.819196,-58.142857,-34.004464,24.892857,...,-369.197961,161.944643,397.943304,0.057863,-421.515077,0.014136,0.068701,-33.450893,['2022-05-26'],0.0
9,P-CBRCS2,"[2022-05-06, 2022-05-26]",131.178571,99.785714,52.341518,163.837054,63.770089,-449.424107,-560.984375,-548.002232,...,941.666179,-207.597343,-1198.981607,-0.088839,217.153036,-0.02958,-0.035077,-956.986607,['2022-05-26'],1.0


In [6]:
# Sentinel-1
s1_df_mod_bal = utils.get_balanced_df(s1_df_mod, method='under', random_state=10)
s1_df_mod_bal

Unnamed: 0,crop_field_name,consequent_s1_acquisitions,VV,VH,AVE,DIF,RAT1,RAT2,NDI,RVI,manure_dates,y
0,P-BLD,"[2022-04-02, 2022-04-14]",-1.115553,-1.778969,-1.447261,0.663417,-0.011032,0.019916,-0.007252,0.014504,['2022-05-26'],0.0
1,P-BLD,"[2022-05-20, 2022-06-01]",0.251519,-0.011654,0.119933,0.263173,-0.012674,0.023108,-0.008366,0.016733,['2022-05-26'],1.0
2,P-BLLT1,"[2022-04-02, 2022-04-14]",-0.870915,-0.948468,-0.909692,0.077553,0.015179,-0.037681,0.01136,-0.022721,['2022-05-16'],0.0
3,P-BLLT1,"[2022-05-08, 2022-05-20]",1.620705,1.94239,1.781547,-0.321686,-0.012321,0.02363,-0.008309,0.016618,['2022-05-16'],1.0
4,P-BLLT2,"[2022-04-02, 2022-04-14]",-0.943051,-0.657796,-0.800423,-0.285255,0.023262,-0.049862,0.016424,-0.032847,['2022-05-26'],0.0
5,P-BLLT2,"[2022-05-20, 2022-06-01]",-0.428368,-0.350908,-0.389638,-0.07746,0.009377,-0.019689,0.006565,-0.01313,['2022-05-26'],1.0
6,P-CBRCS1,"[2022-04-02, 2022-04-14]",-1.558735,-1.164879,-1.361807,-0.393857,0.040925,-0.087761,0.028895,-0.05779,['2022-05-26'],0.0
7,P-CBRCS1,"[2022-05-20, 2022-06-01]",1.57965,1.175115,1.377383,0.404535,-0.038201,0.073015,-0.025721,0.051443,['2022-05-26'],1.0
8,P-CBRCS2,"[2022-04-02, 2022-04-14]",-1.333557,-2.568303,-1.95093,1.234745,-0.015689,0.039544,-0.011811,0.023623,['2022-05-26'],0.0
9,P-CBRCS2,"[2022-05-20, 2022-06-01]",-0.648566,-0.571513,-0.610039,-0.077053,0.014684,-0.032982,0.010565,-0.02113,['2022-05-26'],1.0


## **Models (for now just using Sentinel-2 data)**

Now you must select which set of `optical indexes` will be used in the models, the number of `KFolds` and finally the `Scaler` to use.

### Optical features subset selection
**[Wrapper methods](https://towardsdatascience.com/feature-selection-for-machine-learning-in-python-wrapper-methods-2b5e27d2db31)**. <br>
> *"Evaluate multiple models using procedures that add and/or remove predictors to find the optimal combination that maximizes model performance. These procedures are normally built after the concept of Greedy Search technique (or algorithm). A greedy algorithm is any algorithm that follows the problem-solving heuristic of making the locally optimal choice at each stage."* - Cit. Jack Yee Tan

We already ranked in terms of importance, which are the optical features most impacted by manure application in the analysis notebook.<br>
The objective now is to consider a subset of them (and also modify this subset), in order to see which are the ones improving most the overall performances (without incurring into neither [overfitting nor overfitting](https://www.v7labs.com/blog/overfitting-vs-underfitting) issues).

In [7]:
s2_features_widget = ipywidgets.SelectMultiple(options=s2_df.select_dtypes(include=np.number).columns, value=['EOMI3', 'SCI', 'EOMI1', 'SDI'], description='Features')
s2_features_widget

SelectMultiple(description='Features', index=(40, 48, 38, 60), options=('B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B…

In [8]:
print('Selected features list: ' + str(list(s2_features_widget.value)))

Selected features list: ['EOMI3', 'SCI', 'EOMI1', 'SDI']


### K selection (for KFold Cross-Validation)
KFold cross validation is a method for evaluating the performance of a machine learning model by dividing the data into `K` subsets or folds. The algorithm is trained on `K-1` folds and tested on the remaining fold, and this process is repeated `K` times, with each fold used as the test set exactly once. The performance metric is then averaged across all `K` folds to provide an estimate of the model's performance on unseen data.

KFold cross validation is used instead of the classical train and test split because it provides a more reliable estimate of the model's performance. The classical train and test split can be biased depending on the way the data is divided, which can result in overfitting or underfitting. KFold cross validation helps to **mitigate this bias** by using all data for both training and testing.

Choosing the right `K` value is important because it can affect the reliability of the performance estimate. A value that is too low can result in a high variance estimate, meaning the estimate can be sensitive to the particular way the data is split into folds. Whereas, a value that is too high can result in a high bias estimate, meaning the estimate may be less accurate due to the larger training sets used in each fold.

In general, a `K` value of $5$ or $10$ is often used for KFold cross validation, although the optimal value may depend on the size and complexity of the dataset. It is important to experiment with different values and evaluate the performance of the model on different folds to choose the optimal one.

**Why opting for Stratified KFold?**
Standard KFold randomly splits the data into `K` folds without considering the class distribution. This can result in a fold that contains a disproportionate number of instances from one class, which can lead to overfitting or underfitting of the model. In contrast, Stratified KFold ensures that each fold contains a proportional number of instances from each class. This is achieved by dividing the data into folds while preserving the percentage of samples for each class.

If you choose `K` as the maximum pickable value this will mean using [*Leave One Out Cross-Validation*](https://www.baeldung.com/cs/cross-validation-k-fold-loo).

In [9]:
kfolds_widget = ipywidgets.IntSlider(value=5, min=1, max=len(s2_df.crop_field_name.unique()), step=1, description='KFolds')
kfolds_widget

IntSlider(value=5, description='KFolds', max=29, min=1)

### Normalization method selection
Normalization is typically needed before building machine learning models because many machine learning algorithms are sensitive to the scale of the input features. Normalization can help to improve the performance of machine learning models by ensuring that the input features have a similar scale.
**It is very important to keep the statistics computed on the training data and reuse them also to normalize test and validation data.**

Here are some reasons why normalization is important in machine learning:
* **Better performances:** Normalization can help to improve the performance of machine learning algorithms. Some machine learning algorithms are based on distance metrics that are affected by the scale of the input features. If the input features have different scales, the algorithm may be biased towards features with larger scales, leading to suboptimal performance.
* **Faster convergence:** Normalization can help machine learning algorithms to converge more quickly. Some optimization algorithms, such as gradient descent, converge faster when the input features have a similar scale.
* **Improved interpretability:** Normalization can improve the interpretability of machine learning models. When the input features have vastly different scales, it can be difficult to interpret the coefficients of the model or the importance of the features.

There are several normalization techniques that can be used, each one with its own advantages and drawbacks.

#### Min-Max scaling
One of the simplest normalization technique consists in scaling all the data in such a way that all features have values in the same range, typically between $0$ and $1$ (using a different range is trivial). <br>
This techniques is called min-max scaling and it is based on the computation of the minimum $m_j$ and the maximum $M_j$ values for each feature ($j=0,1,...,n−1$). Then each feature of an input feature vector $x$ are normalized by applying the following linear scaling:
 * $$\bar{x}_j = \frac{x_j − m_j}{M_j − m_j}$$

This ensures that all the training features assume values between $0$ and $1$. Note that for data outside the training set normalized values may still fall outside the $[0, 1]$ range.
Sometimes the transformation performed by min-max scaling degenerates due to the presence of a few unrepresentative outliers in the training data. A single very large (or very small) value causes the compression of all the others. <br>
*This scaling algorithm works very well in cases where the standard deviation is very small, or in cases which don’t have Gaussian distribution.*

#### Mean-Var scaling
Mean-var scaling is another normalization techniques that does not present the same drawback. In mean-var scaling each feature is linearly scaled to have zero mean and unit variance. <br>
Training data is used to compute the mean and the standard deviation of each feature. Then the components of a given feature vector $x$ are normalized accordingly:
 * $$\bar{x}_j = \frac{x_j − μ_j}{σ_j}$$

*It assumes a normal distribution for data within each feature.*

#### Max-Abs scaling
In some applications sparseness of features is a very important property. A feature is sparse when its value is most of the times exactly zero. Both min-max scaling and mean-var scaling do not preserve sparsity. <br>
A normalization scheme that is suitable for sparse data is max abs scaling. It consists in dividing each feature by the largest absolute value found in the training set:
 * $$\bar{x}_j = \frac{x_j}{V_j}$$

Where $V_j$ is the maximum absolute value element of the $j$-th feature considered.

#### Robust scaling
Robust Scaler algorithms scale features that are robust to outliers. The method it follows is almost similar to the MinMax Scaler but it uses the interquartile range (rather than the min-max used in MinMax Scaler). The median and scales of the data are removed by this scaling algorithm according to the quantile range. <br>
It, thus, follows the following formula:
 * $$\bar{x}_j = \frac{x_j - Q_1(x_j)}{Q_3(x_j) - Q_1(x_j)}$$
 
Where $Q_1$ is the first quartile, while $Q_3$ is the third quartile.

Please note that there is also the possibility to avoid applying a normalization technique (selecting **`No`** option in the following slider, which is the `default` method).

In [10]:
scaler_widet = ipywidgets.Dropdown(options=[('No', None), ('Min-Max', MinMaxScaler()), ('Mean-Var', StandardScaler()), ('Max-Abs', MaxAbsScaler()), ('Robust', RobustScaler())], description='Scaler')
scaler_widet

Dropdown(description='Scaler', options=(('No', None), ('Min-Max', MinMaxScaler()), ('Mean-Var', StandardScaler…

### Create a new DataFrame (`X`) that contains the selected optical features, from the balanced one

In [11]:
# Define the features and target variables
X = s2_df_mod_bal[list(s2_features_widget.value)]
y = s2_df_mod_bal.iloc[:, -1]
pd.concat([X, y], axis=1).head(8)

Unnamed: 0,EOMI3,SCI,EOMI1,SDI,y
0,0.050282,0.052243,0.047235,-253.591837,0.0
1,0.319567,0.414346,0.391562,-3466.957483,1.0
2,0.134599,0.178288,0.160978,-1542.100475,1.0
3,0.030946,0.038442,0.034784,-296.157437,0.0
4,0.136649,0.189385,0.174058,-1411.042735,1.0
5,0.025078,0.027322,0.024694,-228.722222,0.0
6,-0.020467,-0.047984,-0.049732,811.411273,0.0
7,0.117342,0.145191,0.13361,-1263.488518,1.0


### Model performances evaluation (ASIDE)
In order to evaluate the performance of each considered model, some considerations have been taken ([code](utils.py)).

Stratified Cross-Validation have been used, in order to have a more reliable estimate of the different performance metrics.
For each repeated split of the dataset into train and test, **mean** (with respect to the two considered classes, where the number of samples per each class have been balanced a-priori) [accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html), [precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html), [recall](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html) and [F1 score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) metrics have been calculated for both the train and test sets, normalizing the data accordingly (if asked).

Then, each metric list has been collapsed, with a *mean reduce function*, to a unique value. In so doing, the output is basically the mean accuracy, precision, recall and f1 over different "experiments" (actually trainings).

#### Formulae (source: [confusion-matrix-guide](https://www.v7labs.com/blog/confusion-matrix-guide))
* *Accuracy*: the number of samples correctly classified out of all the samples present in the test set. $$Accuracy = \frac{TP + TN}{TP + FP + TN + FN}$$
* *Precision* (for the positive class): The number of samples actually belonging to the positive class out of all the samples that were predicted to be of the positive class by the model. $$Precision = \frac{TP}{TP + FP}$$
* *Recall* (for the positive class): The number of samples predicted correctly to be belonging to the positive class out of all the samples that actually belong to the positive class. $$Recall=\frac{TP}{TP+FN}$$
* *F1_score*: The harmonic mean of the precision and recall scores obtained for the positive, or negative, class. $$F1\_score=\frac{2*precision*recall}{precision+recall}$$

Where:
* *True Positive (TP)* refers to a sample belonging to the positive class being classified correctly.
* *True Negative (TN)* refers to a sample belonging to the negative class being classified correctly.
* *False Positive (FP)* refers to a sample belonging to the negative class but being classified wrongly as belonging to the positive class.
* *False Negative (FN)* refers to a sample belonging to the positive class but being classified wrongly as belonging to the negative class.

For this particular classification task, the positive class is formed by the manured observations ($y=1$), whereas the negative class is the not manured ones ($y=0$).

### Logistic Regression

[Logistic Regression](https://towardsdatascience.com/introduction-to-logistic-regression-66248243c148) is a simple and powerful linear classification algorithm used for binary classification. It estimates the probability of an event occurring based on a given dataset of independent variables.

In [12]:
# Measure the performances using k-fold cross validation
utils.measure_scv_performances(X, y, LogisticRegression(), scaler=scaler_widet.value, n_folds=kfolds_widget.value, random_state=3)

Summary: LogisticRegression(), MaxAbsScaler(), 5 KFolds, 0.097s



Unnamed: 0_level_0,Mean Accuracy,Mean Precision,Mean Recall,Mean F1
Dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Train,0.81,0.82,0.81,0.81
Test,0.79,0.8,0.79,0.79


### Linear Discriminant Analysis

[Linear Discriminant Analysis](https://towardsdatascience.com/linear-discriminant-analysis-explained-f88be6c1e00b) is another linear classification algorithm that is used to find a linear combination of features that characterizes or separates two or more classes.

In [13]:
# Measure the performances using k-fold cross validation
utils.measure_scv_performances(X, y, LinearDiscriminantAnalysis(), scaler=scaler_widet.value, n_folds=kfolds_widget.value, random_state=3)

Summary: LinearDiscriminantAnalysis(), MaxAbsScaler(), 5 KFolds, 0.121s



Unnamed: 0_level_0,Mean Accuracy,Mean Precision,Mean Recall,Mean F1
Dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Train,0.88,0.9,0.88,0.87
Test,0.85,0.89,0.85,0.84


### Support Vector Classifier (**saved**)
[Support Vector Classifier](https://towardsdatascience.com/everything-about-svm-classification-above-and-beyond-cc665bfd993e) is a non-linear classification algorithm that finds the best boundary between classes.

In [14]:
# Measure the performances using k-fold cross validation
utils.measure_scv_performances(X, y, SVC(), scaler=scaler_widet.value, n_folds=kfolds_widget.value, random_state=3, save=True)

Summary: SVC(), MaxAbsScaler(), 5 KFolds, 0.095s



Unnamed: 0_level_0,Mean Accuracy,Mean Precision,Mean Recall,Mean F1
Dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Train,0.89,0.91,0.89,0.89
Test,0.88,0.9,0.88,0.88


### K-Nearest Neighbors Classifier

[K-Nearest Neighbors (KNN) Classifier](https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761) is a non-parametric classification algorithm that classifies new data points based on the k number of nearest data points in the training set.

In [15]:
# Measure the performances using k-fold cross validation
utils.measure_scv_performances(X, y, KNeighborsClassifier(n_neighbors=9), scaler=scaler_widet.value, n_folds=kfolds_widget.value, random_state=3)

Summary: KNeighborsClassifier(n_neighbors=9), MaxAbsScaler(), 5 KFolds, 0.101s



Unnamed: 0_level_0,Mean Accuracy,Mean Precision,Mean Recall,Mean F1
Dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Train,0.84,0.88,0.84,0.84
Test,0.81,0.86,0.81,0.8


### Random Forest Classifier

[Random Forest Classifier](https://towardsdatascience.com/random-forest-classification-678e551462f5) is an ensemble learning method that constructs a multitude of decision trees at training time and outputs the class that is the mode of the classes, of the individual trees.

In [16]:
# Measure the performances using k-fold cross validation
utils.measure_scv_performances(X, y, RandomForestClassifier(max_depth=1, random_state=30), scaler=scaler_widet.value, n_folds=kfolds_widget.value, random_state=3)

Summary: RandomForestClassifier(max_depth=1, random_state=30), MaxAbsScaler(), 5 KFolds, 0.82s



Unnamed: 0_level_0,Mean Accuracy,Mean Precision,Mean Recall,Mean F1
Dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Train,0.86,0.89,0.86,0.86
Test,0.83,0.86,0.83,0.82


## **Models (using Sentinel-1 data instead)**

Please note that now **radar indexes** will be used. <br>
The value of `K` (for KFold Cross-Validation), as well as the `normalization method` will be the same as the one previously selected. 

### Radar features subset selection

We already ranked in terms of importance, which are the radar features most impacted by manure application in the analysis notebook.<br>
The objective now is to consider a subset of them (and also modify this subset), in order to see which are the ones improving most the overall performances (without incurring into neither overfitting nor overfitting issues).

In [17]:
# Sentinel-1 features
s1_features_widget = ipywidgets.SelectMultiple(options=s1_df.select_dtypes(include=np.number).columns, value=['DIF'], description='Features')
s1_features_widget

SelectMultiple(description='Features', index=(3,), options=('VV', 'VH', 'AVE', 'DIF', 'RAT1', 'RAT2', 'NDI', '…

### Create a new DataFrame (`X`) that contains the selected radar features, from the balanced one

In [18]:
# Define the features and target variables
X = s1_df_mod_bal[list(s1_features_widget.value)]
y = s1_df_mod_bal.iloc[:, -1]
pd.concat([X, y], axis=1).head(8)

Unnamed: 0,DIF,y
0,0.663417,0.0
1,0.263173,1.0
2,0.077553,0.0
3,-0.321686,1.0
4,-0.285255,0.0
5,-0.07746,1.0
6,-0.393857,0.0
7,0.404535,1.0


### Support Vector Classifier

Different models have been tested, the one that works better in this case is SVC. 

In [19]:
# Measure the performances using k-fold cross validation
utils.measure_scv_performances(X, y, SVC(), scaler=scaler_widet.value, n_folds=kfolds_widget.value, random_state=3)

Summary: SVC(), MaxAbsScaler(), 5 KFolds, 0.099s



Unnamed: 0_level_0,Mean Accuracy,Mean Precision,Mean Recall,Mean F1
Dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Train,0.66,0.68,0.66,0.66
Test,0.6,0.65,0.6,0.57
