In [1]:
# Libraries
import pandas as pd, ipywidgets, sys
sys.path.append("../2-analysis/")
import utils

# Machine Learning Models

The project aim is to create a Machine Learning model capable of detecting the dates when a crop field has been manured, using satellite data. <br>
Now is the turn to create and compare different models, considering the previously created features extracted dataset jointly with the results discussed in the analysis notebook.

In [2]:
s2_df = pd.read_csv('../../Datasets/main/main-fields-s2-features-extracted.gz', compression='gzip')
s1_df = pd.read_csv('../../Datasets/main/main-fields-s1-features-extracted.gz', compression='gzip')

## Normalization 
Normalization is typically needed before building machine learning models because many machine learning algorithms are sensitive to the scale of the input features. Normalization can help to improve the performance of machine learning models by ensuring that the input features have a similar scale.

Here are some reasons why normalization is important in machine learning:
* **Better performances:** Normalization can help to improve the performance of machine learning algorithms. Some machine learning algorithms are based on distance metrics that are affected by the scale of the input features. If the input features have different scales, the algorithm may be biased towards features with larger scales, leading to suboptimal performance.
* **Faster convergence:** Normalization can help machine learning algorithms to converge more quickly. Some optimization algorithms, such as gradient descent, converge faster when the input features have a similar scale.
* **Improved interpretability:** Normalization can improve the interpretability of machine learning models. When the input features have vastly different scales, it can be difficult to interpret the coefficients of the model or the importance of the features.

There are several normalization techniques that can be used, each one with its own advantages and drawbacks.

### Min-Max scaling
One of the simplest normalization technique consists in scaling all the data in such a way that all features have values in the same range, typically between 0 and 1 (using a different range is trivial). <br>
This techniques is called min-max scaling and it is based on the computation of the minimum `mj` and the maximum `Mj` values for each feature (`j=0,1,...,n−1`).
Then each feature of an input feature vector `x` are normalized by applying the following linear scaling:
 * `xj_norm = (xj − mj) / (Mj − mj)`

This ensures that all the training features assume values between 0 and 1. Note that for data outside the training set normalized values may still fall outside the `[0, 1]` range.
Sometimes the transformation performed by min-max scaling degenerates due to the presence of a few unrepresentative outliers in the training data. A single very large (or very small) value causes the compression of all the others. <br>
*This scaling algorithm works very well in cases where the standard deviation is very small, or in cases which don’t have Gaussian distribution.*

### Mean-Var scaling
Mean-var scaling is another normalization techniques that does not present the same drawback. In mean-var scaling each feature is linearly scaled to have zero mean and unit variance. <br>
Training data is used to compute the mean and the standard deviation of each feature. Then the components of a given feature vector `x` are normalized accordingly:
 * `xj_norm = (xj − μj) / σj`

*It assumes a normal distribution for data within each feature.*

### Max-Abs scaling
In some applications sparseness of features is a very important property. A feature is sparse when its value is most of the times exactly zero.
Both min-max scaling and mean-var scaling do not preserve sparsity. <br>
A normalization scheme that is suitable for sparse data is max abs scaling. It consists in dividing each feature by the largest absolute value found in the training set:
 * `xj_norm = xj / Vj`

Where `Vj` is the maximum element of the i-th feature considered.

### Robust scaling
Robust Scaler algorithms scale features that are robust to outliers. The method it follows is almost similar to the MinMax Scaler but it uses the interquartile range (rather than the min-max used in MinMax Scaler). The median and scales of the data are removed by this scaling algorithm according to the quantile range. <br>
It, thus, follows the following formula:
 * `xj_norm = [xj - Q1(xj)]/ [Q3(xj) - Q1(xj)]`
 
Where `Q1` is the first quartile, while `Q3` is the third quartile.

In [3]:
# Select the method to use
norm_method_dropdown = ipywidgets.widgets.Dropdown(
    options=['min-max', 'mean-var', 'max-abs', 'robust'],
    value='mean-var',
    description='Method:',
    disabled=False,
)
norm_method_dropdown

Dropdown(description='Method:', index=1, options=('min-max', 'mean-var', 'max-abs', 'robust'), value='mean-var…

In [4]:
# Apply normalization method to the Sentinel-2 DataFrame
s2_df_norm = utils.get_normalized_df(s2_df, method=norm_method_dropdown.value)
s2_df_norm

Unnamed: 0,crop_field_name,s2_acquisition_date,B1,B2,B3,B4,B5,B6,B7,B8,...,CARI1,CARI2,MCARI,MCARI1,MCARI2,BSI,GLI,ALTERATION,SDI,manure_dates
0,P-BLD,2022-01-06,-1.036447,-1.063315,-1.061649,-1.201649,-1.096994,-0.558130,-0.540504,-0.427050,...,-0.273658,-1.461947,4.908170,1.209066,1.904128,-1.179463,4.894048,3.637315,1.127474,['2022-05-26']
1,P-BLD,2022-01-16,-1.030234,-1.068097,-1.075320,-1.208058,-1.092802,-0.456408,-0.357879,-0.269414,...,-0.195099,-1.465764,5.331742,1.359278,1.896499,-1.128323,4.948528,3.619137,1.264134,['2022-05-26']
2,P-BLD,2022-01-26,-0.448858,-0.465868,-0.404927,-0.523943,-0.331648,0.484596,0.610213,0.643956,...,-0.154519,-0.547874,1.120601,1.407079,1.023850,-0.315789,0.580617,0.476291,1.411531,['2022-05-26']
3,P-BLD,2022-01-31,0.890505,0.814924,0.711155,0.677444,0.533677,0.088458,-0.081237,-0.076459,...,0.418623,0.023066,-0.869612,-0.836978,-0.997547,0.451625,-0.739293,-0.911456,-0.267574,['2022-05-26']
4,P-BLD,2022-02-05,-0.352081,-0.376017,-0.270897,-0.427362,-0.151105,1.180466,1.475893,1.528440,...,0.096807,-0.365343,1.494216,2.196687,1.093432,0.076193,0.619163,0.732334,2.087412,['2022-05-26']
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
850,P-VNS,2022-10-08,2.625669,2.652610,2.624483,2.561937,2.493714,1.877812,1.554318,1.391594,...,2.485953,0.566679,-0.891626,-1.628445,-2.157991,2.448572,-0.840330,-1.653026,-0.893668,['2022-04-23']
851,P-VNS,2022-10-23,1.150869,1.149931,0.941665,0.820118,0.698924,0.088690,-0.193431,-0.206163,...,0.618101,-0.413228,-0.766264,-1.014679,-1.161364,0.599746,-0.813274,-1.555275,-0.792936,['2022-04-23']
852,P-VNS,2022-11-12,-0.259449,-0.223361,-0.161344,-0.079493,-0.062027,-0.223642,-0.340975,-0.311866,...,-0.117774,0.422066,-0.247294,-0.316308,-0.214196,-0.072730,-0.291347,-0.086449,-0.625640,['2022-04-23']
853,P-VNS,2022-11-17,-0.426843,-0.357928,-0.315574,-0.299688,-0.249845,-0.201784,-0.234444,-0.232959,...,-0.258757,-0.087250,0.050090,0.111188,0.273076,-0.221561,-0.042338,0.191610,-0.262580,['2022-04-23']


In [5]:
# Apply normalization method to the Sentinel-1 DataFrame
s1_df_norm = utils.get_normalized_df(s1_df, method=norm_method_dropdown.value)
s1_df_norm

Unnamed: 0,crop_field_name,s1_acquisition_date,VV,VH,AVE,DIF,RAT1,RAT2,NDI,RVI,manure_dates
0,P-BLD,2022-01-08,-0.774187,-1.090734,-1.000650,0.377857,0.137118,-0.170164,0.169757,-0.169757,['2022-05-26']
1,P-BLD,2022-01-20,-1.405256,-0.585813,-1.080397,-1.155991,1.402047,-0.807199,1.274152,-1.274152,['2022-05-26']
2,P-BLD,2022-02-01,-1.061536,-1.127134,-1.177549,0.031302,0.472830,-0.355643,0.474372,-0.474372,['2022-05-26']
3,P-BLD,2022-02-13,-1.189730,-1.153135,-1.261623,-0.110408,0.613711,-0.429661,0.599644,-0.599644,['2022-05-26']
4,P-BLD,2022-02-25,-1.668860,-1.471097,-1.692402,-0.348343,0.944494,-0.595290,0.888030,-0.888030,['2022-05-26']
...,...,...,...,...,...,...,...,...,...,...,...
958,P-VNS,2022-11-04,0.824007,0.947509,0.952378,-0.120197,-0.347264,0.122526,-0.285533,0.285533,['2022-04-23']
959,P-VNS,2022-11-16,0.537592,0.232428,0.417707,0.431240,-0.550024,0.254890,-0.481898,0.481898,['2022-04-23']
960,P-VNS,2022-11-28,1.301372,1.238522,1.367963,0.150920,-0.831987,0.449749,-0.760888,0.760888,['2022-04-23']
961,P-VNS,2022-12-10,0.512962,-0.277916,0.134774,1.071708,-0.924553,0.516650,-0.854018,0.854018,['2022-04-23']
