In [1]:
# Libraries
import pandas as pd, utils
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler, StandardScaler, MaxAbsScaler, RobustScaler

# Machine Learning Models

The project aim is to create a Machine Learning model capable of detecting the dates when a crop field has been manured, using satellite data. <br>
Now is the turn to create and compare different models, considering the previously created features extracted dataset jointly with the results discussed in the analysis notebook.

In [2]:
s2_df = pd.read_csv('../../Datasets/main/main-fields-s2-features-extracted.gz', compression='gzip')
s1_df = pd.read_csv('../../Datasets/main/main-fields-s1-features-extracted.gz', compression='gzip')

## Original DataFrames modification, such that they can be used later for ML models

The purpose of building this `DataFrame` is to detect whether manure has been applied to a particular crop field, by analyzing time series data of Sentinel satellite acquisitions (having for train samples informations about when manure have been applied). <br>
The resulting `DataFrame` provides a record of the **absolute difference between consecutive dates for each extracted index** (in the dataset), **as well as a binary indicator variable for whether any manure application dates fall within the time frame between two consecutive Sentinel acquisitions, for each crop field**.<br>
This information can be used to train a machine learning model to predict when manure has been applied for a given crop field, based on some features extracted by Sentinel satellites over the period of interest.

Please notice that each `DataFrame` column (except `crop_field_name` and `y`) are measuring the absolute difference between two consequent Sentinel-`X` acquisitions (for a single crop field).

In [3]:
# Sentinel-2
s2_df_mod = utils.get_modified_df(s2_df, sentinel=2)
s2_df_mod

Unnamed: 0,crop_field_name,B1,B2,B3,B4,B5,B6,B7,B8,B8A,...,CARI1,CARI2,MCARI,MCARI1,MCARI2,BSI,GLI,ALTERATION,SDI,y
1,P-BLD,11.525510,8.440476,21.562925,9.930272,6.205782,124.039116,220.292517,193.943878,193.318027,...,220.735194,14.821130,214.161369,275.431837,0.001536,262.245588,0.004881,0.003641,151.874150,0.0
2,P-BLD,1078.435374,1062.938776,1057.346939,1060.000000,1126.557823,1147.452381,1167.767007,1123.741497,1144.389456,...,114.020134,3564.075800,2129.186864,87.648980,0.175677,4166.625746,0.391301,0.629555,163.807823,0.0
3,P-BLD,2484.477891,2260.610544,1760.287415,1861.486395,1280.738095,483.047619,834.066327,886.345238,932.683673,...,1610.411742,2216.904547,1006.267915,4114.747959,0.406936,3935.251609,0.118245,0.277985,1866.045918,0.0
4,P-BLD,2304.959184,2102.022109,1548.894558,1711.838435,1013.522109,1331.586735,1878.299320,1974.547619,2068.379252,...,904.236804,1508.152782,1195.170656,5562.588367,0.420944,1925.189004,0.121698,0.329274,2617.175170,0.0
5,P-BLD,1566.938776,1633.841837,1222.882653,1338.595238,744.566327,1291.680272,1764.062925,1971.335034,2009.787415,...,481.469639,1402.703367,1072.556874,4946.811224,0.339269,371.829866,0.104554,0.321017,2013.052721,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
850,P-VNS,5606.143498,5374.345291,4764.394619,4737.530942,4444.085202,2348.694170,1702.821525,1609.651121,1355.912108,...,8465.294926,5029.973432,454.213280,4462.239605,0.604684,14214.068737,0.097813,0.469385,1611.464574,0.0
851,P-VNS,2735.709417,2652.242152,2654.146188,2698.858296,2656.408072,2181.641256,2108.233184,1965.759641,1953.879821,...,5248.277237,3804.884046,63.383846,1125.412951,0.200635,9480.670199,0.002424,0.019581,111.947085,0.0
852,P-VNS,2616.098655,2423.871749,1739.669955,1393.901345,1126.258296,380.855605,177.975785,130.049327,101.484305,...,2067.657718,3243.365795,262.395659,1280.547874,0.190679,3448.417006,0.046757,0.294226,185.921973,0.0
853,P-VNS,310.511211,237.513004,243.252018,341.180269,277.982960,26.653812,128.503139,97.081614,54.261883,...,396.134014,1977.622835,150.359500,783.865184,0.098095,763.198817,0.022308,0.055699,403.480717,0.0


In [4]:
# Sentinel-1
s1_df_mod = utils.get_modified_df(s1_df, sentinel=1)
s1_df_mod

Unnamed: 0,crop_field_name,VV,VH,AVE,DIF,RAT1,RAT2,NDI,RVI,y
1,P-BLD,1.158187,0.891477,0.133355,2.049664,0.089755,0.183735,0.062131,0.124261,0.0
2,P-BLD,0.630822,0.955744,0.162461,1.586566,0.065934,0.130239,0.044994,0.089987,0.0
3,P-BLD,0.235273,0.045906,0.140590,0.189366,0.009996,0.021348,0.007047,0.014095,0.0
4,P-BLD,0.879335,0.561385,0.720360,0.317949,0.023471,0.047771,0.016224,0.032448,0.0
5,P-BLD,0.305355,1.617982,0.961669,1.312628,0.043798,0.081217,0.029115,0.058230,0.0
...,...,...,...,...,...,...,...,...,...,...
958,P-VNS,1.302458,0.523352,0.912905,0.779106,0.056724,0.134730,0.041709,0.083419,0.0
959,P-VNS,0.525650,1.262530,0.894090,0.736880,0.014387,0.038177,0.011047,0.022094,0.0
960,P-VNS,1.401747,1.776336,1.589041,0.374589,0.020007,0.056202,0.015695,0.031391,0.0
961,P-VNS,1.446950,2.677388,2.062169,1.230438,0.006568,0.019296,0.005239,0.010479,0.0


## Balancing the DataFrames
Making a dataset balanced is important when dealing with binary classification problems because it ensures that the model is not biased towards either of the classes.

In a binary classification problem (like ours), the goal is to predict the correct label for each sample, which can be either 0 or 1. However, if the dataset is imbalanced, meaning that one class has significantly more samples than the other, the model may learn to always predict the majority class, even if the minority class is actually the correct label.

This is especially problematic if the minority class is the one that we are more interested in identifying, such as in cases of fraud detection or rare disease diagnosis. In such cases, a model that is biased towards the majority class would be of little use.

Therefore, balancing the dataset by increasing the number of samples in the minority class or decreasing the number of samples in the majority class can help improve the accuracy of the model and reduce the chances of bias. This can be done through techniques such as `undersampling`, `oversampling`, or a `combination` of both.

In [5]:
# Sentinel-2
s2_df_mod_balanced = utils.get_balanced_df(s2_df_mod, method='under', random_state=0)
s2_df_mod_balanced

Unnamed: 0,crop_field_name,B1,B2,B3,B4,B5,B6,B7,B8,B8A,...,CARI1,CARI2,MCARI,MCARI1,MCARI2,BSI,GLI,ALTERATION,SDI,y
3,P-BLD,2484.477891,2260.610544,1760.287415,1861.486395,1280.738095,483.047619,834.066327,886.345238,932.683673,...,1610.411742,2216.904547,1006.267915,4114.747959,0.406936,3935.251609,0.118245,0.277985,1866.045918,0.0
13,P-BLD,289.886054,335.82483,201.591837,735.891156,398.739796,1703.545918,2127.702381,2062.307823,1900.292517,...,0.906878,5738.416867,771.599601,4862.913469,0.393025,563.443952,0.112486,0.159643,3466.957483,1.0
41,P-BLLT1,79.674051,162.807753,104.986551,196.462816,118.330696,719.818038,1032.433544,1109.239715,933.285601,...,53.819947,1056.706152,180.465971,2022.91462,0.093796,98.555306,0.027062,0.016326,1542.100475,1.0
52,P-BLLT1,1709.352848,1621.079905,1687.09731,1389.584652,1999.560127,3638.739715,4013.984177,4167.443038,4166.082278,...,4684.894704,5693.210322,753.377077,4464.235823,0.156116,9501.814111,0.000745,0.220242,2684.980222,0.0
65,P-BLLT2,110.491453,231.299145,207.799145,369.918803,315.145299,470.863248,767.512821,827.431624,696.807692,...,441.045507,2546.149145,170.030773,1977.091282,0.139965,623.28048,0.034264,0.015565,1411.042735,1.0
76,P-BLLT2,1816.752137,1738.517094,1788.504274,1474.440171,2071.807692,3422.128205,3713.623932,3874.324786,3902.115385,...,4803.890879,5729.186508,723.10045,3945.773846,0.184898,9492.496735,0.004803,0.214843,2310.824786,0.0
81,P-CBRCS1,1532.701461,1353.94572,1029.676409,935.05428,599.146138,228.695198,183.181628,31.693111,27.125261,...,796.930755,51.93996,447.681117,1153.229562,0.177233,2245.661163,0.047474,0.188485,242.400835,0.0
91,P-CBRCS1,99.931106,106.438413,29.59499,171.755741,62.016701,767.453027,961.979123,953.956159,841.868476,...,160.937447,1024.921474,283.062827,1842.795908,0.087669,252.311687,0.042444,0.017196,1263.488518,1.0
111,P-CBRCS2,1449.752232,1402.600446,1133.216518,1075.78125,808.752232,111.964286,68.100446,231.1875,288.361607,...,1252.002156,1382.21859,393.424681,1792.435982,0.22222,2556.538999,0.050098,0.219083,824.975446,0.0
121,P-CBRCS2,131.178571,99.785714,52.341518,163.837054,63.770089,449.424107,560.984375,548.002232,496.935268,...,80.404202,941.666179,207.597343,1198.981607,0.088839,217.153036,0.02958,0.035077,956.986607,1.0


In [6]:
# Sentinel-1
s1_df_mod_balanced = utils.get_balanced_df(s1_df_mod, method='under', random_state=0)
s1_df_mod_balanced

Unnamed: 0,crop_field_name,VV,VH,AVE,DIF,RAT1,RAT2,NDI,RVI,y
3,P-BLD,0.235273,0.045906,0.14059,0.189366,0.009996,0.021348,0.007047,0.014095,0.0
12,P-BLD,0.251519,0.011654,0.119933,0.263173,0.012674,0.023108,0.008366,0.016733,1.0
33,P-BLLT1,0.654296,1.219815,0.282759,1.874111,0.084119,0.192109,0.060868,0.121736,0.0
41,P-BLLT1,1.620705,1.94239,1.781547,0.321686,0.012321,0.02363,0.008309,0.016618,1.0
63,P-BLLT2,0.521398,1.262022,0.89171,0.740625,0.017749,0.03726,0.012425,0.024851,0.0
72,P-BLLT2,0.428368,0.350908,0.389638,0.07746,0.009377,0.019689,0.006565,0.01313,1.0
93,P-CBRCS1,0.975707,0.046567,0.511137,0.92914,0.052606,0.16552,0.043004,0.086008,0.0
102,P-CBRCS1,1.57965,1.175115,1.377383,0.404535,0.038201,0.073015,0.025721,0.051443,1.0
123,P-CBRCS2,0.254051,0.741128,0.49759,0.487077,0.012518,0.030438,0.009294,0.018587,0.0
132,P-CBRCS2,0.648566,0.571513,0.610039,0.077053,0.014684,0.032982,0.010565,0.02113,1.0


## Models

### Normalization 
Normalization is typically needed before building machine learning models because many machine learning algorithms are sensitive to the scale of the input features. Normalization can help to improve the performance of machine learning models by ensuring that the input features have a similar scale.

Here are some reasons why normalization is important in machine learning:
* **Better performances:** Normalization can help to improve the performance of machine learning algorithms. Some machine learning algorithms are based on distance metrics that are affected by the scale of the input features. If the input features have different scales, the algorithm may be biased towards features with larger scales, leading to suboptimal performance.
* **Faster convergence:** Normalization can help machine learning algorithms to converge more quickly. Some optimization algorithms, such as gradient descent, converge faster when the input features have a similar scale.
* **Improved interpretability:** Normalization can improve the interpretability of machine learning models. When the input features have vastly different scales, it can be difficult to interpret the coefficients of the model or the importance of the features.

There are several normalization techniques that can be used, each one with its own advantages and drawbacks.

#### Min-Max scaling
One of the simplest normalization technique consists in scaling all the data in such a way that all features have values in the same range, typically between 0 and 1 (using a different range is trivial). <br>
This techniques is called min-max scaling and it is based on the computation of the minimum `mj` and the maximum `Mj` values for each feature (`j=0,1,...,n−1`).
Then each feature of an input feature vector `x` are normalized by applying the following linear scaling:
 * `xj_norm = (xj − mj) / (Mj − mj)`

This ensures that all the training features assume values between 0 and 1. Note that for data outside the training set normalized values may still fall outside the `[0, 1]` range.
Sometimes the transformation performed by min-max scaling degenerates due to the presence of a few unrepresentative outliers in the training data. A single very large (or very small) value causes the compression of all the others. <br>
*This scaling algorithm works very well in cases where the standard deviation is very small, or in cases which don’t have Gaussian distribution.*

#### Mean-Var scaling
Mean-var scaling is another normalization techniques that does not present the same drawback. In mean-var scaling each feature is linearly scaled to have zero mean and unit variance. <br>
Training data is used to compute the mean and the standard deviation of each feature. Then the components of a given feature vector `x` are normalized accordingly:
 * `xj_norm = (xj − μj) / σj`

*It assumes a normal distribution for data within each feature.*

#### Max-Abs scaling
In some applications sparseness of features is a very important property. A feature is sparse when its value is most of the times exactly zero.
Both min-max scaling and mean-var scaling do not preserve sparsity. <br>
A normalization scheme that is suitable for sparse data is max abs scaling. It consists in dividing each feature by the largest absolute value found in the training set:
 * `xj_norm = xj / Vj`

Where `Vj` is the maximum element of the i-th feature considered.

#### Robust scaling
Robust Scaler algorithms scale features that are robust to outliers. The method it follows is almost similar to the MinMax Scaler but it uses the interquartile range (rather than the min-max used in MinMax Scaler). The median and scales of the data are removed by this scaling algorithm according to the quantile range. <br>
It, thus, follows the following formula:
 * `xj_norm = [xj - Q1(xj)]/ [Q3(xj) - Q1(xj)]`
 
Where `Q1` is the first quartile, while `Q3` is the third quartile.

### Logistic Regression

In [7]:
# define the features and target variables
X = s2_df_mod_balanced[['EOMI3', 'SCI', 'EOMI1', 'SDI', 'NBR', 'MCARI1']]
y = s2_df_mod_balanced.iloc[:, -1]

# Measure the performances using k-fold cross validation
utils.measure_scv_performances(X, y, LogisticRegression(), StandardScaler(), n_folds=4, random_state=3)

----------------------------------------------------------------------------------------------------
Summary: LogisticRegression(), StandardScaler(), 4 KFolds, 0.036s (elapsed time)

Class 0 (~Manured) - Precision: 0.56  - Recall: 0.52  - F1: 0.54
Class 1 (Manured)  - Precision: 0.55  - Recall: 0.59  - F1: 0.57
----------------------------------------------------------------------------------------------------


### Discriminant Analysis

In [8]:
# define the features and target variables
X1 = s2_df_mod_balanced[['EOMI3', 'SCI', 'EOMI1', 'SDI', 'NBR', 'MCARI1']]
y = s2_df_mod_balanced.iloc[:, -1]

# Measure the performances using k-fold cross validation
utils.measure_scv_performances(X, y, LinearDiscriminantAnalysis(), StandardScaler(), n_folds=5, random_state=3)

----------------------------------------------------------------------------------------------------
Summary: LinearDiscriminantAnalysis(), StandardScaler(), 5 KFolds, 0.038s (elapsed time)

Class 0 (~Manured) - Precision: 0.62  - Recall: 0.55  - F1: 0.58
Class 1 (Manured)  - Precision: 0.59  - Recall: 0.66  - F1: 0.62
----------------------------------------------------------------------------------------------------


### Support Vector Classifier

In [9]:
# define the features and target variables
X1 = s2_df_mod_balanced[['EOMI3', 'SCI']]
y = s2_df_mod_balanced.iloc[:, -1]

# Measure the performances using k-fold cross validation
utils.measure_scv_performances(X, y, SVC(), None, n_folds=5, random_state=3)

----------------------------------------------------------------------------------------------------
Summary: SVC(), None, 5 KFolds, 0.021s (elapsed time)

Class 0 (~Manured) - Precision: 0.53  - Recall: 0.59  - F1: 0.56
Class 1 (Manured)  - Precision: 0.54  - Recall: 0.48  - F1: 0.51
----------------------------------------------------------------------------------------------------
