### CLASSIFICATION MODEL IN "DESCRIPTIVE" MODE

"Descriptive" mode means we try to explain a IFN data target by the others IFN data, at time t. (as it is described in this [diagram](schematic_diagram_of_descriptive_models.png))

This is a notebook example of a classification model in Machine Learning : a simple Ridge Regression model of scikit-learn, with a simple gridsearch for hyperparameters...

we have a categorial target "TAUX_COUV_RAJ" (Regeneration cover rate of the forest plot). As it is an ordered categorial, we could deal with Linear Regression or as a Classification problem. Here, we present the second option  but we tried the first too.

Ridge allows us to fight against overfitting but also to make the importance of the features clearer (if we could use Lasso in classiication, it would be even clearer)



In [1]:
import pandas as pd
import numpy as np
import plotly.express as px

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, accuracy_score
from sklearn.model_selection import cross_val_score, GridSearchCV

from sklearn.linear_model import RidgeClassifier, LogisticRegression

Import data :

In [2]:
data_merge = pd.read_excel('../1_DATA_global_processing/data_processed_and_merged/big_merge_V2_meteo_SAT.xlsx', sheet_name='Sheet1')

In [3]:
data_merge.drop('Unnamed: 0', axis=1, inplace=True)

Target definition :

In [4]:
TARGET =  'TAUX_COUV_RAJ'

We can display the distribution of the different classes of the target :

In [5]:
fig = px.histogram(data_merge[TARGET], nbins=100)
fig.show()

We need to remove the rows wich no have any target value ;

In [6]:
data_red = data_merge.loc[data_merge[TARGET]!=np.nan,:]
data_red = data_red.loc[data_merge[TARGET]!=-1,:]

In [7]:
len(data_red)

7174

Define Y as our target serie :

In [8]:
Y = data_red[TARGET]

In [9]:
Y

1       2
2       2
3       2
4       6
6       4
       ..
9605    2
9606    4
9608    2
9610    2
9611    2
Name: TAUX_COUV_RAJ, Length: 7174, dtype: int64

Feature Engineering (collaborator idea) :

In [10]:
# adding aridity index
data_red["AI"] = data_red['PRCP_GROWTH'] / data_red['TAVE_GROWTH']
# adding H/D index
data_red["H_D"] = data_red['HAUTEUR_ARBRE'] / data_red['DBH']

Selection of features :

In [11]:
targets_cat__ord_feat = [] #'TAUX_COUV_RAJ'

In [12]:
targets_numeric_feat = ['PERF_CROI','SURF_TER_HA', '25_GRID_PER', 'UNIT_VOL_BOIS_MANQUANT', 'UNIT_ACCR']

Above, we divide features in targets and non-targets, because some possible targets are features for other targets ("descriptive" mode allows us to do this)

In [13]:
cat_strict = ['PRODREG','UNIT_VEG_GROS','MODE_REGEN','INTENSITE_EXPLOIT','NIV_DEV','RELIEF','DEG_FERMETURE','STR_PPL', 'ESPECE_DOM', 'TYPE_FORET'] #'TYP_RAJ_PPL','TAUX_COUV_RAJ_ASS'

In [14]:
cat_ord_miss = ['TAILLE_PPL', 'HT_VEG'] # enlever : 'LFI'

In [15]:
numerical = ['ALT','SLOPE25','QUAL_STATION','AGE_PPL','DIV_STR_PPL','TIGES_VIV_H', 'SDI', 'FEUILL_PER', 'CONIF_PER' , 'DBH', 'HAUTEUR_ARBRE', 'AGE_ARBRE', 'PRCP', 'TAVE_AVG',	'TAVE', 'TAVE_GROWTH', 'PRCP_S_S',	'PRCP_G_S', 'NDVI', 'EVI', 'NDMI', 'NDWI', 'DSWI', 'AI', 'H_D'] # enlever : 

Preprocessing for cat_ord_miss :

In documentation, class "-1" means "not determined". So, for our ordered categorial features, we can transform this class in an empty data, and after that, we may use it as a numerical feature. Imputer will fill the missing values. This preprocess allows us to reduce the number of features.

In [16]:
for cat in (cat_ord_miss + targets_cat__ord_feat + [TARGET]):
  data_red[cat] = data_red[cat].apply(lambda v : v if v!=-1 else np.nan)

In [17]:
for cat in cat_ord_miss:
  print(data_red[cat].dtypes)

float64
int64


In [18]:
numerics_feats = numerical + targets_numeric_feat + cat_ord_miss
categorical_feats = cat_strict + targets_cat__ord_feat

In [19]:
len(numerics_feats)

32

In [20]:
len(categorical_feats)

10

Reduction of the dataframe with all the selected features :

In [21]:
data_red = data_red[[cat for cat in numerics_feats + categorical_feats]]

Splitting :

In [23]:
X_train, X_test, y_train, y_test = train_test_split(data_red, Y, test_size=0.2, random_state=2, stratify=Y)

PREPROCESSING PIPELINES :

For the numerical features, we use a KNN Imputer, it's an imputer wich make a fine replacement of the missing values, with a synthetic generation which imitates the nearest neighbors.

In [25]:
numerics_transforms = Pipeline(
    [('imputer',KNNImputer()),
    ('encoder',StandardScaler())
])
categorials_transforms = Pipeline([
    ('imputer',SimpleImputer(strategy='most_frequent')),
    ('encoder',OneHotEncoder(drop="first"))
])

preprocessor = ColumnTransformer(
    [("num", numerics_transforms, numerics_feats),
     ("cat", categorials_transforms, categorical_feats)])

In [26]:
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)





In [27]:
np.shape(X_train)

(5739, 99)

In [28]:
y_train

4854    2
8044    2
6661    2
3066    1
1183    2
       ..
6704    3
8640    3
7449    3
7168    2
4207    3
Name: TAUX_COUV_RAJ, Length: 5739, dtype: int64

Model Definition and GrideSearch definition :

In [79]:
model =  RidgeClassifier(max_iter=10000)

In [80]:
params = {
    'alpha':[0.00000001, 0.00000005, 0.00000008]
}

Here, we have selected the final hyperparameters, after a back and forth in the parameters and the training result...

In [82]:
grid = GridSearchCV(model, param_grid=params, scoring='accuracy', verbose=1)

Training :

In [83]:
y_train.unique()

array([2, 1, 3, 4, 6, 5], dtype=int64)

In [84]:
grid.fit(X_train, y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


GridSearchCV(estimator=RidgeClassifier(max_iter=10000),
             param_grid={'alpha': [1e-08, 5e-08, 8e-08]}, scoring='accuracy',
             verbose=1)

In [85]:
grid.best_estimator_

RidgeClassifier(alpha=1e-08, max_iter=10000)

Predictions and accuracy scores :

In [93]:
y_pred = grid.best_estimator_.predict(X_train)
y_pred_test = grid.best_estimator_.predict(X_test)

In [94]:
print(f'Score on train set {accuracy_score(y_train, y_pred)}')

Score on train set 0.46454086077713885


In [95]:
print(f'Score on test set {accuracy_score(y_test, y_pred_test)}')

Score on test set 0.4564459930313589


In [96]:
train_scores = cross_val_score(grid.best_estimator_, X_train, y_train, cv=5, scoring='accuracy')
test_scores = cross_val_score(grid.best_estimator_, X_test, y_test, cv=5, scoring='accuracy')
print(f'Train score mean : {np.mean(train_scores)}')
print(f'Train score std : {np.std(train_scores)}')
print(f'Test score mean : {np.mean(test_scores)}')
print(f'Test score std : {np.std(test_scores)}')

Train score mean : 0.43456935073772207
Train score std : 0.01366385011227958
Test score mean : 0.42160278745644597
Test score std : 0.021478794435431978


FEATURES EXTRACTION :

First, we need to build a list of the feature involved in the preprocessing, in the order or the preprocessing. Categorial features are declined in '_0', '_1', '_2', etc...

In [98]:
list_features_in = []
for feat in numerics_feats:
  list_features_in.append(feat)
for cat in categorical_feats:
  nb_lab = len(data_red[cat].unique())-1
  for i in range(nb_lab):
    list_features_in.append(f'{cat}_{i}')

Then, we stock coefficients of the model in a dataframe :

In [99]:
df_coef_inter = pd.DataFrame(grid.best_estimator_.coef_)

Because it's a classifier model with 6 classes, coefficient are 6 per feature...

In [101]:
df_coef_inter

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,89,90,91,92,93,94,95,96,97,98
0,0.010504,-0.007878,0.006442,-0.023787,-0.029942,-0.01403,-0.049006,0.061961,0.046955,0.022295,...,0.019985,0.045683,-0.01306,0.010795,0.005324,-0.045422,0.014585,0.065774,0.071548,-0.077132
1,0.024258,0.048327,-0.018401,0.075485,-0.063203,-0.103761,0.435978,0.123409,0.158691,-0.01947,...,-0.016911,0.033146,-0.147026,0.075314,-0.047081,0.056089,0.111657,-0.089196,-0.016516,0.018601
2,0.047552,0.02904,0.042612,0.036794,-0.010448,-0.037256,0.398939,-0.02777,0.016179,-0.007069,...,-0.051836,-0.006302,-0.153867,-0.065676,-0.05879,-0.011545,-0.043917,0.022137,-0.057662,0.061602
3,-0.038798,-0.044151,0.014024,-0.029351,0.062511,0.114255,-0.3275,-0.179932,-0.18321,0.025712,...,0.016875,-0.09604,0.105037,-0.062523,0.006661,2.9e-05,-0.034249,0.011,-0.033777,0.07909
4,-0.01634,-0.019701,-0.031471,-0.05133,0.030807,0.01928,-0.240589,0.118185,0.068218,-0.004889,...,0.088513,0.06715,0.208717,0.085493,0.085494,-0.005011,-0.040258,-0.03225,0.034489,-0.04895
5,-0.027177,-0.005638,-0.013207,-0.00781,0.010276,0.021511,-0.217822,-0.095853,-0.106833,-0.01658,...,-0.056625,-0.043636,0.000199,-0.043403,0.008392,0.005856,-0.007823,0.022532,0.001915,-0.033215


So, we build now a dataframe with the sum of all the coefficients for each feature :

In [100]:
df_coef = pd.DataFrame(abs(df_coef_inter).sum(), columns=['Coeff'])

In [102]:
df_coef['Features'] = list_features_in

In [103]:
df_coef = df_coef.set_index('Features')

And we can display it :

In [104]:
fig = px.bar(df_coef['Coeff'], title=f"Features importance for target : {TARGET} with Lasso Linear Regression")
fig.show()

We can look the importance of SDI (Density of vegetation), SURF_TER_HA (another main target), UNIT_VEG_GROS(vegetation type), NIV_DEV (developement level, a formula given by the forest's observators), STR_PPL(structuration of the plot), ESPECE_DOM(dominant species)...
Unfortunately, the spectal bands function data built with the satellite images have a very few importance... like the meteo data...