# Baseline model notebook
*by Max*

In this notebook I'll attempt to create a simple baseline model for our data.

Import the modules, set the working directories and load the data.

In [9]:
# Import the needed modules
import numpy as np
import pandas as pd

# import own modules from the scr folder
import sys
sys.path.append('../src/')
from train_test_function import train_test_split_fields

# Set a random seed
RSEED = 42
np.random.seed(RSEED)

In [10]:
# Set the directory of the data 
OUTPUT_DIR = '../data'
# Load the base data from the CSV files
df = pd.read_csv(f'{OUTPUT_DIR}/mean_band_perField_perDate.csv')

In [11]:
df.head()

Unnamed: 0,field_id,date,label,B02,B03,B04,B08,B11,B12,CLM
0,1,2017-04-01,4,21.934084,29.180065,35.55466,62.490353,68.3971,46.04019,255.0
1,1,2017-04-11,4,14.844051,23.114147,30.607718,58.736336,73.43569,48.863342,0.0
2,1,2017-04-21,4,13.385852,21.596462,29.223473,57.065918,73.66881,49.313503,0.0
3,1,2017-05-01,4,15.408361,22.471062,29.371382,56.434082,71.05788,46.557877,0.0
4,1,2017-05-11,4,54.829582,65.73955,72.90675,95.67203,66.14791,58.643085,255.0


Convert the absolute date to relative date in form of days since april.

In [13]:
# Convert the date column to datetime object
df['date'] = pd.to_datetime(df['date'])#
# Calculate the days from april as column to get a relative time
df['days_from_april_days'] =  df['date'] - pd.to_datetime('2017-04-01')
df['days_from_april_days'] = df['days_from_april_days'].dt.days

In [14]:
df.head()

Unnamed: 0,field_id,date,label,B02,B03,B04,B08,B11,B12,CLM,days_from_april_days
0,1,2017-04-01,4,21.934084,29.180065,35.55466,62.490353,68.3971,46.04019,255.0,0
1,1,2017-04-11,4,14.844051,23.114147,30.607718,58.736336,73.43569,48.863342,0.0,10
2,1,2017-04-21,4,13.385852,21.596462,29.223473,57.065918,73.66881,49.313503,0.0,20
3,1,2017-05-01,4,15.408361,22.471062,29.371382,56.434082,71.05788,46.557877,0.0,30
4,1,2017-05-11,4,54.829582,65.73955,72.90675,95.67203,66.14791,58.643085,255.0,40


## Baseline Model

For the first base model, we only worked on the mean bands for each field and chose a RandomForest classifier, as this is a commonly used model for raster data. 

We chose the F1 score and Accuracy as metrics, since the main goal is to correctly identify as many plants as possible. Neither FP nor FN are particularly bad or good, hence the harmonic mean F1. In addition, we also have an eye on the cross-entropy, because later we will deal with the probabilities with which a class is assigned to a field. 

Here we do the train-test-split of the data.

In [15]:
# Do the train-test-split
df_train, df_test = train_test_split_fields(df, train_size=0.7, random_state=RSEED)
# Do the validation split
df_train_val, df_test_val = train_test_split_fields(df_train, train_size=0.7, random_state=RSEED)

In [16]:
# Get X for the train and validation data
X_train = df_train_val.drop(columns=['label', 'field_id', 'date'])
X_val = df_test_val.drop(columns=['label', 'field_id', 'date'])

# Get y for the train and validation data
y_train = df_train_val['label']
y_train = y_train.astype(int)
y_val = df_test_val['label']
y_val = y_val.astype(int)

In [17]:
labels = y_train.unique()

Here the modelling is done.

In [10]:
from sklearn.ensemble import RandomForestClassifier
# Fitting the RF model
rf = RandomForestClassifier(n_estimators = 20, random_state = RSEED, n_jobs = -1, verbose=1)
rf.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:   50.2s finished


In [11]:
y_pred_train = rf.predict(X_train)
y_pred_test = rf.predict(X_test)

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  20 out of  20 | elapsed:   15.2s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  20 out of  20 | elapsed:    6.4s finished


In [12]:
y_proba_train = rf.predict_proba(X_train)
y_proba_test = rf.predict_proba(X_test)

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  20 out of  20 | elapsed:   15.0s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  20 out of  20 | elapsed:    6.6s finished


And the results of our first model. 

In [23]:
from sklearn.metrics import accuracy_score, f1_score, log_loss

print('---'*12)
print(f'Accuracy on train data: {round(accuracy_score(y_train, y_pred_train), 3)}')
print(f'Accuracy on test data: {round(accuracy_score(y_test, y_pred_test), 3)}')
print('---'*12)
print(f'F1-score on train data: {round(f1_score(y_train, y_pred_train, average="macro"), 3)}')
print(f'F1-score on test data: {round(f1_score(y_test, y_pred_test, average="macro"), 3)}')
print('---'*12)
print(f'Cross-entropy on train data: {round(log_loss(y_train, y_proba_train, labels=labels), 3)}')
print(f'Cross-entropy on test data: {round(log_loss(y_test, y_proba_test, labels=labels), 3)}')
print('---'*12)

------------------------------------
Accuracy on train data: 0.991
Accuracy on test data: 0.417
------------------------------------
F1-score on train data: 0.99
F1-score on test data: 0.324
------------------------------------
Cross-entropy on train data: 0.328
Cross-entropy on test data: 4.048
------------------------------------
