# Baseline model notebook
*by Max*

In this notebook I'll attempt to create a simple baseline model with the Scikit-Learn DummyClassifier for our already feature engineered data.

Import the modules, set the working directories and load the data.

In [4]:
# Import the needed modules
import numpy as np
import pandas as pd

# import own modules from the scr folder
import sys
sys.path.append('../')
from src.find_repo_root import get_repo_root

# Set a random seed
RSEED = 42
np.random.seed(RSEED)

In [8]:
# Set the directory of the data 
ROOT_DIR = get_repo_root()
DATA_DIR = f"{ROOT_DIR}/data"
# Load the base data from the CSV files
df_train = pd.read_csv(f'{DATA_DIR}/Train_Dataset4.csv')
df_test = pd.read_csv(f'{DATA_DIR}/Test_Dataset.csv')

In [6]:
df_train.head()

Unnamed: 0,B02_10,B02_11,B02_5,B02_6,B02_7,B02_8,B02_9,SIPI2_10,SIPI2_11,SIPI2_5,...,B12_11,B12_5,B12_6,B12_7,B12_8,B12_9,field_id,field_size,tile_id,label
0,17.57947,26.36203,25.397352,18.781458,13.030905,12.876821,13.313466,-0.245756,-0.912282,-1.186115,...,83.79691,92.8234,48.43046,39.593819,35.895364,32.980132,4,151,2526,8
1,15.625155,30.736414,20.636646,14.451087,12.849896,12.036879,10.022516,-0.368229,-1.585534,-2.862939,...,86.229038,80.363353,57.829969,48.347308,48.054347,41.5,14,644,979,8
2,39.258299,40.167746,34.846287,34.478986,32.261831,25.978018,32.592746,-1.813169,-1.902605,-2.430946,...,128.347796,112.93653,106.362693,104.270466,78.17648,111.094649,20,579,632,8
3,30.529762,31.458333,21.476191,23.166666,21.714286,36.603175,28.380952,-1.469586,-1.584824,-1.77497,...,91.32738,59.317463,71.35714,60.195237,71.809521,82.873017,25,42,1779,3
4,24.042105,31.447369,20.434211,11.144737,15.122807,14.789474,23.780702,-0.628406,-0.767568,-0.581649,...,92.26316,78.157895,36.76316,110.824563,49.315791,95.947369,40,38,229,3


## Baseline Model

For this baseline model, we use the mean bands for each field as well as the mean of a few selected spectral indices. For the model we go with the very simple DummyClassifier, in order to give us an idea where the baseline for the other models lies. 

We chose the F1 score and Accuracy as metrics, since the main goal is to correctly identify as many plants as possible. Neither FP nor FN are particularly bad or good, hence the harmonic mean F1. In addition, we also have an eye on the cross-entropy, because later we will deal with the probabilities with which a class is assigned to a field. 

Here we split the features and the target for test and train data.  

In [15]:
# Get X for the train and validation data
X_train = df_train.drop(columns=['label', 'field_id', 'tile_id'])
X_test = df_test.drop(columns=['label', 'field_id', 'tile_id'])

# Get y for the train and validation data
y_train = df_train['label']
y_train = y_train.astype(int)
y_test = df_test['label']
y_test = y_test.astype(int)

Here the modelling is done.

In [16]:
from sklearn.dummy import DummyClassifier
# Fitting the RF model
rf = DummyClassifier(random_state = RSEED)
rf.fit(X_train, y_train)

In [17]:
y_pred_train = rf.predict(X_train)
y_pred_test = rf.predict(X_test)

In [18]:
y_proba_train = rf.predict_proba(X_train)
y_proba_test = rf.predict_proba(X_test)

And the results of our Dummy Classifier model. 

In [19]:
from sklearn.metrics import accuracy_score, f1_score, log_loss

print('---'*12)
print(f'Accuracy on train data: {round(accuracy_score(y_train, y_pred_train), 3)}')
print(f'Accuracy on test data: {round(accuracy_score(y_test, y_pred_test), 3)}')
print('---'*12)
print(f'F1-score on train data: {round(f1_score(y_train, y_pred_train, average="macro"), 3)}')
print(f'F1-score on test data: {round(f1_score(y_test, y_pred_test, average="macro"), 3)}')
print('---'*12)
print(f'Cross-entropy on train data: {round(log_loss(y_train, y_proba_train, labels=labels), 3)}')
print(f'Cross-entropy on test data: {round(log_loss(y_test, y_proba_test, labels=labels), 3)}')
print('---'*12)

------------------------------------
Accuracy on train data: 0.125
Accuracy on test data: 0.094
------------------------------------
F1-score on train data: 0.028
F1-score on test data: 0.019
------------------------------------
Cross-entropy on train data: 2.079
Cross-entropy on test data: 1.97
------------------------------------


We see that the baseline lies really low!