# Baseline model notebook
*by Max*

In this notebook I'll atempt to create a simple baseline model for our data.
The first step is to connect the Google Drive, import the modules and load the data. 

Import the modules, set the working directories and load the data.

In [1]:
# Import the needed modules
import numpy as np
import pandas as pd

In [47]:
# Set the directory of the data 
OUTPUT_DIR = './data'
# Load the base data from the CSV files
df_meta = pd.read_csv(f'{OUTPUT_DIR}/meta_data_fields_bands.csv')
df = pd.read_csv(f'{OUTPUT_DIR}/mean_band_perField_perDate.csv')

In [49]:
df_meta.head()

Unnamed: 0,field_id,tile_id,label,dates
0,1,2171,4,[numpy.datetime64('2017-04-01T00:00:00.0000000...
1,2,1703,7,[numpy.datetime64('2017-04-01T00:00:00.0000000...
2,3,2214,6,[numpy.datetime64('2017-04-01T00:00:00.0000000...
3,4,2526,8,[numpy.datetime64('2017-04-01T00:00:00.0000000...
4,6,544,4,[numpy.datetime64('2017-04-01T00:00:00.0000000...


In [50]:
df.head()

Unnamed: 0,field_id,date,label,B02,B03,B04,B08,B11,B12,CLM
0,1,2017-04-01,4,21.934084,29.180065,35.55466,62.490353,68.3971,46.04019,253.7701
1,1,2017-04-11,4,14.844051,23.114147,30.607718,58.736336,73.43569,48.863342,0.0
2,1,2017-04-21,4,13.385852,21.596462,29.223473,57.065918,73.66881,49.313503,0.0
3,1,2017-05-01,4,15.408361,22.471062,29.371382,56.434082,71.05788,46.557877,36.897106
4,1,2017-05-11,4,54.829582,65.73955,72.90675,95.67203,66.14791,58.643085,255.0


Convert the absolute date to relative date in form of days since april.

In [29]:
# Convert the date column to datetime object
df['date'] = pd.to_datetime(df['date'])#
# Calculate the days from april as column to get a relative time
df['days_from_april_days'] =  df['date'] - pd.to_datetime('2017-04-01')
df['days_from_april_days'] = df['days_from_april_days'].dt.days

## Baseline Model

For the first base model, we only worked on the mean bands for each field and chose a RandomForest classifier, as this is a commonly used model for raster data. 

We chose the F1 score and Accuracy as metrics, since the main goal is to correctly identify as many plants as possible. Neither FP nor FN are particularly bad or good, hence the harmonic mean F1. In addition, we also have an eye on the cross-entropy, because later we will deal with the probabilities with which a class is assigned to a field. 

Here we do the train-test-split of the data.

In [30]:
# Set a random seed
RSEED = 42
np.random.seed(RSEED)

In [31]:
# Split train and test
# Use the field_ids to split the data to train and test
train_size = 0.7

n_fields = df['field_id'].nunique()
train_fields = np.random.choice(df['field_id'].unique(), int(n_fields * train_size), replace=False)
test_fields = df['field_id'].unique()[~np.in1d(df['field_id'].unique(), train_fields)]

In [32]:
X_train = df[df['field_id'].isin(train_fields)]
X_train = X_train.drop(columns=['label', 'field_id', 'date'])

X_test = df[df['field_id'].isin(test_fields)]
X_test = X_test.drop(columns=['label', 'field_id', 'date'])

y_train = df[df['field_id'].isin(train_fields)]['label']
y_train = y_train.astype(int)
y_test = df[df['field_id'].isin(test_fields)]['label']

In [None]:
labels = y_train.unique()

Here the modelling is done.

In [35]:
from sklearn.ensemble import RandomForestClassifier
# Fitting the RF model
rf = RandomForestClassifier(random_state = RSEED, n_jobs = -1, verbose=1)
rf.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.9min


In [None]:
y_pred_train = rf.predict(X_train)
y_pred_test = rf.predict(X_test)

In [None]:
y_proba_train = rf.predict_proba(X_train)
y_proba_test = rf.predict_proba(X_test)

And the results of our first model. 

In [None]:
from sklearn.metrics import accuracy_score, f1_score, log_loss

print(f'Accuracy on train data: {accuracy_score(y_train, y_pred_train)}')
print(f'Accuracy on test data: {accuracy_score(y_test, y_pred_test)}')
print('---'*10)
print(f'F1-score on train data: {f1_score(y_train, y_pred_train, average="macro")}')
print(f'F1-score on test data: {f1_score(y_test, y_pred_test, average="macro")}')
print('---'*10)
print(f'Cross-entropy on train data: {log_loss(y_train, y_proba_train, labels=labels)}')
print(f'Cross-entropy on test data: {log_loss(y_test, y_proba_test, labels=labels)}')
print('---'*10)