# GEO-AI Challenge for Cropland Mapping by ITU
_Antoine Saget_

In this notebook, the solution for the Zindi GEO-AI Challenge for Cropland Mapping by ITU to achieve a 0.943 acc on the private leaderboard.

We also provide a second notebook / python script (simple_reproduction.ipynb / simple_reproduction.py ) with much simpler code that reproduce the same results. 
This other notebook will probably be easier to integrate in your own workflow as it doesn't rely on any additional file and classes.

All parts are independant, you can skip to 5. to reproduce the private leaderboard solutions or start from 1. to get a better understanding of the data download and prepprocessing steps.

Notebook table of contents:
1. Downloading the data from GEE
2. Data preprocessing
2. Study on the impact of timerange
3. Study on the impact of Sentinel-2 radiometric bands
4. Study on the impact of model choices
5. Best model

In [1]:
# Imports and seeds initializations
import ee
import folium
import random
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from tabulate import tabulate
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score

from tqdm import tqdm

# Set seed for reproducibility
SEED = 2023
random.seed(SEED)
np.random.seed(SEED)



In [2]:
from constants import *

In [3]:
# Country bounds and timeranges
country_settings = {
   SUDAN: {
        COUNTRY_NAME: SUDAN,
        START_DATE: '2019-07-01',
        END_DATE: '2020-06-30',
        BOUNDS: [[14.1, 33.1], [14.6, 33.6]]
    },
    AFGHANISTAN: {
        COUNTRY_NAME: AFGHANISTAN,
        START_DATE: '2022-04-01',
        END_DATE: '2022-04-30',
        BOUNDS: [[34.0, 70.2], [34.4, 70.8]]
    },
    IRAN: {
        COUNTRY_NAME: IRAN,
        START_DATE: '2019-07-01',
        END_DATE: '2020-06-30',
        BOUNDS: [[32.0, 48.1], [32.5, 48.6]]
    }
}

# 1. Downloading the data from GEE

In this part, timeserie data from Sentinel-2 is downloaded from GEE.


1. Download the data from GEE, which might take up to 1h  
OR
2. Load the pre-downloaded data, which is much faster

Please note that both options output the exact same data as of 06/10/2023 as the pre-dowloaded data is just a collection of .csv files saved from the data obtained with GEE. However, with possible future changes to the GEE Sentinel-2 collection, the pre-downloaded data might get outdata.

In [4]:
# Run this cell if you want to download the data from GEE, this can take up to 1h and require 100MB~ of data available on your GEE account.
# Also, please update the project-name to match your personal GEE project name.

# Authenticate and initialize Earth Engine
# ee.Authenticate()
# ee.Initialize()

In [5]:
# ee_project_name = "your project name in GEE"
PROJECT_NAME = "ee-antoinesaget"

In [12]:
from dataset import Dataset

ds = Dataset.from_files('Train.csv', 'Test.csv', 'Full dataset', country_settings, debug_level=1)
ds.load_all_optical_data()

########################################
Dataset info:
    Name : Full dataset
    Train shape: (1500, 6)
    Test shape: (1500, 6)
    TrainTest shape: (3000, 6)
    Center : [24.21, 51.94]
    Bounds :
        min : [14.10, 33.10]
        max : [34.31, 70.78]
    Train head:


Unnamed: 0,ID,Lat,Lon,Target,IsTrain,Country
0,ID_SJ098E7S2SY9,34.162491,70.763668,0.0,True,Afghanistan
1,ID_CWCD60FGJJYY,32.075695,48.492047,0.0,True,Iran
2,ID_R1XF70RMVGL3,14.542826,33.313483,1.0,True,Sudan
3,ID_0ZBIDY0PEBVO,14.35948,33.284108,1.0,True,Sudan
4,ID_C20R2C0AYIT0,14.419128,33.52845,0.0,True,Sudan


    Test head:


Unnamed: 0,ID,Lat,Lon,Target,IsTrain,Country
0,ID_9ZLHTVF6NSU7,34.254835,70.348699,,False,Afghanistan
1,ID_LNN7BFCVEZKA,32.009669,48.535526,,False,Iran
2,ID_SOYSG7W04UH3,14.431884,33.399991,,False,Sudan
3,ID_EAP7EXXV8ZDE,14.281866,33.441224,,False,Sudan
4,ID_QPRX1TUQVGHU,14.399365,33.109566,,False,Sudan


########################################
Country Sudan info:
    Start date: 2019-07-01
    End date: 2020-06-30
    Train shape: (500, 6)
    Test shape: (500, 6)
Country Afghanistan info:
    Start date: 2022-04-01
    End date: 2022-04-30
    Train shape: (500, 6)
    Test shape: (500, 6)
Country Iran info:
    Start date: 2019-07-01
    End date: 2020-06-30
    Train shape: (500, 6)
    Test shape: (500, 6)


# 5. Best model 

In [1]:
# Imports and seeds initializations
import random
import ee

import numpy as np
import pandas as pd

from dataset import Dataset, Dataset_training_ready
from model import Model, rf_builder, rf_builder_big, rf_builder_shallow
from constants import *
from utils import save_submission

# Set seeds for reproducibility
SEED = 2023
random.seed(SEED)
np.random.seed(SEED)



In [2]:
PROJECT_NAME = "ee-antoinesaget"

In [3]:
# Authenticate and initialize Earth Engine
# ee.Authenticate()
ee.Initialize()

In [4]:
# Country bounds and timeranges
country_settings = {
   SUDAN: {
        COUNTRY_NAME: SUDAN,
        START_DATE: '2019-07-01',
        END_DATE: '2020-06-30',
        BOUNDS: [[14.1, 33.1], [14.6, 33.6]]
    },
    AFGHANISTAN: {
        COUNTRY_NAME: AFGHANISTAN,
        START_DATE: '2022-04-01',
        END_DATE: '2022-04-30',
        BOUNDS: [[34.0, 70.2], [34.4, 70.8]]
    },
    IRAN: {
        COUNTRY_NAME: IRAN,
        START_DATE: '2019-07-01',
        END_DATE: '2020-06-30',
        BOUNDS: [[32.0, 48.1], [32.5, 48.6]]
    }
}

In [5]:
ds = Dataset.from_files('Train.csv', 'Test.csv', 'Full dataset', country_settings, debug_level=0)

In [29]:
bands = [B2, B3, B4, B8, LON, LAT, NDVI, SCL]
country_settings_optimal = {}
for key, values in country_settings.items():
    country_settings_optimal[key] = values.copy()

country_settings_optimal[SUDAN][END_DATE] = '2021-03-31'
country_settings_optimal[SUDAN][START_DATE] =  pd.to_datetime(country_settings_optimal[SUDAN][END_DATE]) - pd.Timedelta('96W')
country_settings_optimal[SUDAN][START_DATE] = country_settings_optimal[SUDAN][START_DATE].strftime('%Y-%m-%d')
country_settings_optimal[AFGHANISTAN][END_DATE] = '2023-04-30'
country_settings_optimal[AFGHANISTAN][START_DATE] = pd.to_datetime(country_settings_optimal[AFGHANISTAN][END_DATE]) - pd.Timedelta('96W')
country_settings_optimal[AFGHANISTAN][START_DATE] = country_settings_optimal[AFGHANISTAN][START_DATE].strftime('%Y-%m-%d')
country_settings_optimal[IRAN][END_DATE] = '2020-06-30'
country_settings_optimal[IRAN][START_DATE] = pd.to_datetime(country_settings_optimal[IRAN][END_DATE]) - pd.Timedelta('96W')
country_settings_optimal[IRAN][START_DATE] = country_settings_optimal[IRAN][START_DATE].strftime('%Y-%m-%d')

# country_settings_optimal[SUDAN][END_DATE] = '2021-06-30'
# country_settings_optimal[SUDAN][START_DATE] = '2019-01-01'
# country_settings_optimal[AFGHANISTAN][END_DATE] = '2023-04-30'
# country_settings_optimal[AFGHANISTAN][START_DATE] = '2021-04-30'
# country_settings_optimal[IRAN][END_DATE] = '2021-06-30'
# country_settings_optimal[IRAN][START_DATE] = '2019-01-01'

ds_ts = Dataset_training_ready.get_ts_data_from(
    ds, bands, PROJECT_NAME, country_settings=country_settings_optimal)
# display(ds_ts._df_optical.head())
# display(ds_ts._df_optical.info())

preds = pd.Series(index=ds_ts.ids, dtype='uint8', name='Pred', data=0)
for model in [rf_builder, rf_builder_big, rf_builder_shallow]:
    # model = Model(model, ds_ts, SEED)
    model = Model(rf_builder_shallow, ds_ts, SEED)
    # acc, acc_class_0, acc_class_1, _, _ = model.train_with_cv_one_rf_per_country(debug_level=0)
    # print(f'Accuracy Score : {acc:.3f}')
    model.train_on_full_dataset_one_per_country()
    pred, ids = model.predict_on_test()
    preds.loc[ids] += pred

ids = preds.index.values
preds = (preds > 1).astype('uint8').values

# Check diff between original submission and current predisions
original = pd.read_csv('../submissions/10_05_19h_56m_49s_ts_rf_big_one_per_country_optimal_ranges.csv', index_col='ID', usecols=['ID', TARGET])
original[TARGET] = original[TARGET].astype('uint8')

diff = original.loc[ids, TARGET] - preds
diff = diff[diff != 0]
display(diff)
print(f'Number of changes : {len(diff)}')

# save_submission(preds, ids, 'ts_rf_big_one_per_country_optimal_ranges') 
# 10_03_09h_39m_46s_ts_rf_one_per_country_optimal_ranges.csv


ID
ID_5HFMP7OVPTDY    1
ID_2CG94T368XYJ    1
ID_TR58TAFP4VMY    1
ID_QPGSNXD2EAEI    1
Name: Target, dtype: uint8

Number of changes : 4


In [28]:
ids = ['ID_2CG94T368XYJ', 'ID_QPGSNXD2EAEI', 'ID_TR58TAFP4VMY',
       'ID_5HFMP7OVPTDY']
ids = pd.Series(index=ids, data=ids)
ds_ts.X_train.loc[ds_ts.X_train.index.isin(ids), :]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,5430,5431,5432,5433,5434,5435,5436,5437,5438,5439


In [26]:
ds_ts.X_train

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,5430,5431,5432,5433,5434,5435,5436,5437,5438,5439
ID_00MJRGGCU6WE,1152.0,1650.0,2124.0,2666.0,70.753853,34.141720,0.113152,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ID_01MH7OHC15AN,1376.0,1780.0,1970.0,3084.0,70.336754,34.266560,0.220419,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ID_01RYTK0DUR6S,529.0,966.0,968.0,3378.0,70.705070,34.107334,0.554533,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ID_03QDWT3O0285,9312.0,8776.0,8224.0,8344.0,48.338078,32.369534,0.007243,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
ID_048WTPX84MXY,3868.0,4520.0,4444.0,4488.0,48.235218,32.456581,0.004926,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ID_ZU8UD16AKUZM,1346.0,1934.0,2270.0,2942.0,70.744026,34.143208,0.128933,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ID_ZUBD7JBTTH59,2486.0,3264.0,3962.0,4140.0,70.298714,34.237972,0.021970,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ID_ZXCH3LDBSOT8,1130.0,1342.0,1592.0,1948.0,33.364868,14.165354,0.100565,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
ID_ZYX3ZCDDZKVD,859.0,881.0,932.0,1072.0,33.569592,14.305401,0.069860,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
