# GEO-AI Challenge for Cropland Mapping by ITU
_Antoine Saget_

In this notebook, the solution for the Zindi GEO-AI Challenge for Cropland Mapping by ITU to achieve a 0.943 acc on the private leaderboard.

We also provide a second notebook / python script (simple_reproduction.ipynb / simple_reproduction.py ) with much simpler code that reproduce the same results. 
This other notebook will probably be easier to integrate in your own workflow as it doesn't rely on any additional file and classes.

All parts are independant, you can skip to 5. to reproduce the private leaderboard solutions or start from 1. to get a better understanding of the data download and prepprocessing steps.

Notebook table of contents:
1. Downloading the data from GEE
2. Data preprocessing
2. Study on the impact of timerange
3. Study on the impact of Sentinel-2 radiometric bands
4. Study on the impact of model choices
5. Best model

In [1]:
# Imports and seeds initializations
import ee
import folium
import random
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from tabulate import tabulate
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score

from tqdm import tqdm

# Set seed for reproducibility
SEED = 2023
random.seed(SEED)
np.random.seed(SEED)



In [2]:
from constants import *

In [3]:
# Country bounds and timeranges
country_settings = {
   SUDAN: {
        COUNTRY_NAME: SUDAN,
        START_DATE: '2019-07-01',
        END_DATE: '2020-06-30',
        BOUNDS: [[14.1, 33.1], [14.6, 33.6]]
    },
    AFGHANISTAN: {
        COUNTRY_NAME: AFGHANISTAN,
        START_DATE: '2022-04-01',
        END_DATE: '2022-04-30',
        BOUNDS: [[34.0, 70.2], [34.4, 70.8]]
    },
    IRAN: {
        COUNTRY_NAME: IRAN,
        START_DATE: '2019-07-01',
        END_DATE: '2020-06-30',
        BOUNDS: [[32.0, 48.1], [32.5, 48.6]]
    }
}

# 1. Downloading the data from GEE

In this part, timeserie data from Sentinel-2 is downloaded from GEE.


1. Download the data from GEE, which might take up to 1h  
OR
2. Load the pre-downloaded data, which is much faster

Please note that both options output the exact same data as of 06/10/2023 as the pre-dowloaded data is just a collection of .csv files saved from the data obtained with GEE. However, with possible future changes to the GEE Sentinel-2 collection, the pre-downloaded data might get outdata.

In [4]:
# Run this cell if you want to download the data from GEE, this can take up to 1h and require 100MB~ of data available on your GEE account.
# Also, please update the project-name to match your personal GEE project name.

# Authenticate and initialize Earth Engine
# ee.Authenticate()
# ee.Initialize()

In [5]:
# ee_project_name = "your project name in GEE"
PROJECT_NAME = "ee-antoinesaget"

In [12]:
from dataset import Dataset

ds = Dataset.from_files('Train.csv', 'Test.csv', 'Full dataset', country_settings, debug_level=1)
ds.load_all_optical_data()

########################################
Dataset info:
    Name : Full dataset
    Train shape: (1500, 6)
    Test shape: (1500, 6)
    TrainTest shape: (3000, 6)
    Center : [24.21, 51.94]
    Bounds :
        min : [14.10, 33.10]
        max : [34.31, 70.78]
    Train head:


Unnamed: 0,ID,Lat,Lon,Target,IsTrain,Country
0,ID_SJ098E7S2SY9,34.162491,70.763668,0.0,True,Afghanistan
1,ID_CWCD60FGJJYY,32.075695,48.492047,0.0,True,Iran
2,ID_R1XF70RMVGL3,14.542826,33.313483,1.0,True,Sudan
3,ID_0ZBIDY0PEBVO,14.35948,33.284108,1.0,True,Sudan
4,ID_C20R2C0AYIT0,14.419128,33.52845,0.0,True,Sudan


    Test head:


Unnamed: 0,ID,Lat,Lon,Target,IsTrain,Country
0,ID_9ZLHTVF6NSU7,34.254835,70.348699,,False,Afghanistan
1,ID_LNN7BFCVEZKA,32.009669,48.535526,,False,Iran
2,ID_SOYSG7W04UH3,14.431884,33.399991,,False,Sudan
3,ID_EAP7EXXV8ZDE,14.281866,33.441224,,False,Sudan
4,ID_QPRX1TUQVGHU,14.399365,33.109566,,False,Sudan


########################################
Country Sudan info:
    Start date: 2019-07-01
    End date: 2020-06-30
    Train shape: (500, 6)
    Test shape: (500, 6)
Country Afghanistan info:
    Start date: 2022-04-01
    End date: 2022-04-30
    Train shape: (500, 6)
    Test shape: (500, 6)
Country Iran info:
    Start date: 2019-07-01
    End date: 2020-06-30
    Train shape: (500, 6)
    Test shape: (500, 6)


# 5. Best model 

In [1]:
##### PLEASE SET THIS VARIABLE TO YOUR PROJECT NAME IN GEE #####
PROJECT_NAME = "ee-antoinesaget"

In [2]:
# Imports and seeds initializations
import random
import ee

import numpy as np
import pandas as pd

from dataset import Dataset, Dataset_training_ready
from model import Model, rf_builder_shallow
from constants import SUDAN, AFGHANISTAN, IRAN, COUNTRY_NAME, START_DATE, END_DATE, BOUNDS, TARGET, B2, B3, B4, B8, LON, LAT, NDVI, SCL
from utils import save_submission

# Set seeds for reproducibility
SEED = 2023
random.seed(SEED)
np.random.seed(SEED)

In [3]:
# Authenticate and initialize Earth Engine
# ee.Authenticate()
ee.Initialize()

In [4]:
# Country bounds and timeranges
country_settings = {
   SUDAN: {
        COUNTRY_NAME: SUDAN,
        START_DATE: '2019-07-01',
        END_DATE: '2020-06-30',
        BOUNDS: [[14.1, 33.1], [14.6, 33.6]]
    },
    AFGHANISTAN: {
        COUNTRY_NAME: AFGHANISTAN,
        START_DATE: '2022-04-01',
        END_DATE: '2022-04-30',
        BOUNDS: [[34.0, 70.2], [34.4, 70.8]]
    },
    IRAN: {
        COUNTRY_NAME: IRAN,
        START_DATE: '2019-07-01',
        END_DATE: '2020-06-30',
        BOUNDS: [[32.0, 48.1], [32.5, 48.6]]
    }
}

In [5]:
ds = Dataset.from_files('Train.csv', 'Test.csv', 'Full dataset', country_settings, debug_level=0)
# With pandas==2.1.1 there might be a warning about the column type, it can be ignored
# as it is fixed in next versions of pandas : https://github.com/pandas-dev/pandas/issues/55025

  self._df.loc[mask, COUNTRY] = country[COUNTRY_NAME]


In [8]:
# Please see 3. for more informations on bands choices
bands = [B2, B3, B4, B8, LON, LAT, NDVI, SCL]

# Please see 2. for more informations on timespans choices
country_settings_optimal = {
    SUDAN: {
        COUNTRY_NAME: SUDAN,
        START_DATE: '2019-05-29',
        END_DATE: '2021-03-31',
        BOUNDS: [[14.1, 33.1], [14.6, 33.6]]
    },
    AFGHANISTAN: {
        COUNTRY_NAME: AFGHANISTAN,
        START_DATE: '2021-06-27',
        END_DATE: '2023-04-30',
        BOUNDS: [[34.0, 70.2], [34.4, 70.8]]
    },
    IRAN: {
        COUNTRY_NAME: IRAN,
        START_DATE: '2018-08-28',
        END_DATE: '2020-06-30',
        BOUNDS: [[32.0, 48.1], [32.5, 48.6]]
    }
}

# Loading optical data
ds_ts = Dataset_training_ready.get_ts_data_from(
    ds, bands, PROJECT_NAME, country_settings=country_settings_optimal)

# Training and predicting
# Our final model is a single shallow random forest (per country) of 100 trees and a max depth of 10.
model = Model(rf_builder_shallow, ds_ts, SEED)
model.train_on_full_dataset_one_per_country()
preds, ids = model.predict_on_test()

# Check diff between original submission and current predisions
original = pd.read_csv('submissions/original_challenge_submission.csv', index_col='ID', usecols=['ID', TARGET])
original[TARGET] = original[TARGET].astype('uint8')

diff = original.loc[ids, TARGET] - preds
diff = diff[diff != 0] # True on rows different from original submission, False otherwise
print(f'Number of predictions different from original submission : {len(diff)}')

Number of predictions different from original submission : 0


In [9]:
save_submission(preds, ids, 'reproduction_of_original_submission')
# Please note that a diff with the original submission and this one will not be 0.
# By mistake, the original submission also included predictions of the training set.
# This mean that the original submission is 3000 rows while this one is 1500 rows (only test set)
# But the predictions on the test set are the same as shown in the above cell.