<a href="https://colab.research.google.com/github/NaiaraSPinto/VegMapper/blob/devel-calval/calval/prepare_train_val_ref_sample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating a single training/validation/test set from multiple Collect Earth projects

 

### Table of Contents

* [Overview](#overview)
* [Set-up](#setup)
* [Sample preparation](#sample-prep)
    * [Read-in, reshape, and recode](#reshape-recode) 
    * [Simplify the classes](#simplify)
    * [Calculate sample agreement](#agreement)
* [Split the dataset](#split)
    * [Combine and convert to spatial](#combine)

## Overview <a class="anchor" id="overview"></a>
This notebook demonstrates how several Collect Earth Online projects can be:

1. Provide functionality to check the structure and validity of user input for modeling; Yet, users are responsible for providing data in good format. 
2. Re-code the class values and rename the column names.
2. Merged into a single dataset that provide a single label for each sample point and an estimate of label uncertainty;
3. Split into training, validation, and test (or map reference) samples. 

The data used in this demonstration are the results of three Collect Earth Online projects that were captured over the Department of Ucayali, Peru. Each project represents the efforts of an individual (or group of individuals working in the same project) to label 1350 points, classifying each into 1 of 4 classes: not oil palm; young oil palm; mature oil palm;  unsure. The datasets preserve all the information from these projects, although user email addresses were anonymized.

## Sample preparation <a class="anchor" id="sample-prep"></a>
Load packages, setup configuations, define a helper function...

In [1]:
## Use this if you want to run on your local machine
#from label_utils import *

# Use this if you want to run on Google Colab
#%cd VegMapper/
#import vegmapper
#from vegmapper.calval.sample_utils import *


import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
## Google Drive related setup
## mount your Google Drive to access files
drive.mount('/content/drive')
googleDriveFolder = ''

Mounted at /content/drive


In [11]:
# import N survey file(s) into a list (N>=1)
fs = ["ceo-survey-user1.csv",
      "ceo-survey-user2.csv",
      "ceo-survey-user3.csv"]

# Set rename dict for renaming column names
# {"old":"new",...}
rename_dict = {"plot_id":"Point_ID", 
               "pl_cluster":"Clust",
               "center_lat":"Lat", 
               "center_lon":"Lon",
               "Oil Palm?:Young Oil Palm":"Young",
               "Oil Palm?:Mature Oil Palm":"Mature",
               "Oil Palm?:Not Oil Palm":"Not",
               "Oil Palm?:Not Sure":"NotSure"}

# Set re-code dict for land cover classes
recode_dict = {"Young":1,"Mature":1,"Not":0,"NotSure":3}

# Set columns to keep: 
# key_col and label_name are used for joining users's datasets, 
# columns in useful_col will not participate in joining 
# and come from the first user instead to avoid repetition.
key_col = ["Point_ID", "Clust"]
label_name = "labeler"
useful_col = ["Lat", "Lon"]



#Set random seed for train/validation/reference split
seed = 999

In [12]:
# A helper function to process a csv file
def process_csv(csv_path):
    """
    A csv processing pipeline. This function takes a single csv file 
    and let it pass through a sequence of our pre-defined functions
    return: a pandas dataframe of the processed csv.
    """
    print("processing: {}".format(csv_path))
    df = load_csv(csv_path)
    df = rename_cols(df, rename_dict)
    check_exclusive(df, csv_path)
    
    # if you want to combine Young and Mature, just recode both to be 1.
    df = recode(df, recode_dict, label_name)
    
    df = subset_cols(df, [*key_col,  *useful_col, label_name])
    
    return df

### Read-in, reshape, and recode classes <a class="anchor" id="reshape-recode"></a>
The first step was to combine the three datasets into a single dataset, with the columns from each of the three CEO projects, and to recode the four classes into a single column with values 0 (not oil palm),  1 (young oil palm), 2 (mature oil palm), 3 (unsure). At this step, we end up with 3 columns, 1 per completed CEO project: `cl1` = samples from project 1, `cl2` = samples from project 2, `cl3` = samples from project 3. Each column contains the recoded classes (note the renaming of the columns is done in the next code chunk). 

In [26]:
# process ceo-survey-users one by one
dats = list(map(process_csv, fs))

# combine three datasets into one
combined = combine_labelers(dats,by=["Point_ID","Clust"], label_name = label_name)

processing: ceo-survey-user1.csv
The labeled classes are mutually exclusive.
processing: ceo-survey-user2.csv
The labeled classes are mutually exclusive.
processing: ceo-survey-user3.csv


        >>>file: ceo-survey-user3.csv<<<
        Check your columns "Young","Mature","Not", and "NotSure".
        (1)Make sure no empty entry in those columns.
        (2)Make sure there is one and only one column is labeled as 100.


In [27]:
combined

Unnamed: 0,Point_ID,Clust,Lat,Lon,labeler_1,labeler_2,labeler_3
0,140884433,1,-8.3219,-75.045545,1,1,1
1,140884434,1,-8.344409,-74.884792,0,0,0
2,140884435,0,-8.836094,-74.342566,0,0,0
3,140884436,1,-8.32163,-75.031377,1,1,1
4,140884437,0,-10.948943,-71.736808,0,0,0
5,140884438,0,-8.950685,-74.393391,0,0,0
6,140884439,0,-9.795282,-74.019089,0,0,0
7,140884440,1,-8.645355,-74.91606,0,0,1
8,140884441,11,-8.631092,-74.716776,1,1,1
9,140884442,0,-10.298456,-73.232542,0,0,0


### Simplify the classes <a class="anchor" id="simplify"></a>

In this step, a single classification is created by finding the modal class for each sample point across the 3 groups' results. This creates a new `class` column, which provides the class from the majority opinion. 

We repeat this same step again after first collapsing, within each of the `cl1:cl3` columns, the two oil palm classes into a single *oil palm* class with value = 1--*not oil palm* remains 0, and *unsure* remains 3. The modal function was re-run to create a new consensus class, called `class2`. We recommend that `class2` be used for modelling, while `class` may be useful for understanding error patterns. 

### Calculate sample agreement <a class="anchor" id="agreement"></a>
The next step was to calculate some agreement metrics across the three groups' samples. The primary approach was to calculate the proportion of labelling teams that selected the modal class. Since there were just three teams in this example, values were either 0.333, 0.667, 1. This agreement was calculated across for both the original classification scheme (class: 0-3) and the simplifed scheme (), with columns `agree` and `agree2` providing the respective proportions for each observation. 

In [28]:
combined[['mode', 'mode_agreement']] = combined[["labeler_1","labeler_2","labeler_3"]].apply(get_mode_and_occurence, axis=1, result_type='expand')
pd.set_option('display.max_rows', None)
#print(combined)

# we can set the mode to -9999 if there is no agreement (mode_freq = 1/num_labelers)
combined.loc[combined['mode_agreement'] <=1/3, 'mode'] = -9999

combined = combined.drop(combined[combined['mode'] == -9999].index)
print(combined.shape)
#print(combined.loc[combined['mode_agreement'] <=1/3, 'mode'])
# with open('/content/drive/My Drive/' +\
#           googleDriveFolder + '/test_samples_redf_' +\
#           timestamp + '.csv', 'w') as f:
#   samples_redf.to_csv(f, float_format='{:f}'.format, encoding='utf-8', 
#                       index = False)

# print("file exported")


(1342, 9)


In [29]:
combined

Unnamed: 0,Point_ID,Clust,Lat,Lon,labeler_1,labeler_2,labeler_3,mode,mode_agreement
0,140884433,1,-8.3219,-75.045545,1,1,1,1.0,1.0
1,140884434,1,-8.344409,-74.884792,0,0,0,0.0,1.0
2,140884435,0,-8.836094,-74.342566,0,0,0,0.0,1.0
3,140884436,1,-8.32163,-75.031377,1,1,1,1.0,1.0
4,140884437,0,-10.948943,-71.736808,0,0,0,0.0,1.0
5,140884438,0,-8.950685,-74.393391,0,0,0,0.0,1.0
6,140884439,0,-9.795282,-74.019089,0,0,0,0.0,1.0
7,140884440,1,-8.645355,-74.91606,0,0,1,0.0,0.666667
8,140884441,11,-8.631092,-74.716776,1,1,1,1.0,1.0
9,140884442,0,-10.298456,-73.232542,0,0,0,0.0,1.0


We can then calculate the average agreement per sample to get a sense of the uncertainty in labels for each class, for all 4 classes 

In [35]:
agreement = combined.groupby("mode").mean() 
agreement = agreement.rename(columns={"mode_agreement": "mean agreement"})
print(agreement[['mean agreement']])

      mean agreement
mode                
0.0         0.988607
1.0         0.894253
3.0         0.690476


In [36]:
combined

Unnamed: 0,Point_ID,Clust,Lat,Lon,labeler_1,labeler_2,labeler_3,mode,mode_agreement
0,140884433,1,-8.3219,-75.045545,1,1,1,1.0,1.0
1,140884434,1,-8.344409,-74.884792,0,0,0,0.0,1.0
2,140884435,0,-8.836094,-74.342566,0,0,0,0.0,1.0
3,140884436,1,-8.32163,-75.031377,1,1,1,1.0,1.0
4,140884437,0,-10.948943,-71.736808,0,0,0,0.0,1.0
5,140884438,0,-8.950685,-74.393391,0,0,0,0.0,1.0
6,140884439,0,-9.795282,-74.019089,0,0,0,0.0,1.0
7,140884440,1,-8.645355,-74.91606,0,0,1,0.0,0.666667
8,140884441,11,-8.631092,-74.716776,1,1,1,1.0,1.0
9,140884442,0,-10.298456,-73.232542,0,0,0,0.0,1.0


And for the reduced set of classes

## Split the dataset <a class="anchor" id="split"></a>

Here we split the dataset into three parts for model training (60% of the sample), validation (20%), and final assessment (the 20% set aside as the test or map reference dataset).

The splits are confined to the usable sample, which is defined as samples not falling into class 3 and those with at least 2/3 observers agreeing on the class. This decision is made based on the simplified sample scheme (class2), rather than the full scheme (class). The resulting splits are denoted in a column called `usage` (this is distinct from the column `use`, which was used to filter out unusable observations). 

Values of "unusable" in the `usage` column indicate observations that were not usable because of their low agreement or uncertain class.  They are included here for completeness, and in case they help with evaluation


In [37]:
train, rest = train_test_split(combined,test_size=0.4,train_size=0.6, random_state= seed)

val, ref = train_test_split(rest, test_size = 0.5,train_size =0.5, random_state = seed)

out = pd.concat([train.assign(usage = "train"),
        val.assign(usage = "validate"),
        ref.assign(usage = "map reference/test")])
print(out)



       Point_ID  Clust        Lat        Lon  labeler_1  labeler_2  labeler_3  \
26    140884459      0 -10.661994 -71.784007          0          0          0   
67    140884500      0  -9.320213 -73.921025          0          0          0   
55    140884488      1  -8.493207 -74.608635          0          0          0   
1273  140885706      1  -8.263247 -74.816985          0          0          0   
164   140884597      1  -8.517548 -74.931080          3          3          1   
1071  140885504     11  -8.500605 -74.936112          3          3          1   
748   140885181      1  -8.418955 -74.726437          0          0          0   
416   140884849      1  -8.543917 -74.577780          0          0          0   
770   140885203      1  -9.003900 -75.586319          1          1          1   
647   140885080      0  -7.781269 -73.798576          0          0          0   
1007  140885440      0  -8.917909 -73.469366          0          0          0   
53    140884486      1  -8.0

### Combine and export to csv <a class="anchor" id="combine"></a>

The ineligible portion of the sample is also added back for completeness

In [None]:
with open('/content/drive/My Drive/' +\
    googleDriveFolder + '/full_samplef_' + timestamp + '.csv', 'w') as f:
    full_samplef.to_csv(f, float_format='{:f}'.format, encoding='utf-8', index = False)

print('file exported')

And their locations on a map

In [None]:
plot_sample = full_samplef.copy()

usage_dict = {'train': 1, 'validate': 2, "map reference/test": 3, "unusable": 4}
plot_sample = plot_sample.replace({'usage':usage_dict})

print(plot_sample['usage'].unique())

rcParams['figure.figsize'] = 10, 10
plot_sample.plot.scatter(x='x', y='y', c='usage', s=12, cmap='viridis')
None