<a href="https://colab.research.google.com/github/agroimpacts/VegMapper/blob/dev-calval-simplify/calval/process_sample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating a single training/validation/test set from multiple Collect Earth projects



### Table of Contents

* [Overview](#overview)
* [Set-up](#setup)
* [Sample preparation](#sample-prep)
    * [Read-in, reshape, and recode](#reshape-recode)
    * [Simplify the classes](#simplify)
    * [Calculate sample agreement](#agreement)
* [Split the dataset](#split)
    * [Combine and convert to spatial](#combine)

## Overview <a class="anchor" id="overview"></a>
This notebook demonstrates how several Collect Earth Online projects can be:

1. Provide functionality to check the structure and validity of user input for modeling; Yet, users are responsible for providing data in good format.
2. Re-code the class values and rename the column names.
2. Merged into a single dataset that provide a single label for each sample point and an estimate of label uncertainty;
3. Split into training, validation, and test (or map reference) samples.

The example data that can be used for this notebook, if you don't bring your own, are the results of three Collect Earth Online projects that were captured over the Department of Ucayali, Peru. They are in the VegMapper repo under the calval/data folder. Each project csv represents the efforts of an individual (or group of individuals working in the same project) to label 1350 points, classifying each into 1 of 4 classes: not oil palm; young oil palm; mature oil palm;  unsure. The datasets preserve all the information from these projects, although user email addresses were anonymized.


## Sample preparation <a class="anchor" id="sample-prep"></a>
Load packages, setup configuations, define a helper function...

In [1]:
# @title (RUN) Install packages
%%capture
!pipe install folium

In [2]:
# from label_utils import load_csv, subset_cols, rename_cols,\
#     check_exclusive, recode, combine_labelers, get_mode_and_occurence

#@title (RUN) Setup code
## Mount Drive
from google.colab import drive
root = '/content/gdrive'
drive.mount(root)

## Clone and/or update VegMapper
import os
# from datetime import datetime as dt
import pandas as pd
from sklearn.model_selection import train_test_split

repo_path = f"{root}/MyDrive/repos"
clone_path = 'https://github.com/agroimpacts/VegMapper.git'
if not os.path.exists(repo_path):
    print(f"Making {repo_path}")
    os.makedirs(repo_path, exist_ok=True)

if not os.path.exists(f"{repo_path}/VegMapper"):
    !git -C "{repo_path}" clone "{clone_path}"
else:
    !git -C "{repo_path}/VegMapper" pull

os.chdir(f"{repo_path}/VegMapper")

# Import sample_utils function
from vegmapper.calval.label_utils import *
from functools import partial
import folium

Mounted at /content/gdrive
Updating c434ee4..c10a56b
error: Your local changes to the following files would be overwritten by merge:
	vegmapper/calval/label_utils.py
	vegmapper/calval/sample_utils.py
Please commit your changes or stash them before you merge.
Aborting


To load the CSV files, you only need to open the directory on the left panel of your Colab notebook. Then, navigate to the directory where you have the files, click on the three dots menue to the right of the file names, and select 'Copy Path.' Finally, paste the path in the box below.

In [3]:
# @title (RUN) Load CEO Project CSVs

while True:
    try:
        num_users = int(
            input("Enter the number of CEO projects (The number of CEO "\
                  "projects must be more than 2): ")
        )
        if num_users < 2:
            print("The number of CEO projects must be at least 2.")
        else:
            break
    except ValueError:
        print("Invalid input. Please enter a valid number.")

fs = []

for i in range(num_users):
    while True:
        file_path = input(f"Enter the CSV file path & name for project {i+1}: ")
        if not file_path:
            print("File path/name cannot be empty.")
        else:
            fs.append(file_path)
            break

Enter the number of CEO projects (The number of CEO projects must be more than 2): 3
Enter the CSV file path & name for project 1: /content/gdrive/MyDrive/VegMapper/calval/data/ceo-survey-user1.csv
Enter the CSV file path & name for project 2: /content/gdrive/MyDrive/VegMapper/calval/data/ceo-survey-user2.csv
Enter the CSV file path & name for project 3: /content/gdrive/MyDrive/VegMapper/calval/data/ceo-survey-user3.csv


**Note**: it is important to make sure that the CEO project files come from the same project, such each plot_id represents the same location. It is theoretically possible that 2 or more projects could have the same numbers of plots and plot_id numbers, but each plot_id represents a different location, in which case the results here will not be valid.

In [4]:
# @title (RUN) Check that projects have matching observations
match_CEO_projects(fs)

Samples are identical in all CSV files.


In [5]:
# @title (RUN) Check that column names match in all CEO projects

first_file = pd.read_csv(fs[0])
expected_column_names = set(first_file.columns)

for file_path in fs[1:]:
    if not os.path.isfile(file_path):
        print(f"File {file_path} does not exist.")
        continue

    df = pd.read_csv(file_path)
    if set(df.columns) != expected_column_names:
        print(f"Column names in {file_path} are not the same.")
        break
else:
    print("All files have the same column names.")

All files have the same column names.


In [6]:
# @title (RUN) Select the columns containing the class variable

new_col_names, rename_dict = select_columns(fs)


Column Names and Indices:
0: plot_id
1: center_lon
2: center_lat
3: size_m
4: shape
5: sample_points
6: email
7: flagged
8: flagged_reason
9: collection_time
10: analysis_duration
11: common_securewatch_date
12: total_securewatch_dates
13: pl_plotid
14: pl_cluster
15: Oil Palm?:Young Oil Palm
16: Oil Palm?:Mature Oil Palm
17: Oil Palm?:Not Oil Palm
18: Oil Palm?:Not Sure
Enter the names you want to represent presence and absence
 , separated by a comma (e.g. Presence, Absence):  Presence, Absence
Do you want to include an 'Unsure' category? (y/n): y
Enter column indices to change to 'Presence' 
(separate with commas): 15,16
Enter column indices to change to 'Absence' 
(separate with commas): 17
Enter column indices to change to 'Unsure' (separate with commas): 18
{'plot_id': 'Point_ID', 'pl_cluster': 'Clust', 'center_lat': 'Lat', 'center_lon': 'Lon', 'Oil Palm?:Young Oil Palm': 'Presence', 'Oil Palm?:Mature Oil Palm': 'Presence', 'Oil Palm?:Not Oil Palm': 'Absence', 'Oil Palm?:Not Sur

In [9]:
# @title Provide a numerical code presence (e.g. 1), absence (e.g. 0) and unsure (e.g. 2, if present)
recode_dict = {}

# Iterate through new_col_names and get user input for values
for column_name in new_col_names:
    arbitrary_number = int(
        input(f"Enter a number for the '{column_name}' category: ")
    )
    recode_dict[column_name] = arbitrary_number

print("Updated recode_dict:")
print(recode_dict)

Enter a number for the 'Presence' category: 1
Enter a number for the 'Absence' category: 0
Enter a number for the 'Unsure' category: 2
Updated recode_dict:
{'Presence': 1, 'Absence': 0, 'Unsure': 2}


### Read-in, reshape, and recode classes <a class="anchor" id="reshape-recode"></a>
The first step is to combine the three datasets into a single dataset, with the columns from each of the three CEO projects, and to recode the four classes into a single column with values 0 (absence),  1 (presence), and 2 (unsure, if this category exists). After this step, there will be one column per CEO project. Each column contains the recoded classes, and the column is named for the CEO project.

In [11]:
# @title (RUN) Process and combine csvs
label_name = "labeler"

# Define a partial function with fixed arguments
process_csv_partial = partial(process_csv, rename_dict=rename_dict,
                              recode_dict=recode_dict,
                              new_col_names=new_col_names)

# Process ceo-survey-users one by one
dats = list(map(process_csv_partial, fs))

# Combine three datasets into one
combined = combine_labelers(dats, by=["Point_ID", "Clust"],
                            label_name=label_name, fs=fs)
combined_pl = combined.drop(columns=['Clust'])
combined_pl.head()

processing: /content/gdrive/MyDrive/VegMapper/calval/data/ceo-survey-user1.csv
processing: /content/gdrive/MyDrive/VegMapper/calval/data/ceo-survey-user2.csv
processing: /content/gdrive/MyDrive/VegMapper/calval/data/ceo-survey-user3.csv


Unnamed: 0,Point_ID,Lat,Lon,ceo-survey-user1,ceo-survey-user2,ceo-survey-user3
0,140884433,-8.3219,-75.045545,1,1,1
1,140884434,-8.344409,-74.884792,0,0,0
2,140884435,-8.836094,-74.342566,0,0,0
3,140884436,-8.32163,-75.031377,1,1,1
4,140884437,-10.948943,-71.736808,0,0,0


### Calculate sample agreement <a class="anchor" id="agreement"></a>
The next step is to calculate some agreement metrics across the various projects. The primary approach is to calculate the proportion of projects that selected the most common class (e.g. if there were three projects, and in two of the projects the observers labelled a particular sample as belonging to class 1 and the third labelled it as class 0, the sample is classified as 1, the modal class). If there are three projects, values can be either 0.333, 0.667, or 1. If there are four, they can 0.25, 0.5, 0.75, or 1.

Although there may be multiple classes representing presence and absence, agreement is calculated only for the simplifed classes.

In [12]:
# @title (RUN) Agreement mode

num_labelers = len(fs)
labels = []

for i in range(1, num_labelers + 1):
    file_name = os.path.splitext(os.path.basename(fs[i-1]))[0]
    labels.append(file_name)

combined[['mode', 'confidence']] = combined[labels].apply(get_mode_and_occurence, axis=1, result_type='expand')
pd.set_option('display.max_rows', None)
#print(combined)
combined['mode'] = combined['mode'].astype(int)

# we can set the mode to -9999 if there is no agreement (mode_freq = 1/num_labelers)
combined.loc[combined['confidence'] <= 1/3, 'mode'] = -9999

combined = combined.drop(combined[combined['mode'] == -9999].index)
print(f"Combined project has {combined.shape[0]} rows, {combined.shape[1]}"\
      "columns")

combined_pl2 = combined.drop(columns=['Clust'])

combined_pl2.head()

Combined project has 1342 rows, 9columns


Unnamed: 0,Point_ID,Lat,Lon,ceo-survey-user1,ceo-survey-user2,ceo-survey-user3,mode,confidence
0,140884433,-8.3219,-75.045545,1,1,1,1,1.0
1,140884434,-8.344409,-74.884792,0,0,0,0,1.0
2,140884435,-8.836094,-74.342566,0,0,0,0,1.0
3,140884436,-8.32163,-75.031377,1,1,1,1,1.0
4,140884437,-10.948943,-71.736808,0,0,0,0,1.0


We can then calculate the average agreement per sample to get a sense of the uncertainty in labels for each class, for all 4 classes

In [13]:
# @title (RUN) Calculate average agreement per sample

agreement = combined.groupby("mode").mean()
agreement = agreement.rename(columns={"confidence": "mean confidence"})
print(agreement[['mean confidence']])

      mean confidence
mode                 
0            0.988607
1            0.894253
2            0.690476


## Split the dataset <a class="anchor" id="split"></a>

Here we split the dataset into three parts for model training, validation, and final assessment (the portion set aside as the test or map reference dataset), according to the proportions you specify.

The splits are confined to the usable sample, which is defined as samples not falling into the unsure class and those where there was modal agreement on the class. The resulting splits are denoted in a column called `usage`.

Values of "unusable" in the `usage` column indicate observations that were not usable because of their low agreement or classified as unsure.  They are included here for completeness, and in case they help with evaluation


In [None]:
# @title (RUN) the splitting function

while True:
    train_split = float(
        input("What proportion of the sample should be assigned to training?: ")
    )
    validation_split = float(
        input("What proportion should be assigned to validation?: ")
    )
    test_split = float(
        input("What proportion of the sample should be assigned to "\
              "test/reference?: ")
    )

    # Ensure that the splits sum to 1 or are less than 1
    split_sum = train_split + validation_split + test_split

    if abs(split_sum - 1) < 1e-9:
        break
    else:
        print(f"The splits sum to {split_sum}, but must equal 1. \n"\
              "Please try again.")

seed = 999

n_samples = len(combined)
n_train = int(n_samples * train_split)
n_val = int(n_samples * validation_split)
n_test = n_samples - n_train - n_val

train = combined.sample(n_train, random_state=seed)
remaining = combined.drop(train.index)
val = remaining.sample(n_val, random_state=seed)
ref = remaining.drop(val.index)

out = pd.concat(
    [train.assign(usage="train"), val.assign(usage="validate"),
     ref.assign(usage="map_reference/test")]
).reset_index(drop=True)
out_pl = out.drop(columns=['Clust'])

out_pl.head()

### Combine and export to csv <a class="anchor" id="combine"></a>

The ineligible portion of the sample is also added back for completeness

In [16]:
#@title (RUN) Export sample

gdrive_folder = input(f"Enter the name of the output folder: \n\n")
csv_name = input(f"Enter the name of the output csv file: \n\n")

output_dir = f"{root}/MyDrive/{gdrive_folder}"
os.makedirs(output_dir, exist_ok=True)

outpath = os.path.join(output_dir, csv_name)

with open(outpath, 'w') as f:
    out.to_csv(f, float_format='{:f}'.format, encoding='utf-8', index=False)

print('file exported')

Enter the name of the output folder: 

final_test
Enter the name of the output csv file: 

test
file exported


And their locations on a map

In [17]:
# @title (RUN) Display results

# For the usage category
color_mapping = {
    'train': 'blue',
    'validate': 'green',
    'map_reference/test': 'red',
    'unusable': 'gray'
}

m = folium.Map(location=[out['Lat'].mean(), out['Lon'].mean()], zoom_start=7)
scatter_group_usage = folium.FeatureGroup(name='Usage')

scatter_group_mode = folium.FeatureGroup(name='Mode')

# Create legend for 'usage'
legend_html_usage = '''
<div style="position: fixed; bottom: 50px; left: 50px; background-color: white; border: 2px solid grey; z-index: 9999; padding: 10px;">
    <h4>Usage</h4>
    <i style="background: blue; border-radius: 50%; width: 18px; height: 18px; display: inline-block;"></i> Train<br>
    <i style="background: green; border-radius: 50%; width: 18px; height: 18px; display: inline-block;"></i> Validate<br>
    <i style="background: red; border-radius: 50%; width: 18px; height: 18px; display: inline-block;"></i> Map Reference/Test<br>
    <i style="background: gray; border-radius: 50%; width: 18px; height: 18px; display: inline-block;"></i> Unusable<br>
</div>
'''
legend_usage = folium.Element(legend_html_usage)
m.get_root().html.add_child(legend_usage)

# Create legend for 'mode'
legend_html_mode = '''
<div style="position: fixed; bottom: 50px; left: 230px; background-color: white; border: 2px solid grey; z-index: 9999; padding: 10px;">
    <h4>Mode</h4>
    <i style="background: purple; border-radius: 50%; width: 18px; height: 18px; display: inline-block;"></i> Absence<br>
    <i style="background: yellow; border-radius: 50%; width: 18px; height: 18px; display: inline-block;"></i> Presence<br>
    <i style="background: black; border-radius: 50%; width: 18px; height: 18px; display: inline-block;"></i> Not Sure<br>
</div>
'''
legend_mode = folium.Element(legend_html_mode)
m.get_root().html.add_child(legend_mode)

for usage, color in color_mapping.items():
    subset = out[out['usage'] == usage]
    for _, row in subset.iterrows():
        folium.CircleMarker(location=[row['Lat'], row['Lon']], radius=1, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(scatter_group_usage)

for mode in range(3):  # 0, 1, 2
    subset = out[out['mode'] == mode]
    color = ['purple', 'yellow', 'black'][mode]
    for _, row in subset.iterrows():
        folium.CircleMarker(location=[row['Lat'], row['Lon']], radius=1, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(scatter_group_mode)

# Add the feature groups to the map
scatter_group_usage.add_to(m)
scatter_group_mode.add_to(m)

# Add OpenStreetMap layer to the map
folium.TileLayer('openstreetmap').add_to(m)

# Add a layer control to toggle between 'usage' and 'mode' scatter plots and OpenStreetMap layer
folium.LayerControl(collapsed=False).add_to(m)

m