# Comparing CEO labels 🏷️
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nasaharvest/openmapflow/blob/main/openmapflow/notebooks/compare_ceo_labels.ipynb)

**Description:** This notebook provides code to compare labels of the same data points from two Collect Earth Online (CEO) projects.

In [1]:
import pandas as pd

In [2]:
# TODO: create or cd to an openmapflow project directory and dvc pull to get raw label files?
# Using local paths in v1 to discuss in PR thread.

In [3]:
ceo_set1_path = 'ceo-Hawaii-Jan-Dec-2020-(Set-1)-sample-data-2022-08-16.csv'
ceo_set2_path = 'ceo-Hawaii-Jan-Dec-2020-(Set-2)-sample-data-2022-08-16.csv'

In [4]:
ceo_set1 = pd.read_csv(ceo_set1_path)
ceo_set2 = pd.read_csv(ceo_set2_path)

In [5]:
if ceo_set1.shape != ceo_set2.shape:
    print('ERROR: The size of the two dataframes does not match. Most likely, there is a duplicate in the plotid column resulting from an error in CEO. You need to delete the duplicate manually before continuing.')
    print(ceo_set1[ceo_set1.duplicated(subset=['plotid'])])
    print(ceo_set2[ceo_set2.duplicated(subset=['plotid'])])
else:
    print('Loaded two dataframes with equal size: {}'.format(ceo_set1.shape))

Loaded two dataframes with equal size: (1200, 13)


In [6]:
# Sometimes there are slight variations in the labeling question used, so we get this from the question column
label_question = ceo_set1.columns[-1]

In [7]:
ceo_agree = ceo_set1[ceo_set1[label_question] == ceo_set2[label_question]]

print('Number of samples that are in agreement: %d out of %d (%.2f%%)' % 
          (ceo_agree.shape[0], ceo_set1.shape[0], ceo_agree.shape[0]/ceo_set1.shape[0]*100))

Number of samples that are in agreement: 1109 out of 1200 (92.42%)


In [8]:
ceo_disagree_set1 = ceo_set1[ceo_set1[label_question] != ceo_set2[label_question]]
ceo_disagree_set2 = ceo_set2[ceo_set1[label_question] != ceo_set2[label_question]]

print('Number of samples that are NOT in agreement: %d out of %d (%.2f%%)' % 
          (ceo_disagree_set1.shape[0], ceo_set1.shape[0], ceo_disagree_set1.shape[0]/ceo_set1.shape[0]*100))

Number of samples that are NOT in agreement: 91 out of 1200 (7.58%)


In [9]:
pd.set_option('display.max_rows', None)

In [10]:
ceo_disagree_set1[['sampleid', 'email', 'flagged', 'collection_time', 'analysis_duration', 'imagery_title', label_question]]

Unnamed: 0,sampleid,email,flagged,collection_time,analysis_duration,imagery_title,Does this pixel contain active cropland?
6,6,hkerner@umd.edu,False,2022-02-02 22:18,14.4 secs,Planet Monthly Mosaics,Non-crop
16,16,hkerner@umd.edu,False,2022-02-02 22:35,125.8 secs,Planet Monthly Mosaics,Non-crop
38,38,hkerner@umd.edu,False,2022-02-02 23:26,100.7 secs,Planet Monthly Mosaics,Non-crop
109,109,cnakalem@umd.edu,False,2022-04-13 18:25,41.4 secs,Mapbox Satellite,Non-crop
110,110,izvonkov@umd.edu,False,2022-04-13 18:28,240.0 secs,Planet Monthly Mosaics,Non-crop
113,113,cnakalem@umd.edu,False,2022-04-13 18:27,28.7 secs,Mapbox Satellite,Non-crop
125,125,izvonkov@umd.edu,False,2022-04-13 18:51,185.0 secs,Planet Monthly Mosaics,Non-crop
135,135,endu@terpmail.umd.edu,False,2022-06-15 12:29,268.4 secs,Planet Monthly Mosaics,Non-crop
147,147,endu@terpmail.umd.edu,False,2022-06-15 12:40,12.0 secs,Planet Monthly Mosaics,Crop
192,192,endu@terpmail.umd.edu,False,2022-06-25 22:13,16.8 secs,Planet Monthly Mosaics,Crop


In [11]:
ceo_disagree_set2[['sampleid', 'email', 'flagged', 'collection_time', 'analysis_duration', 'imagery_title', label_question]]

Unnamed: 0,sampleid,email,flagged,collection_time,analysis_duration,imagery_title,Does this pixel contain active cropland?
6,6,taryndev@umd.edu,True,2022-07-06 18:19,-2.7 secs,,
16,16,taryndev@umd.edu,True,2022-07-06 18:32,150.0 secs,,
38,38,taryndev@umd.edu,True,2022-07-06 18:51,143.7 secs,,
109,109,ayang115@umd.edu,False,2022-07-11 23:04,47.4 secs,Planet Monthly Mosaics,Crop
110,110,ayang115@umd.edu,False,2022-07-11 23:06,147.5 secs,Planet Monthly Mosaics,Crop
113,113,ayang115@umd.edu,False,2022-07-11 23:14,169.3 secs,Planet Monthly Mosaics,Crop
125,125,ayang115@umd.edu,False,2022-07-11 23:25,83.0 secs,Planet Monthly Mosaics,Crop
135,135,ayang115@umd.edu,False,2022-07-13 04:28,65.2 secs,Mapbox Satellite,Crop
147,147,ayang115@umd.edu,False,2022-07-13 04:40,32.5 secs,Planet Monthly Mosaics,Non-crop
192,192,ayang115@umd.edu,False,2022-07-22 18:37,29.1 secs,Planet Monthly Mosaics,Non-crop


The above tables show the points from each of the two sets for which labelers disagreed on the assigned label. Review these as a group and determine which label should be assigned by consensus. 