<a href="https://colab.research.google.com/github/agroimpacts/VegMapper/blob/dev-calval-simplify/calval/create_sample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Calculating Sample Sizes for Land Cover Assessment

This notebook walks through steps to calculate sample size and download sample points to assess accuracy of a land cover strata layer.

In our example, we will be calculating sample sizes for presence and absence of oil palm fields in Ucayali, Peru. The notebook can calculate reference sample sizes, or reference + training/validation sample sizes.

This notebook requires the following files/information:
- A land cover strata image including classes of **presence** and **absence** for the category of interest. The strata image should be a Google Earth Engine (GEE) image asset of integer type.
- A pair of lists for user defined **presence** and **absence** categories.
- A sampling method with its sampling parameters (see examples)
- A Google Drive account for exporting the final shapefile and CSV files.

The final sample points are exported in Collect Earth Online (CEO) format.

# Overview of Steps

0. Install packages and set project options
1. Load and analyze the stratification layer for collecting the sample
2. Determine the **presence** and **absence** sample sizes required for the reference sample
3. Calculate the binary sample size
4. Include the size requirements for the training and validation samples
5. Distribute sample points among sub-classes
6. Perform sampling on GEE
7. Export the sample to Google Drive for CollectEarth



# Step 0: Earth Engine Python API Colab Setup

Several libraries are required to be loaded to run this notebook.

In addition, you need to authenticate your Google account to load data from Google Earth Engine (GEE).

Press the run button next to "Setup code" below to mount your Google Drive folder, import `VegMapper` (note it will install into a default "repos" directory in your Google Drive, unless you change it in the code block), and authenticate your EarthEngine account. Please respond to the prompts as they arise. You can unfold the code block if you want to inspect the code and change default paths.

In [None]:
#@title (RUN) Setup code
## Mount Drive
from google.colab import drive
root = '/content/gdrive'
drive.mount(root)

## Clone and/or update VegMapper
import os
from datetime import datetime as dt
repo_path = f"{root}/MyDrive/repos"
clone_path = 'https://github.com/agroimpacts/VegMapper.git'
if not os.path.exists(repo_path):
    print(f"Making {repo_path}")
    os.makedirs(repo_path, exist_ok=True)

if not os.path.exists(f"{repo_path}/VegMapper"):
    !git -C "{repo_path}" clone "{clone_path}"
else:
    !git -C "{repo_path}/VegMapper" pull

os.chdir(f"{repo_path}/VegMapper")

# Import sample_utils function
from vegmapper.calval.sample_utils import *

# EE authentication
ee.Authenticate()
# Initialize the library.
ee.Initialize()

# Step 1. Import, analyze, and view the stratification layers

The code below allows you to import a strata image from a Google Earth Engine (GEE) asset. You can choose to use the default asset, which contains land cover information from Ucayali, Peru, or you have the option to provide a custom asset URL. Please note that for this code to function correctly, the image must reside in the same GEE project and account that you authenticate in the Setup Code cell. Additionally, the GEE project should be registered for cloud access to enable asset retrieval. Registration is free for non-profits, education, government research, training, and media purposes. The code then retrieves the metadata of this image and calculates the category statistics.

The function `analyze_strata_image()` returns three variables:

strata: The strata image (an ee.Image object).
strata_df: The category statistics (a Pandas dataframe).
misc: A Python dictionary containing miscellaneous information such as minimum and maximum category values, bounding box (bbox), and scale.

In [None]:
# @title (RUN) Import, analyze GEE land cover
default_asset = "users/michaeljcecil/Updated_Strata_v2"
user_choice = input("Do you want to use the default asset (y/n)? ").lower()

if user_choice == 'y':
    asset = default_asset
else:
    asset = input("Enter your asset URL: ")

strata, strata_df, misc = analyze_strata_image(asset)
strata_df

Next, run the code below to map the strata. The colors can be changed by using a different palette (open the code dialog to see how).

In [None]:
#@title (RUN) Map the strata

# display the strata map with default palette
display(strata, misc)
# Add a custom palette
# palette = ['000000', '111111', '222222','333333', 'e32a1c', 'fdcf6f','1f28b4']
# display(strata, misc, palette=palette)

# Step 2. Determine the size of the **presence** and **absence** samples

There are three possible methods. Depending on the sampling design method, we may choose to specify the sample size for presence and absence either arbitrarily or based on prior knowledge from the strata image. If the latter approach is chosen, and if there are multiple categories representing either presence or absence, we need to specify which categories represent presence, and which absence.

In [None]:
# @title (RUN) Enter presence and absence categories
x = input(f"Which categories represent the presence class? \n If there are "\
          "more than one separate by a comma. \n If you are using the layer " \
          "provided with this example, enter 1,2: \n")
presence_cats = [int(x) for x in x.split(",")]

x = input(f"Which categories represent the absence class? \n If there are "\
          "more than one separate by a comma. \n If you are using the layer " \
          "provided with this example, enter 5: \n")
absence_cats = [int(x) for x in x.split(",")]

num_strata_presence = len(presence_cats)
num_strata_absence = len(absence_cats)

# Save a copy of the original multi-category strata
strata_df_mltcat = strata_df.copy()

# Specify absence and presence categories
# absenceCats, presenceCats = [5], [1, 2]

# Consolidate multi-categories to binary presence absence
strata_df_bincat = consolidate(strata_df, absence_cats, presence_cats)

print("\n Count and percent of the consolidated binary classes")
print(strata_df_bincat)

# Step 3. Calculate binary reference sample size

There are 3 possible methods, including:

1. **Manual**: the number for each of the two classes is determined entirely by you
2. **Automatic: adjust required number**: Adjusts a required sample size for each class according to the estimated areas of the presence and absence classes
3. **Automatic: margin of error**: Calculates sample size statistically This has two selection algorithms (see methods appendix at the end for more details in these approaches). *This method is the default option used here.*



In [None]:
#@title ## (RUN) Draw sample using default MOE approach
cfg_inputs = []
prompt_list = [
    "1. Enter the desired margin of error (e.g. 0.07): ",
    "2. the desired confidence level (e.g. 0.95): ",
    "3. the minimum sample size per call (e.g. 30): ",
    "4. the anticipated accuracy for each class (absence, presence)\n"\
    "   entered separated by a comma, e.g 0.9, 0.7: "
]
for prompt in prompt_list:
    cfg_inputs.append(input(prompt))

cfg_stehman_foody = {
    "MarginOfError": float(cfg_inputs[0]),
    "ConfidenceLevel": float(cfg_inputs[1]),
    "MinimumClassSample": int(cfg_inputs[2]),
    "anticipatedAcc": [float(i) for i in cfg_inputs[3].split(",")]
}

strata_df_bincat = automatic_moe(strata_df_bincat, MOE_Algorithm="StehmanFoody",
                                 **cfg_stehman_foody)
print(strata_df_bincat)

## Alternative methods (don't run unless needed)

In [None]:
#@title #### Alternate MOE-approach
# Using the Olofsson algorithm
cfg_Olofsson = {"MarginOfError":0.07,
                "ConfidenceLevel":0.95,
                "MinimumClassSample":30,
                "CategoryOfInterest":1,
                "mappingAcc":[0.000000001, 0.7]}

strata_df_bincat = automatic_moe(strata_df_bincat, MOE_Algorithm="Olofsson",
                                 **cfg_Olofsson)
print(strata_df_bincat)

In [None]:
#@title #### Manual method
# Arbitary sample size for presence and absence from user input.
strata_df_bincat = manual(strata_df_bincat, absenceSamples=1000,
                          presenceSamples=1000)
print(strata_df_bincat)

In [None]:
#@title ### Automatic required number
strata_df_bincat = automatic_requiredNumber(
    strata_df_bincat, 1000, 0.6, 1000, 1
)
print(strata_df_bincat)

# Step 4. Include Training/Validation requirements

Adjust the overall sample size based on training/validation requirement, based on the proportion that the reference sample should represent of the total sample size, anywhere from 1 (the only sample will be the reference sample to 0.01 (the reference sample is only 1% of the total required).

In [None]:
#@title (RUN) Enter training/validation requirements
x = input(f"Enter the proportion of the total sample \n"\
          "that the reference sample represents, e.g. 0.2: \n\n")
if float(x) < 0.01:
    print("The reference sample has to be at 1% of overall sample. "\
          "Please enter another value\n")
else:
    strata_df_bincat['nh_adjusted'] = (strata_df_bincat['nh_adjusted'] / float(x))\
        .astype(int).tolist()

    print("\nStrata distribution with updated sample requirements")
    print(strata_df_bincat)

# Step 5. Distribute the number of sample points among sub-classes

This step distributes the absence and presence samples among sub-categories, based on a set of weights applied to the different absence and presence sample categories, which determine how much of the total sample size in each group get allocated to the different strata representing each group. For example, in the example we are running here, the presence category has two strata representing it. Two proportions will be given, one for each stratum, representing what percent of the adjusted presence sample is collected from each stratum.


In [None]:
#@title (RUN) Enter proportions for each class's sub-strata

awts = None
pwts = None

def validate_proportions(prompt, num_strata):
    while True:
        prop_input = input(prompt)
        prop_list = [float(p) for p in prop_input.split(",")]

        if len(prop_list) == num_strata and sum(prop_list) == 1:
            return prop_list
        else:
            print(f"Please enter {num_strata} proportions that sum to 1.")

if num_strata_absence > 1:
    pstr = f"Enter a proportion for each of the {num_strata_absence} \n"\
        "**absence** strata, separated by a comma, that sum to 1): "
else:
    pstr = f"There is just 1 absence stratum, so enter 1: "

awts = validate_proportions(pstr, num_strata_absence)

if num_strata_presence > 1:
    pstr = f"Enter a proportion for each of the {num_strata_presence} \n"\
        "**presence** strata, separated by a comma, that sum to 1): "
else:
    pstr = f"There is just 1 presence stratum, so enter 1: "
pwts = validate_proportions(pstr, num_strata_presence)

strata_df_mltcat = distribute_sample(
    strata_df_bincat, strata_df_mltcat, absence_cats, presence_cats,
    absenceSampleWeights=awts, presenceSampleWeights=pwts
)

# # Added discarded classes back for the sample because the GEE stratifiedSample()
# # function requires an explicit specification of 0 for categories that are not
# # of interest.
strata_df_mltcat = unwant_cat_samples_zero(strata_df_mltcat)

strata_df_mltcat = strata_df_mltcat.dropna()

print(strata_df_mltcat)

# Step 6. Perform Sampling in GEE

This code uses the GEE API to calculate the location of all sample points, and then displays them on a map. Presence samples in <font color='red'>**red**</font> and absence samples in <font color='blue'>**blue**</font>.

In [None]:
#@title (RUN) Draw and display sample
seed = 9999

# numPoints=10 is a placeholder specification. It will be overwritten by
# classPoints, which contains the sample size for sampling.
samples = strata.stratifiedSample(
    numPoints=10,
    classBand=misc['classBand'],
    projection='EPSG:3857',
    classValues=ee.List(strata_df_mltcat['Cat'].tolist()),
    classPoints=ee.List(strata_df_mltcat['nh_final'].tolist()),
    geometries=True,
    scale=30,
    seed=seed,
    tileScale=1
)


# Iterates through each class of the original strata layer and print the number
# of points per class. Note that if a category does not have enough pixels to
# meet the sample size requirements, then there may be fewer sample points for
# this category
samples_presence = samples.filter(
    ee.Filter.inList(misc['classBand'], presence_cats)
)
samples_absence = samples.filter(
    ee.Filter.inList(misc['classBand'], absence_cats)
)

# Get a list of feature dictionaries with lat and lon properties, initial write
# to csv
samples_df = sampleFC_to_csv(samples)
timestamp = dt.now().strftime("%Y_%m_%d_%H%M%S")

display_samples(strata, misc, samples_presence, samples_absence)

# Step 7. Export samples for CollectEarth project

This code exports the sample points to your Google Drive folder. The points are exported as a CSV file to a folder and name you specify



In [None]:
#@title (RUN) Export your points
gdrive_folder = input(f"Enter the name of the output folder: \n\n")
csv_name = input(f"Enter the name of the output csv file: \n\n")

outpath = f"{root}/MyDrive/{gdrive_folder}"
if not os.path.exists(outpath):
    print(f"Creating output folder {gdrive_folder}")
    os.makedirs(outpath, exist_ok=True)

samples_df = samples_df.sample(frac=1, random_state=7, ignore_index=True)
samples_df["PLOTID"] = [i for i in range(len(samples_df))]
samples_df[["PLOTID", "LAT", "LON", "STRATA_CAT"]]\
    .to_csv(f"{outpath}/{csv_name}", index=False)

print(f"{outpath}/{csv_name} is written and complete!")
