This notebook follows the recommendations in P. Olofsson et al. (2014) to create a stratified sampling design per map class (page 52 sec. 5). 
The goal is to create a sampling design by exploring how accuracy metrics change according to different sample sizes and sample allocations per class. 
The metrics used to investigate sampling design are overall and class-specific accuracies, and confidence intervals for each accuracy and area parameter.

In [1]:
import os
import numpy as np
import pandas as pd

### Determining the sample size (Olofsson et al. 2014 - sec. 5.1.1)

For stratified random sampling Cochran (1977 Eq 5.25) suggests the following sample size formula (Olofsson et al. 2014 Eq. 13):

$$n \approx \left( \frac{\sum W_iS_i}{S(\hat{O})}\right)^2,$$

where

- $S(\hat{O})$ is the standard error of the estimated overall accuracy $\hat{O}$ that we would like to achieve,

- $W_i$ is the fraction of the map labeled as class $i$, and

- $S_i = \sqrt{U_i(1-U_i)}$ is the standard deviation of class $i$.

In the standard deviation formula, $U_i$ is the user's accuracy for class $i$, i.e. the precision for each class: TP/(TP+FP).

For determining sample size using the previous equation, we then have to estimate the user's accuracy for each class, $U_i$, and the std. error of the estimated overall accuracy, $S(\hat{O})$. The class fractions $W_i$ are not estimated since they come from counting the number of pixels in each class in the map. 

To estimate $U_i$ (and thus obtain the standard deviations) we can use the user's accuracies obtained from applying the model on the test set. Take into account that estimating $U_i$ from the model testing might still be optimistic. 

Recall that the square root of the estimated variance results in the standard error of the estimator. For example, in the case of the estimated overall accuracy of the map $\hat{O}$ we have that $\hat{S}(\hat{O}) = \sqrt{\hat{V}(\hat{O})}$ (Olofsson et al. 2014 Eq. 5). The standard error is used to get confidence intervals for the estimated statistic: the 95% confidence interval is estimated as $\hat{O} \pm 1.96 \hat{S}(\hat{O}) = \hat{O} \pm 1.96 \sqrt{\hat{V}(\hat{O})}$.

To find a value of $S(\hat{O})$ we feel comfortable with, we need to take into account the following: suppose we want an overall accuracy ((TP + TN)/(P+N)) of 0.9, meaning 90% of the pixels were correctly classified. A std. error of 0.01 would mean that the 95% confidence interval around the estimated overall accuracy is $0.9 \pm 0.01*1.96 = 0.9 \pm 0.0196$, so OA is between 88% and 92%. If we increase std error to 0.02, the 95% confidence interval for the estimated OA woult be between 86% and 94%. So small changes in $S(\hat{O})$ will alter the width of the confidence interval.

### Import data: Pixels per class in map of SB coast

In [3]:
year = 2020
pixel_count = pd.read_csv('model_AE5FP_map_pixel_counts.csv')
pixel_count

Unnamed: 0,n_nonice_2020,n_ice_2020,n_ground_2020,n_water_2020,raster
0,36271293,5382187,111150412,62968690,modelAE5_FP_2020_merged_crs26910_S_2020
1,1122203,30004,1891593,2893071,modelAE5_FP_2020_merged_crs26910_W_2020
2,89669636,1123921,62587031,69125241,modelAE5_FP_2020_merged_crs26911_2020


In [6]:
df = pd.DataFrame([pixel_count.sum(numeric_only=True)])
df

Unnamed: 0,n_nonice_2020,n_ice_2020,n_ground_2020,n_water_2020
0,127063132,6536112,175629036,134987002


In [7]:
df = df.rename(columns={'n_nonice_2020':'nonice', 
                        'n_ice_2020': 'ice',
                        'n_ground_2020': 'ground',
                        'n_water_2020': 'water'})
df

Unnamed: 0,nonice,ice,ground,water
0,127063132,6536112,175629036,134987002


In [3]:
# ---------------------------------------------
# --------------- PARAMETER -------------------
prefix = 'salt13_p30'

In [6]:
df = pd.read_csv(os.path.join(os.getcwd(), prefix+'_total_pixel_counts.csv'))

# total number of pixels in each class
#classes = list(df.columns)
#n_pix = list(df.iloc[0,])
#n_pix = [119608120, 6546769, 168627446, 19444041]

#print(classes)
print(n_pix)

[119608120, 6546769, 168627446, 19444041]


In [9]:
# ---------------------------------------------
# --------------- PARAMETERS ------------------
# standard error for all the points
std_error = 0.017

# estimates of user's accuracies TP/(TP+FP)
# classes are: [other vegetation, iceplant, low ndvi, water]
U = [0.8, 0.8, 0.8, 0.9]
# ---------------------------------------------

# fraction of pixels with a given class in total pixels
total_pix = sum(n_pix)
pix_prop = [n/total_pix for n in n_pix]

# standard deviation of user's accuracies
# Cochran, 1977, Eq (5.55)
stdv = [ np.sqrt(u*(1-u)) for u in U]

numerator = sum([ x*y for x,y in zip(pix_prop, stdv)])

sample_size = (numerator/std_error)**2
sample_size

536.6365512761187

### Determine sample allocation per class (Olofsson et al., 2014 - sec. 5.1.2)

Once the overall sample size is chosen, we can determine how many sample points to allocate in each class. s
Roughly speaking, there are two poles for the sample allocation: 

(1) propotional: allocate sample sizes proportionally to the area covered by each class, and

(2) equal: divide the sample size equally among classes. 

The tradeoffs are that proportional allocation will have few samples in rare classes and thus will give imprecise estimates of user's accuracy of these rare classes. 
On the other hand, according to Olofsson et al., equal allocation "is not optimized for estimating area and overall accuracy". 
A suggested middle point is to first allocate a fixed sample size of 50-100 to rare classes, calibrating the sample size according to what standard error we would want to achieve based on the assumed user's accuracies for those classes. 
After rare classes have a specified sample allocation, then we proportionally distribute the remaining points from the sample among the remaining classes. The the estimated std. errors and accuracies can be computed using the sample allocation per class.

In [16]:
# --------------------------------------------------------------------------------------
def sample_allocation(fixed_n, fix_veg, sample_size, n_pix):
    """ sample allocation per class combining fixed + proportional allocation
    
    Allocates either (other_veg and iceplant classes) or only iceplant class 
    a fixed number of sample points and distributes the remaing pts proportionally 
    (? acrorss reamaining classes?)
    
    Parameters
    ----------
    fixed_n : int
        the number of samples to allocate to indicated classes
    
    fix_veg : bool
        if True: allocate fixed_n points to both iceplant and other_veg classes
        if False: only allocate fixed_n points to iceplant class
    
    sample_size : int
        number of points to be distributed among classes
        see Oloffson et al., 2014, Eq. 13
        
    n_pix : list
        list of integers having the total number of pixels per class
        
    -->> TO DO: SHOULD CHANGE THIS TO USE CLASS NAMES AND NOT LOCATIONS  <<--
    """
    
    if fix_veg:
        # total pixels of water + low ndvi
        d = n_pix[2]+n_pix[3] 
        
        prop2 = n_pix[2]/d
        prop3 = n_pix[3]/d
        remain = sample_size-(fixed_n*2)
        return [fixed_n, fixed_n, int(remain*prop2), int(remain*prop3)] 
    
    # total pixels of other vegetation + water + low ndvi
    d = n_pix[0] + n_pix[2] + n_pix[3] 
    
    prop0 = n_pix[0]/d
    prop2 = n_pix[2]/d
    prop3 = n_pix[3]/d
    remain = sample_size-fixed_n
    return [int(remain*prop0), fixed_n, int(remain*prop2), int(remain*prop3)] 

# --------------------------------------------------------------------------------------
def strat_stderror(U, strat_sample):
    """ estimated standard error of estimated user's accuracies (U) per class 
    
    See Oloffson et al., 2014, Eq. 6.
    
    Parameters:
    ----------
    U : list
        a list with the estimated user's accuracies per class, each is a number in [0,1]
        
    strat_sample: list
        list of integers indicating how many points to sample from reach class
    
    Return:
    -------
    A list with the std. errors for each class based on U and the stratified sample.
    
    """
    
    return [ np.sqrt(u*(1-u)/(n-1)) for u,n in zip(U, strat_sample) ]

# --------------------------------------------------------------------------------------
def confidence_intervals(U, strat_sample):
    """ radius of 95% conf. interval around estimated user's accuracy for each class
    
    Parameters
    ----------
    U : list
        a list with the estimated user's accuracies per class, each is a number in [0,1]
        
    strat_sample: list
        list of integers indicating how many points to sample from reach class
    
    Return:
    -------
        a list with the radius of the 95% confidence interval (as a percentage and 
        rounded to two decimal places) for the estimated user's accuracies

    """
    
    se = strat_stderror(U, strat_sample)
    return [np.round(196*x,2) for x in se]
    

In [7]:
# Distributing sample among classes

conf_intrs =[]
strat_samples = []
strat_title = []

# ---------------------------------------------
strat_title.append('equal')
sample_equal = [sample_size/4 for i in range(0,4)]
strat_samples.append(sample_equal)
conf_intrs.append(confidence_intervals(U, sample_equal))

# ---------------------------------------------
# vegetation and iceplant get equal allocations
for n in [200,150, 140, 130, 120, 100, 90,75]:
    strat_title.append(str(n))
    sample = sample_allocation(n, True, sample_size, n_pix)
    strat_samples.append(sample)
    conf_intrs.append(confidence_intervals(U, sample))
    
# ---------------------------------------------
# only iceplant gets fixed allocation
for n in [200,150, 140, 130, 120, 100, 90,75]:
    strat_title.append(str(n))
    sample = sample_allocation(n, False, sample_size, n_pix)
    strat_samples.append(sample)
    conf_intrs.append(confidence_intervals(U, sample))    
    
# ---------------------------------------------

strat_title.append('prop')
sample_prop = [sample_size*x for x in pix_prop]
strat_samples.append(sample_prop)
conf_intrs.append(confidence_intervals(U, sample_prop))


In [8]:
strat_df = pd.DataFrame(strat_samples).T
strat_df.columns  = strat_title
strat_df

Unnamed: 0,equal,200,150,140,130,120,100,90,75,200.1,150.1,140.1,130.1,120.1,100.1,90.1,75.1,prop
0,148.977314,200.0,150.0,140.0,130.0,120.0,100.0,90.0,75.0,114.0,129.0,132.0,135.0,138.0,143.0,146.0,151.0,170.453605
1,148.977314,200.0,150.0,140.0,130.0,120.0,100.0,90.0,75.0,200.0,150.0,140.0,130.0,120.0,100.0,90.0,75.0,8.768113
2,148.977314,110.0,167.0,178.0,189.0,201.0,223.0,235.0,252.0,158.0,178.0,182.0,186.0,190.0,198.0,203.0,209.0,235.604159
3,148.977314,85.0,128.0,137.0,145.0,154.0,172.0,180.0,193.0,122.0,137.0,140.0,143.0,146.0,152.0,156.0,160.0,181.083378


In [9]:
conf_intrs_df = pd.DataFrame(conf_intrs).T
conf_intrs_df.columns  = strat_title
conf_intrs_df

Unnamed: 0,equal,200,150,140,130,120,100,90,75,200.1,150.1,140.1,130.1,120.1,100.1,90.1,75.1,prop
0,6.44,5.56,6.42,6.65,6.9,7.19,7.88,8.31,9.11,7.38,6.93,6.85,6.77,6.7,6.58,6.51,6.4,6.02
1,6.44,5.56,6.42,6.65,6.9,7.19,7.88,8.31,9.11,5.56,6.42,6.65,6.9,7.19,7.88,8.31,9.11,28.13
2,4.83,5.63,4.56,4.42,4.29,4.16,3.95,3.84,3.71,4.69,4.42,4.37,4.32,4.28,4.19,4.14,4.08,3.84
3,3.51,4.66,3.79,3.66,3.56,3.45,3.27,3.19,3.08,3.88,3.66,3.62,3.58,3.55,3.48,3.43,3.39,3.18
