Following P. Olofsson et al.:

(page 52 sect. 5)

This accuracy asessment is designed for the objectives of:
- estimating overall and class-specific accuracies
- areas of the individual classes (as determined by the reference classification), and
- confidence intervals for each accuracy and area parameter


This sampling design follows a stratified random design per ap class (iceplant, other vegetation, low ndvi, water).

In [1]:
import os
import numpy as np
import pandas as pd

### Pixels per class in full map of SB coast

In [2]:
year = 2020
prefix = 'modelAE5_FP_2020'
pixel_count = pd.read_csv(os.path.join(os.getcwd(), prefix+'_rasters_'+str(year)+'_pixel_counts.csv'))
pixel_count

Unnamed: 0,n_nonice_2020,n_ice_2020,n_ground_2020,n_water_2020,raster
0,36271293,5382187,111150412,62968690,modelAE5_FP_2020_merged_crs26910_S_2020
1,1122203,30004,1891593,2893071,modelAE5_FP_2020_merged_crs26910_W_2020
2,89669636,1123921,62587031,69125241,modelAE5_FP_2020_merged_crs26911_2020


In [3]:
# total number of pixels in neach category
n_pix = [sum(pixel_count.iloc[:,c]) for c in range(0,4)]
n_pix

[127063132, 6536112, 175629036, 134987002]

### 5.1.1  Determining the sample size

For stratified random sampling Cochran (1977 eq 5.25) suggests the following sample size formula:

$n \approx \left( \frac{\sum W_iS_i}{S(\hat{O})}\right)^2$, where

$S(\hat{O})$ is the standard error of the estimated overall accuracy that we would like to achieve

$W_i$ is the fraction of the map labeled as class $i$

$S_i = \sqrt{U_i(1-U_i)}$ is the standard deviation of class $i$.


In this last formula, $U_i$ are the user's accuracies for class $i$, i.e. the precision for each class: TP/(TP+FP).

Recall that the square root of the estimated variance results in the standard error of the estimator. For example, in the case of the estimated overall accuracy of the map $\hat{O}$ we have that $\hat{S}(\hat{O}) = \sqrt{\hat{V}(\hat{O})}$ (see eq. 5). Also, the standard error is used to get confidence intervals for the estimated statistic:the 95% confidence interval is estimated as $\hat{O} \pm 1.96 \hat{S}(\hat{O}) = \hat{O} \pm 1.96 \sqrt{\hat{V}(\hat{O})}$.

For determining the sample size then we have to estimate the user's accuracies $U_i$ and the std error of the estimated overall accuracy $S(\hat{O})$. The class fractions $W_i$ are not estimated since they come from counting the number of pixels in each class in the map. 

To estimate $U_i$ (and thus obtain the standard deviations) we can use the precisions obtained from applying the model on the test set. Take into account $U_i$ from the model testing might still be optimistic. For low ndvi and water we use 0.95 and 0.9 respectively.

To find a value of $S(\hat{O})$ we feel comfortable with, we need to take into account the following: suppose we want an overall accuracy ((TP + TN)/(P+N)) of 0.9, meaning 90% of the pixels were correctly classified. A std error of 0.01 would mean that the confidence interval around this estimated overall accuracy is $0.9 \pm 0.01*1.96 = 0.9 \pm 0.0196$, so OA is betwee 88% and 92%. If we increase std error to 0.015, OA would be (approx) between 87% and 93%. 



In [4]:
# Recreation of sampling design by SEPAL

# ---------------------------------------------
# --------------- PARAMETERS ------------------
# standard error for all the points
std_error = 0.0125

# user's accuracies TP/(TP+FP) (estimates)
# classes are: [other vegetation, iceplant, low ndvi, water]
U = [0.8, 0.8, 0.9, 0.95]
# ---------------------------------------------

# fraction of pixels with a given class in total pixels
total_pix = sum(n_pix)
pix_prop = [n/total_pix for n in n_pix]

# standard deviation of user's accuracies
stdv = [ np.sqrt(u*(1-u)) for u in U]

X = [ x*y for x,y in zip(pix_prop, stdv)]

sample_size = (sum(X)/std_error)**2
sample_size

# distributing sample size among classes
#[...]

931.1082113354254

In [5]:
pix_prop

[0.2860395334170426,
 0.014713838683289604,
 0.3953691894823195,
 0.30387743841734827]

### 5.1.2 Determine sample allocation to strata

"Once the overall samle size is chosen, we determine the allocation of the sample to strata."

Roughly speaking, there are two poles for the sample allocation: 

(1) propotional: allocate sample sizes proportionally to the area covered by each class

(2) equal: divide the sample size equally among strata. 

The tradeoffs are that proportional allocation will have few samples in rare classes (such as iceplant) and thus will give imprecise estimates of user's accuracy of these rare classes. On the other hand, equal allocation "is not optimized for estimating area and overall accuracy". A suggested middle point in P. Olofsson et al. is to first allocate a fixed sample size of 50-100 to rare classes, calibrating the sample size according to what standard error we would want to achieve based on the assumed user's accuracies for those classes. After rare classes have a specified sample size, then we distribute the remaining points from the sample proportionally among the remaining classes. The the estimated std errors and accuracies can be computed.



In [6]:

def strat_stderror(U, strat_sample):
    " estimated standard error of estimated user's accuracies (U) per class (see eq. 6 P. Oloffson et al.) "
    return [ np.sqrt(u*(1-u)/(n-1)) for u,n in zip(U, strat_sample) ]


def sample_allocation(fixed_n, fix_veg, sample_size, n_pix):
    " sample allocation per class giving either (other veg and iceplant) or only iceplant a fixed number of sample points and distributing the rest proportionally"
    if fix_veg:
        d = n_pix[2]+n_pix[3] # total pixels of water + low ndvi
        prop2 = n_pix[2]/d
        prop3 = n_pix[3]/d
        remain = sample_size-(fixed_n*2)
        return [fixed_n, fixed_n, int(remain*prop2), int(remain*prop3)] 
    
    d = n_pix[0] + n_pix[2] + n_pix[3] # total pixels of other vegetation + water + low ndvi
    prop0 = n_pix[0]/d
    prop2 = n_pix[2]/d
    prop3 = n_pix[3]/d
    remain = sample_size-fixed_n
    return [int(remain*prop0), fixed_n, int(remain*prop2), int(remain*prop3)] 


def confidence_intervals(U, strat_sample):
    " radius of confidence interval (as percentage) around estimated user's accuracies (U) with given stratied sample allocation"
    se = strat_stderror(U, strat_sample)
    return [np.round(196*x,2) for x in se]
    

In [7]:
# Distributing sample size among classes

conf_intrs =[]
strat_samples = []
strat_title = []

# ---------------------------------------------
strat_title.append('equal')
sample_equal = [sample_size/4 for i in range(0,4)]
strat_samples.append(sample_equal)
conf_intrs.append(confidence_intervals(U, sample_equal))

# ---------------------------------------------
# vegetation and iceplant get equal allocations
for n in [150, 140, 130, 120, 100, 90,75]:
    strat_title.append(str(n))
    sample = sample_allocation(n, True, sample_size, n_pix)
    strat_samples.append(sample)
    conf_intrs.append(confidence_intervals(U, sample))
    
# ---------------------------------------------
# only iceplant gets fixed allocation
for n in [150, 140, 130, 120, 100, 90,75]:
    strat_title.append(str(n))
    sample = sample_allocation(n, False, sample_size, n_pix)
    strat_samples.append(sample)
    conf_intrs.append(confidence_intervals(U, sample))    
    
# ---------------------------------------------

strat_title.append('prop')
sample_prop = [sample_size*x for x in pix_prop]
strat_samples.append(sample_prop)
conf_intrs.append(confidence_intervals(U, sample_prop))


In [8]:
strat_df = pd.DataFrame(strat_samples).T
strat_df.columns  = strat_title
strat_df

Unnamed: 0,equal,150,140,130,120,100,90,75,150.1,140.1,130.1,120.1,100.1,90.1,75.1,prop
0,232.777053,150.0,140.0,130.0,120.0,100.0,90.0,75.0,226.0,229.0,232.0,235.0,241.0,244.0,248.0,266.333758
1,232.777053,150.0,140.0,130.0,120.0,100.0,90.0,75.0,150.0,140.0,130.0,120.0,100.0,90.0,75.0,13.700176
2,232.777053,356.0,368.0,379.0,390.0,413.0,424.0,441.0,313.0,317.0,321.0,325.0,333.0,337.0,343.0,368.131499
3,232.777053,274.0,282.0,291.0,300.0,317.0,326.0,339.0,240.0,243.0,247.0,250.0,256.0,259.0,264.0,282.942778


In [9]:
conf_intrs_df = pd.DataFrame(conf_intrs).T
conf_intrs_df.columns  = strat_title
conf_intrs_df

Unnamed: 0,equal,150,140,130,120,100,90,75,150.1,140.1,130.1,120.1,100.1,90.1,75.1,prop
0,5.15,6.42,6.65,6.9,7.19,7.88,8.31,9.11,5.23,5.19,5.16,5.13,5.06,5.03,4.99,4.81
1,5.15,6.42,6.65,6.9,7.19,7.88,8.31,9.11,6.42,6.65,6.9,7.19,7.88,8.31,9.11,22.0
2,3.86,3.12,3.07,3.02,2.98,2.9,2.86,2.8,3.33,3.31,3.29,3.27,3.23,3.21,3.18,3.07
3,2.81,2.59,2.55,2.51,2.47,2.4,2.37,2.32,2.76,2.75,2.72,2.71,2.68,2.66,2.63,2.54
