# 2 - Sampling Intuition & Cochran's formula

Sampling is a way of gathering information about a large group by looking at a smaller part of it. This smaller part, called a sample, is chosen to represent the whole group as closely as possible. It's like tasting a spoonful of soup to get an idea of the flavor of the whole pot. By studying the sample, we can make educated guesses about the larger group, saving time and resources while still getting useful information.

Sampling has a long history, dating back to the early 20th century when statisticians like William G. Cochran pioneered its use to make data collection more efficient and cost-effective. In his influential book "Sampling Techniques," Cochran laid the foundation for modern sampling methods, demonstrating how to select representative samples and analyze the results to draw meaningful conclusions about larger populations.

Cochran's formula for calculating the appropriate sample size is a cornerstone of modern sampling techniques. While the theory is solid, it's often helpful to build intuition around the formula through practical examples.

#### Import functionality
When using python code in a Jupyter notebook, the first thing is to import the libraries that provide the functionality needed to run the code. Here we use standard libraries from the [scientific python stack](https://fabienmaussion.info/scipro_ws2019/html/15-Scientific-Python.html).

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
from matplotlib import pyplot as plt

## 1 - Sample size calculation based on Cochran's formula 

The relationship between sample size, sampling error, and probability/proportion is at the heart of Cochran's formula.

- **Sample size**: The number of observations or individuals in your sample.
- **Sampling error**: The difference between your sample's results and the true population value. Larger samples tend to have smaller sampling errors because they better represent the whole population.
- **Probability/Proportion**: This is the estimated percentage of the population that has a certain characteristic or opinion. It plays a fundamental role in determining the needed sample size.

Imagine you're trying to guess the percentage of people in your city who prefer coffee over tea. The more people you ask (larger sample size), the more confident you can be that your guess is close to the true percentage (smaller sampling error).  But even with a small sample, if the true proportion is very extreme (like 99% prefer coffee), it's easier to get an accurate estimate than if the proportion is closer to 50/50.

Cochran's formula takes all of these factors into account, allowing you to calculate the optimal sample size needed to achieve a desired level of confidence in your results while minimizing the margin of error.

### 1.1 - Python implementation

The formula is as follows:

$n_0 = \frac{Z^2 p (1-p)}{e^2}$,

where
- $n_0$ is the required sample size
- $Z$ is the Z-score (e.g., 1.96 for a 95% confidence level)
- $p$ is the estimated proportion of the population for the category that will be estimated 
- $e$ is the desired margin of error (expressed as a decimal)

Here below you find the code for an implementation in python.

In [None]:
def cochran_sample_size(precision, confidence_level, population_proportion):
    """Calculates the sample size required to estimate a population proportion using the Cochran formula.

    Parameters:
    precision (float): the desired level of precision (margin of error) as a fraction 
    confidence_level (float): the desired level of confidence as a fraction
    population_proportion (float): the proportion of the population being estimated as a fraction

    Returns:
    The required sample size (int)
    """
    
    # this takes the z-score from the t-distribution to determine the confidence interval 
    z_score = abs(stats.norm.ppf((1 - confidence_level) / 2))
    
    # that is the proportion of the phenomena, and the inverse
    p_hat = population_proportion
    q_hat = 1 - p_hat
    
    # this basically is our margin of error as a function of the proportion
    e = precision * population_proportion 
    
    n = ((z_score**2) * p_hat * q_hat) / (e**2)
    return np.ceil(n)

### 1.2 - A simple example to play with. 

In [None]:
# ----------------------------------------------------------------------------------
# INPUTS
precision = 0.1             # 10% Margin of Error
confidence_level = 0.95     # 95% CI
population_proportion = 0.5 # 50/50 case
# ----------------------------------------------------------------------------------


print(
    f'Sample size for a '
    f'an expected {population_proportion*100} proportion at a '
    f'{confidence_level*100} confidence interval with a '
    f'{precision*100} margin of error is: ' 
)
      
print(
    int(cochran_sample_size(precision, confidence_level, population_proportion))
)

### 1.3 - Rare events

When it comes to rare events, sampling becomes even more challenging. Imagine trying to estimate the prevalence of a rare disease or the likelihood of a natural disaster.  These events are, by definition, infrequent, making it difficult to gather a representative sample. Cochran's formula can still be applied in such cases, but it's important to be aware of the limitations.  The smaller the expected proportion, the larger the required sample size to achieve a reasonable level of confidence. In some situations, it may not be feasible to collect a large enough sample to accurately estimate the prevalence of a rare event.

This is where other sampling techniques, like stratified sampling or adaptive sampling, can be helpful. These methods can help to increase the representation of rare events in the sample, but they also introduce additional complexity and require careful consideration.

The following code creates a figure that plots the sample size as a function of proportion. 

In [None]:
# ----------------------------------------------------------------------------------
# INPUTS
proportions= [0.1, 1, 2, 5, 10]   # a list in percentage 
moe = 0.1                         # expected MoE
ci = 0.9                          # CI
# ----------------------------------------------------------------------------------


d ={}
for proportion in proportions:
    fraction = proportion/100 
    d[proportion] = [proportion, fraction*(1-fraction), cochran_sample_size(moe, ci, fraction)]

# plotting
df = pd.DataFrame.from_dict(d, orient='index', columns=['Proportion (in %)', 'Variance', 'Sample Size'])
ax = sns.scatterplot(data=df, x='Sample Size', y='Proportion (in %)', hue='Proportion (in %)', legend='full', palette='turbo', size='Proportion (in %)')
_ = ax.set_title('Sample size as a function of proportion for a 90% CI and a 20% MoE')
sns.despine(ax=ax, trim=True)

## 2 - Expected relative margin of error for certain sample sizes and proportions

The following formula is an inverted version of the Cochran's formula. Given the proportion of the category, sample size and applied confidence interval, the relative margin of error can be calculated. This is usually the uncertainty reported in sample-based assessments, as it tells us the range of where our true value is likely to be. Since our estimate almost never equals the true value, the uncertainty around the estimate is as important as the actual estimate itself.

In [None]:
def margin_of_error(sample_size, confidence_level, population_proportion):
    """Calculates the margin of error for estimating a population proportion based on the Cochran formula.

    Parameters:
    sample_size (int): the number of observations in the sample
    confidence_level (float): the desired level of confidence as a fraction
    population_proportion (float): the estimated proportion of the population as a fraction

    Returns:
    The margin of error (float)
    """

    # Z-score from the standard normal distribution for the given confidence level
    z_score = abs(stats.norm.ppf((1 - confidence_level) / 2))

    # Proportion of the phenomena and its inverse
    p_hat = population_proportion
    q_hat = 1 - p_hat

    # Calculate the margin of error using the inverse of Cochran's formula
    margin_error = z_score * np.sqrt((p_hat * q_hat) / sample_size / p_hat**2) 

    return margin_error 

### 2.1 - Example calculation

In [None]:
# ----------------------------------------------------------------------------------
# INPUTS
sample_size = 5000           # 1000 Samples
confidence_level = 0.9       # 90% CI
population_proportion = 0.1  # fraction, i.e. 0.1 = 10%
# ----------------------------------------------------------------------------------

error = margin_of_error(sample_size, confidence_level, population_proportion) * 100
error

### 2.2 - Plotting various combinations

Here we calculate the relative margin of error for various combinations of sample size and proportions and plot it. 

In [None]:
# ----------------------------------------------------------------------------------
# INPUTS
confidence_interval = 0.95                # 90 % confidence interval
proportions = [0.1, 0.2, 0.5, 0.75, 1]   # list of proportions we want to show 
sample_sizes = range(1000, 22000, 2000)  # range of sample size (start, end, step)
# ----------------------------------------------------------------------------------


d, i = {}, 0
for proportion in proportions:  #
    for sample_size in sample_sizes: # number of samples to loop over
        d[i] = [margin_of_error(sample_size, confidence_interval, proportion/100) * 100, sample_size, proportion]
        i += 1

# create dataframe
final = pd.DataFrame.from_dict(d, orient='index')
final.columns = ['MoE (in %)', 'Sample Size', 'Proportion (in %)']

# do the plotting
palette = sns.color_palette("magma", n_colors=len(proportions))
ax = sns.barplot(final, y='MoE (in %)', x='Sample Size', hue='Proportion (in %)', palette=palette)
ax.yaxis.grid(True)
ax = sns.despine(ax=ax, trim=True, offset=10)
_ = plt.xticks(rotation=45)

### 2.3 - Relative versus absolute margin of error

The above shown examples only consider the relative margin of error. This is useful when "hunting" for a certain level of precision. However, it is also important to consider the absolute margin of error, whihc is what is shown here. In practice, a small relative error over a larger proportion, in absolute terms, might still be larger than a huge relative error over a small proportion. 

In [None]:
# ----------------------------------------------------------------------------------
# INPUTS
confidence_interval = 0.95                # 95 % confidence interval
proportions = [0.1, 0.2, 0.3, 0.4, 0.5]   # list of proportions we want to show 
sample_sizes = range(1000, 22000, 4000)   # sample size (start, end, step)
# ----------------------------------------------------------------------------------


d, i = {}, 0
for proportion in proportions:
    for sample_size in sample_sizes:
        d[i] = [margin_of_error(sample_size, confidence_interval, proportion/100) * 100, sample_size, proportion]
        i += 1

# create dataframe
final = pd.DataFrame.from_dict(d, orient='index')
final.columns = ['MoE (in %)', 'Sample Size', 'Proportion (in %)']

final['Area'] = 100000 * final['Proportion (in %)']
final['MoE absolute'] = final['MoE (in %)']/100 * final['Area']
#display(final)

from matplotlib import pyplot as plt
fig, ax = plt.subplots(2, 3, figsize=(15, 10), sharex=True, sharey=True)
ax = ax.flatten()

for p in range(len(sample_sizes)):

    sub_final = final[final['Sample Size'] == sample_sizes[p]]
    ax[p] = sns.barplot(
        
        data=sub_final, x='Proportion (in %)', y='Area', hue='Proportion (in %)', palette=palette, legend=False, ax=ax[p]
    )
    
    for idx, container in enumerate(ax[p].containers):
        ax[p].bar_label(container, labels=[f'({round(sub_final["MoE (in %)"], 2).values[idx]} %)'], padding=3)
        ax[p].bar_label(container, labels=[f'{round(sub_final["MoE absolute"], 2).values[idx]}'], padding=15)
    
    # Get bar positions (this might require inspecting the 'ax' object)
    bar_positions = [patch.get_x() + patch.get_width() / 2 for patch in ax[p].patches][:len(final)]
    ax[p].errorbar(bar_positions, sub_final['Area'], yerr=sub_final['MoE absolute'], fmt='none', ecolor='darkgrey', capsize=5)
    ax[p].set_title(f'Sample Size of {sample_sizes[p]}')

plt.suptitle('Confidence Intervals as a function of Sample size')
plt.tight_layout()

## 3 - Practical examples of deforestation based on country statistics

Here we translate the above described interrelationship between sample size, proportion and margin of error to tree cover loss statistics based on Hansen's Global Forest Change product. Note that this dataset relates to tree cover and not necessarily forest. 

The proportion is calculated based on the area of tree cover loss divided by the total area of the country. This code might take a while to run.

### 3.1 - Helper functions

The first function will extract tree cover loss and total country area from Hansen's dataset.
The second function is to plot the values extracted as a function of proportion of tree cover loss to total land area.

In [None]:
import ee
ee.Initialize()

def get_area_statistics(aoi, start, end):

    # load hansen image
    hansen = ee.Image('UMD/hansen/global_forest_change_2023_v1_11')
    
    # create a pixel area image for area of full aoi
    aoi_area = (
        ee.Image(1).reproject(hansen.projection().atScale(30)).rename('aoi_area')
    )

    # get actual forest area
    loss = hansen.select("lossyear").gte(start).And(hansen.select("lossyear").lte(end)).rename('loss')
    layer = loss.addBands(aoi_area)
    return layer.multiply(ee.Image.pixelArea()).reduceRegion(**{
        "reducer": ee.Reducer.sum(),
        "geometry": aoi,
        "scale": 100,
        "maxPixels": 1e14,
    }).getInfo()

def plot_country_stats(d):
    
    df = pd.DataFrame.from_dict(d, orient='index', columns=['Country', 'Proportion of Change (in %)', 'Sample Size', 'Area of Change (km2)', 'Total area'])
    ax = sns.scatterplot(data=df, x='Sample Size', y='Proportion of Change (in %)', hue='Country', legend=True, size='Area of Change (km2)')
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper right', ncol=2)
    _ = ax.set_title('Sample size as a function of proportion for a 90% CI and a 20% MoE')
    sns.despine(ax=ax, trim=True)

In [None]:
countries = ['Kenya', 'Liberia', 'Uganda', 'Ethiopia', 'Paraguay', 'Bolivia', 'Burundi']
gaul = ee.FeatureCollection("FAO/GAUL/2015/level1")

# those years are inclusive, select the same for a 1 year period
startyear = 2020
endyear = 2020

# loop over countries, and extract the areas
d ={}
for country in countries:

    aoi = gaul.filter(ee.Filter.eq("ADM0_NAME", country)).union()
    areas = get_area_statistics(aoi, startyear-2000, endyear-2000)
    proportion = areas['loss']/areas['aoi_area']
    d[country] = [country, proportion*100, cochran_sample_size(0.20, 0.90, proportion), areas['loss']/1000000, areas['aoi_area']/1000000]

# plot the results
plot_country_stats(d)

## 4 - Concluding remarks

This notebook is intended to build some intuition around sampling theory starting from Cochran's formula that builds the base for the sample based estimation process. 