In [None]:
"""
Author: Reggie Karssiens (2667014)
Msc Earth and Climate Sciences, Vrije Universiteit Amsterdam
Supervisor: Liam Heffernan 
Date: 29-01-2025

Aims and expected outcomes
This script provides the tool to automatically reduce the sample size of the geological cores to 10%, 25%, 50%, 75%, and 90%. The outcome is 5 new files containing
random samples sizes.  With these multiple reduces sample sizes the geostatistics can be taken under the look to decide what sample size is the most suitable for this methodology.
The source data includes 190 geological cores from the VU and DINOloket. These 5 sensitiviy files are used to import them in the excisting script to calculate the prediction of the C_mass and
see how the results of the interpolation with different amount of cores differ according to geostatistics. 

Data:
Available input data: (Fileformat, raster/vector, variable, unit, dimensions, resolution, projection, other relevant info adding to understand the data and import the right modules)
- 190 Geological cores containing peat depth, geometry, corenumber and source (VU or DINOloket). (shapefile)

Outcomes:
Five shapefiles files containing different percentages of the amount of geological cores. Afterwards these 5 files are imported into ArcGIS PRO to use
the tool EBK with cross-validation and compare how the statistics is impacted by different sample sizes.
"""

In [None]:
"""
## Methodology: Sample Size Sensitivity Analysis

### Objective
Determine the most suitable sample size for geostatistical interpolation of soil organic carbon mass (C_mass) in peatlands using Empirical Bayesian Kriging (EBK).

### Data and Procedure

**Step 1: Original Dataset**
The original dataset comprises 190 geological cores from two sources:
- 20 cores from Vrije Universiteit (VU)
- 170 cores from DINOloket national database

**Step 2: Sample Size Reduction**
Five random subsamples were extracted from the original 190 cores with varying percentages:
- 10% (n = 19 cores)
- 25% (n = 48 cores)
- 50% (n = 95 cores)
- 75% (n = 143 cores)
- 90% (n = 171 cores)

All subsamples were generated using a fixed random seed (`random_state=42+percentage`) to ensure reproducibility across runs. This approach guarantees that:
- Results are reproducible and verifiable
- Differences in interpolation quality reflect sample size effects, not random variation
- The methodology can be replicated by other researchers

**Step 3: Processing**
Each subsample was:
1. Saved as an individual shapefile
2. Imported into ArcGIS Pro
3. Used as input for Empirical Bayesian Kriging (EBK) interpolation with cross-validation

**Step 4: Comparison and Analysis**
The interpolation quality for each subsample was evaluated based on:
- Cross-validation statistics (RMSE, Mean Error)

### Rationale
By comparing interpolation results across different sample sizes, we could identify the optimal balance between data density and prediction accuracy, ensuring the geostatistical method produces reliable C_mass distribution maps without requiring excessive field data.
"""

In [2]:
import arcpy
import geopandas as gpd
from random import sample

# To allow overwriting outputs change overwriteOutput option to True. You do this to make sure that exciting file will be replaced, when running the code again.
arcpy.env.overwriteOutput = True

In [4]:
#Read the original file. 
Fieldcores190cores = r"C:\\Coding projects Msc Earth & Climate\\Research_project\\data\\spaarnwoude\\Python_result\\gdf_SOC_PD_C\\gdf_SOC_PD_C.shp"

#Make a geodataframe of it as i want change the sample size, while keeping the geometry
Fieldcores190_copy = gpd.read_file(Fieldcores190cores)

In [6]:
#Totaal aantal cores
total_cores = len(Fieldcores190_copy)
print( 'In totaal zijn er {} cores in de gdf original'.format(total_cores))

In totaal zijn er 190 cores in de gdf original


In [7]:
percentage = [10, 25, 50, 75, 90]

for pct in percentage:
    #Bereken het aantal cores bij dit percentage
    n_cores = int(total_cores * pct/100) 

    #Selecteer de random sample
    gdf_sample = Fieldcores190_copy.sample(n=n_cores, random_state= 42+pct)

    #Definier een output path as it loops through the list of percentages, 5 shapefile will be created in de output_dir
    output_dir = r"C:\Coding projects Msc Earth & Climate\Research_project\data\spaarnwoude\Python_intermediate\sensitivity_analysis\shapefiles_reduced_sample_sizes"
    output_path = f"{output_dir}\\Fieldcores_randsamp{pct}pct.shp"

    gdf_sample.to_file(output_path)
    print(f"Saved: {output_path}")
    print(f"Verified: {len(gdf_sample)}cores saved")


print("\n" + "="*60)
print("All sample files created successfully!")
print("="*60)



Saved: C:\Coding projects Msc Earth & Climate\Research_project\data\spaarnwoude\Python_intermediate\sensitivity_analysis\shapefiles_reduced_sample_sizes\Fieldcores_randsamp10pct.shp
Verified: 19cores saved
Saved: C:\Coding projects Msc Earth & Climate\Research_project\data\spaarnwoude\Python_intermediate\sensitivity_analysis\shapefiles_reduced_sample_sizes\Fieldcores_randsamp25pct.shp
Verified: 47cores saved
Saved: C:\Coding projects Msc Earth & Climate\Research_project\data\spaarnwoude\Python_intermediate\sensitivity_analysis\shapefiles_reduced_sample_sizes\Fieldcores_randsamp50pct.shp
Verified: 95cores saved
Saved: C:\Coding projects Msc Earth & Climate\Research_project\data\spaarnwoude\Python_intermediate\sensitivity_analysis\shapefiles_reduced_sample_sizes\Fieldcores_randsamp75pct.shp
Verified: 142cores saved
Saved: C:\Coding projects Msc Earth & Climate\Research_project\data\spaarnwoude\Python_intermediate\sensitivity_analysis\shapefiles_reduced_sample_sizes\Fieldcores_randsamp90p