<table style="font-size: 1em; padding: 0; margin: 0;">

<tr style="vertical-align: top; padding: 0; margin: 0;background-color: #ffffff">
        <td style="vertical-align: top; padding: 0; margin: 0; padding-right: 15px;">
    <p style="background: #182AEB; color:#ffffff; text-align:justify; padding: 10px 25px;">
        <strong style="font-size: 1.0em;"><span style="font-size: 1.2em;"><span style="color: #ffffff;">The Coastal Grain Size Portal (C-GRASP) dataset <br/><em>Will Speiser, Daniel Buscombe, Evan Goldstein</em></strong><br/><br/>
        <strong>> Interpolate Percentiles from Other Dataset Percentiles </strong><br/>
    </p>                       
        
<p style="border: 1px solid #ff5733; border-left: 15px solid #ff5733; padding: 10px; text-align:justify;">
    <strong style="color: #ff5733">The purpose of this notebook</strong>  
    <br/><font color=grey> TThis notebook will output a dataframe containing all of the data from a chosen C-GRASP dataset with  new fields containing an estimated percent error for interpolation of distribution percentiles. This will only be calculated for samples where distribution percentile values are included in the source dataset, as that is the only way to establish a "known" value. As C-Grasp file sizes vary completion of this task will vary with internet connectivity and computer processing power.<font><br/>
    <br/><font color=grey> This notebook provides simple code that estimates the percent error for various interpolated distribution values in the C-Grasp dataset.<font><br/>    
    <br/><font color=grey> To do so, a user choose a CGRASP dataset of choice . <font><br/>
    <br/><font color=grey> The notebook then runs loops through each sample with known distribution percentile values, recalculates that value and calculates an estimate for percent error of the scipy interpolation function (see the "sample_compute_percentile" notebook).<font><br/>    
    </p>

In [None]:
import pandas as pd
import scipy
from scipy.interpolate import interp1d
import requests
import ipywidgets
import math
import numpy as np
import matplotlib.pyplot as plt

#### Select a dataset

In [None]:
#Dataset collection widget
zen=ipywidgets.Select(
    options=['Entire Dataset', 'Estimated Onshore Data', 'Verified Onshore Data', 'Verified Onshore Post 2012 Data'],
    value='Entire Dataset',
    # rows=10,
    description='Dataset:',
    disabled=False
)

display(zen)

#### Download the dataset

In [None]:
url = 'https://zenodo.org/record/5874231/files/' 
if zen.value=='Entire Dataset':
    filename='dataset_10kmcoast.csv'
if zen.value=='Estimated Onshore Data':
    filename='Data_EstimatedOnshore.csv'
if zen.value=='Verified Onshore Data':
    filename='Data_VerifiedOnshore.csv'
if zen.value=='Verified Onshore Post 2012 Data':
    filename='Data_Post2012_VerifiedOnshore.csv'
print("Downloading {}".format(url+filename))   

The next cell will download the CGRASP dataset and read it in as a pandas dataframe with variable name `df`

In [None]:
url=(url+filename)
print('Retrieving Data, Please Wait')
#retrieve data
df=pd.read_csv(url)
print('Sediment Data Retrieved!') 

Let's take a quick look at the file

In [None]:
df

Lets take a look at what distributions are provided from source data:

In [None]:
given_values=np.array2string(df['Measured_Distributions'].unique()) #Find each distribution in entire dataset that was provided provided in source data for at least one sample
given_values= given_values[:].replace(" ",",").replace("'","").replace("[","").replace("]","") #convert to string and remove array artefacts
given_values=(list(set(given_values.split(',')))) #extract delete duplicates (i.e. when multiple source datasets provide the same  distribution)
given_values.remove('nan') #remove nan from list
given_values=np.array(given_values) #Turn it into an array for use later
print('Given distribution values from source data in dataset: ', given_values)

Create a new, blank calculated interpolation value and percent error columns for each of those distributions

In [None]:
for d in range (0,len(given_values)):
            calc_column=str(given_values[d]+'_calc')  
            error_column=str(given_values[d]+'_error')
            df[calc_column]='' 
            df[error_column]='' 
            d=d+1

## This next cell is where the calculations will occur:
* In the outer most for loop, the function is iterating over each sample and accounting for the number of distributions provided in its source data.


* For the next loop within the previous one, the value and name of each distribution provided in the source data is being collected

* In the next loop,the function is one by one "hiding" a distribution from the dataset and is re-interpolated from the other distributions from the source data. This distribution is re-introduced in the next iteration, and another distribution is hidden/re-interpolated. These re-interpolated values go in the "_calc" columns.

* After that, the percent error of each re-interpolated distribution value is calculated with the distribution value from the source data

## <font color=grey> *The output will be the addition of 2 new columns distribution provided in a sample's source data, the re-interpolated value (in '_calc') and the calculated percent error (in '_error')*<font>

In [None]:
for z in range (0,len(df)): #loop on each sample
    if df['num_orig_dists'].iloc[z] < 3: #if the number of given distributions is less than 3 skip the sample
        pass
    else:
        try:
            num_orig_dists=df['num_orig_dists'].iloc[z]#extract amount of known distributions per sample
            given_dist_names=[]
            given_dist_vals=[]
            i=0
            for i in range(0,num_orig_dists): #find distribution values provided in source data for each sample
                a=(df['Measured_Distributions'].iloc[z]) #extract sample's provided distributions
                a=a.split(',')[i] #extract distribution focused on in this iteration
                b=a.split('d')[1] #pull number value from name
                val=int(b)/100 #turn value to decimal (e.g. 90 to .9)
                given_dist_names.append(a) #collect given distribution names from each sample
                given_dist_vals.append(val) #collect given distribution values from each sample
            i=0   
     
            for n in range (0,num_orig_dists): #Repeats for each distribution
                    # "deleting" the distribution name and value to be recalculated
                    new_dist_names=np.delete(given_dist_names, n)
                    new_dist_vals=np.delete(given_dist_vals, n) 
                    # "preserving" the distribution name to be recalculated as another variable
                    focus_column=given_dist_names[n] 
                    focus_column_value=given_dist_vals[n]
                    calc_column=str(given_dist_names[n]+'_calc')  
                    error_column=str(given_dist_names[n]+'_error')
                    #new columns for recalculated value and error



                    grain_size_bins=[]
                    ia=0
                    for ia in range(0,(num_orig_dists)):
                        bin_size=df[new_dist_names[ia]].iloc[z] 
                        grain_size_bins.append(bin_size)

                    grain_size_frequencies=new_dist_vals
                     #This interpolates the value using the gathered "original" distributions from above
                    distribution = scipy.interpolate.interp1d(grain_size_frequencies, grain_size_bins, bounds_error=False, fill_value='extrapolate')
                    #This adds them to the new calculated column
                    df.loc[z,[calc_column]] = float(distribution(given_dist_vals[(n)]))
                    df.loc[z, error_column]=abs(((df[calc_column].iloc[z]-df[focus_column].iloc[z])/df[focus_column].iloc[z])*100)
        except:
            pass
print('Error Calculation Successful!')

Lets see if that worked

In [None]:
start=len(df.columns)-(len(given_values)+9)
df.iloc[:, start:len(df.columns)]


### Write to file

Finally, define a csv file name for the output dataframe

In [None]:
output_csvfile='../data_interp_error.csv'

write the data to that csv file

In [None]:
df.to_csv(output_csvfile) #convert data to CSV