# **Compute Interpolation Error for a given Sample Dataset** <font>


## This notebook serves to get a sense of interpolation error from Scipy's interpolation function for a dataset of concern/interest

* To do so, the user (you)  enters a dataset of concern/interest from the dataset composite (e.g. SandSnap) 

    
    
    
*  This notebook will then compare the percentile of distributions that were originally provided in the of focus versus the value that scipy's interpolation function would estimate and calculate percent error for each.

    
    
* It is suggested, for an accurate sense of error, to use this notebook on datasets that originally had over 3 distributions. This number can be found in the **num_orig_dists** column reported below.
    
    

## <font color=grey> *This notebooks' output is a dataframe of the user specified input with new columns showing the percent interpolation error for each distribution for each sample.*<font>



## Run these two cells  to get everything set up:

In [None]:
import pandas as pd
import geopandas as gpd
import numpy as np
import math
import scipy

from scipy.interpolate import interp1d
import requests



In [None]:


# from https://stackoverflow.com/questions/38511444/python-download-files-from-google-drive-using-url

def download_file_from_google_drive(id, destination):
    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

    return None

def save_response_content(response, destination):
    """
    response = filename for input
    destination = filename for output
    """    
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)



##  **And then Import the Overall Sample Dataset:**
### Printed below the code are the datasets to choose from, their respective number of distributions, and which distributions are provided

In [None]:
DATASET_ID = '1G9fuC_TjtwTr3JWA85gW7228_Ffxsw0G'


destination = '../data.csv'
download_file_from_google_drive(DATASET_ID, destination)
df= pd.read_csv(destination)


ds=df['dataset'].unique()
dd=[]
dn=[]

#This part aggregates the unique dataset names, the amount of sample percentile distributions provided in that dataset, and which ones
i0=0
for i0 in range (0, len(ds)):
    d=ds[i0]
    v=df.loc[df['dataset'] == d, 'num_orig_dists'].unique()[0]
    vn=df.loc[df['dataset'] == d, 'Measured_Distributions'].unique()[0]
    dd.append(v)
    dn.append(vn)
    i0=i0+1

avail= pd.DataFrame({'dataset': ds, 'num_orig_dists': dd, 'Measured_Distributions':dn}, columns=['dataset', 'num_orig_dists', 'Measured_Distributions'])

print(avail)

## Use this cell to enter your dataset of interest (e.g. sandsnap):
#### <font color=red> This is the only cell where you need to enter anything. <font>

In [None]:
#Here is where you type in the dataset of interest (found in the table above)
interest='sandsnap'

#This subsets the larger dataset composite just to just your dataset of interest
df=df[df['dataset']==interest]
df=df.reset_index()

## This cell will extract the names and values of given percentile distributions

In [None]:
#this counts the number of distributions for your dataset of interes
num_given_dists= int(df['num_orig_dists'][:1])


#extract distribution names and distribution percentiles that were provided with the source dataset (e.g, 'd50' and .5)
given_dist_names=[]
given_dist_vals=[]
for i in range(0,num_given_dists):
    a=(df['Measured_Distributions'][:1]).astype(str).str.split(',', expand=True)[i]
    b=a.astype(str).str.split('d', expand=True)[1]
    a=a.unique()[0]
    val=b.astype(int)/100
    val=val.unique()[0]
    given_dist_names.append(a)
    given_dist_vals.append(val)


## This next cell is where the calculations will occur:
* In the outer most for loop, the function is iterating over the number of provided sample distributions. The function is one by one, hiding a distribution from the dataset. This distribution will be re-introduced in the next iteration, and another distribution will be hidden (and so on)

* In the loop after that, using the remaining distributions, a value for that temperarily removed, known value is interpolated for each sample.

* In the loop nested within the one above, is a function that gathers the percentile value for each distribution for each sample row.

## <font color=grey> *The output will be the addition of 2 new columns per input known distribution, the re-calculated value (in '_calc') and the calculated percent error (in '_error')*<font>

In [None]:



for n in range (0,num_given_dists): #Repeats for each distribution
    # "deleting" the distribution name and value to be recalculated
    new_dist_names=np.delete(given_dist_names, n)
    new_dist_vals=np.delete(given_dist_vals, n) 
    # "preserving" the distribution name to be recalculated as another variable
    focus_column=given_dist_names[n] 
    focus_column_value=given_dist_vals[n] 
    #new columns for recalculated value and error
    calc_column=str(given_dist_names[n]+'_calc')  
    error_column=str(given_dist_names[n]+'_error')
    for i in range(0,len_df):#repeats for each row, aka sample 
            grain_size_bins=[]
            #This collects the values from the left over "original" distributions
            for ia in range(0,(num_given_dists-1)):
                bin_size=df[new_dist_names[ia]].iloc[i] 
                grain_size_bins.append(bin_size)
                grain_size_frequencies=new_dist_vals
                
            #This interpolates the value using the gathered "original" distributions from above
            distribution = scipy.interpolate.interp1d(grain_size_frequencies, grain_size_bins, bounds_error=False, fill_value='extrapolate')
            
            #This adds them to the new calcualted column
            df.loc[i,[calc_column]] = distribution(given_dist_vals[(n-1)])


    #This calculates the percent error:        
    df[error_column]=((df[calc_column]-df[focus_column])/df[focus_column])*100


print('Distibutions Interpolation Error Calculated')

df