<img src='https://github.com/LinkedEarth/Logos/raw/master/PYLEOCLIM_logo_HORZ-01.png' width="800">

This notebook investigates using Pandas native time (numpy datetime) for Pyleoclim

In [1]:
import pandas as pd
import numpy as np
from typing import OrderedDict

Import the data including its non-standard time axis into a dataframe for exploration.

In [3]:
df = pd.read_csv('../data/ODP846.csv', header=0)
df.head()

Unnamed: 0,Age,d18O
0,3.645,3.38
1,7.99,3.46
2,11.18,3.765
3,13.803,4.14
4,15.886,4.47


For this problem, the units are in kyr BP. Need to translate to years.

In [4]:
df['Age']*=1000
df.head()

Unnamed: 0,Age,d18O
0,3645.0,3.38
1,7990.0,3.46
2,11180.0,3.765
3,13803.0,4.14
4,15886.0,4.47


It is not uncommon to have duplicated values in time (this usually arises when duplicate measurements are made on the same sample and the authors report both measurements), so let's clean that up before going any further.

In [5]:
def reduce_duplicated_timestamps(ys, ts, verbose=False):
    ''' Reduce duplicated timestamps in a timeseries by averaging the values
    Parameters
    ----------
    ys : array
        Dependent variable
    ts : array
        Independent variable
    verbose : bool
        If True, will print a warning message
    Returns
    -------
    ys : array
        Dependent variable
    ts : array
        Independent variable, with duplicated timestamps reduced by averaging the values
    '''
    ys = np.asarray(ys, dtype=np.float)
    ts = np.asarray(ts, dtype=np.float)
    assert ys.size == ts.size, 'The size of time axis and data value should be equal!'

    if len(ts) != len(set(ts)):
        value = OrderedDict()
        for t, y in zip(ts, ys):
            if t not in value:
                value[t] = [y]
            else:
                value[t].append(y)

        ts = []
        ys = []
        for k, v in value.items():
            ts.append(k)
            ys.append(np.mean(v))

        ts = np.array(ts)
        ys = np.array(ys)

        if verbose:
            print('Duplicate timestamps have been combined by averaging values.')
    return ys, ts

def sort_ts(ys, ts, verbose=False):
    ''' Sort ts values in ascending order
    Parameters
    ----------
    ys : array
        Dependent variable
    ts : array
        Independent variable
    verbose : bool
        If True, will print a warning message
    Returns
    -------
    ys : array
        Dependent variable
    ts : array
        Independent variable, sorted in ascending order
    '''
    ys = np.asarray(ys, dtype=np.float)
    ts = np.asarray(ts, dtype=np.float)
    assert ys.size == ts.size, 'time and value arrays must be of equal length'

    # sort the time series so that the time axis will be ascending
    dt = np.median(np.diff(ts))
    if dt < 0:
        sort_ind = np.argsort(ts)
        ys = ys[sort_ind]
        ts = ts[sort_ind]
        if verbose:
            print('The time axis has been adjusted to be prograde')

    return ys, ts

#Clean and sort the data
d18O_sort, age_sort = sort_ts(df['d18O'],df['Age'])
d18O_clean, age_clean = reduce_duplicated_timestamps(d18O_sort,age_sort)
# Put it back into a dataframe
df2 = pd.DataFrame({'Age':age_clean,'d18O':d18O_clean})
df2.head()

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  ys = np.asarray(ys, dtype=np.float)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  ts = np.asarray(ts, dtype=np.float)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  ys = np.asarray(ys, dtype=np.float)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  ts = np.asarray(ts, dtype=np.float)


Unnamed: 0,Age,d18O
0,3645.0,3.38
1,7990.0,3.46
2,11180.0,3.765
3,13803.0,4.14
4,15886.0,4.47


Explore the time axis:

In [6]:
print('min value: '+ str(df2['Age'].min()))
print('max value: '+str(df2['Age'].max()))
print('mean dt value: '+str(df2['Age'].diff().mean()))
print('median dt value: '+str(df2['Age'].diff().median()))
print('min dt value: '+str(df2['Age'].diff().min()))
print('max dt value: '+str(df2['Age'].diff().max()))

min value: 3645.0
max value: 5029727.0
mean dt value: 2520.6028084252757
median dt value: 2326.0
min dt value: 180.0
max dt value: 26729.0


Let's use the minimum dt to create the time axis. 

In [7]:
dti = np.arange(df2['Age'].min(),df2['Age'].max()+df2['Age'].diff().min(),df2['Age'].diff().min(), dtype='str')
ts = np.array(dti,dtype='datetime64[Y]')

ValueError: no fill-function for data-type.