# **Tutorial 5: Paleoclimate Data Analysis Tools**
**Week 1, Day 4, Paleoclimate**

**Content creators:** Sloane Garelick

**Content reviewers:** Brodie Pearson

**Content editors:** Yosmely Bermúdez, Agustina Pesce, Zahra Khodakaramimaghsoud

**Production editors:** TBD

**Our 2023 Sponsors:** TBD

###**Code and Data Sources**

Code for this tutorial is based on existing notebooks from LinkedEarth for [anlayzing LiPD datasets](https://github.com/LinkedEarth/paleoHackathon/blob/main/notebooks/PaleoHack-nb03_EDA.ipynb) and [resampling data with `Pyleoclim`](https://github.com/LinkedEarth/PyleoTutorials/blob/main/notebooks/L1_uniform_time_sampling.ipynb).

Data from the following sources are used in this tutorial:

*   Tierney, J.E., et al. 2008. Northern Hemisphere Controls on Tropical Southeast African Climate During the Past 60,000 Years. Science, Vol. 322, No. 5899, pp. 252-255, 10 October 2008. https://doi:10.1126/science.1160485
*   Tierney, J.E., and deMenocal, P.. 2013. Abrupt Shifts in Horn of Africa Hydroclimate Since the Last Glacial Maximum. Science, 342(6160), 843-846. https://doi:10.1126/science.1240411
*   Tierney, J.E., Pausata, F., deMenocal, P. 2017. Rainfall Regimes of the Green Sahara. Science Advances, 3(1), e1601503. https://doi:10.1126/sciadv.1601503 
*   Shanahan, T.M., et al. 2015. The time-transgressive termination of the African Humid Period. Nature Geoscience, 8(2), 140-144. https://doi:10.1038/ngeo2329

















#**Tutorial 5 Objectives**

In this tutorial, you will explore various computational analyses for interpreting paleoclimate data and understand why these methods are useful. A common issue in the paleosciences is the presence of uneven time spacing between consecutive observations. While `pyleoclim` includes several methods that can deal with this effectively, there are certain applications for which it is ncessary to place the records on a uniform time axis. In this tutorial you'll learn a few ways to do this with `pyleoclim`. Additionally, we will explore another useful paleoclimate data analysis tool, Principal Component Analysis (PCA), which allows us to identify a common signal between various paleoclimate reconstructions. 


By the end of this tutorial you will be able to perform the following data analysis techniques on proxy-based climate reconstructions:

*   Interpolation
*   Binning 
*   Principal component analysis




In [None]:
# # Install libraries
# !pip install cartopy
# !pip install pyleoclim
# !pip install pandas
# !pip install matplotlib

In [None]:
# Import libraries
import pandas as pd
import cartopy
import pyleoclim as pyleo
import matplotlib.pyplot as plt

## Load the sample dataset for analysis

For this tutorial, we are going to use an example dataset to practice the various data analysis techniques. The dataset we'll be using is a record of hydrogen isotoeps of leaf waxes (dDwax) from Lake Tanganyika in East Africa [(Tierney et al., 2008)](https://www.science.org/doi/10.1126/science.1160485?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub%20%200pubmed). Recall from the introductory video that dDwax is a proxy that records changes in the amount of precipitation in the tropics. In the previous tutorial, we looked at dD data from high-latitude ice cores. In that case, dD was a proxy for temperature, but in the tropics, dD reflects rainfall amount, as explained in the introductory video.

Let's first read the data from a .csv file.

In [None]:
import pooch

# fname = 'tanganyika_dD.csv'
url = "https://osf.io/sujvp/download/"
data_path = pooch.retrieve(
    url,
    known_hash=None
)

tang_dD = pd.read_csv(data_path)

In [None]:
tang_dD.head()

We can now create a `Series` in Pyleoclim and assign names to different variables so that we can easily plot the data.

In [None]:
ts_tang = pyleo.Series(
    time=tang_dD['Age'],
    value= tang_dD['dD_IVonly'],
    time_name='Age',
    time_unit='yr BP',
    value_name='dDwax',
    value_unit='per mille',
    label='Lake Tanganyika dDprecip'
)

ts_tang.plot(color='C1',invert_yaxis=True)

You may notice that the inverted the y-axis. When we're plotting dD data, we typically invert the y-axis because more negative ("depleted") values suggest increased rainfall, whereas more positive ("enriched") values suggest decreased rainfall.

## Uniform Time-Sampling of the Data
There a number of different reasons we might want to assign new values to our data. For example, if the data is not evenly spaced, we might need to resample it so that it is evenly spaced in order to use some other data analysis technique or to more easily compare to other data. 

First, let's check whether our data is already evenly spaced using the .is_evenly_spaced() method:

In [None]:
ts_tang.is_evenly_spaced()

Our data is not evenly spaced. There are a few different methods available in `pyleoclim` to place the on a uniform axis: interpolating, binning, and coarse graining via a Gaussian kernel as in Rehfeld et al. (2011). In general all of these methods use the available data near a chosen time to estimate what the value was at that time, but each method differs in which nearby data points it uses and how it uses them.


###Interpolation
To start out, let's try using interpolation to evenly space our data. Interpolation projects the data onto an evenly spaced time axis with a distance between points (step size) of our choosing. There are a variety of different methods by which the data can be interpolated, these being: `linear`, `nearest`, `zero`, `slinear`, `quadratic`, `cubic`, `previous`, and `next`. More on these and their associated key word arguments can be found in the [documentation](https://pyleoclim-util.readthedocs.io/en/latest/core/api.html#pyleoclim.core.series.Series.interp). By default, the function `.interp()` implements linear interpolation:

In [None]:
tang_linear = ts_tang.interp() #default method = 'linear'

In [None]:
#Checking whether or not the series is now evenly spaced
tang_linear.is_evenly_spaced()

Now that we've interpolated our data, let's compare the original dataset to the linearly interpolated dataset we just created.

In [None]:
fig, ax = ts_tang.plot(label='Original')
tang_linear.plot(ax=ax, label='Linear', invert_yaxis=True)

Notice that although there are some minor differences between the original and linearly interpolated data, the records are essential the same.

Let's compare a few of the different interpolation methods (e.g., quadratic, next, zero) with one another just to see how they are similar and different:

In [None]:
tang_quadratic = ts_tang.interp(method='quadratic')
tang_next = ts_tang.interp(method='next')
tang_zero = ts_tang.interp(method='zero')

In [None]:
fig, ax = tang_linear.plot(label='Linear',invert_yaxis=True)
tang_quadratic.plot(ax=ax,label='Quadratic')
tang_next.plot(ax=ax,label='Next')
tang_zero.plot(ax=ax,label='Zero')

You can see how the methods can produce slightly different results, but reproduce the same overall trend.

###Binning
Another option for resampling our data onto a uniform time axis is binning. Binning is when a set of time intervals is defined and data is grouped or binned with other data in the same interval, then all those points in a "bin" are averaged to get a data value for that bin. The defaults for binning pick a bin size at the coarsest time spacing present in the dataset and average data over a uniform sequence of such intervals. 

In [None]:
tang_bin = ts_tang.bin()

In [None]:
fig, ax = ts_tang.plot(label='Original',invert_yaxis=True)
tang_bin.plot(ax=ax,label='Binned')

Again, notice that although there are some minor differences between the original and binned data, the records still capture the same overall trend.

##Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a tool that allows us to identify a common signal between various paleoclimate reconstructions. Doing so involves resampling all of the records onto a common time-step, so we will practice applying the skills we've learned so far in this tutorial.



So far, we've been looking at dD data from Lake Tanganyika in tropical East Africa. Let's compare this dD record to other existing dD records from lake and marine sediment cores in tropical Africa from the Gulf of Aden [(Tierney and deMenocal, 2017)](https://doi:10.1126/science.1240411), Lake Bosumtwi [(Shanahan et al., 2015)](https://doi:10.1038/ngeo2329), and the West African Margin [(Tierney et al., 2017)](https://doi:10.1126/sciadv.1601503).

First, let's load these datasets:

In [None]:
# Gulf of Aden
# fname = 'aden_dD.csv'
url = "https://osf.io/gm2v9/download/"

data_path = pooch.retrieve(
    url,
    known_hash=None
)

aden_dD = pd.read_csv(data_path)
aden_dD.head()

In [None]:
#Lake Bosumtwi
# fname = "bosumtwi_dD.csv"
url = "https://osf.io/mr7d9/download/"

data_path = pooch.retrieve(
    url,
    known_hash=None
)

bosumtwi_dD = pd.read_csv(data_path)

bosumtwi_dD.head()

In [None]:
# GC27 (West African Margin)
# fname = "gc27_dD.csv"
url = "https://osf.io/k6e3a/download/"

data_path = pooch.retrieve(
    url,
    known_hash=None
)

gc27_dD = pd.read_csv(data_path)
gc27_dD.head()

Next, let's convert each dataset into a `Series` in Pyleoclim.

In [None]:
ts_tanganyika = pyleo.Series(
    time=tang_dD['Age'],
    value= tang_dD['dD_IVonly'],
    time_name='Age',
    time_unit='yr BP',
    value_name='dDwax',
    label='Lake Tanganyika'
)
ts_aden = pyleo.Series(
    time=aden_dD['age_calBP'],
    value= aden_dD['dDwaxIVcorr'],
    time_name='Age',
    time_unit='yr BP',
    value_name='dDwax',
    label='Gulf of Aden'
)
ts_bosumtwi = pyleo.Series(
    time=bosumtwi_dD['age_calBP'],
    value=bosumtwi_dD['d2HleafwaxC31ivc'],
    time_name='Age',
    time_unit='yr BP',
    value_name = 'dDwax',
    label='Lake Bosumtwi'
)
ts_gc27 = pyleo.Series(
    time=gc27_dD['age_BP'],
    value=gc27_dD['dDwax_iv'],
    time_name='Age',
    time_unit='yr BP',
    value_name='dDwax',
    label='GC27'
)

Now let's set up a `MultipleSeries` using Pyleoclim with all four dD datasets. 

In [None]:
ts_list = [ts_tanganyika, ts_aden, ts_bosumtwi, ts_gc27]
ms_africa = pyleo.MultipleSeries(ts_list, name='African dDwax')

We can now create a stackplot with all four dD records:

In [None]:
fig, ax = ms_africa.stackplot()

By creating a stackplot, we can easily compare between the datasets. However, the four dD records aren't the same resolution and don't span the same time interval.

To better compare the records and assess a common trend, we can use PCA. First, we can use `.common_time()` to place the records on a shared time axis with a common sampling frequency. Let's set the time step 500 years and standarize the data:

In [None]:
africa_ct = ms_africa.common_time(step=0.5).standardize()
fig, ax = africa_ct.stackplot()

We now have standardized dD records that are the same sampling resolution and span the same time interval. Now let's apply PCA:

In [None]:
PCA = africa_ct.pca()

The result is an object containing multiple outputs, and with two plotting methods attached to it. For example, we can print the percentage of variance accounted for by each mode, which is saved as pctvar:

In [None]:
print(PCA.pctvar.round())

This means that 97% of the variance in the four paleoclimate records is explained by the first principal component. The number of datasets in the PCA constrains the number of principal components that can be defined, which is why we only have four components in this example.

Now let's create a new series for the first mode of variance and plot it against the original datasets:

In [None]:
pc1 = PCA.pcs

In [None]:
mode1 = pyleo.Series(
    time=africa_ct.series_list[0].time,
    value=PCA.pcs[:,0],
    label=r'$PC_1$',
    value_name='PC1',
    time_name ='age',
    time_unit = 'yr BP'
)

In [None]:
fig, ax1 = plt.subplots()

ax1.set_ylabel('dDwax')
ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis
ax2.set_ylabel('PC1')  # we already handled the x-label with ax1

#plt.plot(mode1.time,pc1_scaled)
mode1.plot(color='black', ax=ax2, invert_yaxis=True)
africa_ct.plot(ax=ax1, linewidth=0.5)

The original dD records are shown in the colored lines, and the PC1 time series is shown in black. 
 

*   How do the original time series compare to the PC1 time series? Do they record similar trends?
*   Which original dD record most closely resembles the PC1 time series?
*   What changes in climate does the PC1 time series record over the past 20,000 years? *Hint: remember that more depleted dD suggests increased rainfall.*



 