<a href="https://colab.research.google.com/github/PaleoLipidRR/marine-AOA-GDGT-distribution/blob/main/PNAS_pythonCodeS2_BeyondTEX86_Analytics_Visualizations_RR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Python Code S1**

## Supplementary Information for:
## Beyond TEX86: GDGT inform marine archaea ecology and evolution
Ronnakrit Rattanasriampaipong, Yi Ge Zhang, Ann Pearson, Brian Hedlund, and Shuang Zhang

Corresponding Author: Ronnakrit Rattanasriampaipong
E-mail: rrattan@tamu.edu
***

Notebook Description:

This is a jupyter containing python scripts that we use to pre-process GDGT database and generate the processed GDGT datasets (output as Dataset S1) to be used for data analysis. The input file (Dataset S1) is a composite GDGT database used for this study from the Python Code S1 (see SI Appendix).

***


# **1. Import python packages of interest**

### 1.1 Mounting your google drive with Google colab so that you can read files directly from the google drive

In [1]:
# Mounting your google drive
from os.path import join
from google.colab import drive

ROOT = "/content/drive"
drive.mount(ROOT,force_remount=True)

Mounted at /content/drive



### 1.2 Computation and Data Analytics

In [2]:
import pandas as pd
import numpy as np
import xarray as xr
import seaborn as sns

import scipy as scipy
from scipy import stats
from sklearn import linear_model, datasets
from sklearn import mixture
from sklearn.metrics import silhouette_samples, silhouette_score

### 1.2 Data plotting and visualizations

**Uncomment !apt-get if you run this notebook from Colab.**

In [3]:
!apt-get install libproj-dev proj-data proj-bin
!apt-get install libgeos-dev

Reading package lists... Done
Building dependency tree       
Reading state information... Done
libproj-dev is already the newest version (4.9.3-2).
libproj-dev set to manually installed.
proj-data is already the newest version (4.9.3-2).
proj-data set to manually installed.
The following NEW packages will be installed:
  proj-bin
0 upgraded, 1 newly installed, 0 to remove and 37 not upgraded.
Need to get 32.3 kB of archives.
After this operation, 110 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 proj-bin amd64 4.9.3-2 [32.3 kB]
Fetched 32.3 kB in 0s (112 kB/s)
Selecting previously unselected package proj-bin.
(Reading database ... 155047 files and directories currently installed.)
Preparing to unpack .../proj-bin_4.9.3-2_amd64.deb ...
Unpacking proj-bin (4.9.3-2) ...
Setting up proj-bin (4.9.3-2) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Reading package lists... Done
Building dependency tree       
Reading state 

In [4]:
%pip install cartopy

Collecting cartopy
  Downloading Cartopy-0.20.1.tar.gz (10.8 MB)
[K     |████████████████████████████████| 10.8 MB 5.2 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25herror
  Downloading Cartopy-0.20.0.tar.gz (10.8 MB)
[K     |████████████████████████████████| 10.8 MB 24.4 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25herror
  Downloading Cartopy-0.19.0.post1.tar.gz (12.1 MB)
[K     |████████████████████████████████| 12.1 MB 91 kB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting pyshp>=2
  Downloading pyshp-2.1.3.tar.gz (219 kB)
[K     |████████████████████████████████| 219 kB 58.7 MB/s 
[?25hBuilding wheels for collected packages: cartopy, pyshp
  Building wheel for cartopy (PEP 517) ... [?25l[?25hdone
  Cr

**Uncomment !apt-get and !pip below if you run this notebook from Colab.** 

shapely and cartopy are not good friends, especially on Google Colab

In [5]:
!apt-get -qq install python-cartopy python3-cartopy
%pip uninstall -y shapely    # cartopy and shapely aren't friends (early 2020)
%pip install shapely --no-binary shapely

Selecting previously unselected package python-pkg-resources.
(Reading database ... (Reading database ... 5%(Reading database ... 10%(Reading database ... 15%(Reading database ... 20%(Reading database ... 25%(Reading database ... 30%(Reading database ... 35%(Reading database ... 40%(Reading database ... 45%(Reading database ... 50%(Reading database ... 55%(Reading database ... 60%(Reading database ... 65%(Reading database ... 70%(Reading database ... 75%(Reading database ... 80%(Reading database ... 85%(Reading database ... 90%(Reading database ... 95%(Reading database ... 100%(Reading database ... 155063 files and directories currently installed.)
Preparing to unpack .../00-python-pkg-resources_39.0.1-2_all.deb ...
Unpacking python-pkg-resources (39.0.1-2) ...
Selecting previously unselected package python-pyshp.
Preparing to unpack .../01-python-pyshp_1.2.12+ds-1_all.deb ...
Unpacking python-pyshp (1.2.12+ds-1) ...
Selecting previously unselected package python-s

In [6]:
%pip install proplot 
%pip install pyrolite  ### This is to install libraries that are not available in Google Colab

Collecting proplot
  Downloading proplot-0.9.4-py3-none-any.whl (8.0 MB)
[K     |████████████████████████████████| 8.0 MB 5.0 MB/s 
Installing collected packages: proplot
Successfully installed proplot-0.9.4
Collecting pyrolite
  Downloading pyrolite-0.3.0-py3-none-any.whl (409 kB)
[K     |████████████████████████████████| 409 kB 5.1 MB/s 
Collecting mpltern>=0.3.1
  Downloading mpltern-0.3.3-py3-none-any.whl (25 kB)
Collecting numpydoc
  Downloading numpydoc-1.1.0-py3-none-any.whl (47 kB)
[K     |████████████████████████████████| 47 kB 3.8 MB/s 
[?25hCollecting periodictable
  Downloading periodictable-1.6.0.tar.gz (686 kB)
[K     |████████████████████████████████| 686 kB 43.4 MB/s 
[?25hCollecting tinydb
  Downloading tinydb-4.5.2-py3-none-any.whl (23 kB)
Collecting typing-extensions<4.0.0,>=3.10.0
  Downloading typing_extensions-3.10.0.2-py3-none-any.whl (26 kB)
Building wheels for collected packages: periodictable
  Building wheel for periodictable (setup.py) ... [?25l[?25h

In [7]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.transforms as transforms
from matplotlib.patches import Rectangle

import proplot as plot
import cartopy.crs as ccrs

from pyrolite.util.time import Timescale
gts = Timescale()


  register_fonts(default=True)
  import pandas.util.testing as tm


###  1.3 Miscellaneous

In [8]:
import string
import os
import requests
import io

###  1.4 Useful functions

In [9]:
def sigmaT_cal_Miller_and_Poisson_1981(temp_array,sal_array):
    '''
    This is a function for seawater density (sigma-t) calculation based on seawater temperature and salinity.
    The calculations follow Miller and Poisson (1981).
    
    Reference:
    Millero, F. J., & Poisson, A. (1981). International one-atmosphere equation of state of seawater. Deep Sea Research Part A. Oceanographic Research Papers, 28(6), 625-629.
    '''
    #Miller and Poisson (1981)
    #parameter for sigma calculation
    A = 8.24493e-1 - 4.0899e-3*temp_array + 7.6438e-5*(temp_array**2) - 8.2467e-7*(temp_array**3) + 5.3875e-9*(temp_array**4)
    B = -5.72466e-3 + 1.0227e-4*temp_array - 1.6546e-6*(temp_array**2)
    C = 4.8314e-4
    rho_0=999.842594 + 6.793952e-2*temp_array - 9.095290e-3*(temp_array**2) + 1.001685e-4*(temp_array**3) - 1.120083e-6*(temp_array**4) + 6.536336e-9*(temp_array**5)
    rho=rho_0 + (A*sal_array) + (B*(sal_array**1.5)) + (C*(sal_array**2))
    return rho-1000


# **2. Load and clean datasets**

In [None]:
# username = 'PaleoLipidRR'
# token = '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'

# github_session = requests.Session()
# github_session.auth = (username,token)

In [None]:
# url = 'https://github.com/PaleoLipidRR/marine-AOA-GDGT-distribution/blob/f4f509c3c5f914a64d384529b6884ca2eaa5b01f/spreadsheets/MarineGDGT_GlobalCompilation_for_supp_07_093021_QCed_RR.csv'
# download = github_session.get(url).content

In [11]:
pd.set_option('display.max_rows',4,'display.max_columns',10)
filepath = "/content/drive/MyDrive/Colab Notebooks/Excel/MarineAOA_project/"  ### Replace with your the location of your file
filename = "PNAS_datasetS2_BeyondTEX86_RR.xlsx"
df = pd.read_excel(filepath+filename)
df = df.iloc[:,1:]  ##This line is to remove the Unnamed: 0 column (the additional column after completing the pythonCodeS1 pre-processing)
df

Unnamed: 0,sampleName,drilling_program,Site,Site_edited,Latitude,...,match_depth,match_lat,match_lon,oceanLayer_class,paleoWaterDepth
0,Bijl2021_014_1172D_2R-5W_140.5_,IODP-offshore,1172,"Tasman Sea, Southern Ocean",-43.9598,...,,,,,2720.0
1,Bijl2021_015_1172D_2R-6W_44545_,IODP-offshore,1172,"Tasman Sea, Southern Ocean",-43.9598,...,,,,,2720.0
...,...,...,...,...,...,...,...,...,...,...,...
5109,Zhu2016_327_IPL_ETNP_ST8_50,N/A-SPM,ETNP,ETNP,13,...,50.0,13.125,-104.875,Surface ocean,50.0
5110,Zhu2016_328_IPL_ETNP_ST8_125,N/A-SPM,ETNP,ETNP,13,...,125.0,13.125,-104.875,Surface ocean,125.0


**Pivot Table of Imported Datasets**

In [12]:
pd.set_option("display.max_rows", None, "display.max_columns", None)
table = pd.pivot_table(df, values=['gdgt23ratio'], index=['dataType_level1','short_remark','Source','lipidClass'],
                    aggfunc=lambda x: len(x.unique()))
table

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,gdgt23ratio
dataType_level1,short_remark,Source,lipidClass,Unnamed: 4_level_1
Core top,Data from original source,Kim et al. (2015) GCA,sediment-totalGDGTs,104.0
Core top,Data from original source,Kim et al. (2016) GCA,IPL-GDGTs,7.0
Core top,Data from original source,Kim et al. (2016) GCA,Total GDGTs,10.0
Core top,Data from original source,"Pan et al., 2016 Organic Geochemistry",sediment-totalGDGTs,9.0
Core top,Data from original source,Wei et al. (2011) AEM,IPL-GDGTs,9.0
Core top,Data from original source,Wei et al. (2011) AEM,Total GDGTs,11.0
Core top,Data from original source,Zell et al. (2014) GCA,IPL-GDGTs,11.0
Core top,Data from original source,Zell et al. (2014) GCA,Total GDGTs,16.0
Core top,Data retrieved from Kim et al. (2015),Kim et al. (2010) GCA,sediment-totalGDGTs,2.0
Core top,Data retrieved from Tierney and Tingley (2015),Hernández-Sánchez et al. (2014) GCA,sediment-totalGDGTs,7.0
