# Retrieving information from LiPD files

## Authors

[Deborah Khider](https://orcid.org/0000-0001-7501-8430)

## Preamble

`PyLiPD` is a Python package that allows you to read, manipulate, and write [LiPD](https://cp.copernicus.org/articles/12/1093/2016/cp-12-1093-2016-discussion.html#discussion) formatted datasets. In this tutorial, we will demonstrate how you can use pre-defined APIs that allows getting specific information from a LiPD file. 

### Goals

* Use existing APIs to get information about the datasets loaded in the workspace, their location, the variables available, the types of geologic archives.
* Obtain a BibTex file

Reading Time: 5 minutes

### Keywords

LiPD

### Pre-requisites

None. This tutorial assumes basic knowledge of Python and Pandas. If you are not familiar with this coding language and this particular library, check out this tutorial: http://linked.earth/ec_workshops_py/.

### Relevant Packages

pylipd

## Data Description

This notebook uses the following datasets, in LiPD format:

- McCabe-Glynn, S., Johnson, K., Strong, C. et al. Variable North Pacific influence on drought in southwestern North America since AD 854. Nature Geosci 6, 617–621 (2013). https://doi.org/10.1038/ngeo1862

- Lawrence, K. T., Liu, Z. H., & Herbert, T. D. (2006). Evolution of the eastern tropical Pacific through Plio-Pleistocne glaciation. Science, 312(5770), 79-83.

- Euro2k database: PAGES2k Consortium., Emile-Geay, J., McKay, N. et al. A global multiproxy database for temperature reconstructions of the Common Era. Sci Data 4, 170088 (2017). doi:10.1038/sdata.2017.88

## Demonstration

### Extracting infomation about the content of a LiPD object

Let's start by importing our favorite package and load our datasets. 

In [1]:
from pylipd.lipd import LiPD

Let's load some diverse datasets to highlight to capabilities:

In [2]:
path = '../data/Euro2k/'

D = LiPD()
D.load_from_dir(path)

Loading 31 LiPD files


100%|█████████████████████████████████████████████████████████████| 31/31 [00:00<00:00, 44.56it/s]

Loaded..





In [3]:
data_path = ['../data/Crystal.McCabe-Glynn.2013.lpd', '../data/ODP846.Lawrence.2006.lpd', 'https://lipdverse.org/data/iso2k100_CO06MOPE/1_0_2//CO06MOPE.lpd']

D.load(data_path)

Loading 3 LiPD files


100%|███████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.34it/s]

Loaded..





### Getting information about Datasets

From the introductory notebooks on [loading LiPD datasets](L0_loading_lipd_datasets.ipynb) and [working with `LiPD` objects](L0_lipd_object.ipynd), you are already familiar with the functions to get all the names of the datasets.

In [4]:
D.get_all_dataset_names()

['Ocn-RedSea.Felis.2000',
 'Arc-Forfjorddalen.McCarroll.2013',
 'Eur-Tallinn.Tarand.2001',
 'Eur-CentralEurope.Dobrovoln_.2009',
 'Eur-EuropeanAlps.B_ntgen.2011',
 'Eur-CentralandEasternPyrenees.Pla.2004',
 'Arc-Tjeggelvas.Bjorklund.2012',
 'Arc-Indigirka.Hughes.1999',
 'Eur-SpannagelCave.Mangini.2005',
 'Ocn-AqabaJordanAQ19.Heiss.1999',
 'Arc-Jamtland.Wilson.2016',
 'Eur-RAPiD-17-5P.Moffa-Sanchez.2014',
 'Eur-LakeSilvaplana.Trachsel.2010',
 'Eur-NorthernSpain.Mart_n-Chivelet.2011',
 'Eur-MaritimeFrenchAlps.B_ntgen.2012',
 'Ocn-AqabaJordanAQ18.Heiss.1999',
 'Arc-Tornetrask.Melvin.2012',
 'Eur-EasternCarpathianMountains.Popa.2008',
 'Arc-PolarUrals.Wilson.2015',
 'Eur-LakeSilvaplana.Larocque-Tobler.2010',
 'Eur-CoastofPortugal.Abrantes.2011',
 'Eur-TatraMountains.B_ntgen.2013',
 'Eur-SpanishPyrenees.Dorado-Linan.2012',
 'Eur-FinnishLakelands.Helama.2014',
 'Eur-Seebergsee.Larocque-Tobler.2012',
 'Eur-NorthernScandinavia.Esper.2012',
 'Arc-GulfofAlaska.Wilson.2014',
 'Arc-Kittelfjall.Bjo

In fact, this function has been used throughout these notebooks to be able to extract other types of information. Another equivalent function returns all the dataset IDs.

In [5]:
D.get_all_dataset_ids()

['iso2k100_CO06MOPE']

Notice that the function returned only one item. The reason is that most of these files were created before datasetIDs were prevalent on the Lipdverse. Therefore, only the dataset loaded from this website contained an ID.

<div class="alert alert-success">
<b>Note:</b> `datasetIDs` will be used in future versions of `PyLiPD` to direclty query the LiPDverse, without the need to pass the direct URL.
</div>

Another function that allows to look up information stored at the dataset level is `get_all_archiveTypes`. This one functions a little bit differently than the previous functions in that it will only return the unique names present in these datasets:

In [6]:
D.get_all_archiveTypes()

['coral',
 'tree',
 'documents',
 'lake sediment',
 'speleothem',
 'marine sediment',
 'glacier ice',
 'Coral']

This function is particularly useful to know what terms can be used to [filter with specific queries](L1_filtering.ipynb).

You can get information about the location of each dataset as follows:

In [7]:
df_loc = D.get_all_locations()

df_loc

Unnamed: 0,dataSetName,geo_meanLat,geo_meanLon,geo_meanElev
0,Ocn-RedSea.Felis.2000,27.85,34.32,-6.0
1,Arc-Forfjorddalen.McCarroll.2013,68.73,15.73,200.0
2,Eur-Tallinn.Tarand.2001,59.4,24.75,10.0
3,Eur-CentralEurope.Dobrovolný.2009,49.0,13.0,
4,Eur-EuropeanAlps.Büntgen.2011,47.0,10.7,2050.0
5,Eur-CentralandEasternPyrenees.Pla.2004,42.5,0.75,2280.0
6,Arc-Tjeggelvas.Bjorklund.2012,66.6,17.6,520.0
7,Arc-Indigirka.Hughes.1999,69.5,147.0,80.0
8,Eur-SpannagelCave.Mangini.2005,47.1,11.6,2347.0
9,Ocn-AqabaJordanAQ19.Heiss.1999,29.42,34.97,-1.0


### Getting information about variables

To get information about available variable names, you can do the following:

In [7]:
D.get_all_variable_names()

['year',
 'd18O',
 'MXD',
 'temperature',
 'JulianDay',
 'trsgi',
 'age',
 'sampleID',
 'uncertainty_temperature',
 'density',
 'd13C',
 'sampleDensity',
 'Na',
 'thickness',
 '230th/232th_uncertainty',
 '230th age',
 '238u_uncertainty',
 '232th',
 'sample',
 'depth',
 'd234u',
 'd234uinitial',
 '232th_uncertainty',
 'depth_dating',
 'Year',
 '230th/238u',
 'd18o',
 '238u',
 '230th age_uncertainty',
 'corr_age',
 '230th age_uncertaity',
 'd234uinitial_uncertainty',
 '230th/232th',
 'd234u_undertainty',
 'corr_age_uncert',
 '230th/238u_uncertainty',
 'ukprime37.Uk37Prime',
 'sst',
 'median',
 'upper95',
 'd180',
 'sample label',
 'c. wuellerstorfi d18o.d18O',
 'section',
 'c. wuellerstorfi d13c.d13C',
 'interval',
 'temp prahl',
 'depth comp',
 'event',
 'depth cr',
 'c37 total.Alkenone',
 'u. peregrina d18o.d18O',
 'site/hole',
 'temp muller',
 'u. peregrina d13c.d13C',
 'lower95']

Note that like the functions retrieving the various `archiveTypes`, this function also only returns the unique names. As we have explored [previously](L0_lipd_object.ipynb), the Euro2k database contains more than one record correspoonding to `temperature`. Again, this function can be used to figure out what to filter by. 

If you want to have more granularity about which variable is available in which datasets and their associated unique IDs, you can use the following function:

In [8]:
D.get_all_variables()

Unnamed: 0,uri,TSID,variableName
0,http://linked.earth/lipd/Ocn-RedSea.Felis.2000...,PYTXPC7HUA2,year
1,http://linked.earth/lipd/Ocn-RedSea.Felis.2000...,Ocean2kHR_019,d18O
2,http://linked.earth/lipd/Arc-Forfjorddalen.McC...,Arc_150,MXD
3,http://linked.earth/lipd/Arc-Forfjorddalen.McC...,PYTRQDXUHJN,year
4,http://linked.earth/lipd/Eur-Tallinn.Tarand.20...,Eur_026,temperature
...,...,...,...
125,http://linked.earth/lipd/paleo0measurement0.PY...,PYTGO6NV72Y,temp muller
126,http://linked.earth/lipd/paleo0measurement1.PY...,PYTTUPVG4K3,u. peregrina d13c.d13C
127,http://linked.earth/lipd/chron0model0summary0....,PYTI487BQDZ,lower95
128,http://linked.earth/lipd/CO06MOPE.paleo1measur...,LPDca7af9c3,year


### Get a bibliography

You can also retrieve the publication information and save as Bib file:

In [11]:
bibs, df = D.get_bibtex(remote = True, save = True, path = '../data/mybiblio.bib', verbose = False)

Cannot find a matching record for the provided DOI, creating the entry manually
Cannot find a matching record for the provided DOI, creating the entry manually
Cannot find a matching record for the provided DOI, creating the entry manually
Cannot find a matching record for the provided DOI, creating the entry manually
Cannot find a matching record for the provided DOI, creating the entry manually
Cannot find a matching record for the provided DOI, creating the entry manually
Cannot find a matching record for the provided DOI, creating the entry manually
Cannot find a matching record for the provided DOI, creating the entry manually
Cannot find a matching record for the provided DOI, creating the entry manually
Cannot find a matching record for the provided DOI, creating the entry manually
Cannot find a matching record for the provided DOI, creating the entry manually
Cannot find a matching record for the provided DOI, creating the entry manually
Cannot find a matching record for the pr

Let's decompose the parameters for this [function](https://pylipd.readthedocs.io/en/latest/source/pylipd.html#pylipd.lipd.LiPD.get_bibtex):

* `remote`: If set to True, `PyLipd` will use the `crossref` function in the [`doi2bib`](https://pypi.org/project/doi2bib/) package to retrieve the bilbiography. You can only use this option online. If the retrieval fails, the entry will be created from the information in the LiPD file. If set to False, only the information in the file will be used.
* `save`, `path`: If `save` is set to True, `PyliPD` will save the entries in a `.bib` file. In this example, we saved the file to the data folder contained in this repository.
* `verbose` if set to True, the bibliography will print on the screen. 

In addition to saving the file, the function returns `bibs`, a list of text bliography and `df`, which presents the information in a `Pandas DataFrame`.

In [13]:
df.head()

Unnamed: 0,dsname,title,authors,doi,pubyear,year,journal,volume,issue,pages,type,publisher,report,citeKey,edition,institution,url,url2
0,Ocn-RedSea.Felis.2000,World Data Center for Paleoclimatology,T. Felis,,,,,,,,dataCitation,,,felis2000httpswwwncdcnoaagovpaleostudy1861Data...,,World Data Center for Paleoclimatology,,https://www.ncdc.noaa.gov/paleo/study/1861
1,Ocn-RedSea.Felis.2000,Tropical sea surface temperatures for the past...,Cyril Giry and Jens Zinke and Casey P. Saenger...,10.1002/2014PA002717,,2015.0,Paleoceanography,30.0,3.0,226-252,journal-article,Wiley-Blackwell,,tierney2015tropicalseasurfacetempera,,,http://dx.doi.org/10.1002/2014PA002717,
2,Ocn-RedSea.Felis.2000,A coral oxygen isotope record from the norther...,Thomas Felis and Gerold Wefer and Maoz Fine an...,10.1029/1999PA000477,,2000.0,Paleoceanography,15.0,6.0,679-694,article,Wiley-Blackwell,,felis2000acoraloxygenisotoperecord,,,http://dx.doi.org/10.1029/1999PA000477,
3,Arc-Forfjorddalen.McCarroll.2013,A 1200-year multiproxy record of tree growth a...,M. Lindholm and H. Grudd and D. McCarroll and ...,10.1177/0959683612467483,,2013.0,The Holocene,23.0,,471-484,article,SAGE Publications,,mccarroll2013a1200yearmultiproxyrecord,,,http://dx.doi.org/10.1177/0959683612467483,
4,Arc-Forfjorddalen.McCarroll.2013,This study,D. McCarroll,,,,,,,,dataCitation,,,mccarroll0thisstudyDataCitation,,,,this study
