# Harvesting IODP dataset metadata using OAI-PMH

The following is a brief tutorial of how to access IODP datasets from Zenodo.org using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). More information may be found at: https://developers.zenodo.org/#oai-pmh

Below, we use the Python package Sickle to interface with Zenodo via the OAI-PMH protocol, retrieve records from the IODP community, filter the records by several criteria, then export the records to file.

In [5]:
## importing packages

from sickle import Sickle
import pprint
import pandas as pd
import json

In [6]:
## base url for zenodo oai-pmh

sickle = Sickle('https://zenodo.org/oai2d')

In [7]:
# list available metadata formats for retrieval

metadataFormats = sickle.ListMetadataFormats()
list(metadataFormats)

[<MetadataFormat marcxml>,
 <MetadataFormat oai_dc>,
 <MetadataFormat dcat>,
 <MetadataFormat marc21>,
 <MetadataFormat datacite>,
 <MetadataFormat oai_datacite>,
 <MetadataFormat datacite4>,
 <MetadataFormat oai_datacite4>]

In [8]:
# selecting "oai_dc" (dublin core) as metadata format, and "user-iodp" as the Zenodo record set. 
# "iodp" is the name of the community in Zenodo.
# this cell takes ~1 minute to run

records = sickle.ListRecords(
    **{'metadataPrefix' :'oai_dc', 
       'set' : 'user-iodp'
       })

# capture all records from iterable
records = [record for record in records]

In [9]:
print(f'Number of records: {len(records)}')

Number of records: 842


# API response in XML format

Below is an example of one record in XML format with Dublin Core layout:

```xml
<record xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
	<header>
		<identifier>oai:zenodo.org:10211019</identifier>
		<datestamp>2023-11-27T17:34:14Z</datestamp>
		<setSpec>openaire_data</setSpec>
		<setSpec>user-iodp</setSpec>
	</header>
	<metadata>
		<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
			<dc:contributor>International Ocean Discovery Program</dc:contributor>
			<dc:creator>Sager, William</dc:creator>
			<dc:creator>Blum, Peter</dc:creator>
			<dc:creator>Carvallo, Claire A.</dc:creator>
			<dc:creator>Heaton, Daniel</dc:creator>
			<dc:creator>Nelson, Wendy R.</dc:creator>
			<dc:creator>Tshiningayamwe, Mbili</dc:creator>
			<dc:creator>Widdowson, Mike</dc:creator>
			<dc:creator>Avery, Aaron J.</dc:creator>
			<dc:creator>Bhutani, Rajneesh</dc:creator>
			<dc:creator>Buchs, David M.</dc:creator>
			<dc:creator>Class, Cornelia</dc:creator>
			<dc:creator>Dai, Yuhao</dc:creator>
			<dc:creator>Dalla Valle, Giacomo</dc:creator>
			<dc:creator>Del Gaudio, Arianna V.</dc:creator>
			<dc:creator>Fielding, Sharmonay</dc:creator>
			<dc:creator>Gaastra, Kevin M.</dc:creator>
			<dc:creator>Han, Seunghee</dc:creator>
			<dc:creator>Homrighausen, Stephan</dc:creator>
			<dc:creator>Kubota, Yusuke</dc:creator>
			<dc:creator>Li, Chun-Feng</dc:creator>
			<dc:creator>Petrou, Ethan</dc:creator>
			<dc:creator>Potter, Katherine E.</dc:creator>
			<dc:creator>Pujatti, Simone</dc:creator>
			<dc:creator>Scholpp, Jesse</dc:creator>
			<dc:creator>Shervais, John W.</dc:creator>
			<dc:creator>Thoram, Sriharsha</dc:creator>
			<dc:creator>Tikoo-Schantz, Sonia M.</dc:creator>
			<dc:creator>Wang, Xiao-Jun</dc:creator>
			<dc:date>2023-10-11</dc:date>
			<dc:description>Images of the outside of hard rock whole-round sections were acquired using a linescan imager (Section Half Imaging Logger [SHIL]) and a special holder that allows each 90 degree segment of the outer surface to be positioned properly. The images were taken at a resolution of 20 lines/mm (50 micropixels). JRSO staff take these quadrant images and compile them into a side-by-side rollout photograph of the section. Composite images are available as both JPG and TIF image formats. Individual quadrant images are available as JPG images only through this report; contact the &amp;lt;a href="mailto:database@iodp.tamu.edu"&amp;gt;IODP-JRSO Data Librarian&amp;lt;/a&amp;gt; if quadrant TIF files (~160 MB) are needed.</dc:description>
			<dc:identifier>https://doi.org/10.5281/zenodo.10211019</dc:identifier>
			<dc:identifier>oai:zenodo.org:10211019</dc:identifier>
			<dc:publisher>International Ocean Discovery Program</dc:publisher>
			<dc:relation>https://doi.org/10.14379/iodp.proc.391.2023</dc:relation>
			<dc:relation>https://zenodo.org/communities/iodp</dc:relation>
			<dc:relation>https://doi.org/10.5281/zenodo.10211018</dc:relation>
			<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
			<dc:rights>Creative Commons Attribution 4.0 International</dc:rights>
			<dc:rights>https://creativecommons.org/licenses/by/4.0/legalcode</dc:rights>
			<dc:subject>International Ocean Discovery Program</dc:subject>
			<dc:subject>IODP</dc:subject>
			<dc:subject>JOIDES Resolution</dc:subject>
			<dc:subject>Expedition 391</dc:subject>
			<dc:subject>Expedition 397T</dc:subject>
			<dc:subject>Site U1575</dc:subject>
			<dc:subject>Site U1576</dc:subject>
			<dc:subject>Site U1577</dc:subject>
			<dc:subject>Site U1578</dc:subject>
			<dc:subject>Site U1584</dc:subject>
			<dc:subject>Site U1585</dc:subject>
			<dc:subject>Walvis Ridge Hotspot</dc:subject>
			<dc:subject>Earth connections</dc:subject>
			<dc:subject>Tristan-Gough-Walvis Hotspot</dc:subject>
			<dc:subject>true polar wander</dc:subject>
			<dc:subject>isotopic zonation</dc:subject>
			<dc:subject>large low shear-wave velocity province</dc:subject>
			<dc:subject>LLSVP</dc:subject>
			<dc:title>IODP Expedition 397T Whole-round core section composite 360 degree images</dc:title>
			<dc:type>info:eu-repo/semantics/other</dc:type>
		</oai_dc:dc>
	</metadata>
</record>
```

In [10]:
# filter records to only those uploaded by IODP

records = [record.metadata for record in records if 'contributor' in record.metadata.keys() and 'International Ocean Discovery Program' in record.metadata['contributor']]

# Example of Metadata in JSON format:

Below is an example of the metadata content of a single record in JSON format.
```json
{
  "contributor": [
    "International Ocean Discovery Program"
  ],
  "creator": [
    "Sager, William",
    "Hoernle, Kaj",
    "H\u00f6fig, Tobias W.",
    "Avery, Aaron J.",
    "Bhutani, Rajneesh",
    "Buchs, David M.",
    "Carvallo, Claire A.",
    "Class, Cornelia",
    "Dai, Yuhao",
    "Dalla Valle, Giacomo",
    "Del Gaudio, Arianna V.",
    "Fielding, Sharmonay",
    "Gaastra, Kevin M.",
    "Han, Seunghee",
    "Homrighausen, Stephan",
    "Kubota, Yusuke",
    "Li, Chun-Feng",
    "Nelson, Wendy R.",
    "Petrou, Ethan",
    "Potter, Katherine E.",
    "Pujatti, Simone",
    "Scholpp, Jesse",
    "Shervais, John W.",
    "Thoram, Sriharsha",
    "Tikoo-Schantz, Sonia M.",
    "Tshiningayamwe, Mbili",
    "Wang, Xiao-Jun",
    "Widdowson, Mike"
  ],
  "date": [
    "2023-10-11"
  ],
  "description": [
    "Natural gamma radiation (NGR) data in the ~0.1 to 3.0 MeV range were measured using eight custom-designed sodium iodide (thallium) [NaI(Tl)] detectors arranged along the core measurement axis at 20 cm intervals. The NGR system uses layers of passive shielding (lead) and active shielding (plastic scintillators and coincidence electronics) to reduce the cosmic-ray signal for low-count analysis of sediment core sections and to obtain the maximum signal-to-noise ratio. Data are reported on a total counts per second basis and the raw spectral files are available as compressed files for later analysis."
  ],
  "identifier": [
    "https://doi.org/10.5281/zenodo.10206419",
    "oai:zenodo.org:10206419"
  ],
  "publisher": [
    "International Ocean Discovery Program"
  ],
  "relation": [
    "https://doi.org/10.14379/iodp.proc.391.2023",
    "https://zenodo.org/communities/iodp",
    "https://doi.org/10.5281/zenodo.10206418"
  ],
  "rights": [
    "info:eu-repo/semantics/openAccess",
    "Creative Commons Attribution 4.0 International",
    "https://creativecommons.org/licenses/by/4.0/legalcode"
  ],
  "subject": [
    "International Ocean Discovery Program",
    "IODP",
    "JOIDES Resolution",
    "Expedition 391",
    "Expedition 397T",
    "Site U1575",
    "Site U1576",
    "Site U1577",
    "Site U1578",
    "Site U1584",
    "Site U1585",
    "Walvis Ridge Hotspot",
    "Earth connections",
    "Tristan-Gough-Walvis Hotspot",
    "true polar wander",
    "isotopic zonation",
    "large low shear-wave velocity province",
    "LLSVP"
  ],
  "title": [
    "IODP Expedition 391 Natural gamma radiation"
  ],
  "type": [
    "info:eu-repo/semantics/other"
  ]
}
```

# Converting to tabular format and exporting

In [11]:
# iterating through records and extracting relevant metadata

js = []
for r in records:
    
    # records uploaded by the JRSO have 3 relations: the expedition proceedings doi, the IODP community link, and the dataset parent doi.
    # filtering out other records
    if len(r['relation']) == 3 and r['contributor'][0] == 'International Ocean Discovery Program':
    
        j = {
            'title': r['title'][0],
            'date' : r['date'][0],
            'doi' : r['identifier'][0],
            'parent_doi' : r['relation'][2],
            'proceedings_doi'  : r['relation'][0],
            'type' : r['type'][0]
            
        }
        js.append(j)
    
df = pd.DataFrame(js)
df.head()

Unnamed: 0,title,date,doi,parent_doi,proceedings_doi,type
0,IODP Expedition 385 Vane shear strength (Torvane),2021-09-27,https://doi.org/10.5281/zenodo.7708697,https://doi.org/10.5281/zenodo.7708696,https://doi.org/10.14379/iodp.proc.385.2021,info:eu-repo/semantics/other
1,IODP Expedition 354 Titration,2016-09-07,https://doi.org/10.5281/zenodo.7869337,https://doi.org/10.5281/zenodo.7869336,https://doi.org/10.14379/iodp.proc.354.2016,info:eu-repo/semantics/other
2,IODP Expedition 396 X-ray fluorescence (XRF),2023-04-06,https://doi.org/10.5281/zenodo.7850796,https://doi.org/10.5281/zenodo.7850795,https://doi.org/10.14379/iodp.proc.396.2023,info:eu-repo/semantics/other
3,IODP Expedition 385 Compressional strength (pe...,2021-09-27,https://doi.org/10.5281/zenodo.7708606,https://doi.org/10.5281/zenodo.7708605,https://doi.org/10.14379/iodp.proc.385.2021,info:eu-repo/semantics/other
4,IODP Expedition 354 Section-half images,2016-09-07,https://doi.org/10.5281/zenodo.7868403,https://doi.org/10.5281/zenodo.7868402,https://doi.org/10.14379/iodp.proc.354.2016,info:eu-repo/semantics/other


In [12]:
# sorting records.
# note that records include all dataset versions.

df = (
    df.sort_values(by=['parent_doi','doi', 'title'], ascending=[True, True, True])
      .reset_index(drop=True)
)
df.head()

Unnamed: 0,title,date,doi,parent_doi,proceedings_doi,type
0,IODP Expedition 391 Hole summary,2023-10-11,https://doi.org/10.5281/zenodo.10206216,https://doi.org/10.5281/zenodo.10206215,https://doi.org/10.14379/iodp.proc.391.2023,info:eu-repo/semantics/other
1,IODP Expedition 391 Scanning electron microsco...,2023-10-11,https://doi.org/10.5281/zenodo.10206224,https://doi.org/10.5281/zenodo.10206223,https://doi.org/10.14379/iodp.proc.391.2023,info:eu-repo/semantics/other
2,IODP Expedition 391 Alkalinity and pH,2023-10-11,https://doi.org/10.5281/zenodo.10206227,https://doi.org/10.5281/zenodo.10206226,https://doi.org/10.14379/iodp.proc.391.2023,info:eu-repo/semantics/other
3,IODP Expedition 391 Photomicrographs,2023-10-11,https://doi.org/10.5281/zenodo.10206279,https://doi.org/10.5281/zenodo.10206278,https://doi.org/10.14379/iodp.proc.391.2023,info:eu-repo/semantics/other
4,IODP Expedition 391 Closeup images,2023-10-11,https://doi.org/10.5281/zenodo.10206287,https://doi.org/10.5281/zenodo.10206286,https://doi.org/10.14379/iodp.proc.391.2023,info:eu-repo/semantics/other


In [13]:
# export to csv
df.to_csv("./output/iodp_community_records_oaidc_format.csv", index=False)