<img src="../Images/DSC_Logo.png" style="width: 400px;">

# Downloading Data and Metadata from PANGAEA with `pangaeapy`
hehewjrhewkrhewkrh
This Jupyter Notebook demonstrates how to retrieve multiple datasets and their metadata from the [PANGAEA](https://www.pangaea.de/) data repository using the [`pangaeapy`](https://pypi.org/project/PANGAEApy/) Python package. It was developed with reference to the [PANGAEA community workshop materials on github](https://github.com/pangaea-data-publisher/community-workshop-material/) which provide additional information on PANGAEA data retrieval.

# 1. Preparation

## 1.1 Import Libraries

You might need to install `pangaeapy` first:

In [1]:
!pip install pangaeapy



Import (load) `pangaeapy`:

In [2]:
import pangaeapy as pan
from pangaeapy.pandataset import PanDataSet

Other Python packages that must be installed before they can be imported and used:

In [3]:
import os
import pandas as pd
import numpy as np
import requests 
from urllib.request import urlopen, urlretrieve

To ignore warnings in this script:

In [4]:
import warnings
from pandas.errors import SettingWithCopyWarning
warnings.simplefilter(action='ignore', category=(SettingWithCopyWarning))
warnings.simplefilter(action='ignore', category=FutureWarning)

## 1.2 `pangaeapy` Documentation

To call the `pangaeapy` documentation (uncomment):

In [5]:
#help(pan) # help on package pangaeapy
#help(pan.panquery) # help on module pangaeapy.panquery in pangaeapy
help(pan.pandataset) # help on module pangaeapy.pandataset in pangaeapy

Help on module pangaeapy.pandataset in pangaeapy:

NAME
    pangaeapy.pandataset

CLASSES
    builtins.object
        PanAuthor
        PanBasis
        PanCampaign
        PanDataHarvester
        PanDataSet
        PanEvent
        PanLicence
        PanMethod
        PanParam
        PanProject
    
    class PanAuthor(builtins.object)
     |  PanAuthor(lastname, firstname=None, orcid=None, id=None, affiliations=None)
     |  
     |  PANGAEA Author Class.
     |  A simple helper class to declare 'author' objects which are associated as part of the metadata of a given PANGAEA dataset object
     |  
     |  Parameters
     |  ----------
     |  lastname : str
     |      The author's first name
     |  firstname : str
     |      The authors's last name
     |  ORCID : str
     |      The unique ORCID identifier assigned by orcid.org
     |  id : int
     |      The PANGAEA internal id
     |  
     |  Attributes
     |  ----------
     |  lastname : str
     |      The author's fir

Searching in PANGAEA is also documented [here](https://wiki.pangaea.de/wiki/PANGAEA_search).

## 1.3 Create Data Folders to Organize and Store Downloaded Datasets

Define directories for storing data:

In [6]:
data_directory = "../Data/PANGAEA_orca_data"
dataset_directory = "../Data/PANGAEA_orca_data/Datasets"

Create main data directory if it doesn't exist:

In [7]:
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)

Create subdirectory for individual datasets if it doesn't exist:

In [8]:
if not os.path.isdir(dataset_directory):
    os.mkdir(dataset_directory)

# 2. Query Data

PANGAEA offers various query options including uncertain spelling, optional query terms ("OR"), author-search, and geographical bounding boxes. Here, we want to **query orcinus orca sightings of Polarstern cruises in the Arctic**. For that, we won't hit the query limit of 500 datasets, however, to make this query code applicable for larger query results we run code to retrieve datasets in chunks of 500.  

> Find more examples in the original [PANGAEA community workshop materials on github](https://github.com/pangaea-data-publisher/community-workshop-material/) and look in the `pangaeapy` documentation (Sect. 1.2) for callable query options.

## 2.1 Define Query

A simple text-based search provides expected results:

In [9]:
query = pan.PanQuery('Polarstern orcinus orca', bbox=(-180, 66.565, 180, 90), limit=500)

print(f'There are {query.totalcount} query results.')
print(f'Currently query consists of {len(query.result)} entries.')

There are 33 query results.
Currently query consists of 33 entries.


Below is an alternative query with advanced search options where we narrow down broad orcinus orca results using a specific metadata field (here: basis).

In [10]:
query = pan.PanQuery('basis:Polarstern AND "orcinus orca"', bbox=(-180, 66.565, 180, 90), limit=500)

print(f'There are {query.totalcount} query results.')
print(f'Currently query consists of {len(query.result)} entries.')

There are 33 query results.
Currently query consists of 33 entries.


## 2.2 Get Query Results

In this step, we loop through the query results and combine the metadata into a single dataframe.

>Note: At this stage, no data files are downloaded yet. Only the search results (metadata) returned by PANGAEA are collected.

In [11]:
# create empty dataframe
df_query_results_all = pd.DataFrame()

# loop over all results in steps of 500
for i in np.arange(0,query.totalcount,500):
    
    # set query
    qs = query
    
    # convert qs result with 500 entries to dataframe df_qs
    df_qs = pd.DataFrame(qs.result)
    
    # concatenate all individual df_qs into one dataframe
    df_query_results_all = pd.concat([df_query_results_all,df_qs],ignore_index=True)

In [12]:
# show first 3 lines of query results
df_query_results_all.head(3)

Unnamed: 0,URI,score,html,type,position
0,doi:10.1594/PANGAEA.924703,19.889349,"<li><div class=""citation""><a href=""https://doi...",member,0
1,doi:10.1594/PANGAEA.982407,19.55639,"<li><div class=""citation""><a href=""https://doi...",member,1
2,doi:10.1594/PANGAEA.924707,19.045376,"<li><div class=""citation""><a href=""https://doi...",member,2


In [13]:
# show last 3 lines of query results
df_query_results_all.tail(3)

Unnamed: 0,URI,score,html,type,position
30,doi:10.1594/PANGAEA.924699,17.033512,"<li><div class=""citation""><a href=""https://doi...",member,30
31,doi:10.1594/PANGAEA.924716,16.899841,"<li><div class=""citation""><a href=""https://doi...",member,31
32,doi:10.1594/PANGAEA.924712,15.422614,"<li><div class=""citation""><a href=""https://doi...",member,32


## 2.3 Save Query Results

In [14]:
# Save as tab-delimited text
df_query_results_all.to_csv(os.path.join(data_directory, "PANGAEA_query.txt"), 
                            encoding="utf-8", 
                            sep="\t", 
                            index=False)

# 3. Get Metadata for Multiple Datasets

## 3.1 Download Metadata

We now iterate over the query results and fetch only the dataset metadata (no data files). This creates a consolidated table that is useful for an overview (title, authors, parameters, geography) and for reuse essentials, including the recommended citation and DOI you’ll need to cite the datasets properly.


> Rate limits: As a safety precaution, the number of metadata requests is limited for a specific time period. If you have larger requests, prepare to wait or use a different tool. Find more information in the [PANGAEA community workshop materials on github](https://github.com/pangaea-data-publisher/community-workshop-material/).

> Fields reference: See Sect. 1.2 for callable metadata attributes.

In [15]:
for ind, value in df_query_results_all['URI'].items():
    
    # get metadata 
    ds = PanDataSet(id=value, include_data=False) # just metadata

    # store metadata in df in new column
    df_query_results_all.loc[ind,'dataset title'] = ds.title
    df_query_results_all.loc[ind,'abstract'] = ds.abstract
    df_query_results_all.loc[ind,'publication date'] = ds.date
    df_query_results_all.loc[ind,'collection members'] = ', '.join(ds.collection_members)
    df_query_results_all.loc[ind,'isCollection'] = "Yes" if ds.isCollection else "No"
    df_query_results_all.loc[ind,'first author fullname'] = ds.authors[0].fullname
    df_query_results_all.loc[ind,'all authors fullnames'] = "; ".join([x.fullname for x in ds.authors])
    df_query_results_all.loc[ind,'all authors orcids'] = "; ".join([x.ORCID if x.ORCID else "no ORCID" for x in ds.authors])
    df_query_results_all.loc[ind,'citation'] = ds.citation
    df_query_results_all.loc[ind,'dataset DOI'] = ds.doi
    df_query_results_all.loc[ind,'mean latitude'] = ds.geometryextent["meanLatitude"]
    df_query_results_all.loc[ind,'mean longitude'] = ds.geometryextent["meanLongitude"]
    campaign_names = {event.campaign.name for event in ds.events if event.campaign and event.campaign.name}
    df_query_results_all.loc[ind, 'campaign'] = "; ".join(campaign_names)
    df_query_results_all.loc[ind,'parameters'] = "; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()])
    df_query_results_all.loc[ind,'device'] = "; ".join(set([device if device else "no device" for device in ds.getEventsAsFrame()["device"]]))

Show first two lines of metadata:

In [16]:
df_query_results_all.head(2)

Unnamed: 0,URI,score,html,type,position,dataset title,abstract,publication date,collection members,isCollection,first author fullname,all authors fullnames,all authors orcids,citation,dataset DOI,mean latitude,mean longitude,campaign,parameters,device
0,doi:10.1594/PANGAEA.924703,19.889349,"<li><div class=""citation""><a href=""https://doi...",member,0,Whale sightings during POLARSTERN cruise PS99....,Data on whale distribution and abundance in th...,2020-11-12T17:15:50,,No,"Burkhardt, Elke","Burkhardt, Elke",0000-0002-5128-4176,"Burkhardt, Elke (2020): Whale sightings during...",https://doi.org/10.1594/PANGAEA.924703,76.3767512,12.7475864,PS99.1,DATE/TIME; LATITUDE; LONGITUDE; Whale species;...,Underway cruise track measurements
1,doi:10.1594/PANGAEA.982407,19.55639,"<li><div class=""citation""><a href=""https://doi...",member,1,Whale sightings during POLARSTERN cruise PS143/1,Data on whale distribution and abundance in th...,2025-05-14T10:41:38,,No,"Burkhardt, Elke","Burkhardt, Elke",0000-0002-5128-4176,"Burkhardt, Elke (2025): Whale sightings during...",https://doi.org/10.1594/PANGAEA.982407,76.57997749999998,7.631909772727273,PS143/1,DATE/TIME; LATITUDE; LONGITUDE; Whale species;...,Underway cruise track measurements


Print unique first author names in metadata:

In [17]:
df_query_results_all['first author fullname'].unique()

array(['Burkhardt, Elke'], dtype=object)

## 3.2 Save Metadata

In [18]:
df_query_results_all.to_csv(os.path.join(data_directory, "PANGAEA_metadata.txt"),
                            encoding="utf-8",
                            sep="\t",
                            index=False)

print(f'PANGAEA metadata saved')

PANGAEA metadata saved


# 4. Download Multiple Datasets

Function to translate default parameters to long parameter names because by default parameters are abbreviated without units:

In [19]:
def get_long_parameters(ds):
    """Translate short parameters names to long names including unit

    Args:
        ds (PANGAEA dataset): PANGAEA dataset
    """
    ds.data.columns =  [f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()]

## 4.1 Download Datasets to Dictionary

In this step we download the actual dataset contents from PANGAEA. Each dataset is loaded into a pandas dataframe and stored in a Python dictionary.

> Why a dictionary? A dictionary in Python is like a labeled container ({key: value, ...}), where you can store multiple objects and access them later by their key. Here, we use the PANGAEA dataset ID as the key and the corresponding data table (a dataframe) as the value.

In [20]:
# Create an empty dictionary to store downloaded datasets
data_dict = {}

# Loop over all DOIs (or restrict to a subset, e.g. [:20] for the first 20 results)
for pangaea_doi in df_query_results_all['URI']:
 
    # Download the dataset from PANGAEA (enable_cache=True saves it locally for reuse)
    ds = PanDataSet(pangaea_doi, enable_cache=True)
    
    # Replace short parameter names in ds.data with full descriptive names + units
    get_long_parameters(ds)

    # Extract the numeric dataset ID from the DOI string (part after "A.")
    # Example: "10.1594/PANGAEA.900123" → "900123"
    pangaea_id = pangaea_doi.split('A.')[1]

    # Store the dataset's dataframe in the dictionary under its ID
    data_dict[pangaea_id] = ds.data

    # Print a simple progress message
    print("".join(40*["-"]))
    print(f'PANGAEA ID: {pangaea_doi}')
    print(f'Dataset title: {ds.title}')

----------------------------------------
PANGAEA ID: doi:10.1594/PANGAEA.924703
Dataset title: Whale sightings during POLARSTERN cruise PS99.1 (ARK-XXX/1.1)
----------------------------------------
PANGAEA ID: doi:10.1594/PANGAEA.982407
Dataset title: Whale sightings during POLARSTERN cruise PS143/1
----------------------------------------
PANGAEA ID: doi:10.1594/PANGAEA.924707
Dataset title: Whale sightings during POLARSTERN cruise PS92 (ARK-XXIX/1)
----------------------------------------
PANGAEA ID: doi:10.1594/PANGAEA.924698
Dataset title: Whale sightings during POLARSTERN cruise PS106/1 (ARK-XXXI/1.1)
----------------------------------------
PANGAEA ID: doi:10.1594/PANGAEA.924709
Dataset title: Whale sightings during POLARSTERN cruise PS86 (ARK-XXVIII/3)
----------------------------------------
PANGAEA ID: doi:10.1594/PANGAEA.924710
Dataset title: Whale sightings during POLARSTERN cruise PS85 (ARK-XXVIII/2)
----------------------------------------
PANGAEA ID: doi:10.1594/PANGAEA.9

Inspect any dataset stored in the dictionary by looking it up by its PANGAEA ID:

In [21]:
data_dict['924703'].head()

Unnamed: 0,DATE/TIME,LATITUDE,LONGITUDE,Whale species,Certainty of identification,Individuals [#],Event
0,2016-06-17 19:57:00,69.68787,9.94335,Orcinus orca,definite,10,PS99.1-track
1,2016-06-19 02:58:00,74.51515,15.88732,Balaenoptera acutorostrata,definite,3,PS99.1-track
2,2016-06-19 16:16:00,74.84158,17.64587,"Large whale, unidentified",definite,1,PS99.1-track
3,2016-06-19 16:29:00,74.8415,17.65185,Megaptera novaeangliae,probable,1,PS99.1-track
4,2016-06-20 08:47:00,74.86393,17.64258,Megaptera novaeangliae,possible,1,PS99.1-track


## 4.2 Save Individual Datasets

Loop over each dataset in the dictionary and save:

In [22]:
for key, df in data_dict.items():
    print(f'PANGAEA dataset {key} saved')
    
    df.to_csv(
        os.path.join(dataset_directory, f'PANGAEA_orca_dataset_{key}.txt'),
        index=False,
        sep="\t",
        encoding="utf-8"
    )

PANGAEA dataset 924703 saved
PANGAEA dataset 982407 saved
PANGAEA dataset 924707 saved
PANGAEA dataset 924698 saved
PANGAEA dataset 924709 saved
PANGAEA dataset 924710 saved
PANGAEA dataset 924582 saved
PANGAEA dataset 972704 saved
PANGAEA dataset 929094 saved
PANGAEA dataset 924708 saved
PANGAEA dataset 924717 saved
PANGAEA dataset 924706 saved
PANGAEA dataset 982409 saved
PANGAEA dataset 924589 saved
PANGAEA dataset 924715 saved
PANGAEA dataset 924705 saved
PANGAEA dataset 924701 saved
PANGAEA dataset 982410 saved
PANGAEA dataset 924569 saved
PANGAEA dataset 924702 saved
PANGAEA dataset 924587 saved
PANGAEA dataset 924713 saved
PANGAEA dataset 929096 saved
PANGAEA dataset 896939 saved
PANGAEA dataset 972703 saved
PANGAEA dataset 929095 saved
PANGAEA dataset 924714 saved
PANGAEA dataset 924585 saved
PANGAEA dataset 924704 saved
PANGAEA dataset 972623 saved
PANGAEA dataset 924699 saved
PANGAEA dataset 924716 saved
PANGAEA dataset 924712 saved


---

# 5. Exercise: Polarstern Cruise Tracks

We downloaded metadata for datasets containing orca sightings, which includes a column named "campaign". Now, your task is to find and download metadata and data for **three master track datasets** that are part of the same campaigns as the orca datasets.

1. Extract the unique campaign names from the orca datasets and store them in a variable called "orca_campaigns".
> Hint: Use the `.unique()` function from Sect. 3.1.

In [23]:
orca_campaigns = df_query_results_all['campaign'].unique()
orca_campaigns

array(['PS99.1', 'PS143/1', 'PS92', 'PS106/1', 'PS86', 'PS85', 'PS115/1',
       'PS136', 'ARK-XXVII/1', 'PS87', 'ARK-XXII/1a', 'PS93.1', 'PS143/2',
       'PS106/2', 'ARK-XXII/1c', 'PS93.2', 'PS100', 'PS144', 'PS115/2',
       'PS99.2', 'PS107', 'ARK-XXIII/2', 'ARK-XXVII/3', 'PS114', 'PS138',
       'ARK-XXVII/2', 'ARK-XXIII/1', 'PS108', 'PS94', 'PS137', 'PS101',
       'ARK-XXII/1b', 'ARK-XXIII/3'], dtype=object)

2. Create a new folder and "Datasets" subfolder for your data inside the "Data" folder (see Sect. 1.3).

In [24]:
data_directory = "../Data/PANGAEA_mastertrack_data"
dataset_directory = "../Data/PANGAEA_mastertrack_data/Datasets"

if not os.path.isdir(data_directory):
    os.mkdir(data_directory)

if not os.path.isdir(dataset_directory):
    os.mkdir(dataset_directory)

3. Build a query that returns the cruise track datasets for three campaigns (see Sect. 2.1):
> - Join multiple campaign names with OR and wrap them in parentheses using text-based search or using the metadata query field "campaign".
> - Add "master track" text-based search with AND to focus only on cruise track datasets.
> - Add the metadata query field device:"Underway cruise track measurements" with AND to exclude other results (e.g., seismic profiles).

In [25]:
query = pan.PanQuery('("ARK-XXIII/2" OR "PS85" OR "PS92") AND "master track" AND device:"Underway cruise track measurements"', bbox=(-180, 66.565, 180, 90), limit = 500)

print(f'There are {query.totalcount} query results.')
print(f'Currently query consists of {len(query.result)} entries.')

There are 4 query results.
Currently query consists of 4 entries.


In [26]:
query = pan.PanQuery('(campaign:"ARK-XXIII/2" OR campaign:"PS85" OR campaign:"PS92") AND "master track" AND device:"Underway cruise track measurements"', bbox=(-180, 66.565, 180, 90), limit = 500)

print(f'There are {query.totalcount} query results.')
print(f'Currently query consists of {len(query.result)} entries.')

There are 4 query results.
Currently query consists of 4 entries.


4. Get the query results of the newly defined query and save the query to your new data folder (see Sects. 2.2 & 2.3). 

In [27]:
df_query_results_all = pd.DataFrame()

for i in np.arange(0,query.totalcount,500):
    
    qs = query
    
    df_qs = pd.DataFrame(qs.result)
    
    df_query_results_all = pd.concat([df_query_results_all,df_qs],ignore_index=True)

In [28]:
df_query_results_all.to_csv(
    os.path.join(data_directory, "PANGAEA_query.txt"),
    encoding="utf-8",
    sep="\t",
    index=False
)

5. Next, download the metdata and datasets and save them to your new folders. You can use the same attributes as before for the metadata download (see Sects. 3 & 4). 

In [29]:
for ind,value in df_query_results_all['URI'].items():
    
    ds = PanDataSet(id=value, include_data=False)

    df_query_results_all.loc[ind,'dataset title'] = ds.title
    df_query_results_all.loc[ind,'abstract'] = ds.abstract
    df_query_results_all.loc[ind,'publication date'] = ds.date
    df_query_results_all.loc[ind,'collection members'] = ', '.join(ds.collection_members)
    df_query_results_all.loc[ind,'isCollection'] = "Yes" if ds.isCollection else "No"
    df_query_results_all.loc[ind,'first author fullname'] = ds.authors[0].fullname
    df_query_results_all.loc[ind,'all authors fullnames'] = "; ".join([x.fullname for x in ds.authors])
    df_query_results_all.loc[ind,'all authors orcids'] = "; ".join([x.ORCID if x.ORCID else "no ORCID" for x in ds.authors])
    df_query_results_all.loc[ind,'citation'] = ds.citation
    df_query_results_all.loc[ind,'dataset DOI'] = ds.doi
    df_query_results_all.loc[ind,'mean latitude'] = ds.geometryextent["meanLatitude"]
    df_query_results_all.loc[ind,'mean longitude'] = ds.geometryextent["meanLongitude"]
    campaign_names = {event.campaign.name for event in ds.events if event.campaign and event.campaign.name}
    df_query_results_all.loc[ind, 'campaign'] = "; ".join(campaign_names)
    df_query_results_all.loc[ind,'parameters'] = "; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()])
    df_query_results_all.loc[ind,'device'] = "; ".join(set([device if device else "no device" for device in ds.getEventsAsFrame()["device"]]))

In [30]:
df_query_results_all.to_csv(
    os.path.join(data_directory, "PANGAEA_metadata.txt"),
    encoding="utf-8",
    sep="\t",
    index=False
)

In [31]:
data_dict = {}

for pangaea_doi in df_query_results_all['URI']:
    
    ds = PanDataSet(pangaea_doi, enable_cache=True)
    
    get_long_parameters(ds)

    pangaea_id = pangaea_doi.split('A.')[1]

    data_dict[pangaea_id] = ds.data

    print("".join(40*["-"]))
    print(f'PANGAEA ID: {pangaea_doi}')
    print(f'Dataset title: {ds.title}')

----------------------------------------
PANGAEA ID: doi:10.1594/PANGAEA.847933
Dataset title: Station list and links to master tracks in different resolutions of POLARSTERN cruise ARK-XXIII/2, Longyearbyen - Reykjavik, 2008-07-04 - 2008-08-10
----------------------------------------
PANGAEA ID: doi:10.1594/PANGAEA.834205
Dataset title: Station list and links to master tracks in different resolutions of POLARSTERN cruise PS85 (ARK-XXVIII/2), Bremerhaven - Tromsø, 2014-06-06 - 2014-07-03


Received too many requests (for data) error (429)...waiting 30s - https://doi.org/10.1594/PANGAEA.848841


----------------------------------------
PANGAEA ID: doi:10.1594/PANGAEA.848841
Dataset title: Station list and links to master tracks in different resolutions of POLARSTERN cruise PS92 (ARK-XXIX/1 TRANSSIZ), Bremerhaven - Longyearbyen, 2015-05-19 - 2015-06-28
----------------------------------------
PANGAEA ID: doi:10.1594/PANGAEA.905170
Dataset title: Isoprene concentrations in the surface Arctic waters during POLARSTERN cruise PS92 (ARK-XXIX/1 TRANSSIZ) in May 2015


In [32]:
for key, df in data_dict.items():
    print(f'PANGAEA dataset {key} saved')
    
    df.to_csv(
        os.path.join(dataset_directory, f'PANGAEA_master_dataset_{key}.txt'),
        index=False,
        sep="\t",
        encoding="utf-8"
    )

PANGAEA dataset 847933 saved
PANGAEA dataset 834205 saved
PANGAEA dataset 848841 saved
PANGAEA dataset 905170 saved


# 6. Download All Matching Master Tracks

In addition to using a fixed query like: 

In [33]:
query = pan.PanQuery('campaign:"ARK-XXIII/2" AND "master track" AND device:"Underway cruise track measurements"', bbox=(-180, 66.565, 180, 90), limit = 500)

There are programmatic alternatives for querying multiple campaigns in a cleaner and more flexible way.

List of campaign names in the Orca datasets from above:

In [34]:
orca_campaigns

array(['PS99.1', 'PS143/1', 'PS92', 'PS106/1', 'PS86', 'PS85', 'PS115/1',
       'PS136', 'ARK-XXVII/1', 'PS87', 'ARK-XXII/1a', 'PS93.1', 'PS143/2',
       'PS106/2', 'ARK-XXII/1c', 'PS93.2', 'PS100', 'PS144', 'PS115/2',
       'PS99.2', 'PS107', 'ARK-XXIII/2', 'ARK-XXVII/3', 'PS114', 'PS138',
       'ARK-XXVII/2', 'ARK-XXIII/1', 'PS108', 'PS94', 'PS137', 'PS101',
       'ARK-XXII/1b', 'ARK-XXIII/3'], dtype=object)

Option 1: Loop over campaign names that are in the `orca_campaigns` list:

In [35]:
# Empty DataFrame to store all results
df_query_results_all = pd.DataFrame()

# Loop through each campaign
for campaign in orca_campaigns:
    query_string = f'campaign:"{campaign}" AND "master track" AND device:"Underway cruise track measurements"'
    q = pan.PanQuery(query_string, bbox=(-180, 66.565, 180, 90), limit=500)
    df_q = pd.DataFrame(q.result)
    df_query_results_all = pd.concat([df_query_results_all, df_q], ignore_index=True)

# Show the results
df_query_results_all

# Make copy for comparison
df_loop = df_query_results_all.copy()

Option 2: One query using multiple campaign:"..." terms joined by `OR` with the `join` argument for names in the `orca_campaigns` list:

In [36]:
# Join campaign names correctly with full fielded expressions
campaign_or_string = " OR ".join([f'campaign:"{c}"' for c in orca_campaigns])

# Build final query string
query_string = f'({campaign_or_string}) AND "master track" AND device:"Underway cruise track measurements"'

# Execute query
query = pan.PanQuery(query_string, bbox=(-180, 66.565, 180, 90), limit=500)

# Store in DataFrame
df_query_results_all = pd.DataFrame(query.result)

# Show the results
df_query_results_all

# Make copy for comparison
df_or = df_query_results_all.copy()

We retreive and save the query, metadata and datasets as before:

1. Query

In [37]:
df_query_results_all = pd.DataFrame()

for i in np.arange(0,query.totalcount,500):
    
    qs = query

    df_qs = pd.DataFrame(qs.result)
    
    df_query_results_all = pd.concat([df_query_results_all,df_qs],ignore_index=True)

2. Metadata

In [38]:
df_query_results_all.to_csv(
    os.path.join(data_directory, "PANGAEA_query.txt"),
    encoding="utf-8",
    sep="\t",
    index=False
)

In [39]:
for ind,value in df_query_results_all['URI'].items():
    
    ds = PanDataSet(id=value, include_data=False) # just metadata

    df_query_results_all.loc[ind,'dataset title'] = ds.title
    df_query_results_all.loc[ind,'abstract'] = ds.abstract
    df_query_results_all.loc[ind,'publication date'] = ds.date
    df_query_results_all.loc[ind,'collection members'] = ', '.join(ds.collection_members)
    df_query_results_all.loc[ind,'isCollection'] = "Yes" if ds.isCollection else "No"
    df_query_results_all.loc[ind,'first author fullname'] = ds.authors[0].fullname
    df_query_results_all.loc[ind,'all authors fullnames'] = "; ".join([x.fullname for x in ds.authors])
    df_query_results_all.loc[ind,'all authors orcids'] = "; ".join([x.ORCID if x.ORCID else "no ORCID" for x in ds.authors])
    df_query_results_all.loc[ind,'citation'] = ds.citation
    df_query_results_all.loc[ind,'dataset DOI'] = ds.doi
    df_query_results_all.loc[ind,'mean latitude'] = ds.geometryextent["meanLatitude"]
    df_query_results_all.loc[ind,'mean longitude'] = ds.geometryextent["meanLongitude"]
    campaign_names = {event.campaign.name for event in ds.events if event.campaign and event.campaign.name}
    df_query_results_all.loc[ind, 'campaign'] = "; ".join(campaign_names)
    df_query_results_all.loc[ind,'parameters'] = "; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()])
    df_query_results_all.loc[ind,'device'] = "; ".join(set([device if device else "no device" for device in ds.getEventsAsFrame()["device"]]))

In [40]:
df_query_results_all.to_csv(
    os.path.join(data_directory, "PANGAEA_metadata.txt"),
    encoding="utf-8",
    sep="\t",
    index=False
)

3. Datasets

In [41]:
dataset_directory = "../Data/PANGAEA_mastertrack_data/Datasets"

columns = ["DATE/TIME", "LATITUDE", "LONGITUDE", "Event"]

frames = []
for filename in os.listdir(dataset_directory):
    if filename.lower().endswith(".txt"):
        file_path = os.path.join(dataset_directory, filename)

        if os.path.getsize(file_path) == 0:
            print(f"Skipping empty file: {filename}")
            continue

        df = pd.read_csv(file_path, sep="\t", usecols=columns)

        if df.empty:
            print(f"Skipping empty DataFrame: {filename}")
            continue

        frames.append(df)

df_mastertrack_all = pd.concat(frames, ignore_index=True)

print(df_mastertrack_all.head())

             DATE/TIME  LATITUDE  LONGITUDE       Event
0  2014-06-06 00:10:00  53.56549    8.55573  PS85-track
1  2014-06-06 00:20:00  53.56550    8.55573  PS85-track
2  2014-06-06 00:30:00  53.56549    8.55573  PS85-track
3  2014-06-06 00:40:00  53.56549    8.55573  PS85-track
4  2014-06-06 00:50:00  53.56549    8.55572  PS85-track


In [42]:
df

Unnamed: 0,DATE/TIME,LATITUDE,LONGITUDE,Event
0,2015-05-20 12:00:00,57.225,5.584,PS92-track
1,2015-05-20 12:03:00,57.230,5.581,PS92-track
2,2015-05-20 12:05:00,57.236,5.578,PS92-track
3,2015-05-20 12:07:00,57.237,5.577,PS92-track
4,2015-05-20 12:09:00,57.240,5.576,PS92-track
...,...,...,...,...
3375,2015-05-27 12:21:00,80.888,18.664,PS92-track
3376,2015-05-27 12:24:00,80.890,18.686,PS92-track
3377,2015-05-27 12:26:00,80.894,18.707,PS92-track
3378,2015-05-27 12:29:00,80.898,18.727,PS92-track


In [43]:
data_dict = {}
for doi in df_query_results_all['URI']:
    ds = PanDataSet(id=doi, include_data=True)
    pangaea_id = doi.split('A.')[1]
    data_dict[pangaea_id] = ds.data

Data access failed, no tabular data available - https://doi.org/10.1594/PANGAEA.972617
Data access failed, no tabular data available - https://doi.org/10.1594/PANGAEA.963304
Data access failed, no tabular data available - https://doi.org/10.1594/PANGAEA.972614
Data access failed, no tabular data available - https://doi.org/10.1594/PANGAEA.963840
Data access failed, no tabular data available - https://doi.org/10.1594/PANGAEA.962550
Data access failed, no tabular data available - https://doi.org/10.1594/PANGAEA.974028


Some DOIs in our master-track results are single datasets (`isCollection = No`) but still non-tabular (they are zipped). `PanDataSet(..., include_data=True)` can’t extract a table from those, so it prints "no tabular data available". They also duplicate cruise-level tabular data we already have. For cruise PS92, the "Isoprene concentrations during cruise" entry is one remaining duplicate that we will manually remove during preprocessing.

In [44]:
for key, df in data_dict.items():
    print(f'PANGAEA dataset {key} saved')
    
    # Save as tab-delimited text
    df.to_csv(
        os.path.join(dataset_directory, f'PANGAEA_master_dataset_{key}.txt'),
        index=False,
        sep="\t",
        encoding="utf-8"
    )

PANGAEA dataset 972617 saved
PANGAEA dataset 963304 saved
PANGAEA dataset 972614 saved
PANGAEA dataset 972619 saved
PANGAEA dataset 963840 saved
PANGAEA dataset 962550 saved
PANGAEA dataset 972615 saved
PANGAEA dataset 974028 saved
PANGAEA dataset 963315 saved
PANGAEA dataset 963843 saved
PANGAEA dataset 974029 saved
PANGAEA dataset 962551 saved
PANGAEA dataset 847927 saved
PANGAEA dataset 881582 saved
PANGAEA dataset 841006 saved
PANGAEA dataset 847925 saved
PANGAEA dataset 847933 saved
PANGAEA dataset 893556 saved
PANGAEA dataset 841007 saved
PANGAEA dataset 881581 saved
PANGAEA dataset 847926 saved
PANGAEA dataset 848843 saved
PANGAEA dataset 869478 saved
PANGAEA dataset 847932 saved
PANGAEA dataset 835512 saved
PANGAEA dataset 895065 saved
PANGAEA dataset 834205 saved
PANGAEA dataset 864237 saved
PANGAEA dataset 848842 saved
PANGAEA dataset 881579 saved
PANGAEA dataset 855530 saved
PANGAEA dataset 841008 saved
PANGAEA dataset 848841 saved
PANGAEA dataset 881580 saved
PANGAEA datase

# 7. Download Individual Datasets

To download a single dataset that we already know, we can query directly using the Digital Object Identifier (DOI) assigned by PANGAEA. Each dataset on PANGAEA has a unique DOI, which acts as a permanent link to its data and metadata. In the example below, we use the DOI 10.1594/PANGAEA.868991 to retrieve the dataset by Jungblut et al. 2017 that contains counts of seabirds, marine mammals, and other megafauna during Polarstern cruise PS83 on its Atlantic transect from Cape Town to Bremerhaven.

Download and save dataset:

In [45]:
ds = PanDataSet("https://doi.org/10.1594/PANGAEA.868991")
df = ds.data
df.to_csv(os.path.join("../Data/868991_dataset.txt"), 
          sep="\t", encoding="utf-8", index=False)

We also want to download the master track for this cruise:

In [46]:
ds = PanDataSet("https://doi.org/10.1594/PANGAEA.832511")
df = ds.data
df.to_csv(os.path.join("../Data/868991_dataset_mastertrack.txt"), 
          sep="\t", encoding="utf-8", index=False)