<img src="../Images/DSC_Logo.png" style="width: 400px;">

# Downloading Data and Metadata from PANGAEA with `pangaeapy`

This Jupyter Notebook demonstrates how to retrieve multiple datasets and their metadata from the [PANGAEA](https://www.pangaea.de/) data repository using the [`pangaeapy`](https://pypi.org/project/PANGAEApy/) Python package. It was developed with reference to the [PANGAEA community workshop materials on github](https://github.com/pangaea-data-publisher/community-workshop-material/) which provide additional information on PANGAEA data retrieval.

# 1. Preparation

## 1.1 Import Libraries

You might need to install `pangaeapy` first:

In [None]:
!pip install pangaeapy

Import (load) `pangaeapy`:

In [None]:
import pangaeapy as pan
from pangaeapy.pandataset import PanDataSet

Other Python packages that must be installed before they can be imported and used:

In [None]:
import os
import pandas as pd
import numpy as np
import requests 
from urllib.request import urlopen, urlretrieve

To ignore warnings in this script:

In [None]:
import warnings
from pandas.errors import SettingWithCopyWarning
warnings.simplefilter(action='ignore', category=(SettingWithCopyWarning))
warnings.simplefilter(action='ignore', category=FutureWarning)

## 1.2 `pangaeapy` Documentation

To call the `pangaeapy` documentation (uncomment):

In [None]:
#help(pan) # help on package pangaeapy
#help(pan.panquery) # help on module pangaeapy.panquery in pangaeapy
help(pan.pandataset) # help on module pangaeapy.pandataset in pangaeapy

Searching in PANGAEA is also documented [here](https://wiki.pangaea.de/wiki/PANGAEA_search).

## 1.3 Create Data Folders to Organize and Store Downloaded Datasets

Define directories for storing data:

In [None]:
data_directory = "../Data/PANGAEA_orca_data"
dataset_directory = "../Data/PANGAEA_orca_data/Datasets"

Create main data directory if it doesn't exist:

In [None]:
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)

Create subdirectory for individual datasets if it doesn't exist:

In [None]:
if not os.path.isdir(dataset_directory):
    os.mkdir(dataset_directory)

# 2. Query Data

PANGAEA offers various query options including uncertain spelling, optional query terms ("OR"), author-search, and geographical bounding boxes. Here, we want to **query orcinus orca sightings of Polarstern cruises in the Arctic**. For that, we won't hit the query limit of 500 datasets, however, to make this query code applicable for larger query results we run code to retrieve datasets in chunks of 500.  

> Find more examples in the original [PANGAEA community workshop materials on github](https://github.com/pangaea-data-publisher/community-workshop-material/) and look in the `pangaeapy` documentation (Sect. 1.2) for callable query options.

## 2.1 Define Query

A simple text-based search provides expected results:

In [None]:
query = pan.PanQuery('Polarstern orcinus orca', bbox=(-180, 66.565, 180, 90), limit=500)

print(f'There are {query.totalcount} query results.')
print(f'Currently query consists of {len(query.result)} entries.')

Below is an alternative query with advanced search options where we narrow down broad orcinus orca results using a specific metadata field (here: basis).

In [None]:
query = pan.PanQuery('basis:Polarstern AND "orcinus orca"', bbox=(-180, 66.565, 180, 90), limit=500)

print(f'There are {query.totalcount} query results.')
print(f'Currently query consists of {len(query.result)} entries.')

## 2.2 Get Query Results

In this step, we loop through the query results and combine the metadata into a single dataframe.

>Note: At this stage, no data files are downloaded yet. Only the search results (metadata) returned by PANGAEA are collected.

In [None]:
# create empty dataframe
df_query_results_all = pd.DataFrame()

# loop over all results in steps of 500
for i in np.arange(0,query.totalcount,500):
    
    # set query
    qs = query
    
    # convert qs result with 500 entries to dataframe df_qs
    df_qs = pd.DataFrame(qs.result)
    
    # concatenate all individual df_qs into one dataframe
    df_query_results_all = pd.concat([df_query_results_all,df_qs],ignore_index=True)

In [None]:
# show first 3 lines of query results
df_query_results_all.head(3)

In [None]:
# show last 3 lines of query results
df_query_results_all.tail(3)

## 2.3 Save Query Results

In [None]:
# Save as tab-delimited text
df_query_results_all.to_csv(os.path.join(data_directory, "PANGAEA_query.txt"), 
                            encoding="utf-8", 
                            sep="\t", 
                            index=False)

# 3. Get Metadata for Multiple Datasets

## 3.1 Download Metadata

We now iterate over the query results and fetch only the dataset metadata (no data files). This creates a consolidated table that is useful for an overview (title, authors, parameters, geography) and for reuse essentials, including the recommended citation and DOI you’ll need to cite the datasets properly.


> Rate limits: As a safety precaution, the number of metadata requests is limited for a specific time period. If you have larger requests, prepare to wait or use a different tool. Find more information in the [PANGAEA community workshop materials on github](https://github.com/pangaea-data-publisher/community-workshop-material/).

> Fields reference: See Sect. 1.2 for callable metadata attributes.

In [None]:
for ind, value in df_query_results_all['URI'].items():
    
    # get metadata 
    ds = PanDataSet(id=value, include_data=False) # just metadata

    # store metadata in df in new column
    df_query_results_all.loc[ind,'dataset title'] = ds.title
    df_query_results_all.loc[ind,'abstract'] = ds.abstract
    df_query_results_all.loc[ind,'publication date'] = ds.date
    df_query_results_all.loc[ind,'collection members'] = ', '.join(ds.collection_members)
    df_query_results_all.loc[ind,'isCollection'] = "Yes" if ds.isCollection else "No"
    df_query_results_all.loc[ind,'first author fullname'] = ds.authors[0].fullname
    df_query_results_all.loc[ind,'all authors fullnames'] = "; ".join([x.fullname for x in ds.authors])
    df_query_results_all.loc[ind,'all authors orcids'] = "; ".join([x.ORCID if x.ORCID else "no ORCID" for x in ds.authors])
    df_query_results_all.loc[ind,'citation'] = ds.citation
    df_query_results_all.loc[ind,'dataset DOI'] = ds.doi
    df_query_results_all.loc[ind,'mean latitude'] = ds.geometryextent["meanLatitude"]
    df_query_results_all.loc[ind,'mean longitude'] = ds.geometryextent["meanLongitude"]
    campaign_names = {event.campaign.name for event in ds.events if event.campaign and event.campaign.name}
    df_query_results_all.loc[ind, 'campaign'] = "; ".join(campaign_names)
    df_query_results_all.loc[ind,'parameters'] = "; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()])
    df_query_results_all.loc[ind,'device'] = "; ".join(set([device if device else "no device" for device in ds.getEventsAsFrame()["device"]]))

Show first two lines of metadata:

In [None]:
df_query_results_all.head(2)

Print unique first author names in metadata:

In [None]:
df_query_results_all['first author fullname'].unique()

## 3.2 Save Metadata

In [None]:
df_query_results_all.to_csv(os.path.join(data_directory, "PANGAEA_metadata.txt"),
                            encoding="utf-8",
                            sep="\t",
                            index=False)

print(f'PANGAEA metadata saved')

# 4. Download Multiple Datasets

Function to translate default parameters to long parameter names because by default parameters are abbreviated without units:

In [None]:
def get_long_parameters(ds):
    """Translate short parameters names to long names including unit

    Args:
        ds (PANGAEA dataset): PANGAEA dataset
    """
    ds.data.columns =  [f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()]

## 4.1 Download Datasets to Dictionary

In this step we download the actual dataset contents from PANGAEA. Each dataset is loaded into a pandas dataframe and stored in a Python dictionary.

> Why a dictionary? A dictionary in Python is like a labeled container ({key: value, ...}), where you can store multiple objects and access them later by their key. Here, we use the PANGAEA dataset ID as the key and the corresponding data table (a dataframe) as the value.

In [None]:
# Create an empty dictionary to store downloaded datasets
data_dict = {}

# Loop over all DOIs (or restrict to a subset, e.g. [:20] for the first 20 results)
for pangaea_doi in df_query_results_all['URI']:
 
    # Download the dataset from PANGAEA (enable_cache=True saves it locally for reuse)
    ds = PanDataSet(pangaea_doi, enable_cache=True)
    
    # Replace short parameter names in ds.data with full descriptive names + units
    get_long_parameters(ds)

    # Extract the numeric dataset ID from the DOI string (part after "A.")
    # Example: "10.1594/PANGAEA.900123" → "900123"
    pangaea_id = pangaea_doi.split('A.')[1]

    # Store the dataset's dataframe in the dictionary under its ID
    data_dict[pangaea_id] = ds.data

    # Print a simple progress message
    print("".join(40*["-"]))
    print(f'PANGAEA ID: {pangaea_doi}')
    print(f'Dataset title: {ds.title}')

Inspect any dataset stored in the dictionary by looking it up by its PANGAEA ID:

In [None]:
data_dict['924703'].head()

## 4.2 Save Individual Datasets

Loop over each dataset in the dictionary and save:

In [None]:
for key, df in data_dict.items():
    print(f'PANGAEA dataset {key} saved')
    
    df.to_csv(
        os.path.join(dataset_directory, f'PANGAEA_orca_dataset_{key}.txt'),
        index=False,
        sep="\t",
        encoding="utf-8"
    )

---

# 5. Exercise: Polarstern Cruise Tracks

We downloaded metadata for datasets containing orca sightings, which includes a column named "campaign". Now, your task is to find and download metadata and data for **three master track datasets** that are part of the same campaigns as the orca datasets.

1. Extract the unique campaign names from the orca datasets and store them in a variable called "orca_campaigns".
> Hint: Use the `.unique()` function from Sect. 3.1.

2. Create a new folder and "Datasets" subfolder for your data inside the "Data" folder (see Sect. 1.3).

3. Build a query that returns the cruise track datasets for three campaigns (see Sect. 2.1):
> - Join multiple campaign names with OR and wrap them in parentheses using text-based search or using the metadata query field "campaign".
> - Add "master track" text-based search with AND to focus only on cruise track datasets.
> - Add the metadata query field device:"Underway cruise track measurements" with AND to exclude other results (e.g., seismic profiles).

4. Get the query results of the newly defined query and save the query to your new data folder (see Sects. 2.2 & 2.3). 

5. Next, download the metdata and datasets and save them to your new folders. You can use the same attributes as before for the metadata download (see Sects. 3 & 4). 

# 6. Download All Matching Master Tracks

In addition to using a fixed query like: 

In [None]:
query = pan.PanQuery('campaign:"ARK-XXIII/2" AND "master track" AND device:"Underway cruise track measurements"', bbox=(-180, 66.565, 180, 90), limit = 500)

There are programmatic alternatives for querying multiple campaigns in a cleaner and more flexible way.

List of campaign names in the Orca datasets from above:

In [None]:
orca_campaigns

Option 1: Loop over campaign names that are in the `orca_campaigns` list:

In [None]:
# Empty DataFrame to store all results
df_query_results_all = pd.DataFrame()

# Loop through each campaign
for campaign in orca_campaigns:
    query_string = f'campaign:"{campaign}" AND "master track" AND device:"Underway cruise track measurements"'
    q = pan.PanQuery(query_string, bbox=(-180, 66.565, 180, 90), limit=500)
    df_q = pd.DataFrame(q.result)
    df_query_results_all = pd.concat([df_query_results_all, df_q], ignore_index=True)

# Show the results
df_query_results_all

# Make copy for comparison
df_loop = df_query_results_all.copy()

Option 2: One query using multiple campaign:"..." terms joined by `OR` with the `join` argument for names in the `orca_campaigns` list:

In [None]:
# Join campaign names correctly with full fielded expressions
campaign_or_string = " OR ".join([f'campaign:"{c}"' for c in orca_campaigns])

# Build final query string
query_string = f'({campaign_or_string}) AND "master track" AND device:"Underway cruise track measurements"'

# Execute query
query = pan.PanQuery(query_string, bbox=(-180, 66.565, 180, 90), limit=500)

# Store in DataFrame
df_query_results_all = pd.DataFrame(query.result)

# Show the results
df_query_results_all

# Make copy for comparison
df_or = df_query_results_all.copy()

We retreive and save the query, metadata and datasets as before:

1. Query

In [None]:
df_query_results_all = pd.DataFrame()

for i in np.arange(0,query.totalcount,500):
    
    qs = query

    df_qs = pd.DataFrame(qs.result)
    
    df_query_results_all = pd.concat([df_query_results_all,df_qs],ignore_index=True)

2. Metadata

In [None]:
df_query_results_all.to_csv(
    os.path.join(data_directory, "PANGAEA_query.txt"),
    encoding="utf-8",
    sep="\t",
    index=False
)

In [None]:
for ind,value in df_query_results_all['URI'].items():
    
    ds = PanDataSet(id=value, include_data=False) # just metadata

    df_query_results_all.loc[ind,'dataset title'] = ds.title
    df_query_results_all.loc[ind,'abstract'] = ds.abstract
    df_query_results_all.loc[ind,'publication date'] = ds.date
    df_query_results_all.loc[ind,'collection members'] = ', '.join(ds.collection_members)
    df_query_results_all.loc[ind,'isCollection'] = "Yes" if ds.isCollection else "No"
    df_query_results_all.loc[ind,'first author fullname'] = ds.authors[0].fullname
    df_query_results_all.loc[ind,'all authors fullnames'] = "; ".join([x.fullname for x in ds.authors])
    df_query_results_all.loc[ind,'all authors orcids'] = "; ".join([x.ORCID if x.ORCID else "no ORCID" for x in ds.authors])
    df_query_results_all.loc[ind,'citation'] = ds.citation
    df_query_results_all.loc[ind,'dataset DOI'] = ds.doi
    df_query_results_all.loc[ind,'mean latitude'] = ds.geometryextent["meanLatitude"]
    df_query_results_all.loc[ind,'mean longitude'] = ds.geometryextent["meanLongitude"]
    campaign_names = {event.campaign.name for event in ds.events if event.campaign and event.campaign.name}
    df_query_results_all.loc[ind, 'campaign'] = "; ".join(campaign_names)
    df_query_results_all.loc[ind,'parameters'] = "; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()])
    df_query_results_all.loc[ind,'device'] = "; ".join(set([device if device else "no device" for device in ds.getEventsAsFrame()["device"]]))

In [None]:
df_query_results_all.to_csv(
    os.path.join(data_directory, "PANGAEA_metadata.txt"),
    encoding="utf-8",
    sep="\t",
    index=False
)

3. Datasets

In [None]:
dataset_directory = "../Data/PANGAEA_mastertrack_data/Datasets"

columns = ["DATE/TIME", "LATITUDE", "LONGITUDE", "Event"]

frames = []
for filename in os.listdir(dataset_directory):
    if filename.lower().endswith(".txt"):
        file_path = os.path.join(dataset_directory, filename)

        if os.path.getsize(file_path) == 0:
            print(f"Skipping empty file: {filename}")
            continue

        df = pd.read_csv(file_path, sep="\t", usecols=columns)

        if df.empty:
            print(f"Skipping empty DataFrame: {filename}")
            continue

        frames.append(df)

df_mastertrack_all = pd.concat(frames, ignore_index=True)

print(df_mastertrack_all.head())

In [None]:
df

In [None]:
data_dict = {}
for doi in df_query_results_all['URI']:
    ds = PanDataSet(id=doi, include_data=True)
    pangaea_id = doi.split('A.')[1]
    data_dict[pangaea_id] = ds.data

Some DOIs in our master-track results are single datasets (`isCollection = No`) but still non-tabular (they are zipped). `PanDataSet(..., include_data=True)` can’t extract a table from those, so it prints "no tabular data available". They also duplicate cruise-level tabular data we already have. For cruise PS92, the "Isoprene concentrations during cruise" entry is one remaining duplicate that we will manually remove during preprocessing.

In [None]:
for key, df in data_dict.items():
    print(f'PANGAEA dataset {key} saved')
    
    # Save as tab-delimited text
    df.to_csv(
        os.path.join(dataset_directory, f'PANGAEA_master_dataset_{key}.txt'),
        index=False,
        sep="\t",
        encoding="utf-8"
    )

# 7. Download Individual Datasets

To download a single dataset that we already know, we can query directly using the Digital Object Identifier (DOI) assigned by PANGAEA. Each dataset on PANGAEA has a unique DOI, which acts as a permanent link to its data and metadata. In the example below, we use the DOI 10.1594/PANGAEA.868991 to retrieve the dataset by Jungblut et al. 2017 that contains counts of seabirds, marine mammals, and other megafauna during Polarstern cruise PS83 on its Atlantic transect from Cape Town to Bremerhaven.

Download and save dataset:

In [None]:
ds = PanDataSet("https://doi.org/10.1594/PANGAEA.868991")
df = ds.data
df.to_csv(os.path.join("../Data/868991_dataset.txt"), 
          sep="\t", encoding="utf-8", index=False)

We also want to download the master track for this cruise:

In [None]:
ds = PanDataSet("https://doi.org/10.1594/PANGAEA.832511")
df = ds.data
df.to_csv(os.path.join("../Data/868991_dataset_mastertrack.txt"), 
          sep="\t", encoding="utf-8", index=False)