![PANGAEA_Banner.png](https://gitlab.awi.de/kriemann/nfdi4earth_academy_data/raw/main/logo/PANGAEA_Banner.png)

# How to download data from PANGAEA and manipulate it 

based on [PANGAEA Community Workshop script](Python/PANGAEApy_practical/pangaeapy_practical_solutions.ipynb)  
Last updated: 2025-08-20  

This notebook will guide you how to search and retrieve diverse earth- and environmental data and its metadata from the [PANGAEA data repository](https://www.pangaea.de) using Python. It uses the [PANGAEApy package](https://pypi.org/project/pangaeapy/), version 1.0.22 to facilitate the data download. 

Check out our [Wiki](https://wiki.pangaea.de/wiki/PANGAEA_search) for further details on searching data in PANGAEA.

## 1. Import libraries

In [None]:
import os
import pandas as pd
import numpy as np
from collections import Counter
import requests 

In [None]:
### PANGAEApy
## if you need to install PANGAEApy use pip
#!pip install pangaeapy # Uncomment to upgrade pangaeapy

## if you need to upgrade PANGAEApy use 
#!pip install pangaeapy --upgrade # Uncomment to upgrade pangaeapy

## check version of PANGAEApy
# !pip show pangaeapy

## for details see https://pypi.org/project/pangaeapy/ 

import pangaeapy as pan
from pangaeapy.pandataset import PanDataSet

### PANGAEApy documentation
To call the PANGAEApy documentation uncomment one of the following lines

In [None]:
# help(pan)
### or 
# help(pan.panquery)
### or
# help(pan.pandataset)

In [None]:
### ignore warnings in this script
import warnings
from pandas.errors import SettingWithCopyWarning
warnings.simplefilter(action='ignore', category=(SettingWithCopyWarning))
warnings.simplefilter(action='ignore', category=FutureWarning)

## 2. Query for data in PANGAEA

AIM: How to search for datasets of a particular topic such as a species, location, project or author?

This mirrors the query via the [PANGAEA website](https://pangaea.de/)  

**Note:** The search term is enclosed with single quotes '. If your search term includes a blank, use additional double quotes " inside the single quotes.  
Example: _'sea ice'_ vs. _'"sea ice"'_  
Example: _'parameter:Temperature, water method:CTD/Rosette'_ vs. _'parameter:"Temperature, water" method:CTD/Rosette'_

#### General info on query
* limit = the maximum number of datasets to be returned from query is 500.
    * default limit = 10
    * To download > 500 use the offset attribute e.g. pan.PanQuery("Triticum", limit = 500, offset=500)
* type: 
    * collection = dataset collection
    * member = individual dataset which can be part of a dataset collection 
* score: Indicates how well the dataset matched the query search term
* help(pan.panquery)

### 2.1 Basic example queries
* search via [keywords](https://wiki.pangaea.de/wiki/PANGAEA_search)
* search via geographical coordinates a.k.a. bounding box

#### Query PANGAEA with 1 keyword

In [None]:
query = pan.PanQuery('Geochemistry')
### compare with https://pangaea.de/?q=Geochemistry

In [None]:
### query is a PANGAEApy object with built in objects
print(query)

In [None]:
### you can ask the following attributes
## totalcount, error, query, result
print(query.query)

In [None]:
print(f'There are {query.totalcount} query results.')

In [None]:
### put query results into dataframe
query_results = pd.DataFrame(query.result)
print(f'Total length of data frame query_results is {len(query_results)}.')

In [None]:
query_results

#### Query PANGAEA with combinations of keywords

In [None]:
### find datasets that contain both "Geochemistry" and "sediment core"
## remember how to use the different quotes:
## The search term is enclosed with single quotes '. If your search term includes a blank, use additional double quotes " inside the single quotes.
query = pan.PanQuery('Geochemistry "sediment core"')
print(f'There are {query.totalcount} query results.')

#### Optional query terms

In [None]:
### find datasets that contain "Geochemistry" and either "Spitzbergen" or "Svalbard" 
query = pan.PanQuery('Geochemistry AND (Spitzbergen OR Svalbard)')
print(f'There are {query.totalcount} query results.')

#### Uncertain spelling

In [None]:
### find datasets with uncertain spelling of single letter
query = pan.PanQuery('Pal?nologic')
print(f'There are {query.totalcount} query results.')

#### Specific author

In [None]:
### find datasets of author "Boetius"
query = pan.PanQuery('citation:author:Boetius')
print(f'There are {query.totalcount} query results.') 

#### Within geographical coordinates a.k.a bounding box

In [None]:
### query database for "Geochemistry" and "sediment core" within a certain geolocation a.k.a. bounding box
## bounding box: bbox=(minlon, minlat,  maxlon, maxlat)
query = pan.PanQuery('Geochemistry "sediment core"', limit = 500, bbox=(-60, 50, -10, 70))
print(f'There are {query.totalcount} query results.')

### 2.2 How to query PANGAEA without result limitations
* The maximum of retrieving search results is 500 datasets.  
* Retrieve datasets in chunks of 500 via offset option.  
* Put all datasets in one data frame.

In [None]:
### Get all results and combine them in data frame.

### define search pattern
search_pattern = 'project:label:PAGES_C-PEAT'

### basic query to get number of search results
query = pan.PanQuery(search_pattern, limit = 500)
print(f'There are {query.totalcount} query results.')
print(f'Currently query consists of {len(query.result)} entries.')

### create empty data frame
df_query_results_all = pd.DataFrame()

### loop over all results in steps of 500
for i in np.arange(0,query.totalcount,500):
    
    ### store result of individual step in qs
    qs = pan.PanQuery(search_pattern, limit = 500, offset=i)
    
    ### convert qs result with 500 entries to data frame df_qs
    df_qs = pd.DataFrame(qs.result)
    
    ### concatenate all individual df_qs into one data frame named query_results_all
    df_query_results_all = pd.concat([df_query_results_all,df_qs],ignore_index=True)
    
print(f'df_query_results_all consists of {len(df_query_results_all)} results.')

In [None]:
### show first and last 3 lines
pd.concat( [ df_query_results_all.head(3), df_query_results_all.tail(3) ] )

### 2.3 Quiz

[More information](https://wiki.pangaea.de/wiki/PANGAEA_search) how to query with keywords

#### 2.3.1 How many datasets contain "geological investigations"?
Hint: "geological investigations" **not** "geological" and "investigations"

In [None]:
# Your solution

In [None]:
### solution
query = pan.PanQuery('"geological investigations"')
print(query.totalcount)

#### 2.3.2 How many datasets contain "geological investigations" in the title only?

In [None]:
# Your solution

In [None]:
### solution
query = pan.PanQuery('citation:title:"geological investigations"')
print(query.totalcount)

#### 2.3.3 How many datasets measured "Temperature, water" using a CTD/Rosette?

In [None]:
# Your solution

In [None]:
### solution
query = pan.PanQuery('parameter:"Temperature, water" method:CTD/Rosette')
print(query.totalcount)

## 3. Get metadata of datasets

A long list of metadata is callable with PanDataSet. 
Find a comprehensive list in internal documentation  
_help(pan.PanQuery)_    
or in this notebook full of examples: [pangaeapy_detailed_metadata_search.ipynb](https://github.com/pangaea-data-publisher/community-workshop-material/tree/master/Python/PANGAEApy_practical/pangaeapy_detailed_metadata_search.ipynb)

### 3.1 Get metadata of individual dataset

##### Example dataset from PANGAEA https://doi.pangaea.de/10.1594/PANGAEA.918423 

In [None]:
### 2 ways to ask for dataset metadata

## via URI
# ds = PanDataSet('doi:10.1594/PANGAEA.918423', include_data=False)

## via PANGAEA id number of dataset
## id number can be either int or str
# ds = PanDataSet('918423', include_data=False) 
ds = PanDataSet(918423, include_data=False) 

#### Basic metadata retrieval

In [None]:
### Title
ds.title

In [None]:
### Abstract
ds.abstract

In [None]:
### Authors
print(f'Authors: {"; ".join([x.fullname for x in ds.authors])}')

In [None]:
### Full Reference
ds.citation

In [None]:
### Geolocation
print(f'Latitude: {ds.geometryextent["meanLatitude"]}')
print(f'Longitude: {ds.geometryextent["meanLongitude"]}')

In [None]:
### Parameters
params = "; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()])
print(f'Parameters: {params}')

In [None]:
### Event as dataframe
ds.getEventsAsFrame()

In [None]:
### Event as PanEvent object
print(ds.events)

In [None]:
### information are stored as lists in PanEvent object
print(type(ds.events))

In [None]:
### therefore easy way of getting info is loop
for event in ds.events:
    print(event.label)
    print(event.method.name)
    print(event.basis.name)

#### Store metadata in data frame

In [None]:
### create empty data frame
df = pd.DataFrame()

### store metadata in df
df.loc[0,'dataset title'] = ds.title
df.loc[0,'abstract'] = ds.abstract

### ds.authors is a list
df.loc[0,'first author fullname'] = ds.authors[0].fullname
df.loc[0,'all authors fullnames'] = "; ".join([x.fullname for x in ds.authors])

### authors orcids is a list
df.loc[0,'all authors orcids'] = "; ".join([x.ORCID if x.ORCID else "no ORCID" for x in ds.authors])

df.loc[0,'citation'] = ds.citation
df.loc[0,'dataset DOI'] = ds.doi
df.loc[0,'west bound longitude'] = ds.geometryextent["westBoundLongitude"]
df.loc[0,'east bound longitude'] = ds.geometryextent["eastBoundLongitude"]
df.loc[0,'south bound latitude'] = ds.geometryextent["southBoundLatitude"]
df.loc[0,'north bound latitude'] = ds.geometryextent["northBoundLatitude"]
### parameters is a list
df.loc[0,'parameters'] = "; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()])

### event devices
df.loc[0,'label'] = "; ".join(set([device for device in ds.getEventsAsFrame()["label"]]))

In [None]:
df

#### Save dataframe as file

In [None]:
### Create data directory
data_directory = "PANGAEA_data"
# Check if it already exists before creating it
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)
    
### Save as csv (comma seperated value)
df.to_csv(os.path.join(data_directory, f'PANGAEA_metadata_{ds.id}.csv'), encoding='utf-8', index=False)
df.to_csv(os.path.join(data_directory, f'PANGAEA_metadata_{ds.id}.txt'), sep='\t', encoding='utf-8', index=False)
print(f'PANGAEA metadata of "{ds.title}" saved')

### 3.2 Getting metadata for multiple datasets

In [None]:
### define search pattern
search_pattern = 'project:label:PAGES_C-PEAT and citation:title:geochemistry'

In [None]:
### do query, pay attention to limit
query = pan.PanQuery(search_pattern, limit = 5)
print(f'There are {query.totalcount} query results.')
print(f'Currently query consists of {len(query.result)} entries.')

In [None]:
### store query results in dataframe
df = pd.DataFrame(query.result)

In [None]:
df

#### Loop over all entries in df and get metadata for each entry
NOTE: As a safety precaution, the number of metadata requests is limited for a specific time period. 

_Received too many (metadata) requests error (429)...waiting 30s -_

If you have larger requests, prepare to wait or use a different tool e.g. OAI-PMH (https://wiki.pangaea.de/wiki/OAI-PMH).

In [None]:
### Create one data frame for all datasets
data_all = pd.DataFrame()

### loop over all datasets ins df
for ind,value in df['URI'].items():
    
    ## use PanDataSet to get metadata and data and put them into 2 diferent dataframes
    ds = PanDataSet(value, include_data=False)

    print(ind, ds.doi)

    ## put metadata into df in new columns
    df.loc[ind,'Title'] = ds.title
    df.loc[ind,'Publication date'] = ds.date
    df.loc[ind,'Authors'] = {"; ".join([x.fullname for x in ds.authors])}
    df.loc[ind,'Citation'] = ds.citation
    df.loc[ind,'DOI'] = ds.doi
    df.loc[ind,'Parameters'] = "; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()])
    if ds.events:
        df.loc[ind,'Event'] = "; ".join([x.label for x in ds.events])


In [None]:
df

In [None]:
### drop columns no longer needed
df = df.drop(['URI','score','html','type','position'],axis=1)
df

#### Save dataframe as file

In [None]:
### Create data directory
data_directory = "PANGAEA_data"
### Check if it already exists before creating it
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)
    
### Save as csv (comma seperated value)
df.to_csv(os.path.join(data_directory, f'PANGAEA_metadata_df_all.csv'), encoding='utf-8', index=False)
df.to_csv(os.path.join(data_directory, f'PANGAEA_metadata_df_all.txt'), sep='\t', encoding='utf-8', index=False)
print(f'PANGAEA metadata saved')

### 3.3 Quiz

#### 3.3.1 What is the title of this dataset?
https://doi.pangaea.de/10.1594/PANGAEA.804588

In [None]:
# Your solution

In [None]:
### solution
ds = PanDataSet(804588, include_data=False)
ds.title

#### 3.3.2 What is the name of the second author of this dataset?
https://doi.pangaea.de/10.1594/PANGAEA.804588

In [None]:
# Your solution

In [None]:
### solution
ds = PanDataSet(804588, include_data=False)
ds.authors[1].fullname

#### 3.3.3 Did they measure pH in this dataset?
https://doi.pangaea.de/10.1594/PANGAEA.743969

In [None]:
# Your solution

In [None]:
### solution
ds = PanDataSet("743969")
list_params = list(ds.params)
# print(list_params)
if 'pH' in list_params:
    print('yes')

## 4. Download datasets

### 4.1 Download single dataset

##### Example dataset: https://doi.pangaea.de/10.1594/PANGAEA.972802

In [None]:
ds = PanDataSet(972802)
### ds contains data and metadata
### see metadata section on how to get metadata
type(ds)

In [None]:
### ds.data is data frame
type(ds.data)

In [None]:
### dataset header contains of parameter short names without unit
ds.data.head(3)

#### Translate to long parameter names
Because by default parameters are abbreviated without units

In [None]:
### Translate short parameters names to long names including unit
def get_long_parameters(ds):
    """Translate short parameters names to long names including unit

    Args:
        ds (PANGAEA dataset): PANGAEA dataset
    """
    ds.data.columns =  [f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()]

In [None]:
get_long_parameters(ds)

In [None]:
ds.data.head(3)

#### Save data

In [None]:
### Create data directory
data_directory = "PANGAEA_data"
### Check if it already exists before creating it
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)
### Save as csv (comma seperated value)
print(f'PANGAEA dataset "{ds.title}" saved')
ds.data.to_csv(os.path.join(data_directory, f'PANGAEA_dataset_{ds.id}.csv'),index=False)

### 4.2 Download dataset including binary files e.g. images

##### Example dataset: https://doi.pangaea.de/10.1594/PANGAEA.932826

In [None]:
### Download dataset from PANGAEA
ds = PanDataSet(932826, enable_cache=True)
### Spell out abbreviated parameters
get_long_parameters(ds)
ds.data.head(2)

In [None]:
df

In [None]:
### download only 1 image
df = ds.data[ds.data['DATE/TIME']=='2021-02-16 03:45:21']

### Create file urls
df["image_url"] = [f'https://download.pangaea.de/dataset/{ds.id}/files/{img}' for img in df['Image']]

In [None]:
df

In [None]:
### Create data directory
data_directory = "PANGAEA_data"
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)

# ### download images
# for i, file_url in enumerate(df['image_url']):
#     response = requests.get(file_url,data_directory)    
#     index = df.loc[(df == file_url).any(axis=1)].index[0]
#     ### save image
#     open(data_directory+'/'+df.loc[index,'Image'], 'wb').write(response.content)
#     print(df.loc[index,'Image'] +' downloaded')

### 4.3 Download multiple datasets
* download multiple datasets: data and metdata
* combine data into one dataframe
* combine metadata into one dataframe  

#### Define search pattern and do query

In [None]:
### define search pattern
search_pattern = 'project:label:PAGES_C-PEAT and citation:title:geochemistry' 

### do query
query = pan.PanQuery(search_pattern, limit = 500)
print(f'There are {query.totalcount} query results.')
print(f'Currently query consists of {len(query.result)} entries.')

#### Get all query results and combine them in data frame

In [None]:
### create empty data frame
df_all = pd.DataFrame()

### apply loop if query consists of more than 500 datasets
if query.totalcount >= 500:
    ### loop over all results in steps of 500
    for i in np.arange(0,query.totalcount,500):
        
        ### store result of individual step in qs
        qs = pan.PanQuery(search_pattern, limit = 500, offset=i)
        
        ### convert qs result with 500 entries to data frame df_qs
        df_qs = pd.DataFrame(qs.result)
        
        ### concatenate all individual df_qs into one data frame named query_results_all
        df_all = pd.concat([df_all,df_qs],ignore_index=True)
else:
    df_all = pd.DataFrame(query.result)

print(f'There are {query.totalcount} query results.')
print(f'df_all consists of {len(df_all)} results.')

In [None]:
### show first 3 lines
df_all.head(3)

#### Filter out collections

In [None]:
df_all = df_all[df_all['type']=='member']
print(f'df_all consists of {len(df_all)} results.')

#### Practical functions for complicated datasets
* double parameter
* method as comment  

Example dataset: https://doi.pangaea.de/10.1594/PANGAEA.890478  

In [None]:
### function to find duplicate column names
def find_duplicates(col_names):
    name_counts = Counter(col_names)
    duplicates = [name for name, count in name_counts.items() if count > 1]
    return duplicates, bool(duplicates)

In [None]:
### function to create general parameter long names with unit and method
def generate_general_column_name(param):
    base_name = f'{param.name} [{param.unit}]' if param.unit else param.name 

    if param.method:
        base_name += f', method:{param.method.name}'

    return base_name

In [None]:
### functions to rename duplicate column names so they are all individual within the dataset
def generate_unique_column_name(param):
    base_name = f'{param.name} [{param.unit}]' if param.unit else param.name
    
    if param.method:
        base_name += f', method:{param.method.name}'
    
    if param.comment:
        return f'{base_name}, comment:{param.comment}'
    else:
        return f'{base_name}, col nr:{param.colno}'

def make_unique_column_names(ds, same_param_name):
    col_names = []
    for param in ds.params.values():
        name = generate_general_column_name(param)
        if name in same_param_name:
            name = generate_unique_column_name(param)
        col_names.append(name)
    return col_names

#### Download and combine data and metadata of query results

NOTE: As a safety precaution, the number of metadata requests is limited for a specific time period. 

_Received too many (metadata) requests error (429)...waiting 30s -_

If you have larger requests, prepare to wait or use a different tool e.g. OAI-PMH (https://wiki.pangaea.de/wiki/OAI-PMH).

In [None]:
### Create one data frame for all datasets
data_all = pd.DataFrame()

### loop over all datasets in df_all
# for ind,value in df_all['URI'].items():
### only download first 3 results during workshop
for ind,value in df_all['URI'][0:3].items(): 
    
    ## use PanDataSet to get metadata and data and put them into 2 diferent dataframes
    ds = PanDataSet(value)

    print(ind, ds.doi)

    ## put metadata into df_all in new columns
    df_all.loc[ind,'Title'] = ds.title
    df_all.loc[ind,'Publication date'] = ds.date
    df_all.loc[ind,'Authors'] = {"; ".join([x.fullname for x in ds.authors])}
    df_all.loc[ind,'Citation'] = ds.citation
    df_all.loc[ind,'DOI'] = ds.doi
    df_all.loc[ind,'Parameters'] = "; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()])
    if ds.events:
        df_all.loc[ind,'Event'] = "; ".join([x.label for x in ds.events])
    
    ### Translate default short parameter names to long parameter names, add unit and method if available, check if all column names are individuals
    col_names = []
    for param in ds.params.values():
        col_name = generate_general_column_name(param)
        col_names.append(col_name)

    ### find duplicate column names make them individual column names
    same_param_name, double_name = find_duplicates(col_names)

    if double_name:
        col_names = make_unique_column_names(ds, set(same_param_name))
    
    ### rename columns because python cannot handle duplicate column names within dataframe
    ds.data.columns =  col_names
    
    ### create new data dataframe for each query result 
    df_data = pd.DataFrame()
    df_data = ds.data
    df_data['DOI'] = ds.doi

    ### combine all datasats into one dataframe
    data_all = pd.concat([data_all,df_data], ignore_index=True)


In [None]:
### metadata table
df_all.head(3)

In [None]:
df_all.columns

In [None]:
### rearrange and drop columns
df_all = df_all[['Title','Event', 'Parameters', 'Citation', 'DOI']]

In [None]:
df_all.head(3)

In [None]:
### data table
pd.concat([data_all.head(2),data_all.tail(2)])

#### Check header and merge columns 

In [None]:
### show all header names
data_all.columns

In [None]:
### define parameter/header/column to be kept a.k.a. keep_param
keep_param = 'Peat type, col nr:6'

### copy value in keep column, if keep value is nan
merge_param = 'Peat type, col nr:8'

### if condition is needed because example consists of first 3 datasets
if merge_param in data_all.columns and keep_param in data_all.columns:
    ### merge merge_param into keep_param
    mask = data_all[keep_param].isna() & data_all[merge_param].notna()
    data_all.loc[mask, keep_param] = data_all.loc[mask, merge_param]
    
    ### remove merge_param
    data_all = data_all.drop(columns=[merge_param])

In [None]:
data_all.rename(columns={'Peat type, col nr:6':'Peat type'}, inplace=True)

In [None]:
pd.concat( [data_all.head(2),data_all.tail(2)] )

In [None]:
data_all.columns

In [None]:
data_all = data_all[['DOI','Event', 'Latitude [deg]', 'Longitude [deg]', 'Elevation [m]',
                     'DEPTH, sediment/rock [m]', 'AGE [ka BP]', 'Density, dry bulk [g/cm**3]',
                     'Peat type, col nr:4', 'Peat type, col nr:8', 'Peat type, comment:Loisel et al. 2014',
                     'Organic matter [%]','Density, organic matter [g/cm**3]','Density, organic carbon [g/cm**3]',
                     'Carbon, total [%]','Nitrogen, total [%]']]


In [None]:
pd.concat([data_all.head(2),data_all.tail(2)])

#### Save dataframe as file

In [None]:
### Create data directory
data_directory = "PANGAEA_data"
### Check if it already exists before creating it
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)
    
### Save as tab delimited text file

## set filename
filename1 = 'PAGES_C-PEAT_Geochemistry_metadata.txt'
filename2 = 'PAGES_C-PEAT_Geochemistry_data.txt'

df_all.to_csv(os.path.join(data_directory, filename1), sep='\t', encoding='utf-8', index=False)
data_all.to_csv(os.path.join(data_directory, filename2), sep='\t', encoding='utf-8', index=False)