![PANGAEA_Banner.png](https://gitlab.awi.de/kriemann/nfdi4earth_academy_data/raw/main/logo/PANGAEA_Banner.png)

# How to download data from PANGAEA and manipulate it 

based on [PANGAEA Community Workshop script](Python/PANGAEApy_practical/pangaeapy_practical_solutions.ipynb)  
Last updated: 2025-08-18  

This notebook will guide you how to search and retrieve diverse earth- and environmental data and its metadata from the [PANGAEA data repository](https://www.pangaea.de) using Python. It uses the [PANGAEApy package](https://pypi.org/project/pangaeapy/), version 1.0.22 to facilitate the data download. 

Check out our [Wiki](https://wiki.pangaea.de/wiki/PANGAEA_search) for further details on searching data in PANGAEA.

## 1. Import libraries

In [1]:
import os
import pandas as pd
import numpy as np
from collections import Counter
import requests 

In [2]:
### PANGAEApy
## if you need to install PANGAEApy use pip
#!pip install pangaeapy # Uncomment to upgrade pangaeapy

## if you need to upgrade PANGAEApy use 
#!pip install pangaeapy --upgrade # Uncomment to upgrade pangaeapy

## check version of PANGAEApy
# !pip show pangaeapy

## for details see https://pypi.org/project/pangaeapy/ 

import pangaeapy as pan
from pangaeapy.pandataset import PanDataSet

### PANGAEApy documentation
To call the PANGAEApy documentation uncomment one of the following lines

In [3]:
# help(pan)
### or 
# help(pan.panquery)
### or
# help(pan.pandataset)

In [4]:
### ignore warnings in this script
import warnings
from pandas.errors import SettingWithCopyWarning
warnings.simplefilter(action='ignore', category=(SettingWithCopyWarning))
warnings.simplefilter(action='ignore', category=FutureWarning)

## 2. Query for data in PANGAEA

AIM: How to search for datasets of a particular topic such as a species, location, project or author?

This mirrors the query via the [PANGAEA website](https://pangaea.de/)  

**Note:** The search term is enclosed with single quotes '. If your search term includes a blank, use additional double quotes " inside the single quotes.  
Example: _'sea ice'_ vs. _'"sea ice"'_  
Example: _'parameter:Temperature, water method:CTD/Rosette'_ vs. _'parameter:"Temperature, water" method:CTD/Rosette'_

#### General info on query
* limit = the maximum number of datasets to be returned from query is 500.
    * default limit = 10
    * To download > 500 use the offset attribute e.g. pan.PanQuery("Triticum", limit = 500, offset=500)
* type: 
    * collection = dataset collection
    * member = individual dataset which can be part of a dataset collection 
* score: Indicates how well the dataset matched the query search term
* help(pan.panquery)

### 2.1 Basic example queries
* search via [keywords](https://wiki.pangaea.de/wiki/PANGAEA_search)
* search via geographical coordinates a.k.a. bounding box

#### Query PANGAEA with 1 keyword

In [5]:
query = pan.PanQuery('Geochemistry')
### compare with https://pangaea.de/?q=Geochemistry

In [6]:
### query is a PANGAEApy object with built in objects
print(query)

<pangaeapy.panquery.PanQuery object at 0x7f39c8084d70>


In [7]:
### you can ask the following attributes
## totalcount, error, query, result
print(query.query)

Geochemistry


In [8]:
print(f'There are {query.totalcount} query results.')

There are 23272 query results.


In [9]:
### put query results into dataframe
query_results = pd.DataFrame(query.result)
print(f'Total length of data frame query_results is {len(query_results)}.')

Total length of data frame query_results is 10.


In [10]:
query_results

Unnamed: 0,URI,score,html,type,position
0,doi:10.1594/PANGAEA.763295,26.125828,"<li><div class=""citation""><a href=""https://doi...",collection,0
1,doi:10.1594/PANGAEA.757640,26.000298,"<li><div class=""citation""><a href=""https://doi...",collection,1
2,doi:10.1594/PANGAEA.792717,25.770397,"<li><div class=""citation""><a href=""https://doi...",collection,2
3,doi:10.1594/PANGAEA.763350,25.511566,"<li><div class=""citation""><a href=""https://doi...",collection,3
4,doi:10.1594/PANGAEA.725824,25.318384,"<li><div class=""citation""><a href=""https://doi...",collection,4
5,doi:10.1594/PANGAEA.788108,25.310232,"<li><div class=""citation""><a href=""https://doi...",collection,5
6,doi:10.1594/PANGAEA.738188,25.310232,"<li><div class=""citation""><a href=""https://doi...",collection,6
7,doi:10.1594/PANGAEA.757260,25.244822,"<li><div class=""citation""><a href=""https://doi...",collection,7
8,doi:10.1594/PANGAEA.790780,24.969322,"<li><div class=""citation""><a href=""https://doi...",collection,8
9,doi:10.1594/PANGAEA.746058,24.82097,"<li><div class=""citation""><a href=""https://doi...",collection,9


#### Query PANGAEA with combinations of keywords

In [11]:
### find datasets that contain both "Geochemistry" and "sediment core"
## remember how to use the different quotes:
## The search term is enclosed with single quotes '. If your search term includes a blank, use additional double quotes " inside the single quotes.
query = pan.PanQuery('Geochemistry "sediment core"')
print(f'There are {query.totalcount} query results.')

There are 5742 query results.


#### Optional query terms

In [12]:
### find datasets that contain "Geochemistry" and either "Spitzbergen" or "Svalbard" 
query = pan.PanQuery('Geochemistry AND (Spitzbergen OR Svalbard)')
print(f'There are {query.totalcount} query results.')

There are 1421 query results.


#### Uncertain spelling

In [13]:
### find datasets with uncertain spelling of single letter
query = pan.PanQuery('Pal?nologic')
print(f'There are {query.totalcount} query results.')

There are 857 query results.


#### Specific author

In [14]:
### find datasets of author "Boetius"
query = pan.PanQuery('citation:author:Boetius')
print(f'There are {query.totalcount} query results.') 

There are 1854 query results.


#### Within geographical coordinates a.k.a bounding box

In [15]:
### query database for "Geochemistry" and "sediment core" within a certain geolocation a.k.a. bounding box
## bounding box: bbox=(minlon, minlat,  maxlon, maxlat)
query = pan.PanQuery('Geochemistry "sediment core"', limit = 500, bbox=(-60, 50, -10, 70))
print(f'There are {query.totalcount} query results.')

There are 186 query results.


### 2.2 How to query PANGAEA without result limitations
* The maximum of retrieving search results is 500 datasets.  
* Retrieve datasets in chunks of 500 via offset option.  
* Put all datasets in one data frame.

In [16]:
### Get all results and combine them in data frame.

### define search pattern
search_pattern = 'project:label:PAGES_C-PEAT'

### basic query to get number of search results
query = pan.PanQuery(search_pattern, limit = 500)
print(f'There are {query.totalcount} query results.')
print(f'Currently query consists of {len(query.result)} entries.')

### create empty data frame
df_query_results_all = pd.DataFrame()

### loop over all results in steps of 500
for i in np.arange(0,query.totalcount,500):
    
    ### store result of individual step in qs
    qs = pan.PanQuery(search_pattern, limit = 500, offset=i)
    
    ### convert qs result with 500 entries to data frame df_qs
    df_qs = pd.DataFrame(qs.result)
    
    ### concatenate all individual df_qs into one data frame named query_results_all
    df_query_results_all = pd.concat([df_query_results_all,df_qs],ignore_index=True)
    
print(f'df_query_results_all consists of {len(df_query_results_all)} results.')

There are 758 query results.
Currently query consists of 500 entries.
df_query_results_all consists of 758 results.


In [17]:
### show first and last 3 lines
pd.concat( [ df_query_results_all.head(3), df_query_results_all.tail(3) ] )

Unnamed: 0,URI,score,html,type,position
0,doi:10.1594/PANGAEA.890490,6.520642,"<li><div class=""citation""><a href=""https://doi...",member,0
1,doi:10.1594/PANGAEA.890383,6.520642,"<li><div class=""citation""><a href=""https://doi...",member,1
2,doi:10.1594/PANGAEA.890497,6.520642,"<li><div class=""citation""><a href=""https://doi...",member,2
755,doi:10.1594/PANGAEA.936501,6.244241,"<li><div class=""citation""><a href=""https://doi...",member,755
756,doi:10.1594/PANGAEA.934839,6.244241,"<li><div class=""citation""><a href=""https://doi...",member,756
757,doi:10.1594/PANGAEA.929546,6.244241,"<li><div class=""citation""><a href=""https://doi...",member,757


### 2.3 Quiz

[More information](https://wiki.pangaea.de/wiki/PANGAEA_search) how to query with keywords

#### 2.3.1 How many datasets contain "geological investigations"?
Hint: "geological investigations" **not** "geological" and "investigations"

In [18]:
# Your solution

In [19]:
### solution
query = pan.PanQuery('"geological investigations"')
print(query.totalcount)

57


#### 2.3.2 How many datasets contain "geological investigations" in the title only?

In [20]:
# Your solution

In [21]:
### solution
query = pan.PanQuery('citation:title:"geological investigations"')
print(query.totalcount)

6


#### 2.3.3 How many datasets measured "Temperature, water" using a CTD/Rosette?

In [22]:
# Your solution

In [23]:
### solution
query = pan.PanQuery('parameter:"Temperature, water" method:CTD/Rosette')
print(query.totalcount)

17980


## 3. Get metadata of datasets

A long list of metadata is callable with PanDataSet. 
Find a comprehensive list in internal documentation  
_help(pan.PanQuery)_    
or in this notebook full of examples: [pangaeapy_detailed_metadata_search.ipynb](https://github.com/pangaea-data-publisher/community-workshop-material/tree/master/Python/PANGAEApy_practical/pangaeapy_detailed_metadata_search.ipynb)

### 3.1 Get metadata of individual dataset

##### Example dataset from PANGAEA https://doi.pangaea.de/10.1594/PANGAEA.918423 

In [24]:
### 3 ways to ask for dataset metadata
## complete URL
# ds = PanDataSet('https://doi.pangaea.de/10.1594/PANGAEA.918423', include_data=False) 
## just URI
# ds = PanDataSet('doi:10.1594/PANGAEA.918423', include_data=False)
## just PANGAEA id number of dataset
ds = PanDataSet(918423, include_data=False) 

#### Basic metadata retrieval

In [25]:
### Title
ds.title

'Geochemistry of sediment core Core_PGI01 from Penguin Island (South Shetland Archipelago, Antarctica)'

In [26]:
### Abstract
ds.abstract

In [27]:
### Authors
print(f'Authors: {"; ".join([x.fullname for x in ds.authors])}')

Authors: Guerra, Roberta


In [28]:
### Full Reference
ds.citation

'Guerra, Roberta (2020): Geochemistry of sediment core Core_PGI01 from Penguin Island (South Shetland Archipelago, Antarctica) [dataset]. PANGAEA, https://doi.org/10.1594/PANGAEA.918423, In: Guerra, R (2020): Geochemistry of sediment cores from Deception and Penguin Islands (South Shetland Archipelago, Antarctica) [dataset bundled publication]. PANGAEA, https://doi.org/10.1594/PANGAEA.918427'

In [29]:
### Geolocation
print(f'Latitude: {ds.geometryextent["meanLatitude"]}')
print(f'Longitude: {ds.geometryextent["meanLongitude"]}')

Latitude: -62.11
Longitude: -57.941


In [30]:
### Parameters
params = "; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()])
print(f'Parameters: {params}')

Parameters: Depth, top/min [m]; Depth, bottom/max [m]; DEPTH, sediment/rock [m]; Silicon dioxide [%]; Aluminium oxide [%]; Iron oxide, Fe2O3 [%]; Manganese oxide [%]; Magnesium oxide [%]; Calcium oxide [%]; Vanadium [mg/kg]; Chromium [mg/kg]; Nickel [mg/kg]; Copper [mg/kg]; Zinc [mg/kg]; Strontium [mg/kg]; Lead [mg/kg]


In [31]:
### Event as dataframe
ds.getEventsAsFrame()

Unnamed: 0,label,id,latitude,longitude,latitude2,longitude2,elevation,device,deviceid,method,basis,datetime,datetime2,location,campaign
0,Core_PGI01,2941949,-62.11,-57.941,,,-90.0,Box corer,10832,<pangaeapy.pandataset.PanMethod object at 0x7f...,<pangaeapy.pandataset.PanBasis object at 0x7f3...,2007-12-01T00:00:00,2008-01-30T00:00:00,"Penguin Island, South Shetland Archipelago",OPERANTAR_XXVI


In [32]:
### Event as PanEvent object
print(ds.events)

for event in ds.events:
    print(event.label)
    print(event.method.name)
    print(event.basis.name)

[<pangaeapy.pandataset.PanEvent object at 0x7f3915a32f90>]
Core_PGI01
Box corer
Ary Rongel


#### Store metadata in data frame

In [33]:
### create empty data frame
df = pd.DataFrame()

### store metadata in df
df.loc[0,'dataset title'] = ds.title
df.loc[0,'abstract'] = ds.abstract

### ds.authors is a list
df.loc[0,'first author fullname'] = ds.authors[0].fullname
df.loc[0,'all authors fullnames'] = "; ".join([x.fullname for x in ds.authors])

### authors orcids is a list
df.loc[0,'all authors orcids'] = "; ".join([x.ORCID if x.ORCID else "no ORCID" for x in ds.authors])

df.loc[0,'citation'] = ds.citation
df.loc[0,'dataset DOI'] = ds.doi
df.loc[0,'west bound longitude'] = ds.geometryextent["westBoundLongitude"]
df.loc[0,'east bound longitude'] = ds.geometryextent["eastBoundLongitude"]
df.loc[0,'south bound latitude'] = ds.geometryextent["southBoundLatitude"]
df.loc[0,'north bound latitude'] = ds.geometryextent["northBoundLatitude"]
### parameters is a list
df.loc[0,'parameters'] = "; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()])

### event devices
df.loc[0,'label'] = "; ".join(set([device for device in ds.getEventsAsFrame()["label"]]))

In [34]:
df

Unnamed: 0,dataset title,abstract,first author fullname,all authors fullnames,all authors orcids,citation,dataset DOI,west bound longitude,east bound longitude,south bound latitude,north bound latitude,parameters,label
0,Geochemistry of sediment core Core_PGI01 from ...,,"Guerra, Roberta","Guerra, Roberta",0000-0002-7151-6616,"Guerra, Roberta (2020): Geochemistry of sedime...",https://doi.org/10.1594/PANGAEA.918423,-57.941,-57.941,-62.11,-62.11,"Depth, top/min [m]; Depth, bottom/max [m]; DEP...",Core_PGI01


#### Save dataframe as file

In [35]:
### Create data directory
data_directory = "PANGAEA_data"
# Check if it already exists before creating it
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)
    
### Save as csv (comma seperated value)
df.to_csv(os.path.join(data_directory, f'PANGAEA_metadata_{ds.id}.csv'), encoding='utf-8', index=False)
df.to_csv(os.path.join(data_directory, f'PANGAEA_metadata_{ds.id}.txt'), sep='\t', encoding='utf-8', index=False)
print(f'PANGAEA metadata of "{ds.title}" saved')

PANGAEA metadata of "Geochemistry of sediment core Core_PGI01 from Penguin Island (South Shetland Archipelago, Antarctica)" saved


### 3.2 Getting metadata for multiple datasets

In [36]:
### define search pattern
search_pattern = 'project:label:PAGES_C-PEAT and citation:title:geochemistry'

In [37]:
### do query, pay attention to limit
query = pan.PanQuery(search_pattern, limit = 5)
print(f'There are {query.totalcount} query results.')
print(f'Currently query consists of {len(query.result)} entries.')

There are 267 query results.
Currently query consists of 5 entries.


In [38]:
### store query results in dataframe
df = pd.DataFrame(query.result)

In [39]:
df

Unnamed: 0,URI,score,html,type,position
0,doi:10.1594/PANGAEA.890490,16.503645,"<li><div class=""citation""><a href=""https://doi...",member,0
1,doi:10.1594/PANGAEA.890537,16.503645,"<li><div class=""citation""><a href=""https://doi...",member,1
2,doi:10.1594/PANGAEA.890538,16.406353,"<li><div class=""citation""><a href=""https://doi...",member,2
3,doi:10.1594/PANGAEA.890415,16.406353,"<li><div class=""citation""><a href=""https://doi...",member,3
4,doi:10.1594/PANGAEA.890186,16.406353,"<li><div class=""citation""><a href=""https://doi...",member,4


#### Loop over all entries in df and get metadata for each entry
NOTE: As a safety precaution, the number of metadata requests is limited for a specific time period. 

_Received too many (metadata) requests error (429)...waiting 30s -_

If you have larger requests, prepare to wait or use a different tool e.g. OAI-PMH (https://wiki.pangaea.de/wiki/OAI-PMH).

In [40]:
### Create one data frame for all datasets
data_all = pd.DataFrame()

### loop over all datasets ins df
for ind,value in df['URI'].items():
    
    ## use PanDataSet to get metadata and data and put them into 2 diferent dataframes
    ds = PanDataSet(value, include_data=False)

    print(ind, ds.doi)

    ## put metadata into df in new columns
    df.loc[ind,'Title'] = ds.title
    df.loc[ind,'Publication date'] = ds.date
    df.loc[ind,'Authors'] = {"; ".join([x.fullname for x in ds.authors])}
    df.loc[ind,'Citation'] = ds.citation
    df.loc[ind,'DOI'] = ds.doi
    df.loc[ind,'Parameters'] = "; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()])
    if ds.events:
        df.loc[ind,'Event'] = "; ".join([x.label for x in ds.events])


0 https://doi.org/10.1594/PANGAEA.890490
1 https://doi.org/10.1594/PANGAEA.890537
2 https://doi.org/10.1594/PANGAEA.890538
3 https://doi.org/10.1594/PANGAEA.890415
4 https://doi.org/10.1594/PANGAEA.890186


In [41]:
df

Unnamed: 0,URI,score,html,type,position,Title,Publication date,Authors,Citation,DOI,Parameters,Event
0,doi:10.1594/PANGAEA.890490,16.503645,"<li><div class=""citation""><a href=""https://doi...",member,0,Geochemistry of Stordalen peat core,2018-05-29T09:39:56,"{Kokfelt, Ulla}","Kokfelt, Ulla (2018): Geochemistry of Stordale...",https://doi.org/10.1594/PANGAEA.890490,"DEPTH, sediment/rock [m]; AGE [ka BP]; Density...",Stordalen_1
1,doi:10.1594/PANGAEA.890537,16.503645,"<li><div class=""citation""><a href=""https://doi...",member,1,Geochemistry of V34_core1 peat core,2018-05-29T12:24:30,"{MacDonald, Glen M}","MacDonald, Glen M (2018): Geochemistry of V34_...",https://doi.org/10.1594/PANGAEA.890537,"DEPTH, sediment/rock [m]; AGE [ka BP]; Density...",V34_core1
2,doi:10.1594/PANGAEA.890538,16.406353,"<li><div class=""citation""><a href=""https://doi...",member,2,Geochemistry of Vasyugan_coreV21 peat core,2018-05-29T12:24:30,"{Borren, Wiebe}","Borren, Wiebe (2018): Geochemistry of Vasyugan...",https://doi.org/10.1594/PANGAEA.890538,"DEPTH, sediment/rock [m]; AGE [ka BP]; Density...",Vasyugan_coreV21
3,doi:10.1594/PANGAEA.890415,16.406353,"<li><div class=""citation""><a href=""https://doi...",member,3,Geochemistry of Lake396_core3 peat core,2018-05-28T14:01:35,"{Camill, Philip}","Camill, Philip (2018): Geochemistry of Lake396...",https://doi.org/10.1594/PANGAEA.890415,"DEPTH, sediment/rock [m]; AGE [ka BP]; Density...",Lake396_core3
4,doi:10.1594/PANGAEA.890186,16.406353,"<li><div class=""citation""><a href=""https://doi...",member,4,Geochemistry of Nuikluk_coreNL10-1 peat core,2018-05-24T10:11:32,"{Yu, Zicheng}","Yu, Zicheng (2018): Geochemistry of Nuikluk_co...",https://doi.org/10.1594/PANGAEA.890186,"DEPTH, sediment/rock [m]; AGE [ka BP]; Density...",Nuikluk_coreNL10-1


In [42]:
### drop columns no longer needed
df = df.drop(['URI','score','html','type','position'],axis=1)
df

Unnamed: 0,Title,Publication date,Authors,Citation,DOI,Parameters,Event
0,Geochemistry of Stordalen peat core,2018-05-29T09:39:56,"{Kokfelt, Ulla}","Kokfelt, Ulla (2018): Geochemistry of Stordale...",https://doi.org/10.1594/PANGAEA.890490,"DEPTH, sediment/rock [m]; AGE [ka BP]; Density...",Stordalen_1
1,Geochemistry of V34_core1 peat core,2018-05-29T12:24:30,"{MacDonald, Glen M}","MacDonald, Glen M (2018): Geochemistry of V34_...",https://doi.org/10.1594/PANGAEA.890537,"DEPTH, sediment/rock [m]; AGE [ka BP]; Density...",V34_core1
2,Geochemistry of Vasyugan_coreV21 peat core,2018-05-29T12:24:30,"{Borren, Wiebe}","Borren, Wiebe (2018): Geochemistry of Vasyugan...",https://doi.org/10.1594/PANGAEA.890538,"DEPTH, sediment/rock [m]; AGE [ka BP]; Density...",Vasyugan_coreV21
3,Geochemistry of Lake396_core3 peat core,2018-05-28T14:01:35,"{Camill, Philip}","Camill, Philip (2018): Geochemistry of Lake396...",https://doi.org/10.1594/PANGAEA.890415,"DEPTH, sediment/rock [m]; AGE [ka BP]; Density...",Lake396_core3
4,Geochemistry of Nuikluk_coreNL10-1 peat core,2018-05-24T10:11:32,"{Yu, Zicheng}","Yu, Zicheng (2018): Geochemistry of Nuikluk_co...",https://doi.org/10.1594/PANGAEA.890186,"DEPTH, sediment/rock [m]; AGE [ka BP]; Density...",Nuikluk_coreNL10-1


#### Save dataframe as file

In [43]:
### Create data directory
data_directory = "PANGAEA_data"
### Check if it already exists before creating it
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)
    
### Save as csv (comma seperated value)
df.to_csv(os.path.join(data_directory, f'PANGAEA_metadata_df_all.csv'), encoding='utf-8', index=False)
df.to_csv(os.path.join(data_directory, f'PANGAEA_metadata_df_all.txt'), sep='\t', encoding='utf-8', index=False)
print(f'PANGAEA metadata saved')

PANGAEA metadata saved


### 3.3 Quiz

#### 3.3.1 What is the title of this dataset?
https://doi.pangaea.de/10.1594/PANGAEA.804588

In [44]:
# Your solution

In [45]:
### solution
ds = PanDataSet("https://doi.pangaea.de/10.1594/PANGAEA.804588", include_data=False)
ds.title

'Classified Adélie penguin colonies from Landsat data'

#### 3.3.2 What is the name of the second author of this dataset?
https://doi.pangaea.de/10.1594/PANGAEA.804588

In [46]:
# Your solution

In [47]:
### solution
ds = PanDataSet("https://doi.pangaea.de/10.1594/PANGAEA.804588", include_data=False)
ds.authors[1].fullname

'Southwell, C J'

#### 3.3.3 Did they measure pH in this dataset?
https://doi.pangaea.de/10.1594/PANGAEA.743969

In [48]:
# Your solution

In [49]:
### solution
ds = PanDataSet("https://doi.pangaea.de/10.1594/PANGAEA.743969")
list_params = list(ds.params)
# print(list_params)
if 'pH' in list_params:
    print('yes')

yes


## 4. Download datasets

### 4.1 Download single dataset

##### Example dataset: https://doi.pangaea.de/10.1594/PANGAEA.972802

In [50]:
ds = PanDataSet("https://doi.pangaea.de/10.1594/PANGAEA.972802")
### ds contains data and metadata
### see metadata section on how to get metadata
type(ds)

pangaeapy.pandataset.PanDataSet

In [51]:
### ds.data is data frame
type(ds.data)

pandas.core.frame.DataFrame

In [52]:
### dataset header contains of parameter short names without unit
ds.data.head(3)

Unnamed: 0,Event,Leg,Activity,Cast,Date/Time,Device,Device_2,Latitude,Longitude,Elevation,Collector,Feature,Depth water,Sect,Samp type,Sample label,Chl a,Comment
0,PS122/1_2-95,PS122/1,2,95,2019-10-09 01:30:00,Bucket Water Sampling,BUCKET_PS,84.9112,136.058,-4228,SI_ice_ sampler,Lead,,bottom 00-05 cm,Ice,PS122_Chl_004,2.579,new ice 3-5 cm thick pieces
1,PS122/1_2-95,PS122/1,2,95,2019-10-09 01:30:00,Bucket Water Sampling,BUCKET_PS,84.9112,136.058,-4228,SI_ice_ sampler,Lead,,bottom 00-05 cm,Ice,PS122_Chl_005,0.402,new ice 3-5 cm thick pieces
2,PS122/1_2-95,PS122/1,2,95,2019-10-09 01:30:00,Bucket Water Sampling,BUCKET_PS,84.9112,136.058,-4228,SI_ice_ sampler,Lead,,bottom 00-05 cm,Ice,PS122_Chl_006,2.398,new ice 3-5 cm thick pieces


#### Translate to long parameter names
Because by default parameters are abbreviated without units

In [53]:
### Translate short parameters names to long names including unit
def get_long_parameters(ds):
    """Translate short parameters names to long names including unit

    Args:
        ds (PANGAEA dataset): PANGAEA dataset
    """
    ds.data.columns =  [f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()]

In [54]:
get_long_parameters(ds)

In [55]:
ds.data.head(3)

Unnamed: 0,Event label,Leg Number,Activity description,Cast number,DATE/TIME,Device type,Device type.1,LATITUDE,LONGITUDE,ELEVATION [m a.s.l.],Collector,Feature,"DEPTH, water [m]",Section,Sample type,Sample code/label,Chlorophyll a [µg/l],Comment
0,PS122/1_2-95,PS122/1,2,95,2019-10-09 01:30:00,Bucket Water Sampling,BUCKET_PS,84.9112,136.058,-4228,SI_ice_ sampler,Lead,,bottom 00-05 cm,Ice,PS122_Chl_004,2.579,new ice 3-5 cm thick pieces
1,PS122/1_2-95,PS122/1,2,95,2019-10-09 01:30:00,Bucket Water Sampling,BUCKET_PS,84.9112,136.058,-4228,SI_ice_ sampler,Lead,,bottom 00-05 cm,Ice,PS122_Chl_005,0.402,new ice 3-5 cm thick pieces
2,PS122/1_2-95,PS122/1,2,95,2019-10-09 01:30:00,Bucket Water Sampling,BUCKET_PS,84.9112,136.058,-4228,SI_ice_ sampler,Lead,,bottom 00-05 cm,Ice,PS122_Chl_006,2.398,new ice 3-5 cm thick pieces


#### Save data

In [56]:
### Create data directory
data_directory = "PANGAEA_data"
### Check if it already exists before creating it
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)
### Save as csv (comma seperated value)
print(f'PANGAEA dataset "{ds.title}" saved')
ds.data.to_csv(os.path.join(data_directory, f'PANGAEA_dataset_{ds.id}.csv'),index=False)

PANGAEA dataset "Chlorophyll a concentrations from leads, melt ponds and under ice sampling during the MOSAiC expedition (PS122) in the Central Arctic Ocean 2019-2020" saved


### 4.2 Download dataset including binary files e.g. images

##### Example dataset: https://doi.pangaea.de/10.1594/PANGAEA.932826

In [57]:
### Download dataset from PANGAEA
ds = PanDataSet("https://doi.pangaea.de/10.1594/PANGAEA.932826", enable_cache=True)
### Spell out abbreviated parameters
get_long_parameters(ds)
ds.data.head(2)

Unnamed: 0,DATE/TIME,LATITUDE,LONGITUDE,"DEPTH, water [m]",Image,Binary Object,Event,Elevation [m]
0,2021-02-16 03:33:58,-75.142628,-30.275583,364,TIMER_2021_02_16_at_03_33_58_IMG_0001.JPG,TIMER_2021_02_16_at_03_33_58_IMG_0001.txt,PS124_21-7,-495.1
1,2021-02-16 03:34:18,-75.142628,-30.275583,364,TIMER_2021_02_16_at_03_34_18_IMG_0002.JPG,TIMER_2021_02_16_at_03_34_18_IMG_0002.txt,PS124_21-7,-495.1


In [58]:
df

Unnamed: 0,Title,Publication date,Authors,Citation,DOI,Parameters,Event
0,Geochemistry of Stordalen peat core,2018-05-29T09:39:56,"{Kokfelt, Ulla}","Kokfelt, Ulla (2018): Geochemistry of Stordale...",https://doi.org/10.1594/PANGAEA.890490,"DEPTH, sediment/rock [m]; AGE [ka BP]; Density...",Stordalen_1
1,Geochemistry of V34_core1 peat core,2018-05-29T12:24:30,"{MacDonald, Glen M}","MacDonald, Glen M (2018): Geochemistry of V34_...",https://doi.org/10.1594/PANGAEA.890537,"DEPTH, sediment/rock [m]; AGE [ka BP]; Density...",V34_core1
2,Geochemistry of Vasyugan_coreV21 peat core,2018-05-29T12:24:30,"{Borren, Wiebe}","Borren, Wiebe (2018): Geochemistry of Vasyugan...",https://doi.org/10.1594/PANGAEA.890538,"DEPTH, sediment/rock [m]; AGE [ka BP]; Density...",Vasyugan_coreV21
3,Geochemistry of Lake396_core3 peat core,2018-05-28T14:01:35,"{Camill, Philip}","Camill, Philip (2018): Geochemistry of Lake396...",https://doi.org/10.1594/PANGAEA.890415,"DEPTH, sediment/rock [m]; AGE [ka BP]; Density...",Lake396_core3
4,Geochemistry of Nuikluk_coreNL10-1 peat core,2018-05-24T10:11:32,"{Yu, Zicheng}","Yu, Zicheng (2018): Geochemistry of Nuikluk_co...",https://doi.org/10.1594/PANGAEA.890186,"DEPTH, sediment/rock [m]; AGE [ka BP]; Density...",Nuikluk_coreNL10-1


In [59]:
### download only 1 image
df = ds.data[ds.data['DATE/TIME']=='2021-02-16 03:45:21']

### Create file urls
df["image_url"] = [f'https://download.pangaea.de/dataset/{ds.id}/files/{img}' for img in df['Image']]

In [60]:
df

Unnamed: 0,DATE/TIME,LATITUDE,LONGITUDE,"DEPTH, water [m]",Image,Binary Object,Event,Elevation [m],image_url
35,2021-02-16 03:45:21,-74.868626,-30.665064,507,TIMER_2021_02_16_at_03_45_21_IMG_0036.JPG,TIMER_2021_02_16_at_03_45_21_IMG_0036.txt,PS124_21-7,-495.1,https://download.pangaea.de/dataset/932826/fil...


In [61]:
### Create data directory
data_directory = "PANGAEA_data"
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)

# ### download images
# for i, file_url in enumerate(df['image_url']):
#     response = requests.get(file_url,data_directory)    
#     index = df.loc[(df == file_url).any(axis=1)].index[0]
#     ### save image
#     open(data_directory+'/'+df.loc[index,'Image'], 'wb').write(response.content)
#     print(df.loc[index,'Image'] +' downloaded')

### 4.3 Download multiple datasets
* download multiple datasets: data and metdata
* combine data into one dataframe
* combine metadata into one dataframe  

#### Define search pattern and do query

In [62]:
### define search pattern
search_pattern = 'project:label:PAGES_C-PEAT and citation:title:geochemistry' 

### do query
query = pan.PanQuery(search_pattern, limit = 500)
print(f'There are {query.totalcount} query results.')
print(f'Currently query consists of {len(query.result)} entries.')

There are 267 query results.
Currently query consists of 267 entries.


#### Get all query results and combine them in data frame

In [63]:
### create empty data frame
df_all = pd.DataFrame()

### apply loop if query consists of more than 500 datasets
if query.totalcount >= 500:
    ### loop over all results in steps of 500
    for i in np.arange(0,query.totalcount,500):
        
        ### store result of individual step in qs
        qs = pan.PanQuery(search_pattern, limit = 500, offset=i)
        
        ### convert qs result with 500 entries to data frame df_qs
        df_qs = pd.DataFrame(qs.result)
        
        ### concatenate all individual df_qs into one data frame named query_results_all
        df_all = pd.concat([df_all,df_qs],ignore_index=True)
else:
    df_all = pd.DataFrame(query.result)

print(f'There are {query.totalcount} query results.')
print(f'df_all consists of {len(df_all)} results.')

There are 267 query results.
df_all consists of 267 results.


In [64]:
### show first 3 lines
df_all.head(3)

Unnamed: 0,URI,score,html,type,position
0,doi:10.1594/PANGAEA.890490,16.503645,"<li><div class=""citation""><a href=""https://doi...",member,0
1,doi:10.1594/PANGAEA.890537,16.503645,"<li><div class=""citation""><a href=""https://doi...",member,1
2,doi:10.1594/PANGAEA.890538,16.406353,"<li><div class=""citation""><a href=""https://doi...",member,2


#### Filter out collections

In [65]:
df_all = df_all[df_all['type']=='member']
print(f'df_all consists of {len(df_all)} results.')

df_all consists of 267 results.


#### Practical functions for complicated datasets
* double parameter
* method as comment  

Example dataset: https://doi.pangaea.de/10.1594/PANGAEA.890478  

In [66]:
### function to find duplicate column names
def find_duplicates(col_names):
    name_counts = Counter(col_names)
    duplicates = [name for name, count in name_counts.items() if count > 1]
    return duplicates, bool(duplicates)

In [67]:
### function to create general parameter long names with unit and method
def generate_general_column_name(param):
    base_name = f'{param.name} [{param.unit}]' if param.unit else param.name 

    if param.method:
        base_name += f', method:{param.method.name}'

    return base_name

In [68]:
### functions to rename duplicate column names so they are all individual within the dataset
def generate_unique_column_name(param):
    base_name = f'{param.name} [{param.unit}]' if param.unit else param.name
    
    if param.method:
        base_name += f', method:{param.method.name}'
    
    if param.comment:
        return f'{base_name}, comment:{param.comment}'
    else:
        return f'{base_name}, col nr:{param.colno}'

def make_unique_column_names(ds, same_param_name):
    col_names = []
    for param in ds.params.values():
        name = generate_general_column_name(param)
        if name in same_param_name:
            name = generate_unique_column_name(param)
        col_names.append(name)
    return col_names

#### Download and combine data and metadata of query results

NOTE: As a safety precaution, the number of metadata requests is limited for a specific time period. 

_Received too many (metadata) requests error (429)...waiting 30s -_

If you have larger requests, prepare to wait or use a different tool e.g. OAI-PMH (https://wiki.pangaea.de/wiki/OAI-PMH).

In [69]:
### Create one data frame for all datasets
data_all = pd.DataFrame()

### loop over all datasets in df_all
# for ind,value in df_all['URI'].items():
### only download first 3 results during workshop
for ind,value in df_all['URI'][0:3].items(): 
    
    ## use PanDataSet to get metadata and data and put them into 2 diferent dataframes
    ds = PanDataSet(value)

    print(ind, ds.doi)

    ## put metadata into df_all in new columns
    df_all.loc[ind,'Title'] = ds.title
    df_all.loc[ind,'Publication date'] = ds.date
    df_all.loc[ind,'Authors'] = {"; ".join([x.fullname for x in ds.authors])}
    df_all.loc[ind,'Citation'] = ds.citation
    df_all.loc[ind,'DOI'] = ds.doi
    df_all.loc[ind,'Parameters'] = "; ".join([f'{param.name} [{param.unit}]' if param.unit else param.name for param in ds.params.values()])
    if ds.events:
        df_all.loc[ind,'Event'] = "; ".join([x.label for x in ds.events])
    
    ### Translate default short parameter names to long parameter names, add unit and method if available, check if all column names are individuals
    col_names = []
    for param in ds.params.values():
        col_name = generate_general_column_name(param)
        col_names.append(col_name)

    ### find duplicate column names make them individual column names
    same_param_name, double_name = find_duplicates(col_names)

    if double_name:
        col_names = make_unique_column_names(ds, set(same_param_name))
    
    ### rename columns because python cannot handle duplicate column names within dataframe
    ds.data.columns =  col_names
    
    ### create new data dataframe for each query result 
    df_data = pd.DataFrame()
    df_data = ds.data
    df_data['DOI'] = ds.doi

    ### combine all datasats into one dataframe
    data_all = pd.concat([data_all,df_data], ignore_index=True)


0 https://doi.org/10.1594/PANGAEA.890490
1 https://doi.org/10.1594/PANGAEA.890537
2 https://doi.org/10.1594/PANGAEA.890538


In [70]:
### metadata table
df_all.head(3)

Unnamed: 0,URI,score,html,type,position,Title,Publication date,Authors,Citation,DOI,Parameters,Event
0,doi:10.1594/PANGAEA.890490,16.503645,"<li><div class=""citation""><a href=""https://doi...",member,0,Geochemistry of Stordalen peat core,2018-05-29T09:39:56,"{Kokfelt, Ulla}","Kokfelt, Ulla (2018): Geochemistry of Stordale...",https://doi.org/10.1594/PANGAEA.890490,"DEPTH, sediment/rock [m]; AGE [ka BP]; Density...",Stordalen_1
1,doi:10.1594/PANGAEA.890537,16.503645,"<li><div class=""citation""><a href=""https://doi...",member,1,Geochemistry of V34_core1 peat core,2018-05-29T12:24:30,"{MacDonald, Glen M}","MacDonald, Glen M (2018): Geochemistry of V34_...",https://doi.org/10.1594/PANGAEA.890537,"DEPTH, sediment/rock [m]; AGE [ka BP]; Density...",V34_core1
2,doi:10.1594/PANGAEA.890538,16.406353,"<li><div class=""citation""><a href=""https://doi...",member,2,Geochemistry of Vasyugan_coreV21 peat core,2018-05-29T12:24:30,"{Borren, Wiebe}","Borren, Wiebe (2018): Geochemistry of Vasyugan...",https://doi.org/10.1594/PANGAEA.890538,"DEPTH, sediment/rock [m]; AGE [ka BP]; Density...",Vasyugan_coreV21


In [71]:
df_all.columns

Index(['URI', 'score', 'html', 'type', 'position', 'Title', 'Publication date',
       'Authors', 'Citation', 'DOI', 'Parameters', 'Event'],
      dtype='object')

In [72]:
### rearrange and drop columns
df_all = df_all[['Title','Event', 'Parameters', 'Citation', 'DOI']]

In [73]:
df_all.head(3)

Unnamed: 0,Title,Event,Parameters,Citation,DOI
0,Geochemistry of Stordalen peat core,Stordalen_1,"DEPTH, sediment/rock [m]; AGE [ka BP]; Density...","Kokfelt, Ulla (2018): Geochemistry of Stordale...",https://doi.org/10.1594/PANGAEA.890490
1,Geochemistry of V34_core1 peat core,V34_core1,"DEPTH, sediment/rock [m]; AGE [ka BP]; Density...","MacDonald, Glen M (2018): Geochemistry of V34_...",https://doi.org/10.1594/PANGAEA.890537
2,Geochemistry of Vasyugan_coreV21 peat core,Vasyugan_coreV21,"DEPTH, sediment/rock [m]; AGE [ka BP]; Density...","Borren, Wiebe (2018): Geochemistry of Vasyugan...",https://doi.org/10.1594/PANGAEA.890538


In [74]:
### data table
pd.concat([data_all.head(2),data_all.tail(2)])

Unnamed: 0,"DEPTH, sediment/rock [m]",AGE [ka BP],"Density, dry bulk [g/cm**3]",Organic matter [%],"Density, organic matter [g/cm**3]","Carbon, total [%]","Nitrogen, total [%]","Density, organic carbon [g/cm**3]","Peat type, col nr:8","Peat type, comment:Loisel et al. 2014",Event,Latitude [deg],Longitude [deg],Elevation [m],Date/Time,DOI,"Peat type, col nr:4"
0,0.003,0.0,0.097,98.2,0.095,45.8,0.8,0.044,Dicranum elongatum peat,5.0,Stordalen_1,68.35,19.05,350.0,NaT,https://doi.org/10.1594/PANGAEA.890490,
1,0.007,0.007,0.074,98.3,0.073,45.7,0.7,0.034,Dicranum elongatum peat,5.0,Stordalen_1,68.35,19.05,350.0,NaT,https://doi.org/10.1594/PANGAEA.890490,
345,10.906,9.672,,,0.105,,,0.051,,1.0,Vasyugan_coreV21,56.83,78.42,,NaT,https://doi.org/10.1594/PANGAEA.890538,Sphagnum
346,10.993,,,,0.173,,,0.085,,1.0,Vasyugan_coreV21,56.83,78.42,,NaT,https://doi.org/10.1594/PANGAEA.890538,Sphagnum


#### Check header and merge columns 

In [75]:
### show all header names
data_all.columns

Index(['DEPTH, sediment/rock [m]', 'AGE [ka BP]',
       'Density, dry bulk [g/cm**3]', 'Organic matter [%]',
       'Density, organic matter [g/cm**3]', 'Carbon, total [%]',
       'Nitrogen, total [%]', 'Density, organic carbon [g/cm**3]',
       'Peat type, col nr:8', 'Peat type, comment:Loisel et al. 2014', 'Event',
       'Latitude [deg]', 'Longitude [deg]', 'Elevation [m]', 'Date/Time',
       'DOI', 'Peat type, col nr:4'],
      dtype='object')

In [76]:
### define parameter/header/column to be kept a.k.a. keep_param
keep_param = 'Peat type, col nr:6'

### copy value in keep column, if keep value is nan
merge_param = 'Peat type, col nr:8'

### if condition is needed because example consists of first 3 datasets
if merge_param in data_all.columns and keep_param in data_all.columns:
    ### merge merge_param into keep_param
    mask = data_all[keep_param].isna() & data_all[merge_param].notna()
    data_all.loc[mask, keep_param] = data_all.loc[mask, merge_param]
    
    ### remove merge_param
    data_all = data_all.drop(columns=[merge_param])

In [77]:
data_all.rename(columns={'Peat type, col nr:6':'Peat type'}, inplace=True)

In [78]:
pd.concat( [data_all.head(2),data_all.tail(2)] )

Unnamed: 0,"DEPTH, sediment/rock [m]",AGE [ka BP],"Density, dry bulk [g/cm**3]",Organic matter [%],"Density, organic matter [g/cm**3]","Carbon, total [%]","Nitrogen, total [%]","Density, organic carbon [g/cm**3]","Peat type, col nr:8","Peat type, comment:Loisel et al. 2014",Event,Latitude [deg],Longitude [deg],Elevation [m],Date/Time,DOI,"Peat type, col nr:4"
0,0.003,0.0,0.097,98.2,0.095,45.8,0.8,0.044,Dicranum elongatum peat,5.0,Stordalen_1,68.35,19.05,350.0,NaT,https://doi.org/10.1594/PANGAEA.890490,
1,0.007,0.007,0.074,98.3,0.073,45.7,0.7,0.034,Dicranum elongatum peat,5.0,Stordalen_1,68.35,19.05,350.0,NaT,https://doi.org/10.1594/PANGAEA.890490,
345,10.906,9.672,,,0.105,,,0.051,,1.0,Vasyugan_coreV21,56.83,78.42,,NaT,https://doi.org/10.1594/PANGAEA.890538,Sphagnum
346,10.993,,,,0.173,,,0.085,,1.0,Vasyugan_coreV21,56.83,78.42,,NaT,https://doi.org/10.1594/PANGAEA.890538,Sphagnum


In [79]:
data_all.columns

Index(['DEPTH, sediment/rock [m]', 'AGE [ka BP]',
       'Density, dry bulk [g/cm**3]', 'Organic matter [%]',
       'Density, organic matter [g/cm**3]', 'Carbon, total [%]',
       'Nitrogen, total [%]', 'Density, organic carbon [g/cm**3]',
       'Peat type, col nr:8', 'Peat type, comment:Loisel et al. 2014', 'Event',
       'Latitude [deg]', 'Longitude [deg]', 'Elevation [m]', 'Date/Time',
       'DOI', 'Peat type, col nr:4'],
      dtype='object')

In [81]:
data_all = data_all[['DOI','Event', 'Latitude [deg]', 'Longitude [deg]', 'Elevation [m]',
                     'DEPTH, sediment/rock [m]', 'AGE [ka BP]', 'Density, dry bulk [g/cm**3]',
                     'Peat type, col nr:4', 'Peat type, col nr:8', 'Peat type, comment:Loisel et al. 2014',
                     'Organic matter [%]','Density, organic matter [g/cm**3]','Density, organic carbon [g/cm**3]',
                     'Carbon, total [%]','Nitrogen, total [%]']]


In [82]:
pd.concat([data_all.head(2),data_all.tail(2)])

Unnamed: 0,DOI,Event,Latitude [deg],Longitude [deg],Elevation [m],"DEPTH, sediment/rock [m]",AGE [ka BP],"Density, dry bulk [g/cm**3]","Peat type, col nr:4","Peat type, col nr:8","Peat type, comment:Loisel et al. 2014",Organic matter [%],"Density, organic matter [g/cm**3]","Density, organic carbon [g/cm**3]","Carbon, total [%]","Nitrogen, total [%]"
0,https://doi.org/10.1594/PANGAEA.890490,Stordalen_1,68.35,19.05,350.0,0.003,0.0,0.097,,Dicranum elongatum peat,5.0,98.2,0.095,0.044,45.8,0.8
1,https://doi.org/10.1594/PANGAEA.890490,Stordalen_1,68.35,19.05,350.0,0.007,0.007,0.074,,Dicranum elongatum peat,5.0,98.3,0.073,0.034,45.7,0.7
345,https://doi.org/10.1594/PANGAEA.890538,Vasyugan_coreV21,56.83,78.42,,10.906,9.672,,Sphagnum,,1.0,,0.105,0.051,,
346,https://doi.org/10.1594/PANGAEA.890538,Vasyugan_coreV21,56.83,78.42,,10.993,,,Sphagnum,,1.0,,0.173,0.085,,


#### Save dataframe as file

In [83]:
### Create data directory
data_directory = "PANGAEA_data"
### Check if it already exists before creating it
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)
    
### Save as tab delimited text file

## set filename
filename1 = 'PAGES_C-PEAT_Geochemistry_metadata.txt'
filename2 = 'PAGES_C-PEAT_Geochemistry_data.txt'

df_all.to_csv(os.path.join(data_directory, filename1), sep='\t', encoding='utf-8', index=False)
data_all.to_csv(os.path.join(data_directory, filename2), sep='\t', encoding='utf-8', index=False)