Patrick BROCKMANN - LSCE (Climate and Environment Sciences Laboratory)

 * 2023/09/19 - initial revision

<img align="left" width="50%" src="http://www.lsce.ipsl.fr/Css/img/banniere_LSCE_75.png" >

### Harvesting metadata

Change havesting technic from Elastic request to the use of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH).

https://dataportals.pangaea.de/oa-icc/index2.php?

http://ws.pangaea.de/es/dataportal-oa-icc/pansimple/_search?q=_id:PANGAEA.903033&pretty=true

http://ws.pangaea.de/es/dataportal-oa-icc/pansimple/_search?size=10&_source_include=*Longitude,*Latitude,citation_title,citation_date,description,keyword,format,metadatalink,citation_authors,datalink

to

https://www.openarchives.org/pmh/

http://ws.pangaea.de/oai/provider?verb=ListMetadataFormats

http://ws.pangaea.de/oai/provider?verb=GetRecord&metadataPrefix=pan_md&identifier=oai:pangaea.de:doi:10.1594/PANGAEA.903033

Choose pan_md  

In [1]:
# https://github.com/bloomonkey/oai-harvest
# oai-harvest -p pan_md -d oaiharvest --set project4173 http://ws.pangaea.de/oai/provider

### Exemple with one pan_md XML file

In [2]:
#import requests
from bs4 import BeautifulSoup

#url = "http://ws.pangaea.de/oai/provider?verb=GetRecord&metadataPrefix=pan_md&identifier=oai:pangaea.de:doi:10.1594/PANGAEA.959648"
#url = "http://ws.pangaea.de/oai/provider?verb=GetRecord&metadataPrefix=dif&identifier=oai:pangaea.de:doi:10.1594/PANGAEA.959648"
#page = requests.get(url)
#soup = BeautifulSoup(page.content, "xml")

# oaiharvest/oai:pangaea.de:doi:10.1594%2FPANGAEA.959648.pan_md.xml

#XMLFileToBeOpened = open("oaiharvest/oai:pangaea.de:doi:10.1594%2FPANGAEA.726955.pan_md.xml", "r")
#XMLFileToBeOpened = open("oaiharvest/oai:pangaea.de:doi:10.1594%2FPANGAEA.833328.pan_md.xml", "r")
XMLFileToBeOpened = open("oaiharvest/oai:pangaea.de:doi:10.1594%2FPANGAEA.819639.pan_md.xml", "r")

pageContent = XMLFileToBeOpened.read()
soup = BeautifulSoup(pageContent, "xml")
XMLFileToBeOpened.close()

In [3]:
DOI = soup.find('md:citation').find('md:URI', recursive=False).text
DOI = DOI.replace('doi.org', 'doi.pangaea.de')     # need for creating a dataLink 
DOI

'https://doi.pangaea.de/10.1594/PANGAEA.819639'

In [4]:
title = soup.find('md:citation').find('md:title').text
title

'Seawater carbonate chemistry in Hog reef and calcification rate in the Bermuda reef community, 2010'

In [5]:
yearCitation = soup.find('md:citation').find('md:year').text
yearCitation

'2010'

In [7]:
try:
    yearSupplementTo = soup.find('md:reference', {'relationType': 'Supplement to'}).find('md:year').text
    yearSupplementTo
except:
    None

In [8]:
contents = soup.find('md:citation').findAll('md:author', recursive=False)
lastNames = [content.find('md:lastName').text for content in contents]

try:
    firstNames = [content.find('md:firstName').text for content in contents]
    authors = list(map(lambda x,y: x + ', ' + y, lastNames, firstNames))
except:
    authors = lastNames

authors

['Bates, Nicolas R', 'Amat, A', 'Andersson, Andreas J']

In [9]:
contents = soup.find('md:keywords').findAll('md:keyword', {'type':'fromDatabase'})
keywords = [content.contents[0] for content in contents]
keywords

['Benthos',
 'Calcification/Dissolution',
 'Coast and continental shelf',
 'Entire community',
 'Field observation',
 'North Atlantic',
 'Rocky-shore community',
 'Temperate']

In [10]:
abstract = soup.find('md:abstract').text
abstract

'Despite the potential impact of ocean acidification on ecosystems such as coral reefs, surprisingly, there is very limited field data on the relationships between calcification and seawater carbonate chemistry. In this study, contemporaneous in situ datasets of seawater carbonate chemistry and calcification rates from the high-latitude coral reef of Bermuda over annual timescales provide a framework for investigating the present and future potential impact of rising carbon dioxide (CO2) levels and ocean acidification on coral reef ecosystems in their natural environment. A strong correlation was found between the in situ rates of calcification for the major framework building coral species Diploria labyrinthiformis and the seasonal variability of [CO32-] and aragonite saturation state omega aragonite, rather than other environmental factors such as light and temperature. These field observations provide sufficient data to hypothesize that there is a seasonal "Carbonate Chemistry Coral

In [11]:
contents = soup.find('technicalInfo').find('md:entry', {'key':'mimeType'})
mimeTypeData = contents['value']

mimeTypeData
# 2 values detected: 'application/zip' and 'text/tab-separated-values'

'application/zip'

### Loop over all harvested files '*.pan_md.xml'

In [12]:
from pathlib import Path
import csv

In [14]:
with open('out.csv', 'w', newline='') as csvFile:
    csvWriter = csv.writer(csvFile, delimiter=';')

    paths = Path("oaiharvest").glob('*.pan_md.xml')
    for n,path in enumerate(paths):
        
        if (n%50 == 0):
            print(n,path)
    
        #----------------
        XMLFileToBeOpened = open(path, "r")
        pageContent = XMLFileToBeOpened.read()
        soup = BeautifulSoup(pageContent, "xml")
        XMLFileToBeOpened.close()
    
        #----------------
        DOI = soup.find('md:citation').find('md:URI', recursive=False).text
        DOI = DOI.replace('doi.org', 'doi.pangaea.de')

        #----------------
        title = soup.find('md:citation').find('md:title').text
    
        #----------------
        yearCitation = soup.find('md:citation').find('md:year').text
    
        #----------------
        try:
            yearSupplementTo = soup.find('md:reference', {'relationType': 'Supplement to'}).find('md:year').text
        except:
            yearSupplementTo = None
    
        #----------------
        contents = soup.find('md:keywords').findAll('md:keyword', {'type':'fromDatabase'})
        keywords = [content.contents[0] for content in contents]
        
        #----------------
        try:
            abstract = soup.find('md:abstract').text
        except:
            abstract = None
            
        #----------------
        contents = soup.find('md:citation').findAll('md:author', recursive=False)
        lastNames = [content.find('md:lastName').text for content in contents]
        try:
            firstNames = [content.find('md:firstName').text for content in contents]
            authors = list(map(lambda x,y: x + ', ' + y, lastNames, firstNames))
        except:
            authors = lastNames    # when first names not present 
        
        #----------------
        contents = soup.find('technicalInfo').find('md:entry', {'key':'mimeType'})
        mimeTypeData = contents['value']
        
        if (mimeTypeData == 'text/tab-separated-values') :
            dataLink = DOI + "?format=textfile"
        elif (mimeTypeData == 'application/zip') :
            dataLink = DOI + "?format=zip"
        else:
            print("ERROR: unknow mimeType for dataLink: ", mimeTypeData)
            break
    
        #----------------
        csvWriter.writerow([DOI, title, yearCitation, yearSupplementTo, keywords, abstract, authors, dataLink])

csvFile.close()

0 oaiharvest/oai:pangaea.de:doi:10.1594%2FPANGAEA.835969.pan_md.xml
50 oaiharvest/oai:pangaea.de:doi:10.1594%2FPANGAEA.926447.pan_md.xml
100 oaiharvest/oai:pangaea.de:doi:10.1594%2FPANGAEA.839180.pan_md.xml
150 oaiharvest/oai:pangaea.de:doi:10.1594%2FPANGAEA.958450.pan_md.xml
200 oaiharvest/oai:pangaea.de:doi:10.1594%2FPANGAEA.874781.pan_md.xml
250 oaiharvest/oai:pangaea.de:doi:10.1594%2FPANGAEA.908297.pan_md.xml
300 oaiharvest/oai:pangaea.de:doi:10.1594%2FPANGAEA.867479.pan_md.xml
350 oaiharvest/oai:pangaea.de:doi:10.1594%2FPANGAEA.952281.pan_md.xml
400 oaiharvest/oai:pangaea.de:doi:10.1594%2FPANGAEA.859285.pan_md.xml
450 oaiharvest/oai:pangaea.de:doi:10.1594%2FPANGAEA.727746.pan_md.xml
500 oaiharvest/oai:pangaea.de:doi:10.1594%2FPANGAEA.941907.pan_md.xml
550 oaiharvest/oai:pangaea.de:doi:10.1594%2FPANGAEA.875042.pan_md.xml
600 oaiharvest/oai:pangaea.de:doi:10.1594%2FPANGAEA.848775.pan_md.xml
650 oaiharvest/oai:pangaea.de:doi:10.1594%2FPANGAEA.924889.pan_md.xml
700 oaiharvest/oai:pang

### Data processing (dates consolidation)

In [15]:
import pandas as pd

In [16]:
df = pd.read_csv('out.csv', sep=';', 
                 names=['DOI', 'title', 'yearCitation', 'yearSupplementTo', 'keywords', 'abstract', 'authors', 'dataLink'], 
                 header=None, na_values=[''])
df['yearCitation'] = pd.to_datetime(df['yearCitation'], format='%Y')
df['yearSupplementTo'] = pd.to_datetime(df['yearSupplementTo'], format='%Y')
df['keywords'] = df['keywords'].apply(lambda x: eval(x))
df['authors'] = df['authors'].apply(lambda x: eval(x))
df

Unnamed: 0,DOI,title,yearCitation,yearSupplementTo,keywords,abstract,authors,dataLink
0,https://doi.pangaea.de/10.1594/PANGAEA.835969,Acid-base physiology response to ocean acidifi...,2014-01-01,NaT,"[Acid-base regulation, Animalia, Benthic anima...",Sea cucumbers are dominant invertebrates in se...,"[Collard, Marie, Eeckhaut, Igor, Dehairs, Fran...",https://doi.pangaea.de/10.1594/PANGAEA.835969?...
1,https://doi.pangaea.de/10.1594/PANGAEA.778475,"Seawater carbonate chemistry, nutrients, and C...",2012-01-01,NaT,"[Biomass/Abundance/Elemental composition, Bott...",The coccolithophore Calcidiscus leptoporus was...,"[Langer, Gerald, Oetjen, Kerstin, Brenneis, Tina]",https://doi.pangaea.de/10.1594/PANGAEA.778475?...
2,https://doi.pangaea.de/10.1594/PANGAEA.861380,Control of ambient pH on growth and stable iso...,2015-01-01,NaT,"[Biomass/Abundance/Elemental composition, Bott...",The present work examines the relationship bet...,"[Hermoso, Michael]",https://doi.pangaea.de/10.1594/PANGAEA.861380?...
3,https://doi.pangaea.de/10.1594/PANGAEA.824406,"Seawater carbonate chemistry and growth, carbo...",2013-01-01,NaT,"[Antarctic, Bottles or small containers/Aquari...",Despite the fact that ocean acidification is c...,"[Trimborn, Scarlett, Brenneis, Tina, Sweet, El...",https://doi.pangaea.de/10.1594/PANGAEA.824406?...
4,https://doi.pangaea.de/10.1594/PANGAEA.836912,Strong shift from HCO3- to CO2 uptake in Emili...,2014-01-01,NaT,"[Biomass/Abundance/Elemental composition, Bott...",Effects of ocean acidification on Emiliania hu...,"[Kottmeier, Dorothee, Rokitta, Sebastian D, To...",https://doi.pangaea.de/10.1594/PANGAEA.836912?...
...,...,...,...,...,...,...,...,...
1496,https://doi.pangaea.de/10.1594/PANGAEA.860079,Sea urchins in a high CO2 world: partitioned e...,2016-01-01,NaT,"[Animalia, Behaviour, Benthic animals, Benthos...",Body-size and temperature are the major factor...,"[Carey, Nicholas, Harianto, Januar, Byrne, Maria]",https://doi.pangaea.de/10.1594/PANGAEA.860079?...
1497,https://doi.pangaea.de/10.1594/PANGAEA.890698,Seawater carbonate chemistry and immune respon...,2017-01-01,NaT,"[Animalia, Benthic animals, Benthos, Coast and...",Ocean acidification (OA) is one of the main co...,"[Castillo, Nicole, Saavedra, Luisa M, Vargas, ...",https://doi.pangaea.de/10.1594/PANGAEA.890698?...
1498,https://doi.pangaea.de/10.1594/PANGAEA.846875,Calcareous green alga Halimeda tolerates ocean...,2015-01-01,NaT,"[Benthos, Biomass/Abundance/Elemental composit...","We investigated ecological, physiological, and...","[Vogel, Nikolas, Fabricius, Katharina Elisabet...",https://doi.pangaea.de/10.1594/PANGAEA.846875?...
1499,https://doi.pangaea.de/10.1594/PANGAEA.949645,Seawater carbonate chemistry and physiological...,2011-01-01,2011-01-01,"[Animalia, Behaviour, Benthic animals, Benthos...","Effects of coastal ocean acidification, other ...","[Fernández-Reiriz, Ma José, Range, P, Alvarez-...",https://doi.pangaea.de/10.1594/PANGAEA.949645?...


In [17]:
df['keywords'][0]

['Acid-base regulation',
 'Animalia',
 'Benthic animals',
 'Benthos',
 'Coast and continental shelf',
 'Containers and aquaria (20-1000 L or < 1 m**2)',
 'Echinodermata',
 'Holothuria parva',
 'Holothuria scabra',
 'Indian Ocean',
 'Laboratory experiment',
 'Other metabolic rates',
 'Respiration',
 'Single species',
 'Tropical']

In [18]:
df['authors'][0]

['Collard, Marie', 'Eeckhaut, Igor', 'Dehairs, Frank', 'Dubois, Philippe']

In [19]:
df[df['yearCitation'] < df['yearSupplementTo']]

Unnamed: 0,DOI,title,yearCitation,yearSupplementTo,keywords,abstract,authors,dataLink
149,https://doi.pangaea.de/10.1594/PANGAEA.924887,"Seawater carbonate chemistry and growth,cellul...",2019-01-01,2020-01-01,"[Arctic, Biomass/Abundance/Elemental compositi...","Compared to the rest of the globe, the Arctic ...","[White, Emily, Hoppe, Clara Jule Marie, Rost, ...",https://doi.pangaea.de/10.1594/PANGAEA.924887?...
549,https://doi.pangaea.de/10.1594/PANGAEA.943567,Seawater carbonate chemistry and carbonate che...,2021-01-01,2022-01-01,"[Acid-base regulation, Bacteria, Baltic Sea, B...",Photosynthesis and respiration cause distinct ...,"[Eichner, Meri, Wolf-Gladrow, Dieter A, Ploug,...",https://doi.pangaea.de/10.1594/PANGAEA.943567?...
910,https://doi.pangaea.de/10.1594/PANGAEA.917470,Seawater carbonate chemistry and phytoplankton...,2019-01-01,2020-01-01,"[CO2 vent, Coast and continental shelf, Commun...",In situ effects of ocean acidification are inc...,"[Zitoun, Rebecca, Connell, Sean D, Cornwall, C...",https://doi.pangaea.de/10.1594/PANGAEA.917470?...
923,https://doi.pangaea.de/10.1594/PANGAEA.942405,Seawater carbonate chemistry and bacterial div...,2021-01-01,2022-01-01,"[Animalia, Benthic animals, Benthos, Bottles o...",The sponge-associated microbial community cont...,"[Posadas, Niño, Baquiran, Jake Ivan P, Nada, M...",https://doi.pangaea.de/10.1594/PANGAEA.942405?...
975,https://doi.pangaea.de/10.1594/PANGAEA.923155,Seawater carbonate chemistry and population st...,2020-01-01,2021-01-01,"[Animalia, Benthic animals, Benthos, Biomass/A...",Sites with naturally high CO2 conditions provi...,"[Martins, M, Carreiro-Silva, Marina, Martins, ...",https://doi.pangaea.de/10.1594/PANGAEA.923155?...
1041,https://doi.pangaea.de/10.1594/PANGAEA.950764,Seawater carbonate chemistry and survival in C...,2012-01-01,2013-01-01,"[Animalia, Arthropoda, Benthic animals, Bentho...",Ocean acidification (OA) may alter the behavio...,"[Roberts, David A, Birchenough, Silvana N R, L...",https://doi.pangaea.de/10.1594/PANGAEA.950764?...
1266,https://doi.pangaea.de/10.1594/PANGAEA.929722,Seawater carbonate chemistry and acid-base phy...,2020-01-01,2021-01-01,"[Acid-base regulation, Animalia, Antarctic, Be...",Antarctic surface waters are expected to be th...,"[Di Giglio, Sarah, Agüera, Antonio, Pernet, Ph...",https://doi.pangaea.de/10.1594/PANGAEA.929722?...
1309,https://doi.pangaea.de/10.1594/PANGAEA.930934,Seawater carbonate chemistry and estuarine dis...,2020-01-01,2021-01-01,"[Benthos, Bottles or small containers/Aquaria ...","Relative to their surface area, estuaries make...","[Simone, Michelle, Schulz, Kai, Oakes, Joanne,...",https://doi.pangaea.de/10.1594/PANGAEA.930934?...
1337,https://doi.pangaea.de/10.1594/PANGAEA.955376,Seawater carbonate chemistry and the acute str...,2022-01-01,2023-01-01,"[Animalia, Behaviour, Chordata, Containers and...",The absorption of anthropogenic carbon dioxide...,"[Servili, Arianna, Lévêque, Etienne, Mouchel, ...",https://doi.pangaea.de/10.1594/PANGAEA.955376?...


In [20]:
# Consider yearSupplementTo as correct is present otherwise fill by yearCitation
df['year'] = df['yearSupplementTo'].fillna(df['yearCitation'])
df['year'] = df['year'].dt.year
df

Unnamed: 0,DOI,title,yearCitation,yearSupplementTo,keywords,abstract,authors,dataLink,year
0,https://doi.pangaea.de/10.1594/PANGAEA.835969,Acid-base physiology response to ocean acidifi...,2014-01-01,NaT,"[Acid-base regulation, Animalia, Benthic anima...",Sea cucumbers are dominant invertebrates in se...,"[Collard, Marie, Eeckhaut, Igor, Dehairs, Fran...",https://doi.pangaea.de/10.1594/PANGAEA.835969?...,2014
1,https://doi.pangaea.de/10.1594/PANGAEA.778475,"Seawater carbonate chemistry, nutrients, and C...",2012-01-01,NaT,"[Biomass/Abundance/Elemental composition, Bott...",The coccolithophore Calcidiscus leptoporus was...,"[Langer, Gerald, Oetjen, Kerstin, Brenneis, Tina]",https://doi.pangaea.de/10.1594/PANGAEA.778475?...,2012
2,https://doi.pangaea.de/10.1594/PANGAEA.861380,Control of ambient pH on growth and stable iso...,2015-01-01,NaT,"[Biomass/Abundance/Elemental composition, Bott...",The present work examines the relationship bet...,"[Hermoso, Michael]",https://doi.pangaea.de/10.1594/PANGAEA.861380?...,2015
3,https://doi.pangaea.de/10.1594/PANGAEA.824406,"Seawater carbonate chemistry and growth, carbo...",2013-01-01,NaT,"[Antarctic, Bottles or small containers/Aquari...",Despite the fact that ocean acidification is c...,"[Trimborn, Scarlett, Brenneis, Tina, Sweet, El...",https://doi.pangaea.de/10.1594/PANGAEA.824406?...,2013
4,https://doi.pangaea.de/10.1594/PANGAEA.836912,Strong shift from HCO3- to CO2 uptake in Emili...,2014-01-01,NaT,"[Biomass/Abundance/Elemental composition, Bott...",Effects of ocean acidification on Emiliania hu...,"[Kottmeier, Dorothee, Rokitta, Sebastian D, To...",https://doi.pangaea.de/10.1594/PANGAEA.836912?...,2014
...,...,...,...,...,...,...,...,...,...
1496,https://doi.pangaea.de/10.1594/PANGAEA.860079,Sea urchins in a high CO2 world: partitioned e...,2016-01-01,NaT,"[Animalia, Behaviour, Benthic animals, Benthos...",Body-size and temperature are the major factor...,"[Carey, Nicholas, Harianto, Januar, Byrne, Maria]",https://doi.pangaea.de/10.1594/PANGAEA.860079?...,2016
1497,https://doi.pangaea.de/10.1594/PANGAEA.890698,Seawater carbonate chemistry and immune respon...,2017-01-01,NaT,"[Animalia, Benthic animals, Benthos, Coast and...",Ocean acidification (OA) is one of the main co...,"[Castillo, Nicole, Saavedra, Luisa M, Vargas, ...",https://doi.pangaea.de/10.1594/PANGAEA.890698?...,2017
1498,https://doi.pangaea.de/10.1594/PANGAEA.846875,Calcareous green alga Halimeda tolerates ocean...,2015-01-01,NaT,"[Benthos, Biomass/Abundance/Elemental composit...","We investigated ecological, physiological, and...","[Vogel, Nikolas, Fabricius, Katharina Elisabet...",https://doi.pangaea.de/10.1594/PANGAEA.846875?...,2015
1499,https://doi.pangaea.de/10.1594/PANGAEA.949645,Seawater carbonate chemistry and physiological...,2011-01-01,2011-01-01,"[Animalia, Behaviour, Benthic animals, Benthos...","Effects of coastal ocean acidification, other ...","[Fernández-Reiriz, Ma José, Range, P, Alvarez-...",https://doi.pangaea.de/10.1594/PANGAEA.949645?...,2011


In [21]:
df.drop(columns=['yearCitation','yearSupplementTo'], inplace=True)
df

Unnamed: 0,DOI,title,keywords,abstract,authors,dataLink,year
0,https://doi.pangaea.de/10.1594/PANGAEA.835969,Acid-base physiology response to ocean acidifi...,"[Acid-base regulation, Animalia, Benthic anima...",Sea cucumbers are dominant invertebrates in se...,"[Collard, Marie, Eeckhaut, Igor, Dehairs, Fran...",https://doi.pangaea.de/10.1594/PANGAEA.835969?...,2014
1,https://doi.pangaea.de/10.1594/PANGAEA.778475,"Seawater carbonate chemistry, nutrients, and C...","[Biomass/Abundance/Elemental composition, Bott...",The coccolithophore Calcidiscus leptoporus was...,"[Langer, Gerald, Oetjen, Kerstin, Brenneis, Tina]",https://doi.pangaea.de/10.1594/PANGAEA.778475?...,2012
2,https://doi.pangaea.de/10.1594/PANGAEA.861380,Control of ambient pH on growth and stable iso...,"[Biomass/Abundance/Elemental composition, Bott...",The present work examines the relationship bet...,"[Hermoso, Michael]",https://doi.pangaea.de/10.1594/PANGAEA.861380?...,2015
3,https://doi.pangaea.de/10.1594/PANGAEA.824406,"Seawater carbonate chemistry and growth, carbo...","[Antarctic, Bottles or small containers/Aquari...",Despite the fact that ocean acidification is c...,"[Trimborn, Scarlett, Brenneis, Tina, Sweet, El...",https://doi.pangaea.de/10.1594/PANGAEA.824406?...,2013
4,https://doi.pangaea.de/10.1594/PANGAEA.836912,Strong shift from HCO3- to CO2 uptake in Emili...,"[Biomass/Abundance/Elemental composition, Bott...",Effects of ocean acidification on Emiliania hu...,"[Kottmeier, Dorothee, Rokitta, Sebastian D, To...",https://doi.pangaea.de/10.1594/PANGAEA.836912?...,2014
...,...,...,...,...,...,...,...
1496,https://doi.pangaea.de/10.1594/PANGAEA.860079,Sea urchins in a high CO2 world: partitioned e...,"[Animalia, Behaviour, Benthic animals, Benthos...",Body-size and temperature are the major factor...,"[Carey, Nicholas, Harianto, Januar, Byrne, Maria]",https://doi.pangaea.de/10.1594/PANGAEA.860079?...,2016
1497,https://doi.pangaea.de/10.1594/PANGAEA.890698,Seawater carbonate chemistry and immune respon...,"[Animalia, Benthic animals, Benthos, Coast and...",Ocean acidification (OA) is one of the main co...,"[Castillo, Nicole, Saavedra, Luisa M, Vargas, ...",https://doi.pangaea.de/10.1594/PANGAEA.890698?...,2017
1498,https://doi.pangaea.de/10.1594/PANGAEA.846875,Calcareous green alga Halimeda tolerates ocean...,"[Benthos, Biomass/Abundance/Elemental composit...","We investigated ecological, physiological, and...","[Vogel, Nikolas, Fabricius, Katharina Elisabet...",https://doi.pangaea.de/10.1594/PANGAEA.846875?...,2015
1499,https://doi.pangaea.de/10.1594/PANGAEA.949645,Seawater carbonate chemistry and physiological...,"[Animalia, Behaviour, Benthic animals, Benthos...","Effects of coastal ocean acidification, other ...","[Fernández-Reiriz, Ma José, Range, P, Alvarez-...",https://doi.pangaea.de/10.1594/PANGAEA.949645?...,2011


In [22]:
df['DOI']

0       https://doi.pangaea.de/10.1594/PANGAEA.835969
1       https://doi.pangaea.de/10.1594/PANGAEA.778475
2       https://doi.pangaea.de/10.1594/PANGAEA.861380
3       https://doi.pangaea.de/10.1594/PANGAEA.824406
4       https://doi.pangaea.de/10.1594/PANGAEA.836912
                            ...                      
1496    https://doi.pangaea.de/10.1594/PANGAEA.860079
1497    https://doi.pangaea.de/10.1594/PANGAEA.890698
1498    https://doi.pangaea.de/10.1594/PANGAEA.846875
1499    https://doi.pangaea.de/10.1594/PANGAEA.949645
1500    https://doi.pangaea.de/10.1594/PANGAEA.835643
Name: DOI, Length: 1501, dtype: object

In [23]:
import os
df['Id'] = df.apply(lambda x: os.path.basename(x['DOI']), axis=1)
df['Id']

0       PANGAEA.835969
1       PANGAEA.778475
2       PANGAEA.861380
3       PANGAEA.824406
4       PANGAEA.836912
             ...      
1496    PANGAEA.860079
1497    PANGAEA.890698
1498    PANGAEA.846875
1499    PANGAEA.949645
1500    PANGAEA.835643
Name: Id, Length: 1501, dtype: object

In [24]:
import unidecode
df['dataDirectory'] = df.apply(lambda x:
        unidecode.unidecode(x['authors'][0].split(',')[0].replace(' ','').replace("'",'')) + \
             '_' + str(x['year']) + '_' + os.path.basename(x['DOI']).split('.')[1], axis=1)
df['dataDirectory']

0                Collard_2014_835969
1                 Langer_2012_778475
2                Hermoso_2015_861380
3               Trimborn_2013_824406
4              Kottmeier_2014_836912
                    ...             
1496               Carey_2016_860079
1497            Castillo_2017_890698
1498               Vogel_2015_846875
1499    Fernandez-Reiriz_2011_949645
1500           Challener_2014_835643
Name: dataDirectory, Length: 1501, dtype: object

In [25]:
df.to_csv("metadata.csv", index=False)

In [26]:
df = pd.read_csv("metadata.csv")
df['keywords'] = df['keywords'].apply(lambda x: eval(x))
df['authors'] = df['authors'].apply(lambda x: eval(x))
df

Unnamed: 0,DOI,title,keywords,abstract,authors,dataLink,year,Id,dataDirectory
0,https://doi.pangaea.de/10.1594/PANGAEA.835969,Acid-base physiology response to ocean acidifi...,"[Acid-base regulation, Animalia, Benthic anima...",Sea cucumbers are dominant invertebrates in se...,"[Collard, Marie, Eeckhaut, Igor, Dehairs, Fran...",https://doi.pangaea.de/10.1594/PANGAEA.835969?...,2014,PANGAEA.835969,Collard_2014_835969
1,https://doi.pangaea.de/10.1594/PANGAEA.778475,"Seawater carbonate chemistry, nutrients, and C...","[Biomass/Abundance/Elemental composition, Bott...",The coccolithophore Calcidiscus leptoporus was...,"[Langer, Gerald, Oetjen, Kerstin, Brenneis, Tina]",https://doi.pangaea.de/10.1594/PANGAEA.778475?...,2012,PANGAEA.778475,Langer_2012_778475
2,https://doi.pangaea.de/10.1594/PANGAEA.861380,Control of ambient pH on growth and stable iso...,"[Biomass/Abundance/Elemental composition, Bott...",The present work examines the relationship bet...,"[Hermoso, Michael]",https://doi.pangaea.de/10.1594/PANGAEA.861380?...,2015,PANGAEA.861380,Hermoso_2015_861380
3,https://doi.pangaea.de/10.1594/PANGAEA.824406,"Seawater carbonate chemistry and growth, carbo...","[Antarctic, Bottles or small containers/Aquari...",Despite the fact that ocean acidification is c...,"[Trimborn, Scarlett, Brenneis, Tina, Sweet, El...",https://doi.pangaea.de/10.1594/PANGAEA.824406?...,2013,PANGAEA.824406,Trimborn_2013_824406
4,https://doi.pangaea.de/10.1594/PANGAEA.836912,Strong shift from HCO3- to CO2 uptake in Emili...,"[Biomass/Abundance/Elemental composition, Bott...",Effects of ocean acidification on Emiliania hu...,"[Kottmeier, Dorothee, Rokitta, Sebastian D, To...",https://doi.pangaea.de/10.1594/PANGAEA.836912?...,2014,PANGAEA.836912,Kottmeier_2014_836912
...,...,...,...,...,...,...,...,...,...
1496,https://doi.pangaea.de/10.1594/PANGAEA.860079,Sea urchins in a high CO2 world: partitioned e...,"[Animalia, Behaviour, Benthic animals, Benthos...",Body-size and temperature are the major factor...,"[Carey, Nicholas, Harianto, Januar, Byrne, Maria]",https://doi.pangaea.de/10.1594/PANGAEA.860079?...,2016,PANGAEA.860079,Carey_2016_860079
1497,https://doi.pangaea.de/10.1594/PANGAEA.890698,Seawater carbonate chemistry and immune respon...,"[Animalia, Benthic animals, Benthos, Coast and...",Ocean acidification (OA) is one of the main co...,"[Castillo, Nicole, Saavedra, Luisa M, Vargas, ...",https://doi.pangaea.de/10.1594/PANGAEA.890698?...,2017,PANGAEA.890698,Castillo_2017_890698
1498,https://doi.pangaea.de/10.1594/PANGAEA.846875,Calcareous green alga Halimeda tolerates ocean...,"[Benthos, Biomass/Abundance/Elemental composit...","We investigated ecological, physiological, and...","[Vogel, Nikolas, Fabricius, Katharina Elisabet...",https://doi.pangaea.de/10.1594/PANGAEA.846875?...,2015,PANGAEA.846875,Vogel_2015_846875
1499,https://doi.pangaea.de/10.1594/PANGAEA.949645,Seawater carbonate chemistry and physiological...,"[Animalia, Behaviour, Benthic animals, Benthos...","Effects of coastal ocean acidification, other ...","[Fernández-Reiriz, Ma José, Range, P, Alvarez-...",https://doi.pangaea.de/10.1594/PANGAEA.949645?...,2011,PANGAEA.949645,Fernandez-Reiriz_2011_949645


In [27]:
df['keywords'][0]

['Acid-base regulation',
 'Animalia',
 'Benthic animals',
 'Benthos',
 'Coast and continental shelf',
 'Containers and aquaria (20-1000 L or < 1 m**2)',
 'Echinodermata',
 'Holothuria parva',
 'Holothuria scabra',
 'Indian Ocean',
 'Laboratory experiment',
 'Other metabolic rates',
 'Respiration',
 'Single species',
 'Tropical']

In [28]:
df

Unnamed: 0,DOI,title,keywords,abstract,authors,dataLink,year,Id,dataDirectory
0,https://doi.pangaea.de/10.1594/PANGAEA.835969,Acid-base physiology response to ocean acidifi...,"[Acid-base regulation, Animalia, Benthic anima...",Sea cucumbers are dominant invertebrates in se...,"[Collard, Marie, Eeckhaut, Igor, Dehairs, Fran...",https://doi.pangaea.de/10.1594/PANGAEA.835969?...,2014,PANGAEA.835969,Collard_2014_835969
1,https://doi.pangaea.de/10.1594/PANGAEA.778475,"Seawater carbonate chemistry, nutrients, and C...","[Biomass/Abundance/Elemental composition, Bott...",The coccolithophore Calcidiscus leptoporus was...,"[Langer, Gerald, Oetjen, Kerstin, Brenneis, Tina]",https://doi.pangaea.de/10.1594/PANGAEA.778475?...,2012,PANGAEA.778475,Langer_2012_778475
2,https://doi.pangaea.de/10.1594/PANGAEA.861380,Control of ambient pH on growth and stable iso...,"[Biomass/Abundance/Elemental composition, Bott...",The present work examines the relationship bet...,"[Hermoso, Michael]",https://doi.pangaea.de/10.1594/PANGAEA.861380?...,2015,PANGAEA.861380,Hermoso_2015_861380
3,https://doi.pangaea.de/10.1594/PANGAEA.824406,"Seawater carbonate chemistry and growth, carbo...","[Antarctic, Bottles or small containers/Aquari...",Despite the fact that ocean acidification is c...,"[Trimborn, Scarlett, Brenneis, Tina, Sweet, El...",https://doi.pangaea.de/10.1594/PANGAEA.824406?...,2013,PANGAEA.824406,Trimborn_2013_824406
4,https://doi.pangaea.de/10.1594/PANGAEA.836912,Strong shift from HCO3- to CO2 uptake in Emili...,"[Biomass/Abundance/Elemental composition, Bott...",Effects of ocean acidification on Emiliania hu...,"[Kottmeier, Dorothee, Rokitta, Sebastian D, To...",https://doi.pangaea.de/10.1594/PANGAEA.836912?...,2014,PANGAEA.836912,Kottmeier_2014_836912
...,...,...,...,...,...,...,...,...,...
1496,https://doi.pangaea.de/10.1594/PANGAEA.860079,Sea urchins in a high CO2 world: partitioned e...,"[Animalia, Behaviour, Benthic animals, Benthos...",Body-size and temperature are the major factor...,"[Carey, Nicholas, Harianto, Januar, Byrne, Maria]",https://doi.pangaea.de/10.1594/PANGAEA.860079?...,2016,PANGAEA.860079,Carey_2016_860079
1497,https://doi.pangaea.de/10.1594/PANGAEA.890698,Seawater carbonate chemistry and immune respon...,"[Animalia, Benthic animals, Benthos, Coast and...",Ocean acidification (OA) is one of the main co...,"[Castillo, Nicole, Saavedra, Luisa M, Vargas, ...",https://doi.pangaea.de/10.1594/PANGAEA.890698?...,2017,PANGAEA.890698,Castillo_2017_890698
1498,https://doi.pangaea.de/10.1594/PANGAEA.846875,Calcareous green alga Halimeda tolerates ocean...,"[Benthos, Biomass/Abundance/Elemental composit...","We investigated ecological, physiological, and...","[Vogel, Nikolas, Fabricius, Katharina Elisabet...",https://doi.pangaea.de/10.1594/PANGAEA.846875?...,2015,PANGAEA.846875,Vogel_2015_846875
1499,https://doi.pangaea.de/10.1594/PANGAEA.949645,Seawater carbonate chemistry and physiological...,"[Animalia, Behaviour, Benthic animals, Benthos...","Effects of coastal ocean acidification, other ...","[Fernández-Reiriz, Ma José, Range, P, Alvarez-...",https://doi.pangaea.de/10.1594/PANGAEA.949645?...,2011,PANGAEA.949645,Fernandez-Reiriz_2011_949645
