## Functions for CrossRef (ORCID) data

In [1]:
import sys
sys.path.append("..")
from rcn_py import orcid
from rcn_py import topic_modeling

`name_to_orcid_id(name)`

This function queries the ORCID API using a full name and returns the corresponding ORCID ID if found. It performs the following steps:

- Constructs the necessary headers and parameters for the API request.
- Sends a GET request to the ORCID API to search for the name.
- Extracts the ORCID ID from the response if any results are found.
- Returns the ORCID ID or None if not found.


In [3]:
# query ORCID by a fullname
orcid_id = orcid.name_to_orcid_id('Peter Kalverla')
orcid_id

'0000-0002-5025-7862'

`from_orcid_to_name(orcid_id)`

This function queries ORCID for a specific ORCID ID and retrieves the corresponding author's name. It performs the following steps:

- Queries the ORCID API for the record associated with the provided ORCID ID using the query_orcid_for_record function.
- Extracts the author's name from the ORCID record.
- Returns the author's name.

In [4]:
fullname = orcid.from_orcid_to_name(orcid_id)
fullname

'Peter Kalverla'

`query_orcid_for_record(orcid_id)`

This function queries the ORCID API for a specific ORCID ID and retrieves the associated record. It performs the following steps:

- Sends a GET request to the ORCID API with the provided ORCID ID.
- Retrieves the JSON data from the response if the request is successful.
- Returns the JSON data as a Python dictionary or False if the request is not successful.

In [5]:
# query ORCID for an ORCID record
orcid_record = orcid.query_orcid_for_record(orcid_id)
orcid_record

{'orcid-identifier': {'uri': 'https://orcid.org/0000-0002-5025-7862',
  'path': '0000-0002-5025-7862',
  'host': 'orcid.org'},
 'preferences': {'locale': 'en'},
 'history': {'creation-method': 'DIRECT',
  'completion-date': None,
  'submission-date': {'value': 1440847595525},
  'last-modified-date': {'value': 1677858209460},
  'claimed': True,
  'source': None,
  'deactivation-date': None,
  'verified-email': True,
  'verified-primary-email': True},
 'person': {'last-modified-date': {'value': 1614867500639},
  'name': {'created-date': {'value': 1460759791637},
   'last-modified-date': {'value': 1460759791637},
   'given-names': {'value': 'Peter'},
   'family-name': {'value': 'Kalverla'},
   'credit-name': None,
   'source': None,
   'visibility': 'public',
   'path': '0000-0002-5025-7862'},
  'other-names': {'last-modified-date': None,
   'other-name': [],
   'path': '/0000-0002-5025-7862/other-names'},
  'biography': None,
  'researcher-urls': {'last-modified-date': None,
   'research

`extract_works_section(orcid_record)`

This function extracts the "works" section from an ORCID record. It performs the following steps:

- Retrieves the "works" section from the provided ORCID record.
- Returns the "works" section.

In [6]:
docs = orcid.extract_works_section(orcid_record)
docs

[{'last-modified-date': {'value': 1654040356185},
  'external-ids': {'external-id': [{'external-id-type': 'doi',
     'external-id-value': '10.5194/egusphere-egu21-9387',
     'external-id-normalized': {'value': '10.5194/egusphere-egu21-9387',
      'transient': True},
     'external-id-normalized-error': None,
     'external-id-url': {'value': 'https://doi.org/10.5194/egusphere-egu21-9387'},
     'external-id-relationship': 'self'}]},
  'work-summary': [{'put-code': 89903286,
    'created-date': {'value': 1614838570891},
    'last-modified-date': {'value': 1654040356185},
    'source': {'source-orcid': None,
     'source-client-id': {'uri': 'https://orcid.org/client/0000-0001-9884-1913',
      'path': '0000-0001-9884-1913',
      'host': 'orcid.org'},
     'source-name': {'value': 'Crossref'},
     'assertion-origin-orcid': None,
     'assertion-origin-client-id': None,
     'assertion-origin-name': None},
    'title': {'title': {'value': 'A multi-model ensemble weighting method (Clim

`extract_doi(work)`

This function extracts the title and DOI(s) from a work object. It performs the following steps:

- Extracts the title from the work object.
- Extracts the DOI(s) from the work object.
- Returns the DOI(s) and the title.

In [15]:
# extract title and DOI
doi_list = []
title_list = []
for i in docs:
    doi, title = orcid.extract_doi(i)
    doi_list.append(doi)
    title_list.append(title)
print(doi_list)
print(title_list)

['10.5194/egusphere-egu21-9387', '10.5194/egusphere-egu21-6051', '10.5194/egusphere-egu21-7797', '10.5194/egusphere-egu21-4805', '10.5194/egusphere-egu21-3476', '10.5194/egusphere-egu21-4895', '10.5194/wes-5-1097-2020', '10.5194/egusphere-egu2020-21619', '10.18174/498797', '10.5194/wes-4-193-2019', '10.5194/wes-2018-79-supplement', '10.1002/we.2267']
['A multi-model ensemble weighting method (ClimWIP) in ESMValTool', 'Preprocessing of hydrological models&#8217; input in eWaterCycle with ESMValTool', 'Towards Open and FAIR Hydrological Modelling with eWaterCycle', 'Bringing ESMValTool to the Jupyter Lab', 'Recent developments on the Earth System Model Evaluation Tool', 'era5cli: the command line interface to ERA5 data', 'Clustering wind profile shapes to estimate airborne wind energy production', 'era5cli: The command line tool to download ERA5 data', 'Characterisation of offshore winds for energy applications', 'Low-level jets over the North Sea based on ERA5 and observations: together

`orcid_lda_cluster(dois)`

This function applies LDA topic modeling and clustering to a list of DOIs. It performs the following steps:

- Preprocesses the abstracts associated with the DOIs.
- Creates a dictionary and corpus for the LDA model.
- Applies LDA topic modeling using the gensim library.
- Assigns clusters based on topic scores.
- Returns a dictionary mapping DOIs to cluster assignments and a dictionary mapping cluster indices to topics.


In [16]:
clusters = orcid.orcid_lda_cluster(doi_list)
clusters

{'10.5194/egusphere-egu21-9387': 0,
 '10.5194/egusphere-egu21-6051': 0,
 '10.5194/egusphere-egu21-7797': 0,
 '10.5194/egusphere-egu21-4805': 0,
 '10.5194/egusphere-egu21-3476': 0,
 '10.5194/egusphere-egu21-4895': 0,
 '10.5194/wes-5-1097-2020': 0,
 '10.5194/egusphere-egu2020-21619': 0,
 '10.18174/498797': 5,
 '10.5194/wes-4-193-2019': 0,
 '10.5194/wes-2018-79-supplement': 2,
 '10.1002/we.2267': 0}

`orcid_get_coauthors(full_name)`

This function retrieves co-author information for a given full name using ORCID data. It performs the following steps:

- Retrieves the ORCID ID for the given full name using the name_to_orcid_id function.
- Queries the ORCID API for the author's record using the ORCID ID.
- Extracts the "works" section from the ORCID record.
- Performs LDA topic modeling on the DOIs extracted from the "works" section.
- Filters authors based on certain criteria such as country.
- Generates co-author links and stores co-author information.
- Assigns new group numbers based on unique groups.
- Returns a DataFrame containing co-author information and a list of co-author links.

In [17]:
df, links = orcid.orcid_get_coauthors('Peter Kalverla')

In [18]:
df

Unnamed: 0,orcid,name,group
0,0000-0002-3986-1268,Ruth Lorenz,0
1,0000-0001-5760-4524,Lukas Brunner,0
2,0000-0002-5025-7862,Peter Kalverla,0
3,0000-0002-5413-9038,Stef Smeets,0
4,0000-0002-8928-7831,Jaro Camphuijsen,0
5,0000-0001-9005-8940,Bouwe Andela,0
6,0000-0001-8407-6472,Fakhereh Alidoost,0
7,0000-0003-0157-4818,Jerom Aerts,0
8,0000-0002-7200-3353,Nick van De Giesen,0
9,0000-0001-8367-1333,Gijs van Den Oord,0


In [19]:
links

[('0000-0002-3986-1268', '0000-0001-5760-4524'),
 ('0000-0002-3986-1268', '0000-0002-5025-7862'),
 ('0000-0002-3986-1268', '0000-0002-5413-9038'),
 ('0000-0002-3986-1268', '0000-0002-8928-7831'),
 ('0000-0002-3986-1268', '0000-0001-9005-8940'),
 ('0000-0001-5760-4524', '0000-0002-5025-7862'),
 ('0000-0001-5760-4524', '0000-0002-5413-9038'),
 ('0000-0001-5760-4524', '0000-0002-8928-7831'),
 ('0000-0001-5760-4524', '0000-0001-9005-8940'),
 ('0000-0002-5025-7862', '0000-0002-5413-9038'),
 ('0000-0002-5025-7862', '0000-0002-8928-7831'),
 ('0000-0002-5025-7862', '0000-0001-9005-8940'),
 ('0000-0002-5413-9038', '0000-0002-8928-7831'),
 ('0000-0002-5413-9038', '0000-0001-9005-8940'),
 ('0000-0002-8928-7831', '0000-0001-9005-8940'),
 ('0000-0001-8407-6472', '0000-0003-0157-4818'),
 ('0000-0001-8407-6472', '0000-0001-9005-8940'),
 ('0000-0001-8407-6472', '0000-0002-8928-7831'),
 ('0000-0001-8407-6472', '0000-0002-7200-3353'),
 ('0000-0001-8407-6472', '0000-0001-8367-1333'),
 ('0000-0001-8407-64

`preprocess(text)`

This function tokenizes and lemmatizes a given text. It performs the following steps:

- Tokenizes the text using the gensim.utils.simple_preprocess function.
- Checks if each token is not a stopword and has a length greater than 3.
- Lemmatizes and stems each token using the lemmatize_stemming function.
- Returns a list of preprocessed tokens.

In [35]:
string = "a hh dfhgf iekshj &*O (djahj daj wad d."
topic_modeling.preprocess(string)

['dfhgf', 'iekshj', 'djahj']

`orcid_lda_cluster(dois)`

This function performs LDA topic modeling on a collection of DOIs (Document Object Identifiers) and assigns topics to the DOIs. It takes a list of DOIs as input and performs the following steps:

- Preprocesses the abstracts or titles associated with the DOIs using the preprocess(text) function.
- Creates a dictionary and corpus for the preprocessed data using the gensim.corpora.Dictionary class.
- Applies LDA topic modeling using the gensim.models.LdaMulticore class with 8 topics, the dictionary, and 10 passes.
- Retrieves the topic keywords for each topic using the print_topics() method of the LDA model.
- Stores the topic keywords in the idx2topics dictionary, where the index represents the topic index.
- Assigns each DOI to the most relevant topic based on the topic with the highest score.
- Returns the clusters dictionary, which maps each DOI to its assigned topic index, and the idx2topics dictionary, which contains the topic keywords for each topic index.

In [42]:
topic_modeling.orcid_lda_cluster(doi_list)

({'10.5194/egusphere-egu21-9387': 0,
  '10.5194/egusphere-egu21-6051': 0,
  '10.5194/egusphere-egu21-7797': 0,
  '10.5194/egusphere-egu21-4805': 0,
  '10.5194/egusphere-egu21-3476': 1,
  '10.5194/egusphere-egu21-4895': 0,
  '10.5194/wes-5-1097-2020': 0,
  '10.5194/egusphere-egu2020-21619': 0,
  '10.18174/498797': 5,
  '10.5194/wes-4-193-2019': 0,
  '10.5194/wes-2018-79-supplement': 4,
  '10.1002/we.2267': 6},
 {0: '0.002*"tool" + 0.002*"model" + 0.002*"data" + 0.002*"recip" + 0.002*"esmvaltool" + 0.002*"scientif" + 0.002*"process" + 0.002*"reproduc" + 0.002*"project" + 0.002*"jat"',
  1: '0.036*"avail" + 0.024*"download" + 0.024*"data" + 0.022*"model" + 0.020*"option" + 0.018*"command" + 0.016*"jat" + 0.016*"includ" + 0.014*"variabl" + 0.014*"autom"',
  2: '0.040*"tool" + 0.023*"recip" + 0.019*"diagnost" + 0.018*"user" + 0.017*"test" + 0.014*"commun" + 0.014*"improv" + 0.014*"scientif" + 0.014*"cmip" + 0.013*"evalu"',
  3: '0.002*"wind" + 0.002*"tool" + 0.002*"data" + 0.002*"model" + 0