## Use case 1: Identify and collect information about repositories catering to the medical research community (Python)

> This notebook is based on the examples written in `R` from Dorothea Strecker's [examples-r/01_re3data_API_medical_research_community.ipynb](https://github.com/re3data/using_the_re3data_API/blob/main/examples-r/01_re3data_API_medical_research_community.ipynb).  
> Adapted in `Python` by Heinz-Alexander Fütterer.

Medical researchers are looking for a suitable repository to deposit their data. They require a repository catering to medical research that offers data upload and assigns DOIs to datasets.

Repositories meeting these specifications can be identified via the re3data API. The API also provides the option to retrieve further information about these repositories, such as the name of the repository or a description.

### Step 1: load packages

The package `httpx` includes the HTTP method GET, which will be used to request data from the re3data API. Responses from the redata API are returned in XML. `lxml` includes functions for working with XML, for example parsing or extracting content of specific elements. The `pandas` library is used for storing the responses in a tabular data structure (i.e. a `DataFrame`).

If necessary, install the packages before loading them.

In [1]:
# !pip install httpx==0.23.0 lxml==4.8.0 pandas==1.4.2

In [2]:
import typing

import httpx
import pandas
from lxml import html

### Step 2: define query parameters

Information on individual repositories can be extracted using the re3data ID. Therefore, re3data IDs of repositories with the desired characteristics need to be identified first.

The re3data API allows querying via the endpoint **/api/beta/repositories**. Parameters that can be queried are listed in the [re3data API documentaion](https://www.re3data.org/api/doc). For more information on re3data metadata, including descriptions of available elements and controlled vocabularies, please refer to the documentation of the [re3data Metadata Schema](https://doi.org/10.2312/re3.006) (the API uses version 2.2 of the re3data Metadata Schema).  
The query below returns re3data IDs of repositories meeting the following conditions:

* **"subjects[]" = "205 Medicine"** The repository caters to the subject *Medicine*, notation 205 in the DFG Subject Classification, the subject classification used by re3data.
* **"dataUploads[]"="open"** The repository allows data upload.
* **"pidSystems[]"="DOI"** The repository assigns DOIs.

In [3]:
re3data_query = {
    "subjects[]": "205 Medicine",
    "dataUploads[]": "open",
    "pidSystems[]": "DOI",
}

### Step 3: obtain URLs for further API queries

The query parameters defined in the previous step can then be passed to the re3data API using `httpx.get()`.

The XML response is parsed using `html.fromstring()`. XML elements or attributes can be identified using XPath syntax. The response from the re3data API includes URLs for further queries to the **/api/beta/repository** endpoint. These URLs can be identified with a simple XPath expression. All attributes matching the XPath syntax are identified with `.xpath()`.

The three functions are nested in the example below.

In [4]:
URL = "https://www.re3data.org/api/beta/repositories"

re3data_response = httpx.get(URL, params=re3data_query)
urls = html.fromstring(re3data_response.content).xpath("//@href")

urls[:5]

['https://www.re3data.org/api/beta/repository/r3d100012823',
 'https://www.re3data.org/api/beta/repository/r3d100010953',
 'https://www.re3data.org/api/beta/repository/r3d100012815',
 'https://www.re3data.org/api/beta/repository/r3d100010261',
 'https://www.re3data.org/api/beta/repository/r3d100012074']

### Step 4: define what information about the repositories should be requested

The function `extract_repository_info()` defined in the following code block extracts the content of specific XML elements and attributes. This function will be used to extract the specified information from responses of the re3data API. Its basic structure is similar to the process of extracting the URLs outlined in step 3 above.
The XPath expressions defined here will extract the re3data IDs, names, URLs, and descriptions of the repositories. Results are stored in a dictionary that can be processed later.

Depending on specific use cases, this function can be adapted to extract a different set of elements and attributes. For an overview of the metadata re3data offers, please refer to the documentation of the [re3data Metadata Schema](https://doi.org/10.2312/re3.006) (the API uses version 2.2 of the re3data Metadata Schema).
    
Please note that in version 2.2 of the re3data Metadata Schema, the elements mentioned here have occurences of 1 or 0-1, meaning that for each repository, they occur once at most. For information on how to deal with elements that can occur multiple times, please refer to other examples for using the re3data API.

In [5]:
def extract_repository_info(
    repository_metadata_xml: html.HtmlElement,
) -> typing.Dict[str, str]:
    """Extracts wanted metadata elements from a given repository metadata xml representation.

    Args:
        repository_metadata_xml: XML representation of repository metadata.

    Returns:
        Dictionary representation of repository metadata.

    """

    namespaces = {"r3d": "http://www.re3data.org/schema/2-2"}
    return {
        "re3data_id": repository_metadata_xml.xpath("//re3data.orgidentifier/text()", namespaces=namespaces)[0],
        "name": repository_metadata_xml.xpath("//repositoryname/text()", namespaces=namespaces)[0],
        "url": repository_metadata_xml.xpath("//repositoryurl/text()", namespaces=namespaces)[0],
        "description": repository_metadata_xml.xpath("//description/text()", namespaces=namespaces)[0],
    }

### Step 5: gather detailed information about repositories

After preparing the list of URLs and the extracting function, these components can be put together. The code block below iterates through the list of URLs using a for-loop. For each repository, data is requested from the re3data API using `.get()` from a `httpx.Client`. The XML response is parsed with `html.fromstring()` before `extract_repository_info()` is called. The results are then appended to `results_list`.

`repository_info` is a container for storing results of the API query. The DataFrame has four columns corresponding to names of the list items defined by `extract_repository_info()`.

In [6]:
results = []

with httpx.Client() as client:
    for url in urls:
        repository_metadata_response = client.get(url)
        repository_metadata_xml = html.fromstring(repository_metadata_response.content)
        results.append(extract_repository_info(repository_metadata_xml))

repository_info = pandas.DataFrame(results)

### Step 6: Look at the results

Results are now stored in `repository_info`. They can be inspected using `.head()`, visualized or stored locally with `.to_csv()`.

In [7]:
repository_info.head()

Unnamed: 0,re3data_id,name,url,description
0,r3d100012823,Vivli,https://vivli.org/,Vivli is a non-profit organization working to ...
1,r3d100010953,Polar Data Catalogue,https://www.polardata.ca/,The Polar Data Catalogue is an online database...
2,r3d100012815,UNB Libraries Dataverse,https://dataverse.lib.unb.ca/,UNB Dataverse is repository for research data ...
3,r3d100010261,National Addiction & HIV Data Archive Program,https://www.icpsr.umich.edu/web/pages/NAHDAP/i...,"NAHDAP acquires, preserves and disseminates da..."
4,r3d100012074,BindingDB,http://bindingdb.org/bind/index.jsp,"BindingDB is a public, web-accessible knowledg..."
