# CIP Dataverse REST API Session

Welcome to the CIP REST API course. This Jupyter notebook contains all you need to learn how to retrieve data from Dataverse, including sample code, spece to write your own code and sample answers, which can all be run within the notebook in Python language.

In [1]:
# Don't run this code locally (e.g. con VSCode)
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Search API using GET queries via a browser

Our first REST call will be searching the CIP dataverse from the **Dataverse Project**

We can query the word potato:
* https://data.cipotato.org/api/search?q=**The word to search***&key="**Paste your API KEY**"
* https://data.cipotato.org/api/search?q=**The word to search***&type=dataverse&key="**Paste your API KEY**"
* https://data.cipotato.org/api/search?q=**The word to search***&subtree="**Dataverse Identifier**"&key="**Paste your API KEY**"
* https://data.cipotato.org/api/search?q=*&type=dataset&metadata_fields=citation:dsDescription&metadata_fields=citation:author&key="**Paste your API KEY**"
* https://data.cipotato.org/api/search?q=potato*&show_relevance=true&show_facets=true&fq=publicationDate:2021&fq=authorName_ss:**"Gastelo, Manuel"**&fq=keywordValue_ss:**"Genomic Selection"**&key="**Paste your API KEY**"

### Exercises 1

The first set of exercises just use URLs in the browser. You cannot do these within the Python notebook, however we have provided a box where you can note them down. *Don't forget to get you api key before searching*.

1. Find a dataverse which you can use to lookop information.
2. Create a URL to find information about lateblight.
3. Expand your results to include information about a specific author o specific theme.

#### Enter you answers here:

1.
2.
3.

### Exercises 1 - answers

1. https://data.cipotato.org/api/search?q=potato&type=dataverse&key="**Paste your API KEY**"
2. https://data.cipotato.org/api/search?q=lateblight*&subtree=potato&key="**Paste your API KEY**"
3. https://data.cipotato.org/api/search?q=lateblight*&show_relevance=true&show_facets=true&fq=authorName_ss:%22Hijmans,%20Robert%22&fq=keywordValue_ss:%22Phytophthora%20infestans%22&key=**Paste your API KEY**"

## Search API with Python code

To make a request, you'll need to specify the server and extension, using the `requests` module.

In [None]:
import requests, sys

base_url = "https://data.cipotato.org/api/search"

params = {
    "q": "*",
    'key': '36c4e5ea-7cc3-4e15-b55d-548db5af1b96'
}

response = requests.get(base_url, params=params, verify=False)
response

### Error Handling

In [None]:
import requests, sys

base_url = "https://data.cipotato.org/api/search"

params = {
    "q": "*",
    'key': '36c4e5ea-7cc3-4e15-b55d-548db5af1b96'
}

response = requests.get(base_url, params=params, verify=False)

if response.status_code != 200:
    print(f"Error: {response.raise_for_status()}")
else:
    print(f"{response.ok}")

### Decoding JSON

In [None]:
import requests, sys, json
from pprint import pprint

base_url = "https://data.cipotato.org/api/search"

params = {
    "q": "*",
    "type":"dataverse",
    'key': '36c4e5ea-7cc3-4e15-b55d-548db5af1b96'
}

response = requests.get(base_url, params=params, verify=False)

print(response.headers)

if response.status_code != 200:
    print(f"Error: {response.raise_for_status()}")
else:
    print(f"{response.ok}")

decoded = response.json()

pprint(decoded)


### Helper Function

The helper function allows you to call the request, check the status and decode the json in a single line of code. If you're using lots of REST calls in your script, creating a function will save you a lot of time.

In [None]:
import requests, sys, json
from pprint import pprint

def fetch_endpoint(base_url, content_type="application/json",**params):
    url = f"{base_url}"
    headers = {
        "Content-Type": content_type
    }
    response = requests.get(url, params=params,headers=headers, verify=False)
    if response.status_code != 200:
        print(f"Error: {response.raise_for_status()}")
    else:
        print(f"{response.ok}")

    if content_type == "application/json":
        return response.json()
    else:
        return response.text


base_url = "https://data.cipotato.org/api/search"
params = {
    "q": "*",
    "type":"dataset",
    'key': '36c4e5ea-7cc3-4e15-b55d-548db5af1b96'
}
con = "application/json"

get_data = fetch_endpoint(base_url, con, **params)

pprint(get_data)

### Exercises 2

1. Write a script to **lookup** for datasets related to native potatoes.

In [None]:
#TODO Exercise 2.1

import requests, sys, json

### Exercises 2 - answers

1. Write a script to **lookup** for datasets related to native potatoes.

In [None]:
import requests, sys, json
from pprint import pprint

def fetch_endpoint(base_url, content_type="application/json",**params):
    url = f"{base_url}"
    headers = {
        "Content-Type": content_type
    }
    response = requests.get(url, params=params,headers=headers, verify=False)
    if response.status_code != 200:
        print(f"Error: {response.raise_for_status()}")
    else:
        print(f"{response.ok}")

    if content_type == "application/json":
        return response.json()
    else:
        return response.text


base_url = "https://data.cipotato.org/api/search"
params = {
    "q": "potatoes*",
    "type":"dataset",
    "subtree":"potato",
    'key': '36c4e5ea-7cc3-4e15-b55d-548db5af1b96'
}
con = "application/json"

get_data = fetch_endpoint(base_url, con, **params)

pprint(get_data)

### Using results: fetching specific data

Since JSON is a dictionary, you can pull out a single datapoint using the key.

In [None]:
print(get_data.keys())
print(get_data.get("data",{}).keys())
pprint(get_data.get("data",{}).get("items",[]))

We can add to our previous script

In [None]:
import requests, sys, json
from pprint import pprint

def fetch_endpoint(base_url, content_type="application/json",**params):
    url = f"{base_url}"
    headers = {
        "Content-Type": content_type
    }
    response = requests.get(url, params=params,headers=headers, verify=False)
    if response.status_code != 200:
        print(f"Error: {response.raise_for_status()}")
    else:
        print(f"{response.ok}")

    if content_type == "application/json":
        return response.json()
    else:
        return response.text


base_url = "https://data.cipotato.org/api/search"
params = {
    "q": "potatoes*",
    "type":"dataset",
    "subtree":"potato",
    'key': '36c4e5ea-7cc3-4e15-b55d-548db5af1b96'
}
con = "application/json"

get_data = fetch_endpoint(base_url, con, **params)

items = get_data.get("data",{}).get("items",[])[0]

print(items.keys())
print(items.get("url"))
for i, author in enumerate(items.get("authors")):
    print(f"Author {i+1}: {author}")
pprint(items.get("description"))

### Exercises 3

1. Write a script to lookup for a dataset related to roots ans tubers and print the geographic coverage and the global ID.

In [None]:
##TODO Exercise 3.1

### Excercises 3 - answers

1. Write a script to lookup for a dataset related to roots and tubers and print the geographic coverage and the global ID.

In [None]:
base_url = "https://data.cipotato.org/api/search"
params = {
    "q": "*",
    "type":"dataverse",
    'key': '36c4e5ea-7cc3-4e15-b55d-548db5af1b96'
}
con = "application/json"

get_data = fetch_endpoint(base_url, con, **params)
pprint(get_data)

In [None]:
base_url = "https://data.cipotato.org/api/search"
params = {
    "q": "virus*",
    "type":"dataset",
    "subtree":"rtas",
    'key': '36c4e5ea-7cc3-4e15-b55d-548db5af1b96'
}
con = "application/json"

get_data = fetch_endpoint(base_url, con, **params)

items = get_data.get("data",{}).get("items",[])[0]
pprint(items)
print(items.keys())
print(items.get("geographicCoverage"))
pprint(items.get("global_id"))

# Retrieving Data using REST API with GET Requests in Browser

Our second type of REST call will be searching the CIP dataverse from the **Dataverse Project**

We can query the word potato:

* https://data.cipotato.org/api/access/dataset/:persistentId/?persistentId=**"Paste DOI identifier"**
* https://data.cipotato.org/api/access/dataset/:persistentId/versions/:latest?persistentId=**"Paste DOI identifier"**
* https://data.cipotato.org/api/access/datafile/:persistentId?persistentId=**"doi:10.21223/1RALF3/3HRBMX"**
* For multiple files we're gonna use code!!

## Retrieve a single data with REST API using Python code

To make a request, you'll need to specify the server and extension, using the `requests` module.

In [None]:
import requests, sys

base_url = "https://data.cipotato.org/api/access/dataset/:persistentId"

params = {
    "persistentId": "doi:10.21223/JTYQA6",
    'key': '36c4e5ea-7cc3-4e15-b55d-548db5af1b96',
    "format":"original",
}

response = requests.get(base_url, params=params, verify=False)
response

Now, let's save the content to a variable:

In [None]:
content = response.content

It's important to note that excel files are codified as binary and we need to access it by creating a binary object that allow us to make operations like reading ansd seeking as id the data were stored in a file. Then, we can zip it so we can handle the binary data using the `with` keyword.

In [None]:
import zipfile, io

zip_file = zipfile.ZipFile(io.BytesIO(content))

As says above, treating the binery data as a file object and zip it, we can obtain the file names on the zip file.

In [None]:
zip_names = [i for i in zip_file.namelist() if i.endswith(('.xls', '.xlsx', '.csv', '.tab'))]
zip_names

['01_daily_data.xlsx', 'data_dictionary.xlsx']

Finally, using the `with` keyword, we can make oprations like rading the binary data as a pandas **dataframe**.

In [None]:
import pandas as pd

with zip_file.open(zip_names[0]) as file:
    df = pd.read_excel(file, sheet_name=None)

print(list(df.keys()))

df["Hoja1"]

All the code above can be translated to a function as we did before! Let's do it:

In [None]:
def get_data(base_url, content_type="application/json",**params):

    url = f"{base_url}"
    headers = {
        "Content-Type": content_type
    }
    response = requests.get(url, params=params,headers=headers, verify=False)

    if response.status_code != 200:
        print(f"Error: {response.raise_for_status()}")
    else:
        print(f"{response.ok}")

    content = response.content
    zip_file = zipfile.ZipFile(io.BytesIO(content))
    zip_names = [i for i in zip_file.namelist() if i.endswith(('.xls', '.xlsx', '.csv', '.tab'))]
    print(zip_names)

    return zip_file, zip_names

base_url = "https://data.cipotato.org/api/access/dataset/:persistentId"

params = {
    "persistentId": "doi:10.21223/M97ZQS",
    'key': '36c4e5ea-7cc3-4e15-b55d-548db5af1b96',
    "format":"original",
}


zip_file, zip_names = get_data(base_url, **params)

with zip_file.open(zip_names[0]) as file:
    df = pd.read_excel(file, sheet_name=None)

print(list(df.keys()))

df['F1_Selec_Crit']

### Exercises 4

1. Write a script to download and load a file from a dataset.

In [None]:
##TODO Exercise 4.1


### Exercisas 4 - answers

1. Write a script to download and load a file from a dataset.

In [None]:
import requests, sys
import zipfile, io
import pandas as pd

def get_data(base_url, content_type="application/json",**params):

    url = f"{base_url}"
    headers = {
        "Content-Type": content_type
    }
    response = requests.get(url, params=params,headers=headers, verify=False)

    if response.status_code != 200:
        print(f"Error: {response.raise_for_status()}")
    else:
        print(f"{response.ok}")

    content = response.content
    zip_file = zipfile.ZipFile(io.BytesIO(content))
    zip_names = [i for i in zip_file.namelist() if i.endswith(('.xls', '.xlsx', '.csv', '.tab'))]
    print(zip_names)

    return zip_file, zip_names

base_url = "https://data.cipotato.org/api/access/dataset/:persistentId"

params = {
    "persistentId": "doi:10.21223/57YIIP",
    'key': '36c4e5ea-7cc3-4e15-b55d-548db5af1b96',
    "format":"original",
}


zip_file, zip_names = get_data(base_url, **params)

with zip_file.open("01_data_yield_Mugurero_2005.xlsx") as file:
    df = pd.read_excel(file, sheet_name=None)

print(list(df.keys()))

df["Sheet1"]


## Retrieve multiple data with REST API using Python code

On native api section of the API we have a section where we can list all the files in a Dataset. This is useful to get the **FILE ID** of each file.

Here is an example on the browser: https://data.cipotato.org/api/datasets/:persistentId/versions/:latest/files?persistentId=doi:10.21223/M97ZQS

```
{
  "status": "OK",
  "data": [
    {
      "label": "01_data_F1_Select_crit.tab",
      "restricted": false,
      "version": 4,
      "datasetVersionId": 2569,
      "categories": [
        "Data"
      ],
      "dataFile": {
        "id": 10487,
        "persistentId": "doi:10.21223/M97ZQS/6OXAOK",
        "pidURL": "https://doi.org/10.21223/M97ZQS/6OXAOK",
        "filename": "01_data_F1_Select_crit.tab",
        "contentType": "text/tab-separated-values",
        "filesize": 1090,
        "storageIdentifier": "file://1939d6fb517-a096d1186bf8",
        "originalFileFormat": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
        "originalFormatLabel": "MS Excel Spreadsheet",
        "originalFileSize": 29370,
        "originalFileName": "01_data_F1_Select_crit.xlsx",
        "UNF": "UNF:6:Pz1aSr645YjmE0Bp8UdepQ==",
        "rootDataFileId": -1,
        "md5": "c1e610f01e13d2e9f424c9b903223c32",
        "checksum": {
          "type": "MD5",
          "value": "c1e610f01e13d2e9f424c9b903223c32"
        },
        "creationDate": "2024-12-06"
      }
    }]
}
```

In this case of retrieving multiple data files, we can get the unique file ID within a dataset and load all these.

In [None]:
import requests, sys
import zipfile, io
import pandas as pd
from pprint import pprint

base_url = "https://data.cipotato.org/api/datasets/:persistentId/versions/:latest/files?persistentId="
doi = "doi:10.21223/M97ZQS"
dataset_url = base_url + doi
response = requests.get(dataset_url, verify=False).json()
files_ids = [str(file["dataFile"]["id"]) for file in response["data"]]
pprint(files_ids)

Note that as in the section of searching, we used specific datapoints of the dictionary (JSON) to retrieve the ID within the dataFIle section. Now with the file ID's, we can get specific files.

In [None]:
base_url = "https://data.cipotato.org/api/access/datafiles/"
file_ids = ",".join(files_ids)  # Create a comma-separated list of file IDs
dataurl = base_url + file_ids + "?format=original"
response = requests.get(dataurl, verify=False)
if response.status_code == 200:
    content = response.content


As before, once we have the binary data we can see the file names and load specific files as dataframes.

In [None]:
zip_file = zipfile.ZipFile(io.BytesIO(content))
zip_names = [i for i in zip_file.namelist() if i.endswith(('.xls', '.xlsx', '.csv', '.tab'))]
pprint(zip_names)

In [None]:
with zip_file.open(zip_names[4]) as file:
    df = pd.read_excel(file, sheet_name=None)

print(list(df.keys()))

df = df['F4_ Harvest_Mother'].copy()

['F4_ Harvest_Mother']


Again, we can make a function if we want to use it multiple times.

In [13]:
# First the helper function

def get_file_ids(base_url, doi):
    dataset_url = base_url + doi
    response = requests.get(dataset_url, verify=False).json()
    files_ids = [str(file["dataFile"]["id"]) for file in response["data"]]
    return files_ids

# Now the main function

def get_data(base_url, file_ids):
    base_url = "https://data.cipotato.org/api/access/datafiles/"
    file_ids = ",".join(file_ids)  # Create a comma-separated list of file IDs
    dataurl = base_url + file_ids + "?format=original"
    response = requests.get(dataurl, verify=False)
    if response.status_code == 200:
        content = response.content
        zip_file = zipfile.ZipFile(io.BytesIO(content))
        zip_names = [i for i in zip_file.namelist() if i.endswith(('.xls', '.xlsx', '.csv', '.tab'))]
    return zip_file, zip_names



In this case we can loop the function `get_file_ids` across multiple datasets and select a expecific excel file to load as a dataframe with pandas.

In [None]:
base_url = "https://data.cipotato.org/api/datasets/:persistentId/versions/:latest/files?persistentId="
dois = ["doi:10.21223/M97ZQS", "doi:10.21223/57YIIP", "doi:10.21223/3UT8HI"]

files_ids = []

for doi in dois:
    ids = get_file_ids(base_url, doi)
    files_ids.extend(ids)

pprint(files_ids)


base_url = "https://data.cipotato.org/api/access/datafiles/"
zip_file, zip_names = get_data(base_url, files_ids)
pprint(zip_names)

with zip_file.open(zip_names[11]) as file:
    df = pd.read_excel(file, sheet_name=None)

print(list(df.keys()))

df = df['F4_ Harvest_Mother'].copy()
df


Finally, we can make a plot!

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate the standard error
df['SE'] = df.groupby('INSTN')['TTYNA'].transform(lambda x: x.std() / (len(x) ** 0.5))

# Plotting
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='PLOT', y='TTYNA', hue='INSTN', style='INSTN', s=100, palette='deep', legend='full')
plt.errorbar(df['PLOT'], df['TTYNA'], yerr=df['SE'], fmt='none', c='gray', capsize=3)

plt.title('TTYNA by PLOT with Standard Error')
plt.xlabel('PLOT')
plt.ylabel('TTYNA')
plt.legend(title='INSTN')
plt.show()

### Exercises 5

1. Write a script to load multiple files from at least 3 different experiments.


In [None]:
##TODO Exercise 5.1

### Excercises 5 - answers

1. Write a script to load multiple files from at least 3 different experiments.

In [None]:
base_url = "https://data.cipotato.org/api/datasets/:persistentId/versions/:latest/files?persistentId="
dois = ["doi:10.21223/RHSVIY", "doi:10.21223/7WC7FM", "doi:10.21223/GCJRH6"] # Gastelo experiments

files_ids = []

for doi in dois:
    ids = get_file_ids(base_url, doi)
    files_ids.extend(ids)

pprint(files_ids)


base_url = "https://data.cipotato.org/api/access/datafiles/"
zip_file, zip_names = get_data(base_url, files_ids)
pprint(zip_names)

with zip_file.open(zip_names[3]) as file:
    df = pd.read_excel(file, sheet_name=None)

print(list(df.keys()))

df = df['Fieldbook'].copy()
df
