# Upload pdf files to DataverseNL

- Date of creation: 2025-03-25 by Dorien Huijser
- Date of last edit: 2025-04-11


## Background

In February 2025, a researcher from the faculty of Social and Behavioural Sciences needed help with uploading over a hundred pdf files to DataverseNL.
Many articles from the Dutch journal "Gedrag en Organisatie" (in between its inception and 2004) were never published online. 
The PsychInfo database does however have a bunch of metadata available from some of these journal articles, and the researcher had a local archive of the pdf files (the papers) from  the journal. Although it is not preventable that some papers would have to be published manually (and providing manual metadata), the journals for which there is PsychInfo metadata available *can* be uploaded via a script.

In this file, the Python code can be found that uses the [Dataverse Native API](https://guides.dataverse.org/en/latest/api/native-api.html) to upload all papers to DataverseNL for which:

- There is a pdf file in the researcher's local `archief` folder
- There is metadata from the PsychInfo database
- There is no DOI (which means it already is available online) and the publication date is before 2004

## Prerequisites

- a folder called 'Archief' that contains all the pdf files to be uploaded. The Archief folder is not included in this repository, but there is a txt file that gives the folder tree (`sourcefiles/GenO_dataverse_foldertree.txt`).
- a `citation.xlsx` file, which is a PsychInfo export file. This folder is not included in this repository (due to possible copyright issues), but the column labels are included in the file `sourcefiles/citation_labels.csv`.
- a json file with dataverse metadata fields to be filled. Here this is `dataverse_metadata/metadata_GO_v1.json`
- admin (or at least write) access to the specified Dataverse collection, and thus the API token and Dataverse ID of that collection.
- a Python installation

## Structure of this document

- Import libraries
- Read in psychinfo export file (`citation.xlsx`) in a pandas dataframe
- For every pdf in the Archief folder, get the authors, year, volume and issue and put that in a pandas dataframe (`extract_info` function) > result: `results/01_extracted_path_information_archief_folder.csv`
- Merge the psychinfo export and the Archief results to see which files overlap: select only those that are present in both (meaning that there is a pdf file (from Archief folder) and sufficient metadata (from the Psychinfo export)) so that they can be uploaded to DataverseNL. > result: `results/02_full_merge_archief_and_psychinfo.csv` and subsets (`2a`, `2b`, `2c`).
- For each row in the merged dataframe for which there is both a pdf and psychinfo metadata:
  - Fill the Dataverse metadata JSON (function `create_dv_metadata`)
  - Check if a dataset with the same name has already been uploaded, skip that one then (functions `get_dataverse_dois`, `retrieve_titles`)
  - Create a draft Dataverse dataset with the metadata and retrieve its PID (function `create_dataset`) 
  - Change the publication date to the journal publication date (otherwise every entry gets 2025 as citation date; function `update_citation_date`)
  - Upload the accompanying pdf file to the dataset using the PID (function `upload_data`) > result: `results/03_Files_uploaded_to_Dataverse.csv`

## Import libraries

In [None]:
from pathlib import Path         # For working with paths
import glob                      # For search/matching values with regular expressions
import pandas as pd              # For working with dataframes
import copy                      # To create copies of objects
from datetime import datetime    # To work with dates
import requests                  # For connecting with the Dataverse API
import json                      # To work with JSON files and dictionaries
import os                        # For working with paths
import re                        # For using regular expressions

## Read in the PsychInfo export

In [None]:
# Read in excel file - the first row is empty
psychinfo = pd.read_excel("sourcefiles/citation.xlsx", skiprows = [0])

# Get only the papers without a DOI and before 2004 - these papers do not have to be uploaded
psychinfo_sel = psychinfo.loc[(psychinfo['DO'].isnull()) & (psychinfo['YR'] < 2004)]

## Retrieve information from the files and paths in the archief directory

Information to retrieve from the path names includes:

- YR (year)
- Author last names
- IP (issue)
- VO (jaargang/volume)

In [None]:
# Get file paths for all pdf files in the archief directory
archive_directory = Path.cwd().joinpath("archief")
archive_pdfs = list(archive_directory.rglob("*.pdf"))
relative_paths = [pdf.relative_to(archive_directory) for pdf in archive_pdfs]

# Turn paths into a dataframe with 1 column (more columns to come)
path_infos = pd.DataFrame(data = {"path": relative_paths})

# Function to extract information from the path
def extract_info(path):
    # Split the path to get the relevant parts
    path_obj = Path(path)
    
    # Extract the relevant parts
    year_volume = path_obj.parts[0]  # e.g., "1997 (jaargang 10)"
    year_match = re.search(r'(\d{4})', year_volume)
    volume_match = re.search(r'jaargang (\d+)', year_volume)
    
    year = year_match.group(1) if year_match else None
    volume = volume_match.group(1) if volume_match else None
    
    # Extract the issue from the second part
    issue_match = re.search(r'GO_(\d+)_(\d+)', path_obj.parts[1])
    issue = issue_match.group(2) if issue_match else None
    
    # Extract authors from the filename with a RegEx
    filename = path_obj.name  # Get the filename from the path

    # Take into account that in some cases there is a "_" before the author name and sometimes it is a " ".
    # Also, sometimes there is 4+5 in the issue number (instead of a single number) and 1 file starts with 'gO' instead of "GO" 
    match = re.search(r"(?i)GO_\d{4}_\d+(?:\+\d+)?[ _](.+)", filename, re.IGNORECASE)
    if match:
    # Split authors by comma and strip whitespace
        author_string = re.sub(r"\.pdf$", "", match.group(1)) # Get rid of the .pdf extension
        authors = [author.strip() for author in author_string.split(",")]
    
    return pd.Series([path, volume, issue, year, authors])

# Apply the function to the path_infos DataFrame
for i, row in enumerate(path_infos['path']):
    result = extract_info(row)
    path_infos.loc[i, ['path', 'volume', 'issue', 'year', 'authors']] = result.values

# Write the resulting DataFrame to a csv file
Path("results").mkdir(parents=True, exist_ok=True)
path_infos.to_csv("results/01_extracted_path_information_archief_folder.csv")

## Retrieve papers which have both metadata (psychinfo) and the accompanying file (archief) 

Some papers are not in the psychinfo export, but are in the archief directory. And others are in the psychinfo export, but are not in the archief directory. We only want the entries that are both in the psychinfo export as well as having an accompanying pdf file in the archief directory, because otherwise there is no file to be uploaded or not enough metadata to upload the file.

In [None]:
# Convert columns to the same data type
def int_to_string(df):
    int_columns = df.select_dtypes(include=['int64']).columns.tolist()
    
    if len(int_columns) > 0:
        print(f"Int columns found, converting to string: {int_columns}")
        # Convert object columns to string type
        df[int_columns] = df[int_columns].astype('string')

int_to_string(psychinfo_sel)

In [None]:
# Prepare the authors columns to be merged
# The paths in the archief directory only contain last names, while the authors in the psychinfo export also contain first names or initials
def extract_last_names(author_str):
    '''get only the last names from the psychinfo export ('AU' column).'''
    authors = author_str.split("\n\n")  # Split into list
    authors = [a.lstrip("* ").split(",")[0] for a in authors]  # Remove '*' and extract last name
    return tuple(authors)  # Convert to tuple (hashable)

psychinfo_sel = psychinfo_sel.copy()  # Avoid SettingWithCopyWarning
psychinfo_sel.loc[:, "authors"] = psychinfo_sel["AU"].apply(extract_last_names)

# Convert path_infos['authors'] to tuples
path_infos = path_infos.copy()  # Avoid SettingWithCopyWarning
path_infos.loc[:, "authors"] = path_infos["authors"].apply(tuple)

In [None]:
# Rename columns for merging
psychinfo_sel = psychinfo_sel.rename(columns={"IP": "issue", "VO": "volume", "YR": "year"})

# Merge DataFrames on authors, issue, volume, and year
fullmerge = psychinfo_sel.merge(path_infos, on=["authors", "issue", "volume", "year"], how="outer", indicator = "merge_result")

# Recode the merge_result column
rename_merge_result = {"left_only" : "Present in Psychinfo, not in archief", 
                       "right_only" : "Present in archief, not in Psychinfo",
                       "both" : "Can be uploaded to DataverseNL"}

fullmerge['merge_result'] = fullmerge['merge_result'].map(rename_merge_result)

# Sort fullmerge on merge_result
fullmerge = fullmerge.sort_values(by=['merge_result'], ascending = False)

In [None]:
# Create subsets for easier handling/processing and write them to CSV
fullmerge.to_csv("results/02_full_merge_archief_and_psychinfo.csv")

# Write datasets that can be uploaded to dataverse
dataverseready = fullmerge[fullmerge['merge_result'] == "Can be uploaded to DataverseNL"]
dataverseready.to_csv("results/02a_matches_can_be_uploaded_to_dataverse.csv")

# Write datasets that are only in Psychinfo
psychinfo_only = fullmerge[fullmerge['merge_result'] == "Present in Psychinfo, not in archief"]
psychinfo_only.to_csv("results/02b_papers_in_psychinfo_not_in_archief.csv")

# Write datasets that are only in the Archief folder
archief_only = fullmerge[fullmerge['merge_result'] == "Present in archief, not in Psychinfo"]
archief_only.to_csv("results/02c_papers_in_archief_not_in_psychinfo.csv")

## Create Dataverse metadata JSON

- license name: "CC-BY-NC-ND-4.0" 
- license uri: "http://creativecommons.org/licenses/by-nc-nd/4.0"
- title: OT
- alternativeTitle: TI
- journalVolumeIssue > journalPubDate = YR
- author: AU (!! NB every cell contains multiple authors separated by \n)
- notesText = AB
- Journal = JN (fieled not available in Dataverse)
- journalVolumeIssue > journalVolume = VO
- journalVolumeIssue > journalIssue = IP
- datasetContact > datasetContactEmail: "G&O@uu.nl"
- datasetContact > datasetContactName: "Gedrag & Organisatie"
- datasetContact > datasetContactAffiliation: "Utrecht University"
- dsDescription > dsDescriptionValue: "[TYPE OF ARTICLE] + Gedrag & Organisatie. Gedrag & Organisatie, Tijdschrift voor Sociale, Arbeids- & Organisatiepsychologie, is een wetenschappelijk tijdschrift voor de Nederlandse en Vlaamse markt. Naast wetenschappelijk onderzoek (gebaseerd op kwantitatief en kwalitatief onderzoek) publiceert Gedrag & Organisatie o.a. theoretische uiteenzettingen en overzichtsartikelen."
- dsDescription > dsDescriptionDate: [same date as date of publication]
- Subject: Social Sciences
- depositor: Brenninkmeijer, Veerle
- dateOfDeposit: (current date)
- language: LG
- otherIdAgency: "Tijdschrift voor Sociale, Arbeids- & Organisatiepsychologie"
- otherIdValue: "Gedrag & Organisatie"
- keywords: ID
- topic classification: MH
- other references: RF
- universe: PO
- country: LO and if empty: "Netherlands"
- Article type: MD. Unique values that have to be translated to Dataverse controlled vocabulary
   - Empirical Study; Quantitative Study
   - Empirical Study
   - Empirical Study; Experimental Replication; Longitudinal Study; Quantitative Study
   - nan
   - Literature Review
   - Empirical Study; Longitudinal Study
   - Empirical Study; Qualitative Study
   - Empirical Study; Nonclinical Case Study
   - Empirical Study; Followup Study; Longitudinal Study; Prospective Study
   - Empirical Study; Followup Study
   - Empirical Study; Experimental Replication
   - Empirical Study; Interview

In [None]:
# Function to recode the MD column into Dataverse vocabulary for the field articleType
def recode_md(md_value):
    '''Recode the MD (methods) column to the Dataverse controlled vocabulary for articleType'''
    #Possibilities in Dataverse: abstract, addendum, announcement, article-commentary, book review, books received, brief report, calendar, case report, collection, correction, data paper, discussion, dissertation, editorial, in brief, introduction, letter, meeting report, news, obituary, oration, partial retraction, product review, rapid communication, reply, reprint, research article, retraction, review article, translation, other
    if not pd.isna(md_value):
        if "Empirical Study" in md_value: return("research article")
        elif "Literature Review" in md_value: return("review article")
        else: return("research article")
    else: return("research article")

# Function to recode the LO column into Dataverse vocabulary for the field country
def recode_lo(lo_value):
    '''Recode the LO (location) column to the Dataverse controlled vocabulary for country'''
    # If the input is a list of values: create a new list as output
    if isinstance(lo_value, list):
        new_lo_value = []
        for listitem in lo_value:
            listitem = listitem.strip() # Remove leading and trailing spaces (e.g., "Netherlands, US" would turn into "Netherlands" and " US")
            if listitem == "US": new_lo_value.append("United States")
            elif listitem == 'nan' or listitem == '': new_lo_value.append('Netherlands')
            else: new_lo_value.append(listitem)
        return(new_lo_value)
    
    # If the input is nan
    elif pd.isna(lo_value): return("Netherlands")
    
    # Recode empty values to Netherlands
    elif lo_value == '': return('Netherlands')
    
    # Other cases
    else: return(lo_value.strip())

In [None]:
# Function to create metadata file
def create_dv_metadata(row):
    '''Create a metadata JSON file that complies with the Dataverse metadata fields.
    Input: a row of the dataframe `dataverseready`
    Output: a json_data object'''
    
    # Open and read the Gedrag & Organisatie metadata file (a file with fewer fields than the complete because not all dataverse fields will be used)
    with open('dataverse_metadata/metadata_GO_v1.json', 'r') as file:
        jsonfile = json.load(file)
    
    json_data = copy.deepcopy(jsonfile)
    
    ## Get metadata to fill it in the json_data
    # Metadata constant across datasets
    description = "Gedrag & Organisatie. \nGedrag & Organisatie, Tijdschrift voor Sociale, Arbeids- & Organisatiepsychologie, is een wetenschappelijk tijdschrift voor de Nederlandse en Vlaamse markt. Naast wetenschappelijk onderzoek (gebaseerd op kwantitatief en kwalitatief onderzoek) publiceert Gedrag & Organisatie o.a. theoretische uiteenzettingen en overzichtsartikelen."
    descriptiondate = datetime.today().strftime('%Y-%m-%d')
    license_name = "CC-BY-NC-ND-4.0" # NB controlled vocabulary
    license_uri = "http://creativecommons.org/licenses/by-nc-nd/4.0" # NB controlled vocabulary
    subject = ["Social Sciences"] # NB controlled vocabulary
    contactname = ["Gedrag & Organisatie"]
    contactemail = ["G&O@uu.nl"]
    contactaffiliation = ["Utrecht University"]
    depositor = "Brenninkmeijer, Veerle"
    depositdate = datetime.today().strftime('%Y-%m-%d')
    otheridagency = 'Tijdschrift voor Sociale, Arbeids- & Organisatiepsychologie'
    otheridvalue = 'Gedrag & Organisatie'
    kindofdata = ['written text']
    
    # Metadata that change per paper/dataset
    title = row['OT'] if pd.notna(row['OT']) else row['TI'] # sometimes OT is empty and then we take the TI column instead
    author = row['AU'].split("\n\n")
    alttitle = [row['TI']]
    pubdate = str(row['year'])
    notes = row['AB']
    volume = str(row['volume'])
    issue = str(row['issue'])
    country = recode_lo(str(row['LO']).split(",")) # NB controlled vocabulary
    universe = [row['PO']]
    otherrefs = row['RF'].split("\n\n") if pd.notna(row['RF']) else []
    topicclassification = [term.replace('*', '').strip() for term in row['MH'].split("\n\n")]
    keywords = str(row['ID']).split(",")
    language = str(row['LG']).split(",") # NB controlled vocabulary
    articletype = recode_md(row['MD']) # NB  controlled vocabulary
    keywords = [term.strip() for term in row['ID'].split(',')]

    ## Store the metadata into the json metadata format
    json_data['datasetVersion']['license']['name'] = license_name
    json_data['datasetVersion']['license']['uri'] = license_uri

    # Citation metadata
    fields_citation = json_data["datasetVersion"]["metadataBlocks"]["citation"]["fields"]
    
    for field in fields_citation:
        # Title
        if field["typeName"] == "title":
            field["value"] = title

        # Alternative title (English title)
        if field["typeName"] == "alternativeTitle":
            field["value"] = alttitle

        # Author: has to deal with multiple authors
        if field["typeName"] == "author":
        
            # Access the value list within author
            existing_authors = field["value"]
            num_existing_authors = len(existing_authors)
            
            # Update existing authors
            for i in range(min(num_existing_authors, len(author))):
                existing_authors[i]["authorName"]["value"] = author[i]
                
            # Add new authors if there are more in the authors list
            for i in range(num_existing_authors, len(author)):
                new_author = {
                    "authorName": {
                        "typeName": "authorName",
                        "multiple": False,
                        "typeClass": "primitive",
                        "value": author[i]
                    }
                }
                existing_authors.append(new_author)  # Append new author

        # Description
        if field['typeName'] == "dsDescription":
            for descfield in field["value"]:
                if "dsDescriptionValue" in descfield:
                    descfield["dsDescriptionValue"]["value"] = description
                if "dsDescriptionDate" in descfield:
                    descfield['dsDescriptionDate']['value'] = descriptiondate

        # Subject
        if field["typeName"] == "subject":
            field["value"] = subject

        # Dataset contact
        if field['typeName'] == 'datasetContact':
            # Access the value list within author
            for i, cnt in enumerate(field["value"]):
                if i < len(cnt):  # Ensure we don't go out of bounds
                    cnt["datasetContactName"]["value"] = contactname[i]
                    cnt["datasetContactEmail"]["value"] = contactemail[i]
                    cnt["datasetContactAffiliation"]["value"] = contactaffiliation[i]

         # Other Id Agency + Value
        if field['typeName'] == 'otherId':
            for otherid in field["value"]:
                otherid["otherIdAgency"]["value"] = otheridagency
                otherid["otherIdValue"]["value"] = otheridvalue

        # Notes (abstract)
        if field["typeName"] == "notesText":
            field["value"] = notes

        # Language
        if field['typeName'] == 'language':
            field['value'] = language

        # Depositor + deposit date
        if field['typeName'] == 'depositor':
            field['value'] = depositor

        if field['typeName'] == 'dateOfDeposit':
            field['value'] = depositdate

        # other references
        if field['typeName'] == 'otherReferences':
            field['value'] = otherrefs

        # Topic classification
        if field['typeName'] == 'topicClassification':
            existing_topics = field['value']
            num_existing_topics = len(existing_topics)

            # Update existing topic field
            for i in range(min(num_existing_topics, len(topicclassification))):
                existing_topics[i]['topicClassValue']['value'] = topicclassification[i]

            # Add new topic classifications if there are more in the topicclassification object
            for i in range(num_existing_topics, len(topicclassification)):
                new_topic = {
                    "topicClassValue": {
                      "typeName": "topicClassValue",
                      "multiple": False,
                      "typeClass": "primitive",
                      "value": topicclassification[i]
                    }
                }
                existing_topics.append(new_topic) # append new topic

        # Keywords
        if field['typeName'] == 'keyword':
            existing_keywords = field['value']
            num_existing_keywords = len(existing_keywords)

            # Update existing keyword field
            for i in range(min(num_existing_keywords, len(keywords))):
                existing_keywords[i]['keywordValue']['value'] = keywords[i]

            # Add new keyword if there are more in the keywords list
            for i in range(num_existing_keywords, len(keywords)):
                new_keyword = {
                    "keywordValue": {
                      "typeName": "keywordValue",
                      "multiple": False,
                      "typeClass": "primitive",
                      "value": keywords[i]
                    }
                }
                existing_keywords.append(new_keyword) # append new keyword

        # Kind of data
        if field['typeName'] == 'kindOfData':
            field['value'] = kindofdata

    # Geospatial metadata
    fields_geo = json_data["datasetVersion"]["metadataBlocks"]["geospatial"]["fields"]

    for field in fields_geo:
        # Country
        if field['typeName'] == 'geographicCoverage':
        #for geo in field['value']:
        #    geo['country']['value'] = country
            existing_covfields = field["value"]
            num_countries = len(existing_covfields)

            # Update existing countries
            for i in range(min(num_countries, len(country))):
                existing_covfields[i]['country']['value'] = country[i]

            # Add new countries if there are more in the country list
            for i in range(num_countries, len(country)):
                new_country = {
                    "country": {
                        "typeName": "country",
                        "multiple": False,
                        "typeClass": "controlledVocabulary",
                        "value": country[i]
                    }
                }
                existing_covfields.append(new_country) # Append new country

    # Social sciences and humanities metadata
    fields_social = json_data["datasetVersion"]["metadataBlocks"]["socialscience"]["fields"]
    for field in fields_social:
        if field['typeName'] == 'universe':
            field['value'] = universe

    # Journal metadata
    fields_journal = json_data["datasetVersion"]["metadataBlocks"]["journal"]["fields"]

    for field in fields_journal:
    
        # Issue, volume, publication date
        if field['typeName'] == 'journalVolumeIssue':
            for vol_is in field['value']:
                vol_is['journalVolume']['value'] = volume
                vol_is['journalIssue']['value'] = issue
                vol_is['journalPubDate']['value'] = pubdate

        # Article type
        if field['typeName'] == 'journalArticleType':
            field['value'] = articletype

    # Return the modified JSON
    #print("Here is the resulting Json file:\n\n", json.dumps(json_data, indent=2))
    return(json_data)

## Check if a dataset has already been created with the same title

In [None]:
# Check if the paper has already been uploaded: we won't want duplicates!
def get_dataverse_dois(base_url, dv_parent_alias, api_token):
    '''Get a list of DOIs from the datasets in the specified Dataverse collection'''
    headers = {
        'X-Dataverse-key': api_token
    }
    
    # List dataverse contents
    request = requests.get('%s/api/dataverses/%s/contents' % (base_url, dv_parent_alias), headers = headers)
    response_data = request.json()
    
    # Extract the list of persistent identifiers
    persistent_urls = [
        item['persistentUrl'].replace('https://doi.org/', 'doi:') 
        for item in response_data['data']
    ]

    return(persistent_urls)

# Retrieve the titles for each DOI in the dataverse
def retrieve_titles(persistent_urls, base_url, api_token):
    '''Get the corresponding titles from a list of Dataverse persistent identifiers'''
    headers = {
        'X-Dataverse-key': api_token
    }
    
    titles = []
    for doi in persistent_urls:
        request = requests.get('%s/api/datasets/:persistentId/?persistentId=%s' % (base_url, doi), headers = headers)
        response_data = request.json()
        
        for field in response_data['data']['latestVersion']['metadataBlocks']['citation']['fields']:
            if field["typeName"] == "title":
                the_title = field["value"]
        titles.append(the_title)
        
    return(titles)

## Create a dataset with the metadata in Dataverse and retrieve its PID

In [None]:
# Create a draft dataset using the Dataverse Native API
def create_dataset(json_data, base_url, api_token, dv_parent_alias):    
    # Prepare the headers
    headers = {
        'X-Dataverse-key': api_token,
        'Content-Type': 'application/json'  # Set the content type
    }
    
    # Post request
    response = requests.post(f'{base_url}/api/dataverses/{dv_parent_alias}/datasets', 
                             headers = headers, 
                             data = json.dumps(json_data))
    
    # Check the response
    if response.status_code == 201:
        # Retrieve DOI (persistent identifier / pid)
        pid = response.json()['data']['persistentId']
        return(pid)
    else:
        print(f"Failed to create dataset: {response.status_code}, {response.text}")
        #print("\nDataset:", json_data)
        return None

## Change the publication date to the journal publication date

By default Dataverse uses the date of uploading/publishing as the citation date. We want this to be the date that the journal article was published.

In [None]:
def update_citation_date(pid, base_url, api_token):
    headers = {
        'X-Dataverse-key': api_token
    }

    payload = 'journalPubDate'
    
    # Making the PUT request with the correct URL and payload
    response = requests.put(
        f'{base_url}/api/datasets/:persistentId/citationdate?persistentId={pid}',
        headers = headers, 
        data = payload
    )
    
    # Checking if the request was successful
    if response.status_code == 200:
        return response.json()
    else:
        return f"Error: {response.status_code} - {response.text}"

## Upload the corresponding file to Dataverse

Using the PID, upload the corresponding file to Dataverse as well.

In [None]:
def upload_data(pid, file_path, base_url, api_token):
    # Get the full file path to be uploaded to DataverseNL
    path = Path(os.path.join("archief", file_path))

    # Restrict access
    params = {"restrict":"false"} #dict(description = title)
    params_as_json_string = json.dumps(params)
    payload = dict(jsonData = params_as_json_string)
    
    # Code based on: https://guides.dataverse.org/en/latest/api/native-api.html#add-file-api
    # Open the file in binary mode
    with open(path, 'rb') as file:
        files = {'file': (path.name, file)}  # pass the filename and the opened file

        # Add file using the Dataset's persistentId
        url_persistent_id = '%s/api/datasets/:persistentId/add?persistentId=%s&key=%s' % (base_url, pid, api_token)
        r = requests.post(url_persistent_id, 
                          data = payload, 
                          files = files)

        # Check the response
        if r.status_code == 200:
            print("Successfully uploaded file: ", file_path)
        else:
            print("Failed to upload file: ", r.status_code, r.text)

## Perform all the steps using 1 master function

In [None]:
# Umbrella function to apply the 3 dataverse-related functions to the dataverseready dataframe
def process_dataframe(df, base_url, dv_parent_alias, api_token):
    '''Master function that performs all substeps in a row:
    1. Create a new column called pid that will store the newly created DOIs
    Then, for every row in the provided dataframe:
    2. Create Dataverse-compliant metadata
    3. Check if a dataset with the same name is already present in the provided Dataverse collection
    4. If not: Create a dataverse dataset using the metadata
    5. Update the citation date to match the journal publication date (otherwise the citation date will be the date of upload)
    6. Upload the corresponding data file
    '''
    
    # Create a new column for pids in the dataframe
    df['pid'] = None

    # Prep for checking if a dataset has already been uploaded
    dois = get_dataverse_dois(base_url, dv_parent_alias, api_token)
    titles = retrieve_titles(dois, base_url, api_token)

    # Perform the steps per row in the dataframe
    for index, row in df.iterrows():
        #print("row index: ", index)
        
        # Step 1: Create metadata JSON
        json_data = create_dv_metadata(row)

        # Step 2: Check if a dataset with the same title has already been created
        title = row['OT'] if pd.notna(row['OT']) else row['TI']
        
        if not title in titles: # Only if there is no dataset with an existing title
        
            # Step 3: Create dataset and get the PID
            pid = create_dataset(json_data, base_url, api_token, dv_parent_alias)
            
            if pid:  # if the dataset was successfully created
                # Step 4: Update the citation date to the journal publication date
                update_citation_date(pid, base_url, api_token)
                
                # Step 5: Upload the data file
                upload_data(pid, row['path'], base_url, api_token)
        else:
            print("There is already a dataset with the same name in this Dataverse collection, skipping this row")
            print("Index: ", index, "  Title: ", title)
            pid = "Already present in Dataverse"
    
        # Save the pid at the correct index in the dataframe
        df.at[index, 'pid'] = pid
    print("Done uploading to Dataverse")
    return df

In [None]:
# Get Dataverse API token from a dv_api_token.txt file in my local parent folder
parent_folder = os.path.abspath(os.path.join(os.getcwd(), '..'))

# Read the API token from the file
with open(os.path.join(parent_folder, 'dv_api_token.txt'), 'r') as file:
    api_token = file.read().strip()

In [None]:
# Apply all the Dataverse-related functions
base_url =  "https://dataverse.nl"                    # Base URL of your Dataverse instance, without trailing slash (e. g. https://data.aussda.at))
#api_token = "********-****-****-****-************"   # API token of a Dataverse user with proper rights to create a Dataset (DO NOT SHARE)
dv_parent_alias = "GenO_Archive"                      # Alias of the Dataverse, the Dataset should be attached to.

completed_df = process_dataframe(dataverseready, base_url, dv_parent_alias, api_token)

# Write to CSV
completed_df.to_csv("results/03_Files_uploaded_to_Dataverse.csv")