# Restrict all publications in a Dataverse and Deaccession V1

- Date of creation: 2025-03-31 by Dorien Huijser
- Date of last edit: 2025-04-11

## Background

All publications in the GenO Archive Dataverse (<https://dataverse.nl/dataverse/GenO_Archive>) need to be set to restricted access in a newer version, and the previous version deaccessioned, due to potential copyright issues.

## Prerequisites

- Admin access to a dataverse
- Python installation

## ⚠️ Current issues with this script ⚠️

⚠️ At the moment it is not yet possible in the DataverseNL instance of Harvard Dataverse to restrict files via the API - or at least I have not found a solution to do so. The function `restrict_file` in this notebook therefore does not work; you have to indicate the terms of use/that you want to enable access requests, which is not possible via the API (but will be, I think: <https://github.com/IQSS/dataverse/pull/11349>).

## Structure of this document

- Import libraries
- Find all DOIs and dataset IDs of the GenO Archive dataverse, and put them together in a pandas dataframe (function `get_dois_and_ids`)
- Look for the file ID(s) per dataset and add them to the pandas dataframe (function `retrieve_file_ids`)
- ⚠️ Set all files to restricted access > Does not work (yet?) (function `restrict_file`)
- Change the publication date to the Journal publication date (function `update_citation_date`)
- Publish as new version of the dataset (function `publish_dataset`)
- Deaccession version 1 of each dataset (function `deaccession_dataset`)
 
## Import libraries

In [None]:
import pandas as pd              # For working with dataframes
import requests                  # For connecting with the Dataverse API
import json                      # For deaccessioning the dataset
import os                        # For reading in the api token txt file

## Get all DOIs and dataset IDs from in the dataverse collection

In [None]:
def get_dois_and_ids(base_url, dv_parent_alias, api_token):
    '''Get a dataframe of DOIs and dataset IDs from the datasets in the specified Dataverse collection'''
    headers = {
        'X-Dataverse-key': api_token
    }
    
    # List dataverse contents
    request = requests.get(f'{base_url}/api/dataverses/{dv_parent_alias}/contents', headers = headers)
    response_data = request.json()
    
    # Extract the list of persistent identifiers
    persistent_urls = [
        item['persistentUrl'].replace('https://doi.org/', 'doi:') 
        for item in response_data['data']
    ]

    # Extract the file ids
    ds_ids = [item['id'] for item in response_data['data']]

    # Put both in a pandas dataframe
    result = pd.DataFrame({'persistent_urls': persistent_urls, 
                           'ds_ids': ds_ids})
    return(result)

## Get the file IDs per dataset

In [None]:
# Retrieve the file ids for each DOI in the dataverse
def retrieve_file_ids(persistent_urls, base_url, api_token):
    '''Get the corresponding file ids from a list of Dataverse persistent identifiers'''
    headers = {
        'X-Dataverse-key': api_token
    }

    # Create an empty pandas dataframe that will be filled
    doi_files = pd.DataFrame({"persistent_urls": [],
                             "file_ids": []})
    
    for doi in persistent_urls:
        file_ids = []
        request = requests.get(f'{base_url}/api/datasets/:persistentId/?persistentId={doi}', headers = headers)
        response_data = request.json()

        # Retrieve all the file IDs for this DOI
        for field in response_data['data']['latestVersion']['files']:
            if field["dataFile"]["id"]:
                the_id = str(field["dataFile"]["id"])
            # Make a list of all the file IDs per DOI
            file_ids.append(the_id)

        # Add the file IDs to the corresponding DOI in the doi_files dataframe
        tempdf = pd.DataFrame({'persistent_urls': doi, 'file_ids': [file_ids]})
        doi_files = pd.concat([doi_files, tempdf], ignore_index=True)
        
    return(doi_files)

## Restrict the files

In [None]:
# This function does not work yet:
# {"status":"ERROR","message":"Terms of Use and Access are invalid. You must enable request access or add terms of access in datasets with restricted files."}

def restrict_file(fileid, base_url, api_token):
    '''Restrict specified files in a Dataverse upload. This creates a new draft of the dataset.'''
    headers = {
        'X-Dataverse-key': api_token
    }

    data = 'true'
    
    response = requests.put(f'{base_url}/api/files/{fileid}/restrict', headers = headers, data = data)
    
    # Checking if the request was successful
    if response.status_code == 200:
        return response.json()
    else:
        return f"Error: {response.status_code} - {response.text}"

## Change the publication date to the journal publication date

By default Dataverse uses the date of uploading/publishing as the citation date. We want this to be the date that the journal article was published.

In [None]:
def update_citation_date(pid, base_url, api_token):
    headers = {
        'X-Dataverse-key': api_token
    }

    payload = 'journalPubDate' # Take the field journalPubDate as citation date
    
    # Making the PUT request with the correct URL and payload
    response = requests.put(
        f'{base_url}/api/datasets/:persistentId/citationdate?persistentId={pid}',
        headers = headers, 
        data = payload
    )
    
    # Checking if the request was successful
    if response.status_code == 200:
        return response.json()
    else:
        return f"Error: {response.status_code} - {response.text}"

## Publish the dataset in a new version

In [None]:
def publish_dataset(pid, base_url, api_token, major_or_minor):
    headers = {
        'X-Dataverse-key': api_token
    } 

    response = requests.post(
        f'{base_url}/api/datasets/:persistentId/actions/:publish?persistentId={pid}&type={major_or_minor}',
        headers = headers)
    
    # Checking if the request was successful
    if response.status_code == 200:
        return response.json()
    else:
        return f"Error: {response.status_code} - {response.text}"

## Deaccession the old version of the dataset

In [None]:
def deaccession_dataset(ds_id, version, api_token, base_url):
    # Define the deaccession reason and other details
    deaccession_data = {
        "deaccessionReason": "File restricted due to copyright, see v2.0 of this dataset"
        # "deaccessionForwardURL": "https://demo.dataverse.nl"
    }
    
    # Convert the deaccession data to JSON format
    json_data = json.dumps(deaccession_data)

    # Set the headers
    headers = {
        'X-Dataverse-key': api_token,
        'Content-Type': 'application/json'  # This is for the overall request, but will be handled by 'files'
    }

    response = requests.post(f'{base_url}/api/datasets/{ds_id}/versions/{version}/deaccession', 
                             headers = headers, 
                             data = json_data)

    # Checking if the request was successful
    if response.status_code == 200:
        return response.json()
    else:
        return f"Error: {response.status_code} - {response.text}"

## Execute all created functions

In [None]:
# Get Dataverse API token from a dv_api_token.txt file in my local parent folder
parent_folder = os.path.abspath(os.path.join(os.getcwd(), '..'))

# Read the API token from the file
with open(os.path.join(parent_folder, 'dv_api_token.txt'), 'r') as file:
    api_token = file.read().strip()

base_url = "https://dataverse.nl"
dv_parent_alias = "GenO_Archive"

In [None]:
# Step 1: Retrieve all DOIs, dataset IDs and file IDs
all_datasets = get_dois_and_ids(base_url, dv_parent_alias, api_token)
fileidsdf = retrieve_file_ids(all_datasets['persistent_urls'], base_url, api_token)

all_datasets_inclfiles = all_datasets.merge(fileidsdf, how = 'outer')
print(all_datasets_inclfiles)

In [None]:
# Steps to perform for each dataset:
# NB This used to be 1 giant for loop, but because the restrict_file function does not work, I've split them up in separate loops so that they can be separately run

# Step 2: Restrict all files of this dataset (if there is/are file ids)
for i, doi in enumerate(all_datasets_inclfiles['persistent_urls']):
    print(f'Attempting to restrict dataset {i}: {doi}')
    
    # NOTE: THIS CODE IS NOT EXECUTED NOW
    if not pd.isna(all_datasets_inclfiles['file_ids'].iloc[i]).any(): # ADDED .any()
        print("   Dataset does have file IDs")
        for file in all_datasets_inclfiles['file_ids'].iloc[i]:
            print(f'   Restricting file: {file}')
            restrict_file(file, base_url, api_token)
    else:
        print(f"No files present in dataset {doi}, skipping to the next dataset")
        continue

In [None]:
# Step 3: Update citation date to journal publication date
for i, doi in enumerate(all_datasets_inclfiles['persistent_urls']):
    print(f'Updating citation of dataset {i}: {doi}')
    update_citation_date(doi, base_url, api_token)

In [None]:
# Step 4: Publish a new version of this dataset
for i, doi in enumerate(all_datasets_inclfiles['persistent_urls']):
    print(f'Publishing a new version of dataset {i}: {doi}')
    publish_dataset(doi, base_url, api_token, 'major')

In [None]:
# Step 5: Deaccession v1.0
for i, doi in enumerate(all_datasets_inclfiles['persistent_urls']):
    print(f'Deaccessioning dataset {i}: {doi}')
    deaccession_dataset(all_datasets_inclfiles['ds_ids'].iloc[i], "1.0", api_token, base_url)