# DataONE Replication Demo

This notebook demonstrates how to use the DataONE replication service to replicate data from a DataONE Member repository (MN) to another member repository. It uses the DataONE Python libraries to interact with the DataONE APIs.
The DataONE python libraries are located in GitHub at [github.com/DataONEorg/d1_python](https://github.com/DataONEorg/d1_python) with documentation at [dataone-python.readthedocs.io](https://dataone-python.readthedocs.io/en/latest/). We also provide the [DataONE R client packages](https://dataoneorg.r-universe.dev/dataone) for R users (also downloadable from [CRAN](https://cran.r-project.org/web/packages/dataone/index.html).

## Overview of replication

![DataONE Replication](images/dataone-replication.png)

## Installation

Installation under a virtual environment is recommended. [`uv`](https://docs.astral.sh/uv/) is a tool for creating and managing virtual environments in Python. It is similar to `virtualenv` but has some additional features. You can use any virtual environment tool that allows you to load the following packages, including the dependencies for a Jupyter notebook. We'll be demonstrating using Visual Studio Code with the Jupyter extension, but you can use any Jupyter notebook environment.

uv:

```bash
mkdir d1-replication-demo
cd d1-replication-demo
uv init
uv add dataone.cli 
uv add ipykernel
```

While you could create your own virtual environment and notebook, we have created a template to work from that includes functions that will be helpful during the demo. Clone or download the repository from https://github.com/DataONEorg/d1-replication-demo. Once you have a copy of the repository, change your working directory and install the dependencies:

```bash
cd d1-replication-demo
uv install
```

## Create a dataset for testing

Let's create a dataset for testing replication in a DataONE testing environment, particularly using the [KNB Test Repository](https://dev.nceas.ucsb.edu/). 

![KNB](images/knb-banner.png)

The dataset will consist of a single CSV file and the metadata describing it, but could in theory contain as many objects as needed.

### Login to the repository using your ORCID identity

DataONE uses ORCID as a common user identifier for sharing across the network of repositories. If you do not have an ORCID account, you can create one at [orcid.org](https://orcid.org/).

Click the 'Sign In' button in the top right corner of the KNB Test Repository page, and then click the 'Sign in with ORCID' button. You will be redirected to the ORCID login page. After logging in, you will be redirected back to the KNB Test Repository.

![KNB Sign In](images/knb-login.png)

### Save your DataONE API key

Later we will want to use the DataONE API to interact with the repository. To do this, we need an API key that is associated with your ORCID account. We'll copy the API key to a file named `jwtkey` in the current directory. This key will be used to authenticate your requests to the DataONE API when we are working in python. You can find the API key in the KNB Test Repository by clicking on your name in the top right corner, then selecting 'My profile'. In the 'Session' tab, you will see your API key listed under 'Authentication Token'. Copy this key and save it to a file named `jwtkey` in the current directory.

![KNB API Key](images/knb-auth-token.png)

### Create a new dataset 

Now let's create a simple single file dataset. Click 'Submit' to start a new dataset, and then pick a small CSV file to upload (we provide one in the example repository `data` directory if you don't have one handy). You can also add a PNG image to the dataset, but this is not required. 

![KNB Submit](images/knb-submit.png)

Once you have added your files, you'll need to provide the minimal metadata required to create the dataset. This includes a title, abstract, and name and email for the creator and contact. You can also add additional metadata if you wish, but this is not required for the demo. You can find the required metadata fields as they are marked with a red asterisk (*).

![KNB Dataset](images/knb-dataset.png)

# Replication across DataONE

Now that we have a dataset in the KNB Test Repository, we can replicate it to another DataONE Member Node (MN). The replication service allows us to create copies of the dataset in other repositories, which can be useful for backup and redundancy purposes. Each object in DataONE is referenced via its Persistent Identifier (PID), which is a unique identifier for the object and visible in the citation on the dataset landing page. The PID is used to retrieve the object from the repository, and it is also used to set the replication policy for the object.

## Getting a repository client

Let's start by importing the necessary libraries and setting up the DataONE client to interact with the KNB Test Repository. We will use the `d1_client` library to create a client for the KNB Test Repository, which is a member repository in the test network.

The DataONE python clients provide an abstraction of the DataONE APIs enabling interaction with DataONE Member (MN) and Coordinating (CN) Nodes. Since the APIs of MNs and CNs differ, there are two basic types of client - one for Member repositories (MNs) and one for Coordinating Nodes (CNs) that derive from a common base.

In [27]:
import datetime
import d1_client.mnclient_2_0
import d1_client.cnclient_2_0
import d1_common.types.exceptions


In [2]:
# Authentication token (JWT) from the DataONE Test Environment
# Create a file named jwtkey with the token on the first line and no other lines
with open("jwtkey", "r") as f:
    for line in f:
        auth_token = line.strip()

# Set the token in the request header
options: dict = {"headers": {"Authorization": "Bearer " + auth_token}}

# Base URL of the KNB Test Repository
base_url = "https://dev.nceas.ucsb.edu/knb/d1/mn"
repo = d1_client.mnclient_2_0.MemberNodeClient_2_0(base_url, **options)

# Call the getCapabilities API method
# https://dataone-architecture-documentation.readthedocs.io/en/latest/apis/MN_APIs.html#MNCore.getCapabilities
node = repo.getCapabilities()

# Response is an instance of a Node document that can be accessed through its properties
# https://dataone-architecture-documentation.readthedocs.io/en/latest/apis/Types.html#Types.Node
print(f"{node.name:30} {node.baseURL}")


KNB Test Node                  https://dev.nceas.ucsb.edu/knb/d1/mn


Note that the variable `repo` represents the KNB Test Repository, and we can use that variable to manipulate our dataset. 

## Getting a CN client

In addition to the repository, we can also interact with the DataONE Coordinating Nodes (CNs), which provide network-wide services that are useful to clients. The CNs provide services such as replication, synchronization, and search. We can create a client for the CNs using the `d1_cn_client` library, and then use that to verify that our saved API key is recognized and valid. You should see your ORCID credentials printed out in the output of the cell below.

In [3]:
# The Staging CN base URL:
base_url = "https://cn-stage.test.dataone.org/cn"
cn = d1_client.cnclient_2_0.CoordinatingNodeClient_2_0(base_url, **options)

creds = cn.echoCredentials()
print(creds.toxml())


<?xml version="1.0" ?><ns1:subjectInfo xmlns:ns1="http://ns.dataone.org/service/types/v1"><person><subject>http://orcid.org/0000-0003-0077-4738</subject><givenName>Matthew B.</givenName><familyName>Jones</familyName><isMemberOf>CN=DBO,DC=dataone,DC=org</isMemberOf><isMemberOf>CN=SASAP,DC=dataone,DC=org</isMemberOf><isMemberOf>CN=cerp-sfwmd-admins,DC=dataone,DC=org</isMemberOf><isMemberOf>CN=delta-swg-23,DC=dataone,DC=org</isMemberOf><isMemberOf>CN=SASAP Data Team,DC=dataone,DC=org</isMemberOf><isMemberOf>CN=Likely Suspects Framework Users,DC=dataone,DC=org</isMemberOf><isMemberOf>CN=drp-portal-admins,DC=dataone,DC=org</isMemberOf><isMemberOf>CN=knb-data-admins,DC=dataone,DC=org</isMemberOf><isMemberOf>CN=DataONE-Support,DC=dataone,DC=org</isMemberOf><isMemberOf>CN=drp-data-admins,DC=dataone,DC=org</isMemberOf><isMemberOf>CN=sctld-admins,DC=dataone,DC=org</isMemberOf><isMemberOf>CN=cib-admins,DC=dataone,DC=org</isMemberOf><isMemberOf>CN=cib-curators,DC=dataone,DC=org</isMemberOf><isMemb

## Functions for viewing system metadata

Each object in DataONE is identified by its PID and is associated with system metadata that describes the object, including its size, format, and replication policy. The system metadata is stored in the repository and can be retrieved using the DataONE API. Let's create a `view` function to print system metadata for a given PID. This function will be used to view the system metadata for our dataset and any replicated objects.
 

In [4]:
def toStr(v):
    if isinstance(v, datetime.datetime):
        return v.strftime(DATE_FORMAT)
    return str(v)

def propertyStr(p):
    '''String representation of pyxb property
    '''
    if p is None:
        return ""
    try:
        return toStr(p.value())
    except:
        return toStr(p)

def view(sysmeta, title="SystemMetadata"):
    '''Print SystemMetadata properties
    '''
    print(f"\n{title}:")
    print(f"  Identifier: {sysmeta.identifier.value()}")
    print(f"  Series Identifier: {propertyStr(sysmeta.seriesId)}")
    print(f"  Modified: {sysmeta.dateSysMetadataModified}")
    print(f"  Uploaded: {sysmeta.dateUploaded}")
    print(f"  Format ID: {sysmeta.formatId}")
    print(f"  Size: {sysmeta.size}")
    print(f"  Checksum: hash://{sysmeta.checksum.algorithm.lower()}/{sysmeta.checksum.value()}")
    print(f"  Origin Member Node: {propertyStr(sysmeta.originMemberNode)}")
    print(f"  Authoritative Member Node: {propertyStr(sysmeta.authoritativeMemberNode)}")
    print(f"  Obsoletes: {propertyStr(sysmeta.obsoletes)}")
    print(f"  Obsoleted By: {propertyStr(sysmeta.obsoletedBy)}")
    print("  Access policy rules:")
    for rule in sysmeta.accessPolicy.allow:
        print(f"    {', '.join(map(lambda S: S.value(), rule.subject))}  can  {', '.join(rule.permission)}")
    print("  Replication policy:")
    print(f"    Replication allowed: {sysmeta.replicationPolicy.replicationAllowed}")
    print(f"    Replicas requested: {sysmeta.replicationPolicy.numberReplicas}")
    for node in sysmeta.replicationPolicy.preferredMemberNode:
        print(f"    Preferred node: {node.value()}")
    for node in sysmeta.replicationPolicy.blockedMemberNode:
        print(f"    Blocked node: {node.value()}")
    print("  Replicas of this object:")
    for replica in sysmeta.replica:
        print(f"    {replica.replicaMemberNode.value():15} {replica.replicationStatus:10} {replica.replicaVerified}")

## Function to change a replication policy

Changing the replication policy allows setting how many replicas of an object are requested, which nodes are preferred for replication, and which nodes are blocked from replication. The replication policy is part of the SystemMetadata for an object. We'll create a function to change key aspects of a replication policy, including:

- the number of replicas requested,
- the preferred nodes for replication

The function will also display the metadata before and after the change was made, so you can see the effect of the change.


In [5]:
def change_replication_policy(repo, pid, number_replicas=0, preferred_node=None):
    '''Change the replication policy to allow replication
    '''

    try:
        sysmeta = repo.getSystemMetadata(pid)
        view(sysmeta, "Original SystemMetadata")
        if number_replicas > 0:
            sysmeta.replicationPolicy.replicationAllowed = True
            sysmeta.replicationPolicy.numberReplicas = number_replicas
            sysmeta.replicationPolicy.preferredMemberNode = [
                d1_common.types.dataoneTypes.NodeReference("urn:node:mnStageUCSB2")]
        else:
            sysmeta.replicationPolicy.replicationAllowed = False
            sysmeta.replicationPolicy.numberReplicas = 0

        updated_flag = repo.updateSystemMetadata(pid, sysmeta)
        if (updated_flag):
            sysmeta_new = repo.getSystemMetadata(pid)
            view(sysmeta_new, "Updated SystemMetadata")

    except d1_common.types.exceptions.DataONEException as e:
        print(e)

## Update the repository with the new replication policy



In [None]:
# Try and retrieve existing system metadata
pid_meta = "urn:uuid:c530b4fe-3a14-43b0-8894-dd239b21fc56"
#pid_rmap = "resource_map_urn:uuid:3a06877a-3941-412f-a495-da8608fd4f94"
#pid_png = "urn:uuid:0f93e94d-6509-4c93-8ea1-a6f7129488ef"
change_replication_policy(repo, pid_meta, number_replicas=1)


Original SystemMetadata:
  Identifier: urn:uuid:c530b4fe-3a14-43b0-8894-dd239b21fc56
  Series Identifier: 
  Modified: 2025-07-24 05:27:45.384000+00:00
  Uploaded: 2025-07-24 05:27:44.776000+00:00
  Format ID: https://eml.ecoinformatics.org/eml-2.2.0
  Size: 2476
  Checksum: hash://md5/57860f0890f9dd92a29dd0a7d779cfdb
  Origin Member Node: urn:node:mnTestKNB
  Authoritative Member Node: urn:node:mnTestKNB
  Obsoletes: 
  Obsoleted By: 
  Access policy rules:
    public  can  read
    CN=knb-data-admins,DC=dataone,DC=org  can  read, write, changePermission
  Replication policy:
    Replication allowed: false
    Replicas requested: None
  Replicas of this object:

Updated SystemMetadata:
  Identifier: urn:uuid:c530b4fe-3a14-43b0-8894-dd239b21fc56
  Series Identifier: 
  Modified: 2025-07-24 05:29:52.713000+00:00
  Uploaded: 2025-07-24 05:27:44.776000+00:00
  Format ID: https://eml.ecoinformatics.org/eml-2.2.0
  Size: 2476
  Checksum: hash://md5/57860f0890f9dd92a29dd0a7d779cfdb
  Origin

In [26]:
try:
    sysmeta2 = cn.getSystemMetadata(pid_meta)
    view(sysmeta2, "Current SystemMetadata")
except d1_common.types.exceptions.DataONEException as e:
    print(e)


Current SystemMetadata:
  Identifier: urn:uuid:c530b4fe-3a14-43b0-8894-dd239b21fc56
  Series Identifier: 
  Modified: 2025-07-24 05:29:52.713000+00:00
  Uploaded: 2025-07-24 05:27:44.776000+00:00
  Format ID: https://eml.ecoinformatics.org/eml-2.2.0
  Size: 2476
  Checksum: hash://md5/57860f0890f9dd92a29dd0a7d779cfdb
  Origin Member Node: urn:node:mnTestKNB
  Authoritative Member Node: urn:node:mnTestKNB
  Obsoletes: 
  Obsoleted By: 
  Access policy rules:
    public  can  read
    CN=knb-data-admins,DC=dataone,DC=org  can  read, write, changePermission
  Replication policy:
    Replication allowed: true
    Replicas requested: 1
    Preferred node: urn:node:mnStageUCSB2
  Replicas of this object:
    urn:node:mnTestKNB completed  2025-07-24 05:29:55.075000+00:00
    urn:node:cnStage completed  2025-07-24 05:29:55.151000+00:00
    urn:node:mnStageUCSB2 completed  2025-07-24 05:30:49.365000+00:00


# Data Packages

Data packages in DataONE are essentially graphs describing the relatonships between objects in a dataset. They are represented as [ORE Resource Maps](https://dataoneorg.github.io/api-documentation/design/DataPackage.html), which are special files that contain metadata about the dataset and its constituent objects.

![Data Package](images/data-package.png)

In order to change the replication policy for an entire dataset, we need to find all of the members of the data package and then change the replication policy for each member. For example, here is a DataONE query executed in `curl` that uses the SOLR index in DataONE to query and find all of the members of a data package:

```bash
❯ curl -s "https://cn-stage.test.dataone.org/cn/v2/query/solr/?q=identifier:urn\:uuid\:c530b4fe-3a14-43b0-8894-dd239b21fc56&fl=resourceMap,documents&wt=json" | jq .response.docs
[
  {
    "documents": [
      "urn:uuid:c530b4fe-3a14-43b0-8894-dd239b21fc56",
      "urn:uuid:c9330b48-ba9a-4a41-8f8a-e8b8cd88650d",
      "urn:uuid:23f752b5-7950-447e-901b-70464bdeee54"
    ],
    "resourceMap": [
      "resource_map_urn:uuid:7a0b9271-4525-4334-845f-6a16602d857b"
    ]
  }
]
```

By iterating over the list of identifiers and calling the `change_replication_policy()` function, we can change the replication policy for each member of the data package. The `change_replication_policy()` function will also display the system metadata before and after the change, so you can see the effect of the change.


# DataONE Replication Summary

![DataONE Replication](./images/dataone-replication.png)
