# DataONE Replication Demo

This notebook demonstrates how to use the DataONE replication service to replicate data from a DataONE Member repository (MN) to another member repository. It uses the DataONE Python libraries to interact with the DataONE APIs.
The DataONE python libraries are located in GitHub at [github.com/DataONEorg/d1_python](https://github.com/DataONEorg/d1_python) with documentation at [dataone-python.readthedocs.io](https://dataone-python.readthedocs.io/en/latest/). We also provide the [DataONE R client packages](https://dataoneorg.r-universe.dev/dataone) for R users (also downloadable from [CRAN](https://cran.r-project.org/web/packages/dataone/index.html).

## Installation

Installation under a virtual environment is recommended. [`uv`](https://docs.astral.sh/uv/) is a tool for creating and managing virtual environments in Python. It is similar to `virtualenv` but has some additional features. You can use any virtual environment tool that allows you to load the following packages, including the dependencies for a Jupyter notebook. We'll be demonstrating using Visual Studio Code with the Jupyter extension, but you can use any Jupyter notebook environment.

uv:

```bash
mkdir d1-replication-demo
cd d1-replication-demo
uv init
uv add dataone.cli 
uv add ipykernel
```

While you could create your own virtual environment and notebook, we have created a template to work from that includes functions that will be helpful during the demo. Clone or download the repository from https://github.com/DataONEorg/d1-replication-demo. Once you have a copy of the repository, change your working directory and install the dependencies:

```bash
cd d1-replication-demo
uv install
```

## Create a dataset for testing

Let's create a dataset for testing replication in a DataONE testing environment, particularly using the [KNB Test Repository](https://dev.nceas.ucsb.edu/). 

![KNB](images/knb-banner.png)

The dataset will consist of a single CSV file and the metadata describing it, but could in theory contain as many objects as needed.

### Login to the repository using your ORCID identity

DataONE uses ORCID as a common user identifier for sharing across the network of repositories. If you do not have an ORCID account, you can create one at [orcid.org](https://orcid.org/).

Click the 'Sign In' button in the top right corner of the KNB Test Repository page, and then click the 'Sign in with ORCID' button. You will be redirected to the ORCID login page. After logging in, you will be redirected back to the KNB Test Repository.

![KNB Sign In](images/knb-login.png)

### Save your DataONE API key

Later we will want to use the DataONE API to interact with the repository. To do this, we need an API key that is associated with your ORCID account. We'll copy the API key to a file named `jwtkey` in the current directory. This key will be used to authenticate your requests to the DataONE API when we are working in python. You can find the API key in the KNB Test Repository by clicking on your name in the top right corner, then selecting 'My Profile'. In the 'Session' tab, you will see your API key listed under 'Authentication Token'. Copy this key and save it to a file named `jwtkey` in the current directory.

![KNB API Key](images/knb-auth-token.png)

### Create a new dataset 

### Add a CSV file and PNG image

### Save the dataset



## Getting a client

The DataONE python clients provide an abstraction of the DataONE APIs enabling interaction with DataONE Member (MN) and Coordinating (CN) Nodes. Since the APIs of MNs and CNs differ, there are two basic types of client - one for Member rpositories (MNs) and one for Coordinating Nodes (CNs) that derive from a common base. There are also different versions of the DataONE API in use, though these have mostly upgraded to version 2. CNs are all operating with the version 2 API.

In [137]:
import datetime
import d1_client.mnclient_2_0
import d1_client.cnclient_2_0
import d1_common.types.exceptions

In [138]:
# Authentication token (JWT) from the DataONE Test Environment
# Create a file named jwtkey with the token on the first line and no other lines
with open("jwtkey", "r") as f:
    for line in f:
        auth_token = line.strip()

# Set the token in the request header
options: dict = {"headers": {"Authorization": "Bearer " + auth_token}}

# Base URL of the KNB Test Repository
base_url = "https://dev.nceas.ucsb.edu/knb/d1/mn"
repo = d1_client.mnclient_2_0.MemberNodeClient_2_0(base_url, **options)

# Call the getCapabilities API method
# https://dataone-architecture-documentation.readthedocs.io/en/latest/apis/MN_APIs.html#MNCore.getCapabilities
node = repo.getCapabilities()

# Response is an instance of a Node document that can be accessed through its properties
# https://dataone-architecture-documentation.readthedocs.io/en/latest/apis/Types.html#Types.Node
print(f"{node.name:30} {node.baseURL}")


KNB Test Node                  https://dev.nceas.ucsb.edu/knb/d1/mn


### Getting a CN client

In [139]:
# The Staging CN base URL:
base_url = "https://cn-stage.test.dataone.org/cn"
cn = d1_client.cnclient_2_0.CoordinatingNodeClient_2_0(base_url, **options)

# Call the listNodes API method
# https://dataone-architecture-documentation.readthedocs.io/en/latest/apis/CN_APIs.html#CNCore.listNodes
nodes = cn.listNodes()

# Response is a structure mirroring the API response message structure, 
# in this case a list of Node objects
# https://dataone-architecture-documentation.readthedocs.io/en/latest/apis/Types.html#Types.NodeList
for node in nodes.node:
    print(f"{node.name:30}\n  {node.identifier.value()}\t {node.baseURL}")

creds = cn.echoCredentials()
print(creds.toxml())

cn-stage                      
  urn:node:cnStage	 https://cn-stage.test.dataone.org/cn
cn-stage-orc-1                
  urn:node:cnStageORC1	 https://cn-stage-orc-1.test.dataone.org/cn
UCSB Stage Dedicated Replica Server 3
  urn:node:mnStageUCSB3	 https://mn-stage-ucsb-3.test.dataone.org/metacat/d1/mn
mn-stage-unm-1                
  urn:node:mnStageUNM1	 https://mn-stage-unm-1.test.dataone.org/mn
UCSB Stage Dedicated Replica Server 2
  urn:node:mnStageUCSB2	 https://mn-stage-ucsb-2.test.dataone.org/metacat/d1/mn
cn-stage-ucsb-1               
  urn:node:cnStageUCSB1	 https://cn-stage-ucsb-1.test.dataone.org/cn
cn-stage-unm-1                
  urn:node:cnStageUNM1	 https://cn-stage-unm-1.test.dataone.org/cn
PISCO Test MN                 
  urn:node:mnStagePISCO	 http://test.piscoweb.org/catalog/d1/mn
SEAD Virtual Archive          
  urn:node:mnTestSEAD	 http://d2i-dev.d2i.indiana.edu:8081/sead/rest/mn
mnTestDRYAD JSON-LD node      
  urn:node:mnTestDRYAD	 https://so.test.dataone.org/m

## Get SystemMetadata for a given persistent identifier (PID)


In [141]:
def toStr(v):
    if isinstance(v, datetime.datetime):
        return v.strftime(DATE_FORMAT)
    return str(v)

def propertyStr(p):
    '''String representation of pyxb property
    '''
    if p is None:
        return ""
    try:
        return toStr(p.value())
    except:
        return toStr(p)

def view(sysmeta, title="SystemMetadata"):
    '''Print SystemMetadata properties
    '''
    print(f"\n{title}:")
    print(f"  Identifier: {sysmeta.identifier.value()}")
    print(f"  Series Identifier: {propertyStr(sysmeta.seriesId)}")
    print(f"  Modified: {sysmeta.dateSysMetadataModified}")
    print(f"  Uploaded: {sysmeta.dateUploaded}")
    print(f"  Format ID: {sysmeta.formatId}")
    print(f"  Size: {sysmeta.size}")
    print(f"  Checksum: hash://{sysmeta.checksum.algorithm.lower()}/{sysmeta.checksum.value()}")
    print(f"  Origin Member Node: {propertyStr(sysmeta.originMemberNode)}")
    print(f"  Authoritative Member Node: {propertyStr(sysmeta.authoritativeMemberNode)}")
    print(f"  Obsoletes: {propertyStr(sysmeta.obsoletes)}")
    print(f"  Obsoleted By: {propertyStr(sysmeta.obsoletedBy)}")
    print("  Access policy rules:")
    for rule in sysmeta.accessPolicy.allow:
        print(f"    {', '.join(map(lambda S: S.value(), rule.subject))}  can  {', '.join(rule.permission)}")
    print("  Replication policy:")
    print(f"    Replication allowed: {sysmeta.replicationPolicy.replicationAllowed}")
    print(f"    Replicas requested: {sysmeta.replicationPolicy.numberReplicas}")
    for node in sysmeta.replicationPolicy.preferredMemberNode:
        print(f"    Preferred node: {node.value()}")
    for node in sysmeta.replicationPolicy.blockedMemberNode:
        print(f"    Blocked node: {node.value()}")
    print("  Replicas of this object:")
    for replica in sysmeta.replica:
        print(f"    {replica.replicaMemberNode.value():15} {replica.replicationStatus:10} {replica.replicaVerified}")

## Function to change a replication policy

Changing the replication policy allows setting how many replicas of an object are requested, which nodes are preferred for replication, and which nodes are blocked from replication. The replication policy is part of the SystemMetadata for an object.

In this block, we 1) retrieve the SystemMetadata for a given PID, 2) change the replication policy to request a specific number of replicas and nodes, and 3) update the SystemMetadata with the new replication policy.


In [142]:
def change_replication_policy(repo, pid, number_replicas=0, preferred_node=None):
    '''Change the replication policy to allow replication
    '''

    try:
        sysmeta = repo.getSystemMetadata(pid)
        view(sysmeta, "Original SystemMetadata")
        if number_replicas > 0:
            sysmeta.replicationPolicy.replicationAllowed = True
            sysmeta.replicationPolicy.numberReplicas = number_replicas
            sysmeta.replicationPolicy.preferredMemberNode = [
                d1_common.types.dataoneTypes.NodeReference("urn:node:mnStageUCSB2")]
        else:
            sysmeta.replicationPolicy.replicationAllowed = False
            sysmeta.replicationPolicy.numberReplicas = 0

        updated_flag = repo.updateSystemMetadata(pid, sysmeta)
        if (updated_flag):
            sysmeta_new = repo.getSystemMetadata(pid)
            view(sysmeta_new, "Updated SystemMetadata")

    except d1_common.types.exceptions.DataONEException as e:
        print(e)

## Update the repository with the new replication policy


In [148]:
# Try and retrieve existing system metadata
pid_meta = "urn:uuid:ca5f81cc-7b02-41ed-a316-a03e2b4aea57"
pid_csv = "urn:uuid:82079214-7e3e-4c52-a117-90f497a430ea"
pid_rmap = "resource_map_urn:uuid:3a06877a-3941-412f-a495-da8608fd4f94"
pid_png = "urn:uuid:0f93e94d-6509-4c93-8ea1-a6f7129488ef"
change_replication_policy(repo, pid_png, number_replicas=1)


Original SystemMetadata:
  Identifier: urn:uuid:0f93e94d-6509-4c93-8ea1-a6f7129488ef
  Series Identifier: 
  Modified: 2025-07-11 20:07:29.700000+00:00
  Uploaded: 2025-07-11 20:07:29.640000+00:00
  Format ID: image/png
  Size: 114406
  Checksum: hash://md5/dc610593c6ffdc78d7c085232ef90a8c
  Origin Member Node: urn:node:mnTestKNB
  Authoritative Member Node: urn:node:mnTestKNB
  Obsoletes: 
  Obsoleted By: 
  Access policy rules:
    CN=knb-data-admins,DC=dataone,DC=org  can  read, write, changePermission
    public  can  read
  Replication policy:
    Replication allowed: false
    Replicas requested: None
  Replicas of this object:

Updated SystemMetadata:
  Identifier: urn:uuid:0f93e94d-6509-4c93-8ea1-a6f7129488ef
  Series Identifier: 
  Modified: 2025-07-11 20:13:24.551000+00:00
  Uploaded: 2025-07-11 20:07:29.640000+00:00
  Format ID: image/png
  Size: 114406
  Checksum: hash://md5/dc610593c6ffdc78d7c085232ef90a8c
  Origin Member Node: urn:node:mnTestKNB
  Authoritative Member No

In [150]:
try:
    sysmeta = repo.getSystemMetadata(pid_png)
    view(sysmeta, "Current SystemMetadata")
except d1_common.types.exceptions.DataONEException as e:
    print(e)


Current SystemMetadata:
  Identifier: urn:uuid:0f93e94d-6509-4c93-8ea1-a6f7129488ef
  Series Identifier: 
  Modified: 2025-07-11 20:13:24.551000+00:00
  Uploaded: 2025-07-11 20:07:29.640000+00:00
  Format ID: image/png
  Size: 114406
  Checksum: hash://md5/dc610593c6ffdc78d7c085232ef90a8c
  Origin Member Node: urn:node:mnTestKNB
  Authoritative Member Node: urn:node:mnTestKNB
  Obsoletes: 
  Obsoleted By: 
  Access policy rules:
    CN=knb-data-admins,DC=dataone,DC=org  can  read, write, changePermission
    public  can  read
  Replication policy:
    Replication allowed: true
    Replicas requested: 1
    Preferred node: urn:node:mnStageUCSB2
  Replicas of this object:
    urn:node:mnTestKNB completed  2025-07-11 20:09:12.612000+00:00
    urn:node:mnStageUCSB2 completed  2025-07-11 20:14:48.702000+00:00
