In [1]:
from hdmf.common import DynamicTable, VectorData
from hdmf.term_set import TermSet

from pynwb.resources import ExternalResources
from pynwb import NWBFile, NWBHDF5IO
from pynwb import get_type_map as tm
from pynwb.file import Subject

from datetime import datetime
from dateutil import tz
import numpy as np

# An Overview NERD

### Goals and Use Cases

To have a FAIR data ecosystem that supports data reuse, the `ExternalResources` class is a toolkit of standardized methods to create and manage linkages between data terms and external resources, such as online ontologies or digital identifiers. Common use cases:

* Linking terms from user data to ontologies facilitates standardized diction and semantics of terms to precisely defined neuroscience metadata using existing curated resources, e.g., brain atlases; species taxonomies; and anatomical, cell, and gene function ontologies. 
* Linking data to persistent digital identifiers (e.g., ORCID, RRID, or DOI) enables unique identification of experimenters, publications, subjects, software, and other resources and assets identified in the experimental metadata.
* Linking data to related data assets is essential for integration and interoperability of data across different data archives for experiments involving multiple data modalities.

# Using NERD with a single NWB File from the DANDI Archive

Loading in the file, we can see multiple cases where contextual metadata will be important in regards to creating and sharing FAIR data. We can map the experimenter to a digital identifier, i.e., ORCID. The electrode group has a location that will be mapped to a brain atlas. Lastly, we can map the `Subject` species attribute to an ontology resource, in this case the NCBI Taxonomy.

In [4]:
with NWBHDF5IO("sub-Haydn_desc-train_ecephys.nwb", "r") as io:
    read_nwbfile = io.read()
read_nwbfile

When directly using NERD with a single source, in the most common case that'll be a `NWBFile`, it is recommended to link the instance of the `ExternalResources` class to the file. This link will allows for easier use of NERD, as shown later in the tutorial.

In [5]:
er = ExternalResources() 
read_nwbfile.link_resources(er)

  warn(_exp_warn_msg(cls))


We can see the linkage as follows:

In [6]:
read_nwbfile.get_linked_resources()

#### Important Note

By setting `external_resources` in `NWBFile`, the user is establishing a link. However, since `ExternalResources` is written separately to an `NWBFile`, this link is not saved on write. This allows for users to annotate existing files without having to modify files containing large datasets.

### ORCiD

In [7]:
er.add_ref(
    container=read_nwbfile,
    attribute="experimenter",
    key="Hansem Sohn",
    entity_id='ORCID:0000-0001-8593-7473', 
    entity_uri='https://orcid.org/0000-0001-8593-7473')

(<hdmf.common.resources.Key at 0x121561d20>,
 <hdmf.common.resources.Entity at 0x1218d3370>)

### Electrode Group Location

In [8]:
er.add_ref(
    container=read_nwbfile.electrode_groups['electrode_group_1'],
    attribute="location",
    key="Dorsomedial frontal cortex",
    entity_id="Frontal Cortex", 
    entity_uri="https://www.ebrains.eu/tools/rat-brain",  
)

(<hdmf.common.resources.Key at 0x121b691b0>,
 <hdmf.common.resources.Entity at 0x121b69030>)

### Subject Species

In [9]:
er.add_ref(
    container=read_nwbfile.subject,
    attribute='species',
    key='Macaca mulatta',
    entity_id='NCBI_TAXON:9544',
    entity_uri='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=info&id=9544'
)

(<hdmf.common.resources.Key at 0x121b69450>,
 <hdmf.common.resources.Entity at 0x121b68c70>)

### What about the connection to the NWBFile?

Even though we've been using the notation using only the instance of `ExternalResources`, i.e., `er.add_ref(...)`, we are still updating the `ExternalResources` linked to the file. Alternatively, a user could use `read_nwbfile.external_resources.add_ref(...)`. We now see that our instance of `ExternalResources` shows a populated normalized set of tables for efficient data storage and query options. Even though the data structure consists of multiple tables, the user can visualize a flattened view of the NERD system.

In [10]:
df=er.to_dataframe()
df

Unnamed: 0,file_object_id,objects_idx,object_id,files_idx,object_type,relative_path,field,keys_idx,key,entities_idx,entity_id,entity_uri
0,9c3a5c45-316c-493d-a712-03a01b662ee9,0,9c3a5c45-316c-493d-a712-03a01b662ee9,0,NWBFile,general/experimenter,,0,Hansem Sohn,0,ORCID:0000-0001-8593-7473,https://orcid.org/0000-0001-8593-7473
1,9c3a5c45-316c-493d-a712-03a01b662ee9,1,f8641805-f93c-446f-8194-5fce08d22dbb,0,ElectrodeGroup,location,,1,Dorsomedial frontal cortex,1,ID,URI
2,9c3a5c45-316c-493d-a712-03a01b662ee9,2,5ee39486-8625-4ac3-9691-ce9d724812a4,0,Subject,species,,2,Macaca mulatta,2,NCBI_TAXON:9544,https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/...


### Useful query methods

NERD hosts multiple methods to retrieve the stored data. More methods are in active development and are open for community requests and feedback.

#### Get Object Type

This method retrieves all instances of a specified `object_type`. In this case, a user can retrieve all instances involving `Subject`.


In [11]:
er.get_object_type(object_type='Subject', all_instances=True)

Unnamed: 0,file_object_id,objects_idx,object_id,files_idx,object_type,relative_path,field,keys_idx,key,entities_idx,entity_id,entity_uri
2,9c3a5c45-316c-493d-a712-03a01b662ee9,2,5ee39486-8625-4ac3-9691-ce9d724812a4,0,Subject,species,,2,Macaca mulatta,2,NCBI_TAXON:9544,https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/...


#### Get Key

`get_key` returns a `Key` object based on the name given. If the name is used more than once, the user provides the container, relative_path, and field to retrieve the specfic `Key` they want. Users will need to us this method if they want to resuse the key for a new reference, since `ExternalResources` requires unique keys associated with an `Object`.

In [12]:
er.get_key('Hansem Sohn')

<hdmf.common.resources.Key at 0x121b6b3a0>

In [13]:
er.get_key(key_name='Macaca mulatta', container=read_nwbfile.subject, relative_path='species')

<hdmf.common.resources.Key at 0x121b6b040>

#### Get all entities for an Object

`get_object_entities` allows the user to retrieve all entities and key information associated with an `Object`.

In [14]:
er.get_object_entities(container=read_nwbfile.subject,
                       relative_path='species')

Unnamed: 0,entity_id,entity_uri
0,NCBI_TAXON:9544,https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/...


### Write NERD as a zipped collection of tsv files

As mentioned prior, NERD is written separately from the NWB File. `to_norm_tsv` writes each table as a tsv and stores in a zip file.

In [15]:
er.to_norm_tsv(path='./')

### Read ER from tsv

In [16]:
er_read=ExternalResources.from_norm_tsv(path='./')

  warn(_exp_warn_msg(cls))


In [17]:
er_read.to_dataframe()

Unnamed: 0,file_object_id,objects_idx,object_id,files_idx,object_type,relative_path,field,keys_idx,key,entities_idx,entity_id,entity_uri
0,9c3a5c45-316c-493d-a712-03a01b662ee9,0,9c3a5c45-316c-493d-a712-03a01b662ee9,0,NWBFile,general/experimenter,,0,Hansem Sohn,0,ORCID:0000-0001-8593-7473,https://orcid.org/0000-0001-8593-7473
1,9c3a5c45-316c-493d-a712-03a01b662ee9,1,f8641805-f93c-446f-8194-5fce08d22dbb,0,ElectrodeGroup,location,,1,Dorsomedial frontal cortex,1,ID,URI
2,9c3a5c45-316c-493d-a712-03a01b662ee9,2,5ee39486-8625-4ac3-9691-ce9d724812a4,0,Subject,species,,2,Macaca mulatta,2,NCBI_TAXON:9544,https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/...


# Using NERD with multiple NWBFiles

A single instance of the `ExternalResources` class supports storing contextual metadata for multiple files. If the user decides to address multiple files at once, setting up a link between the instance of `ExternalResources` and the file is not possible (due to multiple files being present at once). However, there are ways around this. Users can set link to a file, populate the NERD data structure, and relink that instance to the next file.

Another method (as seen below) would be to explicitely define the `file` parameter when populating with `add_ref`.

In this example, we have three files currently existing on the DANDI Archive. These files all contain experiments regarding a "rat". The species field is free-form text, allowing a wide range of names to represent the same animal. Having contextual metadata for `Subject` species will allows users to connect and query across files with datasets and attributes that share the same external reference.

In [18]:
# File with Subject species as rat
e1='sub-Rat203_ecephys.nwb'
io=NWBHDF5IO(e1, "r")
read_nwbfile_e1 = io.read()

# File with Subject species as Rattus norvegicus domestica
e2='sub-EE_ses-EE-042_ecephys.nwb'
io=NWBHDF5IO(e2, "r")
read_nwbfile_e2 = io.read()

# File with Subject species as rattus norvegicus
e3 = 'sub-BH243.nwb'
io=NWBHDF5IO(e3, "r")
read_nwbfile_e3 = io.read()

er = ExternalResources()

er.add_ref(
    file=read_nwbfile_e1,
    container=read_nwbfile_e1.subject,
    attribute='species',
    key='rat',
    entity_id='NCBI_TAXON:10116',
    entity_uri='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=info&id=10116'
)

er.add_ref(
    file=read_nwbfile_e2,
    container=read_nwbfile_e2.subject,
    attribute='species',
    key='Rattus norvegicus domestica',
    entity_id='NCBI_TAXON:10116',
)

er.add_ref(
    file=read_nwbfile_e3,
    container=read_nwbfile_e3.subject,
    attribute='species',
    key='rattus norvegicus',
    entity_id='NCBI_TAXON:10116',
)

er.to_dataframe()

  warn(_exp_warn_msg(cls))


Unnamed: 0,file_object_id,objects_idx,object_id,files_idx,object_type,relative_path,field,keys_idx,key,entities_idx,entity_id,entity_uri
0,8e4f1f81-85b8-469e-9d1b-b7b188edfd6f,0,ed65b7ec-a46e-48fc-b685-e37634e6a4fc,0,Subject,species,,0,rat,0,NCBI_TAXON:10116,https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/...
1,510e730a-4c83-4bdb-a8b9-e68994adec0a,1,088479f0-5966-45a1-9394-21bedf7b9cf2,1,Subject,species,,1,Rattus norvegicus domestica,0,NCBI_TAXON:10116,https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/...
2,4c579581-596e-4145-a82a-ca7be747016c,2,d0299e3c-f007-4465-98a9-92f2590699a4,2,Subject,species,,2,rattus norvegicus,0,NCBI_TAXON:10116,https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/...


# NERD Structure

From a user's perspective, one can think of the `ExternalResources` as a simple table. 

In [19]:
er_read.to_dataframe()

Unnamed: 0,file_object_id,objects_idx,object_id,files_idx,object_type,relative_path,field,keys_idx,key,entities_idx,entity_id,entity_uri
0,9c3a5c45-316c-493d-a712-03a01b662ee9,0,9c3a5c45-316c-493d-a712-03a01b662ee9,0,NWBFile,general/experimenter,,0,Hansem Sohn,0,ORCID:0000-0001-8593-7473,https://orcid.org/0000-0001-8593-7473
1,9c3a5c45-316c-493d-a712-03a01b662ee9,1,f8641805-f93c-446f-8194-5fce08d22dbb,0,ElectrodeGroup,location,,1,Dorsomedial frontal cortex,1,ID,URI
2,9c3a5c45-316c-493d-a712-03a01b662ee9,2,5ee39486-8625-4ac3-9691-ce9d724812a4,0,Subject,species,,2,Macaca mulatta,2,NCBI_TAXON:9544,https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/...


However, to reduce data redundancy and improve data integrity, `ExternalResources` stores this data internally in a collection of interlinked tables.
* `KeyTable` where each row describes a `Key`. A `Key` is a term defined by the user's data. 
* `FileTable` where each row describes a `File`. A `File` is a `NWBFile` in our use case.
* `EntityTable`  where each row describes an `Entity`. An `Entity` is a term from an ontology or resource.
* `ObjectTable` where each row describes an `Object`. An `Object` is a NWB data-type, meaning it has an object_id, e.g., `AbstractContainer`.
* `ObjectKeyTable` where each row describes an `ObjectKey` pair identifying which `Key`
  is used with which `Object`.

### KeyTable

Multiple `Keys` can have the same name. They are disambiguated by the `Object` associated with each. Meaning, we may have keys with the same name in different objects, but for a particular object all keys must be unique within `ExternalResources`.

In [20]:
er_read.keys.to_dataframe()

Unnamed: 0,key
0,Hansem Sohn
1,Dorsomedial frontal cortex
2,Macaca mulatta


### EntityTable

This stores the ID and URI information for the external references.

In [21]:
er_read.entities.to_dataframe()

Unnamed: 0,entity_id,entity_uri
0,ORCID:0000-0001-8593-7473,https://orcid.org/0000-0001-8593-7473
1,ID,URI
2,NCBI_TAXON:9544,https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/...


### EntityKeyTable

The `EntityKeyTable` stores all relationship between which user/data defined `Key` represents which `Entity` from an external resource.

In [22]:
er_read.entity_keys.to_dataframe()

Unnamed: 0,entities_idx,keys_idx
0,0,0
1,1,1
2,2,2


### FileTable

The `FileTable` stores the `id` for the `NWBFile`, allowing users to keep track of which files have the objects that have external references. With this update, the `ObjectTable` has a new column `file_id_idx`, i.e., the row index of the `FileTable`, to link the object and the file that stores it.

As we saw prior, `add_ref` is one of the main methods to populate `ExternalResources`.

<code>er.add_ref(
    container=read_nwbfile.subject,
    attribute='species',
    key='Macaca mulatta',
    entity_id='NCBI:9544',
    entity_uri='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=info&id=9544'
)</code>

The FileTable is not optional, meaning every new reference needs an associated file. `add_ref` will search for a file if none is provided, as in this example. Users can also manually provide the file if the container hasn't been added to the file (an example in the next section)..

In [23]:
er_read.files.to_dataframe()

Unnamed: 0,file_object_id
0,9c3a5c45-316c-493d-a712-03a01b662ee9


### ObjectTable

`file_id_idx` is the row index for the corresponding `NWBFile` that houses the `Object`. If there is no file, the user does *not* have to have one to use `ExternalResources`; it will be an empty string. The `object_type` column stores the explicit type of the object to allow for easy lookups.

`relative_path` and `field` come in when dealing with different scenarios of adding references to `ExternalResources`.
* `relative_path` is the path from the closest parent that is a NWB data-type. This is used when the attribute is not a NWB data-type and so has no object id.
* `field` is used differentiate the different fields of the dataset for compound data. For example, if a dataset has a compound data-type with fields ‘x’, ‘y’, and ‘z’, and each field is associated with different ontologies, then use field=’x’ to denote that ‘x’ is using the external reference.

In [24]:
er_read.objects.to_dataframe()

Unnamed: 0,files_idx,object_id,object_type,relative_path,field
0,0,9c3a5c45-316c-493d-a712-03a01b662ee9,NWBFile,general/experimenter,
1,0,f8641805-f93c-446f-8194-5fce08d22dbb,ElectrodeGroup,location,
2,0,5ee39486-8625-4ac3-9691-ce9d724812a4,Subject,species,


### ObjectKeyTable

Stores the relationship between which keys are used with each `Object`.

In [25]:
er_read.object_keys.to_dataframe()

Unnamed: 0,objects_idx,keys_idx
0,0,0
1,1,1
2,2,2


# ExternalResources Rules

1. Multiple `Key` objects can have the same name.
   They are disambiguated by the `Object` associated
   with each, meaning we may have keys with the same name in different objects, but for a particular object
   all keys must be unique.
2. In order to query specific records, the `ExternalResources` class
   uses '(file, object_id, relative_path, field, key)' as the unique identifier.
3. `Object` can have multiple `Key`
   objects.
4. Multiple `Object` objects can use the same `Key`.
5. Do not use the private methods to add into the `KeyTable`,
   `EntityKey`, `EntityTable`,
   `ObjectTable`, `ObjectKeyTable`,
   `FileTable`,
   individually.
6. URIs are optional, but highly recommended. If not known, an empty string may be used.
7. An entity ID should be the unique string identifying the entity in the given resource.
   This may or may not include a string representing the resource and a colon.
   Use the format provided by the resource. For example, Identifiers.org uses the ID ``ncbigene:22353``
   but the NCBI Gene uses the ID ``22353`` for the same term.
8. In a majority of cases, `Object` objects will have an empty string
   for 'field'. The `ExternalResources` class supports compound data_types.
   In that case, 'field' would be the field of the compound data_type that has an external reference.
9. In some cases, the attribute that needs an external reference is not a object with a 'data_type'.
   The user must then use the nearest object that has a data type to be used as the parent object. When
   adding an external resource for an object with a data type, users should not provide an attribute.
   When adding an external resource for an attribute of an object, users need to provide
   the name of the attribute.
10. The user must provide a `File` or an `Object` that
    has `File` along the parent hierarchy.
11. When reusing `Entity` objects, the user provides only the ID for the entity when using
    `add_ref`. This is done to prevent duplicates and will return an
    error on how to reuse `Entity` objects if the
    `add_ref` method finds an enitity ID and URI that already exists.

# An example with a new NWBFile

In [2]:
session_start_time = datetime(2018, 4, 25, 2, 30, 3, tzinfo=tz.gettz("US/Pacific"))

nwbfile = NWBFile(
    session_description="Mouse exploring an open field",
    identifier="Mouse5_Day3", 
    session_start_time=session_start_time, 
    session_id="session_1234",
    experimenter=["Dichter, Benjamin K.", "Smith, Alex"], 
    lab="My Lab Name",  
    institution="University of My Institution",  
    related_publications="DOI:10.1016/j.neuron.2016.12.011", 
)

In [3]:
nwbfile.subject = Subject(
    subject_id="001",
    age="P90D",
    description="mouse 5",
    species="Mus musculus",
    sex="M",
)

In [4]:
er = ExternalResources() 
nwbfile.link_resources(er) 

  warn(_exp_warn_msg(cls))


## Using add_ref

### add_ref without a file

As mentioned prior, the file must be explicitly set within `add_ref` or the object must already exist within the file, in which the link betweent the file and the instance of `ExternalResources` will automatically resolve the `file` parameter. The example below will return an error as expected.

In [29]:
col1 = VectorData(
    name='Species_Data',
    description='species from NCBI and Ensemble',
    data=['Homo sapiens', 'Mus musculus']
)
species = DynamicTable(name='species', description='My species', columns=[col1],)
er.add_ref(
    container=species,
    attribute='Species_Data',
    key='Homo sapiens',
    entity_id='NCBI_TAXON:9606',
    entity_uri='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606'
)

ValueError: Could not find file. Add container to the file.

### add_ref with attributes

Let's look at very simple example. The `attribute` is structure/feature that holds a term the user wants to add a reference for. `attribute` can either be an NWB data-type or not, e.g., a variable that contains a string-value.

*Note: we manually provide `file=nwbfile.object_id`.*


In [5]:
col1 = VectorData(
    name='Species_Data',
    description='species from NCBI and Ensemble',
    data=['Homo sapiens', 'Mus musculus']
)
species = DynamicTable(name='species', description='My species', columns=[col1],)
er.add_ref(
    file=nwbfile,
    container=species,
    attribute='Species_Data',
    key='Homo sapiens',
    entity_id='NCBI_TAXON:9606',
    entity_uri='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606'
)

(<hdmf.common.resources.Key at 0x10d2e60e0>,
 <hdmf.common.resources.Entity at 0x10d2e5840>)

Recall that `relative_path` is the path from the closest parent that is a NWB data-type and is used when the attribute is not a NWB data-type and so has no `object_id`. 

In [6]:
# Subject species attribute
er.add_ref(
    container=nwbfile.subject,
    attribute='species',
    key='Mus musculus',
    entity_id='NCBI_TAXON:10090',
    entity_uri='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10090'
)

# NWBFile experimenter
er.add_ref(
    container=nwbfile,
    attribute="experimenter",
    key="Dichter, Benjamin K.",
    entity_id="ORCID:0000-0001-5725-6910",
    entity_uri="https://orcid.org/0000-0001-5725-6910",
)

(<hdmf.common.resources.Key at 0x10d30c610>,
 <hdmf.common.resources.Entity at 0x10d30c4c0>)

In [7]:
er.to_dataframe()

Unnamed: 0,file_object_id,objects_idx,object_id,files_idx,object_type,relative_path,field,keys_idx,key,entities_idx,entity_id,entity_uri
0,4cad868d-b382-43e1-bac5-94878c25cba0,0,14e9e98f-fe73-4441-b2d4-7789a6a0cbd6,0,VectorData,,,0,Homo sapiens,0,NCBI_TAXON:9606,https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/...
1,4cad868d-b382-43e1-bac5-94878c25cba0,1,f0001c28-1f64-4010-ba96-bf51b5ec8728,0,Subject,species,,1,Mus musculus,1,NCBI_TAXON:10090,https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/...
2,4cad868d-b382-43e1-bac5-94878c25cba0,2,4cad868d-b382-43e1-bac5-94878c25cba0,0,NWBFile,general/experimenter,,2,"Dichter, Benjamin K.",2,ORCID:0000-0001-5725-6910,https://orcid.org/0000-0001-5725-6910


### add_ref with compound data

In [33]:
col1 = VectorData(
    name='Species_column',
    description='description',
    data=np.array(
        [('Mus musculus', 9, 81.0), ('Homo sapiens', 3, 27.0)],
        dtype=[('species', 'U14'), ('age', 'i4'), ('weight', 'f4')]
    )
)

species = DynamicTable(name='SpeciesTable', description='My species', columns=[col1],)

In [34]:
er.add_ref(
    file=nwbfile,
    container=species,
    attribute='Species_column',
    field='species',
    key='Mus musculus',
    entity_id='NCBI_TAXON:10090',
    entity_uri='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10090'
)

ValueError: If you plan on reusing an entity, then entity_uri parameter must be None.

In [35]:
er.to_dataframe()

Unnamed: 0,file_object_id,objects_idx,object_id,files_idx,object_type,relative_path,field,keys_idx,key,entities_idx,entity_id,entity_uri
0,de741c59-0a24-4ac0-b053-c9313a4517f9,0,463c4780-d2a9-4019-8e0d-63f6c8000524,0,VectorData,,,0,Homo sapiens,0,NCBI_TAXON:9606,https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/...
1,de741c59-0a24-4ac0-b053-c9313a4517f9,1,c56950f4-acaf-470c-a955-7c0c6562629f,0,Subject,species,,1,Mus musculus,1,NCBI_TAXON:10090,https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/...
2,de741c59-0a24-4ac0-b053-c9313a4517f9,2,de741c59-0a24-4ac0-b053-c9313a4517f9,0,NWBFile,general/experimenter,,2,"Dichter, Benjamin K.",2,ORCID:0000-0001-5725-6910,https://orcid.org/0000-0001-5725-6910


In [9]:
er.parent=nwbfile

In [11]:
nwbfile.children

(subject pynwb.file.Subject at 0x4516273728
 Fields:
   age: P90D
   age__reference: birth
   description: mouse 5
   sex: M
   species: Mus musculus
   subject_id: 001,
 external_resources pynwb.resources.ExternalResources at 0x4912954960
 Fields:
   entities: entities <class 'hdmf.common.resources.EntityTable'>
   entity_keys: entity_keys <class 'hdmf.common.resources.EntityKeyTable'>
   files: files <class 'hdmf.common.resources.FileTable'>
   keys: keys <class 'hdmf.common.resources.KeyTable'>
   object_keys: object_keys <class 'hdmf.common.resources.ObjectKeyTable'>
   objects: objects <class 'hdmf.common.resources.ObjectTable'>)

## Write NWBFile and NERD separately

In [12]:
with NWBHDF5IO("NWBfile_ER_Example_child.nwb", "w") as io:
    io.write(nwbfile)

In [None]:
er.to_norm_tsv(path='./')

## Read the NWBFile with NERD with NWBHDF5IO

As we saw in the beginning of this guide, users can set a link to an instance of an `ExternalResources` class. Users also have the option to set an existing instance of `ExternalResources` that exists as a populated zip file. This is conducted through NWBHDF5IO as a path parameter.

In [None]:
with NWBHDF5IO("sub-Haydn_desc-train_ecephys.nwb", "r", external_resources_path='./') as io:
    read_nwbfile = io.read()
    read_nwbfile.get_linked_resources()

# TermSet

`TermSet` allows users to create their own subset of ontological references and is built upon the resources from LinkML.

Use Cases:
1. Validation of data. Currently, validation with a `TermSet` is only supported for `Data`, but we are in the talks to expand out to, i.e., experimenters. 
2. `TermSet` streamlines the user experience for adding new references to `ExternalResources` using `add_ref_term_set`.

The first step is create a `.yaml` file 

![title](taxon.png)

LinkML Enumerations are collections of controlled string values. 

In [1]:
terms = TermSet(term_schema_path='./species_term_set.yaml')

NameError: name 'TermSet' is not defined

The `TermSet` class has methods to help you view and retrieve terms.

In [8]:
terms.view_set

{'Homo sapiens': Term_Info(id='NCBI_TAXON:9606', description='tbd', meaning='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=9606'),
 'Mus musculus': Term_Info(id='Ensemble:10090', description='tbd', meaning='https://rest.ensembl.org/taxonomy/id/10090'),
 'Ursus arctos horribilis': Term_Info(id='NCBI_TAXON:116960', description='tbd', meaning='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=116960'),
 'Myrmecophaga tridactyla': Term_Info(id='NCBI_TAXON:71006', description='tbd', meaning='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=71006')}

In [9]:
terms['Homo sapiens']

Term_Info(id='NCBI_TAXON:9606', description='tbd', meaning='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=9606')

## Validate Data with a TermSet

Data is validated when a TermSet is provided to Data and VectorData.

#### Validate Data

In [10]:
col1 = VectorData(
    name='species',
    description='...',
    data=['Homo sapiens', 'Mus musculus'],
    term_set=terms)

#### Validate Bad Data

In [11]:
col1 = VectorData(
    name='species',
    description='...',
    data=['Homo sapiens', 'Mus muscuklus', 'Rattus norvegicus'],
    term_set=terms,
)

ValueError: "Mus muscuklus, Rattus norvegicus, Mus muscuklus, Rattus norvegicus" is not in the term set.

#### Validate Data on append

In [7]:
# append 
col1 = VectorData(
    name='species',
    description='...',
    data=['Homo sapiens', 'Ursus arctos horribilis'],
    term_set=terms,
)
col1.append('Mus musculus')

#### Validate Bad Data on append

In [8]:
# append bad data
col1 = VectorData(
    name='species',
    description='...',
    data=['Homo sapiens', 'Ursus arctos horribilis'],
    term_set=terms,
)
col1.append('Macaca mulatta')

ValueError: "Macaca mulatta" is not in the term set.

#### Validate Data on extend

In [9]:
# extend
col1 = VectorData(
    name='species',
    description='...',
    data=['Homo sapiens'],
    term_set=terms,
)
col1.extend(['Mus musculus', 'Ursus arctos horribilis'])

#### Validate Bad Data on extend

In [10]:
# extend bad data
col1 = VectorData(
    name='species',
    description='...',
    data=['Homo sapiens'],
    term_set=terms,
)
col1.extend(['Macaca mulatta', 'Oryctolagus cuniculus'])

ValueError: "Macaca mulatta, Oryctolagus cuniculus" is not in the term set.

#### Validate with add_row example 1

Validating new data is determined by whether the `VectorData` column was intialized with validate. `DynamicTable` will automatically check for columns that have validation set. If any of the new data is actually *bad* data, then `add_row` will not add any new data.

In [14]:
col1 = VectorData(
    name='Species_1',
    description='...',
    data=['Homo sapiens'],
    term_set=terms,
)
col2 = VectorData(
    name='Species_2',
    description='...',
    data=['Mus musculus'],
    term_set=terms,
)
species = DynamicTable(name='species', description='My species', columns=[col1,col2])

In [15]:
# add bad data
species.add_row(Species_1='Mus musculus', Species_2='bad')

ValueError: "bad" is not in the term set.

In [16]:
species.to_dataframe()

Unnamed: 0_level_0,Species_1,Species_2
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Homo sapiens,Mus musculus


#### Validate with add_row example 2

`add_row` does not validate every column. It only validates the data for `VectorData` that has validate set.

In [17]:
col1 = VectorData(
    name='Species_1',
    description='...',
    data=['Homo sapiens'],
    term_set=terms,
)
col2 = VectorData(
    name='Species_2',
    description='...',
    data=['Mus musculus'],
)
species = DynamicTable(name='species', description='My species', columns=[col1,col2])

In [18]:
species.add_row(Species_1='Mus mrusculus', Species_2='rat')

ValueError: "Mus mrusculus" is not in the term set.

In [19]:
species.to_dataframe()

Unnamed: 0_level_0,Species_1,Species_2
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Homo sapiens,Mus musculus


#### Validate with add_row example 3

`add_row` is able to distinguish which columns have valid data

In [20]:
col1 = VectorData(
    name='Species_1',
    description='...',
    data=['Homo sapiens'],
    term_set=terms,
)
col2 = VectorData(
    name='Species_2',
    description='...',
    data=['Mus musculus'],
    term_set=terms,
)
species = DynamicTable(name='species', description='My species', columns=[col1,col2])

In [21]:
species.add_row(Species_1='Ursus arctos horribilis', Species_2='rat')

ValueError: "rat" is not in the term set.

#### Validate with add_column

`add_column` also supports validation

In [22]:
col1 = VectorData(
    name='col1',
    description='column #1',
    data=[1, 2],
)
species = DynamicTable(name='species', description='My species', columns=[col1],)

In [23]:
species.add_column(name='species',
                   description='Species data',
                   data=['Homo sapiens', 'Mus muscuflus'],
                   term_set=terms)

ValueError: 'Mus muscuflus' is not in the term set.

## Add ExternalResources using a TermSet

`TermSet` allows for an easier way to add references to `ExternalResources`. The user will create a `.yaml` file that will contain enumerations. These enumerations take place as the `entities`. Using the `TermSet` does bring greater structure to the naming convention for `Key` values in `ExternalResources`. `Key` values will have to match the name of the term in the `TermSet`. For example, if I have species data, the species values need to be the proper ontological terms in order to be validated and pulled from the `TermSet`.

Rules:
The termset must exist on the object that will use it. It cannot be used on a non-NWB datatype.

In [24]:
session_start_time = datetime(2018, 4, 25, 2, 30, 3, tzinfo=tz.gettz("US/Pacific"))

nwbfile = NWBFile(
    session_description="Mouse exploring an open field",  # required
    identifier="Mouse5_Day3",  # required
    session_start_time=session_start_time,  # required
    session_id="session_1234",  # optional
    experimenter=["Dichter, Benjamin K.", "Smith, Alex"],  # optional
    lab="My Lab Name",  # optional
    institution="University of My Institution",  # optional
    related_publications="DOI:10.1016/j.neuron.2016.12.011",  # optional
)

In [25]:
er = ExternalResources() 
nwbfile.external_resources=er

  warn(_exp_warn_msg(cls))


In [26]:
col1 = VectorData(
    name='Species_Data',
    description='species from NCBI and Ensemble',
    data=['Homo sapiens', 'Ursus arctos horribilis'],
    term_set=terms,
)

species = DynamicTable(name='species', description='My species', columns=[col1],)

In [27]:
er.add_ref_term_set(file=nwbfile,
                    container=species,
                    attribute='Species_Data',
                   ) 

True

In [28]:
er.to_dataframe()

Unnamed: 0,file_object_id,objects_idx,object_id,files_idx,object_type,relative_path,field,keys_idx,key,entities_idx,entity_id,entity_uri
0,04d18aa3-c145-455f-9ae4-d60e0751cb57,0,6eb1c78a-161b-475e-a9f9-3682763b5386,0,VectorData,,,0,Homo sapiens,0,NCBI_TAXON:9606,https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/...
1,04d18aa3-c145-455f-9ae4-d60e0751cb57,0,6eb1c78a-161b-475e-a9f9-3682763b5386,0,VectorData,,,1,Ursus arctos horribilis,1,NCBI_TAXON:116960,https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/...


## Auto-add to ER with Termset (In Development)

In order to take advantage of auto-adding references to `ExternalResources` the data needs to first be validated and then be added to the `NWBFile`. This functionality has limited use cases, but will be expanded on in the future to support auto-add to `ExternalResources` for other NWB data-types. Currently, only `DynamicTable` is supported.

This requires validation and the use of a `TermSet`.

In [29]:
terms = TermSet(name='Species_TermSet', term_schema_path='/Users/mavaylon/Research/NWB/species_term_set.yaml')

In [30]:
session_start_time = datetime(2018, 4, 25, 2, 30, 3, tzinfo=tz.gettz("US/Pacific"))

nwbfile = NWBFile(
    session_description="Mouse exploring an open field",  # required
    identifier="Mouse5_Day3",  # required
    session_start_time=session_start_time,  # required
    session_id="session_1234",  # optional
    experimenter=["Dichter, Benjamin K.", "Smith, Alex"],  # optional
    lab="My Lab Name",  # optional
    institution="University of My Institution",  # optional
    related_publications="DOI:10.1016/j.neuron.2016.12.011",  # optional
)

In [31]:
er = ExternalResources() 
nwbfile.external_resources=er

  warn(_exp_warn_msg(cls))


In [32]:
col1 = VectorData(
    name='Species_1',
    description='...',
    data=['Homo sapiens'],
    term_set=terms,
)
col2 = VectorData(
    name='Species_2',
    description='...',
    data=['Mus musculus'],
    term_set=terms,
)

species = DynamicTable(name='species', description='My species', columns=[col1,col2],)

In [33]:
nwbfile.add_acquisition(species)