# 2. Data Records
In this notebook, we will be going over how to create, add metadata and other contextual information, edit, establish relationships between ``Data Records`` - the fundamental unit in DataFed

## Setting up:
Just as in the previous notebook, we start be instantiating the ``datafed.CommandLib.API`` class to communicate with DataFed:

In [None]:
from datafed.CommandLib import API

In [None]:
import json

df_api = API()

### Specifying context:
Since we want to work within the ``context`` of the Training Project:

In [None]:
df_api.setContext("p/trn001")

To begin with, you will be working within your own private collection whose ``alias`` is the same as your DataFed username.

### <span style="color:green"> Exercise: </span>
<span style="color:green"> Enter your username into the ``parent_collection`` variable </span>

In [None]:
parent_collection = ""

## Creating Data Records:
Data Records can hold a whole lot of contextual information about the raw data in them. One key component is the scientific metadata. Ideally, we would get this metadata from the headers of the raw data file or some other log file that was generated along with the raw data. 

### <span style="color:blue"> Note </span>
> <span style="color:blue"> DataFed expects scientific metadata to be specified **like** a python dictionary. </span>

For now, let's set up some dummy metadata:

In [None]:
parameters = {
    "a": 4,
    "b": [1, 2, -4, 7.123],
    "c": "Something important",
    "d": {"x": 14, "y": -19},  # Can use nested dictionaries
}

### <span style="color:blue"> Note </span>
> <span style="color:blue"> DataFed **currently** encodes metadata in JSON strings. The next version will accept python dictonaries as is.</span>

For now, we will need to convert the metadata to a JSON string using the ``dumps()`` function in the ``json`` package

In [None]:
json.dumps(parameters)

We use the `dataCreate()` function to make our new record, and the `json.dumps()` function to format the python dictionary to JSON:

In [None]:
# parent_collection The parent collection, whose alias is your username
dc_resp = df_api.dataCreate(
    "my important data",
    metadata=json.dumps(parameters),
    parent_id=parent_collection,
)
dc_resp

#### <span style="color:blue"> **Note:** </span> 
> <span style="color:blue">  In the future, the ``dataCreate()`` function would by default return only the ``ID`` of the record instead of such a verbose response if it successfully created the Data Record. We expect to be able to continue to get this verbose response through an optional argument. </span>

### <span style="color:green">  Exercise: </span>
<span style="color:green"> Extract the ``ID`` of the data record from the message returned from ``dataCreate()`` for future use: </span>

In [None]:
record_id = dc_resp[0].data[0].id
print(record_id)

Data Records and the information in them are not static and can always be modified at any time
## Updating Data Records
Let's add some additional metadata and change the title of our record:

In [None]:
metadata = {"appended_metadata": True}
du_resp = df_api.dataUpdate(
    record_id,
    title="Some new title for the data",
    metadata=json.dumps(metadata),
)
print(du_resp)

#### <span style="color:blue"> **Note:** </span> 
> <span style="color:blue"> In the future, the dataUpdate() command would return only an acknowledgement of the successful execution of the data update. </span>

## Viewing Data Records
We can get full information about a data record  including the complete metadata via the ``dataView()`` function. Let us use this function to verify that the changes have been incorporated:

In [None]:
dv_resp = df_api.dataView(record_id)
print(dv_resp)

#### <span style="color:blue"> **Note:** </span> 
> <span style="color:blue"> Record metadata is always stored as a JSON string. </span>

### <span style="color:green"> Exercise: </span>
<span style="color:green"> Try isolating the updated metadata and converting it to a python dictionary. </span>

Hint - ``json.loads()`` is the opposite of ``json.dumps()``

In [None]:
metadata = json.loads(dv_resp[0].data[0].metadata)
print(metadata)

In the first update, we **merged** new metadata with the existing metadata within the record. However ``dataUpdate()`` is also capable of replacing the metadata as well.

### <span style="color:green"> Exercise: </span>
<span style="color:green"> Now try to **replace** the metadata. <br><br>Hint: look at the `metadata_set` keyword argument in the docstrings. </span>

### <span style="color:blue"> Tip: </span>
> <span style="color:blue"> With the cursor just past the starting parenthesis of ``dataUpdate(``, simultaneously press the ``Shift`` and ``Tab`` keys once, twice, or four times to view more of the documentation about the function. </span>

In [None]:
new_metadata = {"key": "value", "E": "mc^2"}
du_resp = df_api.dataUpdate(record_id, metadata=json.dumps(new_metadata), metadata_set=True)
dv_resp = df_api.dataView(record_id)
print(json.loads(dv_resp[0].data[0].metadata))

# Provenance
Along with in-depth, detailed scientific metadata describing each data record, DataFed also provides a very handy tool for tracking data provenance, i.e. recording the relationships between Data Records which can be used to track the history, lineage, and origins of a data object. 

### <span style="color:green"> Exercise: </span>
<span style="color:green"> Create a new record meant to hold some processed version of the first data record. <br> **Caution**: Make sure to create it in the correct Collection.</span>

In [None]:
new_params = {"hello": "world", "counting": [1, 2, 3, 4]}

dc2_resp = df_api.dataCreate(
    "Subsequent Record",
    metadata=json.dumps(new_params),
    parent_id=parent_collection,
)

clean_rec_id = dc2_resp[0].data[0].id
print(clean_rec_id)

### Specifying Relationships
Now that we have two records, we can specify the second record's relationship to the first by adding a **dependency** via the ``deps_add`` keyword argument of the ``dataUpdate()`` function.

### <span style="color:blue"> Note: </span>
> <span style="color:blue"> As the documentation for ``dataUpdate()`` will reveal, dependencies must be specified as a ``list`` of relationships. Each relationship is expressed as a ``list`` where the first item is a dependency type (a string) and the second is the data record (also a string). </span>

DataFed currently supports three relationship types:
* `der` - Is derived from
* `comp` - Is comprised of
* `ver` - Is new version of

In [None]:
dep_resp = df_api.dataUpdate(clean_rec_id, deps_add=[["der", record_id]])
print(dep_resp)

### <span style="color:green"> Exercise: </span>
<span style="color:green"> Take a look at the records on the DataFed Web Portal in order to see a graphical representation of the data provenance. </span>

### <span style="color:green"> Exercise: </span>
<span style="color:green">1. Create a new data record to hold a figure in your journal article. <br>2. Extract the record ID. <br>3. Now establish a provenance link between this figure record and the processed data record we just created. You may try out a different dependency type if you like. <br>4. Take a look at the DataFed web portal to see the update to the Provenance of the records</span>

In [None]:
# 1
reply = df_api.dataCreate("Figure 1", parent_id=parent_collection)
# 2
fig_id = reply[0].data[0].id
# 3
provenance_link = df_api.dataUpdate(fig_id, deps_add=[("comp", clean_rec_id)])
print(provenance_link[0].data[0])