## Creating JSON database to store metadata information
*Author:* Corrado Motta - corradomotta92@gmail.com

In this notebook, we use [pysondb](https://github.com/pysonDB/pysonDB-v2) to generate lightweight JSON based database with unique IDs. Such database contains respectively the minimum set of global attributes (either mandatory or optional) and all the robotic variables comprising the required metadata.

### Create JSON database for global metadata

First of all we import all the needed packages.

In [16]:
# To generate JSON DB
from pysondb import PysonDB
# To read CSV tables
import pandas as pd

Then we read the CSV table where the global metadata we decided to be the minimum set are stored. 

In [21]:
table_path = "tables/Minimal global metadata set.csv"
global_metadata = pd.read_csv(table_path, # path to the table
                              usecols = range(2,7), # columns to be included
                              nrows = 35, # number of rows to consider
                              skiprows = [0,1,2]) # how many initial rows to skip

Check that the table looks like expected

In [18]:
global_metadata

Unnamed: 0,Name,Description,ISO19115 Name,M-O-NI,ACDD name
0,Title,A brief title for the dataset.,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...,M,title
1,Abstract,A short summary of the process of developing t...,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...,M,summary
2,keywords,A comma separated list of key words and phrases.,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...,M,keywords
3,Conventions,A comma-separated list of the conventions that...,NOT FOUND,M,Conventions
4,keywords vocabulary,If you are following a guideline for the words...,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...,O,keywords_vocabulary
5,Principal investigator (PI),Name of the PI,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...,M,creator_name
6,PI email,Email to the PI,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...,M,creator_email
7,PI institution \t,Affiliation of the PI,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...,M,institution
8,Dataset start time,ISO8601 reference for the dataset,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...,M,time_coverage_start
9,Dataset end time,ISO8601 reference for the dataset,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...,M,time_coverage_end


We would like to create a json database, where for each global metadata we get:
* Unique ID
* Informal name
* ACDD name
* Whether it is Mandatory or Optional

In [19]:
# Create database
global_db = PysonDB('database/global_metadata.json')

# iterate over our table
for index, row in global_metadata.iterrows():
    # set True value in case is mandatory otherwise false
    if(row['M-O-NI'].lower() == "m"):
        mandatory = True
    else:
        mandatory = False

    # add a row for each element only if ACDD is present
    if(row['ACDD name'] and row['ACDD name'] !="-"):
        db.add({
            'name':     row['Name'],
            'ACDD':     row['ACDD name'],
            'required': mandatory,
        })

The JSON database is now created. This can be dinamically used during the generation of file NECTDF or ISO, to check if all mandatory elements are filled as expected.

### Create JSON database for robotic variable attributes

We are currently working on identifying the correct names and attributes for our robotic and scientific variables following, when possible, the existing standards. We decided on a minimum set of attributes for each variable. Such attributes are taken from ACDD and CF conventions. The attributes should be:

1. __long_name__: in case a standard_name is not found on CF convention, the long_name will assume that role. Otherwise standard_name and long_name coincide.
2. __standard_name__: Standard name following CF [table](https://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html).
3. __units__: A character string that specifies the units used for the variable's data (empty if dimensionless).
4. __coverage_content_type__: An ISO 19115-1 code to indicate the source of the data e.g., physicalMeasurement, auxiliaryInformation, or modelResult).
5. __comments__: Miscellaneous information about the data or methods used to produce it. This is especially important for the variables that do not have a standard_name. Can be omitted in the other case.

Considering that in the telemetry of a robotic vehicle the same variable can be measured by two or more different sensors (e.g. GPS position given by 3, 4 different GPS on board), we need to be able to give a __custom__ name to each variable in the log, but also a __pointer__ to the standard variable. In that way, the log can be automatically used to generate a NETCDF4 FAIR-compliant file, where besides the global metadata, each variable contains all needed attributes to be found and understood both by machine and humans.

Therefore, the idea is to create another JSON database that contains all of our robotic and scientific variables, comprising the attributes mentioned above. Each variable in the database is saved with a unique ID.

The log file should then contains __two lines__ for the the header like in the following example:
```
NGC_latitude, NGC_longitude, MBES_latitude, MBES_longitude
271595412737, 32523223453, 271595412737, 32523223453
45.438759, 12.327145, 45.515624, 12.419372
45.438760, 12.327148, 45.515635, 12.419332
45.438750, 12.327103, 45.515690, 12.419345
```

As we can see the first header line contains the custom name. The second one contains the ID of the standard variable that it refers to. The two different latitudes refer to the same ID as well as the two longitudes. When the other notebook of this repository is launched, it reads the header of the log files, get the IDs, open the JSON database and find the correspondent attributes for that specific variable. In this case, the database looks something like the following snippet:

```json
"271595412737": {
    "long_name": "latitude",
    "standard_name": "latitude",
    "units": "degree_north",
    "coverage_content_type": "physicalMeasurement",
    "comments": "Latitude measured by GPS"
},
"32523223453": {
    "long_name": "longitude",
    "standard_name": "longitude",
    "units": "degree_east",
    "coverage_content_type": "physicalMeasurement",
    "comments": "Longitude measured by GPS"
}
```

By retrieving the attributes using the ID, it can then fill the NETCDF and makes it FAIR-compliant.

In [22]:
#db.add({
#    'long_name': 'timestamp',
#    'comments': 'Counter increased by one every 10 ms',
#    'units': '',
#    'coverage_content_type': 'auxiliaryInformation'
#})

In [None]:
#prova = db.get_all()

In [None]:
#for key, value in prova.items():
    #if(not value['units']):
    #    print("values without units:")
#    print(value['long_name'] + " --> " + key)

In [None]:
#variable_id = "271595412737339974"
#print("standard name:", db.get_by_id(variable_id)['long_name'])
#print("description:", db.get_by_id(variable_id)['comments'])
#print("type:", db.get_by_id(variable_id)['coverage_content_type'])