# Demonstration notebook of converting variety of input files to HEAL variable level metadata (ie data dictionary)
This takes a specified input file and uses the healdatautils to export HEAL-formatted data dictionaries.
The data dictionary titles are inferred from the file names. 

> Note, as reminder, according to the required fields, if no description is present, this will return a validation failure (see the resulting validation report: output/errors/heal-csv-errors-summary.txt).

Will demonstrate two ways to create a data dictionary via the healdata-util `vlmd` tool.

1. Via python
2. Via command line

In [11]:
# run if in google colab or platform (and not binder).
#  If in binder, this is pre-installed so commented out.
#%pip install git+https://github.com/norc-heal/healdata-utils.git

## Via python

In [12]:
from pathlib import Path 
from healdata_utils.cli import convert_to_vlmd
import os 
import pandas as pd
import json
import shutil

from healdata_utils.cli import input_descriptions
from IPython.display import Markdown,display

In [13]:
def printdir(dirname):
    for d in Path(dirname).iterdir():
        print(d)
        if Path(d).is_dir():
            for _d in Path(d).iterdir():
                print(f"   {_d}")

In [14]:
# current file paths for proof of concept
demo_filepaths = {
    "data.csv":"data/example_nmhss_puf_2020_data.csv",
    # first 20 records of NMHSS SAMHDA Public Use File 
    "sas7bdat":"data/example_nmhss_2019_first_20recs.sas7bdat", 
    # SPSS/Stata examples created from pyreadstat via notebooks/demos/scripts/example.py
    "dta":"data/example_pyreadstat_output.dta", 
    "sav":"data/example_pyreadstat_output.sav",
    # The demostration CSV data dictionary exported from UChicago Redcap instance
    "redcap.csv":"data/example_redcap_demo.redcap.csv",
    # Valid csv version of pyreadstat 
    "template.csv":"data/example_sav_pyreadstat_output.csv"
}


In [15]:
# available inputs
display(Markdown("Available inputs (except por and sas7bdat example does not have encodings/missing vals):"))
display(Markdown("".join(["- "+ext+"\n" for ext in list(input_descriptions.keys())])))
display(Markdown("Change the variable `input_type` to one of the extensions "))

Available inputs (except por and sas7bdat example does not have encodings/missing vals):

- data.csv
- template.csv
- csv
- sav
- dta
- sas7bdat
- template.json
- json
- redcap.csv


Change the variable `input_type` to one of the extensions 

In [16]:
#input_type = "sas7bdat" 
#input_type = "sav"
input_type = "redcap.csv" # see list above for others
#csv #heal formatted csv to be read in (if invalid will say in report-summary.txt)
#input_type = "sav"
input_type = "template.csv"

Description for selected input type below:

In [17]:
display(Markdown(f"Input type: **`{input_type}`**"))
display(Markdown((input_descriptions[input_type])))

Input type: **`template.csv`**

Converts a CSV conforming to HEAL specifications (but see 2 additional notes below) 
    into a HEAL-specified data dictionary in both csv format and json format.

    Converts an in-memory data dictionary or a path to a data dictionary file into a HEAL-specified tabular template by:
        1. Adding missing fields, and
        2. Converting fields from a specified mapping.
            NOTE: currently this mapping is only float/num to number or text/char to string (case insensitive)
                In future versions, there will be a specified module for csv input mappings.
    
    

In [18]:
description = "This is a proof of concept to demonstrate the healdata-utils functionality"
title = "Healdata-utils Demonstration Data Dictionary"
healdir = "output"
inputpath = demo_filepaths[input_type]

In [19]:
# make python demo output
Path(healdir).mkdir(exist_ok=True)

In [20]:
healdir = r"C:\Users\kranz-michael\projects\heal-metadata-schemas\variable-level-metadata-schema\examples\valid\template_submission_output.json"
inputpath = r"C:\Users\kranz-michael\projects\heal-metadata-schemas\variable-level-metadata-schema\examples\valid\template_submission.csv"

In [22]:
data_dictionaries = convert_to_vlmd(
    filepath=inputpath,
    outputdir=healdir, #if not specified, will not write to file
    inputtype=input_type, #if not specified, looks for suffix
    data_dictionary_props={
        "name":Path(inputpath).stem,
        "title":title,
        "description":description}
)

In [11]:
Markdown("Here is the resulting contents of the file directory:")
print(printdir("output"))

output\errors
   output\errors\heal-csv-errors.json
   output\errors\heal-json-errors.json
output\heal-csvtemplate-data-dictionary.csv
output\heal-jsontemplate-data-dictionary.json
None


Resulting CSV fields

Examine human-readable csv validation report. Say a data dictionary is not valid. The csv report summary will give these errors. If this is the case, you can edit the csv data dictionary and re-run `convert_vlmd` with the csv input type. For an example of this, see the csv validation demo notebook. In this notebook, all files are valid, so the summary will return a 
report indicating it is valid.

In [12]:
print(Path("output/errors/heal-csv-errors.json").read_text())

{
    "valid": true,
    "errors": []
}


You can view the data dictionary by looking via a pandas dataframe from the written file or directly from the returned
data dictionary object. 

In [13]:
pd.DataFrame(data_dictionaries['csvtemplate']).head()

Unnamed: 0,name,type,description,title,module,format,constraints.pattern,encodings,constraints.enum
0,study_id,string,Study ID,Study ID,demographics,,,,
1,date_enrolled,date,Demographic Characteristics: Date subject sign...,Date subject signed consent,demographics,any,,,
2,first_name,string,Demographic Characteristics: First Name,First Name,demographics,,,,
3,last_name,string,Demographic Characteristics: Last Name,Last Name,demographics,,,,
4,address,string,"Contact Information: Street, City, State, ZIP","Street, City, State, ZIP",demographics,,,,


Resulting JSON object 

> Note how currently the fields are nested within the data_dictionary property) as opposed to the csv tempalte which just has fields.

In [14]:
print(json.dumps(data_dictionaries['jsontemplate'],indent=4)[0:1000])

{
    "name": "example_redcap_demo.redcap",
    "title": "Healdata-utils Demonstration Data Dictionary",
    "description": "This is a proof of concept to demonstrate the healdata-utils functionality",
    "data_dictionary": [
        {
            "name": "study_id",
            "type": "string",
            "description": "Study ID",
            "title": "Study ID",
            "module": "demographics"
        },
        {
            "name": "date_enrolled",
            "type": "date",
            "format": "any",
            "description": "Demographic Characteristics: Date subject signed consent",
            "title": "Date subject signed consent",
            "module": "demographics"
        },
        {
            "name": "first_name",
            "type": "string",
            "description": "Demographic Characteristics: First Name",
            "title": "First Name",
            "module": "demographics"
        },
        {
            "name": "last_name",
            "type": 

## Via command line

We will demonstrate the `vlmd` command line utility using one of the data dictionaries. 

In [15]:
# make a separate output-cli folder for cli demo

Path("output-cli").mkdir(exist_ok=True)

In [16]:
!vlmd --help

Usage: vlmd [OPTIONS]

Options:
  --filepath TEXT                 Path to the file you want to convert to a
                                  HEAL data dictionary  [required]
  --title TEXT                    The title of your data dictionary. If not
                                  specified, then the file name will be used
  --description TEXT              Description of data dictionary
  --inputtype [data.csv|template.csv|csv|sav|dta|por|sas7bdat|template.json|json|redcap.csv]
                                  The type of your input file.
  --outputdir TEXT                The folder where you want to output your
                                  HEAL data dictionary
  --help                          Show this message and exit.


To create the above data dictionary via the command line, run directly in this notebook with the cell below:

In [17]:
!vlmd --filepath "data/example_pyreadstat_output.sav" \
--outputdir "output-cli" \
--title "Healdata-utils Demonstration Data Dictionary" \
--description "This is a proof of concept to demonstrate the healdata-utils functionality" 

In [18]:
printdir("output-cli")

output-cli\errors
   output-cli\errors\heal-csv-errors.json
   output-cli\errors\heal-json-errors.json
output-cli\heal-csvtemplate-data-dictionary.csv
output-cli\heal-jsontemplate-data-dictionary.json
