# Demonstration notebook of converting variety of input files to HEAL variable level metadata (ie data dictionary)
This takes a specified input file and uses the healdatautils to export HEAL-formatted data dictionaries.
The data dictionary titles are inferred from the file names. 

> Note, as reminder, according to the required fields, if no description is present, this will return a validation failure (see the resulting validation report: output/errors/heal-csv-errors-summary.txt).

Will demonstrate two ways to create a data dictionary via the healdata-util `vlmd` tool.

1. Via python
2. Via command line

In [None]:
# run if in google colab or platform (and not binder)
%pip install git+https://github.com/norc-heal/healdata-utils.git

## Via python

In [1]:
from pathlib import Path 
from healdata_utils.cli import convert_to_vlmd
import os 
import pandas as pd
import json
import shutil

from healdata_utils.cli import input_descriptions
from IPython.display import Markdown,display

In [2]:
def printdir(dirname):
    for d in Path(dirname).iterdir():
        print(d)
        if Path(d).is_dir():
            for _d in Path(d).iterdir():
                print(f"   {_d}")

In [3]:
# current file paths for proof of concept
demo_filepaths = {
    # first 20 records of NMHSS SAMHDA Public Use File 
    "sas7bdat":"data/example_nmhss_2019_first_20recs.sas7bdat", 
    # SPSS/Stata examples created from pyreadstat via notebooks/demos/scripts/example.py
    "dta":"data/example_pyreadstat_output.dta", 
    "sav":"data/example_pyreadstat_output.sav",
    # The demostration CSV data dictionary exported from UChicago Redcap instance
    "redcap.csv":"data/example_redcap_demo.redcap.csv",
    # Valid csv version of pyreadstat 
    "csv":"data/example_pyreadstat_output.csv"
}


In [6]:
# available inputs
display(Markdown("Available inputs (except por and sas7bdat example does not have encodings/missing vals):"))
display(Markdown("".join(["- "+ext+"\n" for ext in list(input_descriptions.keys())])))
display(Markdown("Change the variable `input_type` to one of the extensions "))

Available inputs (except por and sas7bdat example does not have encodings/missing vals):

- csv
- sav
- dta
- por
- sas7bdat
- json
- redcap.csv


Change the variable `input_type` to one of the extensions 

In [7]:
input_type = "sas7bdat" 
#input_type = "sav"
#input_type = "redcap.csv" # see list above for others
#csv #heal formatted csv to be read in (if invalid will say in report-summary.txt)

Description for selected input type below:

In [8]:
display(Markdown(f"Input type: **`{input_type}`**"))
display(Markdown((input_descriptions[input_type])))

Input type: **`sas7bdat`**

Converts a "metadata-rich" (ie statistical software file) 
    into a HEAL-specified data dictionary in both csv format and json format.

    This function relies on [readstat](https://github.com/Roche/pyreadstat) which supports SPSS (sav and por), 
    SAS (sas7bdat), and Stata (dta). 

    > Currently, this function uses both data and metadata to generate 
    a HEAL specified data dictionary. That is, types are inferred from the 
    data (so at least test or synthetic data needed) while everything else is taken 
    from the metadata (eg missing values, variable labels, variable value labels etc)

    

In [12]:
description = "This is a proof of concept to demonstrate the healdata-utils functionality"
title = "Healdata-utils Demonstration Data Dictionary"
healdir = "output"
inputpath = demo_filepaths[input_type]

In [13]:
# make python demo output
Path(healdir).mkdir(exist_ok=True)

In [14]:
data_dictionaries = convert_to_vlmd(
    filepath=inputpath,
    outputdir=healdir, #if not specified, will not write to file
    inputtype=input_type, #if not specified, looks for suffix
    data_dictionary_props={
        "name":Path(inputpath).stem,
        "title":title,
        "description":description}
)

Validating csv data dictionary...
Csv is VALID
Validating heal-specified json fields.....
JSON array of data dictionary fields is VALID


In [15]:
Markdown("Here is the resulting contents of the file directory:")
print(printdir("output"))

output/errors
   output/errors/heal-csv-errors-summary.txt
   output/errors/heal-csv-errors.json
   output/errors/heal-json-errors.json
output/heal-csvtemplate-data-dictionary.csv
output/heal-jsontemplate-data-dictionary.json
None


Resulting CSV fields

Examine human-readable csv validation report. Say a data dictionary is not valid. The csv report summary will give these errors. If this is the case, you can edit the csv data dictionary and re-run `convert_vlmd` with the csv input type. For an example of this, see the csv validation demo notebook. In this notebook, all files are valid, so the summary will return a 
report indicating it is valid.

In [16]:
print(Path("output/errors/heal-csv-errors-summary.txt").read_text())


# -----
# valid: memory 
# -----

## Summary 

+------------------------+-------------------+
| Description            | Size/Name/Count   |
| File name (Not Found)  | memory            |
+------------------------+-------------------+
| File size              | N/A               |
+------------------------+-------------------+
| Total Time Taken (sec) | 0.031             |
+------------------------+-------------------+




You can view the data dictionary by looking via a pandas dataframe from the written file or directly from the returned
data dictionary object. 

In [17]:
pd.DataFrame(data_dictionaries['csvtemplate']).head()

Unnamed: 0,module,name,title,description,type,format,constraints.maxLength,constraints.enum,constraints.pattern,constraints.maximum,...,univar_stats.median,univar_stats.mean,univar_stats.std,univar_stats.min,univar_stats.max,univar_stats.mode,univar_stats.count,univar_stats.twenty_five_percentile,univar_stats.seventy_five_percentile,univar_stats.cat_marginals
0,,CASEID,,Case identification number,any,,,,,,...,,,,,,,,,,
1,,LST,,State postal code,any,,,,,,...,,,,,,,,,,
2,,MHINTAKE,,Facility offers mental health intake,integer,,,,,,...,,,,,,,,,,
3,,MHDIAGEVAL,,Facility offers mental health diagnostic evalu...,integer,,,,,,...,,,,,,,,,,
4,,MHREFERRAL,,Facility offers mental health information and/...,integer,,,,,,...,,,,,,,,,,


Resulting JSON object 

> Note how currently the fields are nested within the data_dictionary property) as opposed to the csv tempalte which just has fields.

In [18]:
print(json.dumps(data_dictionaries['jsontemplate'],indent=4)[0:1000])

{
    "name": "example_nmhss_2019_first_20recs",
    "title": "Healdata-utils Demonstration Data Dictionary",
    "description": "This is a proof of concept to demonstrate the healdata-utils functionality",
    "data_dictionary": {
        "name": "example_nmhss_2019_first_20recs",
        "title": "Healdata-utils Demonstration Data Dictionary",
        "description": "This is a proof of concept to demonstrate the healdata-utils functionality",
        "data_dictionary": [
            {
                "name": "CASEID",
                "type": "any",
                "description": "Case identification number"
            },
            {
                "name": "LST",
                "type": "any",
                "description": "State postal code"
            },
            {
                "name": "MHINTAKE",
                "type": "integer",
                "description": "Facility offers mental health intake"
            },
            {
                "name": "MHDIAGEVAL",
    

## Via command line

We will demonstrate the `vlmd` command line utility using one of the data dictionaries. 

In [19]:
# make a separate output-cli folder for cli demo

Path("output-cli").mkdir(exist_ok=True)

In [20]:
!vlmd --help

Usage: vlmd [OPTIONS]

Options:
  --filepath TEXT                 Path to the file you want to convert to a
                                  HEAL data dictionary  [required]
  --title TEXT                    The title of your data dictionary. If not
                                  specified, then the file name will be used
  --description TEXT              Description of data dictionary
  --inputtype [csv|sav|dta|por|sas7bdat|json|redcap.csv]
                                  The type of your input file.
  --outputdir TEXT                The folder where you want to output your
                                  HEAL data dictionary
  --help                          Show this message and exit.


To create the above data dictionary via the command line, run directly in this notebook with the cell below:

In [21]:
!vlmd --filepath "data/example_pyreadstat_output.sav" \
--outputdir "output-cli" \
--title "Healdata-utils Demonstration Data Dictionary" \
--description "This is a proof of concept to demonstrate the healdata-utils functionality" 

Validating csv data dictionary...
Csv is VALID
Validating heal-specified json fields.....
JSON array of data dictionary fields is VALID


In [29]:
printdir("output-cli")

output-cli\errors
   output-cli\errors\heal-csv-errors-summary.txt
   output-cli\errors\heal-csv-errors.json
   output-cli\errors\heal-json-errors.json
output-cli\heal-csvtemplate-data-dictionary.csv
output-cli\heal-jsontemplate-data-dictionary.json
