## Generate Jsons of Expected Field Values for Datasets
We need to identify and standardize all possible values for a set of fields in the MODEL-AD immunohisto data (biomarkers and pathology - or more). Although it is not ideal to create a validation set directly from the data, we decided it would be best to start here and manually update the lists as needed.

#### General steps:
1. Define your datasets
2. Download the data
3. Get unique set of values for each field of interest
4. Output the information in a json to be read during gx validation

In [None]:
import json
from agoradatatools.etl import extract, utils

#### User specified values
You must create a dataset object with the following structure.
```
datasets = {
    "dataset_name": {
        "synapse_id": "ID",
        "fields": {
            "field_name": [],
            "field_name": [],
            ...
        }
    },
    ...
}
```
If you want to extract more fields, add '"field_name": []' to the "fields" dictionary. After running the notebook, the unique field values will be stored as a list for each field.

In [None]:
# User specified values

datasets = {
    "biomarkers": {
        "synapse_id": "syn61250724.1",
        "fields": {
            "model": [],
            "type": [],
            "tissue": [],
            "sex": []
        }
    },
    "pathology": {
        "synapse_id": "syn61357279",
        "fields": {
            "model": [],
            "type": [],
            "tissue": [],
            "sex": []
        }
    }
}

In [None]:
# Log into Synapse
syn = utils._login_to_synapse()

In [None]:
# Download data as dataframes
for dataset in datasets:
    df = extract.get_entity_as_df(syn_id=datasets[dataset]["synapse_id"], source="csv", syn=syn)
    df = utils.standardize_column_names(df=df)
    df = utils.standardize_values(df=df)
    datasets[dataset]["df"] = df

In [None]:
# Get unique values for each field
for dataset in datasets:
    for field in datasets[dataset]["fields"]:
        datasets[dataset]["fields"][field] = datasets[dataset]["df"][field].unique().tolist()


In [None]:
# Write to json
for dataset in datasets:
    with open(f"{dataset}_unique_field_values.json", "w") as f:
        json.dump(datasets[dataset]["fields"], f, indent=4)