# Tutorial on dataset selection

Below we provide instructions on how to use the information of database in `dataset_overview.json`.

Take QM9 as an example to show how the information is presented for each dataset in our json file (if the value is not clarified in the paper and needs additional check on the dataset in detail, `"checking required"` will be used as the placeholder.): 

In [None]:
{
"name": # Abbriviation of the name for the dataset 
    "QM9", 

"full_name": # Full name of the dataset
    "QM9", 

"description": # Simple description about the dataset
    " A collection of molecular structures and properties for 134,000 small organic molecules ",

"methods": { # Calculation methods and softwares used to construct dataset. 
             # Currently only methods for geometry optimization and single 
             # point calculations are considered.
    "geometry_optimization":{
        "B3LYP/6-31G(2df,p)": "Gaussian 09" # method : software
        },
    "energy":{
        "B3LYP/6-31G(2df,p)": "Gaussian 09",
        "G4MP2": "Gaussian 09"
        }
    },

"data_size": { # Number of molecules in the dataset. 
               # Currently only number of conformations 
               # are considered. 
    "number_of_structures": 133885
    }, 

"data_access": { # The link of repository to store the dataset. 
    "Figshare": "https://doi.org/10.6084/m9.figshare.978904"
    },

"chemical_elements": # The chemical elements the dataset covers 
    ["H","C","N","O","F"],

"number_of_heavy_atoms": [ # The number of non-hydrogen atoms 
    "checking required", # minima
    "checking required", # mean
    9                    # maxima
    ],                  

"initial_source": [ # The original dataset which current dataset build on
    "GDB-17"
    ], 

"non-equilibrium structures": # whether the dataset contain non-equilibrium structures
    "False",

"charges": [ # charges of the molecules in the dataset
    0
    ],

"multiplicities": [ # multiplicities of the molecules in the dataset
    1
    ],

"excited_states": # whether the dataset contain molecules in excited states
    "False",

"solvent": [ # the solvents used for calculation
    "gas_phase"
    ],

"temperature": # the temperature used for thermochemical calculation or dynamics
    298.15,

"properties": [ # atomic and molecular properties stored in the dataset
    "total_energy",
    "enthalpy",
    "..."
],
"doi": "10.1038/sdata.2014.22",
"reference": [ # reference in bibtex
    "@article{ramakrishnan2014quantum,\n  title={Quantum chemistry structures and properties of 134 kilo molecules},\n  author={Ramakrishnan, Raghunathan and Dral, Pavlo O and Rupp, Matthias and Von Lilienfeld, O Anatole},\n  journal={Scientific data},\n  volume={1},\n  number={1},\n  pages={1--7},\n  year={2014},\n  publisher={Nature Publishing Group}\n}\n\n"
]
},

More entries will be added later according to user requests and instances for new datasets can be created easily from the template we created in the `template.json`. 

## Filter datasets

User can write their own scripts to filter the datasets according to the properties presented above. We also provide `filter_dataset` function to help user get started with. 3 possible usages are shown below.

In [None]:
import json

with open('dataset_overview.json','r') as d:
    datasets = json.load(d)
datasets = datasets['dataset_overview'] # all the datasets are stored under 'dataset_overview' as list

def lower_list(ll):
    ll_updates = []
    for l in ll:
        if type(l) == str:
            ll_updates.append(l.lower())
        else:
            return ll
    return ll_updates

def filter_dataset( # only list and str type values are supported
    datasets, # the datasets to filter on
    entry, # the properties sorted on
    value # the corresponding value requested by users
    ): 
    datasets_to_select = []
    for dataset in datasets:
        if type(dataset[entry]) == list:
            if type(value) == list:
                if set(lower_list(value)) <= set(lower_list(dataset[entry])):
                    datasets_to_select.append(dataset)
            elif type(value) == str:
                if value.lower() in lower_list(dataset[entry]):
                    datasets_to_select.append(dataset)
            else:
                print('Not supported type for value')
        elif type(dataset[entry]) == str:
            if dataset[entry].lower() == value.lower():
                datasets_to_select.append(dataset)
        else:
            print('filter function only supports list and str. For other properties user can generate with their own scripts easily')
    return datasets_to_select


#### Case 1: Select datasets with the PubChem as the initial source 

In [None]:
datasets_to_select = filter_dataset(
    datasets=datasets,
    entry='initial_source',
    value='pubchem')

# print information
for d in datasets_to_select:
    print(f"Name of the dataset: {d['name']}")
    print(f"Description of the dataset: {d['description']}")

#### Case 2: Select datasets with excited states available

In [None]:
datasets_to_select = filter_dataset(
    datasets=datasets,
    entry='excited_states',
    value='True')

# print information
for d in datasets_to_select:
    print(f"Name of the dataset: {d['name']}")
    print(f"Description of the dataset: {d['description']}")

#### Case 3: Select datasets containing HCNOS


In [None]:
datasets_to_select = filter_dataset(
    datasets=datasets,
    entry='chemical_elements',
    value=['H','C','N','O',])
    
# print information
for d in datasets_to_select:
    print(f"Name of the dataset: {d['name']}")
    print(f"Description of the dataset: {d['description']}")

## Generate .csv table from the latest .json file

For better visualization of the statistics, we provide `json2csv` function to transform the json file into csv format which can be easily parsed with Excel or common table visualization tools. We also provide an examples with table generated from `pandas` which is the library for data analysis in Python.

In [None]:
import json
import pandas as pd

def get_value_for_entry(entry, datasets):
    vals = [dataset[entry] for dataset in datasets]
    def dict2str(d):
        ss_list = [str(vv) for vv in d.values()]
        return ','.join(ss_list)
    if type(vals[0]) == str:
        return vals 
    elif type(vals[0]) == list:
        return [','.join([str(vv) for vv in val]) for val in vals]
    elif type(vals[0]) == dict:
        if entry == 'methods':
            print('Only fidelity of energy will be presented')
            vals = [vv['energy'] for vv in vals]
            return [','.join([*d]) for d in vals]

        else:
            return [dict2str(d) for d in vals]

def json2csv(
    datasets, # list of datasets in dict
    entries, # (list) the properties user would like to show in the table
    output_file='dataset_overview.csv' # the name of the output csv file
):

    entries_updated = ['name','description','doi']
    entries_updated += [i for i in entries if i not in entries_updated]
    
    df = pd.DataFrame(columns=entries_updated)
    for entry in entries_updated:
        values = get_value_for_entry(entry, datasets)
        df[entry] = values

    df.to_csv(output_file, index=False)
    return df 


In [None]:
with open('dataset_overview.json','r') as d:
    datasets = json.load(d)
datasets = datasets['dataset_overview'] # all the datasets are stored under 'dataset_overview' as list

json2csv(
    datasets=datasets[:5],
    entries=['chemical_elements','initial_source'],
    output_file='dataset_overview.csv'
)

## Combine them together: Generate table for selected datasets

Below we provide an example on selecting datasets containing elements F and S and generate table on the required information

In [None]:
import json
import pandas as pd

# load the whole datasets
with open('dataset_overview.json','r') as d:
    datasets = json.load(d)
datasets = datasets['dataset_overview'] # all the datasets are stored under 'dataset_overview' as list

datasets_to_select = filter_dataset(
    datasets=datasets,
    entry='chemical_elements',
    value=['F', 'S'])

# generate table:
json2csv(
    datasets=datasets_to_select,
    entries=['properties'],
    output_file='dataset_overview.csv')