This notebook creates a csv-file from json files of toy data. The toy data contains input parameters and results of simulations

In [33]:
import pandas as pd
import json
import os
import requests

## Parsing of the json file

Information about the data stored in the json file

In [34]:
filename = "Ge-1_Se-1"
element_list = {elt.split("-")[0]: int(elt.split("-")[1]) for elt in filename.split("_")}

In [35]:
JSON_PATH = os.path.join(os.path.dirname(os.path.dirname(os.getcwd())), "data/" + filename + ".json")

In [36]:
with open(JSON_PATH) as file:
    data = json.load(file)

In [37]:
raw_df = pd.DataFrame(data)
raw_df.columns

Index(['ecutrho', 'k_density', 'ecutwfc', 'n_iterations', 'time', 'converged',
       'accuracy', 'fermi', 'total_energy'],
      dtype='object')

## Data Transformation: Computation of energy difference

We set the reference energy as the energy with the highest simulation parameters.
We then compute the relative energy difference with this reference energy.

In [38]:
converged_rows = raw_df.loc[:,'converged'] == True
idx_ref = 0
for idx, row in raw_df.loc[converged_rows].iterrows():
    if (
        row["ecutwfc"] > raw_df.loc[idx_ref, "ecutwfc"]
        or row["ecutrho"] > raw_df.loc[idx_ref, "ecutrho"]
        or row["k_density"] < raw_df.loc[idx_ref, "k_density"]
    ):
        idx_ref = idx

ref_energy = raw_df.loc[idx_ref, "total_energy"]
print(f"Ref energy: {ref_energy} (found at index {idx_ref})")

raw_df = raw_df.assign(delta_E = raw_df.loc[:,'total_energy'] - ref_energy)

Ref energy: -256.52875705 (found at index 660)


We only want the cut-off radii, the k-point spacing and whether the algorithm converged together with the accuracy (w.r.t. a reference calculation)

In [39]:
rel_cols = ['ecutrho', 'k_density', 'ecutwfc', 'converged' , 'accuracy', 'delta_E']
df = raw_df[rel_cols]

In [40]:
# LOADING ALL ELEMENT KEYS
url_table = requests.get("https://archive.materialscloud.org/record/file?record_id=862&filename=SSSP_1.1.2_PBE_efficiency.json&file_id=a5642f40-74af-4073-8dfd-706d2c7fccc2")
text_table = url_table.text
sssp_table = json.loads(text_table)
periodic_table_keys = list(sssp_table.keys())

For each element in the periodic table, we want to know the relative contribution to the total number of atoms in the structure, i.e. 0.0 -> not in the structure, 1.0 -> all the atoms in the structure are from this element.

In our toy example, we have data from GeTe simulations. Hence, the two element get a value of 0.5 each.

In [41]:
# ADDING A ZERO COLUMN FOR EACH ELEMENT
#for element in periodic_table_keys:
#    df = df.assign(**{element: 0.0})

#total_elt = sum(list(element_list.values()))
#for elt, nb_elt in element_list.items():
#    df = df.assign(**{elt: nb_elt / total_elt})

In [42]:
filepath = os.path.join(os.path.dirname(os.path.dirname(os.getcwd())), "data/" + filename + ".csv")
#df.to_csv(filepath)

## Data Transformaton: Encoding of the atomic structure

For each column in the periodic table, we want to know the relative contribution to the total number of atoms in the structure, i.e. 0.0 -> no contribution, 1.0 -> all the atoms in the structure are from that column.

For the time being, we leave out the lanthanides and actinides, which leaves us with 18 columns. We denote the nth column of the periodic table by 'PTCn'.

In the notebook 'encoding_periodic_table.ipynb', the periodic table has been encoded into a dict with as keys the element keys and as values the strings 'PTCn'.

In [43]:
# LOADING ALL ELEMENTS 
PT_PATH = os.path.join(os.path.dirname(os.path.dirname(os.getcwd())), "data/periodic_table_info.json")
with open(PT_PATH, 'r') as rf:   
    pt_info = json.load(rf)

In [49]:
ptc_colnames = ['PTC' + str(n) for n in range(1, 19)]
for colname in ptc_colnames:
    df = df.assign(**{colname : 0.0})

In [52]:
total_elt = sum(list(element_list.values()))
for elt, nb_elt in element_list.items():
    ptc = pt_info[elt]
    print(elt, ' -> ', ptc)
    df[ptc] = nb_elt/total_elt

Ge  ->  PTC14
Se  ->  PTC16


Unnamed: 0,ecutrho,k_density,ecutwfc,converged,accuracy,delta_E,PTC1,PTC2,PTC3,PTC4,...,PTC9,PTC10,PTC11,PTC12,PTC13,PTC14,PTC15,PTC16,PTC17,PTC18
0,100,0.500000,20,True,1.000000e-13,0.872583,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.0,0.0
1,100,0.250000,20,True,4.300000e-13,0.736391,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.0,0.0
2,100,0.166667,20,True,2.800000e-13,0.736493,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.0,0.0
3,100,0.125000,20,True,1.900000e-14,0.730394,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.0,0.0
4,100,0.100000,20,True,7.800000e-13,0.731293,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
656,380,0.500000,100,True,5.500000e-14,0.151176,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.0,0.0
657,380,0.250000,100,True,1.700000e-13,-0.000548,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.0,0.0
658,380,0.166667,100,True,6.900000e-13,-0.000436,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.0,0.0
659,380,0.125000,100,True,8.600000e-13,-0.000028,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.0,0.0
