# Prepare Open Catalyst project data into graph format

---

### Overview
1. Inspect raw data
2. Process raw data
---

In [1]:
import hyper
import prep_open_catalyst

HYPER = hyper.HyperParameter()

### 1. Inspect raw data

We print out every information that the raw data files contain and that we can deduce from their atomic models.

In [4]:
sample_dataset, sample_datapoint = prep_open_catalyst.import_raw_data_samples(HYPER)
prep_open_catalyst.print_features_of_datapoint(sample_datapoint)

Importing sample file 105.extxyz.xz 

The sampled dataset contains 95 data points, each consisting of a constellation of atoms.

 A single datapoint contains the following properties.

 Global number of atoms:
 100

 Chemical formula:
 C2HAl48ORe48

 Symbols concise:
 Al48Re48C2HO

 Energy:
 -729.60778439

 Free energy:
 -729.61016215

 Volume:
 5585.237204315779

 Center of mass:
 [ 8.76943345  5.9739095  17.60965115]

 Periodic boundary condition (pbc):
 [ True  True  True]

 Cell:
 Cell([[13.19698417, 0.0, 4.39899472], [4.39899472, 12.02606366, -4.39899472], [0.0, 0.0, 35.19195779]])

 Symbols extensive:
 ['Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re',

### 2. Process raw data

We choose the following information as for saving in order to deduce the essential features and labels to our supervised learning task:
* number of atoms (n_atoms)
* structure atom symbols (symbols_atoms)
* energy (energy)
* positions (pos_x1, pos_y1, pos_z1, ..., pos_x(n_atoms), pos_y(n_atoms), pos_z(n_atoms))
* forces (force_x1, force_y1, force_z1, ..., force_x(n_atoms), force_y(n_atoms), force_z(n_atoms))


The following will be used as features:
* positions (pos_x1, pos_y1, pos_z1, ..., pos_x(n_atoms), pos_y(n_atoms), pos_z(n_atoms))


The following will be used as labels:
* energy (energy)
* forces (force_x1, force_y1, force_z1, ..., force_x(n_atoms), force_y(n_atoms), force_z(n_atoms))


The following will be used to more easily process data and dynamically augment features with atom information about mass, electronegativity, and similar constant atom properties:
* number of atoms (n_atoms)
* structure atom symbols (symbols_atoms)



In [2]:
df_training, df_validation, df_testing = prep_open_catalyst.process_raw_data(HYPER)

In [3]:
df_training

Unnamed: 0,n_atoms,symbols_atoms,energy,pos_x0,pos_y0,pos_z0,pos_x1,pos_y1,pos_z1,pos_x2,...,force_z217,force_x218,force_y218,force_z218,force_x219,force_y219,force_z219,force_x220,force_y220,force_z220
0,100,Al48Re48C2HO,-729.607784,5.498743,2.236723,16.129647,1.099749,2.236723,16.129647,3.299246,...,,,,,,,,,,
0,59,Zr36Co6Te12NH3O,-396.740199,6.778714,4.399073,19.658950,13.540526,0.448924,23.093010,5.161429,...,,,,,,,,,,
0,52,Au12Nb36NH3,-376.166771,5.375560,4.150542,18.611123,3.033585,3.597201,20.164822,5.615150,...,,,,,,,,,,
0,161,Tl80Sn20S60N,-506.784264,5.868771,1.146471,29.786739,14.882965,12.980509,41.343782,17.200255,...,,,,,,,,,,
0,68,Sb16Pt48N2H2,-330.498068,2.010649,6.305745,13.069215,0.000000,2.347826,11.058567,0.000000,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,45,Ga18P18C2H5O2,-184.109465,0.160982,3.376109,16.264987,1.684315,0.848926,10.578389,-1.292035,...,,,,,,,,,,
0,126,Ta72Ga48C2H2O2,-937.256679,14.035666,2.094278,13.679034,7.833401,5.195411,17.209107,16.512517,...,,,,,,,,,,
0,28,K8Na8Se8C2HO,-87.313297,4.261174,6.126177,9.480781,0.113061,7.293140,9.480781,4.035052,...,,,,,,,,,,
0,94,Nb36Zn6S48N2HO,-605.946977,2.558971,12.139895,15.057044,2.558971,3.705012,19.926926,2.558971,...,,,,,,,,,,
