# Prepare Open Catalyst project data into graph format

---

### Overview
1. Inspect raw data
2. Augment data
3. Training, validation, testing data
---

In [1]:
import hyper
import prep_open_catalyst

HYPER = hyper.HyperParameter()

### 1. Inspect raw data

We print out every information that the raw data files contain and that we can deduce from their atomic models.

In [2]:
sample_dataset, sample_datapoint = prep_open_catalyst.import_raw_data_samples(HYPER)
prep_open_catalyst.print_features_of_datapoint(sample_datapoint)

Importing sample file 105.extxyz.xz 

The sampled dataset contains 95 data points, each consisting of a constellation of atoms.

 A single datapoint contains the following properties.

 Global number of atoms:
 100

 Chemical formula:
 C2HAl48ORe48

 Symbols concise:
 Al48Re48C2HO

 Energy:
 -729.60778439

 Free energy:
 -729.61016215

 Volume:
 5585.237204315779

 Center of mass:
 [ 8.76943345  5.9739095  17.60965115]

 Periodic boundary condition (pbc):
 [ True  True  True]

 Cell:
 Cell([[13.19698417, 0.0, 4.39899472], [4.39899472, 12.02606366, -4.39899472], [0.0, 0.0, 35.19195779]])

 Symbols extensive:
 ['Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re',

### 2. Augment data

Augment data with additional data from period table of elements downloaded from https://pubchem.ncbi.nlm.nih.gov/periodic-table/

In [3]:
df_periodic_table = prep_open_catalyst.create_augment_data(HYPER)
display(df_periodic_table)

Unnamed: 0,AtomicNumber,Symbol,Name,AtomicMass,CPKHexColor,ElectronConfiguration,Electronegativity,AtomicRadius,IonizationEnergy,ElectronAffinity,OxidationStates,StandardState,MeltingPoint,BoilingPoint,Density,GroupBlock,YearDiscovered
0,1,H,Hydrogen,1.008000,FFFFFF,1s1,2.20,120.0,13.598,0.754,"+1, -1",Gas,13.81,20.28,0.000090,Nonmetal,1766
1,2,He,Helium,4.002600,D9FFFF,1s2,,140.0,24.587,,0,Gas,0.95,4.22,0.000179,Noble gas,1868
2,3,Li,Lithium,7.000000,CC80FF,[He]2s1,0.98,182.0,5.392,0.618,+1,Solid,453.65,1615.00,0.534000,Alkali metal,1817
3,4,Be,Beryllium,9.012183,C2FF00,[He]2s2,1.57,153.0,9.323,,+2,Solid,1560.00,2744.00,1.850000,Alkaline earth metal,1798
4,5,B,Boron,10.810000,FFB5B5,[He]2s2 2p1,2.04,192.0,8.298,0.277,+3,Solid,2348.00,4273.00,2.370000,Metalloid,1808
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
113,114,Fl,Flerovium,290.192000,,[Rn]7s2 7p2 5f14 6d10 (predicted),,,,,"6, 4,2, 1, 0",Expected to be a Solid,,,,Post-transition metal,1998
114,115,Mc,Moscovium,290.196000,,[Rn]7s2 7p3 5f14 6d10 (predicted),,,,,"3, 1",Expected to be a Solid,,,,Post-transition metal,2003
115,116,Lv,Livermorium,293.205000,,[Rn]7s2 7p4 5f14 6d10 (predicted),,,,,"+4, +2, -2",Expected to be a Solid,,,,Post-transition metal,2000
116,117,Ts,Tennessine,294.211000,,[Rn]7s2 7p5 5f14 6d10 (predicted),,,,,"+5, +3, +1, -1",Expected to be a Solid,,,,Halogen,2010


### 3. Training, validation, testing data

We choose the following information as for saving in order to deduce the essential features and labels to our supervised learning task:
* number of atoms (n_atoms)
* structure atom symbols (symbols_atoms)
* energy (energy)
* positions (pos_x1, pos_y1, pos_z1, ..., pos_x(n_atoms), pos_y(n_atoms), pos_z(n_atoms))
* forces (force_x1, force_y1, force_z1, ..., force_x(n_atoms), force_y(n_atoms), force_z(n_atoms))


The following will be used as features:
* positions (pos_x1, pos_y1, pos_z1, ..., pos_x(n_atoms), pos_y(n_atoms), pos_z(n_atoms))


The following will be used as labels:
* energy (energy)
* forces (force_x1, force_y1, force_z1, ..., force_x(n_atoms), force_y(n_atoms), force_z(n_atoms))


The following will be used to more easily process data and dynamically augment features from periodic table of atoms with atom information about for example mass and electronegativity:
* number of atoms (n_atoms)
* structure atom symbols (symbols_atoms)

In [2]:
df_training, df_validation, df_testing = prep_open_catalyst.process_raw_data(HYPER)

display(df_training)
display(df_validation)
display(df_testing)

Processing training data


100%|█████████████████████████████████████████| 100/100 [07:02<00:00,  4.23s/it]


Processing validation data


100%|█████████████████████████████████████████| 100/100 [06:59<00:00,  4.20s/it]


Processing testing_ood_both data


100%|█████████████████████████████████████████| 100/100 [07:17<00:00,  4.38s/it]


Processing testing_ood_cat data


100%|█████████████████████████████████████████| 100/100 [06:51<00:00,  4.12s/it]


Processing testing_ood_ads data


100%|█████████████████████████████████████████| 100/100 [07:36<00:00,  4.57s/it]


Unnamed: 0,n_atoms,symbols_atoms,energy,pos_x0,pos_y0,pos_z0,pos_x1,pos_y1,pos_z1,pos_x2,...,force_z220,force_x221,force_y221,force_z221,force_x222,force_y222,force_z222,force_x223,force_y223,force_z223
0,67,Co48W16C2H,-505.688484,3.002557e+00,1.026604,15.382117,3.721496,3.079812,11.499983,3.721496,...,,,,,,,,,,
0,52,Al24Cu24NH3,-182.353104,0.000000e+00,2.135570,11.745634,0.000000,0.000000,13.881203,0.000000,...,,,,,,,,,,
0,86,Os20Pt20Sc40C2H3O,-627.255796,2.315190e+00,10.430328,13.680880,2.315190,3.094713,16.922795,2.315190,...,,,,,,,,,,
0,131,Sc32Sn96ONH,-558.033499,6.345227e+00,10.197099,12.492166,0.000000,0.000000,18.837393,3.172614,...,,,,,,,,,,
0,112,Hf36Si36Ni36C2H2,-775.828981,2.860949e+00,3.392390,17.704656,2.860949,13.057783,18.835877,2.860949,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,139,Al104Os32N2O,-758.587628,0.000000e+00,7.654267,27.413607,0.000000,2.303521,26.125611,0.000000,...,,,,,,,,,,
0,53,Ag12Al12Y24CH3O,-244.700249,-4.000000e-08,0.171119,20.249111,0.002357,7.098446,22.621136,3.686021,...,,,,,,,,,,
0,96,Hf50Sn40C2H3O,-648.309721,1.745060e+01,3.739011,17.934424,2.267183,5.371822,17.934424,9.499522,...,,,,,,,,,,
0,70,Zr40Cr8Sb16C2H2O2,-500.119098,8.125752e-01,2.798334,12.075078,10.392471,2.798334,18.738798,3.331860,...,,,,,,,,,,


Unnamed: 0,n_atoms,symbols_atoms,energy,pos_x0,pos_y0,pos_z0,pos_x1,pos_y1,pos_z1,pos_x2,...,force_z216,force_x217,force_y217,force_z217,force_x218,force_y218,force_z218,force_x219,force_y219,force_z219
0,52,Ir16Te32CH2O,-241.566736,-0.009435,7.534863,19.631822,4.651943e+00,11.157494,16.786254,0.000000,...,,,,,,,,,,
0,55,As12Co12Mn24C2H4O,-351.842303,0.002137,2.327156,24.318908,0.000000e+00,0.000000,21.022291,2.019368,...,,,,,,,,,,
0,183,Y108H74C,-936.029973,2.152734,7.229120,14.914581,5.381836e+00,24.785763,16.778903,4.305469,...,,,,,,,,,,
0,56,Au12Na36C2H4O2,-124.779492,0.000000,0.000000,24.346177,5.309774e+00,3.065599,15.675352,2.654887,...,,,,,,,,,,
0,85,V60Ge20C2H3,-606.016375,-0.845234,6.250061,18.661400,5.916639e+00,2.346092,15.900877,2.475443,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,111,Se48V60N2O,-720.201954,1.672647,5.205978,18.092989,3.000000e-08,5.205978,24.725246,0.474571,...,,,,,,,,,,
0,56,Ca24Tl24C2H4O2,-139.946055,0.064036,0.130446,21.669780,6.000000e-08,3.244348,17.008337,2.809688,...,,,,,,,,,,
0,68,Cr16Ni48CH3,-377.374484,0.000000,0.000000,17.122125,2.490491e+00,0.000000,14.631634,4.980982,...,,,,,,,,,,
0,87,Sr32In16N32C2H4O,-385.420113,2.717927,2.047568,22.571022,0.000000e+00,11.234991,15.398455,2.734782,...,,,,,,,,,,


Unnamed: 0,n_atoms,symbols_atoms,energy,pos_x0,pos_y0,pos_z0,pos_x1,pos_y1,pos_z1,pos_x2,...,force_z219,force_x220,force_y220,force_z220,force_x221,force_y221,force_z221,force_x222,force_y222,force_z222
0,57,Te36Mo18CHO,-292.078200,3.572862,1.031396,18.406606,1.786431,2.062793,26.110461,1.786431,...,,,,,,,,,,
0,100,Fe80Y16NO2H,-685.908707,4.631673,5.219386,13.958494,0.726388,2.685117,15.669541,3.084574,...,,,,,,,,,,
0,99,Fe32Ge64CHO,-506.359434,0.000000,5.103869,19.557570,-0.037307,5.158870,22.147874,2.838971,...,,,,,,,,,,
0,51,Rh16Sb16Se16CHO,-249.050448,2.181003,8.547154,13.520531,6.633046,4.912076,16.090920,2.219177,...,,,,,,,,,,
0,61,Nb18Se36C2H3O2,-348.789928,2.024206,3.486077,13.439769,1.012103,10.458232,8.180729,2.024206,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,92,W40C42N2H8,-867.921716,4.756594,2.701329,13.988177,7.406975,2.025997,15.037288,0.000000,...,,,,,,,,,,
0,60,Ti32Sn16N2C2H8,-341.716076,1.379694,1.418215,16.130472,1.379694,7.091075,11.351073,2.759387,...,,,,,,,,,,
0,36,Rb8Sb16N2C2H8,-128.905585,5.015242,9.286547,19.355810,9.117517,11.669718,21.506456,3.255418,...,,,,,,,,,,
0,35,Ca16Au16CHO,-106.919528,-0.000012,1.157834,21.056821,2.034689,3.542000,12.454411,2.034689,...,,,,,,,,,,
