# Prepare Open Catalyst project data into graph format

---

### Overview
1. Inspect raw data
2. Augment data
3. Training, validation, testing data
---

In [1]:
import hyper_opencatalyst
import prep_open_catalyst

HYPER = hyper_opencatalyst.HyperOpenCatalyst()

### 1. Inspect raw data

We print out every information that the raw data files contain and that we can deduce from their atomic models.

In [2]:
sample_dataset, sample_datapoint = prep_open_catalyst.import_raw_data_samples(HYPER)
prep_open_catalyst.print_features_of_datapoint(sample_datapoint)

Importing sample file 105.extxyz.xz 

The sampled dataset contains 95 data points, each consisting of a constellation of atoms.

 A single datapoint contains the following properties.

 Global number of atoms:
 100

 Chemical formula:
 C2HAl48ORe48

 Symbols concise:
 Al48Re48C2HO

 Energy:
 -729.60778439

 Free energy:
 -729.61016215

 Volume:
 5585.237204315779

 Center of mass:
 [ 8.76943345  5.9739095  17.60965115]

 Periodic boundary condition (pbc):
 [ True  True  True]

 Cell:
 Cell([[13.19698417, 0.0, 4.39899472], [4.39899472, 12.02606366, -4.39899472], [0.0, 0.0, 35.19195779]])

 Symbols extensive:
 ['Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Al', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re', 'Re',

### 2. Augment data

Augment data with additional data from period table of elements downloaded from https://pubchem.ncbi.nlm.nih.gov/periodic-table/

In [3]:
df_periodic_table = prep_open_catalyst.create_augment_data(HYPER)
display(df_periodic_table)

Unnamed: 0,AtomicNumber,Symbol,Name,AtomicMass,CPKHexColor,ElectronConfiguration,Electronegativity,AtomicRadius,IonizationEnergy,ElectronAffinity,OxidationStates,StandardState,MeltingPoint,BoilingPoint,Density,GroupBlock,YearDiscovered
0,1,H,Hydrogen,1.008000,FFFFFF,1s1,2.20,120.0,13.598,0.754,"+1, -1",Gas,13.81,20.28,0.000090,Nonmetal,1766
1,2,He,Helium,4.002600,D9FFFF,1s2,,140.0,24.587,,0,Gas,0.95,4.22,0.000179,Noble gas,1868
2,3,Li,Lithium,7.000000,CC80FF,[He]2s1,0.98,182.0,5.392,0.618,+1,Solid,453.65,1615.00,0.534000,Alkali metal,1817
3,4,Be,Beryllium,9.012183,C2FF00,[He]2s2,1.57,153.0,9.323,,+2,Solid,1560.00,2744.00,1.850000,Alkaline earth metal,1798
4,5,B,Boron,10.810000,FFB5B5,[He]2s2 2p1,2.04,192.0,8.298,0.277,+3,Solid,2348.00,4273.00,2.370000,Metalloid,1808
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
113,114,Fl,Flerovium,290.192000,,[Rn]7s2 7p2 5f14 6d10 (predicted),,,,,"6, 4,2, 1, 0",Expected to be a Solid,,,,Post-transition metal,1998
114,115,Mc,Moscovium,290.196000,,[Rn]7s2 7p3 5f14 6d10 (predicted),,,,,"3, 1",Expected to be a Solid,,,,Post-transition metal,2003
115,116,Lv,Livermorium,293.205000,,[Rn]7s2 7p4 5f14 6d10 (predicted),,,,,"+4, +2, -2",Expected to be a Solid,,,,Post-transition metal,2000
116,117,Ts,Tennessine,294.211000,,[Rn]7s2 7p5 5f14 6d10 (predicted),,,,,"+5, +3, +1, -1",Expected to be a Solid,,,,Halogen,2010


### 3. Training, validation, testing data

We choose the following information as for saving in order to deduce the essential features and labels to our supervised learning task:
* number of atoms (n_atoms)
* structure atom symbols (symbols_atoms)
* energy (energy)
* positions (pos_x1, pos_y1, pos_z1, ..., pos_x(n_atoms), pos_y(n_atoms), pos_z(n_atoms))
* forces (force_x1, force_y1, force_z1, ..., force_x(n_atoms), force_y(n_atoms), force_z(n_atoms))


The following will be used as features:
* positions (pos_x1, pos_y1, pos_z1, ..., pos_x(n_atoms), pos_y(n_atoms), pos_z(n_atoms))


The following will be used as labels:
* energy (energy)
* forces (force_x1, force_y1, force_z1, ..., force_x(n_atoms), force_y(n_atoms), force_z(n_atoms))


The following will be used to more easily process data and dynamically augment features from periodic table of atoms with atom information about for example mass and electronegativity:
* number of atoms (n_atoms)
* structure atom symbols (symbols_atoms)

In [4]:
df_training, df_validation, df_testing = prep_open_catalyst.train_val_test_create(HYPER)

display(df_training)
display(df_validation)
display(df_testing)

Processing training data


100%|███████████████████████████████████████████| 10/10 [00:09<00:00,  1.06it/s]


Processing validation data


100%|███████████████████████████████████████████| 10/10 [00:11<00:00,  1.13s/it]


Processing testing_ood_both data


100%|███████████████████████████████████████████| 10/10 [00:13<00:00,  1.33s/it]


Processing testing_ood_cat data


100%|███████████████████████████████████████████| 10/10 [00:08<00:00,  1.12it/s]


Processing testing_ood_ads data


100%|███████████████████████████████████████████| 10/10 [00:11<00:00,  1.18s/it]


Unnamed: 0,n_atoms,symbols_atoms,volume,center_of_mass_x,center_of_mass_y,center_of_mass_z,energy,pos_x0,pos_y0,pos_z0,...,force_z195,force_x196,force_y196,force_z196,force_x197,force_y197,force_z197,force_x198,force_y198,force_z198
0,51,Sr24Ge24C2O,4404.730270,4.266923,6.775241,22.463928,-175.347678,1.069484,3.909930,21.348195,...,,,,,,,,,,
0,28,Pb12S12CH2O,3039.520650,5.400035,5.153482,14.311428,-110.242136,4.481414,3.403351,12.401905,...,,,,,,,,,,
0,35,K16Cu8Sb8N2H,3389.626780,3.578149,5.984496,16.897993,-92.043374,2.700358,4.095182,20.229901,...,,,,,,,,,,
0,151,Ca48H99C2O2,7119.644819,6.057214,6.874023,22.884935,-493.403238,6.326551,1.766319,16.709857,...,,,,,,,,,,
0,54,Y16Sn16Ir16C2H2O2,3157.180533,6.411038,5.625692,19.733429,-342.470650,8.357756,9.354020,19.899857,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,76,Zr24Co24Ge24NO3,4595.231692,5.912899,4.887310,20.333587,-502.215587,2.973570,8.781949,16.945027,...,,,,,,,,,,
0,104,Ca64Cu32C2H5O,10747.622166,6.111329,14.678291,12.692258,-241.715621,0.805660,12.326422,8.982795,...,,,,,,,,,,
0,74,Co24S48NO,4066.497407,5.201170,5.963069,17.798059,-362.215810,0.000000,7.877710,15.140294,...,,,,,,,,,,
0,79,Na18S36Ti18C2H4O,4535.342707,7.301243,6.774508,16.441868,-388.639942,2.165346,6.324243,17.999390,...,,,,,,,,,,


Unnamed: 0,n_atoms,symbols_atoms,volume,center_of_mass_x,center_of_mass_y,center_of_mass_z,energy,pos_x0,pos_y0,pos_z0,...,force_z196,force_x197,force_y197,force_z197,force_x198,force_y198,force_z198,force_x199,force_y199,force_z199
0,39,Ga32C2H3O2,2235.845003,3.484408,4.325116,14.559650,-115.904386,3.375167,0.573847,15.245494,...,,,,,,,,,,
0,109,Tl48Sn12S48N,10541.972330,9.697518,8.998227,19.340726,-353.768467,7.886924,11.310323,20.079947,...,,,,,,,,,,
0,51,Y24Au16N2C2H6O,4199.625506,3.821544,8.046813,17.284376,-256.370446,4.139141,4.139141,14.841638,...,,,,,,,,,,
0,59,In24S32CH2,4362.843435,2.801447,5.268271,23.865704,-214.366967,0.000000,2.832410,26.322730,...,,,,,,,,,,
0,101,Te64Pd32CH3O,7447.271908,7.162152,6.465087,18.491219,-370.313202,1.197092,5.719655,15.841278,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,70,Na48Pd16C2H2O2,4839.778824,3.890954,7.646286,18.228997,-165.296648,2.612180,5.065486,17.343642,...,,,,,,,,,,
0,99,Na16P80ONH,6293.907527,9.302958,7.335640,23.806773,-445.185759,13.029880,12.825736,23.637755,...,,,,,,,,,,
0,42,Sr24Sn12C2H2O2,4360.352880,4.302511,5.149350,26.194104,-124.002961,1.987743,5.491000,22.951366,...,,,,,,,,,,
0,36,Ag24Au8CH2O,2077.924053,2.965315,4.705322,14.921377,-91.484136,0.016091,3.820516,18.207741,...,,,,,,,,,,


Unnamed: 0,n_atoms,symbols_atoms,volume,center_of_mass_x,center_of_mass_y,center_of_mass_z,energy,pos_x0,pos_y0,pos_z0,...,force_z216,force_x217,force_y217,force_z217,force_x218,force_y218,force_z218,force_x219,force_y219,force_z219
0,64,Ge48Rh12NO2H,4416.205704,4.896886,4.847274,19.295173,-300.020225,5.470174,8.709283,17.494634,...,,,,,,,,,,
0,87,Cr30Na6Se48CHO,5562.581097,3.586963,9.757159,13.117377,-451.505217,7.381845,8.565098,11.153581,...,,,,,,,,,,
0,78,Sc12Ga48Co6N2C2H8,3414.645915,5.387901,5.010509,17.907954,-310.262632,8.879102,3.334421,14.080999,...,,,,,,,,,,
0,51,Zr16Si16Te16CHO,3854.835684,2.471578,5.219912,15.523587,-294.977631,1.309787,7.111955,17.475673,...,,,,,,,,,,
0,67,Zr16Al32Zn16CHO,2781.009449,2.898014,3.121429,16.948734,-282.977579,2.898855,0.402245,23.819178,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,43,Sn12Se24C2H3O2,4408.177863,6.975215,5.467960,15.125107,-163.830469,0.000000,0.000000,15.397800,...,,,,,,,,,,
0,39,Ta6Ni12Te18CHO,3550.478021,5.606426,4.457414,14.145278,-195.072747,0.936831,6.966022,11.835410,...,,,,,,,,,,
0,75,Ca24Cd48NH2,6054.829340,4.084789,8.551904,15.907742,-85.332262,2.555578,9.822935,14.529328,...,,,,,,,,,,
0,74,K8Pd32Si32CH,4741.634442,7.198025,6.363979,15.714758,-345.778327,5.719835,4.986884,18.842985,...,,,,,,,,,,
