Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What are units and normalization factors in QM9 dataset? #202

Open
Nokimann opened this issue Oct 11, 2021 · 3 comments
Open

What are units and normalization factors in QM9 dataset? #202

Nokimann opened this issue Oct 11, 2021 · 3 comments

Comments

@Nokimann
Copy link

I used the following code:

from jarvis.db.figshare import data
d = data('qm9_std_jctc')

The 1st data in QM9 dataset obtained from JARVIS:

{'mu': -1.77790756800166,
 'alpha': -7.59467417670514,
 'HOMO': -6.71425764235072,
 'LUMO': 2.24686567442436,
 'gap': 5.35591684810335,
 'R2': -4.11464477806684,
 'ZPVE': -3.14893653207103,
 'U0': 5.70989371834825,
 'U': 5.69336539320842,
 'H': 5.68508295617329,
 'G': 5.75764468354196,
 'Cv': -6.18353212813309,
 'omega1': -1.3203823354756,
 'SMILES': 'C',
 'SMILES_relaxed': 'C',
 'id': '000001',
 'atoms': {'lattice_mat': [[60, 0, 0], [0, 60, 0], [0, 0, 60]],
  'coords': [[0.4999998496686667, 0.5000001250963333, 0.4999999923633333],
   [0.5002473255336667, 0.481802867173, 0.4998995777733333],
   [0.5170736659886667, 0.5062992418296667, 0.4998712520133333],
   [0.49119790078366665, 0.5060288326963334, 0.48525591384666666],
   [0.4914812580253333, 0.5058689332046666, 0.5149732640033333]],
  'elements': ['C', 'H', 'H', 'H', 'H'],
  'abc': [60.0, 60.0, 60.0],
  'angles': [90.0, 90.0, 90.0],
  'cartesian': False,
  'props': ['', '', '', '', '']}}

And, the original 1st data in QM9 dataset with description:

5
gdb 1	157.7118	157.70997	157.70699	0.	13.21	-0.3877	0.1171	0.5048	35.3641	0.044749	-40.47893	-40.476062	-40.475117	-40.498597	6.469	
C	-0.0126981359	 1.0858041578	 0.0080009958	-0.535689
H	 0.002150416	-0.0060313176	 0.0019761204	 0.133921
H	 1.0117308433	 1.4637511618	 0.0002765748	 0.133922
H	-0.540815069	 1.4475266138	-0.8766437152	 0.133923
H	-0.5238136345	 1.4379326443	 0.9063972942	 0.133923
1341.307	1341.3284	1341.365	1562.6731	1562.7453	3038.3205	3151.6034	3151.6788	3151.7078
C	C	
InChI=1S/CH4/h1H4	InChI=1S/CH4/h1H4
Line       Content
----       -------
1          Number of atoms na
2          Properties 1-17 (see below)
3,...,na+2 Element type, coordinate (x,y,z) (Angstrom), and Mulliken partial charge (e) of atom
na+3       Frequencies (3na-5 or 3na-6)
na+4       SMILES from GDB9 and for relaxed geometry
na+5       InChI for GDB9 and for relaxed geometry

The properties stored in the second line of each file:

I.  Property  Unit         Description
--  --------  -----------  --------------
 1  tag       -            "gdb9"; string constant to ease extraction via grep
 2  index     -            Consecutive, 1-based integer identifier of molecule
 3  A         GHz          Rotational constant A
 4  B         GHz          Rotational constant B
 5  C         GHz          Rotational constant C
 6  mu        Debye        Dipole moment
 7  alpha     Bohr^3       Isotropic polarizability
 8  homo      Hartree      Energy of Highest occupied molecular orbital (HOMO)
 9  lumo      Hartree      Energy of Lowest occupied molecular orbital (LUMO)
10  gap       Hartree      Gap, difference between LUMO and HOMO
11  r2        Bohr^2       Electronic spatial extent
12  zpve      Hartree      Zero point vibrational energy
13  U0        Hartree      Internal energy at 0 K
14  U         Hartree      Internal energy at 298.15 K
15  H         Hartree      Enthalpy at 298.15 K
16  G         Hartree      Free energy at 298.15 K
17  Cv        cal/(mol K)  Heat capacity at 298.15 K

I. = Property index (properties are given in this order)
For the 6095 isomers, properties 12-16 were calculated at the G4MP2 level of theory.
All other calculations were done at the DFT/B3LYP/6-31G(2df,p) level of theory.

I found the units are converted and normalized
For example, for homo, lumo, ...
Hartree -> eV, and then normalized from the entire data with mean and std

How could I get a unit and mean/std factors for each property?

@knc6
Copy link
Collaborator

knc6 commented Oct 11, 2021

Hi,

The QM9 dataset is adapted from GDrive link from Faber et al.. They provide the mean/std in qm9-prop-stats-v1 file and the normalized dataset in qm9-mol-info-standardized-v1 file.
The units can be found in Faber et al. (Table 3 and 4), or Choudhary et al. (Table 5).

@Nokimann
Copy link
Author

Nokimann commented Oct 12, 2021

Thank you @knc6
We can't directly load the mean/std from JARVIS now?

@gasteigerjo
Copy link

gasteigerjo commented Feb 21, 2022

I don't think it's a good idea to provide only standardized data, as it invites the same evaluation error as in ALIGNN. I've observed this confusion between scaled and original data (and inner energy vs. atomization energy) on QM9 in multiple previous papers as well.

It would be great if you would instead provide the data in real units, as done e.g. by PyG: https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets.QM9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants