# Access to data

NablaDFT includes three types of databases:

1. **Energy database.** There are molecule structure, energy, and forces. Data are available via the atomic simulation environment (ASE) interface.
2. **Hamiltonian database.** There are molecule structure, energy, forces, hamiltonian and overlap matrix. Data are available via nabla2DFT custom access interface.
3. **Raw psi4 wave function.** There are serialized Psi4 wavefunction. Data are available via psi4 or numpy interfaces.

Each database has specific atom units, order of records, and order of atomic orbitals in Hamiltonians. In this tutorial, we show how to load and visualize some data. Advanced processing of metadata and Hamiltonians are described in the following lessons.

The smallest split (`train-tiny`,  51 Mb) of the energy database is available with `wget`:

In [3]:
!wget https://a002dlils-kadurin-nabladft.obs.ru-moscow-1.hc.sbercloud.ru/data/nablaDFTv2/energy_databases/train_2k_v2_formation_energy_w_forces.db

--2024-06-11 14:35:34--  https://a002dlils-kadurin-nabladft.obs.ru-moscow-1.hc.sbercloud.ru/data/nablaDFTv2/energy_databases/train_2k_v2_formation_energy_w_forces.db
Resolving a002dlils-kadurin-nabladft.obs.ru-moscow-1.hc.sbercloud.ru (a002dlils-kadurin-nabladft.obs.ru-moscow-1.hc.sbercloud.ru)... 46.243.206.34, 46.243.206.35
Connecting to a002dlils-kadurin-nabladft.obs.ru-moscow-1.hc.sbercloud.ru (a002dlils-kadurin-nabladft.obs.ru-moscow-1.hc.sbercloud.ru)|46.243.206.34|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 53288960 (51M) [binary/octet-stream]
Saving to: ‘train_2k_v2_formation_energy_w_forces.db’


2024-06-11 14:35:59 (2,09 MB/s) - ‘train_2k_v2_formation_energy_w_forces.db’ saved [53288960/53288960]



An atomic simulation environment (ASE) package is necessary for processing energy databases. It also helps to visualize molecules.

In [None]:
import ase
from ase.db import connect
from ase.units import Bohr
from ase.visualize import view

In [7]:
with connect("train_2k_v2_formation_energy_w_forces.db") as train_db:
    atom_row = train_db.get(1)
    row = atom_row.toatoms()

Energy databases store atom positions in Angstrom.

In [8]:
row.numbers, row.positions

(array([6, 6, 8, 6, 8, 6, 6, 7, 6, 6, 6, 6, 6, 6, 6, 6, 8, 6, 8, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 array([[ 5.649981,  0.292707, -0.359445],
        [ 4.488466,  1.21383 , -0.624704],
        [ 3.379362,  0.938695,  0.203172],
        [ 2.766558, -0.290821,  0.174661],
        [ 3.231289, -1.149445, -0.61145 ],
        [ 1.597579, -0.6515  ,  1.015378],
        [ 0.412593,  0.249381,  0.74796 ],
        [-0.750674, -0.561331,  1.088722],
        [-2.024895, -0.121393,  1.564805],
        [-2.992659,  0.158262,  0.484208],
        [-4.251773,  0.59135 ,  0.861804],
        [-5.209216,  0.871821, -0.108371],
        [-4.914042,  0.721817, -1.44446 ],
        [-3.657549,  0.289714, -1.819159],
        [-2.695867,  0.007762, -0.850864],
        [-0.394728, -1.919742,  0.860161],
        [-1.157842, -2.929562,  0.864303],
        [ 1.060777, -1.975332,  0.606771],
        [ 1.654344, -2.964555,  0.142461],
        [ 5.445909, -0.429589,  0.454969],
        [ 6.584967, 

You can check the data using visualization. You can check the data using visualization. ASE takes physical units from meta information.

In [9]:
view(row, viewer='x3d')

# Hamiltonian database

The smallest split of the Hamiltonian database (`train-tiny`) requires downloading of 14 Gb.

In [18]:
!wget https://a002dlils-kadurin-nabladft.obs.ru-moscow-1.hc.sbercloud.ru/data/nablaDFTv2/hamiltonian_databases/train_2k.db

--2024-06-11 14:54:38--  https://a002dlils-kadurin-nabladft.obs.ru-moscow-1.hc.sbercloud.ru/data/nablaDFTv2/hamiltonian_databases/train_2k.db
Resolving a002dlils-kadurin-nabladft.obs.ru-moscow-1.hc.sbercloud.ru (a002dlils-kadurin-nabladft.obs.ru-moscow-1.hc.sbercloud.ru)... 46.243.206.35, 46.243.206.34
Connecting to a002dlils-kadurin-nabladft.obs.ru-moscow-1.hc.sbercloud.ru (a002dlils-kadurin-nabladft.obs.ru-moscow-1.hc.sbercloud.ru)|46.243.206.35|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15118360576 (14G) [binary/octet-stream]
Saving to: ‘train_2k.db.1’

train_2k.db.1         0%[                    ]  38,41M  6,91MB/s    eta 77m 2s ^C


We provide a custom class to access the Hamiltonian databases.

In [30]:
from nablaDFT.dataset import HamiltonianDatabase

In [31]:
help(HamiltonianDatabase)

Help on class HamiltonianDatabase in module nablaDFT.dataset.hamiltonian_dataset:

class HamiltonianDatabase(builtins.object)
 |  HamiltonianDatabase(filename, flags=1)
 |  
 |  This is a class to store large amounts of ab initio reference data
 |  for training a neural network in a SQLite database
 |  
 |  Data structure:
 |  Z (N)    (int)        nuclear charges
 |  R (N, 3) (float)      Cartesian coordinates in bohr
 |  E ()     (float)      energy in Eh
 |  F (N, 3) (float)      forces in Eh/bohr
 |  H (Norb, Norb)        full hamiltonian in atomic units
 |  S (Norb, Norb)        overlap matrix in atomic units
 |  C (Norb, Norb)        core hamiltonian in atomic units
 |  moses_id () (int)     molecule id in MOSES dataset
 |  conformer_id () (int) conformation id
 |  
 |  Args:
 |      filename (str): path to database.
 |  
 |  Methods defined here:
 |  
 |  __getitem__(self, idx)
 |  
 |  __init__(self, filename, flags=1)
 |      Initialize self.  See help(type(self)) for accurate

In [32]:
train = HamiltonianDatabase("train_2k.db")
# atoms numbers, atoms positions, energy, forces, core hamiltonian, overlap matrix, coefficients matrix
Z, R, E, F, H, S, C, moses_id, conformation_id = train[0]  

In [33]:
Z, R

(array([6, 6, 7, 6, 6, 7, 6, 8, 6, 7, 6, 6, 6, 8, 7, 6, 6, 6, 6, 6, 6, 7,
        7, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32),
 array([[12.381227  , -1.6390955 ,  0.99412256],
        [10.60808   , -1.4759347 , -1.264499  ],
        [ 8.206612  , -0.30607894, -0.51739   ],
        [ 6.1327586 , -1.4889795 ,  0.35170072],
        [ 4.2733946 ,  0.2647034 ,  0.83938235],
        [ 1.7510334 , -0.13743979,  1.7892759 ],
        [ 0.1485835 ,  1.9786887 ,  2.1191106 ],
        [ 0.9965735 ,  4.1259356 ,  1.5764493 ],
        [-2.4638078 ,  1.7212306 ,  3.0714474 ],
        [-4.2715445 ,  1.13479   ,  1.1013815 ],
        [-5.605585  ,  2.681132  , -0.38528493],
        [-5.587442  ,  5.418015  , -0.36191657],
        [-7.1772933 ,  6.826564  , -2.175137  ],
        [-4.2307696 ,  6.5596795 ,  1.1861527 ],
        [-7.0589304 ,  1.2952561 , -1.9698147 ],
        [-6.67727   , -1.1683024 , -1.519217  ],
        [-7.718418  , -3.2837431 , -2.631555  ],
        [-6.963811 

ASE cannot take physical units from metainformation. Angstrom to Bohr should be converted explicitly. You can check the correctness of units using ASE atomic visualization. If atomic units are correct, then each atom will drawn as a sphere. The spheres should touch each other but not intersect or stay aside.

In [15]:
atom = ase.Atoms(Z, R*Bohr)
view(atom, viewer='x3d')

# Raw psi4 wave function

You can upload PSI4 wavefunctions into the PSI4, or into numpy. Numpy-way is simpler because it is not required to install Psi4.

In [None]:
!wget https://a002dlils-kadurin-nabladft.obs.ru-moscow-1.hc.sbercloud.ru/data/moses_wfns_big/wfns_moses_conformers_archive_0.tar
!tar -xf wfns_moses_conformers_archive_0.tar
!cd mnt/sdd/data/moses_wfns_big/
!ls mnt/sdd/data/moses_wfns_big/

In [34]:
import numpy as np

In [35]:
data = np.load('mnt/sdd/data/moses_wfns_big/wfn_conf_50000_0.npy', allow_pickle=True).tolist()

In [36]:
Z = data['molecule']['elez']
R = data['molecule']['geom'].reshape((-1,3))
Z, R

(array([ 6,  6,  6,  6,  6,  6,  6,  8,  6,  6,  8,  7,  6,  7,  6,  6,  6,
         6,  8,  7,  6,  6, 16,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1]),
 array([[ 3.816079, -1.722304, -2.555971],
        [ 4.944881, -1.034021, -1.837613],
        [ 6.265923, -1.24688 , -2.199719],
        [ 7.294317, -0.601195, -1.523246],
        [ 6.995003,  0.245753, -0.497806],
        [ 5.66433 ,  0.461569, -0.131403],
        [ 4.634962, -0.178541, -0.801212],
        [ 3.300079,  0.008511, -0.468425],
        [ 3.005922,  0.882068,  0.587993],
        [ 1.516336,  0.926033,  0.764885],
        [ 0.987537,  1.645412,  1.658853],
        [ 0.646883,  0.172134, -0.052548],
        [-0.785166,  0.263066,  0.175039],
        [-1.75456 , -0.332926, -0.447484],
        [-3.001104, -0.142552, -0.118493],
        [-4.08153 , -0.818908, -0.956447],
        [-3.28879 ,  0.699046,  0.903997],
        [-4.435918,  1.131839,  1.635794],
        [-4.1453  ,  2.142819,  

In [37]:
atom = ase.Atoms(Z, R)
view(atom, viewer='x3d')

In [38]:
wfn = np.load('mnt/sdd/data/moses_wfns_big/wfn_conf_50000_0.npy', allow_pickle=True).tolist()
orbital_matrix_a = wfn["matrix"]["Ca"]        # alpha orbital coefficients
orbital_matrix_b = wfn["matrix"]["Cb"]        # betta orbital coefficients
density_matrix_a = wfn["matrix"]["Da"]        # alpha electonic density
density_matrix_b = wfn["matrix"]["Db"]        # betta electonic density
aotoso_matrix = wfn["matrix"]["aotoso"]       # atomic orbital to symmetry orbital transformation matrix
core_hamiltonian_matrix = wfn["matrix"]["H"]  # core Hamiltonian matrix
fock_matrix_a = wfn["matrix"]["Fa"]           # DFT alpha Fock matrix
fock_matrix_b = wfn["matrix"]["Fb"]           # DFT betta Fock matrix 

An advenced processing of wavefunctions and data is available from psi4 also. Note, that psi4 require compilation or conda install. More information about obtaining of PSI4 is available here https://psicode.org/psi4manual/master/build_obtaining.html

In [None]:
import psi4

In [None]:
wfn = psi4.core.Wavefunction.from_file('mnt/sdd/data/moses_wfns_big/wfn_conf_50000_0.npy')
psi4.oeprop(wfn, "MAYER_INDICES")
psi4.oeprop(wfn, "WIBERG_LOWDIN_INDICES")
psi4.oeprop(wfn, "MULLIKEN_CHARGES")
psi4.oeprop(wfn, "LOWDIN_CHARGES")
meyer_bos = wfn.array_variables()["MAYER INDICES"]  # Mayer bond indices
lodwin_bos = wfn.array_variables()["WIBERG LOWDIN INDICES"]  # Wiberg bond indices
mulliken_charges = wfn.array_variables()["MULLIKEN CHARGES"]  # Mulliken atomic charges
lowdin_charges = wfn.array_variables()["LOWDIN CHARGES"]  # Löwdin atomic charges