# Rupp2015

# Data preprocessing

## How to browse the local filesystem in Python

In [None]:
# print working directory
!pwd

/Users/wasmer/JupyterHub/sisclab2022-project6-git/work-package-1/meeting02-ml-tutorial/johannes


In [2]:
# use os.path or pathlib.Path to look around local filesystem for input data file
from pathlib import Path

In [3]:
# current working directory
Path.cwd()

PosixPath('/Users/wasmer/JupyterHub/sisclab2022-project6-git/work-package-1/meeting02-ml-tutorial/johannes')

In [4]:
working_dir = Path.cwd()

In [65]:
# list all files in parent dir
list(working_dir.parent.iterdir())

[PosixPath('/Users/wasmer/JupyterHub/sisclab2022-project6-git/work-package-1/meeting02-ml-tutorial/deringer2021-tutorial-material'),
 PosixPath('/Users/wasmer/JupyterHub/sisclab2022-project6-git/work-package-1/meeting02-ml-tutorial/ilinca'),
 PosixPath('/Users/wasmer/JupyterHub/sisclab2022-project6-git/work-package-1/meeting02-ml-tutorial/lixia'),
 PosixPath('/Users/wasmer/JupyterHub/sisclab2022-project6-git/work-package-1/meeting02-ml-tutorial/po-yen'),
 PosixPath('/Users/wasmer/JupyterHub/sisclab2022-project6-git/work-package-1/meeting02-ml-tutorial/rupp2015-tutorial-material'),
 PosixPath('/Users/wasmer/JupyterHub/sisclab2022-project6-git/work-package-1/meeting02-ml-tutorial/johannes'),
 PosixPath('/Users/wasmer/JupyterHub/sisclab2022-project6-git/work-package-1/meeting02-ml-tutorial/.ipynb_checkpoints')]

In [66]:
# list all files in parent dir, recursively
list(working_dir.parent.glob('**/*'))

[PosixPath('/Users/wasmer/JupyterHub/sisclab2022-project6-git/work-package-1/meeting02-ml-tutorial/deringer2021-tutorial-material'),
 PosixPath('/Users/wasmer/JupyterHub/sisclab2022-project6-git/work-package-1/meeting02-ml-tutorial/ilinca'),
 PosixPath('/Users/wasmer/JupyterHub/sisclab2022-project6-git/work-package-1/meeting02-ml-tutorial/lixia'),
 PosixPath('/Users/wasmer/JupyterHub/sisclab2022-project6-git/work-package-1/meeting02-ml-tutorial/po-yen'),
 PosixPath('/Users/wasmer/JupyterHub/sisclab2022-project6-git/work-package-1/meeting02-ml-tutorial/rupp2015-tutorial-material'),
 PosixPath('/Users/wasmer/JupyterHub/sisclab2022-project6-git/work-package-1/meeting02-ml-tutorial/johannes'),
 PosixPath('/Users/wasmer/JupyterHub/sisclab2022-project6-git/work-package-1/meeting02-ml-tutorial/.ipynb_checkpoints'),
 PosixPath('/Users/wasmer/JupyterHub/sisclab2022-project6-git/work-package-1/meeting02-ml-tutorial/deringer2021-tutorial-material/README.md'),
 PosixPath('/Users/wasmer/JupyterHub/

In [28]:
# okay, found data file
datafile_relpath = "../rupp2015-tutorial-material/dsgdb7ae2.xyz"

## How to read in a `.xyz` file

References:

- [Wiki > XYZ file format](https://en.wikipedia.org/wiki/XYZ_file_format)
- [ase docs > ase.io](https://wiki.fysik.dtu.dk/ase/ase/io/io.html) > `ase.io.read()` can convert `xyz` files into `ase.Atoms` objects (molecules, crystals)
- [ase docs > ase.io > formats > extxyz](https://wiki.fysik.dtu.dk/ase/ase/io/formatoptions.html#extxyz). Extended XYZ format.

In [22]:
import ase.io

In [42]:
# read in the first two molecules from the xyz file
data_test = ase.io.read(filename=datafile_relpath, index=':2')

In [43]:
data_test[0].arrays

{'numbers': array([6, 1, 1, 1, 1]),
 'positions': array([[ 1.04168, -0.0562 , -0.07148],
        [ 2.15109, -0.0562 , -0.0715 ],
        [ 0.67187,  0.17923, -1.09059],
        [ 0.67188,  0.70866,  0.64196],
        [ 0.67188, -1.05649,  0.23421]])}

In [67]:
# compare with xyz file content, by reading as textfile and printing
with open(datafile_relpath) as f:
    datafile_lines = f.readlines()

In [41]:
datafile_lines[0:20]    

['5\n',
 '0001 -417.031\n',
 'C      1.04168000 -0.05620000 -0.07148000    1.04168200 -0.05620000 -0.07148100\n',
 'H      2.15109000 -0.05620000 -0.07150000    2.13089400 -0.05620200 -0.07149600\n',
 'H      0.67187000  0.17923000 -1.09059000    0.67859800  0.17494100 -1.07204400\n',
 'H      0.67188000  0.70866000  0.64196000    0.67861300  0.69474600  0.62898000\n',
 'H      0.67188000 -1.05649000  0.23421000    0.67861400 -1.03828500  0.22864100\n',
 '8\n',
 '0002 -711.117\n',
 'C      0.99571000  0.01149000 -0.09922000    0.99591400  0.01151100 -0.09922100\n',
 'C      2.51489000  0.01148000 -0.09922000    2.51468600  0.01146600 -0.09922600\n',
 'H      0.61911000  0.74910000 -0.83887000    0.59725900  0.72987700 -0.81959600\n',
 'H      0.61911000  0.28325000  0.90938000    0.59725900  0.27617000  0.88310600\n',
 'H      0.61909000 -0.99785000 -0.36818000    0.59727800 -0.97153100 -0.36116700\n',
 'H      2.89151000  1.02083000  0.16973000    2.91332200  0.99450900  0.16271900\n'

We see that only the force field coordinates were read (first three columns), not the DFT coordinates (last three columns). This is technically an `extxyz` format, but the comment line does not fit that schema (compare reference above). So for now we choose to ignore the DFT coordinates and just go with the force field ones. If the tutorial latter requires to use the DFT coordinates, will either have to figure out how to make `ase.io.extxyz.read()` read this correctly, or to write our own file parser. Shouldn't be so hard.

In [46]:
# ase puts the comment line, which holds the total energy, into the .info attribute, in string format
atoms = data_test[0]

In [47]:
atoms.info

{'0001': True, '-417.031': True}

In [59]:
# okay, read in all ~7000 molecules
data_force_field = ase.io.read(filename=datafile_relpath, index=':')

In [60]:
len(data_force_field)

7102

In [63]:
# extract the total energies, convert to numeric
data_total_energies = [float(list(molecule.info)[1]) for molecule in data_force_field]

In [64]:
data_total_energies[0]

-417.031

In [69]:
import ase.visualize

In [75]:
# visualize CH4 atom.
#
# 'ngl' viewer currently doesn't yet work with ipywidgets>=8.0 which is installed here. 
# reference https://github.com/nglviewer/nglview/issues/1032
ase.visualize.view(atoms, viewer='x3d')