Dr Oliviero Andreussi, olivieroandreuss@boisestate.edu
Notebook designed with the assistance of Christopher Orizaba

Boise State University, Department of Chemistry and Biochemistry

# Molecular Modeling Tools for the Computational Thermochemistry Lab {-}

Most chemistry applications of quantum mechanics (a.k.a. quantum chemistry) relies on a powerful commercial software called Gaussian. This code was first developed by a forefather of quantum chemistry and Nobel prize winner John Pople. However, Gaussian is a Fortran 77 code that requires an expensive license to run. For our applications we can achieve the same results using Python-based codes, at the expense of some computing time. In the following we will be using [PySCF](https://pyscf.org/index.html) for our quantum chemistry calculations, so we will need to install it on our Colab instance.

In [None]:
# @title PySCF Setup { display-mode: "form" }
# Import the main components of PySCF used in this worksheet
!pip install pyscf
!pip install pyberny
!pip install pyscf\[geomopt\]
from pyscf import gto, scf, dft, mp, cc
from pyscf.geomopt.berny_solver import optimize
#
from scipy.constants import physical_constants # we will need these for units conversion

Before we start, let us import the main modules that we will need for this lecture. 

In [None]:
# @title Notebook Setup { display-mode: "form" }
# Import the main modules used in this worksheet
import numpy as np
import matplotlib.pyplot as plt
import time
# Load the google drive with your files
#from google.colab import drive
#drive.mount('/content/drive')
# The following needs to be the path of the folder with all your datafile in .csv format
#base_path = '/content/drive/MyDrive/'

Set the local path, in case you want to save some of the results and plots from this notebook

In [None]:
# @title Set Local Path { display-mode: "form" }
# The following needs to be the path of the folder with all your collected data in .csv format
local_path="Colab Notebooks/CompThermo_Data/" # @param {type:"string"}
path = base_path+local_path

In [None]:
# @title Utilities { display-mode: "form" }
def mol_to_xyz(mol, comment=""):
    bohr_to_ang = 0.529177210903
    coords = mol.atom_coords() * bohr_to_ang
    symbols = [mol.atom_symbol(i) for i in range(mol.natm)]

    lines = []
    lines.append(str(mol.natm))
    lines.append(comment)
    for sym, (x, y, z) in zip(symbols, coords):
        lines.append(f"{sym:2s} {x: .8f} {y: .8f} {z: .8f}")

    return "\n".join(lines)

## Visualize the Systems

The following module needs to be installed on Colab to visualize and generate the molecular systems that we will simulate. 

In [None]:
# @title Install and load RDKit, CirPy, and Py3DMol { display-mode: "form" }
!pip install rdkit
from rdkit import Chem
from rdkit.Chem import Draw
!pip install cirpy
import cirpy
! pip install py3Dmol
import py3Dmol

In particular we can use them to draw the molecules in our experiments. While for some molecules you can just write their names and RDKit will plot them, for most molecules you will need to provide their SMILES or their CAS numbers.  Luckily, CIRpy can usually find SMILES for you, if you type the common name correctly or if you know the CAS number. 

These are the CAS numbers for the molecules in the first part of the computational thermochemistry experiments:
* cas_list = ["106-98-9", "590-18-1", "624-64-6", "115-11-7"]


In [None]:
# @title Choose the molecule to draw { display-mode: "form" }
input = '590-18-1' # @param {type:"string"}
input_type = 'cas' # @param ["smiles", "name", "cas"] {allow-input: true}
if input_type != 'smiles' :
    smiles=cirpy.resolve( input, 'smiles')
else:
    smiles=input
img = Draw.MolToImage( Chem.MolFromSmiles(smiles), size=(300, 300) )
display(img)

Let's first go through the main steps of a QC calculation on a molecule. Before we run any simulation, we need to get some initial guess for the positions of the atoms of our molecule. Luckly we can use `cirpy` to convert our molecule into a `xyz` format, that contains the number of atoms, a comment line, followed by the element and Cartesian coordinates of all the atoms in the molecule. 

In [None]:
print(cirpy.resolve(input,'xyz'))

For our calculation we only need the atoms information, so we will use some `str`+`list` methods to remove the first two lines from the `xyz` format. 

In [None]:
atom = ''.join(string+'\n' for string in cirpy.resolve(input,'xyz').split('\n')[2:])
print(atom)

In [None]:
xyz = cirpy.resolve(input,'xyz')  # for a pyscf.gto.Mole object
view = py3Dmol.view(width=400, height=300)
view.addModel(xyz, 'xyz')
view.setStyle({'stick': {}})
view.zoomTo()
view.show()

### Using SMILES

For simple organic molecules the synthax of SMILES is not too hard to understand and use. We will be creating carbocations using SMILES (and AI can probably help), but you can use the following cell to familiarize with this task.

Try looking at 'C\[C+\](C)C', 'C[C+]CC', or 'CC([C+])C'.

In [None]:
input = 'CC([C+])C'
xyz = cirpy.resolve(input,'xyz')  # for a pyscf.gto.Mole object
view = py3Dmol.view(width=400, height=300)
view.addModel(xyz, 'xyz')
view.setStyle({'stick': {}})
view.zoomTo()
view.show()

## Total Energy Calculations

We can now create a molecule (aka `Mole`) object in PySCF, which holds static molecular data: atoms, basis, charges, integrals, symmetry. We need to manually assign the atoms (elements and coordinates) to it, and setup our choice of  basis set. We will then be able to use this molecule object to run calculations of its energy with different types of methods. The combination of method and basis set is usually referred to as *model chemistry*. Total energy calculations of given atomic coordinates are also called single-point energy calculations. 

In [None]:
cis2butene_cas = '590-18-1' # make sure to select the molecule you want here
cis2butene = gto.Mole()
cis2butene.atom = ''.join(string+'\n' for string in cirpy.resolve(cis2butene_cas,'xyz').split('\n')[2:])
cis2butene.unit = 'Angstrom'
cis2butene.verbose = 0 # this specifies the verbosity of the calculation

### Basis Set 

 Part of the accuracy of your calculation will depend on the basis set adopted. The larger the basis set is, the more expensive and (hopefully) more accurate the calculation will be. The available basis sets are listed [here](https://pyscf.org/_modules/pyscf/gto/basis.html). One of the smallest basis sets, often used just for quick results and to check that things work, is the minimal `sto3g` basis set. More reasonable and common choices for small organic molecules for HF and DFT calculations include the famous Pople's basis sets: `631g`, `631+g*`, `6311g`, and `6311+g*`. For correlated methods (MP2, CC, etc.) correlation-consistent basis sets from Dunning are more commonly recommended: `ccPVDZ`, `augccPVDZ`, `ccPVTZ`, `augccPVTZ`. 

In [None]:
cis2butene.basis='sto3g'
cis2butene.build()

Once we have built the molecule, we can check how many electrons and orbitals we have introduced:

In [None]:
print(f"Number of Electrons: {cis2butene.nelectron}")
print(f"Number of Atomic Orbitals: {cis2butene.nao}")

### Methods: Hartree-Fock

Using the `Mole` object above we can setup and run Hartree-Fock ([HF](https://pyscf.org/user/scf.html)) self-consistent field (SCF) calculations by creating a 'mean field' (aka `mf`) object. This Python object will contain the numerical details of the calculation, how to solve it, and its results. In the following we will create a standard HF calculation (aka Restricted HF or RHF) and we will run it. 

In [None]:
cis2butene_hf = scf.RHF(cis2butene)
cis2butene_hf.run()

Inside the object we will find most of the results and, in particular, the total energy of the molecule in atomic units (Hartree)

In [None]:
print(f"HF total energy: {cis2butene_hf.e_tot} Ha")

### Methods: Density Functional Theory

From the practical point of view Kohn-Sham Density Functional Theory (DFT, or [KS](https://pyscf.org/user/dft.html)) is very similar to HF calculations. Also in this case we rely on a mean field theory and solve a self-consistent-field calculation. However, in principles DFT should captures electron correlation effects, if we knew the exact functional of the density. There is no analytical expression for the exact functional, but there are many many attempt to get close to it. Each functional expression has an associated acronym, a list of options is available [here](https://github.com/pyscf/pyscf/blob/master/pyscf/dft/libxc.py), searching for XC_ALIAS. One of the most successful density functionals for describing common organic molecules is the hybrid B3LYP functional. Apart from specifying the name of the functional, running DFT in PySCF is very similar to running HF claculations. 

In [None]:
cis2butene_dft = dft.RKS(cis2butene, xc='B3LYP')
cis2butene_dft.run()

In [None]:
print(f"B3LYP total energy: {cis2butene_dft.e_tot} Ha")

### Methods: Perturbation Theory

One of the most popular post-HF method is Moeller-Plesset Second Order Perturbation Theory (aka MP2). This is a method that improves on Hartree–Fock by accounting for electron correlation effects, i.e. how electrons avoid each other as they move. It starts from the Hartree–Fock solution and adds a correction that captures some of this missing electron–electron interaction, giving more accurate energies at a moderate additional computational cost. However, it is important to stress that contrary to HF and DFT, MP2 is not a variational method. That means the MP2 energy is not guaranteed to be higher than (or equal to) the true exact ground-state energy. In practice, MP2 often gives energies that are too low, especially for systems with: stretched bonds, near-degenerate orbitals, or strong electron correlation.

In order to run an MP2 calculations we need to start from the solution of a mean-field (HF or DFT) calculation.

In [None]:
cis2butene_mp2 = mp.MP2(cis2butene_hf)
# cis2butene_mp2.max_memory = 2000 # you may need to uncomment this for larger molecules on Colab
cis2butene_mp2.run()

In [None]:
print(f"HF total energy: {cis2butene_hf.e_tot} Ha")
print(f"MP2 correction: {cis2butene_mp2.e_corr} Ha")
print(f"MP2 total energy:{cis2butene_mp2.e_tot} Ha")

### Methods: Coupled Cluster

Coupled-cluster with single and double excitations (CCSD) is a method that improves on Hartree–Fock by describing electron correlation more accurately. It starts from the Hartree–Fock picture and systematically includes the effects of electrons moving together through single and double excitations, leading to much more reliable energies for many molecules. CCSD is more computationally demanding than simpler methods like MP2, but it is often considered a “gold standard” for accuracy in small to medium-sized systems. Also CC is not a variational method, which means that the total energy may be lower than the exact value. 

As for MP2 calculations, also for CC we need to start from a previous mean-field calculation.

In [None]:
cis2butene_ccsd = cc.CCSD(cis2butene_hf)
cis2butene_ccsd.run()

In [None]:
print(f"HF total energy: {cis2butene_hf.e_tot} Ha")
print(f"CCSD correction: {cis2butene_ccsd.e_corr} Ha")
print(f"CCSD total energy:{cis2butene_ccsd.e_tot} Ha")

We can go one step higher than CCSD by introducing an approximate correction for more complex, three-electron correlation effects, in what is known as coupled cluster with singles doubles and perturbative triples (aka CCSD(T)). It is also run on top of a Hartree–Fock solution and often gives very accurate energies for small molecules, which is why it is sometimes called the “gold standard” of quantum chemistry. The added accuracy comes at a higher computational cost, and the method works best when the Hartree–Fock description is already a good starting point.

In [None]:
cis2butene_ccsd_t = cc.CCSDT(cis2butene_hf)
cis2butene_ccsd_t.run()

In [None]:
print(f"HF total energy: {cis2butene_hf.e_tot} Ha")
print(f"CCSD(T) correction: {cis2butene_ccsd_t.e_corr} Ha")
print(f"CCSD(T) total energy:{cis2butene_ccsd_t.e_tot} Ha")

# Sreening Calculations

While most AI tools will suggest clear `for` loops to automate running multiple calculations, you can follow the synthax below for most of the tasks in this laboratory. 

For a fixed level of theory (say HF), you can run multiple calculations for different molecules with multiple basis sets as follows

In [None]:
# RHF calculations for butene isomers for different basis sets
cas_list = ["106-98-9", "590-18-1", "624-64-6", "115-11-7"]
basis_list = ["STO-3G", "6-31G", "6-31+G*"]

for cas in cas_list:
  print(cas)
  for basis_set in basis_list:
    t0 = time.perf_counter()

    xyz = ''.join(string+'\n' for string in cirpy.resolve(cas,'xyz').split('\n')[2:])
    molecule = gto.Mole()
    molecule.atom = xyz
    molecule.unit = 'Angstrom'
    molecule.basis = basis_set
    molecule.verbose = 0
    molecule.build()

    molecule_mf = scf.RHF(molecule)
    energy = molecule_mf.run().e_tot

    t1 = time.perf_counter()
    elapsed = t1-t0
    print(f"  {basis_set:10s} | Energy = {energy: .8f} Eh | Time = {elapsed:6.2f} s")

Alternatively, for a fixed basis set you can run multiple calculations for different molecules with different levels of theory as follows

In [None]:
# Different calculations for butene isomers for a given basis set
cas_list = ["106-98-9"] # ADD AS MANY OF THE MOLECULES AS YOU WANT
basis_set = "6-31G"

for cas in cas_list:
  print(cas)
  xyz = ''.join(string+'\n' for string in cirpy.resolve(cas,'xyz').split('\n')[2:])
  molecule = gto.Mole()
  molecule.atom = xyz
  molecule.unit = 'Angstrom'
  molecule.basis = basis_set
  molecule.verbose = 0
  molecule.build()

  t0 = time.perf_counter()
  molecule_hf = scf.RHF(molecule)
  energy_hf = molecule_hf.run().e_tot
  t1 = time.perf_counter()
  elapsed = t1-t0
  print(f"HF    | Energy = {energy_hf: .8f} Ha | Time = {elapsed:6.2f} s")

  t0 = time.perf_counter()
  molecule_pbe = dft.KS(molecule,xc='PBE')
  energy_pbe = molecule_pbe.run().e_tot
  t1 = time.perf_counter()
  elapsed = t1-t0
  print(f"PBE   | Energy = {energy_pbe: .8f} Ha | Time = {elapsed:6.2f} s")

  t0 = time.perf_counter()
  molecule_b3lyp = dft.KS(molecule,xc='B3LYP')
  energy_b3lyp = molecule_b3lyp.run().e_tot
  t1 = time.perf_counter()
  elapsed = t1-t0
  print(f"B3LYP | Energy = {energy_b3lyp: .8f} Ha | Time = {elapsed:6.2f} s")

  t0 = time.perf_counter()
  molecule_mp2 = mp.MP2(molecule_hf)
  energy_mp2 = molecule_mp2.run().e_tot
  t1 = time.perf_counter()
  elapsed = t1-t0
  print(f"MP2   | Energy = {energy_mp2: .8f} Ha | Time = {elapsed:6.2f} s")

  t0 = time.perf_counter()
  molecule_ccsd = cc.CCSD(molecule_hf)
  energy_ccsd = molecule_ccsd.run().e_tot
  t1 = time.perf_counter()
  elapsed = t1-t0
  print(f"CCSD  | Energy = {energy_ccsd: .8f} Ha | Time = {elapsed:6.2f} s")

# Geometry Optimization

While total energies are very important, so far we have only relied on atomic coordinates that have been built for us by CirPy and very likely do not represent the correct equilibrium geometries of the molecules we are studying. Given an initial structure we can use quantum chemistry to relax the positions of the atoms towards the equilibrium geometries, e.g. by following the direction of the forces on the atoms, going downhill in the potential energy surface. This type of calculation is known as 'geometry optimization', 'geoopt', or 'relax' calculation. 

NOTE that optimizing the positions of the atoms is a complex non-linear problem in a multidimensional (3*number of atoms) space: we are not guaranteed to arrive at the global minimum and we will usually only converge to the local minimun closest to the starting point.

In practice, using PySCF we can optimize a molecule using HF or KS energies by passing the corresponding mean field object to the `optimize()` function. The output of the calculation is a new molecule object, with the new relaxed coordinates. 

In [None]:
print(f"HF Energy of the initial structure:   {cis2butene_hf.e_tot}")
# Optimize the geometry
cis2butene_opt = optimize(cis2butene_hf)
# Recompute the energy of the optimized molecule
cis2butene_opt_hf = scf.RHF(cis2butene_opt)
cis2butene_opt_hf.run()
print(f"HF Energy of the optimized structure: {cis2butene_opt_hf.e_tot}")


We can also compare the atomic coordinates of the initial and optimized structures to see how bond lengths and angles have changed during the optimization

In [None]:
initial_coordinates = cis2butene.atom_coords().copy()   # Bohr
print(initial_coordinates)
optimized_coordinates = cis2butene_opt.atom_coords().copy()
print(optimized_coordinates)

However, it may be easier to see the changes using a 3D visualization

In [None]:
view = py3Dmol.view(width=400, height=400)
view.addModel(mol_to_xyz(cis2butene, "mol1"), "xyz")
view.setStyle({"model": 0}, {"stick": {"radius": 0.05, "color": "red"}})

view.addModel(mol_to_xyz(cis2butene_opt, "mol2"), "xyz")
view.setStyle({"model": 1}, {"stick": {"radius": 0.05, "color": "green"}})

# Calculation Setup for Larger Systems

While in ideal circumstances, with infinite resources and time, we would run geometry optimization and total energy calculations with the most accurate level of theory and basis set, in practice we often need to compromise. Since geometry optimization calculations involve running multiple single-point energy steps, we usually adopt cheaper methods and smaller basis sets for that step. We can then use the optimized geometry to run a more accurate single-point energy calculation. In this section we will review the whole process for a more challenging system, with a larger number or atoms and electrons. 

We start by setting up our molecule and visualizing it, to make sure it looks right.

In [None]:
hexene_cas = '592-41-6' # make sure to select the molecule you want here
xyz = cirpy.resolve(hexene_cas,'xyz')  # for a pyscf.gto.Mole object
view = py3Dmol.view(width=400, height=300)
view.addModel(xyz, 'xyz')
view.setStyle({'stick': {}})
view.zoomTo()
view.show()

We can then build a `Mole` object with a basis set appropriate for geometry optimization

In [None]:
hexene = gto.Mole()
hexene.atom = ''.join(string+'\n' for string in cirpy.resolve(hexene_cas,'xyz').split('\n')[2:])
hexene.unit = 'Angstrom'
hexene.verbose = 0 # this specifies the verbosity of the calculation
hexene.basis='def2tzvp'
hexene.build()

We first run a single DFT calculation to check the timing and make sure that a single step is not taking too long (you can assume that a geometry optimization calculation will take some multiple (10?) of the time of a single step calculation)

In [None]:
hexene_dft = dft.RKS(hexene, xc='B3LYP')
hexene_dft.run()
print(f"DFT Energy of the initial structure:   {hexene_dft.e_tot}")

We run the optimization using a very common DFT method

In [None]:
hexene_opt = optimize(hexene_dft)

Now we can copy the geometry from the optimized molecule and setup a second molecule with a more accurate basis set. 

In [None]:
hexene_final = hexene_opt.copy()
hexene_final.basis = 'augccpvtz'

Most of the post-HF methods require to run a HF calculation first

In [None]:
hexene_final_hf = scf.RHF(hexene_final)
hexene_final_hf.run()
print(f"HF Energy of the optimized structure: {hexene_final_hf.e_tot}")

Eventually we can run our correlated method of choice

In [None]:
hexene_final_ccsd = cc.CCSD(hexene_final_hf)
hexene_final_ccsd.run()
print(f"CCSD Energy of the optimized structure: {hexene_final_ccsd.e_tot}")