# Pipeline for Gaussian results #

The purpose of this script is to read all Gaussian output (.log) files in a specified folder, extract the information on vibrational frequencies (Frequencies, IR-Raman intensities and normal mode displacements), scale the frequencies, produce visual "Spectra" for the theoretical frequencies and then export all that information in an organised manner. 

Importing libraries.

In [1]:
import pandas as pd
import numpy as np
import math
import os
import openpyxl
from openpyxl import load_workbook

Defining a function that will be used later on for the inference:

Gaussian:
    
Defines the equation for a Gaussian curve. 

$$
      f(\chi) = \frac{1}{\sqrt{2 \pi \sigma^{2}}}e^{\frac{({\chi-\mu})^{2}}{2 {\sigma^{2}}}}
$$
* $\chi$ refers to a specific point on the x-axis. 
* $f(\chi)$ refers to the y-value for the specific point. 
* $\mu$ refers to the distribution's mean value.
* $\sigma$ refers to the distribution's standard deviation.



In [2]:
def gauss(x, a, b, c):
    return a * (math.e ** (-(((x - b) ** 2) / (2 * (c ** 2)))))

In this section, the user can set the "directory" where the Gaussian output (.log) files are kept and the "path" to the excel spreadsheet in which the results should be saved.

In [3]:
directory = ''
path = ''

This section creates some lists that are used to collect entries that either show a negative frequency (not optimised) "negative_freqs_ls", are missing Raman activities "no_Raman_ls" or are missing data all-together "no_data_ls", useful for understanding if any calculations need to be re-performed. Those can be viewed after the batch processing of molecules.

Additionally, it creates a "record" list that contains all processed data, used for transferring them into the output "Variables_Pipeline_Gaussian.py" file.

In [4]:
negative_freq_ls = []
no_Raman_ls = []
no_data_ls = []

record = []

This section reads each file available in the "directory" and extracts the vibrational information (Frequency, IR intensity, Raman activity) as well as the normal mode displacements (R: bond length, A: bond angle and D: bond dihedral angle) for each vibrational mode. It then structures that information into a Pandas dataframe, while also printing a message if it detects a negative frequency or missing data. 

It then determines which normal mode displacements offer a significant distribution (> 5%) and includes those in the dataframe.

In [5]:
for filename in os.listdir(directory):
    file = os.path.join(directory, filename)
    if os.path.isfile(file):
        print(file)

    f = open(file)
    content = f.readlines()
    Freq = []
    Infra = []
    Raman = []
    Normal_modes = []
    
    normal_toggle = 0
    checkpoints = []
    counter = 0
    
    Modes = []
    temp_modes = []
    for i in content:
        if "Frequencies --" in i:
            j = str.split(i)
            for char in j:
                if char == "Frequencies" or char.isspace() or "'--'" in char:
                    j.remove(char)
                    for value in j:
                        if value != "--":
                            Freq.append(float(value))
        elif "IR Inten" in i:
            j = str.split(i)
            for char in j:
                if "IR" in char or "Inten" in char or char.isspace() or "'--'" in char:
                    j.remove(char)
                    for value in j:
                        if value != "--" and value != "Inten":
                            Infra.append(float(value))
        elif "Raman Activ" in i:
            j = str.split(i)
            for char in j:
                if char == "Raman" or char.isspace() or "'--'" in char:
                    j.remove(char)
                    for value in j:
                        if value != "--" and value != "Activ":
                            Raman.append(float(value))
        elif "Normal Mode" in i and normal_toggle == 0:
            normal_toggle = 1
        elif normal_toggle == 1 and "Axes" in i:
            normal_toggle = 0
        elif normal_toggle == 1:
            Normal_modes.append(i)
    
    counter = 1
    for i in Normal_modes:
        if "Normal Mode" in i and temp_modes == []:
            Normal_modes.remove(i)
        elif "Normal Mode" in i and temp_modes != []:
            Normal_modes.remove(i)
            Modes.append(temp_modes)
            temp_modes = []
        elif counter == len(Normal_modes):
            Normal_modes.remove(i)
            Modes.append(temp_modes)
            temp_modes = []
        elif any(char.isdigit() for char in i) and "Max" not in i:
            temp_modes.append(i)
        counter = counter + 1
        
    for i in Modes:
        for j in i:
            mode = str.split(j)
            for k in mode:
                if k.isspace():
                    mode.remove(k)
                elif "!" in k:
                    mode.remove(k)
            Modes[Modes.index(i)][Modes[Modes.index(i)].index(j)] = mode
            
    df = pd.DataFrame({"Frequency": Freq, "IR Intensity": Infra, "Raman activity": Raman})
    df.dropna(inplace=True)
    df = df.astype(float)
    df.insert(3, "Modes", Modes)
    
    if df.empty == True:
        print("No vibrational information for " + str.split(str(filename), ".")[0])
        no_data_ls.append(str.split(str(filename), ".")[0])
        continue
    elif df.iloc[0, 0].astype(float) < 0:
        print("Negative frequency found for " + str.split(str(filename), ".")[0])
        negative_freq_ls.append(str.split(str(filename), ".")[0])
        continue
    else: 
        
        df_mode_lst = []
        important_contributions = []        
        counter = 1
        for i in df.iloc[:,3]:
            temp_type = []
            temp_stretch = []
            temp_stretch_sign = []
            temp_stretch_perc = []
            
            temp_bend = []
            temp_bend_sign = []
            temp_bend_perc = []
            
            temp_wag = []
            temp_wag_sign = []
            temp_wag_perc = []
            
            for j in i:
                if "R" in j[0]:
                    temp_stretch.append(j[1].replace("R",""))
                    temp_stretch_sign.append(float(j[2]))
                    temp_stretch_perc.append(float(j[3]))
                    if float(j[-1]) > 5:
                        important_contributions.append(j)
                elif "A" in j[0]:
                    temp_bend.append(j[1].replace("A",""))
                    temp_bend_sign.append(float(j[2]))
                    temp_bend_perc.append(float(j[3]))     
                elif "D" in j[0]:
                    temp_wag.append(j[1].replace("D",""))
                    temp_wag_sign.append(float(j[2]))
                    temp_wag_perc.append(float(j[3]))
            temp_df_stretch = pd.DataFrame({"Stretching mode": temp_stretch, "Stretch value": temp_stretch_sign, "Stretch percentage": temp_stretch_perc})
            temp_df_bend = pd.DataFrame({"Bending mode": temp_bend, "Bend value": temp_bend_sign, "Bend percentage": temp_bend_perc})
            temp_df_wag = pd.DataFrame({"Wagging mode": temp_wag, "Wag value": temp_wag_sign, "Wag percentage": temp_wag_perc})
            exec("df_mode_"+str(counter)+" = pd.concat([temp_df_stretch, temp_df_bend, temp_df_wag])")
            df_mode_lst.append("df_mode_"+str(counter))
            counter = counter + 1

Results\vanillin.log


This part is where the user can input the scaling factor that they want the frequencies to be scaled by.

In [6]:
        Scaling_factor = 0.95  

This section then scales all frequencies by the scaling factor and creates empty arrays (size of 4000) for the "spectra" to be drawn into. The "spectra" consist of two 4000-long arrays, each corresponding to a frequency point on the spectrum. The x-axis array just counts from 1-4000, whereas the y-axis array will contain the intensity information for that point.

In [7]:
        scaled_freq = []
        for i in list(df["Frequency"]):
            scaled_freq.append(float(i) * Scaling_factor)
        
        df.insert(1, "Scaled Frequency", scaled_freq)
                        
        #Get peak x and y values
        dfx = list(df.iloc[:,1].round())
        df_IR_y = list(df.iloc[:,2])
        
        if Raman != []:
            df_Ram_y = list(df.iloc[:,3])

            
        #Create new arrays for all spectrum coordinates
        x = list(range(1,4001,1))
        x = [round(elem, 1) for elem in x]
        IR_y = [0] * 4000
        Ram_y = [0] * 4000
        
        #Update peak y-values
        counter = -1
        for i in x:
            counter = counter + 1
            if i in dfx:
                position = dfx.index(i)
                IR_y[counter] = df_IR_y[position]
                Ram_y[counter] = df_Ram_y[position]

This is a good point to mention how the "Graphs" work. As the DFT-calculation only provides a frequency and intensity value, and not a peak, peaks are instead visualised as gaussian distributions with a fixed std, and the mean of the frequency calculated.

This section produces the first "graph", the "Convolved graph", in which the y-axis array contains a sum of all the intensities corresponding to that frequency from all "peaks".

In [8]:
        IR_y_conv = np.zeros(4000)
        
        for i in dfx:
            templist = [0] * 4000
            z = x.index(i)     
            if z > 50 and z < 3950:
                xdata = np.array(x[(z - 50):(z + 51)])
            elif z < 50:
                xdata = np.array(x[0:(z + 51)])
            elif z > 3950:
                xdata = np.array(x[(z - 50):4000])
            for j in xdata:
                templist[j-1] = gauss(z, IR_y[z], j, 6)
            IR_y_conv = IR_y_conv + np.array(templist)

        if Raman != []:            
            
            Ram_y_conv = np.zeros(4000)
            
            for i in dfx:
                templist = [0] * 4000
                z = x.index(i)     
                if z > 50 and z < 3950:
                    xdata = np.array(x[(z - 50):(z + 51)])
                elif z < 50:
                    xdata = np.array(x[0:(z + 51)])
                elif z > 3950:
                    xdata = np.array(x[(z - 51):4000])
                    
                for j in xdata:
                    templist[j] = gauss(z, Ram_y[z], j, 6)
                Ram_y_conv = Ram_y_conv + np.array(templist)
        else:
            print("Raman unavailable for " + str.split(str(filename), ".")[0])
        
        print("Convolved spectra completed")


Convolved spectra completed


The last section exports the results. They are exported in two ways:

1. An excel spreadsheet, containing a sheet with all frequencies, scaled frequencies, IR intensities, Raman activities and vibrational mode displacements, as well as a sheet for each vibrational mode with organised displacements as "stretches", "bends" and "wags", their contributing atoms and their contributing percentages.


2. A Variables_Pipeline_Gaussian.py python file, which contains all produced variables (x-axis and y-axis arrays for convolved IR and Raman spectra for each molecule). This is then fed into further scripts for graph visualisation.

In [11]:
        new_filename = filename.replace("-", "_")    
        new_filename = new_filename.replace("+", "plus")
        new_filename = new_filename.replace(",", "_")
        if len(new_filename) >= 31:
            new_filename = new_filename[:30]
        with open('Variables_Pipeline_Gaussian.py', 'a') as f:
            f.write('x_' + str.split(str(new_filename), ".")[0] + ' = ' + str(x) + '\n' + 'IR_y_conv_' + str.split(str(new_filename), ".")[0] + ' = ' + str(list(IR_y_conv)) + '\n' + 'Ram_y_conv_' + str.split(str(new_filename), ".")[0] + ' = ' + str(list(Ram_y_conv)) + '\n')

        counter = 1
        with pd.ExcelWriter(str.split(str(new_filename), ".")[0]+"_normal_modes.xlsx") as writer:
            df.to_excel(writer, sheet_name = "Frequencies", index = False)
            for i in range(1,len(df_mode_lst)):
                name = "Normal_mode_"+ str(counter)
                exec(str(df_mode_lst[i-1])+" = "+str(df_mode_lst[i-1])+".to_excel(writer, sheet_name ="+"'"+str(name)+"'"+", index = False)")
                counter = counter + 1
        print(str.split(str(new_filename), ".")[0] + " completed")

        record.append(str.split(str(new_filename), ".")[0])

with open('Variables_Pipeline_Gaussian.py', 'a') as f:
    f.write("record = " + str(record))

vanillin completed
