<a href="https://colab.research.google.com/github/190ibrahim/MLFFs_Transition_Metal_Complexes_BSc_Thesis/blob/main/Data_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tools and Libraries

I am going to use Python programming language. It provides extensive libraries and tools for data manipulation, analysis, and machine learning.


1.	Pandas:
	Pandas is a Python library for data manipulation and analysis. It's useful for handling structured data, which will be crucial in preprocessing the dataset.
2.	NumPy:
	NumPy is a fundamental package for scientific computing with Python. It provides support for large, multi-dimensional arrays and matrices, along with mathematical functions to operate on these.
3.	Scikit-learn:
	Scikit-learn is a machine learning library for simple and efficient tools for data mining and data analysis. It includes various tools for classification, regression, clustering, and more.
4.	Matplotlib and Seaborn:
	These libraries are great for data visualization. will be used to visualize the data and the performance of the machine learning model.
5.	DScribe:
	DScribe is a Python library specifically designed for materials science and cheminformatics. It provides tools for generating structural descriptors from atomic structures, making it a valuable resource for the work on transition metal complexes. Using DScribe can enhance the ability to extract meaningful features from molecular structures.
6.	TensorFlow:
  is an open-source machine learning framework developed by Google, and Keras is an open-source high-level neural networks API. Together, they provide a powerful environment for building and training machine learning models.
7.  XGBoost:
  is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable.
8.  ASE:


  So let's import the necessary libraries:





In [None]:
!pip install ase
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
# from dscribe.descriptors import CoulombMatrix, SOAP
from decimal import Decimal
from scipy.spatial import distance
from scipy.spatial.distance import cdist
import ase


Collecting ase
  Downloading ase-3.22.1-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: ase
Successfully installed ase-3.22.1


# Data Availability


tmQM is an open data set freely available at GitHub (https://github.com/bbskjelstad/tmqm) and from Quantum-Machine (http://quantum-machine.org/datasets/). Quantum features, geometries and properties computed at the GFN2-xTB and TPSSh-D3BJ/def2-SVP levels of theory are provided in the xyz and csv file formats.

# Data Collection

I am going to use transition metal quantum mechanics (tmQM) data set, which contains the geometries and properties of a large transition metal–organic compound space. tmQM comprises 86,665 mononuclear complexes extracted from the Cambridge Structural Database, including Werner, bioinorganic, and organometallic complexes based on a large variety of organic ligands and 30 transition metals (the 3d, 4d, and 5d from groups 3 to 12).

  tmQM is an open data set that can be downloaded free of charge from https://github.com/bbskjelstad/tmqm.

In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

quantum_properties_path = '/content/drive/MyDrive/Thesis/Thesis/data/tmQM_y.csv'
geometry_X1_path = '/content/drive/MyDrive/Thesis/Thesis/data/tmQM_X1.xyz'
geometry_X2_path = '/content/drive/MyDrive/Thesis/Thesis/data/tmQM_X2.xyz'
benchmark_data_path = '/content/drive/MyDrive/Thesis/Thesis/data/Benchmark2_TPSSh_Opt.xyz'

Mounted at /content/drive


In [None]:
# Dataframe initialization
df_quantum_properties = None
df_molecular_geometry_X1 = None
df_molecular_geometry_X2 = None
df_benchmark = None

In [None]:
# Load CSV Data
try:
  df_quantum_properties = pd.read_csv(quantum_properties_path)
  print("Quantum properties data loaded successfully!")
except (FileNotFoundError, pd.errors.ParserError) as e:
  print(f"Error loading quantum properties data: {e}")

Quantum properties data loaded successfully!


In [None]:
def extract_xyz_data(file_path):
    df_data = []
    stoichiometry_flag = False  # Assume no Stoichiometry initially
    with open(file_path, 'r') as file:
        lines = file.readlines()

        # Iterate over the lines
        for line in lines:
            # If the line starts with a number, it indicates the number of atoms
            if line.strip().isdigit():
                n_atoms = int(line.strip())

            # If the line starts with "CSD_code", extract the code
            elif line.startswith("CSD_code"):
                csd_code = line.split("=")[1].split("|")[0].strip()

                if "Stoichiometry" in line: # I did this so I handle all three files the benchmark and geometry X1 & X2
                    # Extract Stoichiometry from the line
                    stoichiometry = line.split("Stoichiometry = ")[1].split("|")[0].strip()
                    stoichiometry_flag = True

            # If the line starts with an atom symbol, extract the atom and coordinates
            elif line[0].isalpha():
                parts = line.split()
                atom, x, y, z = parts[0], Decimal(parts[1]), Decimal(parts[2]), Decimal(parts[3])
                if stoichiometry_flag:
                    df_data.append([csd_code, n_atoms, stoichiometry, atom, x, y, z])  # include stoichiometry
                else:
                    df_data.append([csd_code, n_atoms, atom, x, y, z])  # include stoichiometry

    # Column names (adjust based on whether 'Stoichiometry' is included)
    columns = ["CSD_code", "N_atoms", "Atom", "X", "Y", "Z"]
    if stoichiometry_flag:
        columns.insert(2, "Stoichiometry")

    # Convert the list into a DataFrame
    df = pd.DataFrame(df_data, columns=columns)
    df[['X', 'Y', 'Z']] = df[['X', 'Y', 'Z']].applymap(Decimal)

    return df

# df_molecular_geometry_X1 = extract_xyz_data(geometry_X1_path)
df_molecular_geometry_X2 = extract_xyz_data(geometry_X2_path)
# df_benchmark = extract_xyz_data(benchmark_data_path)

In [None]:
df_quantum_properties.head(1)

In [None]:
df_molecular_geometry_X2.head(1)

# Data Preprocessing

Objective: Clean and preprocess the dataset.

Tools: Pandas, NumPy, cheminformatics tools if needed.

In [None]:
df_quantum_properties.shape

(86665, 9)

In [None]:
df_quantum_properties.info()
df_quantum_properties.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86665 entries, 0 to 86664
Data columns (total 1 columns):
 #   Column                                                                                             Non-Null Count  Dtype 
---  ------                                                                                             --------------  ----- 
 0   CSD_code;Electronic_E;Dispersion_E;Dipole_M;Metal_q;HL_Gap;HOMO_Energy;LUMO_Energy;Polarizability  86665 non-null  object
dtypes: object(1)
memory usage: 677.2+ KB


(86665, 1)



1.   Splitting the Column:
Method Used: str.split()
Description: The str.split() method is used to split a string into a list of substrings based on a specified delimiter. In this case, the delimiter is a semicolon (;).
2.   Renaming Columns:
Method Used: columns
Description: The columns attribute is used to assign new column names to a DataFrame. In this step, we provide a list of new column names to replace the default column names.
3.   Concatenating DataFrames:
Method Used: pd.concat()
Description: The pd.concat() function is used to concatenate two DataFrames along a particular axis. In this case, we concatenate the original DataFrame (df) with the new DataFrame (df_split) created in the previous step.
4.   Dropping a Column:
Method Used: drop()
Description: The drop() method is used to remove a specified column or row from a DataFrame. In this step, we drop the original column that contained all values separated by semicolons.






In [None]:
# Split the single column into multiple columns
df_quantum_properties_split = df_quantum_properties['CSD_code;Electronic_E;Dispersion_E;Dipole_M;Metal_q;HL_Gap;HOMO_Energy;LUMO_Energy;Polarizability'].str.split(';', expand=True)

# Rename the columns for clarity
new_column_names = ['CSD_code', 'Electronic_E', 'Dispersion_E', 'Dipole_M', 'Metal_q', 'HL_Gap', 'HOMO_Energy', 'LUMO_Energy', 'Polarizability']
df_quantum_properties_split.columns = new_column_names

# Convert relevant columns to appropriate data types
numeric_columns = ['Electronic_E', 'Dispersion_E', 'Dipole_M', 'Metal_q', 'HL_Gap', 'HOMO_Energy', 'LUMO_Energy', 'Polarizability']
df_quantum_properties_split[numeric_columns] = df_quantum_properties_split[numeric_columns].apply(pd.to_numeric, errors='coerce')

# Concatenate the split columns back to the original dataframe
df_quantum_properties = pd.concat([df_quantum_properties, df_quantum_properties_split], axis=1)

# Drop the original single column
df_quantum_properties.drop(columns=['CSD_code;Electronic_E;Dispersion_E;Dipole_M;Metal_q;HL_Gap;HOMO_Energy;LUMO_Energy;Polarizability'], inplace=True)

# Display the updated information and shape of the dataframe
df_quantum_properties.info()
print(df_quantum_properties.shape)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86665 entries, 0 to 86664
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   CSD_code        86665 non-null  object 
 1   Electronic_E    86665 non-null  float64
 2   Dispersion_E    86665 non-null  float64
 3   Dipole_M        86665 non-null  float64
 4   Metal_q         86665 non-null  float64
 5   HL_Gap          86665 non-null  float64
 6   HOMO_Energy     86665 non-null  float64
 7   LUMO_Energy     86665 non-null  float64
 8   Polarizability  86665 non-null  float64
dtypes: float64(8), object(1)
memory usage: 6.0+ MB
(86665, 9)


In [None]:
df_molecular_geometry_X2.head(2)


Unnamed: 0,CSD_code,N_atoms,Stoichiometry,Atom,X,Y,Z
0,GIQVAG,77,C41H31IrN2O2,Ir,5.83029976772319,3.02909946046576,16.71726529330449
1,GIQVAG,77,C41H31IrN2O2,O,4.59159571147428,4.1130169163554,15.21393930222697


In [None]:
df_molecular_geometry_X2.tail()

Unnamed: 0,CSD_code,N_atoms,Stoichiometry,Atom,X,Y,Z
2856631,UNEQUB,41,C12H16N2O10Zn,C,4.40129766443463,4.05721118426694,3.60153498463601
2856632,UNEQUB,41,C12H16N2O10Zn,H,3.52730443662692,8.57570592136998,4.73729682139021
2856633,UNEQUB,41,C12H16N2O10Zn,H,4.434827922239,7.48156210366946,5.3663173189255
2856634,UNEQUB,41,C12H16N2O10Zn,H,2.08073279652386,4.58401800232308,3.58463985626821
2856635,UNEQUB,41,C12H16N2O10Zn,H,1.48575370434045,5.50548903538786,4.7091226445093


In [None]:
df_molecular_geometry_X2.info()
df_molecular_geometry_X2.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2856636 entries, 0 to 2856635
Data columns (total 7 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   CSD_code       object
 1   N_atoms        int64 
 2   Stoichiometry  object
 3   Atom           object
 4   X              object
 5   Y              object
 6   Z              object
dtypes: int64(1), object(6)
memory usage: 152.6+ MB


(2856636, 7)

# Data Exploration

## Explore quantum_properties Data:


In [None]:
df_quantum_properties.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86665 entries, 0 to 86664
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   CSD_code        86665 non-null  object 
 1   Electronic_E    86665 non-null  float64
 2   Dispersion_E    86665 non-null  float64
 3   Dipole_M        86665 non-null  float64
 4   Metal_q         86665 non-null  float64
 5   HL_Gap          86665 non-null  float64
 6   HOMO_Energy     86665 non-null  float64
 7   LUMO_Energy     86665 non-null  float64
 8   Polarizability  86665 non-null  float64
dtypes: float64(8), object(1)
memory usage: 6.0+ MB


In [None]:
# Display the first few rows of the DataFrame to get an overview
print("First few rows of the DataFrame:")
df_quantum_properties.head(2)

First few rows of the DataFrame:


Unnamed: 0,CSD_code,Electronic_E,Dispersion_E,Dipole_M,Metal_q,HL_Gap,HOMO_Energy,LUMO_Energy,Polarizability
0,WIXKOE,-2045.524942,-0.239239,4.2333,2.10934,0.13108,-0.16204,-0.03096,598.457913
1,DUCVIG,-2430.690317,-0.082134,11.7544,0.75994,0.12493,-0.24358,-0.11865,277.750698


In [None]:
# Checking for missing values
print("\nMissing values in the DataFrame:")
df_quantum_properties.isnull().sum()


Missing values in the DataFrame:


CSD_code          0
Electronic_E      0
Dispersion_E      0
Dipole_M          0
Metal_q           0
HL_Gap            0
HOMO_Energy       0
LUMO_Energy       0
Polarizability    0
dtype: int64

In [None]:
# Explore summary statistics
print("\nSummary statistics of the DataFrame:")
df_quantum_properties.describe().transpose()


Summary statistics of the DataFrame:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Electronic_E,86665.0,-2952.142984,1608.798073,-29008.530471,-3509.81392,-2659.883034,-1944.605342,-295.243567
Dispersion_E,86665.0,-0.141043,0.069794,-1.122263,-0.178652,-0.128506,-0.089613,-0.004936
Dipole_M,86665.0,5.746084,3.889717,0.0,2.9352,5.3308,8.1379,81.6985
Metal_q,86665.0,0.150391,0.795228,-3.08337,-0.26333,0.24352,0.6639,2.33013
HL_Gap,86665.0,0.109266,0.033881,0.00222,0.08856,0.11017,0.13054,0.30742
HOMO_Energy,86665.0,-0.1983,0.05426,-0.44203,-0.21133,-0.18725,-0.16878,0.03984
LUMO_Energy,86665.0,-0.089034,0.055013,-0.37732,-0.10814,-0.07864,-0.05599,0.19819
Polarizability,86665.0,393.512906,151.791233,51.24996,282.781464,369.38677,478.148414,3002.513834


The target variable is the variable that you want to predict or explain using the other variables in your dataset. It is also called the dependent variable or the response variable in statistics and machine learning

## Explore Geometry Data:




In [None]:
df_molecular_geometry_X2.head()

Unnamed: 0,CSD_code,N_atoms,Stoichiometry,Atom,X,Y,Z
0,GIQVAG,77,C41H31IrN2O2,Ir,5.83029976772319,3.02909946046576,16.71726529330449
1,GIQVAG,77,C41H31IrN2O2,O,4.59159571147428,4.1130169163554,15.21393930222697
2,GIQVAG,77,C41H31IrN2O2,O,4.10270728082861,3.44150876137087,18.05100891333301
3,GIQVAG,77,C41H31IrN2O2,N,6.6515089252403,4.64626395459036,17.52782113997698
4,GIQVAG,77,C41H31IrN2O2,N,5.00392558474341,1.43185042415695,15.85192728526944


In [None]:
# print("Missing values in df_geometry_X1:")
# print(df_molecular_geometry_X1.isnull().sum())

print("\nMissing values in df_geometry_X2:")
print(df_molecular_geometry_X2.isnull().sum())


Missing values in df_geometry_X2:
CSD_code         0
N_atoms          0
Stoichiometry    0
Atom             0
X                0
Y                0
Z                0
dtype: int64


In [None]:
# df_molecular_geometry_X1.info()
df_molecular_geometry_X2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2856636 entries, 0 to 2856635
Data columns (total 7 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   CSD_code       object
 1   N_atoms        int64 
 2   Stoichiometry  object
 3   Atom           object
 4   X              object
 5   Y              object
 6   Z              object
dtypes: int64(1), object(6)
memory usage: 152.6+ MB


# Feature Engineering

In [None]:
# @title Finding compounds with a specific element
def has_specific_element(stoichiometry_str, element_symbol):
    return element_symbol in stoichiometry_str

def filter_by_element(df, element_symbol):
    # Apply the has_specific_element function to the specified stoichiometry column
    filter_series = df["Stoichiometry"].apply(lambda x: has_specific_element(x, element_symbol))

    # Filter the DataFrame based on the result of has_specific_element
    df_filtered = df[filter_series]

    # Select only the 'Atom', 'X', 'Y', and 'Z' columns
    df_filtered = df_filtered[['CSD_code', 'Atom', 'X', 'Y', 'Z']]
    return df_filtered

In [None]:
# Ask the user for the element they want to search for
selected_atom = input("Enter the element symbol to filter by: ")
num_smallest_distances = int(input("Enter the num of smallest distances or The valence, i.e how many bonds it has to neighboring atoms: "))

df_compound_Ni = filter_by_element(df_molecular_geometry_X2, selected_atom)

Enter the element symbol to filter by: Ni
Enter the num of smallest distances or The valence, i.e how many bonds it has to neighboring atoms: 4


In [None]:
n_unique_compounds = df_compound_Ni['CSD_code'].nunique()
print(f"Number of unique compounds processed: {n_unique_compounds}")



Number of unique compounds processed: 8702


In [None]:
df_compound_Ni.head(5)

Unnamed: 0,CSD_code,Atom,X,Y,Z
336304,NIEPZF,C,4.54247121742774,2.05352162514156,-6.35536427586791
336305,NIEPZF,C,8.10993142118815,1.43573480229266,-1.7417892414254
336306,NIEPZF,C,7.75541421417754,1.76588121446342,-3.0316974821638
336307,NIEPZF,C,2.53458369794125,0.7559816997419,-6.01898837874612
336308,NIEPZF,C,1.65095081280527,1.55361796255334,-5.0621322956816


In [None]:
df_compound_Ni.index.size

579743

In [None]:
some_csd_code = 'NIEPZF'
df_compound_NIEPZF = df_compound_Ni[df_compound_Ni["CSD_code"]==some_csd_code]
df_compound_NIEPZF

Unnamed: 0,CSD_code,Atom,X,Y,Z
336304,NIEPZF,C,4.54247121742774,2.05352162514156,-6.35536427586791
336305,NIEPZF,C,8.10993142118815,1.43573480229266,-1.74178924142540
336306,NIEPZF,C,7.75541421417754,1.76588121446342,-3.03169748216380
336307,NIEPZF,C,2.53458369794125,0.75598169974190,-6.01898837874612
336308,NIEPZF,C,1.65095081280527,1.55361796255334,-5.06213229568160
...,...,...,...,...,...
336360,NIEPZF,H,8.16200988671480,9.39138630740615,-5.68736435882548
336361,NIEPZF,H,8.22647607759332,8.10511425693277,-6.89455066627853
336362,NIEPZF,H,10.22783312346667,7.35243484877937,-2.20888439893288
336363,NIEPZF,H,9.15702455852157,5.96363887282689,-1.98261769460251


In [None]:
from ase.data import atomic_numbers, covalent_radii

def calc_distance_matrix(molecule):
    n = len(molecule)
    distance_matrix = np.zeros((n, n))

    for i in range(n):
        atom_i = molecule[i]
        Ri = atom_i.position

        for j in range(i, n):
            atom_j = molecule[j]
            Rj = atom_j.position
            distance = np.linalg.norm(Ri - Rj)
            distance_matrix[i, j] = distance
            distance_matrix[j, i] = distance  # Matrix is symmetric

    return distance_matrix

def calc_coulomb_matrix(molecule):
    n = len(molecule)
    coulomb_matrix = np.zeros((n, n))

    for i in range(n):
        atom_i = molecule[i]
        Zi = atomic_numbers[atom_i.symbol]  # Atomic number of atom i
        Ri = atom_i.position  # Position vector of atom i

        for j in range(i, n):
            atom_j = molecule[j]
            Zj = atomic_numbers[atom_j.symbol]  # Atomic number of atom j
            Rj = atom_j.position  # Position vector of atom j

            if i == j:
                coulomb_matrix[i, j] = 0.5 * Zi ** 2.4
            else:
                distance = np.linalg.norm(Ri - Rj)
                coulomb_value = Zi * Zj / distance
                coulomb_matrix[i, j] = coulomb_value
                coulomb_matrix[j, i] = coulomb_value  # Matrix is symmetric

    return coulomb_matrix

def calc_connectivity_matrix(molecule):
    n = len(molecule)
    connectivity_matrix = np.zeros((n, n))

    for i in range(n):
        atom_i = molecule[i]
        Ri = atom_i.position  # Position vector of atom i
        covrad_i = covalent_radii[atom_i.number]  # Covalent radius of atom i

        for j in range(i + 1, n):
            atom_j = molecule[j]
            Rj = atom_j.position  # Position vector of atom j
            covrad_j = covalent_radii[atom_j.number]  # Covalent radius of atom j

            distance = np.linalg.norm(Ri - Rj)
            threshold = 1.1 * (covrad_i + covrad_j)

            # Apply the threshold condition
            if distance <= threshold:
                connectivity_matrix[i, j] = distance
                connectivity_matrix[j, i] = distance  # Matrix is symmetric

    return connectivity_matrix

In [None]:
def process_molecule(df_molecule, selected_atom, num_smallest_distances=4):
    symbols = df_molecule['Atom'].to_list()
    positions = df_molecule[['X', 'Y', 'Z']].to_numpy().astype(float)
    molecule = ase.Atoms(symbols, positions)

    coulomb_matrix = calc_coulomb_matrix(molecule)
    # connectivity_matrix = calc_connectivity_matrix(molecule)
    distance_matrix = calc_distance_matrix(molecule)

    selected_atom_indices = [i for i, symbol in enumerate(symbols) if symbol == selected_atom]
    # print("Indices of '{}' in the symbols list: {}".format(selected_atom, selected_atom_indices)) #Indices of 'Ni' in the symbols list: [26]

    # selected_atoms_df = df_molecule[df_molecule['Atom'] == selected_atom]
    # selected_atoms = [molecule[i] for i in selected_atom_indices]
    # print(selected_atoms) # [Atom('Ni', [7.11979608470826, 3.26079561106869, -5.64926683288802], index=26)]

    for atom_index in selected_atom_indices:
        # Extract the corresponding row from the distance matrix
        atom_distances = distance_matrix[atom_index]

        # Sort the distances in ascending order and get the indices
        sorted_indices = np.argsort(atom_distances)

        # Select the indices of the smallest distances
        smallest_distance_indicies = sorted_indices[1:num_smallest_distances+1]  # Exclude the first element (distance to itself)

        # Store the sorted Coulomb matrix elements, smallest distances, and indices in the results list
        smallest_distances = [atom_distances[idx] for idx in smallest_distance_indicies]

        # Extract the corresponding Coulomb matrix elements
        coulomb_elements = [coulomb_matrix[atom_index, idx] for idx in smallest_distance_indicies]

        # Sort the Coulomb matrix elements in descending order
        largest_coulomb_elements = sorted(coulomb_elements, reverse=True)

        return smallest_distances, largest_coulomb_elements

In [None]:
# Initialize an empty list to store the rows for the new DataFrame
rows = []

# Iterate over each group of molecules by CSD_code
for csd_code, df_molecule in df_compound_Ni.groupby('CSD_code'):
    smallest_distances, largest_coulomb_elements= process_molecule(df_molecule, selected_atom, num_smallest_distances)

    # Create a row with the desired columns
    row = {
        'CSD_code': csd_code,
        'HL_Gap': df_quantum_properties.loc[df_quantum_properties['CSD_code'] == csd_code, 'HL_Gap'].values[0]
    }

    # Add columns for smallest distances
    for i, distance in enumerate(smallest_distances, start=1):
        row[f'smallest_distance{i}'] = distance

    # Add columns for largest Coulomb elements
    for i, coulomb in enumerate(largest_coulomb_elements, start=1):
        row[f'largest_coulomb{i}'] = coulomb

    rows.append(row)

# Create the new DataFrame with the desired columns
results_df = pd.DataFrame(rows)

In [None]:
results_df.head()

Unnamed: 0,CSD_code,HL_Gap,smallest_distance1,smallest_distance2,smallest_distance3,smallest_distance4,largest_coulomb1,largest_coulomb2,largest_coulomb3,largest_coulomb4
0,ABAJAP,0.05942,1.839496,1.981789,2.214582,2.232629,106.550948,98.90053,88.504283,87.788864
1,ABASEC,0.11195,1.82841,1.948955,2.00706,2.011575,107.196982,100.56669,97.655292,97.436084
2,ABEDUH,0.04748,1.86978,2.017333,2.207921,2.268885,202.9058,119.800182,111.037678,74.04518
3,ABEPAA,0.14632,1.882141,2.062249,2.201932,2.209057,230.815942,190.741565,190.126368,89.260068
4,ABEPEE,0.14839,1.885784,2.054287,2.198263,2.213726,231.710584,191.059953,189.725375,89.087629


In [None]:
save_path = '/content/drive/MyDrive/Thesis/Thesis/data/results_{}_compounds.csv'.format(selected_atom)
# Save the DataFrame
results_df.to_csv(save_path, index=False)

print(f"Results of {selected_atom} saved to CSV successfully!")

Results of Ni saved to CSV successfully!


https://www.sciencedirect.com/science/article/pii/S0010465519303042

[link text](https:// [link text](https://))the Coulomb interaction is strongest at smaller distances. This is why we don't need to flip the sorting order of the Coulomb matrix elements.

To interpret the model and understand the feature importances, we’ll use the trained Gradient Boosting Regressor. This model provides a feature_importances_ attribute that can be used to gain insight into the importance of each feature when making predictions.