This code is designed to automate the process of preparing molecules for quantum chemical calculations using XTB, starting from SMILES strings in an Excel file. Here's a breakdown of what it does:

Data Loading: It reads a CATMoS_TrainingSet_with_ld50_cid_smiles.xlsx Excel file into a pandas DataFrame. It then selects rows starting from the third row (df.iloc[2:]) to process.

Output Directory Setup: It creates a base directory named xtb_ohess_catmos to store the results, ensuring it exists before proceeding.

process_smiles Function: This is the core function that handles each molecule:

It creates a dedicated subdirectory for each molecule using its 'CID' (Compound ID).
It defines filenames for the XYZ structure and XTB output within this molecule-specific directory.
Duplicate Check: It first checks if the XYZ file for a given molecule already exists. If it does, it skips processing that molecule to avoid redundant work.
SMILES to 3D Structure: It takes a SMILES string, converts it into an RDKit molecule object, adds explicit hydrogen atoms, and then generates initial 3D coordinates using the ETKDG algorithm. If 3D coordinate generation fails, it raises an error.
Geometry Optimization (UFF): It performs a preliminary geometry optimization using the Universal Force Field (UFF) within RDKit.
XYZ File Generation: The optimized 3D structure is then converted into an XYZ format string.
Save XYZ: This XYZ content is saved to a structure.xyz file in the molecule's directory.
XTB Execution: Finally, it navigates into the molecule's directory and executes an XTB command (xtb structure.xyz --ohess) to perform an XTB geometry optimization and Hessian calculation, redirecting the output to optimization.out.
Error Handling: It includes a try-except block to catch and report any errors encountered during the processing of a specific SMILES string.
Batch Processing Loop: The code then iterates through each row of the loaded DataFrame. For each row, it extracts the SMILES string (from the 'Canonical_QSARr' column) and the 'CID', and calls the process_smiles function to process that specific molecule.

In essence, this script automates the generation of 3D molecular structures from SMILES strings, performs basic geometry optimization, and then uses XTB to perform more accurate quantum chemical optimizations and Hessian calculations for a batch of compounds, organizing the outputs neatly into CID-specific folders

In [None]:
! pip install rdkit

! pip install xtb   # This is not working in Colab but it is not necessary if the notebook is run locally

In [None]:
import pandas as pd
import os
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.rdmolfiles import MolToXYZBlock

In [None]:
# Replace "your_file_name.xlsx" with the actual path to your Excel file
df = pd.read_excel("CATMoS_TrainingSet_with_ld50_cid_smiles.xlsx")

# Load SMILES from Excel file and select a specific range
df = pd.read_excel("CATMoS_TrainingSet_with_ld50_cid_smiles.xlsx")

# Select a range of rows (e.g., from row 0 to 8 for 9 molecules)
df = df.iloc[2:]  # Adjust the range as needed

# Base directory for output
base_output_dir = "xtb_ohess_catmos"
os.makedirs(base_output_dir, exist_ok=True)

# Function to process each SMILES string with its corresponding CID
def process_smiles(smiles, cid):
    try:
        # Create molecule directory using CID, formatted as an integer
        mol_dir = os.path.join(base_output_dir, f"CID_{int(cid)}")
        os.makedirs(mol_dir, exist_ok=True)

        # Define filenames within molecule directory
        xyz_filename = os.path.join(mol_dir, "structure.xyz")
        out_filename = os.path.join(mol_dir, "output.out")

        # Check if the XYZ file already exists to avoid overwriting
        if os.path.exists(xyz_filename):
            print(f"Skipping XYZ file creation for CID_{int(cid)} as it already exists.")
            return # Skip further processing for this molecule

        # Generate RDKit molecule object
        mol = Chem.MolFromSmiles(smiles)
        mol = Chem.AddHs(mol)  # Add explicit hydrogens

        # Generate 3D coordinates
        if AllChem.EmbedMolecule(mol, AllChem.ETKDG()) != 0:
            raise ValueError(f"RDKit failed to generate 3D coordinates for {smiles}")

        # Optimize geometry using UFF (Universal Force Field)
        AllChem.UFFOptimizeMolecule(mol)

        # Convert to XYZ format
        xyz_content = MolToXYZBlock(mol)

        # Save XYZ file
        with open(xyz_filename, "w") as f:
            f.write(xyz_content)

        print(f"XYZ file created: {xyz_filename}")

        # Run XTB optimization inside the molecule directory
        os.system(f"cd {mol_dir} && xtb structure.xyz --ohess > optimization.out")

    except Exception as e:
        print(f"Error processing SMILES {smiles} (CID: {cid}): {e}")

# Apply function to selected SMILES with their CIDs
for _, row in df.iterrows():
    smiles = row["Canonical_QSARr"]
    cid = row["CID"]
    process_smiles(smiles, cid)

print("\nGeometry optimization of this batch is done.")

In [None]:
import os

# List the contents of the base output directory to verify creation
print(f"Contents of '{base_output_dir}':")
!ls -R "{base_output_dir}"