Description:

This script performs the following tasks:

    1. Reads a CSV file containing a column named "smiles".
    2. Defines the amide group substructure using SMARTS notation.
    3. Counts the number of amide groups in each molecule represented by the SMILES strings.
    4. Adds a new column "Amide_Count" to the DataFrame to store the count of amide groups.
    5. Iterates over the DataFrame with a progress bar to update the "Amide_Count" column.
    6. Filters the DataFrame to include only molecules with two or more amide groups.
    7. Saves the filtered DataFrame to a new CSV file.


Instructions for Use

    Ensure you have the required libraries installed:

'''
        pip install pandas rdkit tqdm
'''

Update the input_file_path variable with the path to your input CSV file.

Run the script. The script will read the input CSV file, process the SMILES strings to count amide groups, and save the filtered results to a new CSV file specified by output_file_path.


In [None]:
import pandas as pd
from rdkit import Chem
from tqdm import tqdm

# Define the amide group substructure
amide_substructure = Chem.MolFromSmarts('C(=O)N')

# Function to count the number of amide groups in a molecule
def count_amide_groups(mol):
    matches = mol.GetSubstructMatches(amide_substructure)
    return len(matches)

# Read CSV file with a column named "smiles"
input_file_path = 'input.csv'  # Update with your input file path
df = pd.read_csv(input_file_path)

# Add a column to store the number of amide groups
df['Amide_Count'] = 0

# Iterate over the DataFrame with a progress bar
for index, row in tqdm(df.iterrows(), total=len(df)):
    smiles = row['smiles']
    if isinstance(smiles, str):  # Check if the SMILES string is valid
        mol = Chem.MolFromSmiles(smiles)
        if mol is not None:
            df.at[index, 'Amide_Count'] = count_amide_groups(mol)
    else:
        df.at[index, 'Amide_Count'] = None  # Set Amide_Count to None for rows with invalid SMILES

# Filter the DataFrame to include only molecules with two or more amide groups
filtered_df = df[df['Amide_Count'] >= 2]

# Save the filtered DataFrame to a new CSV file
output_file_path = "output.csv"  # Update with your desired output file path
filtered_df.to_csv(output_file_path, index=False)

print(f"Filtered data with Amide_Count >= 2 saved to '{output_file_path}'.")
