Description

This script performs the following tasks:

    Reads a CSV file containing a column named "Smiles".
    Defines the alkaloid and peptide bond substructures using SMARTS notation.
    Defines a function to check for the presence of a substructure in a molecule.
    Iterates over the DataFrame with a progress bar to check for the presence of the substructures.
    Adds new columns to the DataFrame to store the results.
    Filters the DataFrame to include only rows where both substructures are present.
    Saves the filtered DataFrame to a new CSV file.


Instructions for Use

    Ensure you have the required libraries installed:

'''
pip install pandas rdkit tqdm
'''

Update the input_file_path variable with the path to your input CSV file.

Run the script. The script will read the input CSV file, process the SMILES strings to check for the presence of alkaloid and peptide bond substructures, and save the results to a new CSV file specified by output_file_path.


In [None]:
import pandas as pd
from rdkit import Chem
from tqdm import tqdm

# Define alkaloid and peptide bond substructures
alkaloid_substructure = Chem.MolFromSmarts('[N+]')
peptide_bond_substructure = Chem.MolFromSmarts('C(=O)NCC')

# Function to check for substructure presence
def has_substructure(mol, substructure):
    return mol.HasSubstructMatch(substructure)

# Read CSV file with a column named "Smiles"
input_file_path = 'input.csv'  # Update with your input file path
df = pd.read_csv(input_file_path)

# Add columns for alkaloid and peptide bond presence
df['Has_Alkaloid'] = False
df['Has_Peptide_Bond'] = False

# Iterate over the DataFrame with a progress bar
for index, row in tqdm(df.iterrows(), total=len(df)):
    mol = Chem.MolFromSmiles(row['Smiles'])
    if mol is not None:
        df.at[index, 'Has_Alkaloid'] = has_substructure(mol, alkaloid_substructure)
        df.at[index, 'Has_Peptide_Bond'] = has_substructure(mol, peptide_bond_substructure)

# Filter the DataFrame to include only rows where both substructures are present
filtered_df = df[(df['Has_Alkaloid']) & (df['Has_Peptide_Bond'])]

# Display the results
print(filtered_df[['ChEMBL ID', 'Has_Alkaloid', 'Has_Peptide_Bond']])

# Save the filtered DataFrame to a new CSV file
output_file_path = "output.csv"  # Update with your desired output file path
filtered_df.to_csv(output_file_path, index=False)

print(f"Filtered data with both alkaloid and peptide bond substructures saved to '{output_file_path}'.")
