# **Computational Drug Discovery Download Bioactivity Data**
Ngceboyakwethu Primrose Zinyama

In this Jupyter notebook PubChem bioactivity data will be collected and preprocessed.

## **PubChem Database**

PubChem is the world's largest collection of freely accessible chemical information. Search chemicals by name, molecular formula, structure, and other identifiers. Find chemical and physical properties, biological activities, safety and toxicity information, patents, literature citations and more.
Data as of February 6, 2025

## **Installing libraries**

Install the pubchempy package so that we can retrieve bioactivity data from the PubChem Database.

In [1]:
! pip install pubchempy

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: C:\Users\Admin\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


## **Importing libraries**

## **Installing pandas**

Pandas are python libraries that are used to analyse big datasets and make conclusions based on statistical analysis.

In [None]:
# Dataframe library
! pip install pandas

In [4]:
! pip install simplejson

Defaulting to user installation because normal site-packages is not writeable
Collecting simplejson
  Downloading simplejson-3.19.3-cp313-cp313-win_amd64.whl.metadata (3.2 kB)
Downloading simplejson-3.19.3-cp313-cp313-win_amd64.whl (75 kB)
Installing collected packages: simplejson
Successfully installed simplejson-3.19.3



[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: C:\Users\Admin\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [2]:
# Import necessary libraries
# Fetch data through PubChem
import pandas as pd
import simplejson
import requests
import pubchempy as pcp
import csv

### **Getting csv file from PubChem for human gamma secretase inhibitors with nicastrin bioactivity **

This website assisted in compiling the pugrest to download the file:
 https://iupac.github.io/WFChemCookbook/datasources/pubchem_pugrest1.html

 PUG stands for Power User Gateway, which encompasses several variants of methods for programmatic access to PubChem data and services. This REST-style interface is intended to be a simple access route to PubChem for things like scripts, javascript embedded in web pages, and 3rd party applications, without the overhead of XML, SOAP envelopes, etc. that are required for other versions of PUG. PUG REST also provides convenient access to information on PubChem records that is not possible with any other service.

## **Construct a PUG-REST API and retrieve data**

In [None]:
pugrest_prolog = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"
pugrest_input = "protein/accession/Q92542"
pugrest_operation = "consise"
pugrest_output ="csv"

pugrest_url = "/".join( (pugrest_prolog, pugrest_input, pugrest_operation, pugrest_output ) )
print("REQUEST URL:", pugrest_url)

res = requests.get(pugrest_url)
print("OUTPUT    :", res.text.strip())

In [None]:
print("REQUEST URL:", pugrest_url)

response = requests.get(pugrest_url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Save the content of the response to a local CSV file
    with open("downloaded_data.csv", "wb") as f:
        f.write(response.content)
    print("CSV file downloaded successfully")
else:
    print("Failed to download CSV file. Status code:", response.status_code)

# Read the CSV file into a Pandas DataFrame
df = pd.read_csv("downloaded_data.csv")

In [7]:
import pandas as pd
load = pd.read_csv("downloaded_data.csv")
load.head

<bound method NDFrame.head of            baid     activity      aid        sid        cid   geneid  \
0      99544644       Active    45082  134461073   56681654  23385.0   
1      99544673       Active    45082  103437680   44386767  23385.0   
2      99544679       Active    45082  103437853   15344717  23385.0   
3      99544685       Active    45082  103438207   12147040  23385.0   
4      99544742       Active    45082  103437123   44386506  23385.0   
...         ...          ...      ...        ...        ...      ...   
5013  380626213  Unspecified  1872942  482051457  168272247  23385.0   
5014  380626219  Unspecified  1872942  482069019   22204430  23385.0   
5015  380626264  Unspecified  1872941  482051457  168272247  23385.0   
5016  407928578  Unspecified  1929078  103714659     107715  23385.0   
5017  407928617  Unspecified  1929079  103189275       9651  23385.0   

            pmid             aidtype  aidmdate  hasdrc  ...  repacxn taxid  \
0     15050631.0        Con

##How to Select Specific CSV Columns Using Python and Pandas**

We are interested in the column names cid, cmpdname, activity, acname, acvalue, and aidtype

In [13]:
df = df[['cid', 'cmpdname', 'activity','acname','acvalue','aidtype']]

##How to save the new csv file with just the columns required**

In [14]:
df.to_csv('C:/jupiter/nct_pubchem.csv', index=False)

In [16]:
df2 = df[df.acname.notna()]
df2

Unnamed: 0,cid,cmpdname,activity,acname,acvalue,aidtype
0,56681654,"methyl (2S,3R)-2-[[(2S)-2-[[(2S)-2-[[benzyl-[(...",Active,IC50,0.27,Confirmatory
1,44386767,"2-((S)-2-{(S)-2-[3-Benzyl-3-((2R,3S)-3-tert-bu...",Active,IC50,0.25,Confirmatory
2,15344717,"(S)-2-{(S)-2-[3-((2R,3S)-3-tert-Butoxycarbonyl...",Active,IC50,3.10,Confirmatory
3,12147040,"(R)-methyl 2-((R)-2-(3-benzyl-3-((2S,3R)-3-(te...",Active,IC50,0.65,Confirmatory
4,44386506,"(S)-2-((S)-2-{3-Benzyl-3-[(2R,3S)-3-((S)-2-ter...",Active,IC50,0.23,Confirmatory
...,...,...,...,...,...,...
5001,162649237,N-(2-ethylpyrazol-3-yl)-4-[6-methoxy-5-(4-meth...,Unspecified,IC50,1.00,Confirmatory
5002,162646677,4-[6-methoxy-5-(4-methylimidazol-1-yl)pyridin-...,Unspecified,IC50,1.00,Confirmatory
5003,126599753,5-(4-chlorophenyl)-6-cyclopropyl-3-[6-methoxy-...,Unspecified,IC50,10.00,Confirmatory
5016,107715,Dihydroergocristine,Unspecified,IC50,25.00,Confirmatory


##Filtering and cleaning the data_set**

In [None]:
#Filtering to remain with the acname (activity name) as IC50
import pandas as pd

file_path = "nct_pubchem.csv"
data = pd.read_csv(file_path)

data = data.query('acname == "IC50"')
data = data[["cid", "cmpdname", "activity", "acname", "acvalue", "aidtype"]]

data.to_csv("filtered_nct_pubchem.csv")
print(data)

            cid                                           cmpdname  \
0      56681654  methyl (2S,3R)-2-[[(2S)-2-[[(2S)-2-[[benzyl-[(...   
1      44386767  2-((S)-2-{(S)-2-[3-Benzyl-3-((2R,3S)-3-tert-bu...   
2      15344717  (S)-2-{(S)-2-[3-((2R,3S)-3-tert-Butoxycarbonyl...   
3      12147040  (R)-methyl 2-((R)-2-(3-benzyl-3-((2S,3R)-3-(te...   
4      44386506  (S)-2-((S)-2-{3-Benzyl-3-[(2R,3S)-3-((S)-2-ter...   
...         ...                                                ...   
5000   89657814  N-(2-ethyl-4,5,6,7-tetrahydroindazol-3-yl)-4-[...   
5001  162649237  N-(2-ethylpyrazol-3-yl)-4-[6-methoxy-5-(4-meth...   
5002  162646677  4-[6-methoxy-5-(4-methylimidazol-1-yl)pyridin-...   
5003  126599753  5-(4-chlorophenyl)-6-cyclopropyl-3-[6-methoxy-...   
5016     107715                                Dihydroergocristine   

         activity acname  acvalue       aidtype  
0          Active   IC50     0.27  Confirmatory  
1          Active   IC50     0.25  Confirmatory  
2        

In [41]:
#Filtering to remain with the aidtype as Confirmatory
import pandas as pd

file_path = "filtered_nct_pubchem.csv"
data = pd.read_csv(file_path)

data = data.query('aidtype == "Confirmatory"')
data = data[["cid", "cmpdname", "activity", "acname", "acvalue", "aidtype"]]

data.to_csv("filtered_nct_pubchem1.csv")
print(data)

            cid                                           cmpdname  \
0      56681654  methyl (2S,3R)-2-[[(2S)-2-[[(2S)-2-[[benzyl-[(...   
1      44386767  2-((S)-2-{(S)-2-[3-Benzyl-3-((2R,3S)-3-tert-bu...   
2      15344717  (S)-2-{(S)-2-[3-((2R,3S)-3-tert-Butoxycarbonyl...   
3      12147040  (R)-methyl 2-((R)-2-(3-benzyl-3-((2S,3R)-3-(te...   
4      44386506  (S)-2-((S)-2-{3-Benzyl-3-[(2R,3S)-3-((S)-2-ter...   
...         ...                                                ...   
3528   89657814  N-(2-ethyl-4,5,6,7-tetrahydroindazol-3-yl)-4-[...   
3529  162649237  N-(2-ethylpyrazol-3-yl)-4-[6-methoxy-5-(4-meth...   
3530  162646677  4-[6-methoxy-5-(4-methylimidazol-1-yl)pyridin-...   
3531  126599753  5-(4-chlorophenyl)-6-cyclopropyl-3-[6-methoxy-...   
3532     107715                                Dihydroergocristine   

         activity acname  acvalue       aidtype  
0          Active   IC50     0.27  Confirmatory  
1          Active   IC50     0.25  Confirmatory  
2        

In [42]:
#Filtering to remain with the activity as activa and inactive only
import pandas as pd

file_path = "filtered_nct_pubchem1.csv"
data = pd.read_csv(file_path)

data = data.query('activity == "Active" or activity == "Inactive"')
data = data[["cid", "cmpdname", "activity", "acname", "acvalue", "aidtype"]]

data.to_csv("filtered_nct_pubchem2.csv")
print(data)

            cid                                           cmpdname  activity  \
0      56681654  methyl (2S,3R)-2-[[(2S)-2-[[(2S)-2-[[benzyl-[(...    Active   
1      44386767  2-((S)-2-{(S)-2-[3-Benzyl-3-((2R,3S)-3-tert-bu...    Active   
2      15344717  (S)-2-{(S)-2-[3-((2R,3S)-3-tert-Butoxycarbonyl...    Active   
3      12147040  (R)-methyl 2-((R)-2-(3-benzyl-3-((2S,3R)-3-(te...    Active   
4      44386506  (S)-2-((S)-2-{3-Benzyl-3-[(2R,3S)-3-((S)-2-ter...    Active   
...         ...                                                ...       ...   
2957   11269353                                         Begacestat    Active   
2958   57327010                                    Unii-PX8XQ3H3RV    Active   
2959  160302852  tert-butyl N-[(2S,3R,5R)-6-[[(4S,7R)-8-amino-7...    Active   
2987  137174942  1-benzyl-7-(3-methyl-1,2,4-triazol-1-yl)-5,10-...  Inactive   
2988  137174952  1-benzyl-7-(4-chloroimidazol-1-yl)-5,10-dihydr...  Inactive   

     acname  acvalue       aidtype  
0 

##Data pre-processing of the bioactivity data**


following the example in dataprofessor's example in https://github.com/dataprofessor/code/blob/master/python/CDD_ML_Part_2_Exploratory_Data_Analysis.ipynb to calculate the pIC50. Remember IC50 values in uM need to be normalised by converting to M by dividing by 1000000 before -log conversion.

In [3]:
import pandas as pd
import numpy as np

# Function to convert IC50 (uM) to pIC50
def ic50_to_pic50(ic50_um):
    # Convert IC50 from µM to M
    ic50_m = ic50_um / 1_000_000
    
    # Avoid negative or zero IC50 values
    if ic50_m <= 0:
        return np.nan  # Return NaN if IC50 is zero or negative
    
    # Calculate pIC50 using the IC50 in M
    pic50_value = -np.log10(ic50_m)
    return pic50_value

# Read the CSV file
input_file = 'filtered_nct_pubchem2.csv'  # Replace with the path to your input CSV file
output_file = 'pIC50_nct_pubchem.csv'  # Path for the output file

# Load the CSV into a DataFrame
df = pd.read_csv(input_file)

# Check the column name where IC50 values are stored
# Assuming the IC50 values are in a column named 'IC50_uM', change it if necessary.
ic50_column = 'acvalue'

# Convert the IC50 values to pIC50 values
df['pIC50'] = df[ic50_column].apply(ic50_to_pic50)

# Save the new DataFrame with pIC50 values to a new CSV file
df.to_csv(output_file, index=False)

print(f"pIC50 values have been saved to {output_file}")

pIC50 values have been saved to pIC50_nct_pubchem.csv


### **Labeling compounds as either being active, inactive or intermediate**

The bioactivity data is in the pIC50 unit. The inhibitory potencies of the data set, expressed as pIC50, ranged from 4.3 to 11.7 and compounds with a pIC50 ≥8.0 were classified as actives.  Compounds having values of greater than or equal 8 will be considered to be **active** while those less than 7 will be considered to be **inactive**. As for those values in between 7 and 8 nM will be referred to as **intermediate**. 

In [5]:
df2 = df[df.acvalue.notna()]
df2
bioactivity_class = []
for i in df2.pIC50:
  if float(i) >= 8:
    bioactivity_class.append("active")
  elif float(i) <= 7:
    bioactivity_class.append("inactive")
  else:
    bioactivity_class.append("intermediate")

### **Iterate the *cid* to a list**

In [8]:
cid = []
for i in df2.cid:
  cid.append(i)

### **Iterate the *cmpdname* to a list**

In [9]:
cmpdname = []
for i in df2.cmpdname:
  cmpdname.append(i)

### **Iterate the *pIC50* to a list**

In [10]:
pIC50 = []
for i in df2.pIC50:
  pIC50.append(i)

### **Iterate the *acvalue* to a list**

In [11]:
acvalue = []
for i in df2.acvalue:
  acvalue.append(i)

### **Combine the 5 lists into a dataframe**

In [12]:
data_tuples = list(zip(cid, cmpdname, bioactivity_class, acvalue, pIC50))
df3 = pd.DataFrame( data_tuples,  columns=['cid', 'cmpdname', 'bioactivity_class', 'acvalue', 'pIC50'])

In [13]:
df3

Unnamed: 0,cid,cmpdname,bioactivity_class,acvalue,pIC50
0,56681654,"methyl (2S,3R)-2-[[(2S)-2-[[(2S)-2-[[benzyl-[(...",inactive,0.27000,6.568636
1,44386767,"2-((S)-2-{(S)-2-[3-Benzyl-3-((2R,3S)-3-tert-bu...",inactive,0.25000,6.602060
2,15344717,"(S)-2-{(S)-2-[3-((2R,3S)-3-tert-Butoxycarbonyl...",inactive,3.10000,5.508638
3,12147040,"(R)-methyl 2-((R)-2-(3-benzyl-3-((2S,3R)-3-(te...",inactive,0.65000,6.187087
4,44386506,"(S)-2-((S)-2-{3-Benzyl-3-[(2R,3S)-3-((S)-2-ter...",inactive,0.23000,6.638272
...,...,...,...,...,...
2955,9843750,Semagacestat,intermediate,0.01090,7.962574
2956,73441910,"2-[(1S)-1-[(2S,5R)-5-[4-chloro-5-fluoro-2-(tri...",active,0.00620,8.207608
2957,11269353,Begacestat,intermediate,0.01500,7.823909
2958,57327010,Unii-PX8XQ3H3RV,active,0.00027,9.568636


Saves dataframe to CSV file

In [14]:
df3.to_csv('bioactivity_preprocessed_data.csv', index=False)

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('bioactivity_preprocessed_data.csv')

In [4]:
df.head()

Unnamed: 0,cid,cmpdname,bioactivity_class,acvalue,pIC50
0,56681654,"methyl (2S,3R)-2-[[(2S)-2-[[(2S)-2-[[benzyl-[(...",inactive,0.27,6.568636
1,44386767,"2-((S)-2-{(S)-2-[3-Benzyl-3-((2R,3S)-3-tert-bu...",inactive,0.25,6.60206
2,15344717,"(S)-2-{(S)-2-[3-((2R,3S)-3-tert-Butoxycarbonyl...",inactive,3.1,5.508638
3,12147040,"(R)-methyl 2-((R)-2-(3-benzyl-3-((2S,3R)-3-(te...",inactive,0.65,6.187087
4,44386506,"(S)-2-((S)-2-{3-Benzyl-3-[(2R,3S)-3-((S)-2-ter...",inactive,0.23,6.638272


In [5]:
import pubchempy as pcp

##Getting Canonical smiles**
Fetch all the canonical smiles of all the 2960 available compounds from the PubChem database. We need to pass canonical smiles in the database as a list to get_properties function of the library.

In [None]:
data = []

for i in df['cid']:
    props = pcp.get_properties(['CanonicalSMILES'], i, 'cid')
    data.append(props)

##Fetch the original properties from the original dataframe**

In [None]:
props_df.insert(1, 'cid', df['cid'], True)
props_df['pIC50'] = df['pIC50']
props_df['acvalue'] = df['acvalue']
props_df['bioactivity_class'] = df['bioactivity_class']

In [None]:
props_df.to_csv('nct_with_smiles.csv', index=False)

In [7]:
df = pd.read_csv('nct_with_smiles.csv')

In [9]:
df.head ()

Unnamed: 0,CID,CanonicalSMILES,bioactivity_class,pIC50
0,56681654,CCC(C)C(C(=O)OC)NC(=O)C(C(C)C)NC(=O)C(CC(C)C)N...,inactive,6.568636
1,44386767,CC(C)CC(C(=O)NC(C(C)C)C(=O)NC(CC1=CC=CC=C1)C(=...,inactive,6.60206
2,15344717,CC(C)CC(C(=O)NC(CC1=CC=CC=C1)C(=O)OC)NC(=O)N(C...,inactive,5.508638
3,12147040,CC(C)CC(C(=O)NC(CC(C)C)C(=O)OC)NC(=O)N(CC1=CC=...,inactive,6.187087
4,44386506,CC(C)CC(C(=O)NC(C(C)C)C(=O)OC)NC(=O)N(CC1=CC=C...,inactive,6.638272


##Removing duplicates**

In [25]:
import pandas as pd
from rdkit import Chem
from rdkit.Chem import PandasTools

# Load the dataset (assuming it contains a 'SMILES' column)
# Replace 'your_dataset.csv' with the path to your dataset
df = pd.read_csv('nct_with_smiles.csv')

# Function to get canonical SMILES from a SMILES string
def canonicalize_smiles(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol:
        return Chem.MolToSmiles(mol, canonical=True)
    else:
        return None  # In case the SMILES string is invalid

# Apply the canonicalization function to the SMILES column
df['Canonical_SMILES'] = df['CanonicalSMILES'].apply(canonicalize_smiles)

# Remove rows where the canonical SMILES are duplicates
df_cleaned = df.drop_duplicates(subset=['Canonical_SMILES'])

# Display the number of rows before and after removing duplicates
print(f"Number of rows before removing duplicates: {df.shape[0]}")
print(f"Number of rows after removing duplicates: {df_cleaned.shape[0]}")

# Save the cleaned dataset to a new CSV file (optional)
df_cleaned.to_csv('nct_no_duplicates.csv', index=False)

# Optionally, you can check the first few rows of the cleaned data
print(df_cleaned.head())

Number of rows before removing duplicates: 2960
Number of rows after removing duplicates: 2064
        CID                                    CanonicalSMILES  \
0  56681654  CCC(C)C(C(=O)OC)NC(=O)C(C(C)C)NC(=O)C(CC(C)C)N...   
1  44386767  CC(C)CC(C(=O)NC(C(C)C)C(=O)NC(CC1=CC=CC=C1)C(=...   
2  15344717  CC(C)CC(C(=O)NC(CC1=CC=CC=C1)C(=O)OC)NC(=O)N(C...   
3  12147040  CC(C)CC(C(=O)NC(CC(C)C)C(=O)OC)NC(=O)N(CC1=CC=...   
4  44386506  CC(C)CC(C(=O)NC(C(C)C)C(=O)OC)NC(=O)N(CC1=CC=C...   

  bioactivity_class     pIC50  \
0          inactive  6.568636   
1          inactive  6.602060   
2          inactive  5.508638   
3          inactive  6.187087   
4          inactive  6.638272   

                                    Canonical_SMILES  
0  CCC(C)C(NC(=O)C(NC(=O)C(CC(C)C)NC(=O)N(Cc1cccc...  
1  COC(=O)C(Cc1ccccc1)NC(=O)C(NC(=O)C(CC(C)C)NC(=...  
2  COC(=O)C(Cc1ccccc1)NC(=O)C(CC(C)C)NC(=O)N(C)CC...  
3  COC(=O)C(CC(C)C)NC(=O)C(CC(C)C)NC(=O)N(Cc1cccc...  
4  COC(=O)C(NC(=O)C(CC(C)C)NC(=O)N

In [26]:
df_cleaned.head()

Unnamed: 0,CID,CanonicalSMILES,bioactivity_class,pIC50,Canonical_SMILES
0,56681654,CCC(C)C(C(=O)OC)NC(=O)C(C(C)C)NC(=O)C(CC(C)C)N...,inactive,6.568636,CCC(C)C(NC(=O)C(NC(=O)C(CC(C)C)NC(=O)N(Cc1cccc...
1,44386767,CC(C)CC(C(=O)NC(C(C)C)C(=O)NC(CC1=CC=CC=C1)C(=...,inactive,6.60206,COC(=O)C(Cc1ccccc1)NC(=O)C(NC(=O)C(CC(C)C)NC(=...
2,15344717,CC(C)CC(C(=O)NC(CC1=CC=CC=C1)C(=O)OC)NC(=O)N(C...,inactive,5.508638,COC(=O)C(Cc1ccccc1)NC(=O)C(CC(C)C)NC(=O)N(C)CC...
3,12147040,CC(C)CC(C(=O)NC(CC(C)C)C(=O)OC)NC(=O)N(CC1=CC=...,inactive,6.187087,COC(=O)C(CC(C)C)NC(=O)C(CC(C)C)NC(=O)N(Cc1cccc...
4,44386506,CC(C)CC(C(=O)NC(C(C)C)C(=O)OC)NC(=O)N(CC1=CC=C...,inactive,6.638272,COC(=O)C(NC(=O)C(CC(C)C)NC(=O)N(Cc1ccccc1)CC(O...


##Standardising the dataset**
Standardizing a dataset in the context of cheminformatics typically involves the following steps:

Canonicalizing SMILES: This ensures that the molecular representation (SMILES) is in a standard form, removing any inconsistencies in notation (e.g., different ways of writing the same structure).
Removing salts and fragments: Sometimes, datasets contain salt or fragment molecules that need to be excluded.
Handling stereochemistry: Standardization might involve removing stereochemical information if it's not needed.
Normalizing charges: Ensuring that molecules are neutral or have consistent charge states (if necessary).
Handling tautomeric forms: Converting to a single tautomeric form, if applicable.

In [28]:
from rdkit.Chem import Descriptors

df = pd.read_csv('nct_no_duplicates.csv')

# Function to check if a molecule is a fragment based on molecular weight and number of atoms
def is_fragment(mol, min_atoms=6, min_molecular_weight=50.0):
    """Filter out small molecules likely to be fragments based on molecular weight and atom count"""
    if mol:
        # Check the number of atoms and molecular weight of the molecule
        num_atoms = mol.GetNumAtoms()
        mol_weight = Descriptors.MolWt(mol)
        
        # A molecule is considered a fragment if it has fewer atoms or lower molecular weight than the thresholds
        if num_atoms < min_atoms or mol_weight < min_molecular_weight:
            return True
    return False

# Step 1: Convert SMILES to RDKit molecule objects and filter out fragments
valid_mols = []

for smiles in df['CanonicalSMILES']:
    mol = Chem.MolFromSmiles(smiles)
    if mol and not is_fragment(mol):
        valid_mols.append(mol)
    else:
        valid_mols.append(None)  # Mark fragments as None

# Step 2: Create a new column with valid molecules only (no fragments)
df['Valid_Molecule'] = [mol if mol is not None else None for mol in valid_mols]

# Step 3: Filter out rows where molecules are identified as fragments (None)
df_cleaned = df.dropna(subset=['Valid_Molecule'])

# Step 4: Display results
print(f"Number of rows before filtering fragments: {df.shape[0]}")
print(f"Number of rows after filtering fragments: {df_cleaned.shape[0]}")

# Step 5: Save the cleaned dataset to a new CSV file (optional)
df_cleaned.to_csv('nct_no_fragments.csv', index=False)

# Optionally, you can check the first few rows of the cleaned data
print(df_cleaned.head())

Number of rows before filtering fragments: 2064
Number of rows after filtering fragments: 2064
        CID                                    CanonicalSMILES  \
0  56681654  CCC(C)C(C(=O)OC)NC(=O)C(C(C)C)NC(=O)C(CC(C)C)N...   
1  44386767  CC(C)CC(C(=O)NC(C(C)C)C(=O)NC(CC1=CC=CC=C1)C(=...   
2  15344717  CC(C)CC(C(=O)NC(CC1=CC=CC=C1)C(=O)OC)NC(=O)N(C...   
3  12147040  CC(C)CC(C(=O)NC(CC(C)C)C(=O)OC)NC(=O)N(CC1=CC=...   
4  44386506  CC(C)CC(C(=O)NC(C(C)C)C(=O)OC)NC(=O)N(CC1=CC=C...   

  bioactivity_class     pIC50  \
0          inactive  6.568636   
1          inactive  6.602060   
2          inactive  5.508638   
3          inactive  6.187087   
4          inactive  6.638272   

                                    Canonical_SMILES  \
0  CCC(C)C(NC(=O)C(NC(=O)C(CC(C)C)NC(=O)N(Cc1cccc...   
1  COC(=O)C(Cc1ccccc1)NC(=O)C(NC(=O)C(CC(C)C)NC(=...   
2  COC(=O)C(Cc1ccccc1)NC(=O)C(CC(C)C)NC(=O)N(C)CC...   
3  COC(=O)C(CC(C)C)NC(=O)C(CC(C)C)NC(=O)N(Cc1cccc...   
4  COC(=O)C(NC(=O)C(CC(C)C)NC

In [30]:
# Define common salts to filter (using valid SMILES representations for ions)
common_ions = ['Cl-', '[Na+]', '[K+]', 'S(=O)(=O)[O-]', '[NO3-]', 'Br-', 'I-', 'F-', '[CO3-]']

df = pd.read_csv('nct_no_fragments.csv')

# Function to check if a molecule is likely a salt based on its components
def is_salt(mol, ions=common_ions):
    """Filter out molecules likely to be salts based on their components."""
    if mol:
        # Get the list of molecules (fragments) in the compound
        fragments = Chem.GetMolFrags(mol, asMols=True)
        
        # If the molecule has more than 1 fragment, it's likely a salt
        if len(fragments) > 1:
            return True
        
        # Check if any of the common ions (e.g., chloride, sodium) are part of the molecule
        for ion in ions:
            ion_mol = Chem.MolFromSmiles(ion)
            if ion_mol and mol.HasSubstructMatch(ion_mol):
                return True
    return False

# Step 1: Convert SMILES to RDKit molecule objects and filter out salts
valid_mols = []

for smiles in df['CanonicalSMILES']:
    mol = Chem.MolFromSmiles(smiles)
    if mol and not is_salt(mol):
        valid_mols.append(mol)
    else:
        valid_mols.append(None)  # Mark salts as None

# Step 2: Create a new column with valid molecules only (no salts)
df['Valid_Molecule'] = [mol if mol is not None else None for mol in valid_mols]

# Step 3: Filter out rows where molecules are identified as salts (None)
df_cleaned = df.dropna(subset=['Valid_Molecule'])

# Step 4: Display results
print(f"Number of rows before filtering salts: {df.shape[0]}")
print(f"Number of rows after filtering salts: {df_cleaned.shape[0]}")

# Step 5: Save the cleaned dataset to a new CSV file (optional)
df_cleaned.to_csv('nct_no_salts.csv', index=False)

# Optionally, you can check the first few rows of the cleaned data
print(df_cleaned.head())

[12:15:57] SMILES Parse Error: syntax error while parsing: Cl-
[12:15:57] SMILES Parse Error: check for mistakes around position 3:
[12:15:57] Cl-
[12:15:57] ~~^
[12:15:57] SMILES Parse Error: Failed parsing SMILES 'Cl-' for input: 'Cl-'
[12:15:57] SMILES Parse Error: syntax error while parsing: [NO3-]
[12:15:57] SMILES Parse Error: check for mistakes around position 3:
[12:15:57] [NO3-]
[12:15:57] ~~^
[12:15:57] SMILES Parse Error: Failed parsing SMILES '[NO3-]' for input: '[NO3-]'
[12:15:57] SMILES Parse Error: syntax error while parsing: Br-
[12:15:57] SMILES Parse Error: check for mistakes around position 3:
[12:15:57] Br-
[12:15:57] ~~^
[12:15:57] SMILES Parse Error: Failed parsing SMILES 'Br-' for input: 'Br-'
[12:15:57] SMILES Parse Error: syntax error while parsing: I-
[12:15:57] SMILES Parse Error: check for mistakes around position 2:
[12:15:57] I-
[12:15:57] ~^
[12:15:57] SMILES Parse Error: Failed parsing SMILES 'I-' for input: 'I-'
[12:15:57] SMILES Parse Error: syntax err

Number of rows before filtering salts: 2064
Number of rows after filtering salts: 1995
        CID                                    CanonicalSMILES  \
0  56681654  CCC(C)C(C(=O)OC)NC(=O)C(C(C)C)NC(=O)C(CC(C)C)N...   
1  44386767  CC(C)CC(C(=O)NC(C(C)C)C(=O)NC(CC1=CC=CC=C1)C(=...   
2  15344717  CC(C)CC(C(=O)NC(CC1=CC=CC=C1)C(=O)OC)NC(=O)N(C...   
3  12147040  CC(C)CC(C(=O)NC(CC(C)C)C(=O)OC)NC(=O)N(CC1=CC=...   
4  44386506  CC(C)CC(C(=O)NC(C(C)C)C(=O)OC)NC(=O)N(CC1=CC=C...   

  bioactivity_class     pIC50  \
0          inactive  6.568636   
1          inactive  6.602060   
2          inactive  5.508638   
3          inactive  6.187087   
4          inactive  6.638272   

                                    Canonical_SMILES  \
0  CCC(C)C(NC(=O)C(NC(=O)C(CC(C)C)NC(=O)N(Cc1cccc...   
1  COC(=O)C(Cc1ccccc1)NC(=O)C(NC(=O)C(CC(C)C)NC(=...   
2  COC(=O)C(Cc1ccccc1)NC(=O)C(CC(C)C)NC(=O)N(C)CC...   
3  COC(=O)C(CC(C)C)NC(=O)C(CC(C)C)NC(=O)N(Cc1cccc...   
4  COC(=O)C(NC(=O)C(CC(C)C)NC(=O)N(Cc

In [36]:
# Define common sugar-like substructures (e.g., glucose, fructose)
# These SMILES strings represent glucose and fructose, for example
sugar_smiles = [
    'C1(C(C(C(C(O1)CO)O)O)O',  # Glucose ring (pyranose form)
    'C1(C(C(C(C(O1)CO)O)O)O',  # Fructose (part of the ring)
    # Add more sugar SMILES if needed
]

df = pd.read_csv('nct_no_salts.csv')

# Function to check if a molecule is likely a sugar based on substructure matching
def is_sugar(mol, sugar_patterns=sugar_smiles):
    """Filter out molecules likely to be sugars based on substructure patterns."""
    if mol:
        # Check if the molecule matches any of the sugar patterns (e.g., glucose or fructose)
        for sugar in sugar_patterns:
            sugar_mol = Chem.MolFromSmiles(sugar)
            if sugar_mol and mol.HasSubstructMatch(sugar_mol):
                return True
    return False

# Step 1: Convert SMILES to RDKit molecule objects and filter out sugars
valid_mols = []

for smiles in df['CanonicalSMILES']:
    mol = Chem.MolFromSmiles(smiles)
    if mol and not is_sugar(mol):
        valid_mols.append(mol)
    else:
        valid_mols.append(None)  # Mark sugars as None

# Step 2: Create a new column with valid molecules only (no sugars)
df['Valid_Molecule'] = [mol if mol is not None else None for mol in valid_mols]

# Step 3: Filter out rows where molecules are identified as sugars (None)
df_cleaned = df.dropna(subset=['Valid_Molecule'])

# Step 4: Display results
print(f"Number of rows before filtering sugars: {df.shape[0]}")
print(f"Number of rows after filtering sugars: {df_cleaned.shape[0]}")

# Step 5: Save the cleaned dataset to a new CSV file (optional)
df_cleaned.to_csv('cleaned_nct_no_sugars.csv', index=False)

# Optionally, you can check the first few rows of the cleaned data
print(df_cleaned.head())

[12:43:59] SMILES Parse Error: extra open parentheses while parsing: C1(C(C(C(C(O1)CO)O)O)O
[12:43:59] SMILES Parse Error: check for mistakes around position 3:
[12:43:59] C1(C(C(C(C(O1)CO)O)O)O
[12:43:59] ~~^
[12:43:59] SMILES Parse Error: Failed parsing SMILES 'C1(C(C(C(C(O1)CO)O)O)O' for input: 'C1(C(C(C(C(O1)CO)O)O)O'
[12:43:59] SMILES Parse Error: extra open parentheses while parsing: C1(C(C(C(C(O1)CO)O)O)O
[12:43:59] SMILES Parse Error: check for mistakes around position 3:
[12:43:59] C1(C(C(C(C(O1)CO)O)O)O
[12:43:59] ~~^
[12:43:59] SMILES Parse Error: Failed parsing SMILES 'C1(C(C(C(C(O1)CO)O)O)O' for input: 'C1(C(C(C(C(O1)CO)O)O)O'
[12:43:59] SMILES Parse Error: extra open parentheses while parsing: C1(C(C(C(C(O1)CO)O)O)O
[12:43:59] SMILES Parse Error: check for mistakes around position 3:
[12:43:59] C1(C(C(C(C(O1)CO)O)O)O
[12:43:59] ~~^
[12:43:59] SMILES Parse Error: Failed parsing SMILES 'C1(C(C(C(C(O1)CO)O)O)O' for input: 'C1(C(C(C(C(O1)CO)O)O)O'
[12:43:59] SMILES Parse Erro

Number of rows before filtering sugars: 1995
Number of rows after filtering sugars: 1995
        CID                                    CanonicalSMILES  \
0  56681654  CCC(C)C(C(=O)OC)NC(=O)C(C(C)C)NC(=O)C(CC(C)C)N...   
1  44386767  CC(C)CC(C(=O)NC(C(C)C)C(=O)NC(CC1=CC=CC=C1)C(=...   
2  15344717  CC(C)CC(C(=O)NC(CC1=CC=CC=C1)C(=O)OC)NC(=O)N(C...   
3  12147040  CC(C)CC(C(=O)NC(CC(C)C)C(=O)OC)NC(=O)N(CC1=CC=...   
4  44386506  CC(C)CC(C(=O)NC(C(C)C)C(=O)OC)NC(=O)N(CC1=CC=C...   

  bioactivity_class     pIC50  \
0          inactive  6.568636   
1          inactive  6.602060   
2          inactive  5.508638   
3          inactive  6.187087   
4          inactive  6.638272   

                                    Canonical_SMILES  \
0  CCC(C)C(NC(=O)C(NC(=O)C(CC(C)C)NC(=O)N(Cc1cccc...   
1  COC(=O)C(Cc1ccccc1)NC(=O)C(NC(=O)C(CC(C)C)NC(=...   
2  COC(=O)C(Cc1ccccc1)NC(=O)C(CC(C)C)NC(=O)N(C)CC...   
3  COC(=O)C(CC(C)C)NC(=O)C(CC(C)C)NC(=O)N(Cc1cccc...   
4  COC(=O)C(NC(=O)C(CC(C)C)NC(=O)N(

[12:44:00] SMILES Parse Error: extra open parentheses while parsing: C1(C(C(C(C(O1)CO)O)O)O
[12:44:00] SMILES Parse Error: check for mistakes around position 3:
[12:44:00] C1(C(C(C(C(O1)CO)O)O)O
[12:44:00] ~~^
[12:44:00] SMILES Parse Error: Failed parsing SMILES 'C1(C(C(C(C(O1)CO)O)O)O' for input: 'C1(C(C(C(C(O1)CO)O)O)O'
[12:44:00] SMILES Parse Error: extra open parentheses while parsing: C1(C(C(C(C(O1)CO)O)O)O
[12:44:00] SMILES Parse Error: check for mistakes around position 3:
[12:44:00] C1(C(C(C(C(O1)CO)O)O)O
[12:44:00] ~~^
[12:44:00] SMILES Parse Error: Failed parsing SMILES 'C1(C(C(C(C(O1)CO)O)O)O' for input: 'C1(C(C(C(C(O1)CO)O)O)O'
[12:44:00] SMILES Parse Error: extra open parentheses while parsing: C1(C(C(C(C(O1)CO)O)O)O
[12:44:00] SMILES Parse Error: check for mistakes around position 3:
[12:44:00] C1(C(C(C(C(O1)CO)O)O)O
[12:44:00] ~~^
[12:44:00] SMILES Parse Error: Failed parsing SMILES 'C1(C(C(C(C(O1)CO)O)O)O' for input: 'C1(C(C(C(C(O1)CO)O)O)O'
[12:44:00] SMILES Parse Erro

In [37]:
from rdkit.Chem import rdMolDescriptors

# Function to neutralize a molecule
def neutralize_molecule(mol):
    """Neutralize the molecule by adjusting protonation states."""
    if mol is None:
        return None
    
    # Use RDKit's function to protonate the molecule (e.g., for amines or carboxyl groups)
    mol = Chem.AddHs(mol)  # Add explicit hydrogens
    
    # Neutralize the molecule using RDKit's acid/base chemistry functions
    # Here we can use Protonation states, but RDKit doesn't provide direct methods
    # for full neutralization so we will rely on adding/removing hydrogens as needed.
    
    # We will convert it to a neutral form by trying to remove charges.
    # For now, simply try removing any explicit charges by adding/removing hydrogens.
    mol = Chem.AddHs(mol)  # This step ensures protonation and removal of charges if possible
    
    # Return the molecule after protonation (neutralized)
    return mol

# Function to process the dataset and neutralize all compounds
def neutralize_dataset(df):
    """Neutralize all molecules in the dataset."""
    neutralized_smiles = []
    
    for smiles in df['CanonicalSMILES']:  # Assuming the column name is 'CanonicalSMILES'
        mol = Chem.MolFromSmiles(smiles)
        
        if mol:
            # Neutralize the molecule
            neutralized_mol = neutralize_molecule(mol)
            
            if neutralized_mol:
                # Convert the neutralized molecule to SMILES and add to the list
                neutralized_smiles.append(Chem.MolToSmiles(neutralized_mol, canonical=True))
            else:
                neutralized_smiles.append(None)  # Invalid molecule after neutralization
        else:
            neutralized_smiles.append(None)  # Invalid SMILES
    
    # Add the neutralized SMILES as a new column
    df['Neutralized_SMILES'] = neutralized_smiles
    
    # Remove rows with invalid molecules (None)
    df_cleaned = df.dropna(subset=['Neutralized_SMILES'])
    
    return df_cleaned

# Load your dataset
df = pd.read_csv('cleaned_nct_no_sugars.csv')

# Neutralize the dataset
df_neutralized = neutralize_dataset(df)

# Display the results
print(f"Number of rows before neutralization: {df.shape[0]}")
print(f"Number of rows after neutralization: {df_neutralized.shape[0]}")

# Optionally, save the neutralized dataset to a new CSV file
df_neutralized.to_csv('curated_dataset.csv', index=False)

# Optionally, check the first few rows of the cleaned data
print(df_neutralized.head())

Number of rows before neutralization: 1995
Number of rows after neutralization: 1995
        CID                                    CanonicalSMILES  \
0  56681654  CCC(C)C(C(=O)OC)NC(=O)C(C(C)C)NC(=O)C(CC(C)C)N...   
1  44386767  CC(C)CC(C(=O)NC(C(C)C)C(=O)NC(CC1=CC=CC=C1)C(=...   
2  15344717  CC(C)CC(C(=O)NC(CC1=CC=CC=C1)C(=O)OC)NC(=O)N(C...   
3  12147040  CC(C)CC(C(=O)NC(CC(C)C)C(=O)OC)NC(=O)N(CC1=CC=...   
4  44386506  CC(C)CC(C(=O)NC(C(C)C)C(=O)OC)NC(=O)N(CC1=CC=C...   

  bioactivity_class     pIC50  \
0          inactive  6.568636   
1          inactive  6.602060   
2          inactive  5.508638   
3          inactive  6.187087   
4          inactive  6.638272   

                                    Canonical_SMILES  \
0  CCC(C)C(NC(=O)C(NC(=O)C(CC(C)C)NC(=O)N(Cc1cccc...   
1  COC(=O)C(Cc1ccccc1)NC(=O)C(NC(=O)C(CC(C)C)NC(=...   
2  COC(=O)C(Cc1ccccc1)NC(=O)C(CC(C)C)NC(=O)N(C)CC...   
3  COC(=O)C(CC(C)C)NC(=O)C(CC(C)C)NC(=O)N(Cc1cccc...   
4  COC(=O)C(NC(=O)C(CC(C)C)NC(=O)N(Cc1c

In [39]:
# List of known PAINS substructure patterns (simplified version, add more patterns as necessary)
pains_patterns = [
    "[*]C([*])([*])C([*])([*])",  # Example pattern for a problematic structure (this is just a placeholder)
    "c1ccc2c(c1)cc(c2C(=O)O)",     # Example PAINS pattern (another placeholder)
    "C1CCCC1",                     # Example: cyclic structure
    "C1=CC2=C(C=C1)C(=O)N2",       # Example: potential toxic structure
    # Add more patterns as needed
]

# Function to check if a molecule matches any PAINS substructure
def is_pains(mol):
    """Check if a molecule contains any PAINS substructure."""
    for pains_smiles in pains_patterns:
        pains_mol = Chem.MolFromSmiles(pains_smiles)
        
        if pains_mol:  # Check if the PAINS pattern molecule is valid
            if mol.HasSubstructMatch(pains_mol):
                return True
    return False

# Function to filter out compounds with PAINS substructures from the dataset
def remove_pains_compounds(df):
    """Remove compounds with PAINS substructures from the dataset."""
    valid_smiles = []
    
    for smiles in df['CanonicalSMILES']:  # Assuming the column name is 'CanonicalSMILES'
        mol = Chem.MolFromSmiles(smiles)
        
        if mol and not is_pains(mol):
            valid_smiles.append(smiles)
        else:
            valid_smiles.append(None)
    
    # Add the valid SMILES to a new column
    df['Valid_SMILES'] = valid_smiles
    
    # Remove rows with invalid SMILES (i.e., PAINS or invalid molecules)
    df_cleaned = df.dropna(subset=['Valid_SMILES'])
    
    return df_cleaned

# Load your dataset
df = pd.read_csv('curated_dataset.csv')

# Filter out PAINS compounds
df_no_pains = remove_pains_compounds(df)

# Display the results
print(f"Number of rows before removing PAINS: {df.shape[0]}")
print(f"Number of rows after removing PAINS: {df_no_pains.shape[0]}")

# Optionally, save the filtered dataset to a new CSV file
df_no_pains.to_csv('filtered_no_pains_nct.csv', index=False)

# Optionally, check the first few rows of the cleaned data
print(df_no_pains.head())

[12:52:21] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:21] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:21] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:21] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:21] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:21] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:21] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:21] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:21] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:21] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:21] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:21] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:21] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:21] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:21] Can't kekulize mol.  Un

Number of rows before removing PAINS: 1995
Number of rows after removing PAINS: 1791
        CID                                    CanonicalSMILES  \
0  56681654  CCC(C)C(C(=O)OC)NC(=O)C(C(C)C)NC(=O)C(CC(C)C)N...   
1  44386767  CC(C)CC(C(=O)NC(C(C)C)C(=O)NC(CC1=CC=CC=C1)C(=...   
2  15344717  CC(C)CC(C(=O)NC(CC1=CC=CC=C1)C(=O)OC)NC(=O)N(C...   
3  12147040  CC(C)CC(C(=O)NC(CC(C)C)C(=O)OC)NC(=O)N(CC1=CC=...   
4  44386506  CC(C)CC(C(=O)NC(C(C)C)C(=O)OC)NC(=O)N(CC1=CC=C...   

  bioactivity_class     pIC50  \
0          inactive  6.568636   
1          inactive  6.602060   
2          inactive  5.508638   
3          inactive  6.187087   
4          inactive  6.638272   

                                    Canonical_SMILES  \
0  CCC(C)C(NC(=O)C(NC(=O)C(CC(C)C)NC(=O)N(Cc1cccc...   
1  COC(=O)C(Cc1ccccc1)NC(=O)C(NC(=O)C(CC(C)C)NC(=...   
2  COC(=O)C(Cc1ccccc1)NC(=O)C(CC(C)C)NC(=O)N(C)CC...   
3  COC(=O)C(CC(C)C)NC(=O)C(CC(C)C)NC(=O)N(Cc1cccc...   
4  COC(=O)C(NC(=O)C(CC(C)C)NC(=O)N(Cc1c

[12:52:22] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:22] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:22] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:22] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:22] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:22] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:22] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:22] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:22] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:22] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:22] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:22] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:22] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:22] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4 5 6 7 8
[12:52:22] Can't kekulize mol.  Un