# **Computational Drug Discovery Download Bioactivity Data**
Ngceboyakwethu Primrose Zinyama

In this Jupyter notebook PubChem bioactivity data will be collected and preprocessed.

## **PubChem Database**

PubChem is the world's largest collection of freely accessible chemical information. Search chemicals by name, molecular formula, structure, and other identifiers. Find chemical and physical properties, biological activities, safety and toxicity information, patents, literature citations and more.
Data as of February 6, 2025

## **Installing libraries**

Install the pubchempy package so that we can retrieve bioactivity data from the PubChem Database.

In [1]:
! pip install pubchempy

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: C:\Users\Admin\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


## **Importing libraries**

## **Installing pandas**

Pandas are python libraries that are used to analyse big datasets and make conclusions based on statistical analysis.

In [None]:
# Dataframe library
! pip install pandas

In [4]:
! pip install simplejson

Defaulting to user installation because normal site-packages is not writeable
Collecting simplejson
  Downloading simplejson-3.19.3-cp313-cp313-win_amd64.whl.metadata (3.2 kB)
Downloading simplejson-3.19.3-cp313-cp313-win_amd64.whl (75 kB)
Installing collected packages: simplejson
Successfully installed simplejson-3.19.3



[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: C:\Users\Admin\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [2]:
# Import necessary libraries
# Fetch data through PubChem
import pandas as pd
import simplejson
import requests
import pubchempy as pcp
import csv

### **Getting csv file from PubChem for human gamma secretase inhibitors with nicastrin bioactivity **

This website assisted in compiling the pugrest to download the file:
 https://iupac.github.io/WFChemCookbook/datasources/pubchem_pugrest1.html

 PUG stands for Power User Gateway, which encompasses several variants of methods for programmatic access to PubChem data and services. This REST-style interface is intended to be a simple access route to PubChem for things like scripts, javascript embedded in web pages, and 3rd party applications, without the overhead of XML, SOAP envelopes, etc. that are required for other versions of PUG. PUG REST also provides convenient access to information on PubChem records that is not possible with any other service.

## **Construct a PUG-REST API and retrieve data**

In [None]:
pugrest_prolog = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"
pugrest_input = "protein/accession/Q92542"
pugrest_operation = "consise"
pugrest_output ="csv"

pugrest_url = "/".join( (pugrest_prolog, pugrest_input, pugrest_operation, pugrest_output ) )
print("REQUEST URL:", pugrest_url)

res = requests.get(pugrest_url)
print("OUTPUT    :", res.text.strip())

In [None]:
print("REQUEST URL:", pugrest_url)

response = requests.get(pugrest_url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Save the content of the response to a local CSV file
    with open("downloaded_data.csv", "wb") as f:
        f.write(response.content)
    print("CSV file downloaded successfully")
else:
    print("Failed to download CSV file. Status code:", response.status_code)

# Read the CSV file into a Pandas DataFrame
df = pd.read_csv("downloaded_data.csv")

In [7]:
import pandas as pd
load = pd.read_csv("downloaded_data.csv")
load.head

<bound method NDFrame.head of            baid     activity      aid        sid        cid   geneid  \
0      99544644       Active    45082  134461073   56681654  23385.0   
1      99544673       Active    45082  103437680   44386767  23385.0   
2      99544679       Active    45082  103437853   15344717  23385.0   
3      99544685       Active    45082  103438207   12147040  23385.0   
4      99544742       Active    45082  103437123   44386506  23385.0   
...         ...          ...      ...        ...        ...      ...   
5013  380626213  Unspecified  1872942  482051457  168272247  23385.0   
5014  380626219  Unspecified  1872942  482069019   22204430  23385.0   
5015  380626264  Unspecified  1872941  482051457  168272247  23385.0   
5016  407928578  Unspecified  1929078  103714659     107715  23385.0   
5017  407928617  Unspecified  1929079  103189275       9651  23385.0   

            pmid             aidtype  aidmdate  hasdrc  ...  repacxn taxid  \
0     15050631.0        Con

##How to Select Specific CSV Columns Using Python and Pandas**

We are interested in the column names cid, cmpdname, activity, acname, acvalue, and aidtype

In [13]:
df = df[['cid', 'cmpdname', 'activity','acname','acvalue','aidtype']]

##How to save the new csv file with just the columns required**

In [14]:
df.to_csv('C:/jupiter/nct_pubchem.csv', index=False)

In [16]:
df2 = df[df.acname.notna()]
df2

Unnamed: 0,cid,cmpdname,activity,acname,acvalue,aidtype
0,56681654,"methyl (2S,3R)-2-[[(2S)-2-[[(2S)-2-[[benzyl-[(...",Active,IC50,0.27,Confirmatory
1,44386767,"2-((S)-2-{(S)-2-[3-Benzyl-3-((2R,3S)-3-tert-bu...",Active,IC50,0.25,Confirmatory
2,15344717,"(S)-2-{(S)-2-[3-((2R,3S)-3-tert-Butoxycarbonyl...",Active,IC50,3.10,Confirmatory
3,12147040,"(R)-methyl 2-((R)-2-(3-benzyl-3-((2S,3R)-3-(te...",Active,IC50,0.65,Confirmatory
4,44386506,"(S)-2-((S)-2-{3-Benzyl-3-[(2R,3S)-3-((S)-2-ter...",Active,IC50,0.23,Confirmatory
...,...,...,...,...,...,...
5001,162649237,N-(2-ethylpyrazol-3-yl)-4-[6-methoxy-5-(4-meth...,Unspecified,IC50,1.00,Confirmatory
5002,162646677,4-[6-methoxy-5-(4-methylimidazol-1-yl)pyridin-...,Unspecified,IC50,1.00,Confirmatory
5003,126599753,5-(4-chlorophenyl)-6-cyclopropyl-3-[6-methoxy-...,Unspecified,IC50,10.00,Confirmatory
5016,107715,Dihydroergocristine,Unspecified,IC50,25.00,Confirmatory


##Filtering and cleaning the data_set**

In [None]:
#Filtering to remain with the acname (activity name) as IC50
import pandas as pd

file_path = "nct_pubchem.csv"
data = pd.read_csv(file_path)

data = data.query('acname == "IC50"')
data = data[["cid", "cmpdname", "activity", "acname", "acvalue", "aidtype"]]

data.to_csv("filtered_nct_pubchem.csv")
print(data)

            cid                                           cmpdname  \
0      56681654  methyl (2S,3R)-2-[[(2S)-2-[[(2S)-2-[[benzyl-[(...   
1      44386767  2-((S)-2-{(S)-2-[3-Benzyl-3-((2R,3S)-3-tert-bu...   
2      15344717  (S)-2-{(S)-2-[3-((2R,3S)-3-tert-Butoxycarbonyl...   
3      12147040  (R)-methyl 2-((R)-2-(3-benzyl-3-((2S,3R)-3-(te...   
4      44386506  (S)-2-((S)-2-{3-Benzyl-3-[(2R,3S)-3-((S)-2-ter...   
...         ...                                                ...   
5000   89657814  N-(2-ethyl-4,5,6,7-tetrahydroindazol-3-yl)-4-[...   
5001  162649237  N-(2-ethylpyrazol-3-yl)-4-[6-methoxy-5-(4-meth...   
5002  162646677  4-[6-methoxy-5-(4-methylimidazol-1-yl)pyridin-...   
5003  126599753  5-(4-chlorophenyl)-6-cyclopropyl-3-[6-methoxy-...   
5016     107715                                Dihydroergocristine   

         activity acname  acvalue       aidtype  
0          Active   IC50     0.27  Confirmatory  
1          Active   IC50     0.25  Confirmatory  
2        

In [41]:
#Filtering to remain with the aidtype as Confirmatory
import pandas as pd

file_path = "filtered_nct_pubchem.csv"
data = pd.read_csv(file_path)

data = data.query('aidtype == "Confirmatory"')
data = data[["cid", "cmpdname", "activity", "acname", "acvalue", "aidtype"]]

data.to_csv("filtered_nct_pubchem1.csv")
print(data)

            cid                                           cmpdname  \
0      56681654  methyl (2S,3R)-2-[[(2S)-2-[[(2S)-2-[[benzyl-[(...   
1      44386767  2-((S)-2-{(S)-2-[3-Benzyl-3-((2R,3S)-3-tert-bu...   
2      15344717  (S)-2-{(S)-2-[3-((2R,3S)-3-tert-Butoxycarbonyl...   
3      12147040  (R)-methyl 2-((R)-2-(3-benzyl-3-((2S,3R)-3-(te...   
4      44386506  (S)-2-((S)-2-{3-Benzyl-3-[(2R,3S)-3-((S)-2-ter...   
...         ...                                                ...   
3528   89657814  N-(2-ethyl-4,5,6,7-tetrahydroindazol-3-yl)-4-[...   
3529  162649237  N-(2-ethylpyrazol-3-yl)-4-[6-methoxy-5-(4-meth...   
3530  162646677  4-[6-methoxy-5-(4-methylimidazol-1-yl)pyridin-...   
3531  126599753  5-(4-chlorophenyl)-6-cyclopropyl-3-[6-methoxy-...   
3532     107715                                Dihydroergocristine   

         activity acname  acvalue       aidtype  
0          Active   IC50     0.27  Confirmatory  
1          Active   IC50     0.25  Confirmatory  
2        

In [42]:
#Filtering to remain with the activity as activa and inactive only
import pandas as pd

file_path = "filtered_nct_pubchem1.csv"
data = pd.read_csv(file_path)

data = data.query('activity == "Active" or activity == "Inactive"')
data = data[["cid", "cmpdname", "activity", "acname", "acvalue", "aidtype"]]

data.to_csv("filtered_nct_pubchem2.csv")
print(data)

            cid                                           cmpdname  activity  \
0      56681654  methyl (2S,3R)-2-[[(2S)-2-[[(2S)-2-[[benzyl-[(...    Active   
1      44386767  2-((S)-2-{(S)-2-[3-Benzyl-3-((2R,3S)-3-tert-bu...    Active   
2      15344717  (S)-2-{(S)-2-[3-((2R,3S)-3-tert-Butoxycarbonyl...    Active   
3      12147040  (R)-methyl 2-((R)-2-(3-benzyl-3-((2S,3R)-3-(te...    Active   
4      44386506  (S)-2-((S)-2-{3-Benzyl-3-[(2R,3S)-3-((S)-2-ter...    Active   
...         ...                                                ...       ...   
2957   11269353                                         Begacestat    Active   
2958   57327010                                    Unii-PX8XQ3H3RV    Active   
2959  160302852  tert-butyl N-[(2S,3R,5R)-6-[[(4S,7R)-8-amino-7...    Active   
2987  137174942  1-benzyl-7-(3-methyl-1,2,4-triazol-1-yl)-5,10-...  Inactive   
2988  137174952  1-benzyl-7-(4-chloroimidazol-1-yl)-5,10-dihydr...  Inactive   

     acname  acvalue       aidtype  
0 

##Data pre-processing of the bioactivity data**


following the example in dataprofessor's example in https://github.com/dataprofessor/code/blob/master/python/CDD_ML_Part_2_Exploratory_Data_Analysis.ipynb to calculate the pIC50. Remember IC50 values in uM need to be normalised by converting to M by dividing by 1000000 before -log conversion.

In [3]:
import pandas as pd
import numpy as np

# Function to convert IC50 (uM) to pIC50
def ic50_to_pic50(ic50_um):
    # Convert IC50 from µM to M
    ic50_m = ic50_um / 1_000_000
    
    # Avoid negative or zero IC50 values
    if ic50_m <= 0:
        return np.nan  # Return NaN if IC50 is zero or negative
    
    # Calculate pIC50 using the IC50 in M
    pic50_value = -np.log10(ic50_m)
    return pic50_value

# Read the CSV file
input_file = 'filtered_nct_pubchem2.csv'  # Replace with the path to your input CSV file
output_file = 'pIC50_nct_pubchem.csv'  # Path for the output file

# Load the CSV into a DataFrame
df = pd.read_csv(input_file)

# Check the column name where IC50 values are stored
# Assuming the IC50 values are in a column named 'IC50_uM', change it if necessary.
ic50_column = 'acvalue'

# Convert the IC50 values to pIC50 values
df['pIC50'] = df[ic50_column].apply(ic50_to_pic50)

# Save the new DataFrame with pIC50 values to a new CSV file
df.to_csv(output_file, index=False)

print(f"pIC50 values have been saved to {output_file}")

pIC50 values have been saved to pIC50_nct_pubchem.csv


### **Labeling compounds as either being active, inactive or intermediate**

The bioactivity data is in the pIC50 unit. The inhibitory potencies of the data set, expressed as pIC50, ranged from 4.3 to 11.7 and compounds with a pIC50 ≥8.0 were classified as actives.  Compounds having values of greater than or equal 8 will be considered to be **active** while those less than 7 will be considered to be **inactive**. As for those values in between 7 and 8 nM will be referred to as **intermediate**. 

In [5]:
df2 = df[df.acvalue.notna()]
df2
bioactivity_class = []
for i in df2.pIC50:
  if float(i) >= 8:
    bioactivity_class.append("active")
  elif float(i) <= 7:
    bioactivity_class.append("inactive")
  else:
    bioactivity_class.append("intermediate")

### **Iterate the *cid* to a list**

In [8]:
cid = []
for i in df2.cid:
  cid.append(i)

### **Iterate the *cmpdname* to a list**

In [9]:
cmpdname = []
for i in df2.cmpdname:
  cmpdname.append(i)

### **Iterate the *pIC50* to a list**

In [10]:
pIC50 = []
for i in df2.pIC50:
  pIC50.append(i)

### **Iterate the *acvalue* to a list**

In [11]:
acvalue = []
for i in df2.acvalue:
  acvalue.append(i)

### **Combine the 5 lists into a dataframe**

In [12]:
data_tuples = list(zip(cid, cmpdname, bioactivity_class, acvalue, pIC50))
df3 = pd.DataFrame( data_tuples,  columns=['cid', 'cmpdname', 'bioactivity_class', 'acvalue', 'pIC50'])

In [13]:
df3

Unnamed: 0,cid,cmpdname,bioactivity_class,acvalue,pIC50
0,56681654,"methyl (2S,3R)-2-[[(2S)-2-[[(2S)-2-[[benzyl-[(...",inactive,0.27000,6.568636
1,44386767,"2-((S)-2-{(S)-2-[3-Benzyl-3-((2R,3S)-3-tert-bu...",inactive,0.25000,6.602060
2,15344717,"(S)-2-{(S)-2-[3-((2R,3S)-3-tert-Butoxycarbonyl...",inactive,3.10000,5.508638
3,12147040,"(R)-methyl 2-((R)-2-(3-benzyl-3-((2S,3R)-3-(te...",inactive,0.65000,6.187087
4,44386506,"(S)-2-((S)-2-{3-Benzyl-3-[(2R,3S)-3-((S)-2-ter...",inactive,0.23000,6.638272
...,...,...,...,...,...
2955,9843750,Semagacestat,intermediate,0.01090,7.962574
2956,73441910,"2-[(1S)-1-[(2S,5R)-5-[4-chloro-5-fluoro-2-(tri...",active,0.00620,8.207608
2957,11269353,Begacestat,intermediate,0.01500,7.823909
2958,57327010,Unii-PX8XQ3H3RV,active,0.00027,9.568636


Saves dataframe to CSV file

In [14]:
df3.to_csv('bioactivity_preprocessed_data.csv', index=False)