<a href="https://colab.research.google.com/github/ShaheerSyed/IGF1R_QSAR_modeling/blob/main/(SHS_11_08_2023)_IGF_1R_1_Dataset_Preperation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Installing libraries**

**Install the Bioservices web service package.**

In [None]:
!pip install bioservices

**Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.**

In [None]:
! pip install chembl_webresource_client

## **Importing Necessary libraries**

In [None]:
from bioservices import UniProt
from chembl_webresource_client.new_client import new_client  # Import the chembl_webresource_client for accessing CHEMBL data.
import pandas as pd  # Import the pandas library for data manipulation and analysis.
import numpy as np

## **Search for Target protein**

The UniProt Query for IGF-1R (Limited to 3) returns the top 3 matches found. We are interested in the IGF-1R single protein from Humans, which is the first entry. The UniProt ID is : P08069.

In [None]:
from bioservices import UniProt
u = UniProt(verbose=False)
data = u.search("Insulin like growth factor 1 receptor", limit=3)
print(data)

Creating directory /root/.cache/bioservices 
Welcome to Bioservices
It looks like you do not have a configuration file.
We are creating one with default values in /root/.config/bioservices/bioservices.cfg .
Done
Entry	Entry Name	Reviewed	Protein names	Gene Names	Organism	Length
P08069	IGF1R_HUMAN	reviewed	Insulin-like growth factor 1 receptor (EC 2.7.10.1) (Insulin-like growth factor I receptor) (IGF-I receptor) (CD antigen CD221) [Cleaved into: Insulin-like growth factor 1 receptor alpha chain; Insulin-like growth factor 1 receptor beta chain]	IGF1R	Homo sapiens (Human)	1367
P24062	IGF1R_RAT	reviewed	Insulin-like growth factor 1 receptor (EC 2.7.10.1) (Insulin-like growth factor I receptor) (IGF-I receptor) (CD antigen CD221) [Cleaved into: Insulin-like growth factor 1 receptor alpha chain; Insulin-like growth factor 1 receptor beta chain]	Igf1r	Rattus norvegicus (Rat)	1370
Q05688	IGF1R_BOVIN	reviewed	Insulin-like growth factor 1 receptor (EC 2.7.10.1) (Insulin-like growth factor I re

### **Target search for Insulin-like growth factor I receptor (IGF-1R)  (CHEMBL ID: CHEMBL1957) (UniProt ID: P08069)**

In [None]:
# Target search for Insulin-like growth factor I receptor from the CHEMBL Library

target = new_client.target  # Create an instance of the target class from the CHEMBL web resource client.
target_query = target.search('insulin')  # Search for targets related to the term 'insulin' in CHEMBL.
targets = pd.DataFrame.from_dict(target_query)  # Convert the query result to a pandas DataFrame for easier data manipulation.
targets[:10]  # Display the first 10 rows of the DataFrame, which shows information about the Insulin-like Growth-factor 1 Receptor targets.


Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P01308', 'xref_name': None, 'xre...",Homo sapiens,Insulin,29.0,False,CHEMBL5881,"[{'accession': 'P01308', 'component_descriptio...",SINGLE PROTEIN,9606
1,"[{'xref_id': 'Insulin_receptor', 'xref_name': ...",Homo sapiens,Insulin receptor,16.0,False,CHEMBL1981,"[{'accession': 'P06213', 'component_descriptio...",SINGLE PROTEIN,9606
2,"[{'xref_id': 'P15208', 'xref_name': None, 'xre...",Mus musculus,Insulin receptor,16.0,False,CHEMBL3187,"[{'accession': 'P15208', 'component_descriptio...",SINGLE PROTEIN,10090
3,"[{'xref_id': 'P15127', 'xref_name': None, 'xre...",Rattus norvegicus,Insulin receptor,16.0,False,CHEMBL5486,"[{'accession': 'P15127', 'component_descriptio...",SINGLE PROTEIN,10116
4,"[{'xref_id': 'P14616', 'xref_name': None, 'xre...",Homo sapiens,Insulin receptor-related protein,15.0,False,CHEMBL5483,"[{'accession': 'P14616', 'component_descriptio...",SINGLE PROTEIN,9606
5,"[{'xref_id': 'Insulin-degrading_enzyme', 'xref...",Homo sapiens,Insulin-degrading enzyme,14.0,False,CHEMBL1293287,"[{'accession': 'P14735', 'component_descriptio...",SINGLE PROTEIN,9606
6,[],Mus musculus,Insulin-degrading enzyme,14.0,False,CHEMBL3232680,"[{'accession': 'Q9JHR7', 'component_descriptio...",SINGLE PROTEIN,10090
7,"[{'xref_id': 'P08069', 'xref_name': None, 'xre...",Homo sapiens,Insulin-like growth factor I receptor,13.0,False,CHEMBL1957,"[{'accession': 'P08069', 'component_descriptio...",SINGLE PROTEIN,9606
8,"[{'xref_id': 'Q60751', 'xref_name': None, 'xre...",Mus musculus,Insulin-like growth factor 1 receptor,13.0,False,CHEMBL5381,"[{'accession': 'Q60751', 'component_descriptio...",SINGLE PROTEIN,10090
9,"[{'xref_id': 'P24062', 'xref_name': None, 'xre...",Rattus norvegicus,Insulin-like growth factor 1 receptor,13.0,False,CHEMBL1075098,"[{'accession': 'P24062', 'component_descriptio...",SINGLE PROTEIN,10116


### **Select and retrieve bioactivity data for IGF-1R (listed at entry id: 7 )**

We will assign the entry at 7th index (which corresponds to the target protein, *IGF-1R*) to the ***selected_target*** variable

In [None]:
selected_target = targets.target_chembl_id[7] # IGF-1R Homo Sapiens
selected_target

'CHEMBL1957'

Here, we will retrieve only bioactivity data for *IGF-1R* (CHEMBL1957) that are reported as pChEMBL values.

The following line of code is querying the ChEMBL database for activities related to a specific target (selected_target) and specifically for activities that have been measured using the "IC50" standard type. The result (res) will contain data related to the potency of compounds for the chosen target.

In [None]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [None]:
df = pd.DataFrame.from_dict(res)

In [None]:
# Take a look at the dimensions of the dataframe
df.shape
len(df)

4424

The data frame is composed of 4424 compounds that display modulatory activity of IGF-1R. This matches what we see on the chembl website: https://www.ebi.ac.uk/chembl/web_components/explore/activities/STATE_ID:IAxGBobScWnGGaMqoluj9A==

In [None]:
# Take a look at all the columns in data frame
df.columns.tolist()

['action_type',
 'activity_comment',
 'activity_id',
 'activity_properties',
 'assay_chembl_id',
 'assay_description',
 'assay_type',
 'assay_variant_accession',
 'assay_variant_mutation',
 'bao_endpoint',
 'bao_format',
 'bao_label',
 'canonical_smiles',
 'data_validity_comment',
 'data_validity_description',
 'document_chembl_id',
 'document_journal',
 'document_year',
 'ligand_efficiency',
 'molecule_chembl_id',
 'molecule_pref_name',
 'parent_molecule_chembl_id',
 'pchembl_value',
 'potential_duplicate',
 'qudt_units',
 'record_id',
 'relation',
 'src_id',
 'standard_flag',
 'standard_relation',
 'standard_text_value',
 'standard_type',
 'standard_units',
 'standard_upper_value',
 'standard_value',
 'target_chembl_id',
 'target_organism',
 'target_pref_name',
 'target_tax_id',
 'text_value',
 'toid',
 'type',
 'units',
 'uo_units',
 'upper_value',
 'value']

In [None]:
## In-depth look at important columns to us.

# Define a function to print value counts for a column
def print_value_counts(df, column_name):
    counts = df[column_name].value_counts()
    print(f"Value Counts for {column_name}:")
    print(counts)
    print("\n")

# Define the list of columns we want to summarize
columns_to_summarize = [
    'standard_type',
    'standard_units',
    'standard_relation'
]

# Perform value counts and print for each column in the list
for column in columns_to_summarize:
    print_value_counts(df, column)


Value Counts for standard_type:
IC50    4424
Name: standard_type, dtype: int64


Value Counts for standard_units:
nM         4369
ug.mL-1      15
µM            1
Name: standard_units, dtype: int64


Value Counts for standard_relation:
=     3179
<      606
>      404
>=     192
>>       1
Name: standard_relation, dtype: int64




In [None]:
## Assessing for any missing values in each column

# Initialize a dictionary to store missing value counts
missing_values_count = {}

# Loop through columns and count missing values
for column in df.columns:
    missing_count = df[column].isna().sum()
    missing_values_count[column] = missing_count

# Create a DataFrame from the dictionary
missing_values_df = pd.DataFrame(missing_values_count.items(), columns=['Column', 'Missing Value Count'])

# Display missing value counts in an organized table format
print(missing_values_df)


                       Column  Missing Value Count
0                 action_type                 4401
1            activity_comment                 3716
2                 activity_id                    0
3         activity_properties                    0
4             assay_chembl_id                    0
5           assay_description                    0
6                  assay_type                    0
7     assay_variant_accession                 4424
8      assay_variant_mutation                 4424
9                bao_endpoint                    0
10                 bao_format                    0
11                  bao_label                    0
12           canonical_smiles                    0
13      data_validity_comment                 4411
14  data_validity_description                 4411
15         document_chembl_id                    0
16           document_journal                 2228
17              document_year                    5
18          ligand_efficiency  

Save the resulting ***RAW*** bioactivity data to a CSV file **bioactivity_data.csv**.

In [None]:
df.to_csv('01_IGF1R_bioactivity_data_raw.csv', index=False)

## **Handling Data**
If any compounds has missing value for the **standard_value** and **canonical_smiles** column then drop it.

In [None]:
# Count the number of rows with missing 'standard_value'
missing_standard_value_count = df['standard_value'].isna().sum()

# Filter the DataFrame to remove rows with missing 'standard_value'
df_filtered = df[df['standard_value'].notna()]

# Display the count of rows with missing 'standard_value'
print(f"# of Rows with missing 'standard_value': {missing_standard_value_count}")

####

# Count the number of rows with missing 'canonical_smiles' in the filtered DataFrame
missing_canonical_smiles_count = df_filtered['canonical_smiles'].isna().sum()

# Further filter the same DataFrame to remove rows with missing 'canonical_smiles'
df_filtered = df_filtered[df_filtered['canonical_smiles'].notna()]

# Display the count of rows with missing 'canonical_smiles'
print(f"# of Rows with missing 'canonical_smiles': {missing_canonical_smiles_count}")


# of Rows with missing 'standard_value': 39
# of Rows with missing 'canonical_smiles': 0


In [None]:
# Check len of the filtered dataframe
len(df_filtered)

4385

In [None]:
# Verify that compounds with missing values in 'standard_value' and 'standard_units' have been dropped succesfully.

missing_standard_value_count = df_filtered['standard_value'].isna().sum()
missing_standard_units_count = df_filtered['standard_units'].isna().sum()

print("Count of missing 'standard_value':", missing_standard_value_count)
print("Count of missing 'standard_units':", missing_standard_units_count)

Count of missing 'standard_value': 0
Count of missing 'standard_units': 0


In [None]:
# Taking a look at the different types of measurements provided for IC50.
print(df_filtered['standard_units'].value_counts())

nM         4369
ug.mL-1      15
µM            1
Name: standard_units, dtype: int64


For sake of consistency, we would only like to retain data points having nM as the bioactivity unit. This further filtering resulted in our dataframe having 4,369 compounds.

In [None]:
# Further filtering of the data to retain only the rows with the specified 'standard_units' equal to nM.
df_filtered = df_filtered[(df_filtered['standard_units'] == 'nM')]
len(df_filtered)

4369

In [None]:
# Taking a look again at the different types of measurements provided for IC50. After filtering, we should expect to only see the 'standard_unit' nM.
print(df_filtered['standard_units'].value_counts())

nM    4369
Name: standard_units, dtype: int64


In [None]:
## Further filtering of data to retain only the rows with the 'assay_type' equal to B.
## What does assay_type 'B' mean ?
## Binding (B) - Data measuring binding of compound to a molecular target, e.g. Ki, IC50, Kd.

df_filtered = df_filtered[(df_filtered['assay_type'] == 'B')]

In [None]:
## Take a look at the dimensions of our dataframe.

df_filtered.shape

(4349, 46)

To summarize what we have done so far:

1.   Filtered out any compounds that had missing values for 'standard_value' and 'canonical_smiles'
2.   Retained only those compounds that had 'standard_units' == 'nM'
3.   Retained only those compounds that had 'assay_type' == 'B'


There are still many columns in the dataframe that are not necessary for our purpose. Therefore, we will further filter the dataframe to keep only those features (columns) that are important to us.

In [None]:
## Further filter the dataframe; dropping all unnecessary columns, and retaining only those that are important to us.

# List of columns we want to keep in the desired order
desired_columns = [
    'molecule_chembl_id',
    'canonical_smiles',
    'assay_type',
    'standard_type',
    'standard_relation',
    'standard_value',
    'standard_units'
]

# Filtering out unnecessary columns
df_filtered = df_filtered[desired_columns]


In [None]:
## Take a look at the dataframe.

df_filtered.head(3)

Unnamed: 0,molecule_chembl_id,canonical_smiles,assay_type,standard_type,standard_relation,standard_value,standard_units
0,CHEMBL281872,COc1cc2c(Nc3ccc(Br)cc3F)ncnc2cc1OCCn1ccnn1,B,IC50,>,20000.0,nM
1,CHEMBL533849,COc1cc2c(Nc3ccc(Br)cc3F)ncnc2cc1OCC1CCN(C)CC1.Cl,B,IC50,>,20000.0,nM
2,CHEMBL296468,CC(C)(C)c1cnc(CSc2cnc(NC(=O)C3CCNCC3)s2)o1,B,IC50,>,25000.0,nM


In [None]:
## One last step. We want to ensure that the type of the 'standard_value' column is a float.
## In summary, ensuring that we are working with the appropriate data type, such as floats for numerical values,
## is essential for data accuracy, consistency, compatibility, and meaningful analysis and visualization.


# Check the current data type of the 'standard_value' column
current_data_type = df_filtered['standard_value'].dtype

# Check if the current data type is not already a float
if current_data_type != float:
    # Convert the 'standard_value' column to float
    df_filtered['standard_value'] = df_filtered['standard_value'].astype(float)

# Now, 'standard_value' is ensured to be of type float without changing existing values.

In [None]:
## # Check the current data type of the 'standard_value' column to see if we have succesfully changed to float.

df_filtered['standard_value'].dtype


dtype('float64')

In [None]:
## Take a quick look at the dataframe again.

df_filtered.head(3)

Unnamed: 0,molecule_chembl_id,canonical_smiles,assay_type,standard_type,standard_relation,standard_value,standard_units
0,CHEMBL281872,COc1cc2c(Nc3ccc(Br)cc3F)ncnc2cc1OCCn1ccnn1,B,IC50,>,20000.0,nM
1,CHEMBL533849,COc1cc2c(Nc3ccc(Br)cc3F)ncnc2cc1OCC1CCN(C)CC1.Cl,B,IC50,>,20000.0,nM
2,CHEMBL296468,CC(C)(C)c1cnc(CSc2cnc(NC(=O)C3CCNCC3)s2)o1,B,IC50,>,25000.0,nM


In [None]:
## Take a quick look at the shape of the dataframe again.

df_filtered.shape

(4349, 7)

**Save the filtered dataframe to a CSV file.**

In [None]:
df_filtered.to_csv('02_IGF1R_bioactivity_data_filtered.csv', index=False)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**.

In [None]:
## Read-in the filtered dataframe and save to bioactivity_df variable.

bioactivity_df = pd.read_csv('02_IGF1R_bioactivity_data_filtered.csv')

In [None]:
# Initialize an empty list to store bioactivity thresholds

bioactivity_threshold = []

# Iterate through 'standard_value' in df2 and categorize bioactivity
for i in bioactivity_df['standard_value']:
    if float(i) >= 10000:
        bioactivity_threshold.append("inactive")
    elif float(i) <= 1000:
        bioactivity_threshold.append("active")
    else:
        bioactivity_threshold.append("intermediate")

# Create a Pandas Series from the 'bioactivity_threshold' list and name it 'bioactivity_status'.

bioactivity_class = pd.Series(bioactivity_threshold, name='bioactivity_status')

# Reset the index of 'bioactivity_class' to align with the index of df2
#bioactivity_class.reset_index(drop=True, inplace=True)

# Add the 'bioactivity_status' column to bioactivity_df.

bioactivity_df = pd.concat([bioactivity_df, bioactivity_class], axis=1)


In [None]:
## Take a look at the dataframe to see if 'bioactivity_status' has been succesfully added.
bioactivity_df.head(3)

Unnamed: 0,molecule_chembl_id,canonical_smiles,assay_type,standard_type,standard_relation,standard_value,standard_units,bioactivity_status
0,CHEMBL281872,COc1cc2c(Nc3ccc(Br)cc3F)ncnc2cc1OCCn1ccnn1,B,IC50,>,20000.0,nM,inactive
1,CHEMBL533849,COc1cc2c(Nc3ccc(Br)cc3F)ncnc2cc1OCC1CCN(C)CC1.Cl,B,IC50,>,20000.0,nM,inactive
2,CHEMBL296468,CC(C)(C)c1cnc(CSc2cnc(NC(=O)C3CCNCC3)s2)o1,B,IC50,>,25000.0,nM,inactive


In [None]:
## Assessing for any missing values in each column

# Initialize a dictionary to store missing value counts
missing_values_count = {}

# Loop through columns and count missing values
for column in bioactivity_df.columns:
    missing_count = bioactivity_df[column].isna().sum()
    missing_values_count[column] = missing_count

# Create a DataFrame from the dictionary
missing_values_df = pd.DataFrame(missing_values_count.items(), columns=['Column', 'Missing Value Count'])

# Display missing value counts in an organized table format
print(missing_values_df)

               Column  Missing Value Count
0  molecule_chembl_id                    0
1    canonical_smiles                    0
2          assay_type                    0
3       standard_type                    0
4   standard_relation                    3
5      standard_value                    0
6      standard_units                    0
7  bioactivity_status                    0


In [None]:
## There appears to be some missing values in the 'standard_relation' column.

## Checking missing values in 'standard_relation'.

# Create a boolean Series indicating rows with missing 'standard_relation'

missing_standard_relation = bioactivity_df['standard_relation'].isna()

# Use boolean indexing to filter and display rows with missing 'standard_relation'
rows_with_missing_standard_relation = bioactivity_df[missing_standard_relation]

# Print or inspect the rows with missing 'standard_relation'
rows_with_missing_standard_relation


Unnamed: 0,molecule_chembl_id,canonical_smiles,assay_type,standard_type,standard_relation,standard_value,standard_units,bioactivity_status
4331,CHEMBL4088216,CN1C(=O)[C@@H](N2CCc3cn(Cc4ccccc4)nc3C2=O)COc2...,B,IC50,,1000.0,nM,active
4332,CHEMBL4549667,CN1C(=O)[C@@H](N2CCc3c(nn(Cc4ccccc4)c3Br)C2=O)...,B,IC50,,1000.0,nM,active
4333,CHEMBL4097778,CN1C(=O)[C@@H](N2CCc3cn(CC4CCS(=O)(=O)CC4)nc3C...,B,IC50,,1000.0,nM,active


In [None]:
## Take a quick look at the dimensions of the dataframe.
bioactivity_df.shape

(4349, 8)

# **Filter Bioactivities**

In [None]:
# Taking a look at the different types of 'standard_relation' in our dataframe.
print(bioactivity_df['standard_relation'].value_counts())

=     3148
<      606
>      399
>=     192
>>       1
Name: standard_relation, dtype: int64


In [None]:
## This line of code is to just check if bioactsDF_Train in the following code chunk was created correctly.
len(bioactivity_df[(bioactivity_df['standard_relation'] == '=') | (bioactivity_df['standard_relation'].isna()) & (bioactivity_df['assay_type'] == 'B')])

3151

In [None]:
## Filtering the bioactivities -- We are essentially splitting the data based on the 'standard_relation' column.
## ~70% of data will be used as an internal (training) set.
## ~30% of the data will be used as an external (testing) set.

bioactsDF_Train = bioactivity_df[((bioactivity_df['standard_relation'] == '=') | bioactivity_df['standard_relation'].isna())]

bioactsDF_Test_gra = bioactivity_df[bioactivity_df['standard_relation'].isin(['>', '>>', '>='])]

bioactsDF_Test_les = bioactivity_df[bioactivity_df['standard_relation'] == '<']

print(len(bioactsDF_Train), len(bioactsDF_Test_gra), len(bioactsDF_Test_les))

3151 592 606


## **Checking for redundant compounds with (i) identical SMILES notation (ii) IC50 value greater than 2 SD and (iii) missing IC50 values.**

In [None]:
# Calculate the number of rows in bioactsDF_Train
num_rows = len(bioactsDF_Train)

# Calculate the count of unique 'molecule_chembl_id' values in bioactsDF_Train
num_unique_chemblId = len(bioactsDF_Train['molecule_chembl_id'].unique())

# Display the results in formatted sentences
print(f"The number of rows in bioactsDF_Train is: {num_rows}.")
print(f"The count of unique 'chemblId' values in bioactsDF_Train is: {num_unique_chemblId}.")

The number of rows in bioactsDF_Train is: 3151.
The count of unique 'chemblId' values in bioactsDF_Train is: 2293.


In [None]:
len(bioactsDF_Test_gra), len(bioactsDF_Test_gra['molecule_chembl_id'].unique())

(592, 511)

In [None]:
len(bioactsDF_Test_les), len(bioactsDF_Test_les['molecule_chembl_id'].unique())

(606, 402)


In summary, the code below is identifying and separating duplicate rows in `bioactsDF_Train` based on the 'molecule_chembl_id' column, creating two DataFrames: one with duplicates (`bioactsDF_Train_dup`) and one without duplicates (`bioactsDF_Train_non`).

In [None]:
# Identify and extract duplicate rows in bioactsDF_Train p based on the 'molecule_chembl_id' column.

bioactsDF_Train_dup = pd.concat(g for _, g in bioactsDF_Train.groupby("molecule_chembl_id") if len(g) > 1)

# Create a DataFrame containing non-duplicate rows by filtering the original DataFrame
# Rows are filtered based on their indices (indexes)
bioactsDF_Train_non = bioactsDF_Train.loc[~bioactsDF_Train.index.isin(bioactsDF_Train_dup.index)]

# Print the count of non-duplicate rows, duplicate rows, and the total count
print("bioactsDF_Train_dup")
print("Number of non-duplicate rows:", len(bioactsDF_Train_non))
print("Number of duplicate rows:", len(bioactsDF_Train_dup))
print("Total number of rows after removing duplicates:", len(bioactsDF_Train_dup) + len(bioactsDF_Train_non))


bioactsDF_Train_dup
Number of non-duplicate rows: 1596
Number of duplicate rows: 1555
Total number of rows after removing duplicates: 3151


In [None]:
# Calculate the number of rows in bioactsDF_Train_non
num_rows = len(bioactsDF_Train_non)

# Calculate the count of unique 'chemblId' values in bioactsDF_Train_non
num_unique_chemblId = len(bioactsDF_Train_non['molecule_chembl_id'].unique())

# Display the results in a more user-friendly format
print("bioactsDF_Train_non")
print(f"Number of rows in bioactsDF_Train_non: {num_rows}")
print(f"Number of unique 'molecule_chembl_id' values in bioactsDF_Train_non: {num_unique_chemblId}")


bioactsDF_Train_non
Number of rows in bioactsDF_Train_non: 1596
Number of unique 'molecule_chembl_id' values in bioactsDF_Train_non: 1596


In [None]:
# Calculate the mean and standard deviation of 'standard_value' for each unique compound.

# This aggregates the data and provides statistics for each group
mean_std = bioactsDF_Train_dup.groupby(['molecule_chembl_id'], as_index=False).agg(
    {'standard_value': ['mean', 'std']}
)

# Display the first three rows of the calculated mean and standard deviation
mean_std.head(3)


Unnamed: 0_level_0,molecule_chembl_id,standard_value,standard_value
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std
0,CHEMBL1078891,66.0,48.083261
1,CHEMBL1090357,270.0,84.852814
2,CHEMBL1090358,215.0,21.213203


In [None]:
len(mean_std)

697

In [None]:
# Merge the DataFrame with duplicate rows and the DataFrame containing mean and standard deviation.

# Inner Join: Only the rows with matching 'molecule_chembl_id' values in both DataFrames will be included in the result.

bioactsDF_Train_dup = bioactsDF_Train_dup.merge(mean_std, on='molecule_chembl_id', how='inner')

# Calculate the number of rows in the merged DataFrame
len(bioactsDF_Train_dup)


  bioactsDF_Train_dup = bioactsDF_Train_dup.merge(mean_std, on='molecule_chembl_id', how='inner')
  bioactsDF_Train_dup = bioactsDF_Train_dup.merge(mean_std, on='molecule_chembl_id', how='inner')


1555

In [None]:
bioactsDF_Train_dup.head(3)

Unnamed: 0,molecule_chembl_id,canonical_smiles,assay_type,standard_type,standard_relation,standard_value,standard_units,bioactivity_status,"(standard_value, mean)","(standard_value, std)"
0,CHEMBL1078891,Cc1cc(N2CCN(CCF)CC2)cc2[nH]c(-c3c(NC[C@@H](O)c...,B,IC50,=,32.0,nM,active,66.0,48.083261
1,CHEMBL1078891,Cc1cc(N2CCN(CCF)CC2)cc2[nH]c(-c3c(NC[C@@H](O)c...,B,IC50,=,100.0,nM,active,66.0,48.083261
2,CHEMBL1090357,CN(C)CCc1ccc(Nc2nccc(-c3c(-c4cccc(NC(=O)c5c(F)...,B,IC50,=,210.0,nM,active,270.0,84.852814


In [None]:
# Filter rows based on the standard deviation of the 'value' column
# Keep only rows where the standard deviation is less than 2

bioactsDF_Train_dup = bioactsDF_Train_dup[(bioactsDF_Train_dup[('standard_value', 'std')] < 2)]

# Calculate the number of rows in the filtered DataFrame
num_rows_filtered = len(bioactsDF_Train_dup)

# Calculate the count of unique 'chemblId' values in the filtered DataFrame
num_unique_chemblId = len(bioactsDF_Train_dup['molecule_chembl_id'].unique())

# Display the number of rows in the filtered DataFrame and the count of unique 'chemblId' values
(num_rows_filtered, num_unique_chemblId)


(494, 234)

In [None]:
# Calculate the absolute difference between the 'value' column and the mean of 'value' for each group
bioactsDF_Train_dup['select'] = (bioactsDF_Train_dup['standard_value']- bioactsDF_Train_dup[('standard_value', 'mean')]).abs()

# Group the DataFrame by 'chemblId' and sort each group by the 'select' column in ascending order
bioactsDF_Train_dup = bioactsDF_Train_dup.groupby(["molecule_chembl_id"]).apply(lambda x: x.sort_values(["select"], ascending = True)).reset_index(drop=True)

#Display the first 5 rows of the sorted DataFrame
bioactsDF_Train_dup.head(5)

Unnamed: 0,molecule_chembl_id,canonical_smiles,assay_type,standard_type,standard_relation,standard_value,standard_units,bioactivity_status,"(standard_value, mean)","(standard_value, std)",select
0,CHEMBL1097357,Cc1cc(N2CCNCC2)cc2[nH]c(-c3c(NC[C@@H](O)c4cccc...,B,IC50,=,32.0,nM,active,32.0,0.0,0.0
1,CHEMBL1097357,Cc1cc(N2CCNCC2)cc2[nH]c(-c3c(NC[C@@H](O)c4cccc...,B,IC50,=,32.0,nM,active,32.0,0.0,0.0
2,CHEMBL1097358,Cc1cc(N2CCN(C)CC2)cc2[nH]c(-c3c(NC[C@@H](O)c4c...,B,IC50,=,18.0,nM,active,18.0,0.0,0.0
3,CHEMBL1097358,Cc1cc(N2CCN(C)CC2)cc2[nH]c(-c3c(NC[C@@H](O)c4c...,B,IC50,=,18.0,nM,active,18.0,0.0,0.0
4,CHEMBL1222709,O=C(Nc1ccc(F)nc1)[C@@H]1CCCN1c1nc(Nc2cc(C3CC3)...,B,IC50,=,2.7,nM,active,1.85,1.202082,0.85


In [None]:
# Remove duplicate rows in the DataFrame based on the 'chemblId' column
# Keep the first occurrence and drop subsequent duplicates
bioactsDF_Train_dup = bioactsDF_Train_dup.drop_duplicates(subset='molecule_chembl_id', keep='first')

# Display the first 2 rows of the DataFrame after removing duplicates
bioactsDF_Train_dup.head(2)

Unnamed: 0,molecule_chembl_id,canonical_smiles,assay_type,standard_type,standard_relation,standard_value,standard_units,bioactivity_status,"(standard_value, mean)","(standard_value, std)",select
0,CHEMBL1097357,Cc1cc(N2CCNCC2)cc2[nH]c(-c3c(NC[C@@H](O)c4cccc...,B,IC50,=,32.0,nM,active,32.0,0.0,0.0
2,CHEMBL1097358,Cc1cc(N2CCN(C)CC2)cc2[nH]c(-c3c(NC[C@@H](O)c4c...,B,IC50,=,18.0,nM,active,18.0,0.0,0.0


In [None]:
# Count the total number of rows in the DataFrame 'bioactsDF_Train_dup'
total_rows = len(bioactsDF_Train_dup)

# Count the number of unique 'chemblId' values in the 'bioactsDF_Train_dup' DataFrame
unique_chemblIds = len(bioactsDF_Train_dup['molecule_chembl_id'].unique())

# Display the counts of total rows and unique 'chemblId' values
total_rows, unique_chemblIds


(234, 234)

In [None]:
bioactsDF_Train_dup.tail(2)

Unnamed: 0,molecule_chembl_id,canonical_smiles,assay_type,standard_type,standard_relation,standard_value,standard_units,bioactivity_status,"(standard_value, mean)","(standard_value, std)",select
490,CHEMBL575447,C[C@@]1(C(=O)Nc2cnccn2)CCCN1c1nc(Nc2cc(C3CC3)[...,B,IC50,=,3.0,nM,active,2.5,0.707107,0.5
492,CHEMBL577385,C[C@@]1(C(=O)Nc2nccs2)CCCN1c1nc(Nc2cc(C3CC3)n[...,B,IC50,=,4.0,nM,active,4.0,0.0,0.0


In [None]:
# Remove the 'select' column from the DataFrame 'bioactsDF_Train_dup'
bioactsDF_Train_dup = bioactsDF_Train_dup.drop('select', 1)

# Remove the ('value', 'mean') column from the DataFrame 'bioactsDF_Train_dup'
bioactsDF_Train_dup = bioactsDF_Train_dup.drop(('standard_value', 'mean'), 1)

# Remove the ('value', 'std') column from the DataFrame 'bioactsDF_Train_dup'
bioactsDF_Train_dup = bioactsDF_Train_dup.drop(('standard_value', 'std'), 1)

  bioactsDF_Train_dup = bioactsDF_Train_dup.drop('select', 1)
  bioactsDF_Train_dup = bioactsDF_Train_dup.drop(('standard_value', 'mean'), 1)
  bioactsDF_Train_dup = bioactsDF_Train_dup.drop(('standard_value', 'std'), 1)


In [None]:
# Concatenate two DataFrames 'bioactsDF_Train_non' and 'bioactsDF_Train_dup' to create 'bioactsDF_Train_final'
bioactsDF_Train_final = pd.concat([bioactsDF_Train_non, bioactsDF_Train_dup])

# Calculate the lengths of 'bioactsDF_Train_dup', 'bioactsDF_Train_non', and 'bioactsDF_Train_final'
len_bioactsDF_Train_dup = len(bioactsDF_Train_dup)
len_bioactsDF_Train_non = len(bioactsDF_Train_non)
len_bioactsDF_Train_final = len(bioactsDF_Train_final)

len(bioactsDF_Train_dup), len(bioactsDF_Train_non), len(bioactsDF_Train_final)

(234, 1596, 1830)

In [None]:
bioactsDF_Train_final.tail(2)

Unnamed: 0,molecule_chembl_id,canonical_smiles,assay_type,standard_type,standard_relation,standard_value,standard_units,bioactivity_status
490,CHEMBL575447,C[C@@]1(C(=O)Nc2cnccn2)CCCN1c1nc(Nc2cc(C3CC3)[...,B,IC50,=,3.0,nM,active
492,CHEMBL577385,C[C@@]1(C(=O)Nc2nccs2)CCCN1c1nc(Nc2cc(C3CC3)n[...,B,IC50,=,4.0,nM,active


## **Repeat the same steps that we performed for bioactsDF_Train above. This time, for bioactsDF_Test_gra and bioactsDF_Test_les.**

In [None]:

# Identify and extract duplicate rows based on the 'chemblId' column
bioactsDF_Test_gra_dup = pd.concat(g for _, g in bioactsDF_Test_gra.groupby("molecule_chembl_id") if len(g) > 1)

# Create a DataFrame containing non-duplicate rows by filtering the original DataFrame
bioactsDF_Test_gra_non = bioactsDF_Test_gra.loc[~bioactsDF_Test_gra.index.isin(bioactsDF_Test_gra_dup.index)]

# Print the count of non-duplicate rows, duplicate rows, and the total count
print("Number of non-duplicate rows in bioactsDF_Test_gra:", len(bioactsDF_Test_gra_non))
print("Number of duplicate rows in bioactsDF_Test_gra:", len(bioactsDF_Test_gra_dup))
print("Total number of rows in bioactsDF_Test_gra after removing duplicates:", len(bioactsDF_Test_gra_dup) + len(bioactsDF_Test_gra_non))


Number of non-duplicate rows in bioactsDF_Test_gra: 431
Number of duplicate rows in bioactsDF_Test_gra: 161
Total number of rows in bioactsDF_Test_gra after removing duplicates: 592


In [None]:
# Identify and extract duplicate rows based on the 'chemblId' column

# Calculate the mean and standard deviation of 'standard_value' for each unique molecule
mean_std = bioactsDF_Test_gra_dup.groupby(['molecule_chembl_id'], as_index=False).agg(
    {'standard_value': ['mean', 'std']}
)

# Merge the DataFrame with duplicate rows and the DataFrame containing mean and standard deviation
bioactsDF_Test_gra_dup = bioactsDF_Test_gra_dup.merge(mean_std, on='molecule_chembl_id', how='inner')

# Filter rows based on the standard deviation of the 'standard_value' column
# Keep only rows where the standard deviation is less than 2
bioactsDF_Test_gra_dup = bioactsDF_Test_gra_dup[(bioactsDF_Test_gra_dup[('standard_value', 'std')] < 2)]

# Calculate the absolute difference between the 'standard_value' column and the mean of 'standard_value' for each group
bioactsDF_Test_gra_dup['select'] = (bioactsDF_Test_gra_dup['standard_value']- bioactsDF_Test_gra_dup[('standard_value', 'mean')]).abs()

# Group the DataFrame by 'chemblId' and sort each group by the 'select' column in ascending order
bioactsDF_Test_gra_dup = bioactsDF_Test_gra_dup.groupby(["molecule_chembl_id"]).apply(lambda x: x.sort_values(["select"], ascending = True)).reset_index(drop=True)

# Remove duplicate rows in the DataFrame based on the 'chemblId' column
# Keep the first occurrence and drop subsequent duplicates
bioactsDF_Test_gra_dup = bioactsDF_Test_gra_dup.drop_duplicates(subset='molecule_chembl_id', keep='first')

# Remove unnecessary columns ('select', ('standard_value', 'mean'), ('standard_value', 'std'))
bioactsDF_Test_gra_dup = bioactsDF_Test_gra_dup.drop(['select', ('standard_value', 'mean'), ('standard_value', 'std')], axis=1)

# Concatenate two DataFrames 'bioactsDF_Test_gra_non' and 'bioactsDF_Test_gra_dup' to create 'bioactsDF_Test_gra_final'
bioactsDF_Test_gra_final = pd.concat([bioactsDF_Test_gra_non, bioactsDF_Test_gra_dup])

# Calculate the lengths of 'bioactsDF_Test_gra_dup', 'bioactsDF_Test_gra_non', and 'bioactsDF_Test_gra_final'
len_bioactsDF_Test_gra_dup = len(bioactsDF_Test_gra_dup)
len_bioactsDF_Test_gra_non = len(bioactsDF_Test_gra_non)
len_bioactsDF_Test_gra_final = len(bioactsDF_Test_gra_final)

len(bioactsDF_Test_gra_dup), len(bioactsDF_Test_gra_non), len(bioactsDF_Test_gra_final)


  bioactsDF_Test_gra_dup = bioactsDF_Test_gra_dup.merge(mean_std, on='molecule_chembl_id', how='inner')
  bioactsDF_Test_gra_dup = bioactsDF_Test_gra_dup.merge(mean_std, on='molecule_chembl_id', how='inner')
To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  bioactsDF_Test_gra_dup = bioactsDF_Test_gra_dup.groupby(["molecule_chembl_id"]).apply(lambda x: x.sort_values(["select"], ascending = True)).reset_index(drop=True)


(75, 431, 506)

In [None]:
# Identify and extract duplicate rows based on the 'chemblId' column
bioactsDF_Test_les_dup = pd.concat(g for _, g in bioactsDF_Test_les.groupby("molecule_chembl_id") if len(g) > 1)

# Create a DataFrame containing non-duplicate rows by filtering the original DataFrame
bioactsDF_Test_les_non = bioactsDF_Test_les.loc[~bioactsDF_Test_les.index.isin(bioactsDF_Test_les_dup.index)]

# Print the count of non-duplicate rows, duplicate rows, and the total count
print("Number of non-duplicate rows in bioactsDF_Test_les:", len(bioactsDF_Test_les_non))
print("Number of duplicate rows in bioactsDF_Test_les:", len(bioactsDF_Test_les_dup))
print("Total number of rows in bioactsDF_Test_les after removing duplicates:", len(bioactsDF_Test_les_dup) + len(bioactsDF_Test_les_non))



Number of non-duplicate rows in bioactsDF_Test_les: 198
Number of duplicate rows in bioactsDF_Test_les: 408
Total number of rows in bioactsDF_Test_les after removing duplicates: 606


In [None]:
# Identify and extract duplicate rows based on the 'chemblId' column

# Calculate the mean and standard deviation of 'standard_value' for each unique molecule
mean_std = bioactsDF_Test_les_dup.groupby(['molecule_chembl_id'], as_index=False).agg(
    {'standard_value': ['mean', 'std']}
)

# Merge the DataFrame with duplicate rows and the DataFrame containing mean and standard deviation
bioactsDF_Test_les_dup = bioactsDF_Test_les_dup.merge(mean_std, on='molecule_chembl_id', how='inner')

# Filter rows based on the standard deviation of the 'standard_value' column
# Keep only rows where the standard deviation is less than 2
bioactsDF_Test_les_dup = bioactsDF_Test_les_dup[(bioactsDF_Test_les_dup[('standard_value', 'std')] < 2)]

# Calculate the absolute difference between the 'standard_value' column and the mean of 'standard_value' for each group
bioactsDF_Test_les_dup['select'] = (bioactsDF_Test_les_dup['standard_value']- bioactsDF_Test_les_dup[('standard_value', 'mean')]).abs()

# Group the DataFrame by 'chemblId' and sort each group by the 'select' column in ascending order
bioactsDF_Test_les_dup = bioactsDF_Test_les_dup.groupby(["molecule_chembl_id"]).apply(lambda x: x.sort_values(["select"], ascending = True)).reset_index(drop=True)

# Remove duplicate rows in the DataFrame based on the 'chemblId' column
# Keep the first occurrence and drop subsequent duplicates
bioactsDF_Test_les_dup = bioactsDF_Test_les_dup.drop_duplicates(subset='molecule_chembl_id', keep='first')

# Remove unnecessary columns ('select', ('standard_value', 'mean'), ('standard_value', 'std'))
bioactsDF_Test_les_dup = bioactsDF_Test_les_dup.drop(['select', ('standard_value', 'mean'), ('standard_value', 'std')], axis=1)

# Concatenate two DataFrames 'bioactsDF_Test_les_non' and 'bioactsDF_Test_les_dup' to create 'bioactsDF_Test_les_final'
bioactsDF_Test_les_final = pd.concat([bioactsDF_Test_les_non, bioactsDF_Test_les_dup])

# Calculate the lengths of 'bioactsDF_Test_les_dup', 'bioactsDF_Test_les_non', and 'bioactsDF_Test_les_final'
len_bioactsDF_Test_les_dup = len(bioactsDF_Test_les_dup)
len_bioactsDF_Test_les_non = len(bioactsDF_Test_les_non)
len_bioactsDF_Test_les_final = len(bioactsDF_Test_les_final)

len(bioactsDF_Test_les_dup), len(bioactsDF_Test_les_non), len(bioactsDF_Test_les_final)


  bioactsDF_Test_les_dup = bioactsDF_Test_les_dup.merge(mean_std, on='molecule_chembl_id', how='inner')
  bioactsDF_Test_les_dup = bioactsDF_Test_les_dup.merge(mean_std, on='molecule_chembl_id', how='inner')
To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  bioactsDF_Test_les_dup = bioactsDF_Test_les_dup.groupby(["molecule_chembl_id"]).apply(lambda x: x.sort_values(["select"], ascending = True)).reset_index(drop=True)


(204, 198, 402)

In [None]:
## Check length of
# 1) bioactsDF_Train_final 2) bioactsDF_Test_gra_final 3) bioactsDF_Test_les

print (str(len(bioactsDF_Train_final)),
       str(len(bioactsDF_Test_gra_final)),
       str(len(bioactsDF_Test_les_final)))

1830 506 402


## **Dropping Duplicate Smiles**

In [None]:
# Step 1: Filter rows with non-null 'canonical_smiles' in the Train DataFrame
Train_IGF1R_RAW1 = bioactsDF_Train_final[pd.notnull(bioactsDF_Train_final['canonical_smiles'])]

# Step 2: Filter rows with non-null 'canonical_smiles' in the Test_gra DataFrame
Test_gra_IGF1R_RAW1 = bioactsDF_Test_gra_final[pd.notnull(bioactsDF_Test_gra_final['canonical_smiles'])]

# Step 3: Filter rows with non-null 'canonical_smiles' in the Test_les DataFrame
Test_les_IGF1R_RAW1 = bioactsDF_Test_les_final[pd.notnull(bioactsDF_Test_les_final['canonical_smiles'])]

# Step 4: Display the number of rows before and after filtering for each DataFrame
# This provides information about the reduction in the number of rows.

print('bioactsDF_Train_final of ' + str(len(bioactsDF_Train_final)) + '. After dropping rows with duplicate SMILES is reduced to ' + str(len(Train_IGF1R_RAW1)))

print('bioactsDF_Test_gra_final of ' + str(len(bioactsDF_Test_gra_final)) + '. After dropping rows with duplicate SMILES is reduced to ' + str(len(Test_gra_IGF1R_RAW1)))

print('bioactsDF_Test_les_final of ' + str(len(bioactsDF_Test_les_final)) + '. After dropping rows with duplicate SMILES is reduced to ' + str(len(Test_les_IGF1R_RAW1)))


bioactsDF_Train_final of 1830. After dropping rows with duplicate SMILES is reduced to 1830
bioactsDF_Test_gra_final of 506. After dropping rows with duplicate SMILES is reduced to 506
bioactsDF_Test_les_final of 402. After dropping rows with duplicate SMILES is reduced to 402


In [None]:
# Calculate the lengths of the DataFrames and the number of unique 'molecule_chembl_id' values
train_raw_length = len(Train_IGF1R_RAW1)
train_unique_length = len(Train_IGF1R_RAW1['molecule_chembl_id'].unique())

test_gra_raw_length = len(Test_gra_IGF1R_RAW1)
test_gra_unique_length = len(Test_gra_IGF1R_RAW1['molecule_chembl_id'].unique())

test_les_raw_length = len(Test_les_IGF1R_RAW1)
test_les_unique_length = len(Test_les_IGF1R_RAW1['molecule_chembl_id'].unique())

# Print the results with clear labels
print("Train Data: Total rows =", train_raw_length, "Unique molecules =", train_unique_length)
print("Test_gra Data: Total rows =", test_gra_raw_length, "Unique molecules =", test_gra_unique_length)
print("Test_les Data: Total rows =", test_les_raw_length, "Unique molecules =", test_les_unique_length)


Train Data: Total rows = 1830 Unique molecules = 1830
Test_gra Data: Total rows = 506 Unique molecules = 506
Test_les Data: Total rows = 402 Unique molecules = 402


In [None]:
# Filter rows with non-null 'canonical_smiles' in the Train DataFrames
Train_IGF1R_raw = Train_IGF1R_RAW1[pd.notnull(Train_IGF1R_RAW1['canonical_smiles'])]

# Filter rows with non-null 'canonical_smiles' in the Test DataFrames
Test_gra_IGF1R_raw = Test_gra_IGF1R_RAW1[pd.notnull(Test_gra_IGF1R_RAW1['canonical_smiles'])]
Test_les_IGF1R_raw = Test_les_IGF1R_RAW1[pd.notnull(Test_les_IGF1R_RAW1['canonical_smiles'])]

# Print the results with clear labels, specifying the reduction reason
print("Train Data: Removed rows with missing 'canonical_smiles'. Total compounds reduced from " + str(len(Train_IGF1R_RAW1)) + " to " + str(len(Train_IGF1R_raw)) + " compounds.")
print("Test_gra Data: Removed rows with missing 'canonical_smiles'. Total compounds reduced from " + str(len(Test_gra_IGF1R_RAW1)) + " to " + str(len(Test_gra_IGF1R_raw)) + " compounds.")
print("Test_les Data: Removed rows with missing 'canonical_smiles'. Total compounds reduced from " + str(len(Test_les_IGF1R_RAW1)) + " to " + str(len(Test_les_IGF1R_raw)) + " compounds.")


Train Data: Removed rows with missing 'canonical_smiles'. Total compounds reduced from 1830 to 1830 compounds.
Test_gra Data: Removed rows with missing 'canonical_smiles'. Total compounds reduced from 506 to 506 compounds.
Test_les Data: Removed rows with missing 'canonical_smiles'. Total compounds reduced from 402 to 402 compounds.


**Save the resulting dataframe to a CSV file.**

In [None]:
# Specify the file paths for saving the DataFrames as CSV
train_csv_path = "Train_IGF1R_raw.csv"
test_gra_csv_path = "Test_gra_IGF1R_raw.csv"
test_les_csv_path = "Test_les_IGF1R_raw.csv"

# Save the DataFrames to CSV files
Train_IGF1R_raw.to_csv(train_csv_path, index=False)
Test_gra_IGF1R_raw.to_csv(test_gra_csv_path, index=False)
Test_les_IGF1R_raw.to_csv(test_les_csv_path, index=False)

# Print confirmation messages
print("Train Data saved to", train_csv_path)
print("Test_gra Data saved to", test_gra_csv_path)
print("Test_les Data saved to", test_les_csv_path)


Train Data saved to Train_IGF1R.csv
Test_gra Data saved to Test_gra_IGF1R.csv
Test_les Data saved to Test_les_IGF1R.csv


## **Zip All Resulting CSV files from Dataset Preparation.**

In [None]:
!zip IGF1R_dataset_preparation.zip *.csv

updating: 01_IGF1R_bioactivity_data_raw.csv (deflated 94%)
updating: 02_IGF1R_bioactivity_data_filtered.csv (deflated 88%)
updating: Test_gra_IGF1R.csv (deflated 82%)
updating: Test_les_IGF1R.csv (deflated 90%)
updating: Train_IGF1R.csv (deflated 86%)


In [None]:
! ls -l

total 3876
-rw-r--r-- 1 root root 2980287 Oct 31 05:08 01_IGF1R_bioactivity_data_raw.csv
-rw-r--r-- 1 root root  412880 Oct 31 05:08 02_IGF1R_bioactivity_data_filtered.csv
-rw-r--r-- 1 root root  282495 Oct 31 05:11 IGF1R_dataset_preparation.zip
drwxr-xr-x 1 root root    4096 Oct 27 13:22 sample_data
-rw-r--r-- 1 root root   48304 Oct 31 05:10 Test_gra_IGF1R.csv
-rw-r--r-- 1 root root   47659 Oct 31 05:10 Test_les_IGF1R.csv
-rw-r--r-- 1 root root  184569 Oct 31 05:10 Train_IGF1R.csv


## **Move this code to other document.**

In [None]:
!pip install rdkit

Collecting rdkit
  Downloading rdkit-2023.9.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.5/30.5 MB[0m [31m46.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: rdkit
Successfully installed rdkit-2023.9.1


In [None]:
from rdkit import Chem
from rdkit.Chem.SaltRemover import SaltRemover
from rdkit.Chem import MolFromSmiles,MolToSmiles

In [None]:
def clean_smiles(ListSMILEs):
    """
    Cleans SMILES strings by removing common salts.

    Parameters:
        ListSMILEs (list): A list of SMILES strings to be cleaned.

    Returns:
        list: A list of cleaned SMILES strings with salts removed.
    """

    # Create an instance of SaltRemover to remove common salts
    remover = SaltRemover()

    # Initialize an empty list to store cleaned SMILES strings
    SMILES_desalt = []

    # Iterate over each SMILES string in the input list
    for i in ListSMILEs:
        # Parse the SMILES string into a molecule object
        mol = MolFromSmiles(i)

        # Remove common salts from the molecule structure
        mol_desalt = remover.StripMol(mol)

        # Convert the desalted molecule back to a SMILES string
        mol_SMILES = MolToSmiles(mol_desalt)

        # Append the desalted SMILES string to the result list
        SMILES_desalt.append(mol_SMILES)

    # Return the list of cleaned SMILES strings
    return SMILES_desalt

In [None]:
Train_ER_alpha['SMILES_desalt'] = clean_smiles(Train_ER_alpha.canonical_smiles)
Test_gra_ER_alpha['SMILES_desalt'] = clean_smiles(Test_gra_ER_alpha.canonical_smiles)
Test_les_ER_alpha['SMILES_desalt'] = clean_smiles(Test_les_ER_alpha.canonical_smiles)

In [None]:
# Remove duplicate SMILES and keep the last occurrence for Train_ER_alpha
Train_ER_alpha2 = Train_ER_alpha.drop_duplicates(subset='SMILES_desalt', keep='last')

# Print the reduction in the number of SMILES strings for Train_ER_alpha
print("RAW data of " + str(len(Train_ER_alpha)) + " SMILES has been reduced to " + str(len(Train_ER_alpha2)) + " SMILES.")

# Remove duplicate SMILES and keep the last occurrence for Test_gra_ER_alpha
Test_gra_ER_alpha2 = Test_gra_ER_alpha.drop_duplicates(subset='SMILES_desalt', keep='last')

# Print the reduction in the number of SMILES strings for Test_gra_ER_alpha
print("RAW data of " + str(len(Test_gra_ER_alpha)) + " SMILES has been reduced to " + str(len(Test_gra_ER_alpha2)) + " SMILES.")

# Remove duplicate SMILES and keep the last occurrence for Test_les_ER_alpha
Test_les_ER_alpha2 = Test_les_ER_alpha.drop_duplicates(subset='SMILES_desalt', keep='last')

# Print the reduction in the number of SMILES strings for Test_les_ER_alpha
print("RAW data of " + str(len(Test_les_ER_alpha)) + " SMILES has been reduced to " + str(len(Test_les_ER_alpha2)) + " SMILES.")


RAW data of 1827 SMILES has been reduced to 1827 SMILES.
RAW data of 506 SMILES has been reduced to 506 SMILES.
RAW data of 402 SMILES has been reduced to 402 SMILES.


In [None]:
import os

# Create the 'model' directory if it doesn't exist
if not os.path.exists('model'):
    os.makedirs('model')

# Create the 'model' directory if it doesn't exist
if not os.path.exists('smiles'):
    os.makedirs('smiles')

In [None]:
# Save the cleaned Train_ER_alpha2, Test_gra_ER_alpha2, and Test_les_ER_alpha2 DataFrames as CSV files
Train_ER_alpha2.to_csv('model/Train_ER_alpha.csv', sep=',', index=False)
Test_gra_ER_alpha2.to_csv('model/Test_gra_ER_alpha.csv', sep=',', index=False)
Test_les_ER_alpha2.to_csv('model/Test_les_ER_alpha.csv', sep=',', index=False)

# Extract SMILES_desalt and chemblId columns from Train_ER_alpha2, Test_gra_ER_alpha2, and Test_les_ER_alpha2
Train_smiles = Train_ER_alpha2[['SMILES_desalt', 'molecule_chembl_id']]
Test_gra = Test_gra_ER_alpha2[['SMILES_desalt', 'molecule_chembl_id']]
Test_les = Test_les_ER_alpha2[['SMILES_desalt', 'molecule_chembl_id']]

# Save the SMILES in a .smi file without column headers for Train_ER_alpha, Test_gra_ER_alpha, and Test_les_ER_alpha
Train_smiles.to_csv('smiles/Train_ER_alpha.smi', sep='\t', header=False, index=False)
Test_gra.to_csv('smiles/Test_gra_ER_alpha.smi', sep='\t', header=False, index=False)
Test_les.to_csv('smiles/Test_les_ER_alpha.smi', sep='\t', header=False, index=False)

# Extract chemblId and value columns from Train_ER_alpha2 and save as Train_QSAR.csv
Train_QSAR = Train_ER_alpha2[['molecule_chembl_id', 'value']]
Train_QSAR.to_csv('Train_QSAR.csv', sep=',', index=False)


In [None]:
Train_ER_alpha2.tail(3)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value,STATUS,SMILES_desalt
488,,,16901331,[],CHEMBL3882755,Inhibition of human IGF1R kinase domain (M954 ...,B,,,BAO_0000190,...,9606,,,IC50,nM,UO_0000065,500.0,50.0,active,COc1ccccc1CNc1ncc(Br)c(Nc2cc(C3CC3)[nH]n2)n1
490,,,2925496,[],CHEMBL1045057,Inhibition of IGF1R after 60 mins by fluoresce...,B,,,BAO_0000190,...,9606,,,IC50,nM,UO_0000065,,3.0,active,C[C@@]1(C(=O)Nc2cnccn2)CCCN1c1nc(Nc2cc(C3CC3)[...
492,,,2925494,[],CHEMBL1045057,Inhibition of IGF1R after 60 mins by fluoresce...,B,,,BAO_0000190,...,9606,,,IC50,nM,UO_0000065,,4.0,active,C[C@@]1(C(=O)Nc2nccs2)CCCN1c1nc(Nc2cc(C3CC3)n[...


In [None]:
!zip -r /content/model.zip /content/model

updating: content/model/ (stored 0%)
updating: content/model/Train_ER_alpha.csv (deflated 93%)
updating: content/model/Test_les_ER_alpha.csv (deflated 95%)
updating: content/model/Test_gra_ER_alpha.csv (deflated 91%)


In [None]:
!zip -r /content/smiles.zip /content/smiles

  adding: content/smiles/ (stored 0%)
  adding: content/smiles/Test_les_ER_alpha.smi (deflated 88%)
  adding: content/smiles/Train_ER_alpha.smi (deflated 85%)
  adding: content/smiles/Test_gra_ER_alpha.smi (deflated 76%)


In [None]:
!zip IGF1R.zip *.zip *.csv

  adding: model.zip (stored 0%)
  adding: smiles.zip (stored 0%)
  adding: check.csv (deflated 94%)
  adding: IGF1R_01_bioactivity_data_raw.csv (deflated 94%)
  adding: Train_QSAR.csv (deflated 74%)


In [None]:
! ls -l

total 5848
-rw-r--r-- 1 root root 2956041 Oct 29 06:19 check.csv
-rw-r--r-- 1 root root 2980287 Oct 29 06:19 IGF1R_01_bioactivity_data_raw.csv
drwxr-xr-x 2 root root    4096 Oct 29 06:34 model
drwxr-xr-x 1 root root    4096 Oct 26 13:24 sample_data
drwxr-xr-x 2 root root    4096 Oct 29 06:36 smiles
-rw-r--r-- 1 root root   34530 Oct 29 06:36 Train_QSAR.csv


---