# Descriptor Calculation and Dataset Preparation

## Description
A software to calculate molecular descriptors and fingerprints. The software currently calculates 1875 descriptors (1444 1D, 2D descriptors and 431 3D descriptors) and 12 types of fingerprints (total 16092 bits). The descriptors and fingerprints are calculated using The Chemistry Development Kit with additional descriptors and fingerprints such as atom type electrotopological state descriptors, Crippen's logP and MR, extended topochemical atom (ETA) descriptors, McGowan volume, molecular linear free energy relation descriptors, ring counts, count of chemical substructures identified by Laggner, and binary fingerprints and count of chemical substructures identified by Klekota and Roth.

## Prepare the *Padel descriptors* 

In [1]:
import glob
xml_files = glob.glob("*.xml", dir_fd ='fingerprints_xml')
xml_files.sort()
xml_files

['AtomPairs2DFingerprintCount.xml',
 'AtomPairs2DFingerprinter.xml',
 'EStateFingerprinter.xml',
 'ExtendedFingerprinter.xml',
 'Fingerprinter.xml',
 'GraphOnlyFingerprinter.xml',
 'KlekotaRothFingerprintCount.xml',
 'KlekotaRothFingerprinter.xml',
 'MACCSFingerprinter.xml',
 'PubchemFingerprinter.xml',
 'SubstructureFingerprintCount.xml',
 'SubstructureFingerprinter.xml']

## Load the data

In [1]:
import pandas as pd
df = pd.read_csv('Data/estrogen_receptor_alpha_bioactivity_data_final.csv')
df

Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL431611,Oc1ccc2c(c1)S[C@H](C1CCCC1)[C@H](c1ccc(OCCN3CC...,active,439.621,6.0415,1.0,5.0,8.602060
1,CHEMBL316132,Oc1ccc2c(c1)S[C@H](C1CCCCCC1)[C@H](c1ccc(OCCN3...,active,467.675,6.8217,1.0,5.0,8.124939
2,CHEMBL304552,Oc1ccc([C@H]2Sc3cc(O)ccc3O[C@H]2c2ccc(OCCN3CCC...,active,463.599,5.9296,2.0,6.0,8.508638
3,CHEMBL85881,Oc1ccc2c(c1)S[C@H](CC1CCCCC1)[C@H](c1ccc(OCCN3...,active,467.675,6.8217,1.0,5.0,8.408935
4,CHEMBL85536,Oc1ccc2c(c1)S[C@H](Cc1ccccc1)[C@H](c1ccc(OCCN3...,active,461.627,6.0940,1.0,5.0,8.130768
...,...,...,...,...,...,...,...,...
3048,CHEMBL4475463,O=C(O)c1ccc2c(c1)CCCC(c1ccc(Cl)cc1Cl)=C2c1ccc(...,active,554.489,7.7997,1.0,3.0,7.238824
3049,CHEMBL5185539,CC(F)(F)c1cc(Cl)ccc1-c1sc2c(ccc3[nH]ncc32)c1Oc...,active,572.052,8.4246,1.0,5.0,7.987163
3050,CHEMBL5197035,CC(F)(F)c1cc(Cl)ccc1-c1sc2c(ccc3[nH]ncc32)c1Oc...,active,571.068,8.4577,2.0,5.0,7.882729
3051,CHEMBL5196589,O=S(=O)(Oc1cccc2ccccc12)C1CC2OC1C(c1ccc([Se]c3...,active,625.604,5.0597,1.0,5.0,6.508638


In [15]:
df_id_smiles = df[['canonical_smiles', 'molecule_chembl_id']]
df_id_smiles.to_csv('Data/molecule.smi', sep='\t', index=False, header=False)
# saving as SMILES file
df_id_smiles

Unnamed: 0,canonical_smiles,molecule_chembl_id
0,Oc1ccc2c(c1)S[C@H](C1CCCC1)[C@H](c1ccc(OCCN3CC...,CHEMBL431611
1,Oc1ccc2c(c1)S[C@H](C1CCCCCC1)[C@H](c1ccc(OCCN3...,CHEMBL316132
2,Oc1ccc([C@H]2Sc3cc(O)ccc3O[C@H]2c2ccc(OCCN3CCC...,CHEMBL304552
3,Oc1ccc2c(c1)S[C@H](CC1CCCCC1)[C@H](c1ccc(OCCN3...,CHEMBL85881
4,Oc1ccc2c(c1)S[C@H](Cc1ccccc1)[C@H](c1ccc(OCCN3...,CHEMBL85536
...,...,...
3048,O=C(O)c1ccc2c(c1)CCCC(c1ccc(Cl)cc1Cl)=C2c1ccc(...,CHEMBL4475463
3049,CC(F)(F)c1cc(Cl)ccc1-c1sc2c(ccc3[nH]ncc32)c1Oc...,CHEMBL5185539
3050,CC(F)(F)c1cc(Cl)ccc1-c1sc2c(ccc3[nH]ncc32)c1Oc...,CHEMBL5197035
3051,O=S(=O)(Oc1cccc2ccccc12)C1CC2OC1C(c1ccc([Se]c3...,CHEMBL5196589


## Calculate the *Padel descriptors*

In [4]:
from padelpy import padeldescriptor
from sklearn.feature_selection import VarianceThreshold

In [16]:
def calculate_descriptors(fingerprint):
    fingerprint_output_file = fingerprint.strip('.xtml') + '.csv'
    padeldescriptor(
        mol_dir='Data/molecule.smi', 
        d_file='Calculated_descriptors/' + fingerprint_output_file,   
        descriptortypes= 'fingerprints_xml/' + fingerprint,
        detectaromaticity=True,
        standardizenitro=True,
        standardizetautomers=True,
        threads=2,
        removesalt=True,
        log=True,
        fingerprints=True)

In [17]:
def remove_low_variance(input_data, threshold=0.1):
    selection = VarianceThreshold(threshold)
    selection.fit(input_data)
    return input_data[input_data.columns[selection.get_support(indices=True)]]

### Iterating Through XML Files:
For loop is employed to iterate through a list of XML files (fingerprints), performing the following tasks in each cycle:

1) **Calculate Descriptors**:
    - Utilize the *calculate_descriptors* function to compute descriptors for the current fingerprint.
    - Save the calculated descriptors as a .csv file in the "Calculated_descriptors" folder.
2) **Data Processing:**
    - Read the previously saved .csv file, excluding the 'Name' column.
    - Use the *remove_low_variance* function to eliminate descriptors with low variance (< 0.1)
3) **DataFrame Combination:**
    - Combine the current DataFrame (from the .csv file) with all the previous DataFrames.

**Final Output:**
- Save the combined DataFrame as *padel_descriptors_cleaned.csv*

In [18]:
df_fp_combined = pd.DataFrame()

for fingerprint in xml_files:
    calculate_descriptors(fingerprint)

    fingerprint_csv = fingerprint.strip('.xtml') + '.csv'
    descriptors = pd.read_csv(f'Calculated_descriptors/{fingerprint_csv}')
    
    X = descriptors.drop('Name', axis=1)
    X = remove_low_variance(X, threshold=0.1)

    df_fp_combined = pd.concat([df_fp_combined, X], axis=1)
    
df_fp_combined.to_csv('Data/padel_descriptors_cleaned.csv')

In [2]:
y = df['pIC50']
X = pd.read_csv('Data/padel_descriptors_cleaned.csv')
X = X.drop('Unnamed: 0', axis=1)
X

Unnamed: 0,APC2D1_C_C,APC2D1_C_N,APC2D1_C_O,APC2D1_C_S,APC2D1_C_F,APC2D1_C_Cl,APC2D1_C_X,APC2D1_N_N,APC2D1_O_S,APC2D2_C_C,...,SubFP137,SubFP169,SubFP172,SubFP179,SubFP181,SubFP182,SubFP184,SubFP275,SubFP287,SubFP303
0,25.0,3.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,32.0,...,0,1,0,0,0,0,0,1,0,0
1,27.0,3.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,34.0,...,0,1,0,0,0,0,0,1,0,0
2,26.0,3.0,6.0,2.0,0.0,0.0,0.0,0.0,0.0,33.0,...,0,1,0,0,0,0,0,1,0,0
3,27.0,3.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,34.0,...,0,1,0,0,0,0,0,1,0,0
4,27.0,3.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,34.0,...,0,1,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3048,32.0,3.0,4.0,0.0,1.0,2.0,3.0,0.0,0.0,44.0,...,1,0,0,0,0,0,0,1,1,0
3049,28.0,5.0,4.0,2.0,3.0,1.0,4.0,1.0,0.0,36.0,...,0,0,0,1,1,0,1,1,0,0
3050,28.0,7.0,2.0,2.0,3.0,1.0,4.0,1.0,0.0,36.0,...,0,0,0,1,1,0,1,1,0,0
3051,37.0,0.0,4.0,1.0,0.0,0.0,0.0,0.0,3.0,48.0,...,0,1,0,0,0,0,0,1,1,0


## Store the file containing features(descriptors) and targets('pIC50')

In [3]:
df_final = pd.concat([X, y], axis=1)
df_final.to_csv('Data/padel_descriptors_final.csv')