<a href="https://colab.research.google.com/github/Spycsh/DataScienceNoteBooks/blob/main/rdkit_challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install RDKit. Takes 2-3 minutes
!wget -c https://repo.continuum.io/miniconda/Miniconda3-py37_4.8.3-Linux-x86_64.sh
!chmod +x Miniconda3-py37_4.8.3-Linux-x86_64.sh
!time bash ./Miniconda3-py37_4.8.3-Linux-x86_64.sh -b -f -p /usr/local
!time conda install -q -y -c conda-forge rdkit

In [13]:
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')

In [14]:
import numpy as np
import pandas as pd

In [143]:
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole

m = Chem.MolFromSmiles('[O-][Cl+3]([O-])([O-])[O-].c1ccc(-c2[nH]c[n+](CC3CO3)c2-c2ccccc2)cc1')
print(m.GetNumAtoms())

26


In [144]:
# https://www.rdkit.org/docs/source/rdkit.Chem.rdMolDescriptors.html
# compute molecular descriptors
import rdkit.Chem.rdMolDescriptors as d
print(d.CalcExactMolWt(m))  # returns the molecule’s exact molecular weight
# https://www.rdkit.org/docs/source/rdkit.Chem.Fragments.html
# functions to match a bunch of fragment descriptors from a file
import rdkit.Chem.Fragments as f
print(f.fr_COO(m)) # Number of aliphatic carboxylic acids
# https://www.rdkit.org/docs/source/rdkit.Chem.Lipinski.html
# Calculation of Lipinski parameters for molecules
import rdkit.Chem.Lipinski as l
print(l.HeavyAtomCount(m)) # Number of heavy atoms (any atom that is not hydrogen) a molecule.
# A special class of features is the so-called fingerprints, which represent presence or absence of substructures. They can be derived in many different ways. 
# One of these that is included in RDKit is the so-called Morgan fingerprints, 
# Here, the second argument corresponds to the size of the substructures
# and the third argument corresponds to how many dimensions to map the 
# substructures to (length of the bit vector)
# which can be generated as follows:
from rdkit.Chem import AllChem
fp = AllChem.GetMorganFingerprintAsBitVect(m,2,nBits=124)
print(np.array(fp))

376.08259932400006
0
26
[0 0 1 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0
 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0
 0 0 1 1 0 1 0 1 0 0 0 1 0]


The task should include

1. what alternative representations (choice of feature sets and data preparation techniques) and learning algorithms (including parameter settings) have been evaluated (at least two different representations and two different algorithms have to be evaluated), and possibly also what approaches to combine multiple models and feature sets have been evaluated. Provide pointers to all external packages that you have used.

2. what method has been employed to choose model and what model has been chosen, what method has been used to provide an estimate for the AUC and what is the estimate for the chosen model.

In [25]:
# read the dataset 
train_df = pd.read_csv("training_smiles.csv")
train_df.head(5)

Unnamed: 0,INDEX,SMILES,ACTIVE
0,1,CC#CCCCC(=O)Nc1ccccc1C(=O)O,0.0
1,2,[O-][Cl+3]([O-])([O-])[O-].c1ccc(-c2[nH]c[n+](...,0.0
2,3,CCOC(=O)CSc1nnc(NC(=O)c2cccc([N+](=O)[O-])c2C)s1,0.0
3,4,O=C(CN1CCN(S(=O)(=O)c2ccccc2)CC1)Nc1ccc(Cl)c(C...,0.0
4,5,Cc1cc(NN/C=C2\C=CC(=O)C=C2O)nc(N2CCOCC2)n1,0.0


**Extract features**

* CalcExactMolWt
* HeavyAtomCount
* fr_COO
* MFbitV_x

potential features
* number of atoms
* molecule weights
* fragments
* Lipinski parameters
* Morgan fingerpints(substructures)

## Feature extracting

For quick start, just skip this section and read the 
train_new_features.csv (which has already saved after execeuting
this section)

In [26]:
train_df["HeavyAtomCount"] = train_df["SMILES"].apply(lambda x: Chem.MolFromSmiles(x).GetNumAtoms())
train_df["CalcExactMolWt"] = train_df["SMILES"].apply(lambda x: d.CalcExactMolWt(Chem.MolFromSmiles(x)))
train_df["fr_COO"] = train_df["SMILES"].apply(lambda x: f.fr_Al_COO(Chem.MolFromSmiles(x)))



In [27]:
## just for test
kk = train_df.copy()
for i in range(124):
  kk["MFbitV_"+str(i)] = 0
# kk.iloc[:5, 6:] 5*124
x = train_df["SMILES"].iloc[:5].apply(lambda x: AllChem.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(x),2,nBits=124))
# x[0].tolist()
lists = []
for i in x:
  lists += np.array(i).tolist()
# lists
# pd.DataFrame((np.array(lists).reshape(5,124)))
kk.iloc[:5,6:] = np.array(lists).reshape(5,124)
kk

Unnamed: 0,INDEX,SMILES,ACTIVE,HeavyAtomCount,CalcExactMolWt,fr_COO,MFbitV_0,MFbitV_1,MFbitV_2,MFbitV_3,MFbitV_4,MFbitV_5,MFbitV_6,MFbitV_7,MFbitV_8,MFbitV_9,MFbitV_10,MFbitV_11,MFbitV_12,MFbitV_13,MFbitV_14,MFbitV_15,MFbitV_16,MFbitV_17,MFbitV_18,MFbitV_19,MFbitV_20,MFbitV_21,MFbitV_22,MFbitV_23,MFbitV_24,MFbitV_25,MFbitV_26,MFbitV_27,MFbitV_28,MFbitV_29,MFbitV_30,MFbitV_31,MFbitV_32,MFbitV_33,...,MFbitV_84,MFbitV_85,MFbitV_86,MFbitV_87,MFbitV_88,MFbitV_89,MFbitV_90,MFbitV_91,MFbitV_92,MFbitV_93,MFbitV_94,MFbitV_95,MFbitV_96,MFbitV_97,MFbitV_98,MFbitV_99,MFbitV_100,MFbitV_101,MFbitV_102,MFbitV_103,MFbitV_104,MFbitV_105,MFbitV_106,MFbitV_107,MFbitV_108,MFbitV_109,MFbitV_110,MFbitV_111,MFbitV_112,MFbitV_113,MFbitV_114,MFbitV_115,MFbitV_116,MFbitV_117,MFbitV_118,MFbitV_119,MFbitV_120,MFbitV_121,MFbitV_122,MFbitV_123
0,1,CC#CCCCC(=O)Nc1ccccc1C(=O)O,0.0,18,245.105193,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,0,1,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0
1,2,[O-][Cl+3]([O-])([O-])[O-].c1ccc(-c2[nH]c[n+](...,0.0,26,376.082599,0,0,0,1,1,0,1,0,0,0,0,1,0,1,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,1,1,0,1,0,1,0,0,0,1,0
2,3,CCOC(=O)CSc1nnc(NC(=O)c2cccc([N+](=O)[O-])c2C)s1,0.0,25,382.040562,0,0,0,0,0,1,1,0,1,1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,1,1,0,1,1,0,0,...,1,0,1,0,0,1,0,1,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,1,0,1,0,0,0,0,1,1,0,1,0
3,4,O=C(CN1CCN(S(=O)(=O)c2ccccc2)CC1)Nc1ccc(Cl)c(C...,0.0,27,427.052418,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,1,1,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,1,1,0
4,5,Cc1cc(NN/C=C2\C=CC(=O)C=C2O)nc(N2CCOCC2)n1,0.0,24,329.148789,0,0,0,1,1,0,0,0,0,0,1,1,0,1,0,0,1,0,1,1,1,0,0,0,1,0,1,0,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,1,0,1,1,1,0,1,0,1,0,0,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
121369,121370,O=C(NCc1cccs1)C1CCCN(S(=O)(=O)c2cnc[nH]2)C1,0.0,23,354.082032,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
121370,121371,COc1ccc(Cn2nc(C)cc2C)cc1OC,0.0,18,246.136828,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
121371,121372,Cc1ccc(-c2nn(-c3cc(Cl)ccc3[N+](=O)[O-])c(=O)c3...,0.0,29,405.088019,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
121372,121373,O=C(OCCN1C(=O)c2ccccc2C1=O)c1cccc(OC(F)F)c1,0.0,26,361.076179,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [28]:
len(train_df)

121374

In [29]:
for i in range(124):
  train_df["MFbitV_"+str(i)] = 0

# 121374*124 MF vector
MF_x = train_df["SMILES"].apply(lambda x: np.array(AllChem.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(x),2,nBits=124)))

print("wait 5 minutes please...")
# vector_lists = []
# # concat the lists
# for i in MF_x:
#   vector_lists += np.array(i).tolist()
import itertools
vector_lists = list(itertools.chain(*MF_x))



wait 5 minutes please...


In [30]:
train_df.iloc[:,6:] = np.array(vector_lists).reshape(len(train_df),124)

In [31]:
train_df.to_csv('train_new_features.csv', index=False)

skip the above section, directly read the training_preparation.csv in which features have already been
extracted.

In [32]:
train_df = pd.read_csv('train_new_features.csv')

## Data preprocessing
* column filter
* imputation
* normalization
* discretization
* train test split

In [119]:
from sklearn import model_selection
df_X = train_df.drop(columns=['INDEX','SMILES','ACTIVE'])
y = train_df['ACTIVE']

Normalization across instances should be done after splitting the data between training and test set, using only the data from the training set.

This is because the test set plays the role of fresh unseen data, so it's not supposed to be accessible at the training stage. Using any information coming from the test set before or during training is a potential bias in the evaluation of the performance

In [134]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    df_X, y, test_size=0.2, random_state=np.random.randint(1000)) # set the random state
y_train.head()

26600    0.0
33380    0.0
96435    0.0
11457    0.0
24424    0.0
Name: ACTIVE, dtype: float64

In [121]:
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
from sklearn.preprocessing import KBinsDiscretizer
kbd = KBinsDiscretizer(n_bins=10, encode="ordinal")

In [136]:
df_X.columns

Index(['HeavyAtomCount', 'CalcExactMolWt', 'fr_COO', 'MFbitV_0', 'MFbitV_1',
       'MFbitV_2', 'MFbitV_3', 'MFbitV_4', 'MFbitV_5', 'MFbitV_6',
       ...
       'MFbitV_114', 'MFbitV_115', 'MFbitV_116', 'MFbitV_117', 'MFbitV_118',
       'MFbitV_119', 'MFbitV_120', 'MFbitV_121', 'MFbitV_122', 'MFbitV_123'],
      dtype='object', length=127)

In [137]:
X_train = imp_mean.fit_transform(X_train)
X_train[:,[0,1]] = scaler.fit_transform(X_train[:,[0,1]])
X_train[:,[0,1]] = kbd.fit_transform(X_train[:,[0,1]])
X_train = pd.DataFrame(X_train, columns=df_X.columns)
X_train
# df_X = scaler.fit_transform(X_train)
# df_X = kbd.fit_transform(df_X)
# pd.DataFrame(df_X)

Unnamed: 0,HeavyAtomCount,CalcExactMolWt,fr_COO,MFbitV_0,MFbitV_1,MFbitV_2,MFbitV_3,MFbitV_4,MFbitV_5,MFbitV_6,MFbitV_7,MFbitV_8,MFbitV_9,MFbitV_10,MFbitV_11,MFbitV_12,MFbitV_13,MFbitV_14,MFbitV_15,MFbitV_16,MFbitV_17,MFbitV_18,MFbitV_19,MFbitV_20,MFbitV_21,MFbitV_22,MFbitV_23,MFbitV_24,MFbitV_25,MFbitV_26,MFbitV_27,MFbitV_28,MFbitV_29,MFbitV_30,MFbitV_31,MFbitV_32,MFbitV_33,MFbitV_34,MFbitV_35,MFbitV_36,...,MFbitV_84,MFbitV_85,MFbitV_86,MFbitV_87,MFbitV_88,MFbitV_89,MFbitV_90,MFbitV_91,MFbitV_92,MFbitV_93,MFbitV_94,MFbitV_95,MFbitV_96,MFbitV_97,MFbitV_98,MFbitV_99,MFbitV_100,MFbitV_101,MFbitV_102,MFbitV_103,MFbitV_104,MFbitV_105,MFbitV_106,MFbitV_107,MFbitV_108,MFbitV_109,MFbitV_110,MFbitV_111,MFbitV_112,MFbitV_113,MFbitV_114,MFbitV_115,MFbitV_116,MFbitV_117,MFbitV_118,MFbitV_119,MFbitV_120,MFbitV_121,MFbitV_122,MFbitV_123
0,4.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1,6.0,6.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0
2,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,8.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0
4,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97094,6.0,7.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
97095,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
97096,8.0,7.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
97097,6.0,6.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0


In [139]:
X_test = imp_mean.fit_transform(X_test)
X_test[:,[0,1]] = scaler.fit_transform(X_test[:,[0,1]])
X_test[:,[0,1]] = kbd.fit_transform(X_test[:,[0,1]])
X_test = pd.DataFrame(X_test, columns=df_X.columns)

## Building Models
* Naive Bayes (BernoulliNB)
* DecisionTree
* RandomForest
* MLPClassifier(NN)
