In [7]:
import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem import Descriptors, AllChem
import pickle
from tqdm import tqdm
tqdm.pandas()
from rdkit.Chem import DataStructs
from sklearn.impute import KNNImputer
from feature_engine.selection import DropCorrelatedFeatures 
import pickle

First, let's upload the grouped table, which contains data on critical properties and molecules structures. We drop columns **Is_organic** and **num_fragments** as they are unnecessary now and re-make **mol** column as in .csv it turned into string

In [3]:
grouped_table = pd.read_csv('grouped_table.csv', index_col = 0).reset_index(drop = True).drop(columns = ['Is_organic', 'num_fragments'])
grouped_table['mol'] = grouped_table['SMILES'].apply(lambda x: Chem.MolFromSmiles(x))

We make a function, that creates Pandas DataFrame of descriptors for list of molecules and get such a DataFrame

In [4]:
def mols_to_descriptors(list_of_mols):
    #Dicionary to be transformed into dataframe: name of descriptor : empty list
    descriptor_dict = {desc[0]:[] for desc in Descriptors.descList }
    for mol in tqdm(list_of_mols):
        #descList is a tuplt: (name, function)
        for descriptor, func in Descriptors.descList:
            try:
                #append numeric value or nan into the list cooresponding to descriptor's name
                descriptor_dict[descriptor].append(func(mol))
            except:
                descriptor_dict[descriptor].append(np.nan)
    return pd.DataFrame(descriptor_dict)

In [5]:
desc_df = mols_to_descriptors(grouped_table['mol'])

100%|██████████| 7104/7104 [00:40<00:00, 175.44it/s]


Some of descriptors are highly correlated with others, so we use [Feature engine](https://feature-engine.trainindata.com/en/latest/) to found "duplicative" ones and drop them. With threshold of 0.9, we got rid of 25% descriptors

In [6]:
print('Before dropping', desc_df.shape[1])
dropper = DropCorrelatedFeatures(threshold=0.9)
desc_df = dropper.fit_transform(desc_df)
print('After dropping', desc_df.shape[1])

Before dropping 208
After dropping 156


Let's save the list of descriptors to drop to use it on the stage of critical properties calculation

In [10]:
with open('features_to_drop.pickle', 'wb') as out:
    pickle.dump(dropper.features_to_drop_, out)

There can be problems with certain descriptors with certain molecules and we can get *NaN*s. To avoid problems with computation, we use simple KNNImputer to fill the gaps. Then we save imouter as it can be helpful in final calculators (*calculator_py.py* and *calculator_notebook.ipynb*)

In [22]:
imputer = KNNImputer()
desc_df_columns = desc_df.columns
desc_df = pd.DataFrame(imputer.fit_transform(desc_df), columns = desc_df_columns)

In [24]:
with open('imputer.pickle', 'wb') as out:
    pickle.dump(imputer, out)

And we also create circular fingerprints of all molecules and unite them with numeric descriptors to include structural information into model. We create a function and then make separate DataFrame with fingerprints

In [19]:
def mols_to_fingerprints(list_of_mols):
    #fp_list will contain all fingerprints in np_array type and will be transformed into Dataframe
    fp_list = []
    for mol in tqdm(list_of_mols):
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius = 3)
        #destination array that will receive numpy array of fingerprint
        dest_array = np.zeros(2048)
        DataStructs.ConvertToNumpyArray(fp, dest_array)
        fp_list.append(dest_array)
        #final dataframe contains fingerprints with columns fp1, fp2 ... fp20148 
    return pd.DataFrame(np.stack(fp_list), columns = ['fp{}'.format(i) for i in range(2048)])

In [20]:
fp_df = mols_to_fingerprints(grouped_table['mol'])

100%|██████████| 7104/7104 [00:00<00:00, 9218.16it/s]


Then we just concatenate tables with descriptors and fingerprints and save it as *table_with_descriptors.csv*, which will be used during following work with fully connected NNs

In [23]:
concatenated_table = pd.concat([grouped_table.reset_index(drop = True), desc_df, fp_df], axis = 1)

In [171]:
concatenated_table.to_csv('table_with_desriptors.csv')