This notebook shows how to train a CYP classifier and save it to a file.
The data used to train this model was taken from     

Wang, N.N., Wang, X.G., Xiong, G.L., Yang, Z.Y., Lu, A.P., Chen, X., Liu, S., Hou, T.J. and Cao, D.S., 2022. Machine learning to predict metabolic drug interactions related to cytochrome P450 isozymes. Journal of Cheminformatics, 14(1), p.23.     
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00602-x

In [1]:
import pandas as pd
from tqdm import tqdm
import useful_rdkit_utils as uru
from cyp_classifier import CYPClassifier, CYPModel

The input spreadsheet contains data for substrates and inhibitors of CYP1A2, CYP2C9, CYP2C19, CYP2D6, and CYP3A4. The spreadsheet tabs are labeled with enzyme and type, such as "1A2_Sub" or "1A2_Inh", where "Sub" stands for substrates and "Inh" stands for inhibitors.

Read the data from the spredsheet and put it into a list of dataframes.  Each dataframe is labeled with the target, mode (Inh, Sub), and dataset (target_mode).

In [2]:
excel_path = '13321_2022_602_MOESM1_ESM.xlsx'  # adjust path as needed

# Read in only SMILES and Label columns from all sheets
smi2fp = uru.Smi2Fp()
all_sheets = pd.read_excel(excel_path, sheet_name=None, usecols=['SMILES', 'Label'])
dataframes = []
for tab_name, df in tqdm(all_sheets.items()):
    df['target'] = tab_name.split('_')[0]
    df['mode'] = tab_name.split('_')[1]
    df['dataset'] = tab_name
    dataframes.append(df)

100%|██████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 3246.87it/s]


Concatenate the list into a single dataframe, calculate fingerprints, and drop rows with invalid SMILES. There will be some errors below due to invalid SMILES in the input.  

In [3]:
combo_df = pd.concat(dataframes)
num_before = len(combo_df)
combo_df['fp'] = combo_df['SMILES'].apply(smi2fp.get_np)
combo_df.dropna(subset=['fp'],inplace=True)
num_after = len(combo_df)
print(f'{num_before-num_after} rows removed due to invalid SMILES')

[15:31:28] SMILES Parse Error: syntax error while parsing: [Mg-H2][O+H2]
[15:31:28] SMILES Parse Error: Failed parsing SMILES '[Mg-H2][O+H2]' for input: '[Mg-H2][O+H2]'
[15:31:28] SMILES Parse Error: syntax error while parsing: [i+]1c2c(c3c1cccc3)cccc2
[15:31:28] SMILES Parse Error: Failed parsing SMILES '[i+]1c2c(c3c1cccc3)cccc2' for input: '[i+]1c2c(c3c1cccc3)cccc2'
[15:31:28] SMILES Parse Error: syntax error while parsing: P(=O)([O-])(C)C1=CC[N+H2]CC1
[15:31:28] SMILES Parse Error: Failed parsing SMILES 'P(=O)([O-])(C)C1=CC[N+H2]CC1' for input: 'P(=O)([O-])(C)C1=CC[N+H2]CC1'
[15:31:28] SMILES Parse Error: syntax error while parsing: O=C(O)C1([N-H])CCCC1
[15:31:28] SMILES Parse Error: Failed parsing SMILES 'O=C(O)C1([N-H])CCCC1' for input: 'O=C(O)C1([N-H])CCCC1'
[15:31:28] SMILES Parse Error: syntax error while parsing: Clc1cc2N(CCC[N+H](C)C)c3c(Sc2cc1)cccc3
[15:31:28] SMILES Parse Error: Failed parsing SMILES 'Clc1cc2N(CCC[N+H](C)C)c3c(Sc2cc1)cccc3' for input: 'Clc1cc2N(CCC[N+H](C)C

106 rows removed due to invalid SMILES


[15:31:30] SMILES Parse Error: syntax error while parsing: [O+H]=C(Nc1c(C)cc(C)cc1)c1c(O)cc2c(c1)cccc2
[15:31:30] SMILES Parse Error: Failed parsing SMILES '[O+H]=C(Nc1c(C)cc(C)cc1)c1c(O)cc2c(c1)cccc2' for input: '[O+H]=C(Nc1c(C)cc(C)cc1)c1c(O)cc2c(c1)cccc2'
[15:31:31] SMILES Parse Error: syntax error while parsing: O=[N+H]C=C1N(C)C=CC=C1
[15:31:31] SMILES Parse Error: Failed parsing SMILES 'O=[N+H]C=C1N(C)C=CC=C1' for input: 'O=[N+H]C=C1N(C)C=CC=C1'
[15:31:31] SMILES Parse Error: syntax error while parsing: O(CC)c1cc2c(N(C)C(=CC=C3C(=O)c4[n+H]cccc4C(C)=C3)C=C2)cc1
[15:31:31] SMILES Parse Error: Failed parsing SMILES 'O(CC)c1cc2c(N(C)C(=CC=C3C(=O)c4[n+H]cccc4C(C)=C3)C=C2)cc1' for input: 'O(CC)c1cc2c(N(C)C(=CC=C3C(=O)c4[n+H]cccc4C(C)=C3)C=C2)cc1'
[15:31:31] SMILES Parse Error: syntax error while parsing: O=C(O)C=1C(=O)Oc2c(CC(OC)[C-H2])cccc2C=1
[15:31:31] SMILES Parse Error: Failed parsing SMILES 'O=C(O)C=1C(=O)Oc2c(CC(OC)[C-H2])cccc2C=1' for input: 'O=C(O)C=1C(=O)Oc2c(CC(OC)[C-H2])cccc

Combine all the CYP inhibitor and substrate models into a single class, which combines all the models and allows them to be run on SMILES.

In [4]:
model_dict = {}
for k,v in tqdm(combo_df.groupby('dataset')):
    tmp_df = combo_df.query('dataset == @k')
    clf = CYPModel()
    clf.fit(tmp_df['SMILES'].values, tmp_df['Label'].values)
    model_dict[k] = clf.clf

100%|████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:13<00:00,  1.36s/it]


Instantiate a CYPClassifier object. 

In [5]:
cyp_classifer = CYPClassifier(model_dict)

Save the model to disk.

In [6]:
cyp_classifer.save('CYP_classifier.pkl')

A new CYPClassifier can be instantiated from the stored model

In [7]:
new_classifier = CYPClassifier('CYP_classifier.pkl')

The new model can be used to make predictions by supplying the model id followed by a list of SMILES.

In [8]:
new_classifier.predict('1A2_Sub', ['CCO'])

array([1])