<a href="https://colab.research.google.com/github/2pterons/training/blob/main/MachineLearning/vio/cmpd_clf_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Compound Classification Challenge

This is a notebook for the challenge. For a simple demo, we will use Random Forest with the Morgan fingerprint as our feature vector.

In [None]:
import numpy as np
import pandas as pd
import rdkit.Chem as Chem
import rdkit.Chem.AllChem as AllChem
from sklearn.ensemble import RandomForestClassifier
import sklearn.metrics as metrics

## Data

Let's load the compound data file.

In [None]:
cmpd_df = pd.read_csv('../data/cmpd.csv')
cmpd_df.head()

Unnamed: 0,inchikey,smiles,group,activity
0,FNHKPVJBJVTLMP-UHFFFAOYSA-N,CNC(=O)c1cc(Oc2ccc(NC(=O)Nc3ccc(Cl)c(C(F)(F)F)...,train,active
1,CUDVHEFYRIWYQD-UHFFFAOYSA-N,CNC(=O)c1cccc2cc(Oc3ccnc4cc(OCC5(N)CC5)c(OC)cc...,train,active
2,TTZSNFLLYPYKIL-UHFFFAOYSA-N,Cc1cc2cc(Oc3ccnc(Nc4cccc(CS(=O)(=O)NCCN(C)C)c4...,test,active
3,UOVCGJXDGOGOCZ-UHFFFAOYSA-N,COc1cc2c(cc1F)C(c1ccccc1Cl)=Nc1c(n[nH]c1C)N2,train,active
4,CUIHSIWYWATEQL-UHFFFAOYSA-N,Cc1ccc(Nc2nccc(N(C)c3ccc4c(C)n(C)nc4c3)n2)cc1S...,test,active


cmpd_df.shapeㅇㅇ

In [None]:
cmpd_df.shape

(5530, 4)

In [None]:
cmpd_df.isnull().sum()

inchikey    0
smiles      0
group       0
activity    0
dtype: int64

In [None]:
overlap = []
for i in cmpd_df.activity:
    if i not in overlap:
        overlap.append(i)
print(overlap)

['active', 'inactive', 'unknown', 'intermediate']


In [None]:
unk = cmpd_df[cmpd_df['activity']=='unknown']
inter = cmpd_df[cmpd_df['activity']=='intermediate']
print('unk length =',len(unk),'\nintermediate length =',len(inter))

unk length = 599 
intermediate length = 341


There are 5530 compound samples with:
* SMILES - 2D compound structure,
* InChIKey - a hash from InChI,
* group - a tag to split the dataset into train and test
* activity - y label

In [None]:
cmpd_df['mol'] = cmpd_df.smiles.apply(Chem.MolFromSmiles)

In [None]:
# with minimal modification, we obtain the fingerprint vector using RDKit

def get_Xy(df):
    X = np.vstack(df.mol.apply(lambda m: list(AllChem.GetMorganFingerprintAsBitVect(m, 4, nBits=2048))))
    y = df.activity.eq('active').astype(float).to_numpy()
    return X, y

In [None]:
X_train, y_train = get_Xy(cmpd_df[cmpd_df.group.eq('train')])
X_test, y_test = get_Xy(cmpd_df[cmpd_df.group.eq('test')])

In [None]:
for i in cmpd_df.activity:
    print(i)

active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
inactive
inactive
active
unknown
unknown
unknown
unknown
unknown
unknown
inactive
active
active
active
active
active
active
active
inactive
active
intermediate
active
active
inactive
active
intermediate
active
intermediate
active
inactive
inactive
active
inactive
inactive
active
active
intermediate
active
unknown
unknown
inactive
active
inactive
active
active
intermediate
active
active
inactive
active
active
active
unknown
unknown
active
active
inactive
inactive
inactive
inactive
unknown
inactive
inactive
inactive
inac

active
active
active
active
active
active
active
intermediate
inactive
inactive
intermediate
intermediate
intermediate
active
active
inactive
inactive
intermediate
unknown
intermediate
unknown
unknown
inactive
active
active
active
active
active
unknown
active
active
active
inactive
intermediate
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
unknown
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
active
intermediate
active
active
intermediate
active
active
active
active
active
intermediate
active
active
active
active
active
active
active

## Model: Random Forest

Probably, RF is the simplest classifier for numerical feature vectors without much tuning, and that gives us a start point of our model exploration.

In [None]:
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.875724404378622

In [None]:
y_pred = clf.predict_proba(X_test)[:, 1]

In [None]:
# logloss
metrics.log_loss(y_test, y_pred, labels=[0, 1])

0.42301392571550755

In [None]:
# AUC PRC
precision, recall, _ = metrics.precision_recall_curve(y_test, y_pred, pos_label=1)
metrics.auc(recall, precision)

0.8793711988076696

In [None]:
# AUC ROC
fpr_roc, tpr_roc, _ = metrics.roc_curve(y_test, y_pred, pos_label=1)
metrics.auc(fpr_roc, tpr_roc)

0.8906674951820033

## Hints

Although AUCPRC and AUCROC are already quite high, one may suspect possible overfitting since the dimension of features is 2048, and the number of train samples is 3977. Indeed, it is the case, but a simple regularization with some hyperparam tuning of the RF and/or the Morgan fingerprint does not improve the result significantly. Note that some graph-based deep learning models with minimal tuning easily get you have both AUCPRC and AUCROC > 0.93, and logloss < 0.35.

Also, remember that you may freely use other open resources. For example, there are many many compound samples in PubChem, ChEMBL, ChEBI, ..., and most compounds there are not likely "active".