# Random Forest Classifiers and Fibres of Failure
Today I will use my Random Forest Classifier to see if I can extract prediction error from the model. If I can, then I will use that to build a Fibres of Failure model.
In Fibres of Failure, the original paper uses a complex lens featuring: 
1. The ground truth class (i.e. whether something is active or inactive)
2. The model's confidence in its prediction
3. The model's probability of the ground truth
4. "Principal Component 1" (....?!? yeah don't worry guys no need to define this one....)

## Aims
1. Construct the lens like they did in the paper

## Pitfalls
1. We only have one class thus far, because of holes in the dataset.
2. They did some clever second classifier learning



In [39]:
import pickle
import sys

import scipy

import rdkit
import rdkit.Chem as Chem
import rdkit.Chem.AllChem as AllChem
from rdkit.Chem import rdDepictor
from rdkit.Chem.Draw import rdMolDraw2D
from rdkit.Chem import DataStructs

import numpy as np
import pandas as pd

from collections import Counter

import sklearn.ensemble
from sklearn.manifold import MDS

Here are the hyper-parameters selecting activity cutoffs and which target we wish to look at. Note that validating by year is less accurate, presumably because chemistry changes and different styles of molecule are made.

In [4]:
ACTIVITY_CUTOFF = 5.0
DESIRED_TARGETS = ["CHEMBL240"]
MAPPER_TARGETS = ["CHEMBL240", "CHEMBL264"]
FP_SIZE = 2048
RANDOM_STATE = 2019
VALIDATE_BY_YEAR = False
if VALIDATE_BY_YEAR:
    YEAR_CUTOFF = 2013
else:
    VALIDATE_FRACTION = 0.15

In [5]:
with open("../data/processed/curated_set_with_publication_year.pd.pkl", "rb") as infile:
    df = pickle.load(infile)

possible_targets = Counter([item for item in df["TGT_CHEMBL_ID"]])
possible_drugs = Counter([item for item in df["CMP_CHEMBL_ID"]])

In [6]:
vector_df = pd.DataFrame(0, columns=possible_drugs.keys(), index=possible_targets.keys(), dtype=np.int8)
counted = 0
fingerprint_dict = {}
for index, row in df.iterrows():
    drug = row["CMP_CHEMBL_ID"]
    target = row["TGT_CHEMBL_ID"]
    if target in MAPPER_TARGETS:
        try:
            if not fingerprint_dict[drug]:
                fingerprint_dict[drug] = AllChem.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(row["SMILES"]),
                                                                               radius=3,
                                                                               nBits=FP_SIZE)
        except KeyError:
            fingerprint_dict[drug] = AllChem.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(row["SMILES"]),
                                                                               radius=3,
                                                                               nBits=FP_SIZE)
    if not counted % 10000:
        print("Counted up to", counted)
    if row["BIOACT_PCHEMBL_VALUE"] > ACTIVITY_CUTOFF:
        vector_df[drug][target] = 1
    else:
        vector_df[drug][target] = -1
    counted += 1

Counted up to 0
Counted up to 10000
Counted up to 20000
Counted up to 30000
Counted up to 40000
Counted up to 50000
Counted up to 60000
Counted up to 70000
Counted up to 80000
Counted up to 90000
Counted up to 100000
Counted up to 110000
Counted up to 120000
Counted up to 130000
Counted up to 140000
Counted up to 150000
Counted up to 160000
Counted up to 170000
Counted up to 180000
Counted up to 190000
Counted up to 200000
Counted up to 210000
Counted up to 220000
Counted up to 230000
Counted up to 240000
Counted up to 250000
Counted up to 260000
Counted up to 270000
Counted up to 280000
Counted up to 290000
Counted up to 300000
Counted up to 310000


In [38]:
sub_df = df[np.logical_or.reduce([df["TGT_CHEMBL_ID"] == tgt for tgt in DESIRED_TARGETS])]
mapper_df = df[np.logical_or.reduce([df["TGT_CHEMBL_ID"] == tgt for tgt in MAPPER_TARGETS])]
if VALIDATE_BY_YEAR:
    training_df = sub_df[sub_df["DOC_YEAR"] < YEAR_CUTOFF]
    validation_df = sub_df[sub_df["DOC_YEAR"] >= YEAR_CUTOFF]
else:
    sub_df = sklearn.utils.shuffle(sub_df, random_state=RANDOM_STATE)
    split_point = int(sub_df.shape[0] * VALIDATE_FRACTION)
    training_df = sub_df.iloc[split_point:, :]
    validation_df = sub_df.iloc[:split_point, :]

print(training_df.shape)
print(validation_df.shape)
print(mapper_df.shape)

(3998, 33)
(705, 33)
(7251, 33)


In [8]:
def convert_to_sparse(input_df, use_classes=True):
    n_samples = input_df.shape[0]
    print(n_samples)
    arr = np.empty([n_samples, FP_SIZE], dtype=bool)
    if use_classes:
        is_active = np.empty([n_samples], dtype=bool)
    else:
        is_active = np.empty([n_samples], dtype=np.float64)
    for index, (item, row) in enumerate(input_df.iterrows()):
        fingerprint = AllChem.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(row["SMILES"]),
                                                                  radius=3,
                                                                  nBits=FP_SIZE)
        DataStructs.ConvertToNumpyArray(fingerprint, arr[index, :])
        if use_classes:
            if row["BIOACT_PCHEMBL_VALUE"] < ACTIVITY_CUTOFF:
                is_active[index] = False
            else:
                is_active[index] = True
        else:
            is_active[index] = row["BIOACT_PCHEMBL_VALUE"]

    observations = scipy.sparse.csc_matrix(arr)
    return observations, is_active

In [74]:
compound_names = [row["CMP_CHEMBL_ID"] for _, row in sub_df.iterrows()]

CHEMBL1771129
CHEMBL1085829
CHEMBL3115192
CHEMBL3128201
CHEMBL465515
CHEMBL591501
CHEMBL252201
CHEMBL1084976
CHEMBL471994
CHEMBL1091664
CHEMBL3287932
CHEMBL2041188
CHEMBL2333620
CHEMBL2382364
CHEMBL43
CHEMBL2325984
CHEMBL1224701
CHEMBL2069929
CHEMBL1784473
CHEMBL237191
CHEMBL3093951
CHEMBL255389
CHEMBL100566
CHEMBL455424
CHEMBL2180033
CHEMBL1182473
CHEMBL2382342
CHEMBL1782570
CHEMBL486644
CHEMBL3092632
CHEMBL487066
CHEMBL2046893
CHEMBL230812
CHEMBL270467
CHEMBL2403849
CHEMBL147806
CHEMBL2017580
CHEMBL1089104
CHEMBL1289169
CHEMBL1632220
CHEMBL2146754
CHEMBL58387
CHEMBL270387
CHEMBL573493
CHEMBL2165061
CHEMBL2012290
CHEMBL2012265
CHEMBL1834350
CHEMBL1079837
CHEMBL1650844
CHEMBL556431
CHEMBL1834638
CHEMBL524997
CHEMBL404852
CHEMBL408208
CHEMBL2179804
CHEMBL271449
CHEMBL1809070
CHEMBL557116
CHEMBL2010840
CHEMBL1796064
CHEMBL1835070
CHEMBL565638
CHEMBL1642487
CHEMBL188678
CHEMBL384427
CHEMBL1950743
CHEMBL471399
CHEMBL2017412
CHEMBL3287926
CHEMBL2158838
CHEMBL1829635
CHEMBL1093051
CHEMBL5947

CHEMBL294029
CHEMBL2441422
CHEMBL329067
CHEMBL437792
CHEMBL487221
CHEMBL2017113
CHEMBL1270139
CHEMBL1950490
CHEMBL2181497
CHEMBL1784384
CHEMBL1784519
CHEMBL2012181
CHEMBL500946
CHEMBL551197
CHEMBL1642362
CHEMBL1085567
CHEMBL2012269
CHEMBL550308
CHEMBL1197199
CHEMBL2165057
CHEMBL2441438
CHEMBL2010824
CHEMBL1196581
CHEMBL256442
CHEMBL2031977
CHEMBL221921
CHEMBL269995
CHEMBL2403327
CHEMBL479036
CHEMBL241330
CHEMBL1671894
CHEMBL393718
CHEMBL1092662
CHEMBL574837
CHEMBL271072
CHEMBL392350
CHEMBL1824037
CHEMBL2146874
CHEMBL1078169
CHEMBL2017582
CHEMBL1182451
CHEMBL567288
CHEMBL1189994
CHEMBL1086273
CHEMBL566257
CHEMBL2441408
CHEMBL2012023
CHEMBL2017110
CHEMBL2315930
CHEMBL1083118
CHEMBL582876
CHEMBL468518
CHEMBL1800768
CHEMBL2208432
CHEMBL1796074
CHEMBL1092748
CHEMBL473772
CHEMBL1774835
CHEMBL1956992
CHEMBL3093974
CHEMBL460217
CHEMBL1795916
CHEMBL1083425
CHEMBL1673444
CHEMBL257717
CHEMBL255431
CHEMBL1094040
CHEMBL1910301
CHEMBL3093976
CHEMBL237119
CHEMBL2177905
CHEMBL1760244
CHEMBL1077647
CHE

CHEMBL1923130
CHEMBL492556
CHEMBL1957791
CHEMBL2431809
CHEMBL598194
CHEMBL215457
CHEMBL258132
CHEMBL373860
CHEMBL62152
CHEMBL479382
CHEMBL1822865
CHEMBL1288674
CHEMBL1779013
CHEMBL514564
CHEMBL572996
CHEMBL2326684
CHEMBL523821
CHEMBL559878
CHEMBL2029567
CHEMBL2079265
CHEMBL2164373
CHEMBL1089701
CHEMBL2021948
CHEMBL1258034
CHEMBL1834780
CHEMBL3219120
CHEMBL550943
CHEMBL2441636
CHEMBL272086
CHEMBL1089343
CHEMBL85606
CHEMBL521830
CHEMBL402542
CHEMBL2333622
CHEMBL384880
CHEMBL567202
CHEMBL243708
CHEMBL487827
CHEMBL515001
CHEMBL445212
CHEMBL271034
CHEMBL567417
CHEMBL98553
CHEMBL2387253
CHEMBL494406
CHEMBL1951916
CHEMBL1085397
CHEMBL470951
CHEMBL3086028
CHEMBL1956989
CHEMBL2059323
CHEMBL1080126
CHEMBL596699
CHEMBL1938410
CHEMBL2041171
CHEMBL1084878
CHEMBL490026
CHEMBL2151164
CHEMBL2204940
CHEMBL1182360
CHEMBL1095499
CHEMBL1258153
CHEMBL258090
CHEMBL1830693
CHEMBL1809059
CHEMBL487826
CHEMBL1644605
CHEMBL253878
CHEMBL1795914
CHEMBL207850
CHEMBL565007
CHEMBL3237577
CHEMBL2146875
CHEMBL2046894
C

CHEMBL455071
CHEMBL2031984
CHEMBL2031997
CHEMBL562326
CHEMBL252031
CHEMBL2164372
CHEMBL1289083
CHEMBL477356
CHEMBL356066
CHEMBL3093963
CHEMBL549584
CHEMBL558052
CHEMBL1834786
CHEMBL493972
CHEMBL492585
CHEMBL590593
CHEMBL1809077
CHEMBL2333635
CHEMBL516636
CHEMBL473485
CHEMBL451137
CHEMBL1269847
CHEMBL583621
CHEMBL1819137
CHEMBL480130
CHEMBL1829622
CHEMBL3128180
CHEMBL1762811
CHEMBL1094549
CHEMBL3086040
CHEMBL1800776
CHEMBL2164739
CHEMBL1801204
CHEMBL480131
CHEMBL488817
CHEMBL563860
CHEMBL526450
CHEMBL2441437
CHEMBL574180
CHEMBL2069500
CHEMBL3260567
CHEMBL2325739
CHEMBL1086524
CHEMBL1081409
CHEMBL520543
CHEMBL1951894
CHEMBL1224151
CHEMBL3086048
CHEMBL2312150
CHEMBL2036779
CHEMBL603695
CHEMBL1079577
CHEMBL559118
CHEMBL237336
CHEMBL1094521
CHEMBL3115116
CHEMBL2346974
CHEMBL1836168
CHEMBL573216
CHEMBL493744
CHEMBL551270
CHEMBL2391810
CHEMBL476106
CHEMBL459598
CHEMBL1097841
CHEMBL384111
CHEMBL495702
CHEMBL2171033
CHEMBL464804
CHEMBL1084560
CHEMBL1092550
CHEMBL445794
CHEMBL3084776
CHEMBL21775

CHEMBL477015
CHEMBL1818833
CHEMBL1289218
CHEMBL3237576
CHEMBL3290343
CHEMBL3219110
CHEMBL256481
CHEMBL560177
CHEMBL2012041
CHEMBL2207747
CHEMBL478462
CHEMBL2012266
CHEMBL217233
CHEMBL452091
CHEMBL550063
CHEMBL245492
CHEMBL1782377
CHEMBL1824042
CHEMBL1761695
CHEMBL558651
CHEMBL1836166
CHEMBL2086760
CHEMBL216821
CHEMBL1835058
CHEMBL2347988
CHEMBL55826
CHEMBL639
CHEMBL1782571
CHEMBL220757
CHEMBL1823054
CHEMBL2314064
CHEMBL2415096
CHEMBL1258606
CHEMBL3221487
CHEMBL244279
CHEMBL2022737
CHEMBL1289820
CHEMBL1795913
CHEMBL1270062
CHEMBL900
CHEMBL402911
CHEMBL2010847
CHEMBL2158791
CHEMBL561396
CHEMBL1084583
CHEMBL1209217
CHEMBL487708
CHEMBL381390
CHEMBL1683075
CHEMBL3290346
CHEMBL516874
CHEMBL1935442
CHEMBL444203
CHEMBL271631
CHEMBL2041169
CHEMBL1829605
CHEMBL1000
CHEMBL2179672
CHEMBL250127
CHEMBL2023922
CHEMBL549636
CHEMBL1269848
CHEMBL255645
CHEMBL605744
CHEMBL1643574
CHEMBL1081401
CHEMBL563341
CHEMBL2331648
CHEMBL1645462
CHEMBL2165068
CHEMBL2325210
CHEMBL2325201
CHEMBL2158814
CHEMBL1818828
C

In [47]:
training_observations, training_is_active = convert_to_sparse(training_df)
validation_observations, validation_is_active = convert_to_sparse(validation_df)
total_observations, total_is_active = convert_to_sparse(sub_df)

3998
705
4703


How much does the n_estimators parameter actually matter?
Answer: 1024 seems to be just fine.

In [17]:
model = sklearn.ensemble.RandomForestClassifier(n_estimators=1024, criterion="gini", n_jobs=4, bootstrap=False, max_features="log2")
model.fit(training_observations, training_is_active)
model.score(validation_observations, validation_is_active)

0.8028368794326242

In [48]:
predictions = model.predict(total_observations)
probabilities = model.predict_proba(total_observations)
print(probabilities[100])

[0.08740234 0.91259766]


In [50]:
probabilities_ground_truth = np.empty(probabilities.shape[0])
probabilities_predicted = np.empty(probabilities.shape[0])

for i in range(len(probabilities)):
    is_active = int(total_is_active[i])

    probabilities_ground_truth[i] = probabilities[i][is_active]
    probabilities_predicted[i] = max(probabilities[i])
    
print(probabilities_ground_truth, probabilities_predicted)

[0.10644531 1.         0.83691406 ... 1.         1.         1.        ] [0.89355469 1.         0.83691406 ... 1.         1.         1.        ]


Now we have partially constructed the lens, we need to get the distances in
chemical space that we will map over.

In [42]:
chemical_distance = np.zeros([len(sub_df), len(sub_df)])
for index in range(len(sub_df)):
    drug = sub_df.iloc[index]["CMP_CHEMBL_ID"]
    fingerprint = fingerprint_dict[drug]
    if not index % 100:
        print(index)
    for other_index in range(index):
        other_fingerprint = fingerprint_dict[sub_df.iloc[other_index]["CMP_CHEMBL_ID"]]
        chem_dissimiliarity = 1.0 - rdkit.DataStructs.TanimotoSimilarity(fingerprint, other_fingerprint)
        chemical_distance[index, other_index] = chem_dissimiliarity
        chemical_distance[other_index, index] = chem_dissimiliarity
pickle.dump(chemical_distance, open("2019-04-17-chemical-distance.pkl", "wb"))

0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700


In [16]:
chemical_distance= pickle.load(open("2019-04-17-chemical-distance.pkl", "rb"))
print(chemical_distance)

[[0.         0.87931034 0.94230769 ... 0.93548387 0.91869919 0.91358025]
 [0.87931034 0.         0.86813187 ... 0.88648649 0.9125     0.85789474]
 [0.94230769 0.86813187 0.         ... 0.88607595 0.91729323 0.88023952]
 ...
 [0.93548387 0.88648649 0.88607595 ... 0.         0.92537313 0.9127907 ]
 [0.91869919 0.9125     0.91729323 ... 0.92537313 0.         0.90070922]
 [0.91358025 0.85789474 0.88023952 ... 0.9127907  0.90070922 0.        ]]


In [45]:
mds_component = MDS(n_components=1, dissimilarity="precomputed", metric=False).fit_transform(chemical_distance)

In [56]:
lens = np.empty([probabilities.shape[0], 4])
lens[:, 0] = total_is_active
lens[:, 1] = probabilities_ground_truth
lens[:, 2] = probabilities_predicted
lens[:, 3] = mds_component[:, 0]

got_it_right = np.logical_not(np.logical_xor(total_is_active, predictions))

In [77]:
import hdbscan
from IPython.display import SVG, IFrame
import kmapper as km
custom_tooltips=np.array([f"<img src='./Figures/{chembl_id}.svg'>" for chembl_id in sub_df["CMP_CHEMBL_ID"]])
mapper = km.KeplerMapper(verbose=1)
graph = mapper.map(lens,
                   X=chemical_distance,
                   precomputed=True,
                   cover=km.Cover(n_cubes=[2, 6, 6, 6], perc_overlap=0.25),
                   clusterer=hdbscan.HDBSCAN(metric='precomputed', min_cluster_size=3, min_samples=1))
mapper.visualize(graph, path_html="2019-04-17-mb-fibres-of-failure.html",
                 title="Testing out Fibres of Failure", color_function=probabilities_ground_truth, custom_tooltips=custom_tooltips)
IFrame("2019-04-17-mb-fibres-of-failure.html", 800, 600)

KeplerMapper(verbose=1)
Mapping on data shaped (4703, 4703) using lens shaped (4703, 4)

Creating 432 hypercubes.

Created 816 edges and 923 nodes in 0:00:00.919853.
Wrote visualization to: 2019-04-17-mb-fibres-of-failure.html


## Summary
Looks like it worked, and was fairly computationally cheap (although the chemical distance matrix was slow).
The topological map needs some tuning, but the original FIFA paper was also pretty messy.

The next step is to extract these failure groups (coloured green and blue) into classes. For each class,
we can calculate a correction that we apply to the output of the Random Forest. Then, when we get a new
drug we see if it fits into any failure classes and apply that correction.

How do we see if a drug fits into a failure class, though?
1. Compute a chemical space centroid and radius for each class, and see if the new drug is within that (but we have boolean space vectors so maybe not possible
2. Compare a cross-similarity for each class. For each new drug, compare its similarity to the class members. If it is more similar than the intra-class similarity metric, count it as being in that class. 