# Introduction

Topological Data Analysis is a method to simplify highly complex and many-dimensional data.
It does this by clustering datapoints and considering the connections between them. This was
first used by Singh, Mémoli and Carlson in 2007 with the `Mapper` algorithm.\cite{Singh2007}
More recently, it has been implemented in a variety of packages such as (...list...)
In this work, I use the `KeplerMapper` implementation in `Python`.\cite{KeplerMapper}

The clustering is performed by the `HDBSCAN` algorithm, because it copes well with
changes in the density of datapoints.\cite{McInnes2017}

## Related Works
Discuss `FIFA` paper, BitterSweet Forest, etc.
Do a literature review.

# Theory and Computational Method
## Theory
In this section, I will discuss the theory of topological data
analysis. A simple figure demonstrating it will be helpful,
as well as some maths chat. The Fibres of Failure paper 
is good for that!

## Computational Method
All of the computational work in this report was done using `Python 3.7` and the Jupyter notebook.
This was chosen because of the range of available software packages, and its ease of use.

In [2]:
import pickle
import sys
import scipy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from IPython.display import SVG, IFrame

In [3]:
DESIRED_TARGETS = ["CHEMBL240"]

# To ensure repeatability of runs, the random seed should
# be consistent.
RANDOM_STATE = 42

# Data splitting hyperparameters.
TRAIN_RF_FRACTION = 0.60
TRAIN_FIFA_FRACTION = 0.20
VALIDATE_FRACTION = 1.0 - TRAIN_FIFA_FRACTION - TRAIN_RF_FRACTION

# Community detection hyperparameters.
# Discard any with too small a set of nodes,
# or too small a prediction error.
COMMUNITY_SIZE_CUTOFF = 3
COMMUNITY_ERROR_CUTOFF = 0.20
CORRECTION_STD_WARN = 0.10

The dataset used in this work was taken from a sanitised version of the ChEMBL database created by Lenselink *et al*. \cite{Gaulton2012, Lenselink2017} This dataset contains only the entries in ChEMBL that have minimal experimental error, confident numerical ratings and no duplicate measurements. From this dataset, I used RDKit to parse chemical information into a computationally-accessible format.\cite{rdkit} The drugs were converted into SMILES strings, and their chemical fingerprints were calculted using a Morgan Fingerprinting algorithm with
a fingerprint radius of 3 bonds and a fingerprint size of 2048 bits.

In [4]:
ACTIVITY_CUTOFF = 6.5

In the dataset, activity is quantified by a `pChEMBL` value, which is logarithmic and ranges between 1 and 10. Examples in the literature often demarcate "active" vs "inactive" at `pChEMBL = 5.0`, but this classifies 90% of 
compounds in the dataset as active. Instead, `pChEMBL = 6.5` is used as a cutoff.

In [5]:
FP_SIZE = 2048

In [6]:
import rdkit
import rdkit.Chem as Chem
import rdkit.Chem.AllChem as AllChem
from rdkit.Chem import rdDepictor
from rdkit.Chem.Draw import rdMolDraw2D
from rdkit.Chem import DataStructs

In [7]:
input_location = "../data/processed/curated_set_with_publication_year.pd.pkl"
with open(input_location, "rb") as infile:
    df = pickle.load(infile)

possible_targets = Counter([item for item in df["TGT_CHEMBL_ID"]])
possible_drugs = Counter([item for item in df["CMP_CHEMBL_ID"]])

In [8]:
fingerprint_dict = {}
for index, row in df.iterrows():
    target = row["TGT_CHEMBL_ID"]
    if target in DESIRED_TARGETS:
        drug = row["CMP_CHEMBL_ID"]
        molec = Chem.MolFromSmiles(row["SMILES"])
        fingerprint_dict[drug] = AllChem.GetMorganFingerprintAsBitVect(molec,
                                                                       radius=3,
                                                                       nBits=FP_SIZE)

I used KeplerMapper to perform the topological data analysis.\cite{KeplerMapper} This is a free and open source
implementation of the Mapper algorithm which runs in `Python`. To do further graph manipulations including community detection, I used the `igraph` package.\cite{igraph}

In [9]:
import kmapper as km
import igraph

Topological data analysis requires an algorithm to cluster the datapoints into nodes. I used the HDBSCAN algorithm
to perform this task, because it is designed to cope well with varying densities of data points. This is important for analysis of chemical space, because the density of experimental data is often inconsistent.

In [10]:
import hdbscan
clusterer = hdbscan.HDBSCAN(metric='precomputed', min_cluster_size=3, min_samples=1)

Finally, I used Scikit-Learn to do to the machine learning in this work.\cite{scikit-learn}

In [11]:
import sklearn.ensemble
from sklearn.manifold import MDS

# Qualitative Results
## Exploring Chemical Space
One powerful way to use topological data analysis is to explore the similarities in chemical space in a human-readable form. To do so, we must quantify the "distance" between two compounds in chemical space. I have
chosen to do so with the Tanimoto Similarity, which is
\begin{equation}
   d_T = \frac{M_{11}}{M_{10} + M_{01} - M_{11}}
\end{equation}
with $ M_{11} $ being the number of shared 1s in two fingerprints and $M_{10} + M_{01} $ being the number of 1s in one fingerprint but not in the other.

Generating the distance matrix for a set of compounds is an $\order{N^2} $ operation in both time and memory. The
distance matrix calculation was the limiting factor for the amount of data I could analyse.

In [15]:
chem_space_df = df[np.logical_or.reduce([df["TGT_CHEMBL_ID"] == tgt for tgt in DESIRED_TARGETS])]
chem_space_df = sklearn.utils.shuffle(chem_space_df, random_state=RANDOM_STATE)

In [22]:
distance_matrix = np.zeros([len(chem_space_df), len(chem_space_df)])
for index in range(len(chem_space_df)):
    if not index % 100:
        print(index)
    drug = chem_space_df.iloc[index]["CMP_CHEMBL_ID"]
    fp_1 = fingerprint_dict[drug]
    for other_index in range(index):
        other_drug = chem_space_df.iloc[other_index]["CMP_CHEMBL_ID"]
        fp_2 = fingerprint_dict[other_drug]
        distance = 1.0 - rdkit.DataStructs.TanimotoSimilarity(fp_1, fp_2)
        distance_matrix[drug_index, other_index] = distance
        distance_matrix[other_index, drug_index] = distance
pickle.dump(distance_matrix, open("chemical-space-distance.pkl", "wb"))

0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700


In [None]:
%matplotlib inline
plt.imshow(distance_matrix, zorder=2, cmap='Blues', interpolation='nearest')
plt.colorbar()

The key feature of topological data analysis that makes it useful to explore chemical space is that it shows low-dimensional data while keeping the links of high-dimensional data. Here, I use Multi-Dimensional Scaling (MDS) to reduce the dimensionality of the data to two principal components. MDS works very similarly to principal component analysis, but is applicable to more general spaces (such as non-metric chemical space).\cite{Martin2015}

In [None]:
mds_cooordinate = MDS(n_components=2, dissimilarity="precomputed", metric=False).fit_transform(distance_matrix)