# **Getting Started with ChemPlot**

<h>**Installation**
<p> Currenty to install ChemPlot, first install RDkit (a library for cheminiformatics). Then using pip you can install ChemPlot.

In [None]:
# Install RDkit
!pip install rdkit
!pip install chemplot==1.2.0
!pip install bokeh==2.4.3

In [18]:
from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')

<h>**Import the library and the Example Datasets**
<p>Let's start by importing ChemPlot and two example datasets to demonstrate its the functionalities. We the following datasets: BBBP (blood-brain barrier penetration) [1] and SAMPL (Hydration free energy)  [2] dataset. The target of the BBBP dataset is discrete while the target for the SAMPL dataset is continuos.

---
<p>[1] Martins, Ines Filipa, et al. (2012). A Bayesian approach to in silico blood-brain barrier penetration modeling. Journal of chemical information and modeling 52.6, 1686-1697
<p>[2] Mobley, David L., and J. Peter Guthrie. "FreeSolv: a database of experimental and calculated hydration free energies, with input files." Journal of computer-aided molecular design 28.7 (2014): 711-720.

In [7]:
from chemplot import load_data, Plotter

data_BBBP = load_data("BBBP")
data_SAMPL = load_data("SAMPL")

Let's explore the BBBP dataset.

In [8]:
data_BBBP

Unnamed: 0,smiles,target
0,[Cl].CC(C)NCC(O)COc1cccc2ccccc12,1
1,C(=O)(OC(C)(C)C)CCCc1ccc(cc1)N(CCCl)CCCl,1
2,c12c3c(N4CCN(C)CC4)c(F)cc1c(c(C(O)=O)cn2C(C)CO...,1
3,C1CCN(CC1)Cc1cccc(c1)OCCCNC(=O)C,1
4,Cc1onc(c2ccccc2Cl)c1C(=O)N[C@H]3[C@H]4SC(C)(C)...,1
...,...,...
2034,C1=C(Cl)C(=C(C2=C1NC(=O)C(N2)=O)[N+](=O)[O-])Cl,1
2035,[C@H]3([N]2C1=C(C(=NC=N1)N)N=C2)[C@@H]([C@@H](...,1
2036,[O+]1=N[N](C=C1[N-]C(NC2=CC=CC=C2)=O)C(CC3=CC=...,1
2037,C1=C(OC)C(=CC2=C1C(=[N+](C(=C2CC)C)[NH-])C3=CC...,1


Let's explore the SAMPL dataset.

In [9]:
data_SAMPL

Unnamed: 0,smiles,target
0,CN(C)C(=O)c1ccc(cc1)OC,-11.01
1,CS(=O)(=O)Cl,-4.87
2,CC(C)C=C,1.83
3,CCc1cnccn1,-5.45
4,CCCCCCCO,-4.21
...,...,...
637,CCCCCCCC(=O)OC,-2.04
638,C1CCNC1,-5.48
639,c1cc(ccc1C=O)O,-8.83
640,CCCCCCCCl,0.29


<h>**Plotting the Datasets**
<p>We can now use the library to create some plots. Let's compare the scatter plot for BBBP with the plots describing the distribuition of the chemical space.

Create a Plotter object

In [10]:
cp_BBBP = Plotter.from_smiles(data_BBBP["smiles"], target=data_BBBP["target"], target_type="C")

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Reduce the dimensions of the molecular descriptors

In [None]:
cp_BBBP.tsne(random_state=0)

Compare "scatter", "hex" and "kde" plots

In [None]:
cp_BBBP.visualize_plot(kind="scatter", size=8)
cp_BBBP.visualize_plot(kind="hex", size=8)
cp_BBBP.visualize_plot(kind="kde", size=8);

<h>**Clustering**
<p>It is also possible to cluster data before plotting. You can control number of cluster with the parameter *n_clusters* default value is 5

In [None]:
cp_BBBP.cluster(n_clusters=6)
cp_BBBP.visualize_plot(size=8,clusters=True)

Let's now do the same for a dataset with a continuos target like SAMPL. Create a Plotter object

In [None]:
cp_SAMPL = Plotter.from_smiles(data_SAMPL["smiles"], target=data_SAMPL["target"], target_type="R")

Reduce the dimensions of the molecular descriptors

In [10]:
cp_SAMPL.tsne(random_state=0);

Scatter Plot for SAMPL dataset.

In [None]:
cp_SAMPL.visualize_plot(size=8, colorbar=True);

<h>**Comparing the dimensionality reduction methods**
<p>We will try know to see how different are the plots generated by PCA, t-SNE and UMAP.

Inspect BBBP and compare "PCA", "t-SNE" and "UMAP" plots.

In [None]:
cp_BBBP.pca()
cp_BBBP.visualize_plot(size=8);
cp_BBBP.tsne()
cp_BBBP.visualize_plot(size=8);
cp_BBBP.umap()
cp_BBBP.visualize_plot(size=8);

**t-SNE perplexity value**
<p>Change the perplexity to obtain plots with smaller or bigger clusters. The adequate value for perplexity is however already chosen automatically by ChemPlot.
<p>To see that first let's plot the BBBP data with t-SNE using different values for perplexity.

In [None]:
#Perplexity produces robust results with values between 5 and 50
cp_BBBP.tsne(perplexity=5, random_state=0)
cp_BBBP.visualize_plot(size=8);
cp_BBBP.tsne(perplexity=15, random_state=0)
cp_BBBP.visualize_plot(size=8);
cp_BBBP.tsne(perplexity=30, random_state=0)
cp_BBBP.visualize_plot(size=8);
cp_BBBP.tsne(perplexity=50, random_state=0)
cp_BBBP.visualize_plot(size=8);

<p>Let's know plot the BBBP dataset leaving to ChemPlot the choice for the perplexity value.

In [None]:
cp_BBBP.tsne(random_state=0)
cp_BBBP.visualize_plot(size=8);

<h>**UMAP n_neighbors value**
<p>Similarly UMAP takes a n_neighbors to decide which molecules need to be cliustered together. However also here ChemPlot automatically selects a suitable number given the size of your dataset.

<h>**Structural Similarity (Disabled)**
<p>What if you do not have a target proprety? You can still use ChemPlot by creating a structural similarity based Plotter object.
<p>To demonstrate this we can now create a plot with the BBBP dataset using structural similarity.

In [25]:
cp_BBBP_structural = Plotter.from_smiles(data_BBBP["smiles"], target=data_BBBP["target"], target_type="C", sim_type="structural")

<h>**Interactive Plots**
<p>We can create interactive plots using ChemPlot. Let's first import make sure the plots will be displayed within the notebook.

In [24]:
from bokeh.io import output_notebook
# Call once to configure Bokeh to display plots inline in the notebook.
output_notebook()

We can now use *interactive_plot()* rather than *visualize_plot()* to generate an interactive plot for SAMPL. Use the tools in the right to explore the plot. You can select a group of molecules, zoom or visualize the molecular structure in 2D.

In [None]:
# cp_SAMPL.tsne(random_state=0)
cp_BBBP.interactive_plot(show_plot=True);