<a href="https://colab.research.google.com/github/DIFACQUIM/Cursos/blob/main/6_2_Chemical_Space_tSNE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **t-distributed stochastic neighbor embedding (t-SNE)**
---
Made by: Ana Chávez, Fernanda Saldivar, Armando Rufino, Hector Ortíz and Karen Pelcastre

Contact: anachavez3026@gmail.com, fer.saldivarg@gmail.com

**Last Update:** March 2025


#Contents
---


>[t-distributed stochastic neighbor embedding (t-SNE)](#scrollTo=o3cP9t0hVqTg)

>[Contents](#scrollTo=7yJ5p5csw-sv)

>[Objectives](#scrollTo=2lfpHkGNIqYX)

>[Introduction](#scrollTo=ZuxaVIFYIlz1)

>[For more information:](#scrollTo=-g2EPPlmg9R_)



# Objectives
---

*   Introduce to the visualization of the chemical space.
*   Use chemical space visualization methods to obtain profiles of chemical databases.
*   Generate chemical space visualizations using t-SNE.

# Introduction
---

The t-distributed stochastic neighbor embedding (t-SNE) is a nonlinear algorithm of dimensionality reduction to visualize data in a low dimensional space (generally two or three dimensions) from high dimensional data. t-SNE founds patrons in the distribution of the points in a high dimensional space and ties to preserve those patrons in a low dimensional space. This algorithm is frequently used for the visualization of data, especially in fields like bioinformatics, genomics and data science in general.

In [None]:
from IPython.utils import io
import tqdm.notebook
import os, sys, random
total = 100
with tqdm.notebook.tqdm(total=total) as pbar:
    with io.capture_output() as captured:
      # Install rdkit
      !pip -q install rdkit.pypi==2021.9.4
      pbar.update(25)
      # Install molplotly
      !pip install molplotly
      pbar.update(50)
      # Install jupyter-dash
      !pip install jupyter-dash
      pbar.update(75)
      # Install the dash application design
      !pip install dash-bootstrap-components
      pbar.update(100)

  0%|          | 0/100 [00:00<?, ?it/s]

In [None]:
# Import libraries
import numpy as np
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Descriptors
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem import MACCSkeys, AllChem
from scipy.spatial.distance import pdist


In [None]:
#BIOFACQUIM
url_biofacquim = "https://raw.githubusercontent.com/DIFACQUIM/Cursos/main/Datasets/BIOFACQUIM.V2_curada.csv"
BIOFACQUIM = pd.read_csv(url_biofacquim)
BIOFACQUIM.head(2)

Unnamed: 0,ID,SMILES,Data set
0,FQNP502,c1cc2c(cc1C1OCC3C(c4ccc5c(c4)OCO5)OCC13)OCO2,BIOFACQUIM
1,FQNP281,C=C(C)C(C)(C)CCC(C)C1CCC2(C)C3CCC4C(C)(C)C5CCC...,BIOFACQUIM


In [None]:
#FDA
url_fda = "https://raw.githubusercontent.com/DIFACQUIM/Cursos/main/Datasets/FDA_2022_july_05_curada.csv"
FDA = pd.read_csv(url_fda)
FDA.head(2)

Unnamed: 0,ID,SMILES,NEW_SMILES,Data set
0,DB00006,CC[C@H](C)[C@H](NC(=O)[C@H](CCC(O)=O)NC(=O)[C@...,CCC(C)C(NC(=O)C(CCC(=O)O)NC(=O)C(CCC(=O)O)NC(=...,FDA
1,DB00007,CCNC(=O)[C@@H]1CCCN1C(=O)[C@H](CCCNC(N)=N)NC(=...,CCNC(=O)C1CCCN1C(=O)C(CCCN=C(N)N)NC(=O)C(CC(C)...,FDA


In [None]:
#DNMT1
url_dnmt1 = "https://raw.githubusercontent.com/DIFACQUIM/Cursos/main/Datasets/DNMT1_curada.csv"
DNMT1 = pd.read_csv(url_dnmt1)
DNMT1.head(2)

Unnamed: 0,ID,SMILES,Data set
0,"""CHEMBL2336409",Cc1cc(=Nc2ccc(NC(=O)c3ccc(N=c4cc[nH]c5ccccc45)...,DNMT1
1,"""CHEMBL1361703",COc1ccccc1CNC(=O)COC(=O)c1cc(-c2ccco2)nc2ccccc12,DNMT1


In [None]:
# See columns
print(FDA.columns)
print(BIOFACQUIM.columns)
print(DNMT1.columns)

Index(['ID', 'SMILES', 'NEW_SMILES', 'Data set'], dtype='object')
Index(['ID', 'SMILES', 'Data set'], dtype='object')
Index(['ID', 'SMILES', 'Data set'], dtype='object')


In [None]:
# Select columns
FDA = FDA[['ID', 'NEW_SMILES', "Data set"]]
BIOFACQUIM = BIOFACQUIM[['ID', 'SMILES', "Data set"]]
DNMT1 = DNMT1[['ID', 'SMILES', "Data set"]]

# Change the name to the columns
FDA.columns = ["ID", "SMILES", "Data set"]
BIOFACQUIM.columns = ['ID',  'SMILES', "Data set"]
DNMT1.columns = ["ID", "SMILES", "Data set"]
FDA.head(2)

Unnamed: 0,ID,SMILES,Data set
0,DB00006,CCC(C)C(NC(=O)C(CCC(=O)O)NC(=O)C(CCC(=O)O)NC(=...,FDA
1,DB00007,CCNC(=O)C1CCCN1C(=O)C(CCCN=C(N)N)NC(=O)C(CC(C)...,FDA


In [None]:
# Concatenate databases
data = pd.concat([FDA, BIOFACQUIM, DNMT1], axis=0).reset_index(drop=True)

In [None]:
# Generate descriptors

# Calculate molecular descriptors
data["HBA"] = [Descriptors.NumHAcceptors(y) for y in (Chem.MolFromSmiles(x) for x in data["SMILES"])]
data["HBD"] = [Descriptors.NumHDonors(y) for y in (Chem.MolFromSmiles(x) for x in data["SMILES"])]
data["RB"] = [Descriptors.NumRotatableBonds(y) for y in (Chem.MolFromSmiles(x) for x in data["SMILES"])]
data["LogP"] = [Descriptors.MolLogP(y) for y in (Chem.MolFromSmiles(x) for x in data["SMILES"])]
data["TPSA"] = [Descriptors.TPSA(y) for y in (Chem.MolFromSmiles(x) for x in data["SMILES"])]
data["MW"] = [Descriptors.MolWt(y) for y in (Chem.MolFromSmiles(x) for x in data["SMILES"])]
data.head(2)

Unnamed: 0,ID,SMILES,Data set,HBA,HBD,RB,LogP,TPSA,MW
0,DB00006,CCC(C)C(NC(=O)C(CCC(=O)O)NC(=O)C(CCC(=O)O)NC(=...,FDA,29,27,66,-8.3261,904.07,2180.317
1,DB00007,CCNC(=O)C1CCCN1C(=O)C(CCCN=C(N)N)NC(=O)C(CC(C)...,FDA,14,15,32,-1.4381,431.54,1209.421


In [None]:
# Train t-SNE model
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
data_tsne = data.copy()
data_tsne = data_tsne.drop(labels = ["Data set", "ID","SMILES"],axis = 1)
data_tsne = StandardScaler().fit_transform(data_tsne)
tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
tsne_results = tsne.fit_transform(data_tsne)
tsne_results


'n_iter' was renamed to 'max_iter' in version 1.5 and will be removed in 1.7.



[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 3231 samples in 0.005s...
[t-SNE] Computed neighbors for 3231 samples in 0.300s...
[t-SNE] Computed conditional probabilities for sample 1000 / 3231
[t-SNE] Computed conditional probabilities for sample 2000 / 3231
[t-SNE] Computed conditional probabilities for sample 3000 / 3231
[t-SNE] Computed conditional probabilities for sample 3231 / 3231
[t-SNE] Mean sigma: 0.199656
[t-SNE] KL divergence after 250 iterations with early exaggeration: 71.108345
[t-SNE] KL divergence after 300 iterations: 1.785082


array([[11.04058  ,  0.9407834],
       [10.15655  ,  0.6430354],
       [10.318566 ,  0.6012506],
       ...,
       [ 4.6059785,  7.3025236],
       [ 4.659706 ,  6.995675 ],
       [ 1.720872 ,  5.659061 ]], dtype=float32)

In [None]:
# Select complementary information
label = data[["Data set", "ID", "SMILES"]]
label = label.to_numpy()
label.shape

(3231, 3)

In [None]:
# Concatenate numpy arrays
arr = np.concatenate((label, tsne_results), axis = 1)
arr.shape

(3231, 5)

In [None]:
# Create a new dataframe
tsne_dataset = pd.DataFrame(data=arr, columns = ['Data set',"ID", "SMILES",'axis 1', 'axis 2'] )
tsne_dataset.head(5)

Unnamed: 0,Data set,ID,SMILES,axis 1,axis 2
0,FDA,DB00006,CCC(C)C(NC(=O)C(CCC(=O)O)NC(=O)C(CCC(=O)O)NC(=...,11.04058,0.940783
1,FDA,DB00007,CCNC(=O)C1CCCN1C(=O)C(CCCN=C(N)N)NC(=O)C(CC(C)...,10.15655,0.643035
2,FDA,DB00014,CC(C)CC(NC(=O)C(COC(C)(C)C)NC(=O)C(Cc1ccc(O)cc...,10.318566,0.601251
3,FDA,DB00027,CC(C)CC(NC(=O)CNC(=O)C(NC=O)C(C)C)C(=O)NC(C)C(...,10.383574,1.154017
4,FDA,DB00035,N=C(N)NCCCC(NC(=O)C1CCCN1C(=O)C1CSSCCC(=O)NC(C...,9.970945,0.338789


In [None]:
# Plot
import plotly.express as px
import molplotly
fig_tsne = px.scatter(tsne_dataset,
                            x='axis 1',
                            y='axis 2',
                            #symbol='Minimum Degree',
                            color='Data set',
                            color_discrete_sequence=["indigo", "green", 'orange',],
                            title='t-SNE',
                            labels={'Axis 1': 'axis 1',
                                    'Axis 2': 'axis 2'},
                            width=600,
                            height=500)
app_marker = molplotly.add_molecules(fig=fig_tsne,
                                         df=tsne_dataset,
                                         smiles_col='SMILES',
                                         title_col='ID',
                                         color_col='Data set'
                                        )

#fig_tsne.show()
#app_marker.run_server(mode='inline', port=8060, height=1000)
app_marker.run(port=8060)


JupyterDash is deprecated, use Dash instead.
See https://dash.plotly.com/dash-in-jupyter for more details.



<IPython.core.display.Javascript object>

---
# For more information:
* Medina-Franco JL, Sánchez-Cruz N, López-López E, Díaz-Eufracio BI (2022) [Progress on open chemoinformatic tools for expanding and exploring the chemical space](https://link.springer.com/article/10.1007/s10822-021-00399-1). J Comput Aided Mol Des 36:341–354.
* Medina-Franco JL, Chávez-Hernández AL, López-López E, Saldívar-González FI (2022) [Chemical multiverse: An expanded view of chemical space. Mol Inf 41:2200116](https://onlinelibrary.wiley.com/doi/full/10.1002/minf.202200116).
* Saldívar-González FI, Medina-Franco JL (2022) [Approaches for enhancing the analysis of chemical space for drug discovery](https://www.tandfonline.com/doi/abs/10.1080/17460441.2022.2084608). Expert Opinion on Drug Discovery, 17:789-798.