# Unsupervised Learning

It is possible to do unsupervised exploration of the datasets using PCA, tSNE, KMeans and UMAP.

In [1]:
from deepmol.scalers import MinMaxScaler
from deepmol.loaders import CSVLoader
from deepmol.compound_featurization import TwoDimensionDescriptors

# Load data from CSV file
loader = CSVLoader(dataset_path='../data/CHEMBL217_reduced.csv',
                   smiles_field='SMILES',
                   id_field='Original_Entry_ID',
                   labels_fields=['Activity_Flag'],
                   mode='auto',
                   shard_size=2500)
# create the dataset
data = loader.create_dataset(sep=',', header=0)
scaler = MinMaxScaler()
TwoDimensionDescriptors().featurize(data, scaler=scaler, path_to_save_scaler='scaler.pkl')

2023-03-16 14:23:47.039789: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-16 14:23:47.130340: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-16 14:23:47.130356: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-03-16 14:23:47.146766: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-03-16 14:23:47.601536: W tensorflow/stream_executor/platform/de

2023-03-16 14:23:48,720 — INFO — Assuming classification since there are less than 10 unique y values. If otherwise, explicitly set the mode to 'regression'!


<deepmol.datasets.datasets.SmilesDataset at 0x7f8631971910>

## PCA

PCA (Principal Component Analysis) is a widely used technique in chemoinformatics, which is the application of computational methods to chemical data. In chemoinformatics, PCA is used to analyze molecular descriptors, which are numerical representations of chemical structures.

Molecular descriptors can be used to represent various aspects of a molecule, such as its size, shape, polarity, or electronic properties. However, molecular descriptor sets can be quite large and highly correlated, making it difficult to extract meaningful information from them.

PCA can help address this problem by reducing the dimensionality of the molecular descriptor space, while preserving as much of the information as possible. Specifically, PCA can identify the most important descriptors that contribute to the variation in the data, and create a smaller set of descriptors that captures the majority of the information in the original data.

The reduced set of descriptors can then be used for various tasks, such as drug design, virtual screening, or molecular similarity analysis. PCA can also be used for visualization and exploration of chemical data, by projecting the high-dimensional descriptor space onto a lower-dimensional space that can be easily visualized.

Overall, PCA is a powerful tool in chemoinformatics that can help extract meaningful information from complex chemical data sets, and facilitate the discovery and design of new drugs and materials.

In [2]:
from deepmol.unsupervised import PCA

pca = PCA(n_components=2)
pca_df = pca.run_unsupervised(data)
pca.plot(pca_df.X, path='pca_output.png')

2023-03-16 14:23:55,505 — INFO — 2 Components PCA: 


In [3]:
pca = PCA(n_components=3)
pca_df = pca.run_unsupervised(data)
pca.plot(pca_df.X, path='pca_output.png')

2023-03-16 14:23:56,651 — INFO — 3 Components PCA: 


In [4]:
pca = PCA(n_components=6)
pca_df = pca.run_unsupervised(data)
pca.plot(pca_df.X, path='pca_output.png')

2023-03-16 14:23:58,320 — INFO — 6 Components PCA: 


# t-SNE

t-SNE (t-distributed Stochastic Neighbor Embedding) is a popular technique in chemoinformatics for visualizing high-dimensional molecular data in a lower-dimensional space.

In chemoinformatics, t-SNE is often used to explore the structure-activity relationship (SAR) of chemical compounds, by visualizing how similar compounds are clustered in a lower-dimensional space based on their molecular descriptors.

t-SNE works by first computing pairwise similarities between the high-dimensional data points, such as molecular descriptors. These similarities are then used to construct a probability distribution that represents the likelihood of a data point being similar to other data points in the high-dimensional space.

Next, t-SNE creates a similar probability distribution in a lower-dimensional space, and iteratively adjusts the positions of the data points in this space to minimize the difference between the two distributions. The result is a 2D or 3D visualization of the data points, where similar data points are located close to each other, and dissimilar data points are located far apart.

t-SNE is particularly useful for visualizing complex and non-linear relationships in chemoinformatics data, and for identifying clusters or patterns that may not be easily detectable in the original high-dimensional space. However, it should be noted that t-SNE is a non-parametric technique, and its results may depend on the choice of parameters and the specific initialization of the algorithm. Therefore, t-SNE should be used in combination with other techniques, such as PCA or hierarchical clustering, to gain a more comprehensive understanding of the chemical data.

In [5]:
from deepmol.unsupervised import TSNE

tsne = TSNE(n_components=2)
tsne_df = tsne.run_unsupervised(data)
tsne.plot(tsne_df.X, path='tsne_output.png')


The default initialization in TSNE will change from 'random' to 'pca' in 1.2.


The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.



2023-03-16 14:24:03,530 — INFO — 2 Components t-SNE: 


In [6]:
tsne = TSNE(n_components=3)
tsne_df = tsne.run_unsupervised(data)
tsne.plot(tsne_df.X, path='tsne_output.png')


The default initialization in TSNE will change from 'random' to 'pca' in 1.2.


The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.



2023-03-16 14:24:16,461 — INFO — 3 Components t-SNE: 


In [7]:
tsne = TSNE(n_components=4, method='exact')
tsne_df = tsne.run_unsupervised(data)
tsne.plot(tsne_df.X, path='tsne_output.png')


The default initialization in TSNE will change from 'random' to 'pca' in 1.2.


The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.



2023-03-16 14:29:41,488 — INFO — 4 Components t-SNE: 


# KMeans

K-means clustering is a widely used unsupervised learning algorithm in chemoinformatics for identifying groups of similar chemical compounds based on their molecular descriptors.

The algorithm works by iteratively assigning each data point (i.e., chemical compound) to the closest cluster center (i.e., centroid), and updating the cluster centers based on the new assignments. The process continues until the assignments no longer change, or until a maximum number of iterations is reached.

In chemoinformatics, k-means clustering is often used for tasks such as compound clustering, lead optimization, and hit identification. By identifying clusters of similar compounds, researchers can gain insights into the structure-activity relationships (SAR) of the compounds, and identify potential candidates for further study.

However, k-means clustering has some limitations in chemoinformatics. One limitation is that the algorithm assumes that the clusters are spherical and of equal size, which may not always be the case for chemical compounds. Another limitation is that the algorithm requires the number of clusters to be specified in advance, which may be difficult to determine for large and complex data sets.

In [8]:
from deepmol.unsupervised import KMeans

kmeans = KMeans(n_clusters=2)
kmeans_df = kmeans.run_unsupervised(data)
kmeans.plot(kmeans_df.X, path='kmeans_output.png')

2023-03-16 14:29:43,255 — INFO — Plotting the results of the clustering.


In [9]:
kmeans = KMeans(n_clusters=3)
kmeans_df = kmeans.run_unsupervised(data)
kmeans.plot(kmeans_df.X, path='kmeans_output.png')

2023-03-16 14:29:44,910 — INFO — Plotting the results of the clustering.


In [10]:
kmeans = KMeans(n_clusters=6)
kmeans_df = kmeans.run_unsupervised(data)
kmeans.plot(kmeans_df.X, path='kmeans_output.png')

2023-03-16 14:29:47,515 — INFO — Plotting the results of the clustering.


# UMAP

UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that has gained popularity in chemoinformatics for visualizing and analyzing high-dimensional molecular data.

Like t-SNE, UMAP works by creating a lower-dimensional representation of the high-dimensional data, but it uses a different approach based on topology and geometry. UMAP constructs a high-dimensional graph that captures the local relationships between the data points, and then uses a mathematical technique called Riemannian geometry to embed the graph into a lower-dimensional space.

In chemoinformatics, UMAP has been used for tasks such as compound clustering, lead optimization, and molecular visualization. UMAP can reveal complex and non-linear relationships in the data that may not be easily visible in the original high-dimensional space, and it can provide insights into the structure-activity relationships (SAR) of the compounds.

One advantage of UMAP over other dimensionality reduction techniques is its scalability and speed. UMAP can handle large and complex data sets, and can produce visualizations in real-time. Moreover, UMAP has a few parameters that can be tuned, making it easy to use and apply in various chemoinformatics applications.

In [11]:
from deepmol.unsupervised import UMAP

ump = UMAP(n_components=2)
umap_df = ump.run_unsupervised(data)
ump.plot(umap_df.X, path='umap_output.png')

2023-03-16 14:29:54.058959: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-16 14:29:54.059182: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-16 14:29:54.059245: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2023-03-16 14:29:54.059291: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2023-03-16 14:29:54.059335: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Co

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
2023-03-16 14:29:57,720 — INFO — 2 Components UMAP: 


In [12]:
ump = UMAP(n_components=3)
umap_df = ump.run_unsupervised(data)
ump.plot(umap_df.X, path='umap_output.png')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
2023-03-16 14:30:03,058 — INFO — 3 Components UMAP: 


In [13]:
ump = UMAP(n_components=6)
umap_df = ump.run_unsupervised(data)
ump.plot(umap_df.X, path='umap_output.png')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
2023-03-16 14:30:08,928 — INFO — 6 Components UMAP: 
