In [1]:
%load_ext autoreload
%autoreload 2
from src.cdr_bench.io_utils.io import read_features_hdf5_dataframe, read_optimization_results

# Read features used in optimization

# HDF5 File Structure (CHEMBL204.h5)

This HDF5 file contains chemical compound data along with several molecular features. The structure of the file is organized into two main sections: **Dataset and SMILES** and **Features**.

## 1. Dataset and SMILES (smi)
- **dataset**: Contains identifiers for chemical compounds (e.g., "CHEMBL204").
- **smi**: Contains SMILES strings representing the chemical structure of compounds.

## 2. Features
The features section contains several key molecular features:
- **embed**: A numerical feature representation of the chemical compounds. These are lists of floating-point numbers.
- **maccs_keys**: A list of MACCS molecular fingerprints. These are binary fingerprints indicating the presence or absence of certain molecular features (0s and 1s).
- **mfp_r2_1024**: A list of Morgan molecular fingerprints (radius 2, 1024 bits). These are used to encode molecular substructures as lists of integers.

## Overview of Data
- The dataset contains **4020 rows**, each representing a distinct chemical compound.
- Each row has the following columns:
  - **dataset**: The compound identifier.
  - **smi**: The SMILES string of the compound.
  - **embed**: Numerical embeddings for each compound, stored as lists of floating-point numbers.
  - **maccs_keys**: Binary MACCS molecular keys, stored as lists of integers.
  - **mfp_r2_1024**: Morgan molecular fingerprints (1024-bit), stored as lists of integers.

Each of these feature columns provides a different numerical or categorical representation of the molecular structure for machine learning or chemical informatics analysis.


In [2]:
df_features = read_features_hdf5_dataframe('../datasets/CHEMBL204.h5')

In [3]:
df_features.head()

Unnamed: 0,dataset,smi,embed,maccs_keys,mfp_r2_1024
0,b'CHEMBL204',b'CN1CCN(CC1)C(=O)CC(NC(=O)C=Cc1cc(Cl)ccc1-n1c...,"[-12.185702323913574, -26.70113182067871, 42.6...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, ..."
1,b'CHEMBL204',b'Cc1cc(OCCC=NN=C(N)N)cc(OCc2ccccc2C(F)(F)F)c1',"[-13.178268432617188, -31.464393615722656, 49....","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, ..."
2,b'CHEMBL204',b'NCc1ccc(Cl)cc1CNC(=O)CNC(=O)C(CCc1cccc[n+]1[...,"[-15.059322357177734, -29.994260787963867, 46....","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...","[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,b'CHEMBL204',b'COc1cc(Cl)cc(C(=O)Nc2ccc(Cl)cn2)c1NC(=O)c1sc...,"[-12.55119514465332, -28.289939880371094, 43.5...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,b'CHEMBL204',b'CC1(C)C2CC1C1(C)OB(OC1C2)C(CCCN=C(N)N)NC(=O)...,"[-7.98114013671875, -36.066646575927734, 38.75...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...","[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


# Read optimization results

## HDF5 File Structure (ambient_dist_and_PCA_results.h5)

This HDF5 file contains datasets related to high-dimensional data and the results of Principal Component Analysis (PCA) performed on this data.

### Datasets
1. **X_HD**: 
   - A high-dimensional dataset with a shape of **(4020, 1024)**.
   - It contains 4020 samples, each with 1024 features.
   - This is the original high-dimensional feature representation of the data.

2. **X_PCA**: 
   - A PCA-transformed dataset with a shape of **(4020, 2)**.
   - It contains the same 4020 samples, but each sample has been reduced to 2 principal components using PCA.
   - This reduced dataset is typically used for visualization or as input for further machine learning analysis.

The file likely stores high-dimensional data and the results of PCA to enable comparison between the original and PCA-reduced data.


## HDF5 File Structure (mfp_r2_1024.h5)

This HDF5 file contains datasets and groups related to various dimensionality reduction techniques applied to the Morgan Fingerprint (radius 2, 1024 bits).

### Datasets and Groups:

1. **GTM_coordinates**:
   - Coordinates of the data after applying Generative Topographic Mapping (GTM).
   - These are likely 2D or 3D coordinates for visualization or analysis.

2. **GTM_metrics**:
   - Performance or quality metrics related to the GTM method.
   - Could contain measures like reconstruction error, stress, or variance explained.

3. **PCA_coordinates**:
   - Coordinates of the data after applying Principal Component Analysis (PCA).
   - This dataset contains the principal component scores, often used to visualize the data in reduced dimensions.

4. **PCA_metrics**:
   - Metrics associated with the PCA method, likely showing the explained variance or other statistics for each principal component.

5. **UMAP_coordinates**:
   - Coordinates after applying Uniform Manifold Approximation and Projection (UMAP).
   - These coordinates represent the data in a lower-dimensional space, typically for clustering or visualization.

6. **UMAP_metrics**:
   - Performance metrics for the UMAP method, possibly including measures like neighborhood preservation.

7. **dataframe**:
   - A tabular dataset, possibly containing metadata or other information related to the features or molecules analyzed in the file.

8. **mfp_r2_1024**:
   - The original Morgan Fingerprint dataset (radius 2, 1024 bits).
   - It encodes molecular features in a binary vector format, used for cheminformatics and molecular modeling.

9. **t-SNE_coordinates**:
   - Coordinates after applying t-Distributed Stochastic Neighbor Embedding (t-SNE).
   - These coordinates are typically used for visualizing high-dimensional data in 2D or 3D.

10. **t-SNE_metrics**:
   - Metrics related to the t-SNE method, possibly including perplexity, KL divergence, or neighborhood preservation.

This file organizes molecular data and its corresponding projections in various dimensionality reduction methods, enabling analysis and comparison of the techniques.


In [10]:
file_path = '../results/in_sample_eucl/CHEMBL204/mfp_r2_1024/mfp_r2_1024.h5'
descriptor_set = 'mfp_r2_1024'
methods_to_extract = ['PCA', 't-SNE', 'UMAP', 'GTM']
df, fp_array, results = read_optimization_results(file_path, feature_name=descriptor_set, method_names=methods_to_extract)

In [11]:
df.head()

Unnamed: 0,dataset,smi
0,CHEMBL204,CN1CCN(CC1)C(=O)CC(NC(=O)C=Cc1cc(Cl)ccc1-n1cnn...
1,CHEMBL204,Cc1cc(OCCC=NN=C(N)N)cc(OCc2ccccc2C(F)(F)F)c1
2,CHEMBL204,NCc1ccc(Cl)cc1CNC(=O)CNC(=O)C(CCc1cccc[n+]1[O-...
3,CHEMBL204,COc1cc(Cl)cc(C(=O)Nc2ccc(Cl)cn2)c1NC(=O)c1scc(...
4,CHEMBL204,CC1(C)C2CC1C1(C)OB(OC1C2)C(CCCN=C(N)N)NC(=O)c1...


In [13]:
fp_array.shape

(4020, 1024)

In [15]:
results

{'PCA': {'metrics': {'AUC': (np.float64(0.5479564052076352),
    np.float64(0.0006813930286349295)),
   'LCMC': (array([ 2.42665066e-02,  4.45996799e-02,  5.63995198e-02, ...,
           -6.41025333e-07, -1.60192179e-07,  0.00000000e+00]),
    array([0.00475908, 0.00256645, 0.00131993, ..., 0.        , 0.        ,
           0.        ])),
   'QNN': (array([0.02466667, 0.0454    , 0.0576    , ..., 0.99919904, 0.99959968,
           1.        ]),
    array([0.00475908, 0.00256645, 0.00131993, ..., 0.        , 0.        ,
           0.        ])),
   'Qglobal': (np.float64(0.5526558292306168),
    np.float64(0.0010322391467407711)),
   'Qlocal': (np.float64(0.10001809539257533),
    np.float64(0.002001014100452946)),
   'cont_ls': (array([0.93135454, 0.91039158, 0.87825185, 0.831341  , 0.74923995]),
    array([0.00316594, 0.00353948, 0.00313003, 0.00332986, 0.00291319])),
   'kmax': (np.float64(26.666666666666668), np.float64(2.0548046676563256)),
   'nn_overlap': array([ 3.49502488,  6.