# Analysing Neighborhood preservation

### Generating fingerprints from the .csv file and saving it as an .h5 file

#### Importing neccessary libraries

In [12]:
import numpy as np
import pandas as pd
import sys
sys.path.append(r'C:\Users\akash\OneDrive\Desktop\DR\Nbd_Pres\iSIM\cdr_bench')
from rdkit.Chem import AllChem, MACCSkeys, rdMolDescriptors
from src.cdr_bench.io_utils.io import read_features_hdf5_dataframe, read_optimization_results, check_hdf5_file_format


#### Generating RDKit Fingerprints

Do not forget to change the features.toml file to include input and output directory. Also a modification is needed in the number of molecules to be selected. Current: First 1000 molecules are considered. Can be changed at scripts/generate_descriptors.py at line 106.

In [27]:
import subprocess

result = subprocess.run(
    ['python', 'C:/Users/akash/OneDrive/Desktop/DR/Nbd_Pres/iSIM/cdr_bench/scripts/generate_descriptors.py',  'C:/Users/akash/OneDrive/Desktop/DR/Nbd_Pres/iSIM/cdr_bench/bench_configs/features.toml'],
    capture_output=True,
    text=True
)

print("Return code:", result.returncode)
print("Output:", result.stdout)
print("Error:", result.stderr)

Return code: 0
Output: INFO: Pandarallel will run on 24 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.

https://nalepae.github.io/pandarallel/troubleshooting/
2
C:/Users/akash/OneDrive/Desktop/DR/Nbd_Pres/iSIM/cdr_bench/bench_configs/features.toml
Loading configuration from C:/Users/akash/OneDrive/Desktop/DR/Nbd_Pres/iSIM/cdr_bench/bench_configs/features.toml.
{'input_path': WindowsPath('../Test_Dataset'), 'output_path': '../Test_FPS/', 'file_pattern': '*.csv', 'generate_morgan': {'morgan_radius': 2, 'morgan_fp_size': 1024}, 'preprocess_descriptors': False}
Processing files in ..\Test_Dataset with pattern *.csv.
Processing ..\Test_Dataset\CHEMBL33.csv
Processed CHEMBL33 with 826 molecules.
DataFrame saved to HDF5 file at ../Test_FPS//CHEMBL33.h5 with hierarchical structure.

Error: 
  0%|          | 0/1 [00:00<?, ?it/s][21:14:09] SMILES Parse Error: syntax error while parsing: smiles
[21:14:09] SM

# HDF5 File Structure (CHEMBL33.h5)

This HDF5 file contains chemical compound data along with several molecular features. The structure of the file is organized into two main sections: **Dataset and SMILES** and **Features**.

## 1. Dataset and SMILES (smi)
- **dataset**: Contains identifiers for chemical compounds (e.g., "CHEMBL204").
- **smi**: Contains SMILES strings representing the chemical structure of compounds.

## 2. Features
The features section contains several key molecular features:
- **RDKit Fingerprints**: A list of RDKit molecular fingerprints. These are used to encode molecular substructures as lists of integers.

## Overview of Data
- The dataset contains **4020 rows**, each representing a distinct chemical compound.
- Each row has the following columns:
  - **dataset**: The compound identifier.
  - **smi**: The SMILES string of the compound.
  - **RDKit_fp**: RDKit molecular fingerprints (1024-bit), stored as lists of integers.

Each of these feature columns provides a different numerical or categorical representation of the molecular structure for machine learning or chemical informatics analysis.


#### Reading the .h5 file

In [25]:
df_features=read_features_hdf5_dataframe(r'C:\Users\akash\OneDrive\Desktop\DR\Nbd_Pres\iSIM\cdr_bench\Test_FPS\CHEMBL33.h5')

In [26]:
df_features.head()



Unnamed: 0,dataset,smi,RDKit_fp
0,b'CHEMBL33',b'BrC1CCC(Br)C(Br)CCC(Br)C(Br)CCC1Br',"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,b'CHEMBL33',b'BrCCCCCCBr',"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,b'CHEMBL33',b'BrCc1cc(Br)c2cc(NBr)c(Br)c(Br)c2c1',"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,b'CHEMBL33',b'BrCc1cc(Br)c2cc(NBr)c(Br)cc2c1Br',"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,b'CHEMBL33',b'BrCc1cc2cc(Br)c(NBr)cc2c(Br)c1Br',"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [19]:
df_features['RDKit_fp']

0       [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
1       [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
2       [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
3       [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
4       [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
                              ...                        
1219    [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, ...
1220    [1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, ...
1221    [1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, ...
1222    [1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, ...
1223    [1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, ...
Name: RDKit_fp, Length: 1224, dtype: object

In [28]:
file_path = r'C:\Users\akash\OneDrive\Desktop\DR\Nbd_Pres\iSIM\cdr_bench\Test_Output\CHEMBL33\RDKit_fp\RDKit_fp.h5'
descriptor_set = 'RDKit_fp'
methods_to_extract = ['PCA']
df, fp_array, results = read_optimization_results(file_path, feature_name=descriptor_set, method_names=methods_to_extract)

In [29]:
results

{'PCA': {'metrics': {'AUC': np.float64(0.7341289707637753),
   'LCMC': array([0.17917529, 0.18401644, 0.18320787, 0.18219752, 0.18522269,
          0.18905496, 0.19490498, 0.19989746, 0.2055289 , 0.20906523,
          0.21019737, 0.21305745, 0.21724671, 0.21971311, 0.22152762,
          0.22243415, 0.22380357, 0.22549148, 0.22948658, 0.23338469,
          0.23495131, 0.23824639, 0.23988624, 0.24254953, 0.24504806,
          0.24740085, 0.24850311, 0.2501751 , 0.25173168, 0.2544354 ,
          0.25626162, 0.25782229, 0.26020544, 0.2623415 , 0.26286804,
          0.26407147, 0.26570056, 0.26679781, 0.26833539, 0.27025002,
          0.27189401, 0.27291196, 0.27512131, 0.27645975, 0.27843813,
          0.2791461 , 0.28142091, 0.28299554, 0.28522236, 0.28716634,
          0.28872543, 0.2903409 , 0.29162125, 0.29301106, 0.29443833,
          0.29620371, 0.29777966, 0.30013614, 0.30140724, 0.30241397,
          0.30348688, 0.30518903, 0.3075289 , 0.30973886, 0.31098675,
          0.31247193, 