Repository for the article Statistical-Based Database Fingerprint: Chemical space dependent representation of compound databases
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
DNMT1.csv
ECFP4.counts
Epigenetic_datasets.tar.gz
MACCS.counts
README.md
SB-DFPCalc.ipynb
SB-DFPCalc.py
Similarity_Searching.csv
ZINC12_AllClean.tar.gz.part00
ZINC12_AllClean.tar.gz.part01
ZINC12_AllClean.tar.gz.part02
ZINC12_AllClean.tar.gz.part03
ZINC12_AllClean.tar.gz.part04
ZINC12_AllClean.tar.gz.part05
ZINC12_AllClean.tar.gz.part06
ZINC12_AllClean.tar.gz.part07

README.md

SB-DFP

Statistical-Based Database Fingerprint: Chemical space dependent representation of compound databases

Epigenetic_datasets.tar.gz Contains the 28 datasets described in the work in CSV format.

ZINC12_AllClean.tar.gz.part00 to 07 Contain the following 4 files in CSV format: Not_Processed.csv | Contains 21 structures not processed by rdkit. Repeated_Structures.csv | Contains 154 repeated compounds between ZINC 12 AllClean subset and Epigenetic datasets. Reference_Dataset.csv | Contains 15,403,690 structures used as reference for SB-DFP calculations. Decoys.csv | Contains 1 million compounds used as decoys in similarity searching runs. To uncompress: 1) Join the 8 parts into a single compressed file with: cat ZINC12_AllClean.tar.gz.part* >> ZINC12_AllClean.tar.gz and 2) Extract the files with: tar -xvzf ZINC12_AllClean.tar.gz

SB-DFPCalc.ipynb and SB-DFPCalc.py The Jupyter notebook with the code employed for DFP and SB-DFP calculations with some examples, and the Python script of such notebook.

MACCS.counts and ECFP4.counts Files containing the reference counts for the calculation of SB-DFP based on MACCS and ECFP4 respectively. These files are needed for the execution of the Python code.

DNMT1.csv This file is needed for the execution of the Python code (used as example). This file is also contained in Epigenetic_datasets.tar.gz

Similarity_Searching.csv This file contains the results of all similarity searching calculations performed in the work, including Data set, Method, Fingerprint, Confidence level (if applicable), Recovery Rate and Area Under ROC Curve.