Data Cleaning

ChemicalX comes with benchmark datasets that we pre-processed. In this section of the documentation we discuss how we obtained the raw data. We also discuss hat pre-processing steps have been taken. We do this for each of the datasets in the framework.

DrugCombDB

*

DrugComb

*

Drugbank DDI

We used the cleaned dataset from the Therapeutic Data Commons.
Drug identifiers are represented by the DrugBank identifier.
Contexts are represented by drug-drug interaction identifiers from DrugBank.
Using RDKit 2021.09.03. we generated 256-dimensional Morgan fingerprints.
Labels represent the presence of a specific drug-drug interaction.
Context features are one-hot encoded binary vectors.
We generated an equal number of negative samples as positives.
Negative samples do not contain collisions.

TwoSides

This datasets is a subsample of TwoSides.
We only included the 100 most common side effects.
We used the cleaned dataset from the Therapeutic Data Commons.
Drug identifiers are represented by the DrugBank identifier.
Contexts are represented by the top 10 most common side effects in TwoSides.
Using RDKit 2021.09.03. we generated 256-dimensional Morgan fingerprints.
Labels represent the presence of a specific drug-drug interaction.
Context features are one-hot encoded binary vectors.
We generated an equal number of negative samples as positives.
Negative samples do not contain collisions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_processing.rst

data_processing.rst

Data Cleaning

DrugCombDB

*

DrugComb

*

Drugbank DDI

TwoSides

Files

data_processing.rst

Latest commit

History

data_processing.rst

File metadata and controls

Data Cleaning

DrugCombDB

*

DrugComb

*

Drugbank DDI

TwoSides