Skip to content

Latest commit

 

History

History
52 lines (40 loc) · 1.55 KB

data_processing.rst

File metadata and controls

52 lines (40 loc) · 1.55 KB

Data Cleaning

ChemicalX comes with benchmark datasets that we pre-processed. In this section of the documentation we discuss how we obtained the raw data. We also discuss hat pre-processing steps have been taken. We do this for each of the datasets in the framework.

DrugCombDB

*

DrugComb

*

Drugbank DDI

  • We used the cleaned dataset from the Therapeutic Data Commons.
  • Drug identifiers are represented by the DrugBank identifier.
  • Contexts are represented by drug-drug interaction identifiers from DrugBank.
  • Using RDKit 2021.09.03. we generated 256-dimensional Morgan fingerprints.
  • Labels represent the presence of a specific drug-drug interaction.
  • Context features are one-hot encoded binary vectors.
  • We generated an equal number of negative samples as positives.
  • Negative samples do not contain collisions.

TwoSides

  • This datasets is a subsample of TwoSides.
  • We only included the 100 most common side effects.
  • We used the cleaned dataset from the Therapeutic Data Commons.
  • Drug identifiers are represented by the DrugBank identifier.
  • Contexts are represented by the top 10 most common side effects in TwoSides.
  • Using RDKit 2021.09.03. we generated 256-dimensional Morgan fingerprints.
  • Labels represent the presence of a specific drug-drug interaction.
  • Context features are one-hot encoded binary vectors.
  • We generated an equal number of negative samples as positives.
  • Negative samples do not contain collisions.