Skip to content

Data processing performance needs to be improved #32

@sfluegel05

Description

@sfluegel05

Data preprocessing is currently not as efficient as it could / should be. On my machine, it took several minutes to create a new dataset (for ChEBI version 231). Large parts of that went into the creation of data splits.

While most of this is not a problem since datasets can be reused between training runs, after adding dynamic datasplits in PR #29, the split creation is repeated at the start of each run.

Tasks

  • find out which steps of the preprocessing take up the most time
  • if possible, find more efficient solutions for steps that are currently inefficient

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions