Data preprocessing is currently not as efficient as it could / should be. On my machine, it took several minutes to create a new dataset (for ChEBI version 231). Large parts of that went into the creation of data splits.
While most of this is not a problem since datasets can be reused between training runs, after adding dynamic datasplits in PR #29, the split creation is repeated at the start of each run.
Tasks
- find out which steps of the preprocessing take up the most time
- if possible, find more efficient solutions for steps that are currently inefficient