Skip to content

Data migration / setting fixed splits #34

@sfluegel05

Description

@sfluegel05

Problem

In issue #10 we introduced a new file structure for the ChEBI datasets. To help users in transfering their data into this new structure, we need a migration script that automates this step.
For the most part, this should be relatively easy - taking files from one directory and copying them to another directory. The splits are of course a bit more difficult. If we want users to be able to continue their current splits, this requires a new features: setting datasplits based on a list of ids.
The latter would also have the advantage that we can circumvent the performance issue (#32) by saving the configuration of the current split as a list of ids and reload the splits from this list. (This might look like a step back, but importantly, we do not save the splits as separate files. The standard method of creating splits via a seed stays intact.)

Solution

The behaviour in the end should be:

  • When initialising a dataset, the user has the option to provide a file path to csv file that contains a list of chebi ids and their assignment to a dataset (either train, validation or test). Then, instead of creating a new split, the provided split will be used
  • When initialising the dataset without providing such a file, the splits will get created automatically (as before) and the resulting split is saved as a csv file
  • When running the migration script, the chebi data files will be copied into the new structure. For the splits, the split files are combined into one file and a csv file for the split assignment will be created in addition.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions