Refactoring #11

Arn-BAuA · 2024-01-24T15:53:51Z

Arn-BAuA
Jan 24, 2024
Maintainer

There will be some Refactoring in the next weeks. There are some Aspects of the library that are nice and modular. E. G. how the models can be swapped when experimenting. The way Parameters are handeled, so that they can be saved as json is also one of the aspects I think is rather nice.
What is not nice however is, that everything is grouped around the benchmark.py script, which is very rigid, so that more complex experiments can't be modelled easily.
We will change that, and we will overhaul the data sources as well. The Idea is to create some classes, that encapsulate the functionality of the benchmark.py script more modularly, that than can be used in the experiment script. We will create a tracker.py class, that handles the saving of data. With a polymorph tracker family, we even could incorporate MLFlow. Nice. MLFlow.
There will be a second class, called the visualizer, that handles the visualisation tasks, prior acessible trough the Quickoverview.py script. Unit Tests will be added along the way.
Maybe the trainiers will be restructured as well. Don't know yet.
The way data is handeled will be completely changed. At the moment, data is stored in data blocks, which hold the data, but also some information on the generation and loading process. Idea here was, that data block could be passed around and, upon logging in the benchmark script, the hyperparameters are always with the data.
We will change this. Instead, data will just be passed as Array of pytorch tensors (for now.. we can maybe add some of the pytorch classes for handling datasets later ... ).
We will create a class called the DataManagementNode, that inherets from the Block. This class has two abstract sub-classes: Data Processor and Data Generator. Everything that is in the modifyers section (spelled wrong... have to change this as well ...) will now be a subclass of data processor. Everything that has at least one input and at least one output is a data processor.
Everything that has only an output is a data Generator. The setwrappers and the generators inherit here.
When the hyperparameters shall be saved, we now must call the tracker: Tracker.logDataSource(the Source), since there is no benchmark script that does this anymore.
Amongst the data generators will be one generator called the GeneratingFlow. This is a wrapper classe, that allows the user to pipe data processors and data generators together in a unixouid fashion, to create new datasources, to augment or create drift on the fly.
Unit testing will be added on the way.
Later I also wanna change the way Trainers and Models work, to make it possible to model more complex workflows in this library, like e. g. the one described here: https://arxiv.org/pdf/2009.02040.
In my imagination, we will have a similar thing like the pipelines for augmenting and generating data, but more skewed towards model training. I have to think about that. Maybe we'll have to alter the classes described above again, to come to a more general form. We will see. First steps first: Get rid of the benchmark.py and create the data generator pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring #11

{{title}}

Replies: 0 comments

Select a reply

Refactoring #11

Arn-BAuA Jan 24, 2024 Maintainer

Replies: 0 comments

Arn-BAuA
Jan 24, 2024
Maintainer