-
Notifications
You must be signed in to change notification settings - Fork 90
Description
Describe the bug
For those datasets that are not shipped by torch and thus have to be manually downloaded (e.g., cinic10, multimodal_base, pascal_voc, and tiny_imagenet), they are currently downloaded as a whole (i.e., the whole training and testing datasets) in the constructors of the respective DataSource instances.
While this design may function well in the testing environment where servers and all the clients colocate in one machine, it may come across with severe concurrency issues in some situations such as that in Deploying a Plato Federated Learning Server in a Production Environment, which Plato also aims to support.
To see that, consider the two cases separately:
- For the former case, it is always the server who starts to call its
configure()method, and only when the call returns does the server spawns clients in the same machine. In this way, when clients call theirconfigure()independently, none of them needs to download the dataset, again, as it is well prepared as a whole during the initialization of the server. - For the latter case, however, the server may not colocate with clients. If a remote machine (where there is no server) hosts multiple clients and these clients are concurrently initialized, then the current design implies the possibility that these clients all (1) think that the desired data is not ready at the local storage, and thus (2) download and preprocess (at least "unzip") the data concurrently. If this is the case,
- network bandwidth/CPU cycles/memory will be wasted due to redundant work,
- program runtime will be elongated out of the same reason, and more importantly,
- unexpected stalls or faults may be caused for concurrent creation of the dataset at the file system.
To Reproduce
This bug should conceptually make sense. We may provide the steps for reproducing it later, if necessary.
Additional context
We spotted this bug during the development of a new feature FEMNIST. Since the solution looks like a non-trivial design problem, we prefer seeking the authors' help before working out any immature solution.