Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Async #158

Merged
merged 3 commits into from Aug 1, 2022
Merged

Fix Async #158

merged 3 commits into from Aug 1, 2022

Conversation

fanlai0990
Copy link
Member

@fanlai0990 fanlai0990 commented Aug 1, 2022

Why are these changes needed?

In async FedScale example, (i) training stalls after a while; (ii) API mismatch in Test;

Related issue number

Closes #148

Checks

  • I've included any doc changes needed for https://fedscale.readthedocs.io/en/latest/
  • I've made sure the following tests are passing.
  • Testing Configurations
    • Dry Run (20 training rounds & 1 evaluation round)
    • Cifar 10 (20 training rounds & 1 evaluation round)
    • Femnist (20 training rounds & 1 evaluation round)

@mosharaf
Copy link
Member

mosharaf commented Aug 1, 2022

Thanks @fanlai0990.

Quick questions: Does it address the event mis-ordering issue? (Fine, if not, but we should have a separate issue then)

@fanlai0990
Copy link
Member Author

Not yet. I need more time to think about it (regarding overhead and its fidelity). Probably push a new fix in early Sept.

The current async example makes some sense, but it relies on the clairvoyant information of client completion time, and breaks our client arrival traces.

@ewenw
Copy link
Contributor

ewenw commented Aug 1, 2022

Does the current async simulation perform somewhat valid results despite that it's not entirely correct?

@fanlai0990
Copy link
Member Author

To the best of my knowledge, it provides valid and correct results if no other weird bugs out of blue.

The only deficiency is: client arrivals within the buffer_size do not use the system trace, constant arrivals instead, but note that cross-buffer arrivals still follow the system trace. So it can still provide more realistic evaluations than other existing ones. Please feel free to test, and let us know if you find bugs.

We plan to implement a much more sophisticated version in the future, which should fix this deficiency and the requirement of clairvoyant client completion time. But understandably, this requires reordering of events on the fly, and many other pieces. Please stay tuned.

@fanlai0990 fanlai0990 merged commit 51cc4a1 into SymbioticLab:master Aug 1, 2022
@ewenw
Copy link
Contributor

ewenw commented Aug 11, 2022

Hi @fanlai0990, I'm still not seeing any model test results using the latest async code. Were you able to see test outputs when you run it?

Here are the params I used:

    --data_set femnist
    --data_dir=$(CODE_FETCHER_DEST)/li-cross-device-fl/FedScale/benchmark/dataset/data/femnist
    --data_map_file $(CODE_FETCHER_DEST)/li-cross-device-fl/FedScale/benchmark/dataset/data/femnist/client_data_mapping/train.csv
    --log_path some/path
    --rounds 300
    --eval_interval 2
    --num_participants 800
    --async_buffer 20
    --arrival_interval 3

@fanlai0990
Copy link
Member Author

fanlai0990 commented Aug 11, 2022

Hi @ewenw, Thanks for trying it out! I pulled the latest code, and tested it using your configuration (more details attached below). I can see the test results. Other than the tensorboard, you can try cat femnist_logging | grep "test_loss". My output is:
(08-10) 22:45:01 INFO [executor.py:374] After aggregation round 2, CumulTime 90.7529, eval_time 29.2102, test_loss 3.9207, test_accuracy 4.98%, test_5_accuracy 23.55%

However, I indeed notice some weird test accuracy and am working on it. In the meantime, please let us know if you have any other concerns or features you want.

    - job_name: femnist                   # Generate logs under this folder: log_path/job_name/time_stamp
    - log_path: $FEDSCALE_HOME/benchmark # Path of log files
    - num_participants: 800                      # Number of participants per round, we use K=100 in our paper, large K will be much slower
    - data_set: femnist                     # Dataset: openImg, google_speech, stackoverflow
    - data_dir: $FEDSCALE_HOME/benchmark/dataset/data/femnist    # Path of the dataset
    - data_map_file: $FEDSCALE_HOME/benchmark/dataset/data/femnist/client_data_mapping/train.csv              # Allocation of data to each client, turn to iid setting if not provided
    - device_conf_file: $FEDSCALE_HOME/benchmark/dataset/data/device_info/client_device_capacity     # Path of the client trace
    - device_avail_file: $FEDSCALE_HOME/benchmark/dataset/data/device_info/client_behave_trace
    - model: shufflenet_v2_x2_0                            # Models: e.g., shufflenet_v2_x2_0, mobilenet_v2, resnet34, albert-base-v2
    - eval_interval: 2                     # How many rounds to run a testing on the testing set
    - rounds: 300                          # Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
    - filter_less: 21                       # Remove clients w/ less than 21 samples
    - num_loaders: 2
    - local_steps: 20
    - learning_rate: 0.05
    - batch_size: 20
    - test_bsz: 20
    - use_cuda: False
    - decay_round: 50
    - overcommitment: 1.0
    - async_buffer: 20
    - arrival_interval: 3```

@ewenw
Copy link
Contributor

ewenw commented Aug 11, 2022

Thank you, @fanlai0990 for the prompt response! I can see the test results now after changing the number of executors to 1. With an increasing number of executors, I see less and less test data points.
I do also observe some weirdness in accuracy and loss.
image

@fanlai0990
Copy link
Member Author

Thanks for confirming it! I am fixing it and will get back to you soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] Async aggregator freezes during evaluation
3 participants