Fix Async #158

fanlai0990 · 2022-08-01T07:18:50Z

Why are these changes needed?

In async FedScale example, (i) training stalls after a while; (ii) API mismatch in Test;

Related issue number

Closes #148

Checks

I've included any doc changes needed for https://fedscale.readthedocs.io/en/latest/
I've made sure the following tests are passing.
Testing Configurations
- Dry Run (20 training rounds & 1 evaluation round)
- Cifar 10 (20 training rounds & 1 evaluation round)
- Femnist (20 training rounds & 1 evaluation round)

Committer: fanlai0990 <fanlai0@outlook.com>

mosharaf · 2022-08-01T16:13:52Z

Thanks @fanlai0990.

Quick questions: Does it address the event mis-ordering issue? (Fine, if not, but we should have a separate issue then)

fanlai0990 · 2022-08-01T16:37:46Z

Not yet. I need more time to think about it (regarding overhead and its fidelity). Probably push a new fix in early Sept.

The current async example makes some sense, but it relies on the clairvoyant information of client completion time, and breaks our client arrival traces.

ewenw · 2022-08-01T20:58:55Z

Does the current async simulation perform somewhat valid results despite that it's not entirely correct?

fanlai0990 · 2022-08-01T21:31:46Z

To the best of my knowledge, it provides valid and correct results if no other weird bugs out of blue.

The only deficiency is: client arrivals within the buffer_size do not use the system trace, constant arrivals instead, but note that cross-buffer arrivals still follow the system trace. So it can still provide more realistic evaluations than other existing ones. Please feel free to test, and let us know if you find bugs.

We plan to implement a much more sophisticated version in the future, which should fix this deficiency and the requirement of clairvoyant client completion time. But understandably, this requires reordering of events on the fly, and many other pieces. Please stay tuned.

ewenw · 2022-08-11T01:59:01Z

Hi @fanlai0990, I'm still not seeing any model test results using the latest async code. Were you able to see test outputs when you run it?

Here are the params I used:

    --data_set femnist
    --data_dir=$(CODE_FETCHER_DEST)/li-cross-device-fl/FedScale/benchmark/dataset/data/femnist
    --data_map_file $(CODE_FETCHER_DEST)/li-cross-device-fl/FedScale/benchmark/dataset/data/femnist/client_data_mapping/train.csv
    --log_path some/path
    --rounds 300
    --eval_interval 2
    --num_participants 800
    --async_buffer 20
    --arrival_interval 3

fanlai0990 · 2022-08-11T03:49:02Z

Hi @ewenw, Thanks for trying it out! I pulled the latest code, and tested it using your configuration (more details attached below). I can see the test results. Other than the tensorboard, you can try cat femnist_logging | grep "test_loss". My output is:
(08-10) 22:45:01 INFO [executor.py:374] After aggregation round 2, CumulTime 90.7529, eval_time 29.2102, test_loss 3.9207, test_accuracy 4.98%, test_5_accuracy 23.55%

However, I indeed notice some weird test accuracy and am working on it. In the meantime, please let us know if you have any other concerns or features you want.

    - job_name: femnist                   # Generate logs under this folder: log_path/job_name/time_stamp
    - log_path: $FEDSCALE_HOME/benchmark # Path of log files
    - num_participants: 800                      # Number of participants per round, we use K=100 in our paper, large K will be much slower
    - data_set: femnist                     # Dataset: openImg, google_speech, stackoverflow
    - data_dir: $FEDSCALE_HOME/benchmark/dataset/data/femnist    # Path of the dataset
    - data_map_file: $FEDSCALE_HOME/benchmark/dataset/data/femnist/client_data_mapping/train.csv              # Allocation of data to each client, turn to iid setting if not provided
    - device_conf_file: $FEDSCALE_HOME/benchmark/dataset/data/device_info/client_device_capacity     # Path of the client trace
    - device_avail_file: $FEDSCALE_HOME/benchmark/dataset/data/device_info/client_behave_trace
    - model: shufflenet_v2_x2_0                            # Models: e.g., shufflenet_v2_x2_0, mobilenet_v2, resnet34, albert-base-v2
    - eval_interval: 2                     # How many rounds to run a testing on the testing set
    - rounds: 300                          # Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
    - filter_less: 21                       # Remove clients w/ less than 21 samples
    - num_loaders: 2
    - local_steps: 20
    - learning_rate: 0.05
    - batch_size: 20
    - test_bsz: 20
    - use_cuda: False
    - decay_round: 50
    - overcommitment: 1.0
    - async_buffer: 20
    - arrival_interval: 3```

ewenw · 2022-08-11T13:24:39Z

Thank you, @fanlai0990 for the prompt response! I can see the test results now after changing the number of executors to 1. With an increasing number of executors, I see less and less test data points.
I do also observe some weirdness in accuracy and loss.

fanlai0990 · 2022-08-11T16:04:24Z

Thanks for confirming it! I am fixing it and will get back to you soon.

fanlai added 3 commits August 1, 2022 03:12

clean up async

676dab2

Committer: fanlai0990 <fanlai0@outlook.com>

Merge branch 'master' of github.com:fanlai0990/FedScale

3b2a989

add grpc future call

f886c38

fanlai0990 requested a review from AmberLJC August 1, 2022 07:18

fanlai0990 merged commit 51cc4a1 into SymbioticLab:master Aug 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Async #158

Fix Async #158

fanlai0990 commented Aug 1, 2022 •

edited

mosharaf commented Aug 1, 2022

fanlai0990 commented Aug 1, 2022

ewenw commented Aug 1, 2022

fanlai0990 commented Aug 1, 2022

ewenw commented Aug 11, 2022

fanlai0990 commented Aug 11, 2022 •

edited

ewenw commented Aug 11, 2022 •

edited

fanlai0990 commented Aug 11, 2022

Fix Async #158

Fix Async #158

Conversation

fanlai0990 commented Aug 1, 2022 • edited

Why are these changes needed?

Related issue number

Checks

mosharaf commented Aug 1, 2022

fanlai0990 commented Aug 1, 2022

ewenw commented Aug 1, 2022

fanlai0990 commented Aug 1, 2022

ewenw commented Aug 11, 2022

fanlai0990 commented Aug 11, 2022 • edited

ewenw commented Aug 11, 2022 • edited

fanlai0990 commented Aug 11, 2022

fanlai0990 commented Aug 1, 2022 •

edited

fanlai0990 commented Aug 11, 2022 •

edited

ewenw commented Aug 11, 2022 •

edited