[Core] Async aggregator freezes during evaluation #148

ewenw · 2022-07-26T21:20:48Z

What happened + What you expected to happen

Hi fedscale team, I tried to run the async aggregator locally, but no test metrics are generated. The training seems to work fine, but the system freezes without any error at round 50.

Here are the last events from the aggregator:

(07-26) 11:38:52 INFO [async_aggregator.py:216] Wall clock: 2519 s, round: 49, Remaining participants: 5, Succeed participants: 10, Training loss: 4.433294297636379
(07-26) 11:38:55 INFO [async_aggregator.py:279] Client 2602 train on model 46 during 2274-2535.0060934242283
(07-26) 11:38:55 INFO [aggregator.py:812] Issue EVENT (client_train) to EXECUTOR (1)
(07-26) 11:38:55 INFO [aggregator.py:812] Issue EVENT (update_model) to EXECUTOR (1)
(07-26) 11:38:56 INFO [async_aggregator.py:279] Client 2667 train on model 46 during 2319-2539.592434184604
(07-26) 11:38:56 INFO [aggregator.py:812] Issue EVENT (client_train) to EXECUTOR (1)
(07-26) 11:38:56 INFO [async_aggregator.py:279] Client 2683 train on model 46 during 2328-2542.9932767611217
(07-26) 11:38:56 INFO [aggregator.py:812] Issue EVENT (client_train) to EXECUTOR (1)
(07-26) 11:38:59 INFO [async_aggregator.py:279] Client 2569 train on model 45 during 2253-2605.669321587796
(07-26) 11:38:59 INFO [aggregator.py:812] Issue EVENT (client_train) to EXECUTOR (2)
(07-26) 11:38:59 INFO [aggregator.py:812] Issue EVENT (update_model) to EXECUTOR (2)
(07-26) 11:39:01 INFO [async_aggregator.py:279] Client 2769 train on model 47 during 2385-2680.206093424228
(07-26) 11:39:01 INFO [aggregator.py:812] Issue EVENT (client_train) to EXECUTOR (2)

Here's the tail of the executor log:

oving_loss': 4.510447650271058, 'trained_size': 100, 'success': True, 'utility': 752.3330107862802}
(07-26) 11:39:00 INFO [client.py:32] Start to train (CLIENT: 2569) ...
(07-26) 11:39:01 INFO [client.py:68] Training of (CLIENT: 2569) completes, {'clientId': 2569, 'moving_loss': 4.526119316819311, 'trained_size': 100, 'success': True, 'utility': 729.5144631284894}
(07-26) 11:39:01 INFO [client.py:32] Start to train (CLIENT: 2769) ...
(07-26) 11:39:02 INFO [client.py:68] Training of (CLIENT: 2769) completes, {'clientId': 2769, 'moving_loss': 4.5834765700435645, 'trained_size': 100, 'success': True, 'utility': 692.4210048353054}
(07-26) 11:39:04 INFO [client.py:68] Training of (CLIENT: 2667) completes, {'clientId': 2667, 'moving_loss': 4.169509475803674, 'trained_size': 100, 'success': True, 'utility': 556.3458848955673}

Versions / Dependencies

Latest

Reproduction script

lHere's my config for the async_aggregator.py example:


# ip address of the parameter server (need 1 GPU process)
ps_ip: localhost

# ip address of each worker:# of available gpus process on each gpu in this node
# Note that if we collocate ps and worker on same GPU, then we need to decrease this number of available processes on that GPU by 1
# E.g., master node has 4 available processes, then 1 for the ps, and worker should be set to: worker:3
worker_ips:
    - localhost:[2]

exp_path: $FEDSCALE_HOME/fedscale/core

# Entry function of executor and aggregator under $exp_path
executor_entry: ../../examples/async_fl/async_executor.py

aggregator_entry: ../../examples/async_fl/async_aggregator.py

auth:
    ssh_user: ""
    ssh_private_key: ~/.ssh/id_rsa

# cmd to run before we can indeed run FAR (in order)
setup_commands:
    - source $HOME/anaconda3/bin/activate fedscale

# ========== Additional job configuration ==========
# Default parameters are specified in config_parser.py, wherein more description of the parameter can be found

job_conf:
    - job_name: asyncfl                   # Generate logs under this folder: log_path/job_name/time_stamp
    - log_path: $FEDSCALE_HOME/benchmark # Path of log files
    - num_participants: 800                      # Number of participants per round, we use K=100 in our paper, large K will be much slower
    - data_set: femnist                     # Dataset: openImg, google_speech, stackoverflow
    - data_dir: $FEDSCALE_HOME/benchmark/dataset/data/femnist    # Path of the dataset
    - data_map_file: $FEDSCALE_HOME/benchmark/dataset/data/femnist/client_data_mapping/train.csv              # Allocation of data to each client, turn to iid setting if not provided
    - device_conf_file: $FEDSCALE_HOME/benchmark/dataset/data/device_info/client_device_capacity     # Path of the client trace
    - device_avail_file: $FEDSCALE_HOME/benchmark/dataset/data/device_info/client_behave_trace
    - model: shufflenet_v2_x2_0                            # Models: e.g., shufflenet_v2_x2_0, mobilenet_v2, resnet34, albert-base-v2
    - gradient_policy: yogi                 # {"fed-yogi", "fed-prox", "fed-avg"}, "fed-avg" by default
    - eval_interval: 5                     # How many rounds to run a testing on the testing set
    - rounds: 500                          # Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
    - filter_less: 21                       # Remove clients w/ less than 21 samples
    - num_loaders: 2
    - yogi_eta: 3e-3
    - yogi_tau: 1e-8
    - local_steps: 5
    - learning_rate: 0.05
    - batch_size: 20
    - test_bsz: 20
    - malicious_factor: 4
    - use_cuda: False
    - decay_round: 50
    - overcommitment: 1.0
    - async_buffer: 10
    - checkin_period: 50
    - arrival_interval: 3

Issue Severity

No response

The text was updated successfully, but these errors were encountered:

AmberLJC · 2022-07-26T21:32:18Z

I will take a look at this later. The training loss should be reported at the end of every aggregation in my previous experiments. and will look at the freeze error.

ewenw · 2022-07-26T21:34:18Z

Thank you, @AmberLJC. I've attached a screenshot of my TB output.

AmberLJC · 2022-07-29T00:24:14Z

I found the code freeze is because the checkin_period (simulated clients arrival) is less frequent, so the aggregator hangs as no client check in. I will modify the config checkin_period=30 or num_participants=1500, should work

AmberLJC · 2022-07-29T00:25:43Z

As for the training loss, i think it depends on the buffer size and other configs. though the original paper suggest buffer_size=10, we need to carefully tune these configs to converge the model

ewenw · 2022-07-29T16:26:04Z

Thanks for the investigation. From looking at the code, load_global_model() in the async executor expects a model_id, but the Test() function and testing_handler() aren't customized in the async executor. I'm now seeing this error:
TypeError: load_global_model() missing 1 required positional argument: 'round'

fanlai0990 · 2022-07-29T18:06:24Z

@AmberLJC Can you please take a look when you are free? Thanks!

AmberLJC · 2022-07-29T18:49:00Z

Sure

ewenw added the bug Something isn't working label Jul 26, 2022

AmberLJC mentioned this issue Jul 29, 2022

tweak #155

Merged

5 tasks

AmberLJC linked a pull request Jul 29, 2022 that will close this issue

tweak #155

Merged

5 tasks

fanlai0990 closed this as completed in #155 Jul 29, 2022

fanlai0990 reopened this Jul 29, 2022

fanlai0990 mentioned this issue Aug 1, 2022

Fix Async #158

Merged

5 tasks

fanlai0990 closed this as completed in #158 Aug 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Async aggregator freezes during evaluation #148

[Core] Async aggregator freezes during evaluation #148

ewenw commented Jul 26, 2022 •

edited

AmberLJC commented Jul 26, 2022 •

edited

ewenw commented Jul 26, 2022

AmberLJC commented Jul 29, 2022

AmberLJC commented Jul 29, 2022

ewenw commented Jul 29, 2022

fanlai0990 commented Jul 29, 2022

AmberLJC commented Jul 29, 2022

[Core] Async aggregator freezes during evaluation #148

[Core] Async aggregator freezes during evaluation #148

Comments

ewenw commented Jul 26, 2022 • edited

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

AmberLJC commented Jul 26, 2022 • edited

ewenw commented Jul 26, 2022

AmberLJC commented Jul 29, 2022

AmberLJC commented Jul 29, 2022

ewenw commented Jul 29, 2022

fanlai0990 commented Jul 29, 2022

AmberLJC commented Jul 29, 2022

ewenw commented Jul 26, 2022 •

edited

AmberLJC commented Jul 26, 2022 •

edited