Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Async aggregator freezes during evaluation #148

Closed
ewenw opened this issue Jul 26, 2022 · 7 comments · Fixed by #155 or #158
Closed

[Core] Async aggregator freezes during evaluation #148

ewenw opened this issue Jul 26, 2022 · 7 comments · Fixed by #155 or #158
Labels
bug Something isn't working

Comments

@ewenw
Copy link
Contributor

ewenw commented Jul 26, 2022

What happened + What you expected to happen

Hi fedscale team, I tried to run the async aggregator locally, but no test metrics are generated. The training seems to work fine, but the system freezes without any error at round 50.

Here are the last events from the aggregator:

(07-26) 11:38:52 INFO [async_aggregator.py:216] Wall clock: 2519 s, round: 49, Remaining participants: 5, Succeed participants: 10, Training loss: 4.433294297636379
(07-26) 11:38:55 INFO [async_aggregator.py:279] Client 2602 train on model 46 during 2274-2535.0060934242283
(07-26) 11:38:55 INFO [aggregator.py:812] Issue EVENT (client_train) to EXECUTOR (1)
(07-26) 11:38:55 INFO [aggregator.py:812] Issue EVENT (update_model) to EXECUTOR (1)
(07-26) 11:38:56 INFO [async_aggregator.py:279] Client 2667 train on model 46 during 2319-2539.592434184604
(07-26) 11:38:56 INFO [aggregator.py:812] Issue EVENT (client_train) to EXECUTOR (1)
(07-26) 11:38:56 INFO [async_aggregator.py:279] Client 2683 train on model 46 during 2328-2542.9932767611217
(07-26) 11:38:56 INFO [aggregator.py:812] Issue EVENT (client_train) to EXECUTOR (1)
(07-26) 11:38:59 INFO [async_aggregator.py:279] Client 2569 train on model 45 during 2253-2605.669321587796
(07-26) 11:38:59 INFO [aggregator.py:812] Issue EVENT (client_train) to EXECUTOR (2)
(07-26) 11:38:59 INFO [aggregator.py:812] Issue EVENT (update_model) to EXECUTOR (2)
(07-26) 11:39:01 INFO [async_aggregator.py:279] Client 2769 train on model 47 during 2385-2680.206093424228
(07-26) 11:39:01 INFO [aggregator.py:812] Issue EVENT (client_train) to EXECUTOR (2)

Here's the tail of the executor log:

oving_loss': 4.510447650271058, 'trained_size': 100, 'success': True, 'utility': 752.3330107862802}
(07-26) 11:39:00 INFO [client.py:32] Start to train (CLIENT: 2569) ...
(07-26) 11:39:01 INFO [client.py:68] Training of (CLIENT: 2569) completes, {'clientId': 2569, 'moving_loss': 4.526119316819311, 'trained_size': 100, 'success': True, 'utility': 729.5144631284894}
(07-26) 11:39:01 INFO [client.py:32] Start to train (CLIENT: 2769) ...
(07-26) 11:39:02 INFO [client.py:68] Training of (CLIENT: 2769) completes, {'clientId': 2769, 'moving_loss': 4.5834765700435645, 'trained_size': 100, 'success': True, 'utility': 692.4210048353054}
(07-26) 11:39:04 INFO [client.py:68] Training of (CLIENT: 2667) completes, {'clientId': 2667, 'moving_loss': 4.169509475803674, 'trained_size': 100, 'success': True, 'utility': 556.3458848955673}

Versions / Dependencies

Latest

Reproduction script

lHere's my config for the async_aggregator.py example:


# ip address of the parameter server (need 1 GPU process)
ps_ip: localhost

# ip address of each worker:# of available gpus process on each gpu in this node
# Note that if we collocate ps and worker on same GPU, then we need to decrease this number of available processes on that GPU by 1
# E.g., master node has 4 available processes, then 1 for the ps, and worker should be set to: worker:3
worker_ips:
    - localhost:[2]

exp_path: $FEDSCALE_HOME/fedscale/core

# Entry function of executor and aggregator under $exp_path
executor_entry: ../../examples/async_fl/async_executor.py

aggregator_entry: ../../examples/async_fl/async_aggregator.py

auth:
    ssh_user: ""
    ssh_private_key: ~/.ssh/id_rsa

# cmd to run before we can indeed run FAR (in order)
setup_commands:
    - source $HOME/anaconda3/bin/activate fedscale

# ========== Additional job configuration ==========
# Default parameters are specified in config_parser.py, wherein more description of the parameter can be found

job_conf:
    - job_name: asyncfl                   # Generate logs under this folder: log_path/job_name/time_stamp
    - log_path: $FEDSCALE_HOME/benchmark # Path of log files
    - num_participants: 800                      # Number of participants per round, we use K=100 in our paper, large K will be much slower
    - data_set: femnist                     # Dataset: openImg, google_speech, stackoverflow
    - data_dir: $FEDSCALE_HOME/benchmark/dataset/data/femnist    # Path of the dataset
    - data_map_file: $FEDSCALE_HOME/benchmark/dataset/data/femnist/client_data_mapping/train.csv              # Allocation of data to each client, turn to iid setting if not provided
    - device_conf_file: $FEDSCALE_HOME/benchmark/dataset/data/device_info/client_device_capacity     # Path of the client trace
    - device_avail_file: $FEDSCALE_HOME/benchmark/dataset/data/device_info/client_behave_trace
    - model: shufflenet_v2_x2_0                            # Models: e.g., shufflenet_v2_x2_0, mobilenet_v2, resnet34, albert-base-v2
    - gradient_policy: yogi                 # {"fed-yogi", "fed-prox", "fed-avg"}, "fed-avg" by default
    - eval_interval: 5                     # How many rounds to run a testing on the testing set
    - rounds: 500                          # Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
    - filter_less: 21                       # Remove clients w/ less than 21 samples
    - num_loaders: 2
    - yogi_eta: 3e-3
    - yogi_tau: 1e-8
    - local_steps: 5
    - learning_rate: 0.05
    - batch_size: 20
    - test_bsz: 20
    - malicious_factor: 4
    - use_cuda: False
    - decay_round: 50
    - overcommitment: 1.0
    - async_buffer: 10
    - checkin_period: 50
    - arrival_interval: 3

Issue Severity

No response

@ewenw ewenw added the bug Something isn't working label Jul 26, 2022
@AmberLJC
Copy link
Member

AmberLJC commented Jul 26, 2022

I will take a look at this later. The training loss should be reported at the end of every aggregation in my previous experiments. and will look at the freeze error.

@ewenw
Copy link
Contributor Author

ewenw commented Jul 26, 2022

Thank you, @AmberLJC. I've attached a screenshot of my TB output.
Snipaste_2022-07-26_14-33-29

@AmberLJC
Copy link
Member

I found the code freeze is because the checkin_period (simulated clients arrival) is less frequent, so the aggregator hangs as no client check in. I will modify the config checkin_period=30 or num_participants=1500, should work

@AmberLJC
Copy link
Member

As for the training loss, i think it depends on the buffer size and other configs. though the original paper suggest buffer_size=10, we need to carefully tune these configs to converge the model

@AmberLJC AmberLJC mentioned this issue Jul 29, 2022
5 tasks
@AmberLJC AmberLJC linked a pull request Jul 29, 2022 that will close this issue
5 tasks
@ewenw
Copy link
Contributor Author

ewenw commented Jul 29, 2022

Thanks for the investigation. From looking at the code, load_global_model() in the async executor expects a model_id, but the Test() function and testing_handler() aren't customized in the async executor. I'm now seeing this error:
TypeError: load_global_model() missing 1 required positional argument: 'round'

@fanlai0990 fanlai0990 reopened this Jul 29, 2022
@fanlai0990
Copy link
Member

@AmberLJC Can you please take a look when you are free? Thanks!

@AmberLJC
Copy link
Member

Sure

@fanlai0990 fanlai0990 mentioned this issue Aug 1, 2022
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants