Running error with mpi_torch_fedavg_mnist_lr_example #361

mh-lan · 2022-07-12T13:09:38Z

When running the following code

#!/usr/bin/env bash

WORKER_NUM=$1

PROCESS_NUM=`expr $WORKER_NUM + 1`
echo $PROCESS_NUM

hostname > mpi_host_file

$(which mpirun) -np $PROCESS_NUM \
-hostfile mpi_host_file \
python torch_fedavg_mnist_lr_one_line_example.py --cf config/fedml_config.yaml

Encounter an error

(base) PS C:\Users\doubl\Desktop\script\fedml> bash run_step_by_step_example.sh 4                                       
5                                                                                                                       
run_step_by_step_example.sh: line 10: -np: command not found

The code is excuted as the instructions in https://doc.fedml.ai/simulation/examples/mpi_torch_fedavg_mnist_lr_example.html, and all the dependency is installed.

The text was updated successfully, but these errors were encountered:

beiyuouo · 2022-07-12T13:44:09Z

You should use mpiexec instead of mpirun in Windows.

You can run as followed in Windows:

mpiexec -np 5 -hostfile mpi_host_file python torch_fedavg_mnist_lr_one_line_example.py --cf config\fedml_config.yaml

mh-lan · 2022-07-13T03:09:44Z

@beiyuouo I am using Microsoft MPI Startup Program [Version 10.1.12498.16] [PRE-RELEASE] and there is still an error

Unknown option: -hostfile

After checking the list of options, a proper option should be -machinefile?

-machinefile <file_name>                                                                                                 
Read the list of hosts from <file_name> on which to run the application. 
The format is one host per line, optionally followed by the number of cores.
Comments are from '#' to end of line, and empty lines are ignored. 
The -n * option uses the sum of cores in the file.

mh-lan · 2022-07-13T03:13:21Z

But it still cannot work...

(fedml) PS C:\Users\doubl\Desktop\script\fedml> mpiexec -np 5 -machinefile mpi_host_file python
torch_fedavg_mnist_lr_one_line_example.py --cf config\fedml_config.yaml
ERROR: Failed RpcCliCreateContext error 1722 
Aborting: mpiexec on LABPC-MUHANG is unable to connect to the smpd service on LabPC-Muhang:8677
Other MPI error, error stack: 
connect failed - RPC 鏈嶅姟鍣ㄤ笉鍙敤銆? (errno 1722)

beiyuouo · 2022-07-13T03:36:11Z

What if you do not add machinefile parameter?

Just mpiexec -np 5 python torch_fedavg_mnist_lr_one_line_example.py --cf config\fedml_config.yaml

mh-lan · 2022-07-14T02:46:59Z

What if you do not add machinefile parameter?

Just mpiexec -np 5 python torch_fedavg_mnist_lr_one_line_example.py --cf config\fedml_config.yaml

MPI works, but another error occurs...

Uncaught exception                                                                                                      
Traceback (most recent call last):                                                                                        
File "C:\Users\doubl\Desktop\script\fedml\torch_fedavg_mnist_lr_one_line_example.py", line 5, in <module>                 fedml.run_simulation(backend="MPI")                                                                                   
File "C:\Users\doubl\anaconda3\envs\fedml\lib\site-packages\fedml\launch_simulation.py", line 41, in run_simulation       simulator = SimulatorMPI(args, device, dataset, model)                                                                
File "C:\Users\doubl\anaconda3\envs\fedml\lib\site-packages\fedml\simulation\simulator.py", line 64, in __init__          from .mpi.fedavg.FedAvgAPI import FedML_FedAvg_distributed                                                            
File "C:\Users\doubl\anaconda3\envs\fedml\lib\site-packages\fedml\simulation\mpi\fedavg\FedAvgAPI.py", line 8, in <module>                                                                                                                        from ....core.security.fedml_attacker import FedMLAttacker                                                            
File "C:\Users\doubl\anaconda3\envs\fedml\lib\site-packages\fedml\core\security\fedml_attacker.py", line 1, in <module>                                                                                                                           from ...core.security.attack.attack_method_sample_a import AttackMethodA                                            
ModuleNotFoundError: No module named 'fedml.core.security.attack.attack_method_sample_a'

chaoyanghe · 2022-07-14T08:30:31Z

@mh-lan Try the latest version 0.7.200, or you can install from source (https://doc.fedml.ai/starter/installation.html)

mh-lan · 2022-07-17T02:29:43Z

@mh-lan Try the latest version 0.7.200, or you can install from source (https://doc.fedml.ai/starter/installation.html)

It works with fedml-0.7.208. Another little bug:

[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [__init__.py:96:init] args = {'yaml_config_file': 'config\\fedml_config.yaml', 'run_id': '0', 'rank': 0, 'local_rank': 0, 'node_rank': 0, 'role': 'client', 'yaml_paths': ['config\\fedml_config.yaml'], 'training_type': 'simulation', 'random_seed': 0, 'dataset': 'mnist', 'data_cache_dir': '../../../data/mnist', 'partition_method': 'hetero', 'partition_alpha': 0.5, 'model': 'lr', 'federated_optimizer': 'FedAvg', 'client_id_list': '[]', 'client_num_in_total': 1000, 'client_num_per_round': 4, 'comm_round': 50, 'epochs': 1, 'batch_size': 10, 'client_optimizer': 'sgd', 'learning_rate': 0.03, 'weight_decay': 0.001, 'frequency_of_the_test': 5, 'worker_num': 5, 'using_gpu': False, 'gpu_mapping_file': 'config/gpu_mapping.yaml', 'gpu_mapping_key': 'mapping_default', 'backend': 'MPI', 'is_mobile': 0, 'log_file_dir': './log', 'enable_wandb': False, 'wandb_key': 'ee0b5f53d949c84cee7decbe7a629e63fb2f8408', 'wandb_project': 'fedml', 'wandb_name': 'fedml_torch_fedavg_mnist_lr', 'using_mlops': False, 'comm': <mpi4py.MPI.Intracomm object at 0x000001CCD6275670>, 'process_id': 4, 'sys_perf_profiling': True}                                                                                                 
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [__init__.py:96:init] args = {'yaml_config_file': 'config\\fedml_config.yaml', 'run_id': '0', 'rank': 0, 'local_rank': 0, 'node_rank': 0, 'role': 'client', 'yaml_paths': ['config\\fedml_config.yaml'], 'training_type': 'simulation', 'random_seed': 0, 'dataset': 'mnist', 'data_cache_dir': '../../../data/mnist', 'partition_method': 'hetero', 'partition_alpha': 0.5, 'model': 'lr', 'federated_optimizer': 'FedAvg', 'client_id_list': '[]', 'client_num_in_total': 1000, 'client_num_per_round': 4, 'comm_round': 50, 'epochs': 1, 'batch_size': 10, 'client_optimizer': 'sgd', 'learning_rate': 0.03, 'weight_decay': 0.001, 'frequency_of_the_test': 5, 'worker_num': 5, 'using_gpu': False, 'gpu_mapping_file': 'config/gpu_mapping.yaml', 'gpu_mapping_key': 'mapping_default', 'backend': 'MPI', 'is_mobile': 0, 'log_file_dir': './log', 'enable_wandb': False, 'wandb_key': 'ee0b5f53d949c84cee7decbe7a629e63fb2f8408', 'wandb_project': 'fedml', 'wandb_name': 'fedml_torch_fedavg_mnist_lr', 'using_mlops': False, 'comm': <mpi4py.MPI.Intracomm object at 0x000001A3E3115670>, 'process_id': 3, 'sys_perf_profiling': True}                                                                                                 
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [__init__.py:96:init] args = {'yaml_config_file': 'config\\fedml_config.yaml', 'run_id': '0', 'rank': 0, 'local_rank': 0, 'node_rank': 0, 'role': 'client', 'yaml_paths': ['config\\fedml_config.yaml'], 'training_type': 'simulation', 'random_seed': 0, 'dataset': 'mnist', 'data_cache_dir': '../../../data/mnist', 'partition_method': 'hetero', 'partition_alpha': 0.5, 'model': 'lr', 'federated_optimizer': 'FedAvg', 'client_id_list': '[]', 'client_num_in_total': 1000, 'client_num_per_round': 4, 'comm_round': 50, 'epochs': 1, 'batch_size': 10, 'client_optimizer': 'sgd', 'learning_rate': 0.03, 'weight_decay': 0.001, 'frequency_of_the_test': 5, 'worker_num': 5, 'using_gpu': False, 'gpu_mapping_file': 'config/gpu_mapping.yaml', 'gpu_mapping_key': 'mapping_default', 'backend': 'MPI', 'is_mobile': 0, 'log_file_dir': './log', 'enable_wandb': False, 'wandb_key': 'ee0b5f53d949c84cee7decbe7a629e63fb2f8408', 'wandb_project': 'fedml', 'wandb_name': 'fedml_torch_fedavg_mnist_lr', 'using_mlops': False, 'comm': <mpi4py.MPI.Intracomm object at 0x000001FA80945670>, 'process_id': 0, 'sys_perf_profiling': True}                                                                                                 
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [__init__.py:96:init] args = {'yaml_config_file': 'config\\fedml_config.yaml', 'run_id': '0', 'rank': 0, 'local_rank': 0, 'node_rank': 0, 'role': 'client', 'yaml_paths': ['config\\fedml_config.yaml'], 'training_type': 'simulation', 'random_seed': 0, 'dataset': 'mnist', 'data_cache_dir': '../../../data/mnist', 'partition_method': 'hetero', 'partition_alpha': 0.5, 'model': 'lr', 'federated_optimizer': 'FedAvg', 'client_id_list': '[]', 'client_num_in_total': 1000, 'client_num_per_round': 4, 'comm_round': 50, 'epochs': 1, 'batch_size': 10, 'client_optimizer': 'sgd', 'learning_rate': 0.03, 'weight_decay': 0.001, 'frequency_of_the_test': 5, 'worker_num': 5, 'using_gpu': False, 'gpu_mapping_file': 'config/gpu_mapping.yaml', 'gpu_mapping_key': 'mapping_default', 'backend': 'MPI', 'is_mobile': 0, 'log_file_dir': './log', 'enable_wandb': False, 'wandb_key': 'ee0b5f53d949c84cee7decbe7a629e63fb2f8408', 'wandb_project': 'fedml', 'wandb_name': 'fedml_torch_fedavg_mnist_lr', 'using_mlops': False, 'comm': <mpi4py.MPI.Intracomm object at 0x00000129C79A5670>, 'process_id': 1, 'sys_perf_profiling': True}                                                                                                 
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [__init__.py:96:init] args = {'yaml_config_file': 'config\\fedml_config.yaml', 'run_id': '0', 'rank': 0, 'local_rank': 0, 'node_rank': 0, 'role': 'client', 'yaml_paths': ['config\\fedml_config.yaml'], 'training_type': 'simulation', 'random_seed': 0, 'dataset': 'mnist', 'data_cache_dir': '../../../data/mnist', 'partition_method': 'hetero', 'partition_alpha': 0.5, 'model': 'lr', 'federated_optimizer': 'FedAvg', 'client_id_list': '[]', 'client_num_in_total': 1000, 'client_num_per_round': 4, 'comm_round': 50, 'epochs': 1, 'batch_size': 10, 'client_optimizer': 'sgd', 'learning_rate': 0.03, 'weight_decay': 0.001, 'frequency_of_the_test': 5, 'worker_num': 5, 'using_gpu': False, 'gpu_mapping_file': 'config/gpu_mapping.yaml', 'gpu_mapping_key': 'mapping_default', 'backend': 'MPI', 'is_mobile': 0, 'log_file_dir': './log', 'enable_wandb': False, 'wandb_key': 'ee0b5f53d949c84cee7decbe7a629e63fb2f8408', 'wandb_project': 'fedml', 'wandb_name': 'fedml_torch_fedavg_mnist_lr', 'using_mlops': False, 'comm': <mpi4py.MPI.Intracomm object at 0x0000022CAEB55670>, 'process_id': 2, 'sys_perf_profiling': True}                                                                                                 
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:12:mapping_processes_to_gpu_device_from_yaml_file_mpi]  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!                                                        
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:13:mapping_processes_to_gpu_device_from_yaml_file_mpi]  ################## You do not indicate gpu_util_file, will use CPU training  #################                 
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:12:mapping_processes_to_gpu_device_from_yaml_file_mpi]  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!                                                        
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:17:mapping_processes_to_gpu_device_from_yaml_file_mpi] cpu                                                                                                             
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:13:mapping_processes_to_gpu_device_from_yaml_file_mpi]  ################## You do not indicate gpu_util_file, will use CPU training  #################                 
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:12:mapping_processes_to_gpu_device_from_yaml_file_mpi]  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!                                                        
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:17:mapping_processes_to_gpu_device_from_yaml_file_mpi] cpu                                                                                                             
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:13:mapping_processes_to_gpu_device_from_yaml_file_mpi]  ################## You do not indicate gpu_util_file, will use CPU training  #################                 
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:17:mapping_processes_to_gpu_device_from_yaml_file_mpi] cpu                                                                                                             
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:12:mapping_processes_to_gpu_device_from_yaml_file_mpi]  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!                                                        
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:13:mapping_processes_to_gpu_device_from_yaml_file_mpi]  ################## You do not indicate gpu_util_file, will use CPU training  #################                 
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:12:mapping_processes_to_gpu_device_from_yaml_file_mpi]  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!                                                        
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:13:mapping_processes_to_gpu_device_from_yaml_file_mpi]  ################## You do not indicate gpu_util_file, will use CPU training  #################                 
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:17:mapping_processes_to_gpu_device_from_yaml_file_mpi] cpu

The device indexes are the same. Hope it would be fixed in the next version.

lyydd · 2022-08-26T13:35:22Z

[FedML-Server(0) @device-id-0] [Fri, 26 Aug 2022 18:07:14] [INFO] [fedml_comm_manager.py:35:receive_message] receive_message. msg_type = 0, sender_id = 3, receiver_id = 3
[FedML-Server(0) @device-id-0] [Fri, 26 Aug 2022 18:07:14] [WARNING] [com_manager.py:133:_notify_connection_ready] Cannot handle connection ready
why ????
run with mpiexec -np 5 python torch_fedavg_mnist_lr_one_line_example.py --cf config\fedml_config.yaml in windows

lyydd · 2022-08-29T01:47:45Z

@chaoyanghe could you help me?

SuperXxts · 2022-10-26T12:30:43Z

I also found this problem in "FedML/Python/app/fedcv/object_detection/runs/train/exp/log_0.txt"

client_indexes = [0, 1]
running
receive_message. msg_type = 0, sender_id = 0, receiver_id = 0
Cannot handle connection ready
receive_message. msg_type = 3, sender_id = 1, receiver_id = 0
add_model. index = 0
b_all_received = False
receive_message. msg_type = 3, sender_id = 2, receiver_id = 0
add_model. index = 1
b_all_received = True
len of self.model_dict[idx] = 2
set_model_params
aggregate time cost: 0
round_idx: 0
Saving model at round 0
client_indexes = [0, 1]
send_message_sync_model_to_client. receive_id = 1
send_message_sync_model_to_client. receive_id = 2

The fourth line shows "Cannot handle connection ready". Why? Will it have any impact?

mh-lan closed this as completed Jul 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running error with mpi_torch_fedavg_mnist_lr_example #361

Running error with mpi_torch_fedavg_mnist_lr_example #361

mh-lan commented Jul 12, 2022

beiyuouo commented Jul 12, 2022

mh-lan commented Jul 13, 2022

mh-lan commented Jul 13, 2022

beiyuouo commented Jul 13, 2022

mh-lan commented Jul 14, 2022

chaoyanghe commented Jul 14, 2022

mh-lan commented Jul 17, 2022

lyydd commented Aug 26, 2022

lyydd commented Aug 29, 2022

SuperXxts commented Oct 26, 2022 •

edited

Running error with mpi_torch_fedavg_mnist_lr_example #361

Running error with mpi_torch_fedavg_mnist_lr_example #361

Comments

mh-lan commented Jul 12, 2022

beiyuouo commented Jul 12, 2022

mh-lan commented Jul 13, 2022

mh-lan commented Jul 13, 2022

beiyuouo commented Jul 13, 2022

mh-lan commented Jul 14, 2022

chaoyanghe commented Jul 14, 2022

mh-lan commented Jul 17, 2022

lyydd commented Aug 26, 2022

lyydd commented Aug 29, 2022

SuperXxts commented Oct 26, 2022 • edited

SuperXxts commented Oct 26, 2022 •

edited