Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running error with mpi_torch_fedavg_mnist_lr_example #361

Closed
mh-lan opened this issue Jul 12, 2022 · 10 comments
Closed

Running error with mpi_torch_fedavg_mnist_lr_example #361

mh-lan opened this issue Jul 12, 2022 · 10 comments

Comments

@mh-lan
Copy link

mh-lan commented Jul 12, 2022

When running the following code

#!/usr/bin/env bash

WORKER_NUM=$1

PROCESS_NUM=`expr $WORKER_NUM + 1`
echo $PROCESS_NUM

hostname > mpi_host_file

$(which mpirun) -np $PROCESS_NUM \
-hostfile mpi_host_file \
python torch_fedavg_mnist_lr_one_line_example.py --cf config/fedml_config.yaml

Encounter an error

(base) PS C:\Users\doubl\Desktop\script\fedml> bash run_step_by_step_example.sh 4                                       
5                                                                                                                       
run_step_by_step_example.sh: line 10: -np: command not found   

The code is excuted as the instructions in https://doc.fedml.ai/simulation/examples/mpi_torch_fedavg_mnist_lr_example.html, and all the dependency is installed.

@beiyuouo
Copy link
Collaborator

You should use mpiexec instead of mpirun in Windows.

You can run as followed in Windows:

mpiexec -np 5 -hostfile mpi_host_file python torch_fedavg_mnist_lr_one_line_example.py --cf config\fedml_config.yaml

@mh-lan
Copy link
Author

mh-lan commented Jul 13, 2022

@beiyuouo I am using Microsoft MPI Startup Program [Version 10.1.12498.16] [PRE-RELEASE] and there is still an error

Unknown option: -hostfile 

After checking the list of options, a proper option should be -machinefile?

-machinefile <file_name>                                                                                                 
Read the list of hosts from <file_name> on which to run the application. 
The format is one host per line, optionally followed by the number of cores.
Comments are from '#' to end of line, and empty lines are ignored. 
The -n * option uses the sum of cores in the file.

@mh-lan
Copy link
Author

mh-lan commented Jul 13, 2022

But it still cannot work...

(fedml) PS C:\Users\doubl\Desktop\script\fedml> mpiexec -np 5 -machinefile mpi_host_file python
torch_fedavg_mnist_lr_one_line_example.py --cf config\fedml_config.yaml
ERROR: Failed RpcCliCreateContext error 1722 
Aborting: mpiexec on LABPC-MUHANG is unable to connect to the smpd service on LabPC-Muhang:8677
Other MPI error, error stack: 
connect failed - RPC 鏈嶅姟鍣ㄤ笉鍙敤銆? (errno 1722)

@beiyuouo
Copy link
Collaborator

What if you do not add machinefile parameter?

Just mpiexec -np 5 python torch_fedavg_mnist_lr_one_line_example.py --cf config\fedml_config.yaml

@mh-lan
Copy link
Author

mh-lan commented Jul 14, 2022

What if you do not add machinefile parameter?

Just mpiexec -np 5 python torch_fedavg_mnist_lr_one_line_example.py --cf config\fedml_config.yaml

MPI works, but another error occurs...

Uncaught exception                                                                                                      
Traceback (most recent call last):                                                                                        
File "C:\Users\doubl\Desktop\script\fedml\torch_fedavg_mnist_lr_one_line_example.py", line 5, in <module>                 fedml.run_simulation(backend="MPI")                                                                                   
File "C:\Users\doubl\anaconda3\envs\fedml\lib\site-packages\fedml\launch_simulation.py", line 41, in run_simulation       simulator = SimulatorMPI(args, device, dataset, model)                                                                
File "C:\Users\doubl\anaconda3\envs\fedml\lib\site-packages\fedml\simulation\simulator.py", line 64, in __init__          from .mpi.fedavg.FedAvgAPI import FedML_FedAvg_distributed                                                            
File "C:\Users\doubl\anaconda3\envs\fedml\lib\site-packages\fedml\simulation\mpi\fedavg\FedAvgAPI.py", line 8, in <module>                                                                                                                        from ....core.security.fedml_attacker import FedMLAttacker                                                            
File "C:\Users\doubl\anaconda3\envs\fedml\lib\site-packages\fedml\core\security\fedml_attacker.py", line 1, in <module>                                                                                                                           from ...core.security.attack.attack_method_sample_a import AttackMethodA                                            
ModuleNotFoundError: No module named 'fedml.core.security.attack.attack_method_sample_a'

@chaoyanghe
Copy link
Member

@mh-lan Try the latest version 0.7.200, or you can install from source (https://doc.fedml.ai/starter/installation.html)

@mh-lan
Copy link
Author

mh-lan commented Jul 17, 2022

@mh-lan Try the latest version 0.7.200, or you can install from source (https://doc.fedml.ai/starter/installation.html)

It works with fedml-0.7.208. Another little bug:

[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [__init__.py:96:init] args = {'yaml_config_file': 'config\\fedml_config.yaml', 'run_id': '0', 'rank': 0, 'local_rank': 0, 'node_rank': 0, 'role': 'client', 'yaml_paths': ['config\\fedml_config.yaml'], 'training_type': 'simulation', 'random_seed': 0, 'dataset': 'mnist', 'data_cache_dir': '../../../data/mnist', 'partition_method': 'hetero', 'partition_alpha': 0.5, 'model': 'lr', 'federated_optimizer': 'FedAvg', 'client_id_list': '[]', 'client_num_in_total': 1000, 'client_num_per_round': 4, 'comm_round': 50, 'epochs': 1, 'batch_size': 10, 'client_optimizer': 'sgd', 'learning_rate': 0.03, 'weight_decay': 0.001, 'frequency_of_the_test': 5, 'worker_num': 5, 'using_gpu': False, 'gpu_mapping_file': 'config/gpu_mapping.yaml', 'gpu_mapping_key': 'mapping_default', 'backend': 'MPI', 'is_mobile': 0, 'log_file_dir': './log', 'enable_wandb': False, 'wandb_key': 'ee0b5f53d949c84cee7decbe7a629e63fb2f8408', 'wandb_project': 'fedml', 'wandb_name': 'fedml_torch_fedavg_mnist_lr', 'using_mlops': False, 'comm': <mpi4py.MPI.Intracomm object at 0x000001CCD6275670>, 'process_id': 4, 'sys_perf_profiling': True}                                                                                                 
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [__init__.py:96:init] args = {'yaml_config_file': 'config\\fedml_config.yaml', 'run_id': '0', 'rank': 0, 'local_rank': 0, 'node_rank': 0, 'role': 'client', 'yaml_paths': ['config\\fedml_config.yaml'], 'training_type': 'simulation', 'random_seed': 0, 'dataset': 'mnist', 'data_cache_dir': '../../../data/mnist', 'partition_method': 'hetero', 'partition_alpha': 0.5, 'model': 'lr', 'federated_optimizer': 'FedAvg', 'client_id_list': '[]', 'client_num_in_total': 1000, 'client_num_per_round': 4, 'comm_round': 50, 'epochs': 1, 'batch_size': 10, 'client_optimizer': 'sgd', 'learning_rate': 0.03, 'weight_decay': 0.001, 'frequency_of_the_test': 5, 'worker_num': 5, 'using_gpu': False, 'gpu_mapping_file': 'config/gpu_mapping.yaml', 'gpu_mapping_key': 'mapping_default', 'backend': 'MPI', 'is_mobile': 0, 'log_file_dir': './log', 'enable_wandb': False, 'wandb_key': 'ee0b5f53d949c84cee7decbe7a629e63fb2f8408', 'wandb_project': 'fedml', 'wandb_name': 'fedml_torch_fedavg_mnist_lr', 'using_mlops': False, 'comm': <mpi4py.MPI.Intracomm object at 0x000001A3E3115670>, 'process_id': 3, 'sys_perf_profiling': True}                                                                                                 
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [__init__.py:96:init] args = {'yaml_config_file': 'config\\fedml_config.yaml', 'run_id': '0', 'rank': 0, 'local_rank': 0, 'node_rank': 0, 'role': 'client', 'yaml_paths': ['config\\fedml_config.yaml'], 'training_type': 'simulation', 'random_seed': 0, 'dataset': 'mnist', 'data_cache_dir': '../../../data/mnist', 'partition_method': 'hetero', 'partition_alpha': 0.5, 'model': 'lr', 'federated_optimizer': 'FedAvg', 'client_id_list': '[]', 'client_num_in_total': 1000, 'client_num_per_round': 4, 'comm_round': 50, 'epochs': 1, 'batch_size': 10, 'client_optimizer': 'sgd', 'learning_rate': 0.03, 'weight_decay': 0.001, 'frequency_of_the_test': 5, 'worker_num': 5, 'using_gpu': False, 'gpu_mapping_file': 'config/gpu_mapping.yaml', 'gpu_mapping_key': 'mapping_default', 'backend': 'MPI', 'is_mobile': 0, 'log_file_dir': './log', 'enable_wandb': False, 'wandb_key': 'ee0b5f53d949c84cee7decbe7a629e63fb2f8408', 'wandb_project': 'fedml', 'wandb_name': 'fedml_torch_fedavg_mnist_lr', 'using_mlops': False, 'comm': <mpi4py.MPI.Intracomm object at 0x000001FA80945670>, 'process_id': 0, 'sys_perf_profiling': True}                                                                                                 
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [__init__.py:96:init] args = {'yaml_config_file': 'config\\fedml_config.yaml', 'run_id': '0', 'rank': 0, 'local_rank': 0, 'node_rank': 0, 'role': 'client', 'yaml_paths': ['config\\fedml_config.yaml'], 'training_type': 'simulation', 'random_seed': 0, 'dataset': 'mnist', 'data_cache_dir': '../../../data/mnist', 'partition_method': 'hetero', 'partition_alpha': 0.5, 'model': 'lr', 'federated_optimizer': 'FedAvg', 'client_id_list': '[]', 'client_num_in_total': 1000, 'client_num_per_round': 4, 'comm_round': 50, 'epochs': 1, 'batch_size': 10, 'client_optimizer': 'sgd', 'learning_rate': 0.03, 'weight_decay': 0.001, 'frequency_of_the_test': 5, 'worker_num': 5, 'using_gpu': False, 'gpu_mapping_file': 'config/gpu_mapping.yaml', 'gpu_mapping_key': 'mapping_default', 'backend': 'MPI', 'is_mobile': 0, 'log_file_dir': './log', 'enable_wandb': False, 'wandb_key': 'ee0b5f53d949c84cee7decbe7a629e63fb2f8408', 'wandb_project': 'fedml', 'wandb_name': 'fedml_torch_fedavg_mnist_lr', 'using_mlops': False, 'comm': <mpi4py.MPI.Intracomm object at 0x00000129C79A5670>, 'process_id': 1, 'sys_perf_profiling': True}                                                                                                 
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [__init__.py:96:init] args = {'yaml_config_file': 'config\\fedml_config.yaml', 'run_id': '0', 'rank': 0, 'local_rank': 0, 'node_rank': 0, 'role': 'client', 'yaml_paths': ['config\\fedml_config.yaml'], 'training_type': 'simulation', 'random_seed': 0, 'dataset': 'mnist', 'data_cache_dir': '../../../data/mnist', 'partition_method': 'hetero', 'partition_alpha': 0.5, 'model': 'lr', 'federated_optimizer': 'FedAvg', 'client_id_list': '[]', 'client_num_in_total': 1000, 'client_num_per_round': 4, 'comm_round': 50, 'epochs': 1, 'batch_size': 10, 'client_optimizer': 'sgd', 'learning_rate': 0.03, 'weight_decay': 0.001, 'frequency_of_the_test': 5, 'worker_num': 5, 'using_gpu': False, 'gpu_mapping_file': 'config/gpu_mapping.yaml', 'gpu_mapping_key': 'mapping_default', 'backend': 'MPI', 'is_mobile': 0, 'log_file_dir': './log', 'enable_wandb': False, 'wandb_key': 'ee0b5f53d949c84cee7decbe7a629e63fb2f8408', 'wandb_project': 'fedml', 'wandb_name': 'fedml_torch_fedavg_mnist_lr', 'using_mlops': False, 'comm': <mpi4py.MPI.Intracomm object at 0x0000022CAEB55670>, 'process_id': 2, 'sys_perf_profiling': True}                                                                                                 
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:12:mapping_processes_to_gpu_device_from_yaml_file_mpi]  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!                                                        
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:13:mapping_processes_to_gpu_device_from_yaml_file_mpi]  ################## You do not indicate gpu_util_file, will use CPU training  #################                 
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:12:mapping_processes_to_gpu_device_from_yaml_file_mpi]  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!                                                        
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:17:mapping_processes_to_gpu_device_from_yaml_file_mpi] cpu                                                                                                             
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:13:mapping_processes_to_gpu_device_from_yaml_file_mpi]  ################## You do not indicate gpu_util_file, will use CPU training  #################                 
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:12:mapping_processes_to_gpu_device_from_yaml_file_mpi]  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!                                                        
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:17:mapping_processes_to_gpu_device_from_yaml_file_mpi] cpu                                                                                                             
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:13:mapping_processes_to_gpu_device_from_yaml_file_mpi]  ################## You do not indicate gpu_util_file, will use CPU training  #################                 
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:17:mapping_processes_to_gpu_device_from_yaml_file_mpi] cpu                                                                                                             
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:12:mapping_processes_to_gpu_device_from_yaml_file_mpi]  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!                                                        
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:13:mapping_processes_to_gpu_device_from_yaml_file_mpi]  ################## You do not indicate gpu_util_file, will use CPU training  #################                 
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:12:mapping_processes_to_gpu_device_from_yaml_file_mpi]  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!                                                        
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:13:mapping_processes_to_gpu_device_from_yaml_file_mpi]  ################## You do not indicate gpu_util_file, will use CPU training  #################                 
[FedML-Server(0) @device-id-0] [Sun, 17 Jul 2022 10:20:51] [INFO] [gpu_mapping_mpi.py:17:mapping_processes_to_gpu_device_from_yaml_file_mpi] cpu 

The device indexes are the same. Hope it would be fixed in the next version.

@mh-lan mh-lan closed this as completed Jul 17, 2022
@lyydd
Copy link

lyydd commented Aug 26, 2022

[FedML-Server(0) @device-id-0] [Fri, 26 Aug 2022 18:07:14] [INFO] [fedml_comm_manager.py:35:receive_message] receive_message. msg_type = 0, sender_id = 3, receiver_id = 3
[FedML-Server(0) @device-id-0] [Fri, 26 Aug 2022 18:07:14] [WARNING] [com_manager.py:133:_notify_connection_ready] Cannot handle connection ready
why ????
run with mpiexec -np 5 python torch_fedavg_mnist_lr_one_line_example.py --cf config\fedml_config.yaml in windows

@lyydd
Copy link

lyydd commented Aug 29, 2022

@chaoyanghe could you help me?

@SuperXxts
Copy link

SuperXxts commented Oct 26, 2022

I also found this problem in "FedML/Python/app/fedcv/object_detection/runs/train/exp/log_0.txt"

client_indexes = [0, 1]
running
receive_message. msg_type = 0, sender_id = 0, receiver_id = 0
Cannot handle connection ready
receive_message. msg_type = 3, sender_id = 1, receiver_id = 0
add_model. index = 0
b_all_received = False
receive_message. msg_type = 3, sender_id = 2, receiver_id = 0
add_model. index = 1
b_all_received = True
len of self.model_dict[idx] = 2
set_model_params
aggregate time cost: 0
round_idx: 0
Saving model at round 0
client_indexes = [0, 1]
send_message_sync_model_to_client. receive_id = 1
send_message_sync_model_to_client. receive_id = 2

The fourth line shows "Cannot handle connection ready". Why? Will it have any impact?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants