Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using your specified gpus' list by customizing the function "init_training_device"! #32

Closed
weiyikang opened this issue Oct 8, 2020 · 8 comments
Labels
good first issue Good for newcomers

Comments

@weiyikang
Copy link

Running the fedavg on the configure: 20 rounds, 10 epochs, 2 clients, cifar-10 dataset, resnet56, but the program always crashed! The errors as the following:

W1GVSRF_Z(@7RQL9}K2ET$X

I don't know the reason for the errors.

@chaoyanghe
Copy link
Member

I guess it is because you put your progress on the front end. Please use the back end command (nohup).
Let me also double-check your configuration later with our own machine, and get back to you later.

@weiyikang
Copy link
Author

weiyikang commented Oct 8, 2020

  1. What are the differences between parameter CLIENT_NUM and WORKER_NUM?
  2. How to use the specified GPUs,such as implement the FedML-mobile service SDK #6FedML-API: finish the distributed version of decentralized algorithm  #7 GPU,because other GPUs are used by other people.
    image

@chaoyanghe
Copy link
Member

  1. CLIENT_NUM is to describe how many users are involved in training, while WORKER_NUM means the parallel processes during training. If the client number is super large (e.g., 1 million users), a common practice for scalability is to use uniform sampling to select $WORKER_NUM (e.g. 10) of users to train each round.

@weiyikang
Copy link
Author

I see. The WORKERs were sampled from CLIENTs,WORKER_NUM <= CLIENT_NUM.

@chaoyanghe
Copy link
Member

chaoyanghe commented Oct 9, 2020

  1. In the init_training_device (main_fedavg.py), you can see that GPU_NUM_PER_SERVER is used to arrange GPU devices to each worker. Using this function, you can customize your arrangement.

@chaoyanghe
Copy link
Member

chaoyanghe commented Oct 9, 2020

  1. You can always customize this function to meet your own physical configuration. In your case:

gpu_num_per_machine = 2

def init_training_device(process_ID, fl_worker_num, gpu_num_per_machine):
    # initialize the mapping from process ID to GPU ID: <process ID, GPU ID>
    if process_ID == 0:
        device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        return device
    process_gpu_dict = dict()
    for client_index in range(fl_worker_num):
        gpu_index = client_index % gpu_num_per_machine + 6
        process_gpu_dict[client_index] = gpu_index

    logging.info(process_gpu_dict)
    device = torch.device("cuda:" + str(process_gpu_dict[process_ID - 1]) if torch.cuda.is_available() else "cpu")
    logging.info(device)
    return device

@weiyikang
Copy link
Author

weiyikang commented Oct 9, 2020

The problem has been solved by customizing the function "init_training_device" :

image

The result as following:

image

@weiyikang weiyikang changed the title Crashed! A recurring problem. Using gpus by customize the function "init_training_device"! Oct 9, 2020
@weiyikang weiyikang changed the title Using gpus by customize the function "init_training_device"! Using specified gpus by customize the function "init_training_device"! Oct 9, 2020
@weiyikang weiyikang changed the title Using specified gpus by customize the function "init_training_device"! Using your specified gpus' list by customizing the function "init_training_device"! Oct 9, 2020
@chaoyanghe
Copy link
Member

@weiyikang Thank you for sharing your code!

@chaoyanghe chaoyanghe added the good first issue Good for newcomers label Oct 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants