Using your specified gpus' list by customizing the function "init_training_device"! #32

weiyikang · 2020-10-08T13:23:52Z

Running the fedavg on the configure: 20 rounds, 10 epochs, 2 clients, cifar-10 dataset, resnet56, but the program always crashed! The errors as the following:

I don't know the reason for the errors.

chaoyanghe · 2020-10-08T16:51:16Z

I guess it is because you put your progress on the front end. Please use the back end command (nohup).
Let me also double-check your configuration later with our own machine, and get back to you later.

weiyikang · 2020-10-08T23:58:48Z

What are the differences between parameter CLIENT_NUM and WORKER_NUM？
How to use the specified GPUs，such as implement the FedML-mobile service SDK #6，FedML-API: finish the distributed version of decentralized algorithm #7 GPU，because other GPUs are used by other people.

chaoyanghe · 2020-10-09T00:35:25Z

CLIENT_NUM is to describe how many users are involved in training, while WORKER_NUM means the parallel processes during training. If the client number is super large (e.g., 1 million users), a common practice for scalability is to use uniform sampling to select $WORKER_NUM (e.g. 10) of users to train each round.

weiyikang · 2020-10-09T00:40:37Z

I see. The WORKERs were sampled from CLIENTs，WORKER_NUM <= CLIENT_NUM.

chaoyanghe · 2020-10-09T00:47:06Z

In the init_training_device (main_fedavg.py), you can see that GPU_NUM_PER_SERVER is used to arrange GPU devices to each worker. Using this function, you can customize your arrangement.

chaoyanghe · 2020-10-09T00:53:01Z

You can always customize this function to meet your own physical configuration. In your case:

gpu_num_per_machine = 2

def init_training_device(process_ID, fl_worker_num, gpu_num_per_machine):
    # initialize the mapping from process ID to GPU ID: <process ID, GPU ID>
    if process_ID == 0:
        device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        return device
    process_gpu_dict = dict()
    for client_index in range(fl_worker_num):
        gpu_index = client_index % gpu_num_per_machine + 6
        process_gpu_dict[client_index] = gpu_index

    logging.info(process_gpu_dict)
    device = torch.device("cuda:" + str(process_gpu_dict[process_ID - 1]) if torch.cuda.is_available() else "cpu")
    logging.info(device)
    return device

weiyikang · 2020-10-09T01:11:53Z

The problem has been solved by customizing the function "init_training_device" :

The result as following:

chaoyanghe · 2020-10-09T01:22:29Z

@weiyikang Thank you for sharing your code!

weiyikang changed the title ~~Crashed! A recurring problem.~~ Using gpus by customize the function "init_training_device"! Oct 9, 2020

weiyikang changed the title ~~Using gpus by customize the function "init_training_device"!~~ Using specified gpus by customize the function "init_training_device"! Oct 9, 2020

weiyikang changed the title ~~Using specified gpus by customize the function "init_training_device"!~~ Using your specified gpus' list by customizing the function "init_training_device"! Oct 9, 2020

chaoyanghe added the good first issue Good for newcomers label Oct 9, 2020

chaoyanghe closed this as completed Oct 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using your specified gpus' list by customizing the function "init_training_device"! #32

Using your specified gpus' list by customizing the function "init_training_device"! #32

weiyikang commented Oct 8, 2020

chaoyanghe commented Oct 8, 2020

weiyikang commented Oct 8, 2020 •

edited

chaoyanghe commented Oct 9, 2020

weiyikang commented Oct 9, 2020

chaoyanghe commented Oct 9, 2020 •

edited

chaoyanghe commented Oct 9, 2020 •

edited

weiyikang commented Oct 9, 2020 •

edited

chaoyanghe commented Oct 9, 2020

Using your specified gpus' list by customizing the function "init_training_device"! #32

Using your specified gpus' list by customizing the function "init_training_device"! #32

Comments

weiyikang commented Oct 8, 2020

chaoyanghe commented Oct 8, 2020

weiyikang commented Oct 8, 2020 • edited

chaoyanghe commented Oct 9, 2020

weiyikang commented Oct 9, 2020

chaoyanghe commented Oct 9, 2020 • edited

chaoyanghe commented Oct 9, 2020 • edited

weiyikang commented Oct 9, 2020 • edited

chaoyanghe commented Oct 9, 2020

weiyikang commented Oct 8, 2020 •

edited

chaoyanghe commented Oct 9, 2020 •

edited

chaoyanghe commented Oct 9, 2020 •

edited

weiyikang commented Oct 9, 2020 •

edited