FEA: Distributed Recommendation Implemention #1338

Ethan-TZ · 2022-07-06T04:40:22Z

No description provided.

hyp1231

Just started to review configurator.py. Please feel free to raise your concerns against the reviews.

hyp1231 · 2022-07-07T21:36:08Z

recbole/config/configurator.py

+        gpu_list = self.final_config_dict['gpu_ids']
+        os.environ["CUDA_VISIBLE_DEVICES"] = ','.join(gpu_list)


I suggest not changing the name of a widely-used arg like gpu_id. In my opinion, it's OK to use gpu_id even for multiple GPU IDs.

When input multiple GPU IDs, maybe it's better to use gpu_id: "1,2,3,4" (as a string) rather than gpu_id: [1, 2, 3, 4] (as a List). As users may input this arg via command line, such as python run_recbole.py --gpu_id=1,2,3,4., and the List may be difficult to input via command line.

Better to assign an initial value for gpu_id, or users have to specify an additional arg whenever they want to run.

Suggested change

gpu_list = self.final_config_dict['gpu_ids']

os.environ["CUDA_VISIBLE_DEVICES"] = ','.join(gpu_list)

gpu_list = self.final_config_dict['gpu_id']

os.environ["CUDA_VISIBLE_DEVICES"] = gpu_list

hyp1231 · 2022-07-07T21:47:28Z

recbole/config/configurator.py

-        self.final_config_dict['device'] = torch.device("cuda" if torch.cuda.is_available() and use_gpu else "cpu")
+        gpu_list = self.final_config_dict['gpu_ids']
+        os.environ["CUDA_VISIBLE_DEVICES"] = ','.join(gpu_list)
+        self.final_config_dict['SingleSpec'] = True


Seems that most existing arg names follow a style like single_spec. Please feel free to point out if there are some specific concerns about the naming styles.

Besides, will it be more clear if we move this line after if 'local_rank' not in self.final_config_dict: and before else:?

hyp1231 · 2022-07-07T21:50:09Z

recbole/config/configurator.py

@@ -16,7 +16,6 @@
 import os
 import sys
 import yaml
-import torch


Seems that regardless of the existence of local_rank, we need to import torch. So what are the concerns of removing this line here?

In order to make the setting of environment variables effective, we must put the os.environ["CUDA_VISIBLE_DEVICES"] behind import torch.

Cool! Thanks.

hyp1231 · 2022-07-07T22:00:57Z

recbole/config/configurator.py

+            torch.distributed.init_process_group(backend='nccl', rank = self.final_config_dict['local_rank'], world_size = self.final_config_dict['world_size'], 
+            init_method='tcp://' + self.final_config_dict['ip'] + ':' + str(self.final_config_dict['port']))


Please take care of the coding style. [PEP8]

Suggested change

torch.distributed.init_process_group(backend='nccl', rank = self.final_config_dict['local_rank'], world_size = self.final_config_dict['world_size'],

init_method='tcp://' + self.final_config_dict['ip'] + ':' + str(self.final_config_dict['port']))

torch.distributed.init_process_group(

backend='nccl', rank=self.final_config_dict['local_rank'],

world_size=self.final_config_dict['world_size'],

init_method='tcp://' + self.final_config_dict['ip'] + ':' + str(self.final_config_dict['port']))

hyp1231 · 2022-07-07T22:03:38Z

recbole/config/configurator.py

+        if 'local_rank' not in self.final_config_dict:
+            import torch
+            self.final_config_dict['local_rank'] = 0
+            self.final_config_dict['device'] = torch.device("cpu") if len(gpu_list) == 0 else torch.device("cuda")


Better to check with torch.cuda.is_available()?

Ethan-TZ added 2 commits July 6, 2022 11:53

FEA: Distributed Recommendation Implemention

4e432bd

FIX: change name

61aecf5

Ethan-TZ requested review from hyp1231, chenyushuo, 2017pxy, Wicknight, Sherry-XLL and leoleojie July 6, 2022 04:40

Ethan-TZ added 5 commits July 6, 2022 20:18

FIX: Add the default shuffle field

d6a563b

FIX: change the name of batch_size to be private

1abba5f

FIX: change the name of dataset to be private

9f77004

FIX: trivial changes

f83ce3a

FIX: default settings

1c70a4e

hyp1231 requested changes Jul 7, 2022

View reviewed changes

Ethan-TZ added 3 commits July 8, 2022 10:05

FIX: change code style

c678d18

FIX: overall yaml

ac1859a

FIX: add parameter docs and shuffle interface

ee14ba3

Ethan-TZ merged commit 063bfe7 into RUCAIBox:1.1.x Jul 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEA: Distributed Recommendation Implemention #1338

FEA: Distributed Recommendation Implemention #1338

Ethan-TZ commented Jul 6, 2022

hyp1231 left a comment

hyp1231 Jul 7, 2022 •

edited

Loading

hyp1231 Jul 7, 2022

hyp1231 Jul 7, 2022

Ethan-TZ Jul 8, 2022

hyp1231 Jul 8, 2022

hyp1231 Jul 7, 2022

hyp1231 Jul 7, 2022

		gpu_list = self.final_config_dict['gpu_ids']
		os.environ["CUDA_VISIBLE_DEVICES"] = ','.join(gpu_list)

		torch.distributed.init_process_group(backend='nccl', rank = self.final_config_dict['local_rank'], world_size = self.final_config_dict['world_size'],
		init_method='tcp://' + self.final_config_dict['ip'] + ':' + str(self.final_config_dict['port']))

FEA: Distributed Recommendation Implemention #1338

FEA: Distributed Recommendation Implemention #1338

Conversation

Ethan-TZ commented Jul 6, 2022

hyp1231 left a comment

Choose a reason for hiding this comment

hyp1231 Jul 7, 2022 • edited Loading

Choose a reason for hiding this comment

hyp1231 Jul 7, 2022

Choose a reason for hiding this comment

hyp1231 Jul 7, 2022

Choose a reason for hiding this comment

Ethan-TZ Jul 8, 2022

Choose a reason for hiding this comment

hyp1231 Jul 8, 2022

Choose a reason for hiding this comment

hyp1231 Jul 7, 2022

Choose a reason for hiding this comment

hyp1231 Jul 7, 2022

Choose a reason for hiding this comment

hyp1231 Jul 7, 2022 •

edited

Loading