Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marllib seems never uses gpu devices #203

Open
wmzfight opened this issue Nov 23, 2023 · 3 comments
Open

Marllib seems never uses gpu devices #203

wmzfight opened this issue Nov 23, 2023 · 3 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@wmzfight
Copy link

When I use marrlib to train a model on my custom env, it always shows "(0.0/1.0 accelerator_type:V100S)". I have checked all the possible configs to use GPU devices. Both pytorch and tensorflow can detect all the GPUs on my computer.

However, when I use the example provided by Marllib to train a model on mpe env, it provides the same info. So do you have any idea how to solve this problem?
微信图片_20231123145904

@fortyMiles
Copy link
Collaborator

Hi, it's so glad to hear your reply.

Please check your configurations to make sure you have utilized the GPU.

@Theohhhu Theohhhu added the help wanted Extra attention is needed label Dec 3, 2023
@libin-star
Copy link

libin-star commented Jan 2, 2024

I have another one question, when I use 2gpu to train the MAPPO algorithm, but it uses the more training time than the one gpu. This is why? So could you have any solutions to solve this problem, thank you.
11111

@satpreetsingh
Copy link

I have the same issue, I'm using the following code to kick off a job. No GPU utilization increase seen, and mostly only 1 CPU is used at 100%, others stay idle


from marllib import marl
env = marl.make_env(environment_name="mpe", map_name="simple_spread", force_coop=True)
mappo = marl.algos.mappo(hyperparam_source='mpe')
model = marl.build_model(env, mappo, {"core_arch": "gru", "encode_layer": "128-256"})

# start training
mappo.fit(env, model, 
  stop={'timesteps_total': 100000}, 
  checkpoint_freq=100,
  share_policy='group', 
  num_gpus=1,
  num_workers=32,
  )

E.g. top output

top - 17:09:04 up 6 days,  1:03,  5 users,  load average: 1.16, 1.05, 0.88
Tasks: 556 total,   2 running, 554 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.6 us,  0.5 sy,  0.0 ni, 95.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 257572.3 total, 249544.2 free,   3948.5 used,   4079.7 buff/cache
MiB Swap:   2048.0 total,   2048.0 free,      0.0 used. 251594.3 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                       
 186567 satsingh  20   0   78.8g 889732 189528 R 100.7   0.3   1:46.04 python                        
 186582 satsingh  20   0  918108  27648  11264 S   1.0   0.0   0:00.74 gcs_server                    
 186594 satsingh  20   0   77.9g  73984  11776 S   1.0   0.0   0:01.41 raylet                        
 186685 satsingh  20   0   76.1g  76116  35840 S   1.0   0.0   0:00.92 ray::IDLE                     
 186692 satsingh  20   0   76.1g  76604  36096 S   1.0   0.0   0:00.94 ray::IDLE                     
 186695 satsingh  20   0   76.1g  76628  36352 S   1.0   0.0   0:00.91 ray::IDLE                     
 186717 satsingh  20   0   76.1g  75864  35840 S   1.0   0.0   0:00.97 ray::IDLE                     
 186726 satsingh  20   0   76.1g  76100  35840 S   1.0   0.0   0:00.92 ray::IDLE                     
 186727 satsingh  20   0   76.1g  76080  35840 S   1.0   0.0   0:00.95 ray::IDLE                     
 186730 satsingh  20   0   76.1g  76760  36352 S   1.0   0.0   0:00.92 ray::IDLE                     
 186731 satsingh  20   0   76.1g  76384  36096 S   1.0   0.0   0:00.93 ray::IDLE     

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants