Expanded Hardware Utilization Information #800

TimZaman · 2016-05-30T13:02:23Z

Needed this for identifying potential CPU (and disk) bottlenecks (for example for testing preprocessing during for #777).
Issues that this will immediately reveal is for example the 1200% CPU usage I had due to some over-optimization in my BLAS library.
Works for Caffe and Torch, but the psutil manual tells me disk usage is not supported on Windows so I'm checking for that one.

Ready for review

Caffe

## Torch

lukeyeager · 2016-05-31T17:15:33Z

TravisCI failed because you didn't add psutil to requirements.txt or here:
https://github.com/NVIDIA/DIGITS/blob/3863a42395/scripts/travis/install-apt.sh

Can you make it work with psutil v1.2.1? That's what's available on 14.04:
http://packages.ubuntu.com/trusty/python-psutil

the psutil manual tells me disk usage is not supported on Windows so I'm checking for that one

It's probably a good idea to just not display the information if you can't retrieve it on a certain system for whatever reason.

TimZaman · 2016-06-01T13:17:12Z

TravisCI failed because you didn't add psutil to requirements.txt or here:

Forgot to commit that one.

It's probably a good idea to just not display the information if you can't retrieve it on a certain system for whatever reason.

Yep that's what I'm already checking for on the front-end. On the back-end, on Windows, I do not even attempt to obtain it (and therefore will not be shown in the front-end).

gheinrich · 2016-06-01T14:57:48Z

This is great. Do you think we could show the CPU utilization for dataset creation too?

TimZaman · 2016-06-01T15:01:02Z

Hmm, probably. But is that useful? I haven't found a need for that myself
so far.

On Wednesday, 1 June 2016, Greg Heinrich notifications@github.com wrote:

This is great. Do you think we could show the CPU utilization for dataset
creation too?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#800 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AHXSRITw2671E-UGmgHG_RlzEo4xVDbmks5qHZ31gaJpZM4Ipx6C
.

gheinrich · 2016-06-01T15:21:49Z

It's useful if you want to make sure you are utilizing your CPUs efficiently when creating a large dataset. But don't go out of your way to support that if it's not trivial.

TimZaman · 2016-06-02T12:52:31Z

I have implemented this hardware utilization because I had a very distinct need for it. I do not have a need for this in creating the dataset at all. Secondly, implementing is not trivial at all as I do not see a way to accurately seperate the relevant hardware metrics exactly corresponding to the job; where-as currently the hardware utilization is reported only and exactly for that distinct, specific job, which is [to me and other digitizers] really neat and useful.
I think it's ready for review.

N.B. I would also love to log these kind of metrics but that's for another PR. For example, my favorite metric is the GPU temperature; because it gives insight into some kind of running-average of the usage. I.E. <70 deg = inefficient settings/model.

gheinrich · 2016-06-13T09:25:35Z

This looks good to me thanks. There are conflicts that must be resolved before merging.

Question: the disk write info looks OK however the read statistics appear to be underestimated (I have a 4GB dataset and after several epoch the read counter shows only 96kB). Did I misunderstand what it's supposed to show? Or perhaps the process was reading the database from cache and it didn't count?

TimZaman · 2016-06-13T09:38:14Z

Hmm yes indeed it seems the disk statistics are unreliable. Possibly due to
it not getting info from subprocesses. Might as well rip out the disk info.

On Monday, 13 June 2016, Greg Heinrich notifications@github.com wrote:

This looks good to me thanks. There are conflicts that must be resolved
before merging.

Question: the disk write info looks OK however the read statistics appear
to be underestimated (I have a 4GB dataset and after several epoch the read
counter shows only 96kB). Did I misunderstand what it's supposed to show?
Or perhaps the process was reading the database from cache and it didn't
count?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#800 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AHXSRNrpBmuMPBs3ASUJB_T-Vo98xlcXks5qLSIYgaJpZM4Ipx6C
.

TimZaman · 2016-06-13T11:09:31Z

Okido ready when you are, fixed, squashed & rebased.

gheinrich · 2016-06-13T12:43:42Z

That still looks very good to me, let's see what @lukeyeager thinks :-)

lukeyeager · 2016-06-16T19:32:26Z

digits/templates/models/gpu_utilization.html

@@ -21,3 +20,18 @@
    {% endif %}
 </dl>
 {% endfor %}
+{% if data_cpu %}
+<h3>CPU ({% if 'pid' in data_cpu %}#{{data_cpu.pid}}{% endif %})</h3>


This is a little misleading. When I saw CPU (#10291) I thought it was talking about a CPU core or something. How about Process 20291 instead?

lukeyeager · 2016-06-16T19:35:50Z

Oops I caused a merge conflict with #825.

While you're rebasing, can you also:

Add psutil to https://github.com/NVIDIA/DIGITS/blob/v4.0.0-rc.1/scripts/travis/install-apt.sh
Set the psutil requirement to psutil>=1.2.1,<=3.4.2 to match new formatting

lukeyeager · 2016-06-16T19:41:55Z

I'm seeing some training jobs fail to finish with this change. Are you seeing the same? The Caffe task goes to 100% complete, but is stuck at Running and doesn't move to Done.

TimZaman · 2016-06-16T19:43:42Z

No haven't seen that. Will try to reproduce. Might be because process variable p that's now a class member might need to be closed explicitly when it's done or something because it doesnt automatically go out of scope? That's essentially the only thing that has changed.

TimZaman · 2016-06-16T20:29:24Z

I think I nailed it now @lukeyeager : replaced the shitty try block with a nice if is_running() statement.

suggestions by luke above
rebased
squashed

lukeyeager · 2016-07-25T23:19:05Z

I'm trying this now and am getting this error:

Traceback (most recent call last):
  File "/home/lyeager/digits/venv/local/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run
    result = self._run(*self.args, **self.kwargs)
  File "/home/lyeager/digits/digits/model/tasks/train.py", line 194, in gpu_socketio_updater
    data_cpu['mem_uss'] = ps.memory_full_info().uss
AttributeError: 'Process' object has no attribute 'memory_full_info'

It kills the background socketio thread and now I'm not getting any GPU or CPU information.

$ python -c 'import psutil;print psutil.__file__;print psutil.__version__'
/usr/lib/python2.7/dist-packages/psutil/__init__.pyc
3.4.2

Have you tried this with older versions of psutil? If I'm reading the docs correctly, memory_full_info() is new with v4.

memory_full_info()
New in version 4.0.0.
https://pythonhosted.org/psutil/#psutil.Process.memory_full_info

TimZaman · 2016-07-25T23:22:39Z

Sorry my bad. Yeah too new. I'll just use the old functions.

On Tuesday, 26 July 2016, Luke Yeager notifications@github.com wrote:

I'm trying this now and am getting this error:

Traceback (most recent call last):
File "/home/lyeager/digits/venv/local/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run
result = self._run(_self.args, *_self.kwargs)
File "/home/lyeager/digits/digits/model/tasks/train.py", line 194, in gpu_socketio_updater
data_cpu['mem_uss'] = ps.memory_full_info().uss
AttributeError: 'Process' object has no attribute 'memory_full_info'

It kills the background socketio thread and now I'm not getting any GPU or
CPU information.

$ python -c 'import psutil;print psutil.file;print psutil.version'
/usr/lib/python2.7/dist-packages/psutil/init.pyc
3.4.2

Have you tried this with older versions of psutil? If I'm reading the
docs correctly, memory_full_info() is new with v4.

memory_full_info()
New in version 4.0.0.
https://pythonhosted.org/psutil/#psutil.Process.memory_full_info

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#800 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHXSRIbH0IzbKpfdGJetKOBrEYEgQpcYks5qZURvgaJpZM4Ipx6C
.

TimZaman · 2016-08-09T15:19:42Z

I think I fixed it. psutil turns out to have quite some catches, but i tested this work psutil:
1.2.1, 3.4.2, 4.3.0.
I now also spawn the gpu_socketio_updater if there is no GPU, so renamed that to hw_socketio_updater.

lukeyeager · 2016-08-09T17:10:13Z

The Travis build is failing, but I think it's related to https://www.traviscistatus.com/incidents/2p40l49r3yxd. I'm following up with them ...

Version fallback for psutil, tested for versions 1,3,4 and added some checks. Implemented showing hw info also for cpu-only systems

TimZaman · 2016-08-09T19:23:03Z

Mkay. I just squashed and rebased hoping to trigger Travis again. edit: yep, worked. I do advise to run a real test like you did before just to be sure.

lukeyeager · 2016-08-09T22:27:20Z

Looks good to me! Thanks for the nice addition and for supporting multiple versions of psutil!

TimZaman · 2016-08-09T22:31:23Z

Does Travis also check on Windows OS? If nope, maybe ask @IsaacYangSLA to check this PR's functionality. Some psutil features are OS specific, but it should work because I read the manual ;).

IsaacYangSLA · 2016-08-11T18:05:44Z

I just ran a simple training task on Windows 7. The CPU / Memory usage was shown and updated correctly. So that basically concludes it works in Windows.

…il_info Expanded Hardware Utilization Information

lukeyeager added enhancement UI labels May 31, 2016

lukeyeager self-assigned this Jun 14, 2016

lukeyeager reviewed Jun 16, 2016
View reviewed changes

lukeyeager removed their assignment Aug 1, 2016

Expands hardware util info

65de73d

Version fallback for psutil, tested for versions 1,3,4 and added some checks. Implemented showing hw info also for cpu-only systems

lukeyeager merged commit 0d5d6f2 into NVIDIA:master Aug 9, 2016

lukeyeager mentioned this pull request Aug 9, 2016

Add support for Ubuntu 16.04 #965

Merged

lukeyeager mentioned this pull request Aug 16, 2016

Better error-checking in psutil code #983

Merged

lukeyeager referenced this pull request Aug 22, 2016

Better error-checking in psutil code

7ea7242

lukeyeager mentioned this pull request Oct 25, 2016

incorrect cpu utilization value #1168

Closed

lukeyeager added the python-dependencies label Nov 11, 2016

SlipknotTN pushed a commit to cynnyx/DIGITS that referenced this pull request Mar 30, 2017

Merge pull request NVIDIA#800 from brainstorm-ai/expanded_hardware_ut…

1149a96

…il_info Expanded Hardware Utilization Information

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expanded Hardware Utilization Information #800

Expanded Hardware Utilization Information #800

TimZaman commented May 30, 2016 •

edited

Loading

lukeyeager commented May 31, 2016

TimZaman commented Jun 1, 2016

gheinrich commented Jun 1, 2016

TimZaman commented Jun 1, 2016

gheinrich commented Jun 1, 2016

TimZaman commented Jun 2, 2016 •

edited

Loading

gheinrich commented Jun 13, 2016

TimZaman commented Jun 13, 2016

TimZaman commented Jun 13, 2016 •

edited

Loading

gheinrich commented Jun 13, 2016

lukeyeager Jun 16, 2016

lukeyeager commented Jun 16, 2016

lukeyeager commented Jun 16, 2016

TimZaman commented Jun 16, 2016

TimZaman commented Jun 16, 2016 •

edited

Loading

lukeyeager commented Jul 25, 2016

TimZaman commented Jul 25, 2016

TimZaman commented Aug 9, 2016 •

edited

Loading

lukeyeager commented Aug 9, 2016

TimZaman commented Aug 9, 2016 •

edited

Loading

lukeyeager commented Aug 9, 2016

TimZaman commented Aug 9, 2016 •

edited

Loading

IsaacYangSLA commented Aug 11, 2016

Expanded Hardware Utilization Information #800

Expanded Hardware Utilization Information #800

Conversation

TimZaman commented May 30, 2016 • edited Loading

Caffe

lukeyeager commented May 31, 2016

TimZaman commented Jun 1, 2016

gheinrich commented Jun 1, 2016

TimZaman commented Jun 1, 2016

gheinrich commented Jun 1, 2016

TimZaman commented Jun 2, 2016 • edited Loading

gheinrich commented Jun 13, 2016

TimZaman commented Jun 13, 2016

TimZaman commented Jun 13, 2016 • edited Loading

gheinrich commented Jun 13, 2016

lukeyeager Jun 16, 2016

Choose a reason for hiding this comment

lukeyeager commented Jun 16, 2016

lukeyeager commented Jun 16, 2016

TimZaman commented Jun 16, 2016

TimZaman commented Jun 16, 2016 • edited Loading

lukeyeager commented Jul 25, 2016

TimZaman commented Jul 25, 2016

TimZaman commented Aug 9, 2016 • edited Loading

lukeyeager commented Aug 9, 2016

TimZaman commented Aug 9, 2016 • edited Loading

lukeyeager commented Aug 9, 2016

TimZaman commented Aug 9, 2016 • edited Loading

IsaacYangSLA commented Aug 11, 2016

TimZaman commented May 30, 2016 •

edited

Loading

TimZaman commented Jun 2, 2016 •

edited

Loading

TimZaman commented Jun 13, 2016 •

edited

Loading

TimZaman commented Jun 16, 2016 •

edited

Loading

TimZaman commented Aug 9, 2016 •

edited

Loading

TimZaman commented Aug 9, 2016 •

edited

Loading

TimZaman commented Aug 9, 2016 •

edited

Loading