Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expanded Hardware Utilization Information #800

Merged
merged 1 commit into from Aug 9, 2016
Merged

Expanded Hardware Utilization Information #800

merged 1 commit into from Aug 9, 2016

Conversation

TimZaman
Copy link
Contributor

@TimZaman TimZaman commented May 30, 2016

Needed this for identifying potential CPU (and disk) bottlenecks (for example for testing preprocessing during for #777).
Issues that this will immediately reveal is for example the 1200% CPU usage I had due to some over-optimization in my BLAS library.
Works for Caffe and Torch, but the psutil manual tells me disk usage is not supported on Windows so I'm checking for that one.

  • Ready for review

Caffe

screenshot 2016-05-30 14 49 35

## Torch

screenshot 2016-05-30 14 48 46

@lukeyeager
Copy link
Member

TravisCI failed because you didn't add psutil to requirements.txt or here:
https://github.com/NVIDIA/DIGITS/blob/3863a42395/scripts/travis/install-apt.sh

Can you make it work with psutil v1.2.1? That's what's available on 14.04:
http://packages.ubuntu.com/trusty/python-psutil

the psutil manual tells me disk usage is not supported on Windows so I'm checking for that one

It's probably a good idea to just not display the information if you can't retrieve it on a certain system for whatever reason.

@TimZaman
Copy link
Contributor Author

TimZaman commented Jun 1, 2016

TravisCI failed because you didn't add psutil to requirements.txt or here:

Forgot to commit that one.

It's probably a good idea to just not display the information if you can't retrieve it on a certain system for whatever reason.

Yep that's what I'm already checking for on the front-end. On the back-end, on Windows, I do not even attempt to obtain it (and therefore will not be shown in the front-end).

@gheinrich
Copy link
Contributor

This is great. Do you think we could show the CPU utilization for dataset creation too?

@TimZaman
Copy link
Contributor Author

TimZaman commented Jun 1, 2016

Hmm, probably. But is that useful? I haven't found a need for that myself
so far.

On Wednesday, 1 June 2016, Greg Heinrich notifications@github.com wrote:

This is great. Do you think we could show the CPU utilization for dataset
creation too?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#800 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AHXSRITw2671E-UGmgHG_RlzEo4xVDbmks5qHZ31gaJpZM4Ipx6C
.

@gheinrich
Copy link
Contributor

It's useful if you want to make sure you are utilizing your CPUs efficiently when creating a large dataset. But don't go out of your way to support that if it's not trivial.

@TimZaman
Copy link
Contributor Author

TimZaman commented Jun 2, 2016

I have implemented this hardware utilization because I had a very distinct need for it. I do not have a need for this in creating the dataset at all. Secondly, implementing is not trivial at all as I do not see a way to accurately seperate the relevant hardware metrics exactly corresponding to the job; where-as currently the hardware utilization is reported only and exactly for that distinct, specific job, which is [to me and other digitizers] really neat and useful.
I think it's ready for review.

N.B. I would also love to log these kind of metrics but that's for another PR. For example, my favorite metric is the GPU temperature; because it gives insight into some kind of running-average of the usage. I.E. <70 deg = inefficient settings/model.

@gheinrich
Copy link
Contributor

This looks good to me thanks. There are conflicts that must be resolved before merging.

Question: the disk write info looks OK however the read statistics appear to be underestimated (I have a 4GB dataset and after several epoch the read counter shows only 96kB). Did I misunderstand what it's supposed to show? Or perhaps the process was reading the database from cache and it didn't count?

@TimZaman
Copy link
Contributor Author

Hmm yes indeed it seems the disk statistics are unreliable. Possibly due to
it not getting info from subprocesses. Might as well rip out the disk info.

On Monday, 13 June 2016, Greg Heinrich notifications@github.com wrote:

This looks good to me thanks. There are conflicts that must be resolved
before merging.

Question: the disk write info looks OK however the read statistics appear
to be underestimated (I have a 4GB dataset and after several epoch the read
counter shows only 96kB). Did I misunderstand what it's supposed to show?
Or perhaps the process was reading the database from cache and it didn't
count?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#800 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AHXSRNrpBmuMPBs3ASUJB_T-Vo98xlcXks5qLSIYgaJpZM4Ipx6C
.

@TimZaman
Copy link
Contributor Author

TimZaman commented Jun 13, 2016

Okido ready when you are, fixed, squashed & rebased.

@gheinrich
Copy link
Contributor

That still looks very good to me, let's see what @lukeyeager thinks :-)

@lukeyeager lukeyeager self-assigned this Jun 14, 2016
@@ -21,3 +20,18 @@
{% endif %}
</dl>
{% endfor %}
{% if data_cpu %}
<h3>CPU ({% if 'pid' in data_cpu %}#{{data_cpu.pid}}{% endif %})</h3>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little misleading. When I saw CPU (#10291) I thought it was talking about a CPU core or something. How about Process 20291 instead?

@lukeyeager
Copy link
Member

Oops I caused a merge conflict with #825.

While you're rebasing, can you also:

  1. Add psutil to https://github.com/NVIDIA/DIGITS/blob/v4.0.0-rc.1/scripts/travis/install-apt.sh
  2. Set the psutil requirement to psutil>=1.2.1,<=3.4.2 to match new formatting

@lukeyeager
Copy link
Member

I'm seeing some training jobs fail to finish with this change. Are you seeing the same? The Caffe task goes to 100% complete, but is stuck at Running and doesn't move to Done.

@TimZaman
Copy link
Contributor Author

No haven't seen that. Will try to reproduce. Might be because process variable p that's now a class member might need to be closed explicitly when it's done or something because it doesnt automatically go out of scope? That's essentially the only thing that has changed.

@TimZaman
Copy link
Contributor Author

TimZaman commented Jun 16, 2016

I think I nailed it now @lukeyeager : replaced the shitty try block with a nice if is_running() statement.

  • suggestions by luke above
  • rebased
  • squashed

@lukeyeager
Copy link
Member

I'm trying this now and am getting this error:

Traceback (most recent call last):
  File "/home/lyeager/digits/venv/local/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run
    result = self._run(*self.args, **self.kwargs)
  File "/home/lyeager/digits/digits/model/tasks/train.py", line 194, in gpu_socketio_updater
    data_cpu['mem_uss'] = ps.memory_full_info().uss
AttributeError: 'Process' object has no attribute 'memory_full_info'

It kills the background socketio thread and now I'm not getting any GPU or CPU information.

$ python -c 'import psutil;print psutil.__file__;print psutil.__version__'
/usr/lib/python2.7/dist-packages/psutil/__init__.pyc
3.4.2

Have you tried this with older versions of psutil? If I'm reading the docs correctly, memory_full_info() is new with v4.

memory_full_info()
New in version 4.0.0.
https://pythonhosted.org/psutil/#psutil.Process.memory_full_info

@TimZaman
Copy link
Contributor Author

Sorry my bad. Yeah too new. I'll just use the old functions.

On Tuesday, 26 July 2016, Luke Yeager notifications@github.com wrote:

I'm trying this now and am getting this error:

Traceback (most recent call last):
File "/home/lyeager/digits/venv/local/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run
result = self._run(_self.args, *_self.kwargs)
File "/home/lyeager/digits/digits/model/tasks/train.py", line 194, in gpu_socketio_updater
data_cpu['mem_uss'] = ps.memory_full_info().uss
AttributeError: 'Process' object has no attribute 'memory_full_info'

It kills the background socketio thread and now I'm not getting any GPU or
CPU information.

$ python -c 'import psutil;print psutil.file;print psutil.version'
/usr/lib/python2.7/dist-packages/psutil/init.pyc
3.4.2

Have you tried this with older versions of psutil? If I'm reading the
docs correctly, memory_full_info() is new with v4.

memory_full_info()
New in version 4.0.0.
https://pythonhosted.org/psutil/#psutil.Process.memory_full_info


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#800 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHXSRIbH0IzbKpfdGJetKOBrEYEgQpcYks5qZURvgaJpZM4Ipx6C
.

@lukeyeager lukeyeager removed their assignment Aug 1, 2016
@TimZaman
Copy link
Contributor Author

TimZaman commented Aug 9, 2016

I think I fixed it. psutil turns out to have quite some catches, but i tested this work psutil:
1.2.1, 3.4.2, 4.3.0.
I now also spawn the gpu_socketio_updater if there is no GPU, so renamed that to hw_socketio_updater.

@lukeyeager
Copy link
Member

The Travis build is failing, but I think it's related to https://www.traviscistatus.com/incidents/2p40l49r3yxd. I'm following up with them ...

Version fallback for psutil, tested for versions 1,3,4 and added some checks. Implemented showing hw info also for cpu-only systems
@TimZaman
Copy link
Contributor Author

TimZaman commented Aug 9, 2016

Mkay. I just squashed and rebased hoping to trigger Travis again. edit: yep, worked. I do advise to run a real test like you did before just to be sure.

@lukeyeager
Copy link
Member

Looks good to me! Thanks for the nice addition and for supporting multiple versions of psutil!

@lukeyeager lukeyeager merged commit 0d5d6f2 into NVIDIA:master Aug 9, 2016
@TimZaman
Copy link
Contributor Author

TimZaman commented Aug 9, 2016

Does Travis also check on Windows OS? If nope, maybe ask @IsaacYangSLA to check this PR's functionality. Some psutil features are OS specific, but it should work because I read the manual ;).

@IsaacYangSLA
Copy link
Contributor

I just ran a simple training task on Windows 7. The CPU / Memory usage was shown and updated correctly. So that basically concludes it works in Windows.

SlipknotTN pushed a commit to cynnyx/DIGITS that referenced this pull request Mar 30, 2017
…il_info

Expanded Hardware Utilization Information
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants