Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: a sampling profiler #1865

Closed
wants to merge 6 commits into from
Closed

RFC: a sampling profiler #1865

wants to merge 6 commits into from

Conversation

timholy
Copy link
Sponsor Member

@timholy timholy commented Jan 2, 2013

I've started working on a sampling profiler. The basic idea is very simple: as you run your julia code, periodically take a "backtrace snapshot" that captures the current line # and calling stack. By analyzing this information, one gets a statistical sense for the bottlenecks in code, since the same functions or lines show up repeatedly.

This is intended as a supplement to our existing "intrumenting profiler". A sampling profiler does not give the kind of detailed line-by-line information obtained from an instrumenting profiler, but in principle it does have a couple of its own advantages:

  1. It may not require any modification to the code; the compiler is free to optimize without worrying about instrumentation lines.
  2. It's not necessary to wrap code in "@Profile begin...end", and consequently one can even profile code all the way down into Base.
  3. If it takes ~30 microseconds to take a backtrace, and you take one every millisecond, there is little or no performance overhead. Of course, the more rarely you take snapshots, the worse your statistics (or the longer your run times need to be).

The catch, of course, is that it doesn't yet work. The biggest problem is that the backtrace seems to always come back the same. I'm worried that the backtrace is effectively being run in a separate thread and that I'm just backtracing the profiling code, rather than what's currently running in Julia; assuming that's the problem, I'm not quite sure how to solve it (libunwind-ptrace?). Avoiding this outcome is the entire reason I did this in C rather than trying an implementation in terms of Tasks, but it seems this did not suffice.

A smaller issue is that the backtrace parsing is yielding NULL function pointers, and again I'm not sure why. (I separated out the parsing and the storage of the pointers to try to improve efficiency during the run.)

The long-term idea is that much of this might live in a separate package, but for ease of development I'm coding it inside the Julia repository for now (this simplifies the Makefile issues). It's also possible that this would need to be so tightly integrated with Julia that it shouldn't be a separate package.

If you want to play with this, you'll need this gist: https://gist.github.com/4434156

Demo (uses the gist):

julia> require("ProfNew.jl"); using ProfNew

julia> @sprofile testlen()
length = 270
length = 375
length = 410
length = 445
length = 485
1493.063943655766

julia> p, n = parse_backtrace()
buf => [0x00007f4ac6748cbd, 0x00007f4ac6ae268a, 0x00007f4ac77b2e9a, 0x00007f4ac624e40c]
n = 4
  data[i] = 7f4ac6748cbd
  data[i] = 7f4ac6ae268a
  data[i] = 7f4ac77b2e9a
  data[i] = 7f4ac624e40c
({},[104, 104, 104, 104])

@ViralBShah
Copy link
Member

This is awesome. @JeffBezanson may have to say more about the backtrace and the approach here, but this would incredibly improve julia's usability.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 5, 2013

This will be great to have! A few comments:

  • as you said, SIGEV_THREAD creates a new thread, so backtrace() will return info for that thread, not the main one
  • SIGEV_SIGNAL is probably what you want, and also setup a signal handler for it
  • although, signal handlers cannot malloc memory or call any julia functions (or they risk segfaults), so this could be really tricky. perhaps if you start a helper thread (which can malloc memory, but not call julia functions), the hand-off could be done successfully?
  • alternatively, maybe libunwind-ptrace is the way to go
  • what platform are you on? backtrace() can be a bit finicky, but i think it should be possible to get working. I may take a look at libunwind again on my Mac to see if that can be made to work.
  • MAX_BT_SIZE can be 80000 on any platform

@timholy
Copy link
Sponsor Member Author

timholy commented Jan 5, 2013

Thanks for the pointers, Jameson. Along those lines, let me just say I'm very excited about your work in #1831. If you beat me to it, maybe I'll even use some of that functionality in this problem. This is definitely a case where "zero-copy" C struct support could be a big help.

I eventually realized that I misunderstood how unw_getcontext worked, and that this, in addition to the thread issue, is the problem. Basically I don't think there's any way of making this particular approach work. But I'm still giving this a try, it will just have to be a different approach.

As I've read more, this is looking like a pretty tricky problem. I'm on Linux, and in the short term that's the only platform I can imagine pulling this off for (and even there it will stretch my abilities a fair bit). I now think the basic architecture needs to be something along the lines of:

  1. Main julia process starts up a new process
  2. New process ptraces the original julia process. Uses (1) libunwind-ptrace to start grabbing backtraces, or (2) injects code into the julia process causing it to grab its own backtraces. I strongly suspect (1) will be easier, but not certain.
  3. The julia process, however, is all the user will see, so it needs to be able to start & stop profiling, collect results, etc.

My main concern is that technically the julia process will become the child of the ptracing process, until the ptracing process issues a PTRACE_DETACH. In particular, I'm wondering how much trouble we'll get from the fact that any signal sent to the julia process will cause it to stop and wait for a PTRACE_CONT from its parent. Ideally I'd like the ptracing process to also be a julia process, so I can use Dicts and other goodness, but I'm wondering if the remote_call_fetch functions work using signals. If so, writing much of the parent tracing process in Julia will be a non-starter.

I can definitely write the ptracing stuff in C or C++ (C++ STL's map class would be a good substitute for Dicts), but in general less C code seems preferable. But this is getting down to some pretty low-level plumbing, so maybe it's the right way to do it.

@timholy
Copy link
Sponsor Member Author

timholy commented Jan 5, 2013

@vtjnash, I will also look at SIGEV_SIGNAL one more time and see if that can work. Maybe if all memory is pre-allocated and it's all in C, it can work. It would be easier to avoid ptrace altogether.

@timholy
Copy link
Sponsor Member Author

timholy commented Jan 7, 2013

OK, thanks to @vtjnash's suggestion to use SIGEV_SIGNAL, this now seems to be working on Linux. If OSX implements the same signal and timer functions, this should work there too. On Windows, I've simply tried to make sure it won't break compilation.

To test it, after doing a "make" (there's C code to compile here) do the following:

require("sprofile.jl"); using SProfile
@sprofile your_favorite_function_taking_at_least_tens_of_milliseconds_to_run
sprofile_flat()

Several notes:

  • The higher the count, the more times the given line was current at the moment of a backtrace, and hence the more of a bottleneck it is.
  • You won't see inlined function calls, those are simply absent from the backtrace.
  • If you've not run that function yet in your current julia session, you'll even see calls to functions in inference.jl used to compile the code.
  • A deliberate feature of the design is that multiple calls to @sprofile accumulate their results; sprofile_clear() is used to restart from the beginning.
  • sprofile_flat() is basically the most trivial usage of the backtrace data, though still quite useful. Ultimately we'll probably also want to be able to generate a tree, perhaps even writing the results as html and inspecting it with a browser so that you can "dig down" to lower levels.

The key question now is how much of this should go in Julia, and how much into a package? Basically there are three options: (i) 100% package, (ii) 100% Julia, and (iiii) merge the task.c changes into Julia but put everything else in a package. Option (i) will require a certain amount of duplication of code in task.c. There's a little bit of motivation to customize the backtracing code (although one could further extend task.c to do this), because it might be nice to truncate the backtraces at any point higher than the @sprofile call, getting rid, for example, of the lines in client.jl. In a simple test, I got a median backtrace length of 24, and ~10 of those were at client.jl or above and hence "useless"; being able to discard those wouldn't be a huge memory savings, but might be nice nonetheless (at the cost of slightly greater complexity).

@ViralBShah
Copy link
Member

I suggest keeping it in extras for now, as it is developed.

@timholy
Copy link
Sponsor Member Author

timholy commented Jan 7, 2013

At this point basically everything else is "analytics," the core is ready to use (on Linux, at least). If the "right" way to navigate the data is via a browser, that will probably be an "up for grabs" issue as I'm an HTML noob. I can presumably write a text-based tree viewer.

@timholy
Copy link
Sponsor Member Author

timholy commented Jan 7, 2013

@ViralBShah, I guess the bigger issue is the C code, not the Julia code. Question is, should the C code be compiled when we build Julia?

@ViralBShah
Copy link
Member

I think it is fine to build the C code when we build julia. Is there a reason not to do so?

@timholy
Copy link
Sponsor Member Author

timholy commented Jan 7, 2013

Only if we don't want to ship a profiler with Julia by default, but instead want to farm it out to a package. And, I want someone on Mac, at least, to see whether they can compile Julia with this pull request.

@StefanKarpinski
Copy link
Sponsor Member

I lean in favor of making Task as first-class as possible so that someone else could do the same sort of thing as easily as possible. So that's a vote for (iii) but making the change to task.c in a way that they could be generally useful, not just specific to this one use case (awesome though this use case happens to be).

@aviks
Copy link
Member

aviks commented Jan 8, 2013

Unfortunately this does not compile yet on MacOSX 10.7.5

aviks ~/dev/julia $ make
    CC src/task.o
    CC src/profile.o
profile.c:12:8: error: unknown type name 'timer_t'; did you mean 'time_t'?
static timer_t timerprof;
       ^~~~~~~
       time_t
/usr/include/time.h:87:26: note: 'time_t' declared here
typedef __darwin_time_t         time_t;
                                ^
profile.c:13:26: warning: tentative definition of variable with internal linkage has incomplete non-array type 'struct itimerspec'
      [-Wtentative-definition-incomplete-type]
static struct itimerspec itsprof;
                         ^
profile.c:13:15: note: forward declaration of 'struct itimerspec'
static struct itimerspec itsprof;
              ^
profile.c:31:20: error: incomplete definition of type 'struct itimerspec'
            itsprof.it_value.tv_nsec = nsecprof;
            ~~~~~~~^
profile.c:13:15: note: forward declaration of 'struct itimerspec'
static struct itimerspec itsprof;
              ^
profile.c:32:13: warning: implicit declaration of function 'timer_settime' is invalid in C99 [-Wimplicit-function-declaration]
            timer_settime(timerprof, 0, &itsprof, NULL);
            ^
profile.c:66:9: warning: implicit declaration of function 'timer_create' is invalid in C99 [-Wimplicit-function-declaration]
    if (timer_create(CLOCK_REALTIME, &sigprof, &timerprof) == -1)
        ^
profile.c:66:22: error: use of undeclared identifier 'CLOCK_REALTIME'
    if (timer_create(CLOCK_REALTIME, &sigprof, &timerprof) == -1)
                     ^
profile.c:70:12: error: incomplete definition of type 'struct itimerspec'
    itsprof.it_value.tv_sec = 0;
    ~~~~~~~^
profile.c:13:15: note: forward declaration of 'struct itimerspec'
static struct itimerspec itsprof;
              ^
profile.c:71:12: error: incomplete definition of type 'struct itimerspec'
    itsprof.it_interval.tv_sec = 0;       // make it fire once
    ~~~~~~~^
profile.c:13:15: note: forward declaration of 'struct itimerspec'
static struct itimerspec itsprof;
              ^
profile.c:72:12: error: incomplete definition of type 'struct itimerspec'
    itsprof.it_interval.tv_nsec = 0;
    ~~~~~~~^
profile.c:13:15: note: forward declaration of 'struct itimerspec'
static struct itimerspec itsprof;
              ^
profile.c:73:12: error: incomplete definition of type 'struct itimerspec'
    itsprof.it_value.tv_nsec = nsecprof;
    ~~~~~~~^
profile.c:13:15: note: forward declaration of 'struct itimerspec'
static struct itimerspec itsprof;
              ^
profile.c:80:5: warning: implicit declaration of function 'timer_delete' is invalid in C99 [-Wimplicit-function-declaration]
    timer_delete(timerprof);
    ^
profile.c:13:26: error: tentative definition has type 'struct itimerspec' that is never completed
static struct itimerspec itsprof;
                         ^
profile.c:13:15: note: forward declaration of 'struct itimerspec'
static struct itimerspec itsprof;
              ^
4 warnings and 8 errors generated.
make[2]: *** [profile.o] Error 1
make[1]: *** [julia-release] Error 2
make: *** [release] Error 2
aviks ~/dev/julia $ clang -v
Apple clang version 4.1 (tags/Apple/clang-421.11.66) (based on LLVM 3.1svn)
Target: x86_64-apple-darwin11.4.2
Thread model: posix

@timholy
Copy link
Sponsor Member Author

timholy commented Jan 8, 2013

Thanks, Avik. Knowing that is a big help.

@ViralBShah
Copy link
Member

Should we use the timers from libuv?

@timholy
Copy link
Sponsor Member Author

timholy commented Jan 8, 2013

I had the same thought. I'll first explore how hard it would be to put profile.c (where the timer is) into a package.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 8, 2013

timers from libuv are synchronous and will be of no help here. i'll see if I can find those definitions (for the missing symbols are Mac), so you can put this in task.c

@timholy
Copy link
Sponsor Member Author

timholy commented Jan 9, 2013

@vtjnash, I put in a candidate implementation of BSD timers in the code at https://github.com/timholy/Profile.jl. Once my linking issue gets resolved, I can test them on Linux (which supports both) and then it might be worth a try on OSX.

@timholy
Copy link
Sponsor Member Author

timholy commented Jan 9, 2013

Also cc @aviks.

@aviks
Copy link
Member

aviks commented Jan 9, 2013

I can confirm that the c code now compiles on OSX 10.7. One minor change to the build file: -lrt is not available/required on osx afaik. Which leaves me with the linker error that Tim is working on.

gcc -Wall -I /Users/aviks/dev/julia/src -Wl,-rpath,/Users/aviks/dev/julia/usr/lib -L/Users/aviks/dev/julia/usr/lib -O2 -shared -fPIC profile.c  -ljulia-release -o libprofile.so
Undefined symbols for architecture x86_64:
  "_rec_backtrace", referenced from:
      _sprofile_bt in ccvM7VYJ.o
ld: symbol(s) not found for architecture x86_64
collect2: ld returned 1 exit status

Also, FWIW, I vote for option (iii) above

@timholy
Copy link
Sponsor Member Author

timholy commented Jan 9, 2013

Thanks for letting me know about -lrt; that should now be fixed.

As you can probably guess from the existence of the package, I've gone with option (iii).

@aviks, you may be getting a slightly different variant of the linker problem. I had the definition of rec_backtrace only on my machine, because I wasn't pushing changes to core Julia until I could test everything. But I haven't seen anything break, so I just pushed.

...and, miraculously, something I did between a while ago and now changed something. I suspect it was going with support/dtypes.h for all the type definitions, but really I have no clue. But...closing and packaging!

@aviks
Copy link
Member

aviks commented Jan 9, 2013

Success, libprofile.so is created. I can now load Profile, and use the @sprofile macro on the mac.

I then saw a

julia> sprofile_flat()
in anonymous: bt not defined
 in anonymous at no file
 in sprint at io.jl:387
 in sprofile_flat at /Users/aviks/.julia/Profile/src/sprofile.jl:103
 in sprofile_flat at /Users/aviks/.julia/Profile/src/sprofile.jl:113

but changing bn to bt in line 99, which I presume you've already got...

julia> sprofile_flat()
 Count                 File                       Function   Line
    13            client.jl                         _start    318
     1            client.jl                   answer_color     20
    12            client.jl                eval_user_input     82
     1            client.jl                eval_user_input     86
    13            client.jl                       run_repl    155
    10              no file                      anonymous    193
     2              no file                      anonymous    193

Yay!! Awesome @timholy !

@timholy
Copy link
Sponsor Member Author

timholy commented Jan 9, 2013

I found this too, already in the pkg.

See the README for a demo.

@timholy timholy closed this Jan 9, 2013
@ViralBShah
Copy link
Member

@timholy I think we should bring the sampling profiler into the julia repo, if you are ok with it. I think a profiler is an essential part of the language stack, and should be part of base.

Unless there is a good reason to keep it in a package, a pull request can get us moving towards having a profiler in base.

@timholy
Copy link
Sponsor Member Author

timholy commented Mar 2, 2013

I'd be fine with that, but since I think I got the opposite advice during its initial development I guess I'm wondering how universal this feeling is.

@StefanKarpinski
Copy link
Sponsor Member

That was from me and I've been convinced that this should be included.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants