Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling up to large source base ? #436

Open
moberhuber opened this issue Sep 29, 2015 · 19 comments
Open

Scaling up to large source base ? #436

moberhuber opened this issue Sep 29, 2015 · 19 comments

Comments

@moberhuber
Copy link

Hi rtags team, first of all thanks for this great project !

We've just started looking at it, and while doing so noticed that the rtags database tends to be large (60MB on 62 source files). Also in one instance, navigating a virtual method call was slow (10 sec on the mentioned database).

Has anyone tested how rtags scales up to a large source base like Google Chrome, Mozilla Firefox, or Clang itself ? Has anybody thought about ways to scale up, such as not storing local variables (and references) in the database since they can be re-computed easily on demand ?

Many thanks in advance for any input !

@aaptel
Copy link

aaptel commented Sep 29, 2015

On Tue, Sep 29, 2015 at 7:00 PM, Martin Oberhuber notifications@github.com
wrote:

Hi rtags team, first of all thanks for this great project !

We've just started looking at it, and while doing so noticed that the
rtags database tends to be large (60MB on 62 source files). Also in one
instance, navigating a virtual method call was slow (10 sec on the
mentioned database).

What the size of a ctags file for the same project?

@casch-at
Copy link
Collaborator

Hey,

I use rtags to parse the chromium code base which is the same code base chrome and Chrome OS are based on! Actually, I haven't any speed complains, the database is quit big though (~3gb / ~13k files), 70% bigger than the Eclipse database, but that's not really a concern now a days. Of course parsing takes a while. Jumping around is pretty fast, the delay is not really noticeable. I can take a closer look tomorrow on how long it takes to find virtual functions...

Update:
See #436 (comment)

Regards,
Chris

@casch-at
Copy link
Collaborator

Here are some more accurate key facts.

- Value/Description
Project Chromium -> content_shell
Files to parse ~12000
RTags Database size ~1.6 GB
Parse Time 3-5 hours
CPU Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
RAM 8 GB

Further, I made some time measurements for some of the rtags functions. I'll omitted the rtags- prefix of the functions in the table. I always made 15 runs and calculated the arithmetic mean value (second column).

Function Mean/s
find-symbol-at-point 0.269847666667
find-references-at-point 0.1913876 ?
find-all-references-at-point 0.4015618 ?
find-virtuals-at-point 0.1278864

Well, numbers don't lie, do they? In fact the numbers in bold and with a question mark are questionable. Because when I used one of those function to find references it did work well if the symbol was less than 200 times referenced. But somehow if it was referenced more often the hole thing just hung and I had to hit C-g.

To sum up, rtags is pretty fast but there are still some things which need to be fixed/improved, especially rtags has some problems finding references, if symbols often occur, e.g. 200 times or more like std :: string, as in my case.

Chris

@Andersbakken
Copy link
Owner

Hi Guys

@martin

Thanks for the praise :-)

We don't actually keep the clang translation unit around (among other
things because the serialized contents completely dwarf the rtags database
in size) so I don't actually see how we could reduce the size of database
without giving up considerable functionality. After all I think the best
feature of rtags is rtags-find-references and the only way we can do this
is by relatively quickly looking up precomputed references in our database.
Reparsing on demand generally takes a second or two for source file with
some includes.

I think the best bet is to devote some space to the database at this point.
We could compress the files but then we could no longer mmap around in them
for fast lookups so I think that would be problematic as well.

I don't quite understand why it would take as much as 10 seconds to find
virtual functions on a 62-file project. Is all the time spent in the daemon
or could the elisp be having some problems? rdm should print out the
command line that was executed when it runs something so you could maybe
capture it and time how long rc takes to complete it.

E.g.

time rc --current-file=/path/to/file.cpp -r /path/to/file.cpp:123 -k

It seems to me like something odd is going on. For a codebase that small it
should be instant.

Anders

@christian

Are you able to reproduce this issue with > 200 results? I can't seem to
reproduce it with an example with generated files.

It seems to work correctly as well by finding references to CXCursor in
rtags itself (which seemed to be about 250). I'd love to get to the bottom
of this.

Anders

On Wed, Sep 30, 2015 at 11:36 AM, Christian Schwarzgruber <
notifications@github.com> wrote:

Here are some more accurate key facts.
What Value/Description Project Chromium -> content_shell Files to parse
~12000 RTags Database size ~1.6 GB Parse Time 3-5 hours CPU Intel(R)
Core(TM) i5-2520M CPU @ 2.50GHz RAM 8 GB

Further, I made some time measurements for some of the rtags functions.
I'll omitted the rtags- prefix of the functions in the table. I always
made 15 runs and calculated the arithmetic mean value (second column).
Function Mean/s find-symbol-at-point 0.269847666667
find-references-at-point 0.1913876 ? find-all-references-at-point
0.4015618 ? find-virtuals-at-point 0.1278864

Well, numbers don't lie, do they? In fact the numbers in bold and with a
question mark are questionable. Because when I used one of those function
to find references it did work well if the symbol was less than 200 times
referenced. But somehow if it was referenced more often the hole thing just
hung and I had to hit C-g.

To sum up, rtags is pretty fast but there are still some things which need
to be fixed/improved, especially rtags has some problems finding
references, if symbols often occur, e.g. 200 times or more usually std ::
string is such a candidate.

Chris


Reply to this email directly or view it on GitHub
#436 (comment).

@casch-at
Copy link
Collaborator

casch-at commented Oct 1, 2015

Hey @Andersbakken

Are you able to reproduce this issue with > 200 results? I can't seem to reproduce it with an example with generated files. It seems to work correctly as well by finding references to CXCursor in rtags itself (which seemed to be about 250). I'd love to get to the bottom of this.

I'll look into this in more depth in the next couple days. I should have mentioned that I have used this little function from https://lists.gnu.org/archive/html/help-gnu-emacs/2008-06/msg00087.html to measure the time the bodies of the rtags-... functions take.

@moberhuber
Copy link
Author

Hi @Andersbakken ,

Our test project was rtags itself :)
Looking at QueryJob.h:94 “ virtual int execute() = 0;” and finding references:

time rc -r src/QueryJob.h:94:20
=> src/QueryJob.cpp:256:21: const int ret = execute();

takes in the range of 10-12 seconds on Ubuntu 14.04. Perhaps the problem here was looking at a header file rather than source file ? And perhaps that's the same std::string issue that @cslux sees?

Hi @cslux ,

Your collected data is very interesting ! Boils down to 1MB per file approx, same as what we saw.
When parsing takes 3-5 hours, I'm wondering how long a clean build actually takes (should be approximately as long or slower, since indexing shouldn't take longer than building) ? Also wondering how much main memory is consumed by rdm in this case. For 62 files (rtags itself), we see virtual:100MB with RSS only 10MB active, which I find really good (coming from Java world with Eclipse CDT).

Cheers!
Martin

@moberhuber
Copy link
Author

@aaptel , the ctags file is 114KB but not really comparable to rtags in terms of functionality on C++ code.

@casch-at
Copy link
Collaborator

casch-at commented Oct 1, 2015

Hey @moberhuber

When parsing takes 3-5 hours, I'm wondering how long a clean build actually takes (should be approximately as long or slower, since indexing shouldn't take longer than building)

Building and indexing takes roughly the same time. Maybe I will make a test when I get my new Laptop, so I can use the old one (current) for the tests, otherwise I would sophisticate the result when doing other work.

Also wondering how much main memory is consumed by rdm in this case. For 62 files (rtags itself), we see virtual:100MB with RSS only 10MB active, which I find really good (coming from Java world with Eclipse CDT).

Well, I did not make any serious profiling. But from what top tells me, rdm uses roughly 100-400MB active. Again, I will make a test for this too.

Regarding the database size, the eclipse pdom file is ~600 MB big but can't tell you right now how many files of the chromium code base it has parsed, as I have added a lot "Exclude filters" to reduce the project size, this was necessary because eclipse was completely overwhelmed by the size of the chromium code base :-). My guess would be ~16k files where left to parse.

And perhaps that's the same std::string issue that @cslux sees?

I will definitely investigate the problem in more detail. Will report back when I found something or not.

@casch-at
Copy link
Collaborator

casch-at commented Oct 1, 2015

Hey @Andersbakken

It seems to work correctly as well by finding references to CXCursor in rtags itself (which seemed to be about 250). I'd love to get to the bottom of this.

Nope, doesn't work! The thing just seems to hang in a loop when calling rtags-find-references-at-point, and similar functions. I aborted after one minute. profiler-start and profiler-report afterwards did not provide any significant information what was going on.

- command-execute                                                 634  61%
 - call-interactively                                             634  61%
  + byte-code                                                     593  57%
  - rtags-find-references-at-point                                 33   3%
   - let                                                           33   3%
    - save-current-buffer                                          32   3%
     - rtags-call-rc                                               32   3%
      - let*                                                       32   3%
       - save-excursion                                            32   3%
        - let                                                      32   3%
         - if                                                      32   3%
          - let                                                    32   3%
           - cond                                                  32   3%
              apply                                                32   3%
    - rtags-current-location                                        1   0%
     - let                                                          1   0%
      - and                                                         1   0%
       - format                                                     1   0%
          line-number-at-pos                                        1   0%
  + helm-M-x                                                        8   0%
+ ...                                                             332  32%
+ redisplay_internal (C function)                                  46   4%
+ timer-event-handler                                               8   0%
+ blink-cursor-end                                                  2   0%
+ rtags-post-command-hook                                           1   0%
+ rtags-diagnostics-process-filter                                  1   0%

If I execute the same rc command rtags-find-references-at-point does, from the cmd, gives me instant results. I made a screen shot of the System Monitor showing the rtags processes when calling rtags-find-references-at-point. And one after I have hit C-g. Also I have noticed that the system takes forever opening a file when rtags-find-references-at-point is running.

Further, calling rc from the cmd gives no result until I hit C-g in emacs. The System Monitor shows me that no process spins the CPU :-/, interesting.

Next step:
I'll attach to the rc process, maybe I'll find something.

rtags-find-references-at-point-system-monitor
rtags-find-references-at-point-system-monitor1

@casch-at
Copy link
Collaborator

casch-at commented Oct 1, 2015

@Andersbakken
Update:
When I hit C-g the *RTags* contains some output (134 lines), e.g. when I call rtags-find-references-at-point the last line in *RTags* buffer is, regardless how long I wait.

src/RTagsClang.h:50:25: template <> struct hash<CXCursor> : public unary_function<CXCursor, size_t> function: struct hash<CXCursor>

Next line would be.

src/RTagsClang.h:50:59: template <> struct hash<CXCursor> : public unary_function<CXCursor, size_t>     function: struct hash<CXCursor>

The equivalent cmd call would be, as in my case.

/bin/rc --current-file=/home/cschwarzgruber/Developing/third_party/rtags/src/DumpThread.cpp -o -r home/cschwarzgruber/Developing/third_party/rtags/src/DumpThread.cpp:28:19

Which gives me all matches (235 lines).

The buffer-size is 15362 chars, after I aborted the operation.

Calling rtags-find-references-at-point on gfx::Rect (class of the chromium code base), the buffer contained only 85 lines, but here it comes, every time I called rtags-find-references-at-point and I aborted the operation, the *RTags* buffer always contained around 15500 chars.

  • 15508
  • 15417
  • 15362
  • 15396
  • 15435
  • ...

The buffer size was always the same when calling rtags-find-references-at-point on the same symbol.

Interesting, isn't it? Do you maybe know what's going on here?

@Andersbakken
Copy link
Owner

@moberhuber

Hm. I get vastly better numbers than that:

(master)[src_rtags][abakken@tap:~/dev/rtags] time rc -r src/QueryJob.h:94:20
src/QueryJob.cpp:256:21: const int ret = execute();

real 0m0.375s
user 0m0.000s
sys 0m0.000s
(master)[src_rtags][abakken@tap:~/dev/rtags] time rc -r src/QueryJob.h:94:20
src/QueryJob.cpp:256:21: const int ret = execute();

real 0m0.014s
user 0m0.000s
sys 0m0.000s
(master)[src_rtags][abakken@tap:~/dev/rtags] time rc -r src/QueryJob.h:94:20
src/QueryJob.cpp:256:21: const int ret = execute();

real 0m0.012s
user 0m0.000s
sys 0m0.004s
(master)[src_rtags][abakken@tap:~/dev/rtags] time rc -r src/QueryJob.h:94:20
src/QueryJob.cpp:256:21: const int ret = execute();

real 0m0.015s
user 0m0.000s
sys 0m0.000s

RTags relies on quick disk io. Is your data directory (~/.rtags) possibly mounted on some network drive or something like that?

Anders

@casch-at
Copy link
Collaborator

casch-at commented Oct 2, 2015

Hey @Andersbakken ,

ok some updates regarding the issue that rc only gets part of the request (rtags-find-references*).
When I pass the flag --silent to rdm everything works as expected. But I was still unable to find to root cause.

Chris

@casch-at
Copy link
Collaborator

casch-at commented Oct 2, 2015

Hey @Andersbakken,

one more thing I have just noticed. When I start rdm from the cmd without passing the flag --silent to it, it works too.

To summarize, if rdm is started via emacs the flag --silent must be passed to it, otherwise calling one of the rtags-find-references* functions want return. Or rdm needs to be started via the cmd.

Here my .rdmrc file.

--isystem=/usr/lib/clang/3.5.0/include
--completion-cache-size=200
--job-count=4
--ignore-printf-fixits
# Silent flag needs to be set in case rdm gets started from within emacs otherwise emas will hang when calling `rtags-find-references*`
#--silent
# For debug purpose, can be removed if rp doesn't crash anymore
--suspend-rp-on-crash
# --log-file /home/cschwarzgruber/rdm.log
# --log-file-log-level error

# Local Variables:
# mode: conf-space
# End:

BTW, why does rdm print the requested result, when rc also prints it? Also, what's the purpose of the flag --log-file-log-level error/warning/debug/verbose-debug is this the log level for the indexer? I thought when I pass, e.g. --log-file-log-level error to rdm it will only print errors to the console.

Thanks,
Christian

@Andersbakken
Copy link
Owner

Hi Christian

Ah. That makes some sense. Since rdm is running inside of emacs it gets
into buffering issues when communicating with itself (so to speak).

Maybe --silent should be added to the parameters when rdm is run from
emacs.

By default all log messages end up in the log file but with
--log-file-log-level
you can tweak it. To tweak what is written to stderr you can use a
combination of -v and --silent. What qualifies as an "error" is a little
adhoc. I guess rdm is a little more noisy than it needs to be. I might
change that. The results kinda don't need to be printed out unless you add
a -v I think.

I just turned them into warnings which means that you have to pass -v to
rdm to see all the results. This likely means that you don't need --silent
anymore. I personally don't run rdm inside of emacs which is likely why I
haven't seen this.

Anders

On Fri, Oct 2, 2015 at 3:48 PM, Christian Schwarzgruber <
notifications@github.com> wrote:

Hey @Andersbakken https://github.com/Andersbakken,

one more thing I have just noticed. When I start rdm from the cmd
without passing the flag --silent to it, it works too.

To summarize, if rdm is started via emacs the flag --silent must be
passed to it, otherwise calling one of the rtags-find-references*
functions want return. Or rdm needs to be started via the cmd.

Here my .rdmrc file.

--isystem=/usr/lib/clang/3.5.0/include
--completion-cache-size=200
--job-count=4
--ignore-printf-fixits

Silent flag needs to be set in case rdm gets started from within emacs otherwise emas will hang when calling rtags-find-references*

#--silent

For debug purpose, can be removed if rp doesn't crash anymore

--suspend-rp-on-crash

--log-file /home/cschwarzgruber/rdm.log

--log-file-log-level error

Local Variables:

mode: conf-space

End:

BTW, why does rdm print the request result, when rc also prints it? Also,
what's the purpose of the flag --log-file-log-level
error/warning/debug/verbose-debug is this the log level for the indexer?
I thought when I pass, e.g. --log-file-log-level error to rdm it will
only print errors to the console.

Thanks,
Christian


Reply to this email directly or view it on GitHub
#436 (comment).

Andersbakken added a commit that referenced this issue Oct 4, 2015
@casch-at
Copy link
Collaborator

casch-at commented Oct 4, 2015

By default all log messages end up in the log file but with --log-file-log-level you can tweak it. To tweak what is written to stderr you can use a combination of -v and --silent. What qualifies as an "error" is a little adhoc.
Ok got it.

I guess rdm is a little more noisy than it needs to be.
Yep :-)
I might change that. The results kinda don't need to be printed out unless you add a -v I think.
Looks good now, thanks!

@Andersbakken
Copy link
Owner

Glad to hear.

Anders

On Sun, Oct 4, 2015 at 11:51 AM, Christian Schwarzgruber <
notifications@github.com> wrote:

By default all log messages end up in the log file but with
--log-file-log-level you can tweak it. To tweak what is written to stderr
you can use a combination of -v and --silent. What qualifies as an "error"
is a little adhoc.
Ok got it.

I guess rdm is a little more noisy than it needs to be.
Yep :-)
I might change that. The results kinda don't need to be printed out unless
you add a -v I think.
Looks good now, thanks!


Reply to this email directly or view it on GitHub
#436 (comment).

@moberhuber
Copy link
Author

@Andersbakken

You wrote:

RTags relies on quick disk io. Is your data directory (~/.rtags) possibly mounted
on some network drive or something like that?

Yes, our $HOME is on an NFS share ; still I had not expected a 10x performance degradation down to 10 sec wait time. Also note we see the 10 sec every time (so no disk cache of any sort helps here), and normal navigations are fast whereas only the specific one is slow.

Andersbakken added a commit that referenced this issue Oct 8, 2015
Keep track of how many files were opened by each filemapscope.
@Andersbakken
Copy link
Owner

That is a little strange. It's not always easy to tell how many files were opened for a given query but I just added something that will dump it out as a warning. Can you try this:

pull to 7285645
rebuild

rdm -v

Now it should tell you how many files were opened after each query.

@cviebig
Copy link

cviebig commented Nov 4, 2016

#719 is related I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants