Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

reduce amount of memory used by the analyzers #8

Closed
vladak opened this Issue · 2 comments

3 participants

@vladak
Owner

From discuss mailing list (Knut):

OpenGrok's analyzers read the entire file into a byte buffer before they
start analyzing it, so they may require a lot of memory if some of the
files are big. That is, if individual files are big. The total size of
the code base shouldn't have much impact on the memory requirements, as
far as I know.

Additionally, the analyzers keep the potentially very big byte buffers
around (one per analyzer instance, cached in a ThreadLocal field) to
avoid allocating new ones later, so the buffer space is never freed
after reading a big file.

I have been planning to rewrite the analyzers to make them less
memory-hungry, but I haven't got around to it yet.


Jens/Knut:

Yepp, perhaps just using soft/weak-refs to the buffer or cached
Analyzers at all, so that - at least in theory - on memory pressure
they get automatically purged/gced may fix it with little effort?

Yes, that might help reclaiming the space. We would still see spikes in
memory usage when processing big files, though. I was hoping we could
solve the latter (or both) by passing file contents to the analyzers as
streams rather than byte arrays.

@kahatlen kahatlen was assigned
@kahatlen
Owner

I'm looking at it...

Currently, the analyzers read the entire file into memory and keeps it there to prevent having to read it multiple times (once to build the full text search index, once to build the symbol index, once to produce the xref, and so on...). If they had just opened a new stream for each of the tasks, they would not require memory proportional to the file size, and I think it would also make the code cleaner. The downside is that files will have to be read multiple times, so we need to check that indexing performance doesn't take an unacceptable hit.

@trondn
Owner
@kahatlen kahatlen referenced this issue from a commit in kahatlen/OpenGrok
@kahatlen kahatlen Change the analyzers so that they don't keep the entire file in memory.
Instead of passing a stream to the analyze() methods, and have them
cache the entire contents in a byte or char array, open a new stream
each time the source file should be read.

Also, write the xref files from the analyze() methods. In many
analyzers, this avoids the need for building up a data structure that
holds the full xref output in memory.

This fixes #8.
c4844f8
@kahatlen kahatlen closed this issue from a commit
@kahatlen kahatlen Change the analyzers so that they don't keep the entire file in memory.
Instead of passing a stream to the analyze() methods, and have them
cache the entire contents in a byte or char array, open a new stream
each time the source file should be read.

Also, write the xref files from the analyze() methods. In many
analyzers, this avoids the need for building up a data structure that
holds the full xref output in memory.

This fixes #8.
a39bcfe
@kahatlen kahatlen closed this in a39bcfe
@kahatlen kahatlen referenced this issue from a commit in kahatlen/OpenGrok
@kahatlen kahatlen Improve cleanup of resources opened by the analyze() methods.
Before issue #8, there would be exactly one FileInputStream per file
that was being analyzed, and it was pretty easy to ensure that this
stream was closed. Issue #8 changed this, but it didn't add logic to
ensure that the extra streams were closed if there was an error before
the document was added to the index.

This change attempts to improve the situation by making sure that
readers or token streams associated with the fields in the Lucene
document are closed if something goes wrong.
d7440ea
@vladak vladak referenced this issue from a commit in vladak/OpenGrok
@kahatlen kahatlen Change the analyzers so that they don't keep the entire file in memory.
Instead of passing a stream to the analyze() methods, and have them
cache the entire contents in a byte or char array, open a new stream
each time the source file should be read.

Also, write the xref files from the analyze() methods. In many
analyzers, this avoids the need for building up a data structure that
holds the full xref output in memory.

This fixes #8.
5755f95
@vladak vladak referenced this issue from a commit in vladak/OpenGrok
@kahatlen kahatlen Improve cleanup of resources opened by the analyze() methods.
Before issue #8, there would be exactly one FileInputStream per file
that was being analyzed, and it was pretty easy to ensure that this
stream was closed. Issue #8 changed this, but it didn't add logic to
ensure that the extra streams were closed if there was an error before
the document was added to the index.

This change attempts to improve the situation by making sure that
readers or token streams associated with the fields in the Lucene
document are closed if something goes wrong.
e813067
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.