reduce amount of memory used by the analyzers #8

Closed
vladak opened this Issue Mar 7, 2013 · 2 comments

Projects

None yet

3 participants

@vladak
Owner
vladak commented Mar 7, 2013

From discuss mailing list (Knut):

OpenGrok's analyzers read the entire file into a byte buffer before they
start analyzing it, so they may require a lot of memory if some of the
files are big. That is, if individual files are big. The total size of
the code base shouldn't have much impact on the memory requirements, as
far as I know.

Additionally, the analyzers keep the potentially very big byte buffers
around (one per analyzer instance, cached in a ThreadLocal field) to
avoid allocating new ones later, so the buffer space is never freed
after reading a big file.

I have been planning to rewrite the analyzers to make them less
memory-hungry, but I haven't got around to it yet.


Jens/Knut:

Yepp, perhaps just using soft/weak-refs to the buffer or cached
Analyzers at all, so that - at least in theory - on memory pressure
they get automatically purged/gced may fix it with little effort?

Yes, that might help reclaiming the space. We would still see spikes in
memory usage when processing big files, though. I was hoping we could
solve the latter (or both) by passing file contents to the analyzers as
streams rather than byte arrays.

@kahatlen kahatlen was assigned Mar 9, 2013
Owner
kahatlen commented Mar 9, 2013

I'm looking at it...

Currently, the analyzers read the entire file into memory and keeps it there to prevent having to read it multiple times (once to build the full text search index, once to build the symbol index, once to produce the xref, and so on...). If they had just opened a new stream for each of the tasks, they would not require memory proportional to the file size, and I think it would also make the code cleaner. The downside is that files will have to be read multiple times, so we need to check that indexing performance doesn't take an unacceptable hit.

Owner
trondn commented Mar 10, 2013

On Mar 9, 2013 10:06 PM, "kahatlen" notifications@github.com wrote:

I'm looking at it...

Currently, the analyzers read the entire file into memory and keeps it
there to prevent having to read it multiple times (once to build the full
text search index, once to build the symbol index, once to produce the
xref, and so on...). If they had just opened a new stream for each of the
tasks, they would not require memory proportional to the file size, and I
think it would also make the code cleaner. The downside is that files will
have to be read multiple times, so we need to check that indexing
performance doesn't take an unacceptable hit.

For small files this will most likely be in the operating systems file
system cache anyway....

Trond


Reply to this email directly or view it on GitHub.

@kahatlen kahatlen added a commit to kahatlen/OpenGrok that referenced this issue Mar 31, 2013
@kahatlen kahatlen Change the analyzers so that they don't keep the entire file in memory.
Instead of passing a stream to the analyze() methods, and have them
cache the entire contents in a byte or char array, open a new stream
each time the source file should be read.

Also, write the xref files from the analyze() methods. In many
analyzers, this avoids the need for building up a data structure that
holds the full xref output in memory.

This fixes #8.
c4844f8
@kahatlen kahatlen added a commit that closed this issue Apr 3, 2013
@kahatlen kahatlen Change the analyzers so that they don't keep the entire file in memory.
Instead of passing a stream to the analyze() methods, and have them
cache the entire contents in a byte or char array, open a new stream
each time the source file should be read.

Also, write the xref files from the analyze() methods. In many
analyzers, this avoids the need for building up a data structure that
holds the full xref output in memory.

This fixes #8.
a39bcfe
@kahatlen kahatlen closed this in a39bcfe Apr 3, 2013
@kahatlen kahatlen added a commit to kahatlen/OpenGrok that referenced this issue Apr 21, 2013
@kahatlen kahatlen Improve cleanup of resources opened by the analyze() methods.
Before issue #8, there would be exactly one FileInputStream per file
that was being analyzed, and it was pretty easy to ensure that this
stream was closed. Issue #8 changed this, but it didn't add logic to
ensure that the extra streams were closed if there was an error before
the document was added to the index.

This change attempts to improve the situation by making sure that
readers or token streams associated with the fields in the Lucene
document are closed if something goes wrong.
d7440ea
@vladak vladak added a commit to vladak/OpenGrok that referenced this issue Aug 8, 2013
@kahatlen @vladak kahatlen + vladak Change the analyzers so that they don't keep the entire file in memory.
Instead of passing a stream to the analyze() methods, and have them
cache the entire contents in a byte or char array, open a new stream
each time the source file should be read.

Also, write the xref files from the analyze() methods. In many
analyzers, this avoids the need for building up a data structure that
holds the full xref output in memory.

This fixes #8.
5755f95
@vladak vladak added a commit to vladak/OpenGrok that referenced this issue Aug 8, 2013
@kahatlen @vladak kahatlen + vladak Improve cleanup of resources opened by the analyze() methods.
Before issue #8, there would be exactly one FileInputStream per file
that was being analyzed, and it was pretty easy to ensure that this
stream was closed. Issue #8 changed this, but it didn't add logic to
ensure that the extra streams were closed if there was an error before
the document was added to the index.

This change attempts to improve the situation by making sure that
readers or token streams associated with the fields in the Lucene
document are closed if something goes wrong.
e813067
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment