Java OutOfMemoryError when Using Astminer with Code2Vec #75

michnorm · 2020-01-20T13:43:40Z

Hi,

I am currently trying to use Astminer to extract paths for use with Code2Vec. I have roughly 9500 Python projects containing a total of about 220,000 code files that I want to extract paths from.

When running the command:
java -jar cli.jar code2vec --lang py --project <data location> --output <output location> --maxH 10 --maxW 6 --maxContexts 1000000 --maxTokens 100000 --maxPaths 100000
After about 15 minutes I get the error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects.

When I reduce the number of projects from 9500 to 50 it works successfully. Is this likely to be an issue with system resources, or Astminer itself? Any help would be greatly appreciated, thanks!

The text was updated successfully, but these errors were encountered:

vovak · 2020-01-20T14:00:29Z

hi @mdnorman38, thanks for the report.
I'll see if there is anything we can do to reduce memory consumption.

For now please try increasing maximum heap size by adding -Xmx8g (or other reasonably large value) to java arguments. The default maximum heap size is, iirc, 1/4 of physical memory.

michnorm · 2020-01-21T09:08:50Z

Thanks for the suggestion @vovak I can now process a lot more projects than previously.

Another question, from what I understand you need to pass the path_contexts.csv file to Code2Vec. However, because of batching I have lots of path_contexts.csv files, so how do I pass all the files to Code2Vec. Should I just concatenate them?

egor-bogomolov · 2020-01-21T10:03:57Z

Hello! A proper way to do so is to concatenate the path_contexts.csv files into a single long file.

michnorm · 2020-01-21T11:44:43Z

Great, thanks @egor-bogomolov!

shaoormunir · 2020-02-02T05:47:57Z

@vovak I am facing this issue as well. I am trying to process around 30,000 JavaScript files and, even after giving it 24 Gigabyte memory it fails after around 6,000 files. I am using the library directly in Kotlin instead of CLI as it does not yet support JS parsing, Is there a way to do the parsing in parts by providing it the previous context files or any other way that I can make it work for the whole dataset?

suryadesu · 2020-03-05T09:42:06Z

Thanks for the suggestion @vovak I can now process a lot more projects than previously.

Another question, from what I understand you need to pass the path_contexts.csv file to Code2Vec. However, because of batching I have lots of path_contexts.csv files, so how do I pass all the files to Code2Vec. Should I just concatenate them?

Hello, correct me if I'm wrong. I think the right way to do it is to parse all the files together, as matching the project may result in duplicate tokens, nodetypes that are considered for paths and path_contexts which may cause wrong results for code2vec. I checked this by getting the mining ast paths using whole dataset and then divided into batches, the files created aren't same.

suryadesu · 2020-03-05T09:46:31Z

Even I am facing this issue. I am currently using c, cpp parser for dataset of size around 650MB. After a long time of around 1 day, it throws this error. I tried increasing the heap size for a maximum of 120g. Still I have the same error as Java outofmemory: GC overhead limit exceeded. I found this error occurs while parsing using fuzzyc2cpg. Can someone give a solution for this?

egor-bogomolov · 2020-03-05T12:57:44Z

Could you please try to run jar file mentioned in the issue #60
You can find the jar here

suryadesu · 2020-03-06T10:01:13Z

Could you please try to run jar file mentioned in the issue #60
You can find the jar here

Hi
Thanks for the reply. I tried to run using given jar file as you suggested. That seemed to be running relatively faster, but I am still facing the same error as outofmemory: GC overhead limit exceeded.
I ran it with maximum heap size of 120GB. So, I didn't run using batches.
Can you let me know if there is some other possible

michnorm · 2020-03-26T11:20:21Z

Hello, correct me if I'm wrong. I think the right way to do it is to parse all the files together, as matching the project may result in duplicate tokens, nodetypes that are considered for paths and path_contexts which may cause wrong results for code2vec. I checked this by getting the mining ast paths using whole dataset and then divided into batches, the files created aren't same.

You're right, there are duplicate tokens and it seems to be negatively affecting my Code2Vec results. Unfortunately, I have a pretty large dataset so preprocessing without batching seems impossible with this out of memory issue.

SimoneBrigante · 2020-05-09T15:36:43Z

Could you please try to run jar file mentioned in the issue #60
You can find the jar here

Hi
Thanks for the reply. I tried to run using given jar file as you suggested. That seemed to be running relatively faster, but still I am still facing the same error as outofmemory: GC overhead limit exceeded.
I ran it with maximum heap size of 120GB. So, I didn't run using batches.
Can you let me know if there is some other possible

Hi @suryadesu, I have the same problem as you had. Did you find out a solution?

suryadesu · 2020-05-11T11:04:42Z

Could you please try to run jar file mentioned in the issue #60
You can find the jar here

Hi
Thanks for the reply. I tried to run using given jar file as you suggested. That seemed to be running relatively faster, but still I am still facing the same error as outofmemory: GC overhead limit exceeded.
I ran it with maximum heap size of 120GB. So, I didn't run using batches.
Can you let me know if there is some other possible

Hi @suryadesu, I have the same problem as you had. Did you find out a solution?

No @SimoneBrigante,
As of now, I still haven't found any solution to this problem.

vovak · 2020-10-05T17:03:57Z

Hi @michnorm @suryadesu @SimoneBrigante,

Sorry for the super long wait. Memory consumption is going to be fixed as soon as #106 is merged. For now you can just use cli.sh code2vec ... in the serial-parsing branch.

vovak · 2020-10-07T01:33:30Z

Fixed by #106.

vovak self-assigned this Jan 20, 2020

SpirinEgor mentioned this issue Apr 29, 2020

Create benchmarks for CLI #86

Closed

vovak closed this as completed Oct 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Java OutOfMemoryError when Using Astminer with Code2Vec #75

Java OutOfMemoryError when Using Astminer with Code2Vec #75

michnorm commented Jan 20, 2020

vovak commented Jan 20, 2020

michnorm commented Jan 21, 2020 •

edited

egor-bogomolov commented Jan 21, 2020

michnorm commented Jan 21, 2020

shaoormunir commented Feb 2, 2020

suryadesu commented Mar 5, 2020

suryadesu commented Mar 5, 2020 •

edited

egor-bogomolov commented Mar 5, 2020

suryadesu commented Mar 6, 2020 •

edited

michnorm commented Mar 26, 2020 •

edited

SimoneBrigante commented May 9, 2020

suryadesu commented May 11, 2020

vovak commented Oct 5, 2020

vovak commented Oct 7, 2020

Java OutOfMemoryError when Using Astminer with Code2Vec #75

Java OutOfMemoryError when Using Astminer with Code2Vec #75

Comments

michnorm commented Jan 20, 2020

vovak commented Jan 20, 2020

michnorm commented Jan 21, 2020 • edited

egor-bogomolov commented Jan 21, 2020

michnorm commented Jan 21, 2020

shaoormunir commented Feb 2, 2020

suryadesu commented Mar 5, 2020

suryadesu commented Mar 5, 2020 • edited

egor-bogomolov commented Mar 5, 2020

suryadesu commented Mar 6, 2020 • edited

michnorm commented Mar 26, 2020 • edited

SimoneBrigante commented May 9, 2020

suryadesu commented May 11, 2020

vovak commented Oct 5, 2020

vovak commented Oct 7, 2020

michnorm commented Jan 21, 2020 •

edited

suryadesu commented Mar 5, 2020 •

edited

suryadesu commented Mar 6, 2020 •

edited

michnorm commented Mar 26, 2020 •

edited