Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java OutOfMemoryError when Using Astminer with Code2Vec #75

Closed
michnorm opened this issue Jan 20, 2020 · 14 comments
Closed

Java OutOfMemoryError when Using Astminer with Code2Vec #75

michnorm opened this issue Jan 20, 2020 · 14 comments
Assignees

Comments

@michnorm
Copy link

Hi,

I am currently trying to use Astminer to extract paths for use with Code2Vec. I have roughly 9500 Python projects containing a total of about 220,000 code files that I want to extract paths from.

When running the command:
java -jar cli.jar code2vec --lang py --project <data location> --output <output location> --maxH 10 --maxW 6 --maxContexts 1000000 --maxTokens 100000 --maxPaths 100000
After about 15 minutes I get the error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects.

When I reduce the number of projects from 9500 to 50 it works successfully. Is this likely to be an issue with system resources, or Astminer itself? Any help would be greatly appreciated, thanks!

@vovak
Copy link
Member

vovak commented Jan 20, 2020

hi @mdnorman38, thanks for the report.
I'll see if there is anything we can do to reduce memory consumption.

For now please try increasing maximum heap size by adding -Xmx8g (or other reasonably large value) to java arguments. The default maximum heap size is, iirc, 1/4 of physical memory.

@vovak vovak self-assigned this Jan 20, 2020
@michnorm
Copy link
Author

michnorm commented Jan 21, 2020

Thanks for the suggestion @vovak I can now process a lot more projects than previously.

Another question, from what I understand you need to pass the path_contexts.csv file to Code2Vec. However, because of batching I have lots of path_contexts.csv files, so how do I pass all the files to Code2Vec. Should I just concatenate them?

@egor-bogomolov
Copy link
Collaborator

Hello! A proper way to do so is to concatenate the path_contexts.csv files into a single long file.

@michnorm
Copy link
Author

Great, thanks @egor-bogomolov!

@shaoormunir
Copy link

@vovak I am facing this issue as well. I am trying to process around 30,000 JavaScript files and, even after giving it 24 Gigabyte memory it fails after around 6,000 files. I am using the library directly in Kotlin instead of CLI as it does not yet support JS parsing, Is there a way to do the parsing in parts by providing it the previous context files or any other way that I can make it work for the whole dataset?

@suryadesu
Copy link

Thanks for the suggestion @vovak I can now process a lot more projects than previously.

Another question, from what I understand you need to pass the path_contexts.csv file to Code2Vec. However, because of batching I have lots of path_contexts.csv files, so how do I pass all the files to Code2Vec. Should I just concatenate them?

Thanks for the suggestion @vovak I can now process a lot more projects than previously.

Another question, from what I understand you need to pass the path_contexts.csv file to Code2Vec. However, because of batching I have lots of path_contexts.csv files, so how do I pass all the files to Code2Vec. Should I just concatenate them?

Hello, correct me if I'm wrong. I think the right way to do it is to parse all the files together, as matching the project may result in duplicate tokens, nodetypes that are considered for paths and path_contexts which may cause wrong results for code2vec. I checked this by getting the mining ast paths using whole dataset and then divided into batches, the files created aren't same.

@suryadesu
Copy link

suryadesu commented Mar 5, 2020

Even I am facing this issue. I am currently using c, cpp parser for dataset of size around 650MB. After a long time of around 1 day, it throws this error. I tried increasing the heap size for a maximum of 120g. Still I have the same error as Java outofmemory: GC overhead limit exceeded. I found this error occurs while parsing using fuzzyc2cpg. Can someone give a solution for this?

@egor-bogomolov
Copy link
Collaborator

Could you please try to run jar file mentioned in the issue #60
You can find the jar here

@suryadesu
Copy link

suryadesu commented Mar 6, 2020

Could you please try to run jar file mentioned in the issue #60
You can find the jar here

Hi
Thanks for the reply. I tried to run using given jar file as you suggested. That seemed to be running relatively faster, but I am still facing the same error as outofmemory: GC overhead limit exceeded.
I ran it with maximum heap size of 120GB. So, I didn't run using batches.
Can you let me know if there is some other possible

@michnorm
Copy link
Author

michnorm commented Mar 26, 2020

Hello, correct me if I'm wrong. I think the right way to do it is to parse all the files together, as matching the project may result in duplicate tokens, nodetypes that are considered for paths and path_contexts which may cause wrong results for code2vec. I checked this by getting the mining ast paths using whole dataset and then divided into batches, the files created aren't same.

You're right, there are duplicate tokens and it seems to be negatively affecting my Code2Vec results. Unfortunately, I have a pretty large dataset so preprocessing without batching seems impossible with this out of memory issue.

@SimoneBrigante
Copy link

Could you please try to run jar file mentioned in the issue #60
You can find the jar here

Hi
Thanks for the reply. I tried to run using given jar file as you suggested. That seemed to be running relatively faster, but still I am still facing the same error as outofmemory: GC overhead limit exceeded.
I ran it with maximum heap size of 120GB. So, I didn't run using batches.
Can you let me know if there is some other possible

Hi @suryadesu, I have the same problem as you had. Did you find out a solution?

@suryadesu
Copy link

Could you please try to run jar file mentioned in the issue #60
You can find the jar here

Hi
Thanks for the reply. I tried to run using given jar file as you suggested. That seemed to be running relatively faster, but still I am still facing the same error as outofmemory: GC overhead limit exceeded.
I ran it with maximum heap size of 120GB. So, I didn't run using batches.
Can you let me know if there is some other possible

Hi @suryadesu, I have the same problem as you had. Did you find out a solution?

No @SimoneBrigante,
As of now, I still haven't found any solution to this problem.

@vovak
Copy link
Member

vovak commented Oct 5, 2020

Hi @michnorm @suryadesu @SimoneBrigante,

Sorry for the super long wait. Memory consumption is going to be fixed as soon as #106 is merged. For now you can just use cli.sh code2vec ... in the serial-parsing branch.

@vovak
Copy link
Member

vovak commented Oct 7, 2020

Fixed by #106.

@vovak vovak closed this as completed Oct 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants