-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Java OutOfMemoryError when Using Astminer with Code2Vec #75
Comments
hi @mdnorman38, thanks for the report. For now please try increasing maximum heap size by adding |
Thanks for the suggestion @vovak I can now process a lot more projects than previously. Another question, from what I understand you need to pass the |
Hello! A proper way to do so is to concatenate the |
Great, thanks @egor-bogomolov! |
@vovak I am facing this issue as well. I am trying to process around 30,000 JavaScript files and, even after giving it 24 Gigabyte memory it fails after around 6,000 files. I am using the library directly in Kotlin instead of CLI as it does not yet support JS parsing, Is there a way to do the parsing in parts by providing it the previous context files or any other way that I can make it work for the whole dataset? |
Hello, correct me if I'm wrong. I think the right way to do it is to parse all the files together, as matching the project may result in duplicate tokens, nodetypes that are considered for paths and path_contexts which may cause wrong results for code2vec. I checked this by getting the mining ast paths using whole dataset and then divided into batches, the files created aren't same. |
Even I am facing this issue. I am currently using c, cpp parser for dataset of size around 650MB. After a long time of around 1 day, it throws this error. I tried increasing the heap size for a maximum of 120g. Still I have the same error as Java outofmemory: GC overhead limit exceeded. I found this error occurs while parsing using fuzzyc2cpg. Can someone give a solution for this? |
Hi |
You're right, there are duplicate tokens and it seems to be negatively affecting my Code2Vec results. Unfortunately, I have a pretty large dataset so preprocessing without batching seems impossible with this out of memory issue. |
Hi @suryadesu, I have the same problem as you had. Did you find out a solution? |
No @SimoneBrigante, |
Hi @michnorm @suryadesu @SimoneBrigante, Sorry for the super long wait. Memory consumption is going to be fixed as soon as #106 is merged. For now you can just use |
Fixed by #106. |
Hi,
I am currently trying to use Astminer to extract paths for use with Code2Vec. I have roughly 9500 Python projects containing a total of about 220,000 code files that I want to extract paths from.
When running the command:
java -jar cli.jar code2vec --lang py --project <data location> --output <output location> --maxH 10 --maxW 6 --maxContexts 1000000 --maxTokens 100000 --maxPaths 100000
After about 15 minutes I get the error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects
.When I reduce the number of projects from 9500 to 50 it works successfully. Is this likely to be an issue with system resources, or Astminer itself? Any help would be greatly appreciated, thanks!
The text was updated successfully, but these errors were encountered: