-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frog breaks while processing large amount of txt data #86
Comments
Well. I assume Frog runs out of memory in LaMachine. 1 or 2 threads would be the maximum for this case, I am afraid. Do you really need FoLiA XML output? because XML is verbose and needs a lot of memory. |
Well, it is even worse. 10Gb for 30.000 lines ==> 300 Gb needed The best solution would probably be to split all files in (much) smaller parts and run 64 frog on say 6400 files. Which is still NOT a large amount of files. |
No, a FoLiA output is not needed. Actually, I would prefer something like the output of Frog running in Python, e.g. And no, Dependency Parser and NER are not needed. I was wondering what IOB is? Couldn't find it in the documentation. And yes, it would be possible to split the files even further. However, I did not do it, yet, because I could avoid all the splitting in merging. |
Perhaps we need to take closer look at your python + multiprocessing solution again and determine what went wrong there. |
You can find the script attached. It takes a file with sentences per line as input, splits it automatically into a specified amount of text chunks (so that there is a chunk for every core), analyses each chunk using frog, collects the output of the partial analyses and concatenates it into one file again, which is than written to a file. My pragmatic solution to tackle the memory issue would be to write a bash script which split one huge file into smaller files, which will then be splitted by python again. However, the python script regularly breaks. Would be thankful if you find out why. You can find the file here: frog_multiprocessing.py |
I am now testing frog on 'nlcow14ax_all_clean_martijn_36.txt' with these options: so no parser, no NER and no IOB-Chunker. and output to a file in 'tabbed' format, like this:
This tabbed format can be converted to JSON or an Spreadsheet quite easily . The process is now at line 200.000, so still a lot to do, but memory usage is around 1 Gb and stable, so no worries there. |
Your python script looks pretty clean and set up well. I was worried there might be a forking issue after frog instantiation, but that doesn't seem to be the case, you initialise a fresh frog instance every time (which is actually even more than necessary (once per actual thread), but better safe than sorry). When it breaks do you get any error output? Bad allocation or segmentation fault? |
In the begin I instantiated frog only once but it did not work at all, so I ended up like this. No, I don't receive any error message. The script just freezes. If you have access to the pony's, you can use run the script using some files in my tensusers directory like this: |
To summarize: All "reasonable" i think |
This is also solved by taking the right approach. closing the issue |
Frog is used to analyze 64 different txt files on 64 cores. It is initiated in LaMachine with
frog.nf --inputdir chunks --outputdir chunks --inputformat text --sentenceperline --workers 64
. However, I started the process several times, it once ran for a whole day but another time broke after only a few hours. The absolute runtime on the data should comprise around 20 days according to my calculations. Here is an excerpt of the error message.The text was updated successfully, but these errors were encountered: