Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frog breaks while processing large amount of txt data #86

Closed
hannomuller opened this issue Jan 15, 2020 · 11 comments
Closed

Frog breaks while processing large amount of txt data #86

hannomuller opened this issue Jan 15, 2020 · 11 comments
Assignees

Comments

@hannomuller
Copy link

Frog is used to analyze 64 different txt files on 64 cores. It is initiated in LaMachine with frog.nf --inputdir chunks --outputdir chunks --inputformat text --sentenceperline --workers 64. However, I started the process several times, it once ran for a whole day but another time broke after only a few hours. The absolute runtime on the data should comprise around 20 days according to my calculations. Here is an excerpt of the error message.

executor >  local (64)
[7f/9f749c] process > frog_text2folia (48) [ 97%] 62 of 64, failed: 62
WARN: Killing pending tasks (63)
Error executing process > 'frog_text2folia (37)'

Caused by:
  Process `frog_text2folia (37)` terminated with an error exit status (1)

Command executed:

  set +u
        if [ ! -z "/vol/customopt/lamachine.stable" ]; then
            source /vol/customopt/lamachine.stable/bin/activate
        fi
        set -u
  
        opts=""
        if [[ "true" == "true" ]]; then
            opts="$opts -n"
        fi
        if [ ! -z "" ]; then
  frog-mopts="$opts --skip="
  fi
  
        #move input files to separate staging directory
        mkdir input
        mv *.txt input/
  
        #output will be in cwd
        mkdir output
        frog $opts --outputclass "current" --xmldir "output" --nostdout --testdir input/
        cd output
        for f in *.xml; do
            if [[ ${f%.folia.xml} == $f ]]; then
                newf="${f%.xml}.frogged.folia.xml"
            else
                newf="${f%.folia.xml}.frogged.folia.xml"
            fi
            mv $f ../$newf
        done
        cd ..

Command exit status:
  1

Command output:
  Now using node v13.3.0 (npm v6.13.4)

Command error:
  frog-mbma-:	o - 0 
  frog-mbma-:	r - 0 
  frog-mbma-:	t - 0 
  frog-mbma-:	m - N  morpheme ='ma'
  frog-mbma-:	a - 0 
  frog-mbma-:	 - /  INFLECTION: de delete='a' morpheme ='t'
  frog-mbma-:	t - 0 
  frog-mbma-:	 - V  delete='jege'
  frog-mbma-:	 - 0 
  frog-mbma-:	 - /  INFLECTION: pv delete='ge'
  frog-mbma-:	 - 0 
  frog-mbma-:	z - 0 
  frog-mbma-:	o - 0 
  frog-mbma-:	c - 0  insert='ek' delete='ch'
  frog-mbma-:	h - 0 
  frog-mbma-:	t - /  INFLECTION: pv
  frog-mbma-:tag: / infl: morhemes: [sport,ma,t] description:  confidence: 0
  frog-mbma-:
  frog-mbma-:Hmm: deleting ' is impossible. (a != ').
  frog-mbma-:Reject rule: MBMA rule (qatar):
  frog-mbma-:	q - N  morpheme ='q'
  frog-mbma-:	a - /  INFLECTION: de delete='''
executor >  local (64)
[98/46c588] process > frog_text2folia (31) [100%] 64 of 64, failed: 64
WARN: Killing pending tasks (63)
Error executing process > 'frog_text2folia (37)'

Caused by:
  Process `frog_text2folia (37)` terminated with an error exit status (1)

Command executed:

  set +u
        if [ ! -z "/vol/customopt/lamachine.stable" ]; then
            source /vol/customopt/lamachine.stable/bin/activate
        fi
        set -u
  
        opts=""
        if [[ "true" == "true" ]]; then
            opts="$opts -n"
        fi
        if [ ! -z "" ]; then
  frog-mopts="$opts --skip="
  fi
  
        #move input files to separate staging directory
        mkdir input
        mv *.txt input/
  
        #output will be in cwd
        mkdir output
        frog $opts --outputclass "current" --xmldir "output" --nostdout --testdir input/
        cd output
        for f in *.xml; do
            if [[ ${f%.folia.xml} == $f ]]; then
                newf="${f%.xml}.frogged.folia.xml"
            else
                newf="${f%.folia.xml}.frogged.folia.xml"
            fi
            mv $f ../$newf
        done
        cd ..

Command exit status:
  1
  frog-mbma-:	o - 0 
  frog-mbma-:	r - 0 
  frog-mbma-:	t - 0 
  frog-mbma-:	m - N  morpheme ='ma'
  frog-mbma-:	a - 0 
  frog-mbma-:	 - /  INFLECTION: de delete='a' morpheme ='t'
  frog-mbma-:	t - 0 
  frog-mbma-:	 - V  delete='jege'
  frog-mbma-:	 - 0 
  frog-mbma-:	 - /  INFLECTION: pv delete='ge'
  frog-mbma-:	 - 0 
  frog-mbma-:	z - 0 
  frog-mbma-:	o - 0 
  frog-mbma-:	c - 0  insert='ek' delete='ch'
  frog-mbma-:	h - 0 
  frog-mbma-:	t - /  INFLECTION: pv
  frog-mbma-:tag: / infl: morhemes: [sport,ma,t] description:  confidence: 0
  frog-mbma-:
  frog-mbma-:Hmm: deleting ' is impossible. (a != ').
  frog-mbma-:Reject rule: MBMA rule (qatar):
  frog-mbma-:	q - N  morpheme ='q'
  frog-mbma-:	a - /  INFLECTION: de delete='''
  frog-mbma-:	t - 0 
  frog-mbma-:	a - 0 
  frog-mbma-:	r - 0  INFLECTION: e
  frog-mbma-:tag: / infl: morhemes: [q] description:  confidence: 0
  frog-mbma-:
  frog-mbma-:Hmm: deleting 's is impossible. (t != ').
  frog-mbma-:Reject rule: MBMA rule (ruytse):
  frog-mbma-:	r - N  morpheme ='ruy'
  frog-mbma-:	u - 0 
  frog-mbma-:	y - 0 
  frog-mbma-:	t - /  INFLECTION: m delete=''s'
  frog-mbma-:	s - 0 
  frog-mbma-:	e - /  INFLECTION: E/P
  frog-mbma-:tag: / infl: morhemes: [ruy] description:  confidence: 0
  frog-mbma-:
  frog-mbma-:Hmm: deleting 's is impossible. (t != ').
  frog-mbma-:Reject rule: MBMA rule (duyts):
  frog-mbma-:	d - N  morpheme ='d'
  frog-mbma-:	u - N  morpheme ='uy'
  frog-mbma-:	y - 0 
  frog-mbma-:	t - /  INFLECTION: m delete=''s'
  frog-mbma-:	s - 0  INFLECTION: e
  frog-mbma-:tag: / infl: morhemes: [d,uy] description:  confidence: 0
  frog-mbma-:
  frog-:problem frogging: nlcow14ax_all_clean_martijn_36.txt
  frog-:std::bad_alloc
  frog-:Wed Jan 15 17:16:55 2020 Frog finished
  mv: cannot stat '*.xml': No such file or directory

Work dir:
  /vol/tensusers2/hmueller/LAMACHINE/wd3/work/85/5e0fda647124c40fd8fd4d2846df61

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
@kosloot
Copy link
Collaborator

kosloot commented Jan 16, 2020

Well. I assume Frog runs out of memory in LaMachine.
I am testing Frog now on this file, which has 933.300 lines of input.
After 15.000 lines, the memory used is already around 5.5 Gb
So at least 36 Gb should be needed, but worst case even more.
On Ponyland, that amount should normally be available. But that also depends on other users.
Running 64 in parallel would need say 64* 40 Gb. And that is way too much.

1 or 2 threads would be the maximum for this case, I am afraid.

Do you really need FoLiA XML output? because XML is verbose and needs a lot of memory.
Do you really need the Dependency Parser, NER and IOB tagger? These will add a lot to the size of the FoLiA AND slow down frog too.

@kosloot
Copy link
Collaborator

kosloot commented Jan 16, 2020

Well, it is even worse. 10Gb for 30.000 lines ==> 300 Gb needed
So only some ponies will support this. And I really wonder why you would like to stick this in 1 big FoLiA XML file.

The best solution would probably be to split all files in (much) smaller parts and run 64 frog on say 6400 files. Which is still NOT a large amount of files.

@hannomuller
Copy link
Author

No, a FoLiA output is not needed. Actually, I would prefer something like the output of Frog running in Python, e.g.
[{'index':'1', 'text':'bijvoorbeeld','morph':'[bij][voor][beeld]'}{W1}{W2}...{Wn}]
Is that possible?

And no, Dependency Parser and NER are not needed. I was wondering what IOB is? Couldn't find it in the documentation.

And yes, it would be possible to split the files even further. However, I did not do it, yet, because I could avoid all the splitting in merging.

@kosloot
Copy link
Collaborator

kosloot commented Jan 16, 2020

Well...
I assume @proycon can explain which Python script/parameters you need to run Frog in the desired way.
Regarding JSON output. That is a recently added wish #85

@proycon
Copy link
Member

proycon commented Jan 16, 2020

Perhaps we need to take closer look at your python + multiprocessing solution again and determine what went wrong there.

@hannomuller
Copy link
Author

You can find the script attached. It takes a file with sentences per line as input, splits it automatically into a specified amount of text chunks (so that there is a chunk for every core), analyses each chunk using frog, collects the output of the partial analyses and concatenates it into one file again, which is than written to a file.

My pragmatic solution to tackle the memory issue would be to write a bash script which split one huge file into smaller files, which will then be splitted by python again. However, the python script regularly breaks. Would be thankful if you find out why. You can find the file here: frog_multiprocessing.py

@kosloot
Copy link
Collaborator

kosloot commented Jan 16, 2020

I am now testing frog on 'nlcow14ax_all_clean_martijn_36.txt' with these options:
frog --skip=mpnc -n -o outfile nlcow14ax_all_clean_martijn_36.txt

so no parser, no NER and no IOB-Chunker. and output to a file in 'tabbed' format, like this:

1	de	de	[de]	LID(bep,stan,rest)	0.981886								
2	<number>	<number>	[<number>]	N(soort,mv,basis)	1.000000							
3	dstudio	dstudio	[d][studio]	N(soort,ev,basis,zijd,stan)	0.803744								
4	bestaat	bestaan	[be][sta][t]	WW(pv,tgw,met-t)	0.998852								
5	uit	uit	[uit]	VZ(init)	0.983871								
6	zestien	zestien	[zes][tien]	TW(hoofd,prenom,stan)	0.941423								
7	werkplekken	werkplek	[werk][plek][en]	N(soort,mv,basis)	0.998020	

This tabbed format can be converted to JSON or an Spreadsheet quite easily .

The process is now at line 200.000, so still a lot to do, but memory usage is around 1 Gb and stable, so no worries there.

@proycon
Copy link
Member

proycon commented Jan 16, 2020

Your python script looks pretty clean and set up well. I was worried there might be a forking issue after frog instantiation, but that doesn't seem to be the case, you initialise a fresh frog instance every time (which is actually even more than necessary (once per actual thread), but better safe than sorry).

When it breaks do you get any error output? Bad allocation or segmentation fault?

@hannomuller
Copy link
Author

In the begin I instantiated frog only once but it did not work at all, so I ended up like this. No, I don't receive any error message. The script just freezes. If you have access to the pony's, you can use run the script using some files in my tensusers directory like this: python frog_multiprocessing.py nlcow14ax_all_clean_martijn_smaller frog_multiprocessing 5

@kosloot
Copy link
Collaborator

kosloot commented Jan 16, 2020

To summarize:
frog --skip=mpnc -n -o outfile nlcow14ax_all_clean_martijn_36.txt ran for 2 hours and 48 minutes taking a maximum of 550 Mb Resident memory (about 1.2 G vritual)

All "reasonable" i think

@kosloot
Copy link
Collaborator

kosloot commented Jan 31, 2020

This is also solved by taking the right approach. closing the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants