Frog breaks while processing large amount of txt data #86

hannomuller · 2020-01-15T20:12:27Z

Frog is used to analyze 64 different txt files on 64 cores. It is initiated in LaMachine with frog.nf --inputdir chunks --outputdir chunks --inputformat text --sentenceperline --workers 64. However, I started the process several times, it once ran for a whole day but another time broke after only a few hours. The absolute runtime on the data should comprise around 20 days according to my calculations. Here is an excerpt of the error message.

executor >  local (64)
[7f/9f749c] process > frog_text2folia (48) [ 97%] 62 of 64, failed: 62
WARN: Killing pending tasks (63)
Error executing process > 'frog_text2folia (37)'

Caused by:
  Process `frog_text2folia (37)` terminated with an error exit status (1)

Command executed:

  set +u
        if [ ! -z "/vol/customopt/lamachine.stable" ]; then
            source /vol/customopt/lamachine.stable/bin/activate
        fi
        set -u
  
        opts=""
        if [[ "true" == "true" ]]; then
            opts="$opts -n"
        fi
        if [ ! -z "" ]; then
  frog-mopts="$opts --skip="
  fi
  
        #move input files to separate staging directory
        mkdir input
        mv *.txt input/
  
        #output will be in cwd
        mkdir output
        frog $opts --outputclass "current" --xmldir "output" --nostdout --testdir input/
        cd output
        for f in *.xml; do
            if [[ ${f%.folia.xml} == $f ]]; then
                newf="${f%.xml}.frogged.folia.xml"
            else
                newf="${f%.folia.xml}.frogged.folia.xml"
            fi
            mv $f ../$newf
        done
        cd ..

Command exit status:
  1

Command output:
  Now using node v13.3.0 (npm v6.13.4)

Command error:
  frog-mbma-:	o - 0 
  frog-mbma-:	r - 0 
  frog-mbma-:	t - 0 
  frog-mbma-:	m - N  morpheme ='ma'
  frog-mbma-:	a - 0 
  frog-mbma-:	 - /  INFLECTION: de delete='a' morpheme ='t'
  frog-mbma-:	t - 0 
  frog-mbma-:	 - V  delete='jege'
  frog-mbma-:	 - 0 
  frog-mbma-:	 - /  INFLECTION: pv delete='ge'
  frog-mbma-:	 - 0 
  frog-mbma-:	z - 0 
  frog-mbma-:	o - 0 
  frog-mbma-:	c - 0  insert='ek' delete='ch'
  frog-mbma-:	h - 0 
  frog-mbma-:	t - /  INFLECTION: pv
  frog-mbma-:tag: / infl: morhemes: [sport,ma,t] description:  confidence: 0
  frog-mbma-:
  frog-mbma-:Hmm: deleting ' is impossible. (a != ').
  frog-mbma-:Reject rule: MBMA rule (qatar):
  frog-mbma-:	q - N  morpheme ='q'
  frog-mbma-:	a - /  INFLECTION: de delete='''
executor >  local (64)
[98/46c588] process > frog_text2folia (31) [100%] 64 of 64, failed: 64
WARN: Killing pending tasks (63)
Error executing process > 'frog_text2folia (37)'

Caused by:
  Process `frog_text2folia (37)` terminated with an error exit status (1)

Command executed:

  set +u
        if [ ! -z "/vol/customopt/lamachine.stable" ]; then
            source /vol/customopt/lamachine.stable/bin/activate
        fi
        set -u
  
        opts=""
        if [[ "true" == "true" ]]; then
            opts="$opts -n"
        fi
        if [ ! -z "" ]; then
  frog-mopts="$opts --skip="
  fi
  
        #move input files to separate staging directory
        mkdir input
        mv *.txt input/
  
        #output will be in cwd
        mkdir output
        frog $opts --outputclass "current" --xmldir "output" --nostdout --testdir input/
        cd output
        for f in *.xml; do
            if [[ ${f%.folia.xml} == $f ]]; then
                newf="${f%.xml}.frogged.folia.xml"
            else
                newf="${f%.folia.xml}.frogged.folia.xml"
            fi
            mv $f ../$newf
        done
        cd ..

Command exit status:
  1
  frog-mbma-:	o - 0 
  frog-mbma-:	r - 0 
  frog-mbma-:	t - 0 
  frog-mbma-:	m - N  morpheme ='ma'
  frog-mbma-:	a - 0 
  frog-mbma-:	 - /  INFLECTION: de delete='a' morpheme ='t'
  frog-mbma-:	t - 0 
  frog-mbma-:	 - V  delete='jege'
  frog-mbma-:	 - 0 
  frog-mbma-:	 - /  INFLECTION: pv delete='ge'
  frog-mbma-:	 - 0 
  frog-mbma-:	z - 0 
  frog-mbma-:	o - 0 
  frog-mbma-:	c - 0  insert='ek' delete='ch'
  frog-mbma-:	h - 0 
  frog-mbma-:	t - /  INFLECTION: pv
  frog-mbma-:tag: / infl: morhemes: [sport,ma,t] description:  confidence: 0
  frog-mbma-:
  frog-mbma-:Hmm: deleting ' is impossible. (a != ').
  frog-mbma-:Reject rule: MBMA rule (qatar):
  frog-mbma-:	q - N  morpheme ='q'
  frog-mbma-:	a - /  INFLECTION: de delete='''
  frog-mbma-:	t - 0 
  frog-mbma-:	a - 0 
  frog-mbma-:	r - 0  INFLECTION: e
  frog-mbma-:tag: / infl: morhemes: [q] description:  confidence: 0
  frog-mbma-:
  frog-mbma-:Hmm: deleting 's is impossible. (t != ').
  frog-mbma-:Reject rule: MBMA rule (ruytse):
  frog-mbma-:	r - N  morpheme ='ruy'
  frog-mbma-:	u - 0 
  frog-mbma-:	y - 0 
  frog-mbma-:	t - /  INFLECTION: m delete=''s'
  frog-mbma-:	s - 0 
  frog-mbma-:	e - /  INFLECTION: E/P
  frog-mbma-:tag: / infl: morhemes: [ruy] description:  confidence: 0
  frog-mbma-:
  frog-mbma-:Hmm: deleting 's is impossible. (t != ').
  frog-mbma-:Reject rule: MBMA rule (duyts):
  frog-mbma-:	d - N  morpheme ='d'
  frog-mbma-:	u - N  morpheme ='uy'
  frog-mbma-:	y - 0 
  frog-mbma-:	t - /  INFLECTION: m delete=''s'
  frog-mbma-:	s - 0  INFLECTION: e
  frog-mbma-:tag: / infl: morhemes: [d,uy] description:  confidence: 0
  frog-mbma-:
  frog-:problem frogging: nlcow14ax_all_clean_martijn_36.txt
  frog-:std::bad_alloc
  frog-:Wed Jan 15 17:16:55 2020 Frog finished
  mv: cannot stat '*.xml': No such file or directory

Work dir:
  /vol/tensusers2/hmueller/LAMACHINE/wd3/work/85/5e0fda647124c40fd8fd4d2846df61

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

The text was updated successfully, but these errors were encountered:

kosloot · 2020-01-16T10:07:10Z

Well. I assume Frog runs out of memory in LaMachine.
I am testing Frog now on this file, which has 933.300 lines of input.
After 15.000 lines, the memory used is already around 5.5 Gb
So at least 36 Gb should be needed, but worst case even more.
On Ponyland, that amount should normally be available. But that also depends on other users.
Running 64 in parallel would need say 64* 40 Gb. And that is way too much.

1 or 2 threads would be the maximum for this case, I am afraid.

Do you really need FoLiA XML output? because XML is verbose and needs a lot of memory.
Do you really need the Dependency Parser, NER and IOB tagger? These will add a lot to the size of the FoLiA AND slow down frog too.

kosloot · 2020-01-16T10:12:50Z

Well, it is even worse. 10Gb for 30.000 lines ==> 300 Gb needed
So only some ponies will support this. And I really wonder why you would like to stick this in 1 big FoLiA XML file.

The best solution would probably be to split all files in (much) smaller parts and run 64 frog on say 6400 files. Which is still NOT a large amount of files.

hannomuller · 2020-01-16T10:19:04Z

No, a FoLiA output is not needed. Actually, I would prefer something like the output of Frog running in Python, e.g.
[{'index':'1', 'text':'bijvoorbeeld','morph':'[bij][voor][beeld]'}{W1}{W2}...{Wn}]
Is that possible?

And no, Dependency Parser and NER are not needed. I was wondering what IOB is? Couldn't find it in the documentation.

And yes, it would be possible to split the files even further. However, I did not do it, yet, because I could avoid all the splitting in merging.

kosloot · 2020-01-16T10:24:17Z

Well...
I assume @proycon can explain which Python script/parameters you need to run Frog in the desired way.
Regarding JSON output. That is a recently added wish #85

proycon · 2020-01-16T10:25:17Z

Perhaps we need to take closer look at your python + multiprocessing solution again and determine what went wrong there.

hannomuller · 2020-01-16T11:03:48Z

You can find the script attached. It takes a file with sentences per line as input, splits it automatically into a specified amount of text chunks (so that there is a chunk for every core), analyses each chunk using frog, collects the output of the partial analyses and concatenates it into one file again, which is than written to a file.

My pragmatic solution to tackle the memory issue would be to write a bash script which split one huge file into smaller files, which will then be splitted by python again. However, the python script regularly breaks. Would be thankful if you find out why. You can find the file here: frog_multiprocessing.py

kosloot · 2020-01-16T11:18:58Z

I am now testing frog on 'nlcow14ax_all_clean_martijn_36.txt' with these options:
frog --skip=mpnc -n -o outfile nlcow14ax_all_clean_martijn_36.txt

so no parser, no NER and no IOB-Chunker. and output to a file in 'tabbed' format, like this:

1	de	de	[de]	LID(bep,stan,rest)	0.981886								
2	<number>	<number>	[<number>]	N(soort,mv,basis)	1.000000							
3	dstudio	dstudio	[d][studio]	N(soort,ev,basis,zijd,stan)	0.803744								
4	bestaat	bestaan	[be][sta][t]	WW(pv,tgw,met-t)	0.998852								
5	uit	uit	[uit]	VZ(init)	0.983871								
6	zestien	zestien	[zes][tien]	TW(hoofd,prenom,stan)	0.941423								
7	werkplekken	werkplek	[werk][plek][en]	N(soort,mv,basis)	0.998020

This tabbed format can be converted to JSON or an Spreadsheet quite easily .

The process is now at line 200.000, so still a lot to do, but memory usage is around 1 Gb and stable, so no worries there.

proycon · 2020-01-16T12:08:37Z

Your python script looks pretty clean and set up well. I was worried there might be a forking issue after frog instantiation, but that doesn't seem to be the case, you initialise a fresh frog instance every time (which is actually even more than necessary (once per actual thread), but better safe than sorry).

When it breaks do you get any error output? Bad allocation or segmentation fault?

hannomuller · 2020-01-16T12:43:35Z

In the begin I instantiated frog only once but it did not work at all, so I ended up like this. No, I don't receive any error message. The script just freezes. If you have access to the pony's, you can use run the script using some files in my tensusers directory like this: python frog_multiprocessing.py nlcow14ax_all_clean_martijn_smaller frog_multiprocessing 5

kosloot · 2020-01-16T14:23:47Z

To summarize:
frog --skip=mpnc -n -o outfile nlcow14ax_all_clean_martijn_36.txt ran for 2 hours and 48 minutes taking a maximum of 550 Mb Resident memory (about 1.2 G vritual)

All "reasonable" i think

kosloot · 2020-01-31T15:37:59Z

This is also solved by taking the right approach. closing the issue

kosloot assigned kosloot and proycon Jan 16, 2020

kosloot closed this as completed Jan 31, 2020

proycon mentioned this issue Jan 11, 2021

Praktische vragen rondom grote datasets #94

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frog breaks while processing large amount of txt data #86

Frog breaks while processing large amount of txt data #86

hannomuller commented Jan 15, 2020

kosloot commented Jan 16, 2020

kosloot commented Jan 16, 2020

hannomuller commented Jan 16, 2020

kosloot commented Jan 16, 2020

proycon commented Jan 16, 2020

hannomuller commented Jan 16, 2020

kosloot commented Jan 16, 2020

proycon commented Jan 16, 2020

hannomuller commented Jan 16, 2020

kosloot commented Jan 16, 2020

kosloot commented Jan 31, 2020

Frog breaks while processing large amount of txt data #86

Frog breaks while processing large amount of txt data #86

Comments

hannomuller commented Jan 15, 2020

kosloot commented Jan 16, 2020

kosloot commented Jan 16, 2020

hannomuller commented Jan 16, 2020

kosloot commented Jan 16, 2020

proycon commented Jan 16, 2020

hannomuller commented Jan 16, 2020

kosloot commented Jan 16, 2020

proycon commented Jan 16, 2020

hannomuller commented Jan 16, 2020

kosloot commented Jan 16, 2020

kosloot commented Jan 31, 2020