Memory exhaustion when processing large files. #20

GoogleCodeExporter · 2015-08-04T08:18:11Z

I use the following script to dump records from rather large log files. The 
logs are text based records that are separated by empty lines. As you can see 
I'm using the RS="" construct to separate records. Each record is around 1000 
characters. I've found that when mawk processes roughly 1.6 million records it 
stops processing records. At that point I can observe, using 'top', that the 
RSS starts growing rapidly until the system is out of the memory.

The script works as expected on smaller files. I've tested with mawk versions 
1.3.3 up through the latest 1.3.4, they both fail in same fashion.

I installed gawk and it processed the entire file without problems.

Thanks,
Wayne

---------------------------------------------------------------------- 

#!/bin/sh

if [ -z "$1" ]
then
    {
        echo "No pattern specified"
        echo "Usage: $0 <pattern> [file0 file1 ... fileN]"
    } >&2
    exit 1
fi

pattern="$1"
shift

awk -v pattern="$pattern" -- '

BEGIN {
    RS=""
    ORS="\n\n"
    FS="/n"
}

$0 ~ pattern {
    print $0
    nmatches += 1
}

END {
    print "Number of matched records: " nmatches "."
    print "Total records: " NR "."
}

' "$@"

Original issue reported on code.google.com by wcu...@gmail.com on 18 Jan 2013 at 11:19

The text was updated successfully, but these errors were encountered:

GoogleCodeExporter · 2015-08-04T08:18:11Z

I can reproduce the issue with a smaller file,
will look further to identify the reason for the difference
in behavior.

Original comment by dic...@his.com on 20 Jan 2013 at 8:05

Changed state: Accepted

GoogleCodeExporter · 2015-08-04T08:18:11Z

I'll handle this one.

Original comment by dic...@his.com on 20 Jan 2013 at 8:07

GoogleCodeExporter · 2015-08-04T08:18:11Z

essentially what's happening is that REmatch is calling str_str
with much larger sizes than would be expected (I'm using as test
data some of my build-logs, and having seen in gprof that the
number of calls from REmatch to str_str is unexpectedly high,
examining the parameters of str_str).  Initially it passes ~4kb
to str_str, and dwindles down - but then once it's down to nothing,
it blows up and uses the whole file (and starts dwindling down from
that point, as it did before).  While the 4kb chunks look
larger than I'd guess, the blowup is what you appear to be reporting.

Original comment by dic...@his.com on 20 Jan 2013 at 8:52

mikebrennan000 · 2016-09-04T21:02:26Z

I cannot reproduce this problem. Can someone post a concrete example: input file, pattern

mikebrennan000 · 2016-09-05T17:42:52Z

The REmatch, str_str behavior is what one expects to see when breaking the input buffer into records.

mikebrennan000 · 2016-09-07T18:27:11Z

Using a mawk134 compiled with -DNO_LEAKS added to CFLAGS, the reported behavior is easy to
duplicate as the execution of $0 ~ pattern leaks.

Without -DNO_LEAKS, everything I've tried works fine.

GoogleCodeExporter added Priority-Medium Type-Defect auto-migrated labels Aug 4, 2015

ThomasDickey mentioned this issue Aug 28, 2016

Memory leak processing large number of files? #44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory exhaustion when processing large files. #20

Memory exhaustion when processing large files. #20

GoogleCodeExporter commented Aug 4, 2015

GoogleCodeExporter commented Aug 4, 2015

GoogleCodeExporter commented Aug 4, 2015

GoogleCodeExporter commented Aug 4, 2015

mikebrennan000 commented Sep 4, 2016

mikebrennan000 commented Sep 5, 2016

mikebrennan000 commented Sep 7, 2016 •

edited

Loading

Memory exhaustion when processing large files. #20

Memory exhaustion when processing large files. #20

Comments

GoogleCodeExporter commented Aug 4, 2015

GoogleCodeExporter commented Aug 4, 2015

GoogleCodeExporter commented Aug 4, 2015

GoogleCodeExporter commented Aug 4, 2015

mikebrennan000 commented Sep 4, 2016

mikebrennan000 commented Sep 5, 2016

mikebrennan000 commented Sep 7, 2016 • edited Loading

mikebrennan000 commented Sep 7, 2016 •

edited

Loading