Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory exhaustion when processing large files. #20

Open
GoogleCodeExporter opened this issue Aug 4, 2015 · 6 comments
Open

Memory exhaustion when processing large files. #20

GoogleCodeExporter opened this issue Aug 4, 2015 · 6 comments

Comments

@GoogleCodeExporter
Copy link
Contributor

I use the following script to dump records from rather large log files. The 
logs are text based records that are separated by empty lines. As you can see 
I'm using the RS="" construct to separate records. Each record is around 1000 
characters. I've found that when mawk processes roughly 1.6 million records it 
stops processing records. At that point I can observe, using 'top', that the 
RSS starts growing rapidly until the system is out of the memory.

The script works as expected on smaller files. I've tested with mawk versions 
1.3.3 up through the latest 1.3.4, they both fail in same fashion.

I installed gawk and it processed the entire file without problems.

Thanks,
Wayne

---------------------------------------------------------------------- 

#!/bin/sh

if [ -z "$1" ]
then
    {
        echo "No pattern specified"
        echo "Usage: $0 <pattern> [file0 file1 ... fileN]"
    } >&2
    exit 1
fi

pattern="$1"
shift

awk -v pattern="$pattern" -- '

BEGIN {
    RS=""
    ORS="\n\n"
    FS="/n"
}

$0 ~ pattern {
    print $0
    nmatches += 1
}

END {
    print "Number of matched records: " nmatches "."
    print "Total records: " NR "."
}

' "$@"

Original issue reported on code.google.com by wcu...@gmail.com on 18 Jan 2013 at 11:19

@GoogleCodeExporter
Copy link
Contributor Author

I can reproduce the issue with a smaller file,
will look further to identify the reason for the difference
in behavior.

Original comment by dic...@his.com on 20 Jan 2013 at 8:05

  • Changed state: Accepted

@GoogleCodeExporter
Copy link
Contributor Author

I'll handle this one.

Original comment by dic...@his.com on 20 Jan 2013 at 8:07

@GoogleCodeExporter
Copy link
Contributor Author

essentially what's happening is that REmatch is calling str_str
with much larger sizes than would be expected (I'm using as test
data some of my build-logs, and having seen in gprof that the
number of calls from REmatch to str_str is unexpectedly high,
examining the parameters of str_str).  Initially it passes ~4kb
to str_str, and dwindles down - but then once it's down to nothing,
it blows up and uses the whole file (and starts dwindling down from
that point, as it did before).  While the 4kb chunks look
larger than I'd guess, the blowup is what you appear to be reporting.

Original comment by dic...@his.com on 20 Jan 2013 at 8:52

@mikebrennan000
Copy link

I cannot reproduce this problem. Can someone post a concrete example: input file, pattern

@mikebrennan000
Copy link

The REmatch, str_str behavior is what one expects to see when breaking the input buffer into records.

@mikebrennan000
Copy link

mikebrennan000 commented Sep 7, 2016

Using a mawk134 compiled with -DNO_LEAKS added to CFLAGS, the reported behavior is easy to
duplicate as the execution of $0 ~ pattern leaks.

Without -DNO_LEAKS, everything I've tried works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants