Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rule causing "Complex regular subexpression recursion limit" #318

Closed
danmichaelo opened this issue Aug 14, 2017 · 8 comments
Closed

Comments

@danmichaelo
Copy link

My OAI harvest crashed with the error:

2017/08/13 09:46:08 [33144] - ERROR Catmandu::Error::BUILD /usr/local/share/perl5/Catmandu/Error.pm (22) time=120659727 : Complex regular subexpression recursion limit (32766) exceeded at (eval 1249) line 113.
Complex regular subexpression recursion limit (32766) exceeded at (eval 1249) line 113.


Trace begun at (eval 1249) line 761
Catmandu::Fix::__ANON__('HASH(0x6542c40)') called at /usr/local/share/perl5/Catmandu/Fix.pm line 115
Catmandu::Fix::__ANON__('HASH(0x6542c40)') called at /usr/local/share/perl5/Catmandu/Fix.pm line 175
Catmandu::Fix::fix('Catmandu::Fix=HASH(0x2f33ee8)', 'HASH(0x6542c40)') called at harvest.pl line 117
main::process_record('HASH(0x6542c40)') called at /usr/local/share/perl5/Catmandu/Iterable.pm line 88
Catmandu::Iterable::each('Catmandu::Importer::OAI=HASH(0x3a86520)', 'CODE(0x5de1970)') called at harvest.pl line 141

Unfortunately I don't know which record or which fix rule caused the problem. My fix file is here: https://gist.github.com/danmichaelo/d52035c4204cbe2b1c21c717102c3161

I can share my harvest.pl also if needed, but here's the short version:

my $fixer = $env->fixer('marc_map.txt');

sub process_record {
    my $item = shift;
    my $fixed = $fixer->fix($item);
    ...
}
$importer->each(\&process_record);

Will try to see if I can add try.. catch to find out which record caused this.

@danmichaelo
Copy link
Author

danmichaelo commented Aug 14, 2017

Somewhat loosely related to LibreCat/Catmandu-OAI#7

@phochste
Copy link
Member

Ill check it tomorrow when I'm back from vacation. Can you share with us the baseUrl , set, metadataPrefix you try to harvest?

@danmichaelo
Copy link
Author

danmichaelo commented Aug 15, 2017

Tried the following:

use Data::Dumper;
try {
    $fixed = $fixer->fix($item);
} catch {
    warn "caught error: $_";
    print Dumper($item);
    return;
};

which made the harvest continue past the bad record, but for some reason print Dumper($item); didn't print anything, so either $item is empty or null at this point, or Dumper doesn't work as I expected it to (I'm not very familiar with perl).

Repo details:

 oai_nz:
    package: OAI
    options:
      handler: marcxml
      metadataPrefix: marc21
      set: oai_komplett
      url: "http://bibsys-network.alma.exlibrisgroup.com/view/oai/47BIBSYS_NETWORK/request"

but note that the error occurs 5101200 records into the set, so it takes about 1 day and 8 hours to reach it :) If there's something I can test, let me know.

@phochste
Copy link
Member

I will check it for a while to see which resumption token crashes the harvest

@phochste
Copy link
Member

Interesting..I can boil it down to this kind of regex match that fails:

# test.fix
set_field(a,'The Effect of Pharmaceutical Innovation on the Functional Limitations of Elderly Americans                                                                   Evidence from the 2004 National Nursing Home Survey')
replace_all('a','((\s+\W\s*)+|\.)$','')
$ catmandu convert Null to YAML --fix test2.fix
Oops! One of your fixes threw an error...
Use of uninitialized value in concatenation (.) or string at /Users/hochsten/.plenv/versions/5.24.0/lib/perl5/site_perl/5.24.0/Catmandu/CLI.pm line 192.
Source:
Error: Complex regular subexpression recursion limit (32766) exceeded at (eval 162) line 1.

Input:
$VAR1 = {
          'a' => 'The Effect of Pharmaceutical Innovation on the Functional Limitations of Elderly Americans                                                                   Evidence from the 2004 National Nursing Home Survey'
        };

@phochste
Copy link
Member

We checked it and it is probably the regular expression you are using that is funky. You get the same effect in Perl when doing:

use strict;
use warnings;

my $a = 'The Effect of Pharmaceutical Innovation on the Functional Limitations of Elderly Americans                                                                   Evidence from the 2004 National Nursing Home Survey';

$a =~ s{((\s+\W\s*)+|\.)$}{}g;

print "ok\n";

The problem is that the pattern (\s+\W\s*)+ is too fuzzy. The \W includes also spaces, so you are matching one-or-more-spaces non-word-including-a-space zero-or-more-spaces and that all many times repeated. The Perl regular expression engine doesn't know anymore where to split this pattern given many spaces as input and stops in a recursion.

@danmichaelo
Copy link
Author

Wow, thanks a lot for checking! Funnily enough, this is a regex I took from https://github.com/LibreCat/Catmandu/wiki/Example-Fix-Script :)

@phochste
Copy link
Member

@danmichaelo 😃 Ooops. I better fix that script

@nics nics closed this as completed Sep 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants