Skip to content

Commit

Permalink
Don't copy all of the match string buffer
Browse files Browse the repository at this point in the history
When a pattern matches, and that pattern contains captures (or $`, $&, $'
or /p are present), a copy is made of the whole original string, so
that $1 et al continue to hold the correct value even if the original
string is subsequently modified. This can have severe performance
penalties; for example, this code causes a 1Mb buffer to be allocated,
copied and freed a million times:

    $&;
    $x = 'x' x 1_000_000;
    1 while $x =~ /(.)/g;

This commit changes this so that, where possible, only the needed
substring of the original string is copied: in the above case, only a
1-byte buffer is copied each time. Also, it now reuses or reallocs the
buffer, rather than freeing and mallocing each time.

Now that PL_sawampersand is a 3-bit flag indicating separately whether
$`, $& and $' have been seen, they each contribute only their own
individual penalty; which ones have been seen will limit the extent to
which we can avoid copying the whole buffer.

Note that the above code *without* the $& is not currently slow, but only
because the copying is artificially disabled to avoid the performance hit.
The next but one commit will remove that hack, meaning that it will still
be fast, but will now be correct in the presence of a modified original
string.

We achieve this by by adding suboffset and subcoffset fields to the
existing subbeg and sublen fields of a regex, to indicate how many bytes
and characters have been skipped from the logical start of the string till
the physical start of the buffer. To avoid copying stuff at the end, we
just reduce sublen. For example, in this:

    "abcdefgh" =~ /(c)d/

subbeg points to a malloced buffer containing "c\0"; sublen == 1,
and suboffset == 2 (as does subcoffset).

while if $& has been seen,

subbeg points to a malloced buffer containing "cd\0"; sublen == 2,
and suboffset == 2.

If in addition $' has been seen, then

subbeg points to a malloced buffer containing "cdefgh\0"; sublen == 6,
and suboffset == 2.

The regex engine won't do this by default; there are two new flag bits,
REXEC_COPY_SKIP_PRE and REXEC_COPY_SKIP_POST, which in conjunction with
REXEC_COPY_STR, request that the engine skip the start or end of the
buffer (it will still copy in the presence of the relevant $`, $&, $',
/p).

Only pp_match has been enhanced to use these extra flags; substitution
can't easily benefit, since the usual action of s///g is to copy the
whole string first time round, then perform subsequent matching iterations
against the copy, without further copying. So you still need to copy most
of the buffer.
  • Loading branch information
iabyn committed Sep 8, 2012
1 parent 2c7b5d7 commit 6502e08
Show file tree
Hide file tree
Showing 12 changed files with 207 additions and 27 deletions.
4 changes: 4 additions & 0 deletions dump.c
Expand Up @@ -2056,6 +2056,10 @@ Perl_do_sv_dump(pTHX_ I32 level, PerlIO *file, SV *sv, I32 nest, I32 maxnest, bo
(UV)(r->pre_prefix));
Perl_dump_indent(aTHX_ level, file, " SUBLEN = %"IVdf"\n",
(IV)(r->sublen));
Perl_dump_indent(aTHX_ level, file, " SUBOFFSET = %"IVdf"\n",
(IV)(r->suboffset));
Perl_dump_indent(aTHX_ level, file, " SUBCOFFSET = %"IVdf"\n",
(IV)(r->subcoffset));
if (r->subbeg)
Perl_dump_indent(aTHX_ level, file, " SUBBEG = 0x%"UVxf" %s\n",
PTR2UV(r->subbeg),
Expand Down
2 changes: 2 additions & 0 deletions ext/Devel-Peek/t/Peek.t
Expand Up @@ -350,6 +350,8 @@ do_test('reference to regexp',
GOFS = 0
PRE_PREFIX = 4
SUBLEN = 0
SUBOFFSET = 0
SUBCOFFSET = 0
SUBBEG = 0x0
ENGINE = $ADDR
MOTHER_RE = $ADDR
Expand Down
8 changes: 7 additions & 1 deletion mg.c
Expand Up @@ -637,6 +637,8 @@ Perl_magic_regdata_cnt(pTHX_ SV *sv, MAGIC *mg)
return (U32)-1;
}

/* @-, @+ */

int
Perl_magic_regdatum_get(pTHX_ SV *sv, MAGIC *mg)
{
Expand Down Expand Up @@ -665,7 +667,9 @@ Perl_magic_regdatum_get(pTHX_ SV *sv, MAGIC *mg)
if (i > 0 && RX_MATCH_UTF8(rx)) {
const char * const b = RX_SUBBEG(rx);
if (b)
i = utf8_length((U8*)b, (U8*)(b+i));
i = RX_SUBCOFFSET(rx) +
utf8_length((U8*)b,
(U8*)(b-RX_SUBOFFSET(rx)+i));
}

sv_setiv(sv, i);
Expand All @@ -675,6 +679,8 @@ Perl_magic_regdatum_get(pTHX_ SV *sv, MAGIC *mg)
return 0;
}

/* @-, @+ */

int
Perl_magic_regdatum_set(pTHX_ SV *sv, MAGIC *mg)
{
Expand Down
22 changes: 19 additions & 3 deletions pod/perlreapi.pod
Expand Up @@ -555,6 +555,8 @@ values.
char *subbeg; /* saved or original string so \digit works forever. */
SV_SAVED_COPY /* If non-NULL, SV which is COW from original */
I32 sublen; /* Length of string pointed by subbeg */
I32 suboffset; /* byte offset of subbeg from logical start of str */
I32 subcoffset; /* suboffset equiv, but in chars (for @-/@+) */

/* Information about the match that isn't often used */
I32 prelen; /* length of precomp */
Expand Down Expand Up @@ -695,9 +697,23 @@ occur at a floating offset from the start of the pattern. Used to do
Fast-Boyer-Moore searches on the string to find out if its worth using
the regex engine at all, and if so where in the string to search.

=head2 C<subbeg> C<sublen> C<saved_copy>

Used during execution phase for managing search and replace patterns.
=head2 C<subbeg> C<sublen> C<saved_copy> C<suboffset> C<subcoffset>

Used during the execution phase for managing search and replace patterns,
and for providing the text for C<$&>, C<$1> etc. C<subbeg> points to a
buffer (either the original string, or a copy in the case of
C<RX_MATCH_COPIED(rx)>), and C<sublen> is the length of the buffer. The
C<RX_OFFS> start and end indices index into this buffer.

In the presence of the C<REXEC_COPY_STR> flag, but with the addition of
the C<REXEC_COPY_SKIP_PRE> or C<REXEC_COPY_SKIP_POST> flags, an engine
can choose not to copy the full buffer (although it must still do so in
the presence of C<RXf_PMf_KEEPCOPY> or the relevant bits being set in
C<PL_sawampersand>). In this case, it may set C<suboffset> to indicate the
number of bytes from the logical start of the buffer to the physical start
(i.e. C<subbeg>). It should also set C<subcoffset>, the number of
characters in the offset. The latter is needed to support C<@-> and C<@+>
which work in characters, not bytes.

=head2 C<wrapped> C<wraplen>

Expand Down
3 changes: 3 additions & 0 deletions pp.c
Expand Up @@ -5549,6 +5549,9 @@ PP(pp_split)
if (rex_return == 0)
break;
TAINT_IF(RX_MATCH_TAINTED(rx));
/* we never pass the REXEC_COPY_STR flag, so it should
* never get copied */
assert(!RX_MATCH_COPIED(rx));
if (RX_MATCH_COPIED(rx) && RX_SUBBEG(rx) != orig) {
m = s;
s = orig;
Expand Down
13 changes: 8 additions & 5 deletions pp_ctl.c
Expand Up @@ -289,6 +289,7 @@ PP(pp_substcont)
if (RX_MATCH_COPIED(rx) && RX_SUBBEG(rx) != orig) {
m = s;
s = orig;
assert(!RX_SUBOFFSET(rx));
cx->sb_orig = orig = RX_SUBBEG(rx);
s = orig + (m - s);
cx->sb_strend = s + (cx->sb_strend - m);
Expand Down Expand Up @@ -353,9 +354,9 @@ Perl_rxres_save(pTHX_ void **rsp, REGEXP *rx)

if (!p || p[1] < RX_NPARENS(rx)) {
#ifdef PERL_OLD_COPY_ON_WRITE
i = 7 + RX_NPARENS(rx) * 2;
i = 7 + (RX_NPARENS(rx)+1) * 2;
#else
i = 6 + RX_NPARENS(rx) * 2;
i = 6 + (RX_NPARENS(rx)+1) * 2;
#endif
if (!p)
Newx(p, i, UV);
Expand All @@ -364,7 +365,7 @@ Perl_rxres_save(pTHX_ void **rsp, REGEXP *rx)
*rsp = (void*)p;
}

*p++ = PTR2UV(RX_MATCH_COPIED(rx) ? RX_SUBBEG(rx) : NULL);
*p++ = RX_MATCH_COPIED(rx) ? 1 : 0;
RX_MATCH_COPIED_off(rx);

#ifdef PERL_OLD_COPY_ON_WRITE
Expand All @@ -373,9 +374,10 @@ Perl_rxres_save(pTHX_ void **rsp, REGEXP *rx)
#endif

*p++ = RX_NPARENS(rx);

*p++ = PTR2UV(RX_SUBBEG(rx));
*p++ = (UV)RX_SUBLEN(rx);
*p++ = (UV)RX_SUBOFFSET(rx);
*p++ = (UV)RX_SUBCOFFSET(rx);
for (i = 0; i <= RX_NPARENS(rx); ++i) {
*p++ = (UV)RX_OFFS(rx)[i].start;
*p++ = (UV)RX_OFFS(rx)[i].end;
Expand Down Expand Up @@ -403,9 +405,10 @@ S_rxres_restore(pTHX_ void **rsp, REGEXP *rx)
#endif

RX_NPARENS(rx) = *p++;

RX_SUBBEG(rx) = INT2PTR(char*,*p++);
RX_SUBLEN(rx) = (I32)(*p++);
RX_SUBOFFSET(rx) = (I32)*p++;
RX_SUBCOFFSET(rx) = (I32)*p++;
for (i = 0; i <= RX_NPARENS(rx); ++i) {
RX_OFFS(rx)[i].start = (I32)(*p++);
RX_OFFS(rx)[i].end = (I32)(*p++);
Expand Down
32 changes: 26 additions & 6 deletions pp_hot.c
Expand Up @@ -1325,9 +1325,19 @@ PP(pp_match)
appears to be quite tricky.
Test for the unsafe vars are TODO for now. */
if ( (!global && RX_NPARENS(rx))
|| SvTEMP(TARG) || SvAMAGIC(TARG) || PL_sawampersand
|| (RX_EXTFLAGS(rx) & (RXf_EVAL_SEEN|RXf_PMf_KEEPCOPY)))
r_flags |= REXEC_COPY_STR;
|| PL_sawampersand
|| SvTEMP(TARG)
|| SvAMAGIC(TARG)
|| (RX_EXTFLAGS(rx) & (RXf_EVAL_SEEN|RXf_PMf_KEEPCOPY))
) {
r_flags |= (REXEC_COPY_STR|REXEC_COPY_SKIP_PRE);
/* in @a =~ /(.)/g, we iterate multiple times, but copy the buffer
* only on the first iteration. Therefore we need to copy $' as well
* as $&, to make the rest of the string available for captures in
* subsequent iterations */
if (! (global && gimme == G_ARRAY))
r_flags |= REXEC_COPY_SKIP_POST;
};

play_it_again:
if (global && RX_OFFS(rx)[0].start != -1) {
Expand Down Expand Up @@ -1472,6 +1482,8 @@ PP(pp_match)
if (global) {
/* FIXME - should rx->subbeg be const char *? */
RX_SUBBEG(rx) = (char *) truebase;
RX_SUBOFFSET(rx) = 0;
RX_SUBCOFFSET(rx) = 0;
RX_OFFS(rx)[0].start = s - truebase;
if (RX_MATCH_UTF8(rx)) {
char * const t = (char*)utf8_hop((U8*)s, RX_MINLENRET(rx));
Expand Down Expand Up @@ -1507,6 +1519,8 @@ PP(pp_match)
#endif
}
RX_SUBLEN(rx) = strend - t;
RX_SUBOFFSET(rx) = 0;
RX_SUBCOFFSET(rx) = 0;
RX_MATCH_COPIED_on(rx);
off = RX_OFFS(rx)[0].start = s - t;
RX_OFFS(rx)[0].end = off + RX_MINLENRET(rx);
Expand Down Expand Up @@ -2127,9 +2141,14 @@ PP(pp_subst)
pm = PL_curpm;
rx = PM_GETRE(pm);
}
r_flags = (RX_NPARENS(rx) || SvTEMP(TARG) || PL_sawampersand
|| (RX_EXTFLAGS(rx) & (RXf_EVAL_SEEN|RXf_PMf_KEEPCOPY)) )
? REXEC_COPY_STR : 0;

r_flags = ( RX_NPARENS(rx)
|| PL_sawampersand
|| SvTEMP(TARG)
|| (RX_EXTFLAGS(rx) & (RXf_EVAL_SEEN|RXf_PMf_KEEPCOPY))
)
? REXEC_COPY_STR
: 0;

orig = m = s;
if (RX_EXTFLAGS(rx) & RXf_USE_INTUIT) {
Expand Down Expand Up @@ -2331,6 +2350,7 @@ PP(pp_subst)
if (RX_MATCH_COPIED(rx) && RX_SUBBEG(rx) != orig) {
m = s;
s = orig;
assert(RX_SUBOFFSET(rx) == 0);
orig = RX_SUBBEG(rx);
s = orig + (m - s);
strend = s + (strend - m);
Expand Down
10 changes: 6 additions & 4 deletions regcomp.c
Expand Up @@ -6722,8 +6722,8 @@ Perl_reg_numbered_buff_fetch(pTHX_ REGEXP * const r, const I32 paren,
&& rx->offs[0].end != -1)
{
/* $', ${^POSTMATCH} */
s = rx->subbeg + rx->offs[0].end;
i = rx->sublen - rx->offs[0].end;
s = rx->subbeg - rx->suboffset + rx->offs[0].end;
i = rx->sublen + rx->suboffset - rx->offs[0].end;
}
else
if ( 0 <= n && n <= (I32)rx->nparens &&
Expand All @@ -6732,7 +6732,7 @@ Perl_reg_numbered_buff_fetch(pTHX_ REGEXP * const r, const I32 paren,
{
/* $&, ${^MATCH}, $1 ... */
i = t1 - s1;
s = rx->subbeg + s1;
s = rx->subbeg + s1 - rx->suboffset;
} else {
goto ret_undef;
}
Expand Down Expand Up @@ -6859,7 +6859,7 @@ Perl_reg_numbered_buff_length(pTHX_ REGEXP * const r, const SV * const sv,
}
getlen:
if (i > 0 && RXp_MATCH_UTF8(rx)) {
const char * const s = rx->subbeg + s1;
const char * const s = rx->subbeg - rx->suboffset + s1;
const U8 *ep;
STRLEN el;

Expand Down Expand Up @@ -14462,6 +14462,8 @@ Perl_save_re_context(pTHX)

PL_reg_oldsaved = NULL;
PL_reg_oldsavedlen = 0;
PL_reg_oldsavedoffset = 0;
PL_reg_oldsavedcoffset = 0;
PL_reg_maxiter = 0;
PL_reg_leftiter = 0;
PL_reg_poscache = NULL;
Expand Down

0 comments on commit 6502e08

Please sign in to comment.