Skip to content

Commit bd46758

Browse files
mfwittenFather Chrysostomos
authored andcommitted
[perl #90632] perlfunc: Rewrite `split'
I couldn't stand the way the documenation for `split' was written; it felt like a kludge of broken English dumped into a messy pile by several people, each of whom was unaware of the other's work. This variation completes sentences, adds new ones, rearranges ideas, expands on ideas, simplifies and unifies examples, and includes more cross references. While the original text seemed to be written in a way that touched upon the arguments in reverse order (which did have a hint of elegance), this version attempts to provide the reader with the most useful information upfront. Thanks to Brad Baxter and Thomas R. Sibley for their constructive criticism. [Modified by the committer to incorporate suggestions from Aristotle Pagaltzis and Tom Christiansen.]
1 parent 1887da8 commit bd46758

File tree

1 file changed

+114
-80
lines changed

1 file changed

+114
-80
lines changed

pod/perlfunc.pod

Lines changed: 114 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -6234,117 +6234,151 @@ X<split>
62346234

62356235
=item split
62366236

6237-
Splits the string EXPR into a list of strings and returns that list. By
6238-
default, empty leading fields are preserved, and empty trailing ones are
6239-
deleted. (If all fields are empty, they are considered to be trailing.)
6237+
Splits the string EXPR into a list of strings and returns the
6238+
list in list context, or the size of the list in scalar context.
62406239

6241-
In scalar context, returns the number of fields found.
6240+
If only PATTERN is given, EXPR defaults to C<$_>.
62426241

6243-
If EXPR is omitted, splits the C<$_> string. If PATTERN is also omitted,
6244-
splits on whitespace (after skipping any leading whitespace). Anything
6245-
matching PATTERN is taken to be a delimiter separating the fields. (Note
6246-
that the delimiter may be longer than one character.)
6242+
Anything in EXPR that matches PATTERN is taken to be a separator
6243+
that separates the EXPR into substrings (called "I<fields>") that
6244+
do B<not> include the separator. Note that a separator may be
6245+
longer than one character or even have no characters at all (the
6246+
empty string, which is a zero-width match).
6247+
6248+
The PATTERN need not be constant; an expression may be used
6249+
to specify a pattern that varies at runtime.
6250+
6251+
If PATTERN matches the empty string, the EXPR is split at the match
6252+
position (between characters). As an example, the following:
6253+
6254+
print join(':', split('b', 'abc')), "\n";
6255+
6256+
uses the 'b' in 'abc' as a separator to produce the output 'a:c'.
6257+
However, this:
6258+
6259+
print join(':', split('', 'abc')), "\n";
6260+
6261+
uses empty string matches as separators to produce the output
6262+
'a:b:c'; thus, the empty string may be used to split EXPR into a
6263+
list of its component characters.
6264+
6265+
As a special case for C<split>, the empty pattern given in
6266+
L<match operator|perlop/"m/PATTERN/msixpodualgc"> syntax (C<//>) specifically matches the empty string, which is contrary to its usual
6267+
interpretation as the last successful match.
6268+
6269+
If PATTERN is C</^/>, then it is treated as if it used the
6270+
L<multiline modifier|perlreref/OPERATORS> (C</^/m>), since it
6271+
isn't much use otherwise.
6272+
6273+
As another special case, C<split> emulates the default behavior of the
6274+
command line tool B<awk> when the PATTERN is either omitted or a I<literal
6275+
string> composed of a single space character (such as S<C<' '>> or
6276+
S<C<"\x20">>, but not e.g. S<C</ />>). In this case, any leading
6277+
whitespace in EXPR is removed before splitting occurs, and the PATTERN is
6278+
instead treated as if it were C</\s+/>; in particular, this means that
6279+
I<any> contiguous whitespace (not just a single space character) is used as
6280+
a separator. However, this special treatment can be avoided by specifying
6281+
the pattern S<C</ />> instead of the string S<C<" ">>, thereby allowing
6282+
only a single space character to be a separator.
6283+
6284+
If omitted, PATTERN defaults to a single space, S<C<" ">>, triggering
6285+
the previously described I<awk> emulation.
62476286

62486287
If LIMIT is specified and positive, it represents the maximum number
6249-
of fields the EXPR will be split into, though the actual number of
6250-
fields returned depends on the number of times PATTERN matches within
6251-
EXPR. If LIMIT is unspecified or zero, trailing null fields are
6252-
stripped (which potential users of C<pop> would do well to remember).
6253-
If LIMIT is negative, it is treated as if an arbitrarily large LIMIT
6254-
had been specified. Note that splitting an EXPR that evaluates to the
6255-
empty string always returns the empty list, regardless of the LIMIT
6256-
specified.
6288+
of fields into which the EXPR may be split; in other words, LIMIT is
6289+
one greater than the maximum number of times EXPR may be split. Thus,
6290+
the LIMIT value C<1> means that EXPR may be split a maximum of zero
6291+
times, producing a maximum of one field (namely, the entire value of
6292+
EXPR). For instance:
62576293

6258-
A pattern matching the empty string (not to be confused with
6259-
an empty pattern C<//>, which is just one member of the set of patterns
6260-
matching the empty string), splits EXPR into individual
6261-
characters. For example:
6294+
print join(':', split(//, 'abc', 1)), "\n";
62626295

6263-
print join(':', split(/ */, 'hi there')), "\n";
6296+
produces the output 'abc', and this:
62646297

6265-
produces the output 'h:i:t:h:e:r:e'.
6298+
print join(':', split(//, 'abc', 2)), "\n";
62666299

6267-
As a special case for C<split>, the empty pattern C<//> specifically
6268-
matches the empty string; this is not be confused with the normal use
6269-
of an empty pattern to mean the last successful match. So to split
6270-
a string into individual characters, the following:
6300+
produces the output 'a:bc', and each of these:
62716301

6272-
print join(':', split(//, 'hi there')), "\n";
6302+
print join(':', split(//, 'abc', 3)), "\n";
6303+
print join(':', split(//, 'abc', 4)), "\n";
62736304

6274-
produces the output 'h:i: :t:h:e:r:e'.
6305+
produces the output 'a:b:c'.
62756306

6276-
Empty leading fields are produced when there are positive-width matches at
6277-
the beginning of the string; a zero-width match at the beginning of
6278-
the string does not produce an empty field. For example:
6307+
If LIMIT is negative, it is treated as if it were instead arbitrarily
6308+
large; as many fields as possible are produced.
62796309

6280-
print join(':', split(/(?=\w)/, 'hi there!'));
6310+
If LIMIT is omitted (or, equivalently, zero), then it is usually
6311+
treated as if it were instead negative but with the exception that
6312+
trailing empty fields are stripped (empty leading fields are always
6313+
preserved); if all fields are empty, then all fields are considered to
6314+
be trailing (and are thus stripped in this case). Thus, the following:
62816315

6282-
produces the output 'h:i :t:h:e:r:e!'. Empty trailing fields, on the other
6283-
hand, are produced when there is a match at the end of the string (and
6284-
when LIMIT is given and is not 0), regardless of the length of the match.
6285-
For example:
6316+
print join(':', split(',', 'a,b,c,,,')), "\n";
62866317

6287-
print join(':', split(//, 'hi there!', -1)), "\n";
6288-
print join(':', split(/\W/, 'hi there!', -1)), "\n";
6318+
produces the output 'a:b:c', but the following:
62896319

6290-
produce the output 'h:i: :t:h:e:r:e:!:' and 'hi:there:', respectively,
6291-
both with an empty trailing field.
6320+
print join(':', split(',', 'a,b,c,,,', -1)), "\n";
62926321

6293-
The LIMIT parameter can be used to split a line partially
6322+
produces the output 'a:b:c:::'.
62946323

6295-
($login, $passwd, $remainder) = split(/:/, $_, 3);
6324+
In time-critical applications, it is worthwhile to avoid splitting
6325+
into more fields than necessary. Thus, when assigning to a list,
6326+
if LIMIT is omitted (or zero), then LIMIT is treated as though it
6327+
were one larger than the number of variables in the list; for the
6328+
following, LIMIT is implicitly 4:
62966329

6297-
When assigning to a list, if LIMIT is omitted, or zero, Perl supplies
6298-
a LIMIT one larger than the number of variables in the list, to avoid
6299-
unnecessary work. For the list above LIMIT would have been 4 by
6300-
default. In time critical applications it behooves you not to split
6301-
into more fields than you really need.
6330+
($login, $passwd, $remainder) = split(/:/);
63026331

6303-
If the PATTERN contains parentheses, additional list elements are
6304-
created from each matching substring in the delimiter.
6332+
Note that splitting an EXPR that evaluates to the empty string always
6333+
produces zero fields, regardless of the LIMIT specified.
63056334

6306-
split(/([,-])/, "1-10,20", 3);
6335+
An empty leading field is produced when there is a positive-width
6336+
match at the beginning of EXPR. For instance:
63076337

6308-
produces the list value
6338+
print join(':', split(/ /, ' abc')), "\n";
63096339

6310-
(1, '-', 10, ',', 20)
6340+
produces the output ':abc'. However, a zero-width match at the
6341+
beginning of EXPR never produces an empty field, so that:
63116342

6312-
If you had the entire header of a normal Unix email message in $header,
6313-
you could split it up into fields and their values this way:
6343+
print join(':', split(//, ' abc'));
63146344

6315-
$header =~ s/\n(?=\s)//g; # fix continuation lines
6316-
%hdrs = (UNIX_FROM => split /^(\S*?):\s*/m, $header);
6345+
produces the output S<' :a:b:c'> (rather than S<': :a:b:c'>).
63176346

6318-
The pattern C</PATTERN/> may be replaced with an expression to specify
6319-
patterns that vary at runtime. (To do runtime compilation only once,
6320-
use C</$variable/o>.)
6347+
An empty trailing field, on the other hand, is produced when there is a
6348+
match at the end of EXPR, regardless of the length of the match
6349+
(of course, unless a non-zero LIMIT is given explicitly, such fields are
6350+
removed, as in the last example). Thus:
63216351

6322-
As a special case, specifying a PATTERN of space (S<C<' '>>) will split on
6323-
white space just as C<split> with no arguments does. Thus, S<C<split(' ')>> can
6324-
be used to emulate B<awk>'s default behavior, whereas S<C<split(/ /)>>
6325-
will give you as many initial null fields (empty string) as there are leading spaces.
6326-
A C<split> on C</\s+/> is like a S<C<split(' ')>> except that any leading
6327-
whitespace produces a null first field. A C<split> with no arguments
6328-
really does a S<C<split(' ', $_)>> internally.
6352+
print join(':', split(//, ' abc', -1)), "\n";
63296353

6330-
A PATTERN of C</^/> is treated as if it were C</^/m>, since it isn't
6331-
much use otherwise.
6354+
produces the output S<' :a:b:c:'>.
63326355

6333-
Example:
6356+
If the PATTERN contains
6357+
L<capturing groups|perlretut/Grouping things and hierarchical matching>,
6358+
then for each separator, an additional field is produced for each substring
6359+
captured by a group (in the order in which the groups are specified,
6360+
as per L<backreferences|perlretut/Backreferences>); if any group does not
6361+
match, then it captures the C<undef> value instead of a substring. Also,
6362+
note that any such additional field is produced whenever there is a
6363+
separator (that is, whenever a split occurs), and such an additional field
6364+
does B<not> count towards the LIMIT. Consider the following expressions
6365+
evaluated in list context (each returned list is provided in the associated
6366+
comment):
63346367

6335-
open(PASSWD, '/etc/passwd');
6336-
while (<PASSWD>) {
6337-
chomp;
6338-
($login, $passwd, $uid, $gid,
6339-
$gcos, $home, $shell) = split(/:/);
6340-
#...
6341-
}
6368+
split(/-|,/, "1-10,20", 3)
6369+
# ('1', '10', '20')
6370+
6371+
split(/(-|,)/, "1-10,20", 3)
6372+
# ('1', '-', '10', ',', '20')
6373+
6374+
split(/-|(,)/, "1-10,20", 3)
6375+
# ('1', undef, '10', ',', '20')
63426376

6343-
As with regular pattern matching, any capturing parentheses that are not
6344-
matched in a C<split()> will be set to C<undef> when returned:
6377+
split(/(-)|,/, "1-10,20", 3)
6378+
# ('1', '-', '10', undef, '20')
63456379

6346-
@fields = split /(A)|B/, "1A2B3";
6347-
# @fields is (1, 'A', 2, undef, 3)
6380+
split(/(-)|(,)/, "1-10,20", 3)
6381+
# ('1', '-', undef, '10', undef, ',', '20')
63486382

63496383
=item sprintf FORMAT, LIST
63506384
X<sprintf>

0 commit comments

Comments
 (0)