Skip to content

Commit

Permalink
Docs: Emphasize SvPVbyte and SvPVutf8 over SvPV. This updates
Browse files Browse the repository at this point in the history
perlguts, perlxs, perlxstut, and perlapi.

Issue #18600
  • Loading branch information
FGasper committed Apr 9, 2021
1 parent 9915526 commit 0ea2594
Show file tree
Hide file tree
Showing 5 changed files with 154 additions and 29 deletions.
8 changes: 4 additions & 4 deletions dist/ExtUtils-ParseXS/lib/perlxs.pod
Expand Up @@ -603,7 +603,7 @@ and C<$type> can be used as in typemaps.

bool_t
rpcb_gettime(host,timep)
char *host = (char *)SvPV_nolen($arg);
char *host = (char *)SvPVbyte_nolen($arg);
time_t &timep = 0;
OUTPUT:
timep
Expand All @@ -630,7 +630,7 @@ Here's a truly obscure example:
bool_t
rpcb_gettime(host,timep)
time_t &timep; /* \$v{timep}=@{[$v{timep}=$arg]} */
char *host + SvOK($v{timep}) ? SvPV_nolen($arg) : NULL;
char *host + SvOK($v{timep}) ? SvPVbyte_nolen($arg) : NULL;
OUTPUT:
timep

Expand Down Expand Up @@ -993,7 +993,7 @@ The XS code, with ellipsis, follows.
char *host = "localhost";
CODE:
if( items > 1 )
host = (char *)SvPV_nolen(ST(1));
host = (char *)SvPVbyte_nolen(ST(1));
RETVAL = rpcb_gettime( host, &timep );
OUTPUT:
timep
Expand Down Expand Up @@ -1294,7 +1294,7 @@ prototypes.
char *host = "localhost";
CODE:
if( items > 1 )
host = (char *)SvPV_nolen(ST(1));
host = (char *)SvPVbyte_nolen(ST(1));
RETVAL = rpcb_gettime( host, &timep );
OUTPUT:
timep
Expand Down
3 changes: 2 additions & 1 deletion dist/ExtUtils-ParseXS/lib/perlxstut.pod
Expand Up @@ -1143,7 +1143,8 @@ Mytest.xs:
for (n = 0; n <= numpaths; n++) {
HV * rh;
STRLEN l;
char * fn = SvPV(*av_fetch((AV *)SvRV(paths), n, 0), l);
SV * path = *av_fetch((AV *)SvRV(paths), n, 0);
char * fn = SvPVbyte(path, l);

i = statfs(fn, &buf);
if (i != 0) {
Expand Down
9 changes: 7 additions & 2 deletions pod/perldelta.pod
Expand Up @@ -143,8 +143,13 @@ XXX

=head1 Documentation

XXX Changes to files in F<pod/> go here. Consider grouping entries by
file and be sure to link to the appropriate page, e.g. L<perlfunc>.
L<perlguts> now explains in greater detail the need to consult SvUTF8
when calling SvPV (or variants). A new "How do I pass a Perl string to a C
library?" section in the same document discusses when to use which style of
macro to read an SV's string value.

L<perlapi>, L<perlguts>, L<perlxs>, and L<perlxstut> now prefer SvPVbyte
over SvPV.

=head2 New Documentation

Expand Down
147 changes: 128 additions & 19 deletions pod/perlguts.pod
Expand Up @@ -153,37 +153,86 @@ Perl's own functions typically add a trailing C<NUL> for this reason.
Nevertheless, you should be very careful when you pass a string stored
in an SV to a C function or system call.

To access the actual value that an SV points to, you can use the macros:

SvIV(SV*)
SvUV(SV*)
SvNV(SV*)
SvPV(SV*, STRLEN len)
SvPV_nolen(SV*)

which will automatically coerce the actual scalar type into an IV, UV, double,
or string.

In the C<SvPV> macro, the length of the string returned is placed into the
variable C<len> (this is a macro, so you do I<not> use C<&len>). If you do
not care what the length of the data is, use the C<SvPV_nolen> macro.
Historically the C<SvPV> macro with the global variable C<PL_na> has been
used in this case. But that can be quite inefficient because C<PL_na> must
To access the actual value that an SV points to, Perl's API exposes
several macros that coerce the actual scalar type into an IV, UV, double,
or string:

=over

=item * C<SvIV(SV*)> (C<IV>) and C<SvUV(SV*)> (C<UV>)

=item * C<SvNV(SV*)> (C<double>)

=item * Strings are a bit complicated:

=over

=item * Byte string: C<SvPVbyte(SV*, STRLEN len)> or C<SvPVbyte_nolen(SV*)>

If the Perl string is C<"\xff\xff">, then this returns a 2-byte C<char*>.

This is suitable for Perl strings that represent bytes.

=item * UTF-8 string: C<SvPVutf8(SV*, STRLEN len)> or C<SvPVutf8_nolen(SV*)>

If the Perl string is C<"\xff\xff">, then this returns a 4-byte C<char*>.

This is suitable for Perl strings that represent characters.

B<CAVEAT>: That C<char*> will be encoded via Perl's internal UTF-8 variant,
which means that if the SV contains non-Unicode code points (e.g.,
0x110000), then the result may contain extensions over valid UTF-8.
See L<perlapi/is_strict_utf8_string> for some methods Perl gives
you to check the UTF-8 validity of these macros' returns.

=item * You can also use C<SvPV(SV*, STRLEN len)> or C<SvPV_nolen(SV*)>
to fetch the SV's raw internal buffer. This is tricky, though; if your Perl
string
is C<"\xff\xff">, then depending on the SV's internal encoding you might get
back a 2-byte B<OR> a 4-byte C<char*>.
Moreover, if it's the 4-byte string, that could come from either Perl
C<"\xff\xff"> stored UTF-8 encoded, or Perl C<"\xc3\xbf\xc3\xbf"> stored
as raw octets. To differentiate between these you B<MUST> look up the
SV's UTF8 bit (cf. C<SvUTF8>) to know whether the source Perl string
is 2 characters (C<SvUTF8> would be on) or 4 characters (C<SvUTF8> would be
off).

B<IMPORTANT:> Use of C<SvPV>, C<SvPV_nolen>, or
similarly-named macros I<without> looking up the SV's UTF8 bit is
almost certainly a bug if non-ASCII input is allowed.

When the UTF8 bit is on, the same B<CAVEAT> about UTF-8 validity applies
here as for C<SvPVutf8>.

=back

(See L</How do I pass a Perl string to a C library?> for more details.)

In C<SvPVbyte>, C<SvPVutf8>, and C<SvPV>, the length of the C<char*> returned
is placed into the
variable C<len> (these are macros, so you do I<not> use C<&len>). If you do
not care what the length of the data is, use C<SvPVbyte_nolen>,
C<SvPVutf8_nolen>, or C<SvPV_nolen> instead.
The global variable C<PL_na> can also be given to
C<SvPVbyte>/C<SvPVutf8>/C<SvPV>
in this case. But that can be quite inefficient because C<PL_na> must
be accessed in thread-local storage in threaded Perl. In any case, remember
that Perl allows arbitrary strings of data that may both contain NULs and
might not be terminated by a C<NUL>.

Also remember that C doesn't allow you to safely say C<foo(SvPV(s, len),
Also remember that C doesn't allow you to safely say C<foo(SvPVbyte(s, len),
len);>. It might work with your
compiler, but it won't work for everyone.
Break this sort of statement up into separate assignments:

SV *s;
STRLEN len;
char *ptr;
ptr = SvPV(s, len);
ptr = SvPVbyte(s, len);
foo(ptr, len);

=back

If you want to know if the scalar value is TRUE, you can use:

SvTRUE(SV*)
Expand All @@ -200,7 +249,7 @@ add space for the trailing C<NUL> byte (perl's own string functions typically do
C<SvGROW(sv, len + 1)>).

If you want to write to an existing SV's buffer and set its value to a
string, use SvPV_force() or one of its variants to force the SV to be
string, use SvPVbyte_force() or one of its variants to force the SV to be
a PV. This will remove any of various types of non-stringness from
the SV while preserving the content of the SV in the PV. This can be
used, for example, to append data from an API function to a buffer
Expand Down Expand Up @@ -3243,6 +3292,66 @@ There is no published API for dealing with this, as it is subject to
change, but you can look at the code for C<pp_lc> in F<pp.c> for an
example as to how it's currently done.

=head2 How do I pass a Perl string to a C library?

A Perl string, conceptually, is an opaque sequence of code points.
Many C libraries expect their inputs to be "classical" C strings, which are
arrays of octets 1-255, terminated with a NUL byte. Your job when writing
an interface between Perl and a C library is to define the mapping between
Perl and that library.

Generally speaking, C<SvPVbyte> and related macros suit this task well.
These assume that your Perl string is a "byte string", i.e., is either
raw, undecoded input into Perl or is pre-encoded to, e.g., UTF-8.

Alternatively, if your C library expects UTF-8 text, you can use
C<SvPVutf8> and related macros. This has the same effect as encoding
to UTF-8 then calling the corresponding C<SvPVbyte>-related macro.

Some C libraries may expect other encodings (e.g., UTF-16LE). To give
Perl strings to such libraries
you must either do that encoding in Perl then use C<SvPVbyte>, or
use an intermediary C library to convert from however Perl stores the
string to the desired encoding.

Take care also that NULs in your Perl string don't confuse the C
library. If possible, give the string's length to the C library; if that's
not possible, consider rejecting strings that contain NUL bytes.

=head3 What about C<SvPV>, C<SvPV_nolen>, etc.?

Consider a 3-character Perl string C<$foo = "\x64\x78\x8c">.
Perl can store these 3 characters either of two ways:

=over

=item * bytes: 0x64 0x78 0x8c

=item * UTF-8: 0x64 0x78 0xc2 0x8c

=back

Now let's say you convert C<$foo> to a C string thus:

STRLEN strlen;
char *str = SvPV(foo_sv, strlen);

At this point C<str> could point to a 3-byte C string or a 4-byte one.

Generally speaking, we want C<str> to be the same regardless of how
Perl stores C<$foo>, so the ambiguity here is undesirable. C<SvPVbyte>
and C<SvPVutf8> solve that by giving predictable output: use
C<SvPVbyte> if your C library expects byte strings, or C<SvPVutf8>
if it expects UTF-8.

If your C library happens to support both encodings, then C<SvPV>--always
in tandem with lookups to C<SvUTF8>!--may be safe and (slightly) more
efficient.

B<TESTING> B<TIP:> Use L<utf8>'s C<upgrade> and C<downgrade> functions
in your tests to ensure consistent handling regardless of Perl's
internal encoding.

=head2 How do I convert a string to UTF-8?

If you're mixing UTF-8 and non-UTF-8 strings, it is necessary to upgrade
Expand Down
16 changes: 13 additions & 3 deletions sv.h
Expand Up @@ -801,7 +801,9 @@ compiler will complain if you were to try to modify the contents of the string,
(unless you cast away const yourself).
=for apidoc Am|STRLEN|SvCUR|SV* sv
Returns the length of the string which is in the SV. See C<L</SvLEN>>.
Returns the length, in bytes, of the PV inside the SV.
Note that this may not match Perl's C<length>; for that, use
C<sv_len_utf8(sv)>. See C<L</SvLEN>> also.
=for apidoc Am|STRLEN|SvLEN|SV* sv
Returns the size of the string buffer in the SV, not including any part
Expand Down Expand Up @@ -855,8 +857,8 @@ Set the value of the MAGIC pointer in C<sv> to val. See C<L</SvIV_set>>.
Set the value of the STASH pointer in C<sv> to val. See C<L</SvIV_set>>.
=for apidoc Am|void|SvCUR_set|SV* sv|STRLEN len
Set the current length of the string which is in the SV. See C<L</SvCUR>>
and C<SvIV_set>>.
Sets the current length, in bytes, of the C string which is in the SV.
See C<L</SvCUR>> and C<SvIV_set>>.
=for apidoc Am|void|SvLEN_set|SV* sv|STRLEN len
Set the size of the string buffer for the SV. See C<L</SvLEN>>.
Expand Down Expand Up @@ -1657,6 +1659,14 @@ see C<L</SvPV_force>>.
The differences between the forms are:
The forms with neither C<byte> nor C<utf8> in their names (e.g., C<SvPV> or
C<SvPV_nolen>) can expose the SV's internal string buffer. If
that buffer consists entirely of bytes 0-255 and includes any bytes above
127, then you B<MUST> consult C<SvUTF8> to determine the actual code points
the string is meant to contain. Generally speaking, it is probably safer to
prefer C<SvPVbyte>, C<SvPVutf8>, and the like. See
L<perlguts/How do I pass a Perl string to a C library?> for more details.
The forms with C<flags> in their names allow you to use the C<flags> parameter
to specify to process 'get' magic (by setting the C<SV_GMAGIC> flag) or to skip
'get' magic (by clearing it). The other forms process 'get' magic, except for
Expand Down

0 comments on commit 0ea2594

Please sign in to comment.