Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: raw data API for PyString #1776

Closed
indygreg opened this issue Aug 10, 2021 · 1 comment · Fixed by #1794
Closed

Feature request: raw data API for PyString #1776

indygreg opened this issue Aug 10, 2021 · 1 comment · Fixed by #1794

Comments

@indygreg
Copy link
Contributor

PyUnicode internally stores its data in various variations. See https://docs.python.org/3/c-api/unicode.html.

PyO3's PyString currently only allows you to get at UTF-8 / Rust str compatible variations of the data.

rust-cpython - by contrast - exposes a PyString.data() returning a PyStringData enum:

pub enum PyStringData<'a> {
    Latin1(&'a [u8]),
    Utf8(&'a [u8]),
    Utf16(&'a [u16]),
    Utf32(&'a [u32]),
}

This API enables Rust to have access to the raw bytes backing a Python string, not the UTF-8 normalization of it (if different).

PyOxidizer was relying on this API for testing. (There are some low-level tests around encoding handling that need to verify exact byte sequences and Python string representations are being handled properly.)

While I'm certainly capable of using unsafe Python C APIs to get at the raw string data to close this feature gap, I was curious if PyO3 would be interested in a PR to expose a PyStringData enumeration for PyString instances. Here is my proposal:

  1. PyString gains a pub fn data(&self) -> PyStringData<'_>
  2. PyStringData is an enum with a variant for each internal Python string variation.
  3. PyString.data() calls out to PyUnicode_READY() + PyUnicode_{KIND, DATA, GET_LENGTH} and constructs a PyStringData with a slice.

I'd be willing to contribute a PR for this feature if there is interest.

indygreg added a commit to indygreg/pyo3 that referenced this issue Aug 10, 2021
pyo3 doesn't currently define various Unicode bindings that allow the
retrieval of raw data from Python strings. Said bindings are a
prerequisite to possibly exposing this data in the Rust API (PyO3#1776).
Even if those high-level APIs never materialize, the FFI bindings are
necessary to enable consumers of the raw C API to utilize them.

This commit partially defines the FFI bindings as defined in
CPython's Include/cpython/unicodeobject.h file.

I used the latest CPython 3.9 Git commit for defining the order
of the symbols and the implementation of various inline preprocessor
macros. I tried to be as faithful as possible to the original
implementation, preserving intermediate `#define`s as inline functions.

The structs are a bit wonky and probably warrant the most review
scrutiny. I haven't tested this code thoroughly.

Missing symbols have been annotated with `skipped` and symbols currently
defined in `src/ffi/unicodeobject.rs` have been annotated with `move`.
@davidhewitt
Copy link
Member

davidhewitt commented Aug 10, 2021

👍 I'm ok with adding this, however I think it's going to be quite hard. See my comments on #1777

FWIW given that PEPs 393 & 623 are removing the "legacy" kinds (and if an object is in the legacy state, we can convert it with PyUnicode_READY), I'd vote we aim to expose just a three-member enum which matches the remaining kinds exposed:

pub enum PyStringData<'a> {
    Ucs1(&'a [u8]),
    Ucs2(&'a [u16]),
    Ucs4(&'a [u32]),
}

There's perhaps room for an Ascii variant too. (A special-case of Ucs1 when the ascii flag is set.)

indygreg added a commit to indygreg/pyo3 that referenced this issue Aug 11, 2021
pyo3 doesn't currently define various Unicode bindings that allow the
retrieval of raw data from Python strings. Said bindings are a
prerequisite to possibly exposing this data in the Rust API (PyO3#1776).
Even if those high-level APIs never materialize, the FFI bindings are
necessary to enable consumers of the raw C API to utilize them.

This commit partially defines the FFI bindings as defined in
CPython's Include/cpython/unicodeobject.h file.

I used the latest CPython 3.9 Git commit for defining the order
of the symbols and the implementation of various inline preprocessor
macros. I tried to be as faithful as possible to the original
implementation, preserving intermediate `#define`s as inline functions.

Missing symbols have been annotated with `skipped` and symbols currently
defined in `src/ffi/unicodeobject.rs` have been annotated with `move`.

The `state` field of `PyASCIIObject` is a bitfield, which Rust doesn't
support. So we've provided accessor functions for retrieving these
fields' values. No accessor functions are present because you shouldn't
be touching these values from Rust code.

Tests of the bitfield APIs and macro implementations have been added.
indygreg added a commit to indygreg/pyo3 that referenced this issue Aug 11, 2021
pyo3 doesn't currently define various Unicode bindings that allow the
retrieval of raw data from Python strings. Said bindings are a
prerequisite to possibly exposing this data in the Rust API (PyO3#1776).
Even if those high-level APIs never materialize, the FFI bindings are
necessary to enable consumers of the raw C API to utilize them.

This commit partially defines the FFI bindings as defined in
CPython's Include/cpython/unicodeobject.h file.

I used the latest CPython 3.9 Git commit for defining the order
of the symbols and the implementation of various inline preprocessor
macros. I tried to be as faithful as possible to the original
implementation, preserving intermediate `#define`s as inline functions.

Missing symbols have been annotated with `skipped` and symbols currently
defined in `src/ffi/unicodeobject.rs` have been annotated with `move`.

The `state` field of `PyASCIIObject` is a bitfield, which Rust doesn't
support. So we've provided accessor functions for retrieving these
fields' values. No accessor functions are present because you shouldn't
be touching these values from Rust code.

Tests of the bitfield APIs and macro implementations have been added.
indygreg added a commit to indygreg/pyo3 that referenced this issue Aug 11, 2021
pyo3 doesn't currently define various Unicode bindings that allow the
retrieval of raw data from Python strings. Said bindings are a
prerequisite to possibly exposing this data in the Rust API (PyO3#1776).
Even if those high-level APIs never materialize, the FFI bindings are
necessary to enable consumers of the raw C API to utilize them.

This commit partially defines the FFI bindings as defined in
CPython's Include/cpython/unicodeobject.h file.

I used the latest CPython 3.9 Git commit for defining the order
of the symbols and the implementation of various inline preprocessor
macros. I tried to be as faithful as possible to the original
implementation, preserving intermediate `#define`s as inline functions.

Missing symbols have been annotated with `skipped` and symbols currently
defined in `src/ffi/unicodeobject.rs` have been annotated with `move`.

The `state` field of `PyASCIIObject` is a bitfield, which Rust doesn't
support. So we've provided accessor functions for retrieving these
fields' values. No accessor functions are present because you shouldn't
be touching these values from Rust code.

Tests of the bitfield APIs and macro implementations have been added.
indygreg added a commit to indygreg/pyo3 that referenced this issue Aug 14, 2021
pyo3 doesn't currently define various Unicode bindings that allow the
retrieval of raw data from Python strings. Said bindings are a
prerequisite to possibly exposing this data in the Rust API (PyO3#1776).
Even if those high-level APIs never materialize, the FFI bindings are
necessary to enable consumers of the raw C API to utilize them.

This commit partially defines the FFI bindings as defined in
CPython's Include/cpython/unicodeobject.h file.

I used the latest CPython 3.9 Git commit for defining the order
of the symbols and the implementation of various inline preprocessor
macros. I tried to be as faithful as possible to the original
implementation, preserving intermediate `#define`s as inline functions.

Missing symbols have been annotated with `skipped` and symbols currently
defined in `src/ffi/unicodeobject.rs` have been annotated with `move`.

The `state` field of `PyASCIIObject` is a bitfield, which Rust doesn't
support. So we've provided accessor functions for retrieving these
fields' values. No accessor functions are present because you shouldn't
be touching these values from Rust code.

Tests of the bitfield APIs and macro implementations have been added.
indygreg added a commit to indygreg/pyo3 that referenced this issue Aug 14, 2021
pyo3 doesn't currently define various Unicode bindings that allow the
retrieval of raw data from Python strings. Said bindings are a
prerequisite to possibly exposing this data in the Rust API (PyO3#1776).
Even if those high-level APIs never materialize, the FFI bindings are
necessary to enable consumers of the raw C API to utilize them.

This commit partially defines the FFI bindings as defined in
CPython's Include/cpython/unicodeobject.h file.

I used the latest CPython 3.9 Git commit for defining the order
of the symbols and the implementation of various inline preprocessor
macros. I tried to be as faithful as possible to the original
implementation, preserving intermediate `#define`s as inline functions.

Missing symbols have been annotated with `skipped` and symbols currently
defined in `src/ffi/unicodeobject.rs` have been annotated with `move`.

The `state` field of `PyASCIIObject` is a bitfield, which Rust doesn't
support. So we've provided accessor functions for retrieving these
fields' values. No accessor functions are present because you shouldn't
be touching these values from Rust code.

Tests of the bitfield APIs and macro implementations have been added.
indygreg added a commit to indygreg/pyo3 that referenced this issue Aug 15, 2021
With the recent implementation of non-limited unicode APIs, we're
able to query Python's low-level state to access the raw bytes that
Python is using to store string objects.

This commit implements a safe Rust API for obtaining a view into
Python's internals and representing the raw bytes Python is using
to store strings.

Not only do we allow accessing what Python has stored internally,
but we also support coercing this data to a `Cow<str>`.

Closes PyO3#1776.
indygreg added a commit to indygreg/pyo3 that referenced this issue Aug 15, 2021
With the recent implementation of non-limited unicode APIs, we're
able to query Python's low-level state to access the raw bytes that
Python is using to store string objects.

This commit implements a safe Rust API for obtaining a view into
Python's internals and representing the raw bytes Python is using
to store strings.

Not only do we allow accessing what Python has stored internally,
but we also support coercing this data to a `Cow<str>`.

Closes PyO3#1776.
davidhewitt pushed a commit that referenced this issue Aug 15, 2021
pyo3 doesn't currently define various Unicode bindings that allow the
retrieval of raw data from Python strings. Said bindings are a
prerequisite to possibly exposing this data in the Rust API (#1776).
Even if those high-level APIs never materialize, the FFI bindings are
necessary to enable consumers of the raw C API to utilize them.

This commit partially defines the FFI bindings as defined in
CPython's Include/cpython/unicodeobject.h file.

I used the latest CPython 3.9 Git commit for defining the order
of the symbols and the implementation of various inline preprocessor
macros. I tried to be as faithful as possible to the original
implementation, preserving intermediate `#define`s as inline functions.

Missing symbols have been annotated with `skipped` and symbols currently
defined in `src/ffi/unicodeobject.rs` have been annotated with `move`.

The `state` field of `PyASCIIObject` is a bitfield, which Rust doesn't
support. So we've provided accessor functions for retrieving these
fields' values. No accessor functions are present because you shouldn't
be touching these values from Rust code.

Tests of the bitfield APIs and macro implementations have been added.
indygreg added a commit to indygreg/pyo3 that referenced this issue Aug 19, 2021
With the recent implementation of non-limited unicode APIs, we're
able to query Python's low-level state to access the raw bytes that
Python is using to store string objects.

This commit implements a safe Rust API for obtaining a view into
Python's internals and representing the raw bytes Python is using
to store strings.

Not only do we allow accessing what Python has stored internally,
but we also support coercing this data to a `Cow<str>`.

Closes PyO3#1776.
indygreg added a commit to indygreg/pyo3 that referenced this issue Aug 19, 2021
With the recent implementation of non-limited unicode APIs, we're
able to query Python's low-level state to access the raw bytes that
Python is using to store string objects.

This commit implements a safe Rust API for obtaining a view into
Python's internals and representing the raw bytes Python is using
to store strings.

Not only do we allow accessing what Python has stored internally,
but we also support coercing this data to a `Cow<str>`.

Closes PyO3#1776.
indygreg added a commit to indygreg/pyo3 that referenced this issue Aug 19, 2021
With the recent implementation of non-limited unicode APIs, we're
able to query Python's low-level state to access the raw bytes that
Python is using to store string objects.

This commit implements a safe Rust API for obtaining a view into
Python's internals and representing the raw bytes Python is using
to store strings.

Not only do we allow accessing what Python has stored internally,
but we also support coercing this data to a `Cow<str>`.

Closes PyO3#1776.
indygreg added a commit to indygreg/pyo3 that referenced this issue Aug 19, 2021
With the recent implementation of non-limited unicode APIs, we're
able to query Python's low-level state to access the raw bytes that
Python is using to store string objects.

This commit implements a safe Rust API for obtaining a view into
Python's internals and representing the raw bytes Python is using
to store strings.

Not only do we allow accessing what Python has stored internally,
but we also support coercing this data to a `Cow<str>`.

Closes PyO3#1776.
indygreg added a commit to indygreg/pyo3 that referenced this issue Aug 19, 2021
With the recent implementation of non-limited unicode APIs, we're
able to query Python's low-level state to access the raw bytes that
Python is using to store string objects.

This commit implements a safe Rust API for obtaining a view into
Python's internals and representing the raw bytes Python is using
to store strings.

Not only do we allow accessing what Python has stored internally,
but we also support coercing this data to a `Cow<str>`.

Closes PyO3#1776.
indygreg added a commit to indygreg/pyo3 that referenced this issue Aug 19, 2021
With the recent implementation of non-limited unicode APIs, we're
able to query Python's low-level state to access the raw bytes that
Python is using to store string objects.

This commit implements a safe Rust API for obtaining a view into
Python's internals and representing the raw bytes Python is using
to store strings.

Not only do we allow accessing what Python has stored internally,
but we also support coercing this data to a `Cow<str>`.

Closes PyO3#1776.
indygreg added a commit to indygreg/pyo3 that referenced this issue Aug 20, 2021
With the recent implementation of non-limited unicode APIs, we're
able to query Python's low-level state to access the raw bytes that
Python is using to store string objects.

This commit implements a safe Rust API for obtaining a view into
Python's internals and representing the raw bytes Python is using
to store strings.

Not only do we allow accessing what Python has stored internally,
but we also support coercing this data to a `Cow<str>`.

Closes PyO3#1776.
davidhewitt pushed a commit that referenced this issue Aug 21, 2021
With the recent implementation of non-limited unicode APIs, we're
able to query Python's low-level state to access the raw bytes that
Python is using to store string objects.

This commit implements a safe Rust API for obtaining a view into
Python's internals and representing the raw bytes Python is using
to store strings.

Not only do we allow accessing what Python has stored internally,
but we also support coercing this data to a `Cow<str>`.

Closes #1776.
davidhewitt pushed a commit that referenced this issue Aug 21, 2021
With the recent implementation of non-limited unicode APIs, we're
able to query Python's low-level state to access the raw bytes that
Python is using to store string objects.

This commit implements a safe Rust API for obtaining a view into
Python's internals and representing the raw bytes Python is using
to store strings.

Not only do we allow accessing what Python has stored internally,
but we also support coercing this data to a `Cow<str>`.

Closes #1776.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants