Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OsStr and Path conversions #1379

Merged
merged 1 commit into from
Feb 14, 2021
Merged

OsStr and Path conversions #1379

merged 1 commit into from
Feb 14, 2021

Conversation

kangalio
Copy link
Contributor

@kangalio kangalio commented Jan 10, 2021

Implements conversions between OsStr/OsString/Path/PathBuf and Python strings: #1377

This is my first PR on this project and my first interaction with the codebase

This PR is a draft because in several places of the code I'm not sure if I did the right thing. Those places are marked with a // HELP comment. I'd greatly appreciate if an experienced PyO3 maintainer could check those places.

@kangalio kangalio marked this pull request as ready for review January 10, 2021 22:32
Copy link
Member

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for this PR! This is a really great start. I've given it a thorough read and got various feedback and suggestions from my time spent in battle with the PyO3 codebase 💂.

CHANGELOG.md Outdated Show resolved Hide resolved
guide/src/conversions/tables.md Outdated Show resolved Hide resolved
src/types/mod.rs Outdated Show resolved Hide resolved
src/types/osstr.rs Outdated Show resolved Hide resolved
src/types/osstr.rs Outdated Show resolved Hide resolved
src/types/path.rs Outdated Show resolved Hide resolved
src/types/osstr.rs Outdated Show resolved Hide resolved
src/types/osstr.rs Outdated Show resolved Hide resolved
src/types/osstr.rs Outdated Show resolved Hide resolved
src/types/osstr.rs Outdated Show resolved Hide resolved
@kangalio
Copy link
Contributor Author

Wow, thank you!! That is one detailed code review. I will go over it tomorrow and implement all the suggestions

src/types/osstr.rs Outdated Show resolved Hide resolved
@davidhewitt
Copy link
Member

I was thinking this morning that we should check to understand carefully the differences in behavior on unix vs windows. If String -> OsString -> PyString gives different results on the two platforms this might trip up users.

Also pathlib.Path doesn't have a C-API, but if it gets one we probably want to make Rust paths convert back and forth with those instead of strings. Potentially worth keeping in mind.

@kangalio
Copy link
Contributor Author

I was thinking this morning that we should check to understand carefully the differences in behavior on unix vs windows. If String -> OsString -> PyString gives different results on the two platforms this might trip up users.

Also pathlib.Path doesn't have a C-API, but if it gets one we probably want to make Rust paths convert back and forth with those instead of strings. Potentially worth keeping in mind.

I think this can be taken care of using unit or integration tests. We could feed many different strings through the String -> OsString -> PyString conversion and back. This test will be run on all OSes by CI, so we know if something breaks somewhere.

Also, is there a reason you reacted confused on my previous comment? Now I am confused 😄

@kangalio
Copy link
Contributor Author

By the way, why do all Rust->Python conversions have to be duplicated? There's two traits to implement: IntoPy<PyObject> and ToPyObject and I don't really see the meaningful difference.

This is a bit awkward in the test because I essentially have to test everything twice to account for all the trait implementations

@kngwyu
Copy link
Member

kngwyu commented Jan 11, 2021

why do all Rust->Python conversions have to be duplicated?

It's a common consume/ not consume pattern in Rust, though it does not work nicely sometimes.
I think there are two main problems here (1. the generic parameter of IntoPy is not effectively used 2. we rarely need into operation), which should be discussed in the meantime.

I essentially have to test everything twice

Since they share the implementation, I don't think we need to test all things twice.

Comment on lines +90 to +93
test_roundtrip::<&Path>(py, path);
test_roundtrip::<Cow<'_, Path>>(py, Cow::Borrowed(path));
test_roundtrip::<Cow<'_, Path>>(py, Cow::Owned(path.to_path_buf()));
test_roundtrip::<PathBuf>(py, path.to_path_buf());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If .as_ref() is called at last, do we need to test all these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.as_ref() is only called to verify the end result. The conversion is done on the types as is.

@kangalio
Copy link
Contributor Author

It seems like the codecov check fail is a false positive. The places it marked as "not covered by tests" are actually covered by the tests. Perhaps codecov is getting confused by the generic trickery in the tests.


// TODO: move to other module to prevent accidentally circumventing the new function?
#[cfg(windows)]
struct DropGuard<T>(*mut T);
Copy link
Member

@kngwyu kngwyu Jan 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's good to abstract such a specific operation.
In this struct, *mut T implicitly must be a pointer in the Python heap without any reference to Python objects, so it's unsafe to construct this object.
So I recommend making this a more limited one without new and placing this in the function scope if you like this solution more than Vec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this type is super unsafe so I'd rather it was inside the function too. That way only the function can misuse it.

@davidhewitt
Copy link
Member

davidhewitt commented Jan 11, 2021

Also, is there a reason you reacted confused on my previous comment? Now I am confused 😄

Hehe nope that's just a fat-finger error when checking on the conversation on my phone!

There's two traits to implement: IntoPy<PyObject> and ToPyObject and I don't really see the meaningful difference.

Yes; I would love to rework this pair to also potentially model fallible conversions, but to avoid ecosystem churn until we're really sure what a better design is I think best to leave as-is!

It seems like the codecov check fail is a false positive. The places it marked as "not covered by tests" are actually covered by the tests. Perhaps codecov is getting confused by the generic trickery in the tests.

I think #[inline] functions can have issues with coverage. Probably the generics in the tests got inlined too. Don't need to worry too hard about it; it's annoying and I'd love to improve that one day, but it's fine for now.


Also, I was thinking we might want to consider whether PyString::to_string_lossy gives the same output as converting first to OsString and then calling OsStr::to_string_lossy. Users might assume they're equivalent, though I'm not sure if we should expect them to be. Perhaps we can consider changing PyString::to_string_lossy implementation in this PR.

@konstin
Copy link
Member

konstin commented Jan 11, 2021

There's two traits to implement: IntoPy<PyObject> and ToPyObject and I don't really see the meaningful difference.

Yes; I would love to rework this pair to also potentially model fallible conversions, but to avoid ecosystem churn until we're really sure what a better design is I think best to leave as-is!

My original plan was to have IntoPy<T> and FromPy<T> mirror std's excellent Into<T> and From<T>, with everything equivalent except for an additional gil token parameter. I also removed some of the conversion traits (such as IntoPyObject and IntoPyTuple), but I never finished so the current design is kinda half-baked.

@davidhewitt
Copy link
Member

However I'm wondering why we can assume that PyUnicode_AsWideChar won't return any error in the second invocation?

Also, if I understand the Python docs correctly, the return value of PyUnicode_AsWideChar is the number of bytes read, not the total number of bytes in the string. If that is true, won't the size variable be zero and the whole code blow up?

Docs seem to be slightly misleading in this case - the implementation of PyUnicode_AsWideChar has slightly more helpful docs: https://github.com/python/cpython/blob/fb35fa49d192368e94ffec09c092260ed0fea2e1/Objects/unicodeobject.c#L3270

From that implementation it looks like that function will never raise an exception (as long as the first argument is a valid unicode object), and also that doc describes that passing NULL for the buffer causes the full string size to be returned.

@kangalio
Copy link
Contributor Author

I'm currently working on something else but I hope to continue work on this in the next few days

@kangalio
Copy link
Contributor Author

Also, I was thinking we might want to consider whether PyString::to_string_lossy gives the same output as converting first to OsString and then calling OsStr::to_string_lossy. Users might assume they're equivalent, though I'm not sure if we should expect them to be. Perhaps we can consider changing PyString::to_string_lossy implementation in this PR.

I looked at the source code of Python (unicodeobject.c) in an attempt to figure out what the "surrogatepass" error handler does exactlyy, which is used in PyO3's PyString::to_string_lossy. The Python documentation isn't helping either.

In other words, I'm not able to verify if PyString::to_string_lossy does the same thing as PyString -> OsString -> OsString::to_string_lossy, so at least I can't do it in this PR.

@kangalio
Copy link
Contributor Author

Is there anything left to work on in this PR?

Copy link
Member

@kngwyu kngwyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Member

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to adjust the set of traits provided just slightly, see other comments.


R.E. PyString::to_string_lossy - I'm going to open a separate PR for that case. I ran a quick test script:

use pyo3::prelude::*;
use pyo3::types::{PyDict, PyString};
use std::borrow::Cow;
use std::error::Error;
use std::ffi::OsString;

fn main() -> PyResult<()> {
    Python::with_gil(|py| -> PyResult<()> {
        let locals = PyDict::new(py);
        py.run(
            r#"x = '\udcfa\udcfb\udcfc\udcfd\udcfe\udcff'"#,
            None,
            Some(locals),
        )?;
        let py_str = locals.get_item("x").unwrap().downcast::<PyString>()?;
        dbg!(py_str, py_str.len()?);
        let os_string: OsString = py_str.extract()?;
        dbg!(&os_string, &os_string.len());
        let string: Cow<str> = py_str.to_string_lossy();
        dbg!(&string, &string.chars().count());
        Ok(())
    })
}

The output was this:

[src/main.rs:16] py_str = '\udcfa\udcfb\udcfc\udcfd\udcfe\udcff'
[src/main.rs:16] py_str.len()? = 6
[src/main.rs:18] &os_string = "\xFA\xFB\xFC\xFD\xFE\xFF"
[src/main.rs:18] &os_string.len() = 6
[/home/david/dev/pyo3/src/types/string.rs:86] bytes = b'\xed\xb3\xba\xed\xb3\xbb\xed\xb3\xbc\xed\xb3\x
bd\xed\xb3\xbe\xed\xb3\xbf'
[src/main.rs:20] &string = "������������������"
[src/main.rs:20] &string.chars().count() = 18

It looks like the "surrogatepass" escape handler literally hands the three-byte surrogate sequences to Rust, which then results in three replacement characters being produced by std::String::from_utf8_lossy. I think this is wrong - Python thinks these surrogates are a single codepoint, but they're currently treated as three by PyString::to_string_lossy.

I think that using PyUnicode_EncodeFSDefault is probably better and will then match PyString -> OsString -> String.

}
}

impl<'a> IntoPy<PyObject> for &'a OsString {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an implementation that's not strictly necessary. What's the motivation for having it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

types/string.rs has an equivalent implementation. I oriented myself at those trait implementations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh ok.

I'm not entirely sure why that impl exists... it might be because #[pyo3(get)] on String fields needs it? Not sure.

If we can't figure out why we need this impl, I'd rather skip it for now. We can always add it later!

}
}

impl<'a> IntoPy<PyObject> for &'a PathBuf {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as for &'_ OsString.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing; I oriented myself at the trait implementations for String

}
}

impl ToPyObject for Cow<'_, OsStr> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also needs IntoPy<PyObject> for Cow<'_, OsStr>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha same thing, seems like types/string.rs is missing this as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - if you're willing, can you add it there also? 🙏

I think that adding IntoPy for Cow makes sense (as it's the trait needed to be able to return this type from #[pyfunction]).

guide/src/conversions/tables.md Outdated Show resolved Hide resolved
Comment on lines 64 to 68
let fs_encoded_bytes: &crate::types::PyBytes = unsafe {
ob.py()
.from_borrowed_ptr(ffi::PyUnicode_EncodeFSDefault(pystring.as_ptr()))
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyUnicode_EncodeFSDefault returns a new reference, so we need to own this pointer or we leak memory. I suggest using Py<PyBytes>.

Suggested change
let fs_encoded_bytes: &crate::types::PyBytes = unsafe {
ob.py()
.from_borrowed_ptr(ffi::PyUnicode_EncodeFSDefault(pystring.as_ptr()))
};
let fs_encoded_bytes: Py<crate::types::PyBytes> = unsafe {
Py::from_owned_ptr(ob.py(), ffi::PyUnicode_EncodeFSDefault(pystring.as_ptr()))
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will apply the suggestion. If you have time, I would be interested to hear why you couldn't just replace from_borrowed_ptr with from_owned_ptr

Copy link
Member

@davidhewitt davidhewitt Jan 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. from_owned_ptr returns &PyAny (or another native type reference) where the owned pointer has to be stored by PyO3 in a thread-local vector. In comparison, Py (and PyObject, aka Py<PyAny>) directly hold the owned pointer inside them. This is marginally more efficient, and also means that the temporary bytes will be cleaned up immediately, instead of when PyO3 has a chance to cleanup its internal vector safely.

This has been a long thorn in PyO3's API imo - there's discussion at #1056 and #1308 where I hope to eventually remove this difference and make everything as efficient as possible!

src/conversions/path.rs Show resolved Hide resolved
@davidhewitt
Copy link
Member

I think that using PyUnicode_EncodeFSDefault is probably better and will then match PyString -> OsString -> String.

I changed my mind on this - it actually would change PyString::to_string_lossy to potentially throw exceptions, which would not be great.

Also the inconsistency does not seem to matter; I found someone else who stumbled across similar conversions at rust-lang/rust#56786 - in the end they decided it was better left as-is.

At least the current implementation of PyString::to_string_lossy is infallible and exactly consistent with what std::String::from_utf8_lossy does.

unsafe {
// This will not panic because the data from encode_wide is well-formed Windows
// string data
py.from_owned_ptr::<PyString>(ffi::PyUnicode_FromWideChar(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed that this can use PyObject::from_owned_ptr, which will remove the need for .into() also.

Suggested change
py.from_owned_ptr::<PyString>(ffi::PyUnicode_FromWideChar(
PyObject::from_owned_ptr(py, ffi::PyUnicode_FromWideChar(


// Decode from Python's lossless bytes string representation back into raw bytes
let fs_encoded_bytes: Py<crate::types::PyBytes> = unsafe {
Py::from_owned_ptr(ob.py(), ffi::PyUnicode_EncodeFSDefault(pystring.as_ptr()))
Copy link
Member

@davidhewitt davidhewitt Jan 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using crate::Py here may allow you to avoid the painful OS-specific import at the top.

(sorry about the -D warnings in CI - I think on balance it's useful to help keep the PyO3 code health up, even if it's a little frustrating at times 😬)

fn extract(ob: &PyAny) -> PyResult<Self> {
#[cfg(not(windows))]
{
let pystring = <PyString as PyTryFrom>::try_from(ob)?; // Cast PyAny to PyString
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is shared between the two implementations so could be pulled out the top before them both.

}
}

impl ToPyObject for Cow<'_, OsStr> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - if you're willing, can you add it there also? 🙏

I think that adding IntoPy for Cow makes sense (as it's the trait needed to be able to return this type from #[pyfunction]).

}
}

impl<'a> IntoPy<PyObject> for &'a OsString {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh ok.

I'm not entirely sure why that impl exists... it might be because #[pyo3(get)] on String fields needs it? Not sure.

If we can't figure out why we need this impl, I'd rather skip it for now. We can always add it later!

Comment on lines 64 to 68
let fs_encoded_bytes: &crate::types::PyBytes = unsafe {
ob.py()
.from_borrowed_ptr(ffi::PyUnicode_EncodeFSDefault(pystring.as_ptr()))
};
Copy link
Member

@davidhewitt davidhewitt Jan 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. from_owned_ptr returns &PyAny (or another native type reference) where the owned pointer has to be stored by PyO3 in a thread-local vector. In comparison, Py (and PyObject, aka Py<PyAny>) directly hold the owned pointer inside them. This is marginally more efficient, and also means that the temporary bytes will be cleaned up immediately, instead of when PyO3 has a chance to cleanup its internal vector safely.

This has been a long thorn in PyO3's API imo - there's discussion at #1056 and #1308 where I hope to eventually remove this difference and make everything as efficient as possible!

Copy link
Member

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think there's some final decisions to be made on the traits to provide, and then this is good to merge. Thanks again!

Copy link
Member

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the many rounds of iteration - this PR is looking great to me now 🚀

@davidhewitt
Copy link
Member

I'm going to rebase and merge this. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants