Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Need Update csv.rs】Update csv.py and test_csv.py from CPython v3.12 #5176

Merged
merged 4 commits into from
Mar 5, 2024

Conversation

Blues-star
Copy link
Contributor

I've updated csv.py and test_csv.py, but version 3.12 has changed the import options, and __version__ now needs to be imported from _csv.c (which is now csv.rs). It seems that csv.rs has not yet aligned with the behavior of cpython upstream. Should we consider updating csv.rs or postponing this commit? Due to this reason, I'm still unable to test test_csv.

@fanninpm
Copy link
Contributor

Please, go ahead and update the behavior of csv.rs.

@Blues-star
Copy link
Contributor Author

Does the current csv implementation not incorporate the dialect feature?

@youknowone
Copy link
Member

I don't remember the exact state. Since test_csv didn't exist, I guess our csv implementation was very immature.

@Blues-star
Copy link
Contributor Author

I have done some work to make the csv module compatible with test_csv.py, but due to my unfamiliarity with the API, the current version contains too much template code and redundant operations. It looks like the implementation is a bit messy. I will try to optimize the code and submit a new commit. I will close the current commit because I have a few more commits to make. Here are the test results.

# cargo run -- -m unittest -v Lib.test.test_csv
----------------------------------------------------------------------
Ran 123 tests in 2.669s

FAILED (failures=4, errors=10, skipped=6, expected failures=11)

@Blues-star
Copy link
Contributor Author

Currently, I have adapted most of the behaviors specified by the Python CSV module, but there are still 18 test cases failing, which I have marked as 'failed.' The functionality of the module should be working as expected at the moment. By the way, I have used modules from std in the CSV module, such as Hashmap and Mutex, and initialized using Onecell. I'm not sure if this is feasible, or if I should switch to using PyMutex instead?

@Blues-star
Copy link
Contributor Author

@youknowone have a review?

Copy link
Member

@youknowone youknowone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for many efforts to add a new test and implementing a many features in rust native module!

I'm not sure if this is feasible, or if I should switch to using PyMutex instead?
PyMutex is single/multi thread support helpers. You can easilty turns parking_lot::Mutex to PyMutex. (but not std Mutex) I think PyMutex fits here. Otherwise we prefer to use parking_lot::PyMutex rather than std Mutex

The most of changes looks really good, but I have a bit of concerns about using intern_str. Could you check if they are intended to be actually interned or not?

Comment on lines 92 to 89
fn doublequote(&self, vm: &VirtualMachine) -> PyRef<PyInt> {
vm.ctx.new_bool(self.doublequote).to_owned()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type automatically be able to be turned into PyObjectRef.
So this is possible:

Suggested change
fn doublequote(&self, vm: &VirtualMachine) -> PyRef<PyInt> {
vm.ctx.new_bool(self.doublequote).to_owned()
}
fn doublequote(&self, vm: &VirtualMachine) -> bool {
self.doublequote
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

fn lineterminator(&self, vm: &VirtualMachine) -> PyRef<PyStr> {
match self.lineterminator {
Terminator::CRLF => vm.ctx.intern_str("\r\n".to_string()).to_owned(),
Terminator::Any(t) => vm.ctx.intern_str(format!("{}", t as char)).to_owned(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this t static value or dynamic value?

intern_str interns static strings. If this is a dynamic value,

Suggested change
Terminator::Any(t) => vm.ctx.intern_str(format!("{}", t as char)).to_owned(),
Terminator::Any(t) => vm.ctx.new_str(format!("{}", t as char)).to_owned(),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, I didn't notice this API when I was looking through the API documentation. I was still wondering why there wasn't a basic API for creating dynamic strings.

stdlib/src/csv.rs Outdated Show resolved Hide resolved
s @ PyStr => {
Ok(if s.as_str().bytes().eq(b"\r\n".iter().copied()) {
csv_core::Terminator::CRLF
} else if let Some(t) = s.as_str().bytes().next() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else if let Some(t) = s.as_str().bytes().next() {
} else if let Some(t) = s.as_str().as_bytes().first() {

Does this check require to ensure s.len() == 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this modification may be unnecessary. When s.len()<1, this function will throw an error, propagating the error to the upper layer, which conforms to the expected behavior in Python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I need to get ownership of the first character, and because it is of type u8 I believe that copying it is inexpensive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I am sorry to make confusion. The suggestion and question were not related. I thought creating an iterator was not necessary here (for suggestion), and worried if it requires to raise error when s.len() > 1 but it is missed (comment). *t will be a copy for Copy types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I am sorry to make confusion. The suggestion and question were not related. I thought creating an iterator was not necessary here (for suggestion), and worried if it requires to raise error when s.len() > 1 but it is missed (comment). *t will be a copy for Copy types.

Oh, I see. Due to limitations in the current implementation within csv_core, the support for multiple characters in lineterminator is not complete.

Therefore, I intentionally ignored the case where s.len() > 1. I will add comments to explain this and thank you for your suggestion. I have already optimized the iterator part.

})?;
let input = string.as_str().as_bytes();

if input.is_empty() || input.starts_with(&[b'\n']) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if input.is_empty() || input.starts_with(&[b'\n']) {
if input.is_empty() || input.starts_with(b"\n") {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review

@Blues-star
Copy link
Contributor Author

Thank you for many efforts to add a new test and implementing a many features in rust native module!

I'm not sure if this is feasible, or if I should switch to using PyMutex instead?
PyMutex is single/multi thread support helpers. You can easilty turns parking_lot::Mutex to PyMutex. (but not std Mutex) I think PyMutex fits here. Otherwise we prefer to use parking_lot::PyMutex rather than std Mutex

The most of changes looks really good, but I have a bit of concerns about using intern_str. Could you check if they are intended to be actually interned or not?

Haha, I didn't notice this API when I was looking through the API documentation. I was still wondering why there wasn't a basic API for creating strings.

I have modified part of the code, keeping the old code for multiple type returns and operations on one of the bytes.

Now I have another question: is std::hashmap feasible here, or do I need to replace it with another no-std library such as hashbrown?

@Blues-star
Copy link
Contributor Author

PyMutex doesn't seem to implement sync, so I switched back to parking_lot.

@Blues-star
Copy link
Contributor Author

@youknowone

@youknowone
Copy link
Member

Now I have another question: is std::hashmap feasible here, or do I need to replace it with another no-std library such as hashbrown?

Currently we don't support no-std. So that is ok while preparing it is still a good idea.

@Blues-star
Copy link
Contributor Author

Are there any requested changes that I need to address?

Copy link
Member

@youknowone youknowone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I attached a commit with minor fixes

@Blues-star
Copy link
Contributor Author

Thank you, I attached a commit with minor fixes

I accidentally made changes to csv.py earlier, but now I have reverted it back to the version for Python 3.12.

@youknowone youknowone merged commit 4c8cd67 into RustPython:main Mar 5, 2024
11 checks passed
@youknowone
Copy link
Member

Thank you so much!

@Blues-star
Copy link
Contributor Author

You're welcome.

Actually, there's one more thing I'd like to discuss. Regarding csv_core as the core library for csv.rs, due to issues in the upstream implementation, fully conforming to the behavior specified in csv.py is quite challenging. For example, the linedelimiter enum type is limited to u8 and does not support arbitrary length strings in UTF-8 format. Is there any special significance in choosing csv_core as the core implementation, such as no_std or wsam support? Or should we consider looking for other third-party libraries to extend the current functionality as needed?

@youknowone
Copy link
Member

We probably chose it as harvesting low-hanging fruit. The first csv module writers might not know about those limitation. Any reasonable suggestion will be appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants