Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve Complex Data Types for to_csv #61157

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

Jaspvr
Copy link

@Jaspvr Jaspvr commented Mar 20, 2025

This Pull Request solves the issue outlined in:
#60895

Complex data types like numpy arrays can now be stored in csv format and later recovered.
A new parameter, preserve_complex, is introduces, and when it is set to true in your to_csv function call, the complex data types will be preserved and can be recovered from the csv.

The way this works is by serializing Numpy arrays into JSON format for preserve_complex=True. To get them from the csv, we can set the same parameter in read_csv, and the original Numpy array will be returned.

Please refer to tests in scripts/tests/test_csv.py to see how this is used.

Please refer to the original issue for more information on the problem definition.

@snitish
Copy link
Member

snitish commented Mar 23, 2025

@Jaspvr is there an existing issue that this PR addresses? If so, could you list it in the description? If not, please create an issue describing the bug or proposed enhancement so it can be reviewed by a team member.

@Jaspvr Jaspvr changed the title Csv func Preserve Complex Data Types for to_csv Mar 28, 2025
@Jaspvr
Copy link
Author

Jaspvr commented Mar 28, 2025

@Jaspvr is there an existing issue that this PR addresses? If so, could you list it in the description? If not, please create an issue describing the bug or proposed enhancement so it can be reviewed by a team member.

Hey, just updated the description. This is in relation to #60895

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -3858,6 +3859,11 @@ def to_csv(

{storage_options}

preserve_complex : bool, default False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented in the issue, you can use the dtype argument in read_csv to read complex values already. I'm negative on this approach.

@rhshadrach rhshadrach added IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action Enhancement Complex Complex Numbers labels Mar 29, 2025
@a-holm
Copy link

a-holm commented Mar 31, 2025

This PR introduces a useful feature for preserving complex data types like NumPy arrays during CSV serialization/deserialization via the preserve_complex flag. The implementation integrates well with the existing I/O functions, and the new tests seem cover the basic functionality (I haven't run them).

Two minor points for consideration:

Serialization Logic (pandas/io/formats/csvs.py): The current logic checks only the first value in an object column to decide if it contains arrays/lists needing serialization. This might be fragile if the first row has NaN or a different type. Would checking the first non-NaN value or using pd.api.types.infer_dtype be more robust, balancing robustness vs. performance?
Deserialization Logic (pandas/io/parsers/readers.py): The _restore_complex_arrays function uses a heuristic (startswith("[") / endswith("]")) combined with checking all() non-null values. This seems safer than just checking brackets, but could still potentially misidentify columns if they contain a mix of valid JSON strings and other strings. Perhaps a try-except json.JSONDecodeError within the apply could be an alternative? (though likely slower if many non-JSON strings exist)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Complex Complex Numbers Enhancement IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants