Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dates as datetime64[ms] - remove driving_institution #222

Merged
merged 2 commits into from Jul 10, 2023
Merged

Conversation

aulemahal
Copy link
Collaborator

@aulemahal aulemahal commented Jul 7, 2023

Pull Request Checklist:

  • This PR addresses an already opened issue (for bug fixes / features)
    • This PR fixes #xyz
  • (If applicable) Documentation has been added / updated (for bug fixes / features).
  • (If applicable) Tests have been added.
  • This PR does not seem to break the templates.
  • HISTORY.rst has been updated (with summary of main changes).
    • Link to issue (:issue:number) and pull request (:pull:number) has been added.

What kind of change does this PR introduce?

  • The date_start and date_end columns are casted with a datetime64[ms] dtype (not a Period)
  • Improvements to date_parser.
  • Rewrite of subset_file_coverage.
  • Removal of driving_institution as an official xscen column.
  • pin of pandas >= 2

Pandas 2 now supports datetime columns with a s, ms and us resolution, instead of the old ns default. This allows storing dates from before 1677 and after 2242. However, this support is still partial as many of the datetime manipulation methods will still fail on "out of bounds" dates. This includes: pd.read_csv and pd.to_datetime... Because of this bug, I had to implement the parsing directly in the DataCatalog's init, using a solution proposed on stackoverflow.

Even with this strange workaround, opening simulation.json went from 3 s to 800 ms on my machine !

The change had repercussions in other parts of xscen, especially date_parser and subset_file_coverage. I adapted the former to output pd.Timestamp objects by default and the latter to use more of the Interval magic pandas can already do with datetime bounds.

I also used this PR to remove driving_institution from the official columns, as discussed.

Does this PR introduce a breaking change?

The default output of date_parser has changed.

The default dtype of date_start and date_end has changed.

The driving_institution column has been removed.

Other information:

This required pinning pandas >= 2, clisops >= 0.10. The latter pin allowed unpinning python.

@aulemahal aulemahal requested a review from RondeauG July 7, 2023 21:04
@aulemahal
Copy link
Collaborator Author

Feel free to push changes and merge this whenever you may want to. I'll be on vacation for the next two weeks.

Copy link
Contributor

@RondeauG RondeauG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@RondeauG RondeauG merged commit 33137cb into main Jul 10, 2023
11 checks passed
@RondeauG RondeauG deleted the pandas-2 branch July 10, 2023 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants