Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We should have some utilities to validate and re-cast to the target schema given they exist in functional forms. #37

Open
mmcdermott opened this issue Aug 1, 2024 · 7 comments

Comments

@mmcdermott
Copy link
Contributor

E.g., see this function: https://github.com/mmcdermott/MEDS_transforms/blob/573816cbf3f6005a8fc25eb25424706ca0c97b6e/src/MEDS_transforms/extract/finalize_MEDS_metadata.py#L28

This is polars specific, obviously, which we don't want to be, but having the ability to identify if a codes.parquet or a data/*.parquet file meets a valid extended schema and converting to the right pyarrow schema is very useful (especially because there exist minor differences we should be cognizant of like large_string vs string, etc.).

Tagging @EthanSteinberg for your input.

@mmcdermott
Copy link
Contributor Author

@mmcdermott
Copy link
Contributor Author

@EthanSteinberg, thoughts on this? If you think this would be useful, I'd be a proponent of bringing it over now rather than later.

@EthanSteinberg
Copy link
Collaborator

I agree, it would be useful to put these validation checks here.

@EthanSteinberg
Copy link
Collaborator

I think we want this validation code to use pyarrow though. I don't want to add another dependency (polars) to this repository

@mmcdermott
Copy link
Contributor Author

I agree that we don't want to add a dependency and think the main validation code or re-typing code should be in pyarrow. I think we could consider having code that is only runnable if polars is installed, e.g.,

try:
  import polars as pl
  # validation code here...
except:
  pass

but I think starting with pyarrow would still be very helpful.

@mmcdermott
Copy link
Contributor Author

Sample code for the label schema as well, in case it is helpful once we decide to implement this: https://github.com/justin13601/ACES/blob/e9655390f25bf79167370a802176bcf671cefa44/src/aces/__main__.py#L35

@mmcdermott
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants