Fix: improve csv seed file type inference#2418
Merged
georgesittas merged 1 commit intomainfrom Apr 9, 2024
Merged
Conversation
ad952e6 to
cfc0fb7
Compare
tobymao
approved these changes
Apr 9, 2024
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Prior to this change, large seed files were chunked and so pandas' type inference would be incomplete, leading to users seeing warnings related to data type mismatches.
This PR makes it so that seed files are not chunked, in order to avoid the mixed type inference issues. The tradeoff here is that this could potentially lead to memory consumption issues for larger seed files, but that should be ok since we don't really encourage users to have really large seed files where memory consumption would actually be a problem.
One workaround we discussed was to add both
low_memoryanddtypetoCsvSettings, but the latter was involved because we'd need to allow the properties to be arbitrarily nested in order to supportSEEDdefinitions like the following, so I opted for this simpler approach.kind SEED ( path '../seeds/seed_data.csv', csv_settings ( low_memory = False, dtype ( k1 = v2, ... ) ), )While testing this, I also noticed a couple of other warnings that I fixed. The
locone was related to this:See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.errors.SettingWithCopyWarning.html for reference.
Context: https://tobiko-data.slack.com/archives/C044BRE5W4S/p1712655545849239