Skip to content

Fix: improve csv seed file type inference#2418

Merged
georgesittas merged 1 commit intomainfrom
jo/fix_seed_type_inference
Apr 9, 2024
Merged

Fix: improve csv seed file type inference#2418
georgesittas merged 1 commit intomainfrom
jo/fix_seed_type_inference

Conversation

@georgesittas
Copy link
Copy Markdown
Collaborator

@georgesittas georgesittas commented Apr 9, 2024

Prior to this change, large seed files were chunked and so pandas' type inference would be incomplete, leading to users seeing warnings related to data type mismatches.

This PR makes it so that seed files are not chunked, in order to avoid the mixed type inference issues. The tradeoff here is that this could potentially lead to memory consumption issues for larger seed files, but that should be ok since we don't really encourage users to have really large seed files where memory consumption would actually be a problem.

One workaround we discussed was to add both low_memory and dtype to CsvSettings, but the latter was involved because we'd need to allow the properties to be arbitrarily nested in order to support SEED definitions like the following, so I opted for this simpler approach.

  kind SEED (
    path '../seeds/seed_data.csv',
    csv_settings (
      low_memory = False,
      dtype (
        k1 = v2,
        ...
      )
    ),
  )

While testing this, I also noticed a couple of other warnings that I fixed. The loc one was related to this:

sqlmesh/core/model/definition.py:1166: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.errors.SettingWithCopyWarning.html for reference.

Context: https://tobiko-data.slack.com/archives/C044BRE5W4S/p1712655545849239

@georgesittas georgesittas requested review from a team, eakmanrq and treysp April 9, 2024 22:04
@georgesittas georgesittas force-pushed the jo/fix_seed_type_inference branch from ad952e6 to cfc0fb7 Compare April 9, 2024 22:05
@georgesittas georgesittas changed the title Feat: improve csv seed file type inference Fix: improve csv seed file type inference Apr 9, 2024
@georgesittas georgesittas merged commit 0ba4230 into main Apr 9, 2024
@georgesittas georgesittas deleted the jo/fix_seed_type_inference branch April 9, 2024 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants