Fix: improve csv seed file type inference by georgesittas · Pull Request #2418 · SQLMesh/sqlmesh

georgesittas · 2024-04-09T22:04:03Z

Prior to this change, large seed files were chunked and so pandas' type inference would be incomplete, leading to users seeing warnings related to data type mismatches.

This PR makes it so that seed files are not chunked, in order to avoid the mixed type inference issues. The tradeoff here is that this could potentially lead to memory consumption issues for larger seed files, but that should be ok since we don't really encourage users to have really large seed files where memory consumption would actually be a problem.

One workaround we discussed was to add both low_memory and dtype to CsvSettings, but the latter was involved because we'd need to allow the properties to be arbitrarily nested in order to support SEED definitions like the following, so I opted for this simpler approach.

  kind SEED (
    path '../seeds/seed_data.csv',
    csv_settings (
      low_memory = False,
      dtype (
        k1 = v2,
        ...
      )
    ),
  )

While testing this, I also noticed a couple of other warnings that I fixed. The loc one was related to this:

sqlmesh/core/model/definition.py:1166: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.errors.SettingWithCopyWarning.html for reference.

Context: https://tobiko-data.slack.com/archives/C044BRE5W4S/p1712655545849239

georgesittas requested review from a team, eakmanrq and treysp April 9, 2024 22:04

Feat: improve csv seed file type inference

cfc0fb7

georgesittas force-pushed the jo/fix_seed_type_inference branch from ad952e6 to cfc0fb7 Compare April 9, 2024 22:05

tobymao approved these changes Apr 9, 2024

View reviewed changes

georgesittas changed the title ~~Feat: improve csv seed file type inference~~ Fix: improve csv seed file type inference Apr 9, 2024

georgesittas merged commit 0ba4230 into main Apr 9, 2024

georgesittas deleted the jo/fix_seed_type_inference branch April 9, 2024 22:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: improve csv seed file type inference#2418

Fix: improve csv seed file type inference#2418
georgesittas merged 1 commit intomainfrom
jo/fix_seed_type_inference

georgesittas commented Apr 9, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

georgesittas commented Apr 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

georgesittas commented Apr 9, 2024 •

edited

Loading