Skip to content

Conversation

@Tian-2017
Copy link
Contributor

@Tian-2017 Tian-2017 commented Apr 11, 2025

  • Deduplicate by ID and last updated.
  • use standard logger instead of glue logger, otherwise it causes the unit test error because prepare_increments doesn't have logger;
  • Remove data quality keyword from the finally clause to avoid the confusion
  • Add coalesce(300) to avoid causing (Service: Amazon S3; Status Code: 503) when have many workers

@Tian-2017 Tian-2017 requested review from a team as code owners April 11, 2025 15:27
@Tian-2017 Tian-2017 marked this pull request as draft April 11, 2025 15:30
@Tian-2017 Tian-2017 marked this pull request as ready for review April 11, 2025 15:56
@Tian-2017 Tian-2017 marked this pull request as draft April 14, 2025 09:47
@sonarqubecloud
Copy link

@Tian-2017
Copy link
Contributor Author

Tian-2017 commented Apr 15, 2025

Hey @timburke-hackit could you review it again? I've checked the data - looks good now.
Adding a new commit - this is the version running in the prod :

  • Changed duplicate_ids.count() > 0 to duplicate_ids.limit(1).count() > 0 (to reduce shuffle)
  • Added snapshot_df = snapshot_df.coalesce(300) (to reduce small files reaching the throttle of S3)

Thanks.

@Tian-2017 Tian-2017 marked this pull request as ready for review April 15, 2025 08:41
@Tian-2017 Tian-2017 merged commit 01beda2 into main Apr 15, 2025
6 checks passed
@Tian-2017 Tian-2017 deleted the further-deduplicate-the-incremental-rows branch April 15, 2025 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants