Further-deduplicate-the-incremental-rows #2209

Tian-2017 · 2025-04-11T15:27:53Z

Deduplicate by ID and last updated.
use standard logger instead of glue logger, otherwise it causes the unit test error because prepare_increments doesn't have logger;
Remove data quality keyword from the finally clause to avoid the confusion
Add coalesce(300) to avoid causing (Service: Amazon S3; Status Code: 503) when have many workers

…m linter

sonarqubecloud · 2025-04-15T08:25:23Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Tian-2017 · 2025-04-15T08:40:46Z

Hey @timburke-hackit could you review it again? I've checked the data - looks good now.
Adding a new commit - this is the version running in the prod :

Changed duplicate_ids.count() > 0 to duplicate_ids.limit(1).count() > 0 (to reduce shuffle)
Added snapshot_df = snapshot_df.coalesce(300) (to reduce small files reaching the throttle of S3)

Thanks.

Tian-2017 added 2 commits April 11, 2025 16:14

further deduplicate the incremental rows

9fcd131

pass the logger to the function

0edc08c

Tian-2017 requested review from a team as code owners April 11, 2025 15:27

Tian-2017 marked this pull request as draft April 11, 2025 15:30

Tian-2017 added 3 commits April 11, 2025 16:38

revert the workers to 12 - it must less than 16 based on the terrafor…

f4550e5

…m linter

use standard logger

8d2a09d

stop the spark quickly in finally

d6c58bc

Tian-2017 marked this pull request as ready for review April 11, 2025 15:56

timburke-hackit previously approved these changes Apr 11, 2025

View reviewed changes

Tian-2017 marked this pull request as draft April 14, 2025 09:47

small tweak on shuffle and small files

d879684

Tian-2017 dismissed timburke-hackit’s stale review via d879684 April 15, 2025 08:24

Tian-2017 marked this pull request as ready for review April 15, 2025 08:41

Tian-2017 requested a review from timburke-hackit April 15, 2025 08:41

timburke-hackit approved these changes Apr 15, 2025

View reviewed changes

Tian-2017 merged commit 01beda2 into main Apr 15, 2025
6 checks passed

Tian-2017 deleted the further-deduplicate-the-incremental-rows branch April 15, 2025 08:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Further-deduplicate-the-incremental-rows #2209

Further-deduplicate-the-incremental-rows #2209

Uh oh!

Tian-2017 commented Apr 11, 2025 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Apr 15, 2025

Uh oh!

Tian-2017 commented Apr 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Further-deduplicate-the-incremental-rows #2209

Further-deduplicate-the-incremental-rows #2209

Uh oh!

Conversation

Tian-2017 commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud bot commented Apr 15, 2025

Quality Gate passed

Uh oh!

Tian-2017 commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Tian-2017 commented Apr 11, 2025 •

edited

Loading

Tian-2017 commented Apr 15, 2025 •

edited

Loading