Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SNOW-98] Add Trending Snapshots DAG #65

Merged
merged 21 commits into from
Aug 13, 2024

Conversation

jaymedina
Copy link
Contributor

@jaymedina jaymedina commented Jul 31, 2024

This DAG will automate populating the Trending Projects Snapshots table at the users' desired cadence.

  • Successful initial run of DAG on Trending Snapshots Test
  • Fix calculation of TOTAL_DATA_SIZE_IN_GIB
  • Revert changes made for debugging ( date, time window )
  • Clean up DAG script ( refactor, document, etc )

testing

1 : Test query efficacy

Testing the query on this test Project ensures that the project size (i.e. the total number of File entities stored within a project) is being calculated correctly:

image

2 : Test query efficacy

Confirming the query results for the 30 days leading up to 6/29/2024 approach the results from the initial query

** Current query results **
image

** Initial query results **
image


3 : Test query efficacy

This query also accounts for cases where there are no new downloads / users within the time range specified (thanks to LEFT JOIN)

image

4 : Test DAG functionality

Trending Snapshots Test


5 : Test DAG functionality

I changed time_window in the dag_config to 2 days to ensure the n_unique_users reflects this shorter time window. Results:

Trending Snapshots Test 2 days

@jaymedina jaymedina marked this pull request as ready for review August 2, 2024 02:01
@jaymedina jaymedina requested a review from a team as a code owner August 2, 2024 02:01
SELECT
pp.PROJECT_ID,
COUNT(DISTINCT rd.USER_ID) AS N_UNIQUE_USERS,
COALESCE(TO_CHAR(MAX(rd.RECORD_DATE), 'YYYY-MM-DD'), 'N/A') AS LAST_DOWNLOAD_DATE,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the event that we have no new downloads within the requested time window for a given project, we LEFT JOIN onto PUBLIC_PROJECTS and fill NULL values with N/A.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your rationale here for shifting this to a character? the expected output is what we have for "export_date" so keep that in mind. Nice use of Coalese though for those projects that don't have download dates at all.

One query optimization is to first get the top 10 downloaded public projects by user, then only grab the data sizes and activity for those projects.

Copy link
Contributor Author

@jaymedina jaymedina Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your rationale here for shifting this to a character?

It's because COALESCE needs all the arguments to be the same data type (in this case, string). I was getting a Date 'N/A' is not recognized error otherwise.

the expected output is what we have for "export_date" so keep that in mind

Noted - it looks like the format on Synapse is determined by how the schema is formatted on Synapse itself. I did a test run using the date formatting as it is ('YYYY-MM-DD') but it shows up like MM/DD/YYYY on Synapse, same formatting as export_date:
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One query optimization is to first get the top 10 downloaded public projects by user, then only grab the data sizes and activity for those projects.

Updated ✔️

RECENT_DOWNLOADS still works with PUBLIC_PROJECTS (not TOP_10) because the downloads are what's used to determine what the top 10 projects for that month are. After that, the remaining CTEs use the results from the TOP_10 project CTE to get the file handles and the corresponding total size for each project, which minimizes calculations.

Here are my results for the month of June:

image

I tried comparing the Query duration between the old and new version of this query, but tbh the calculation time is so small it's hard to tell. I think logically, though, this change does minimize compute.

One other thing that came to mind as I was working on this is we should consider expanding the scope of the trending project snapshots to also include upload activity. In other words N_UNIQUE_USERS would include users both uploading and downloading, and we'd have another column for LAST_UPLOAD_DATE.

Comment on lines 96 to 100
SELECT DISTINCT ID, PROJECT_ID, FILE_HANDLE_ID
FROM SYNAPSE_DATA_WAREHOUSE.SYNAPSE.NODE_LATEST
WHERE 1=1
AND NODE_TYPE = 'file'
AND PROJECT_ID IN (SELECT PROJECT_ID FROM PUBLIC_PROJECTS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIt: I was going to say you could combine the PUBLIC_PROJECTS / LATEST_FILE_HANDLES / FILE_SIZES into one CTE, but I understand why you have PUBLIC_PROJECTS as a separate CTE.

Copy link
Contributor Author

@jaymedina jaymedina Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I was working on the optimization suggestion from this thread, I considered having the RECENT_DOWNLOADS CTE join FILEDOWNLOAD and NODE_LATEST and getting rid of PUBLIC_PROJECTS altogether, but I decided against it because:

  • I don't believe it would minimize compute ( I tried it out a few times and found no significant difference in Query duration between the version with PUBLIC_PROJECTS )
  • Smaller CTEs are easier to read and makes the query easier to understand (from my perspective)

But lmk what you think!

FROM SYNAPSE_DATA_WAREHOUSE.SYNAPSE.NODE_LATEST
WHERE 1=1
AND NODE_TYPE = 'file'
AND PROJECT_ID IN (SELECT PROJECT_ID FROM TOP_10_PUBLIC_PROJECTS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add in a is_public check here as well because there can be private files within a public project that we want to filter activity out for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided against this because calculating a project size based off only the public files wouldn't truly reflect the size of the project itself. It would also mislead users into thinking that they have access to all of a project, when in reality they might only have access to a portion of it, if that makes sense. So right now I'm still leaning against filtering for public files. What do you think?

JOIN FILE_SIZES
ON TOP_10_PUBLIC_PROJECTS.PROJECT_ID = FILE_SIZES.PROJECT_ID

GROUP BY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some food for thought as there are many ways to solve a problem. Another approach is to do this in two parts

  1. Calculate the top 10 downloaded projects and their last download activity (DL_INFO_CTE)
  2. Calculate the data size of those top 10 downloaded project (DATA_SIZE_CTE)
  3. join DL_INFO_CTE with DATA_SIZE_CTE on project_id column.

That's effectively what you're doing here, but this group by clause can be a bit confusing at first glance

Copy link
Contributor

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 LGTM! I'm going to pre-approve, but please see some of my comments. Excellent work here!

@jaymedina
Copy link
Contributor Author

Thanks for the review @thomasyu888! I'll wait until after Sage Week before merging.

Some final to-do items for me:

  • Re-run DAG against Trending Snapshots Test table to make sure results look as expected
  • Revert back to point at the official table
  • Merge

Quick thought:
I have the DAG scheduled to run on the 2nd of each month, so with our current merge timeline, the next run would be September 2nd. If we need an updated version of the table before then, I can run it manually. Just want to run this by you so we're on the same page.

Thanks!

@thomasyu888
Copy link
Contributor

Quick thought: I have the DAG scheduled to run on the 2nd of each month, so with our current merge timeline, the next run would be September 2nd. If we need an updated version of the table before then, I can run it manually. Just want to run this by you so we're on the same page.

Great work here! Yes, please run it manually this time, and September will be automated.

Copy link
Contributor

@BryanFauble BryanFauble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 comment around export date. But everything else looks good. I will go ahead and pre-approve, please re-request review if there are sizable changes.

@jaymedina
Copy link
Contributor Author

jaymedina commented Aug 12, 2024

Before I merge this in, I'm having second thoughts about what's in TOTAL_DATA_SIZE column.

The screenshot below shows a backfill of the top 10 trending Projects in February 2024. While N_UNIQUE_USERS and LAST_DOWNLOAD_DATE are a snapshot of those metrics of the Project within that month, TOTAL_DATA_SIZE still reflects the most recent size of the project, rather than what the size was on February.

I think the important metric is the number of unique users, as that is what determines what the "top 10" trending projects are, but I wonder if that project size column is misleading.

If we want TOTAL_DATA_SIZE to reflect what the project size was for that month and not what it is currently, the change would be simple. I would just add another WHERE clause to the LATEST_FILE_HANDLES CTE, like so (with DATE() being feed programmatically rather than hardcoded like it is below:

AND DATE_TRUNC('MONTH', CHANGE_TIMESTAMP) <= DATE_TRUNC('MONTH', DATE('2024-2-12'))

But I'd like to know your thoughts on this before I move forward @BryanFauble @thomasyu888

image

The other solution (simpler) is just change the column name to CURRENT_TOTAL_DATA_SIZE but I wonder if users would find it useful to know how big the project is at the time that the export happened.

@thomasyu888
Copy link
Contributor

But I'd like to know your thoughts on this before I move forward @BryanFauble @thomasyu888

@jaymedina this is a great catch! I think what we probably care about is when the entity was created on, so if we filter out all entities created before the runtime, it would give us a more accurate representation. The issue with this is that we don't capture deletions so the data won't be fully accurate.

@jaymedina
Copy link
Contributor Author

jaymedina commented Aug 12, 2024

The issue with this is that we don't capture deletions so the data won't be fully accurate.

I see. Then I'll move forward with modifying the total project size calculations and can fold the caveat you mention into the column name itself by modifying the name to be ESTIMATED_PROJECT_SIZE_IN_GIB, what do you think?

@thomasyu888
Copy link
Contributor

thomasyu888 commented Aug 12, 2024

modifying the name to be ESTIMATED_PROJECT_SIZE_IN_GIB, what do you think?

Great idea - before doing that change, let's let the frontend team know we are going to make a schema change. Can you post on #synapse? This will impact the homepage.

@jaymedina
Copy link
Contributor Author

jaymedina commented Aug 12, 2024

After Jay's feedback I've moved forward with the schema modifications and calculating the Project size based off the files available in FILE_LATEST that land within the given time window.

Here is a test I just ran with the newest DAG that updates the table for July 2024 and December 2022. Take a look at the new Project size column and notice the difference between Project sizes for the same Projects (notably: AD Knowledge Portal).

I'm re-requesting a review since this is a structural change.

Thank you both for your patience and for your feedback!

Copy link
Contributor

@BryanFauble BryanFauble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great change!

SELECT ID, PROJECT_ID, FILE_HANDLE_ID
FROM SYNAPSE_DATA_WAREHOUSE.SYNAPSE.NODE_LATEST
WHERE 1=1
AND DATE_TRUNC('MONTH', CHANGE_TIMESTAMP) <= DATE_TRUNC('MONTH', DATE('{context["params"]["month_to_run"]}'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be an issue because sometimes the CHANGE_TIMESTAMP will be greater than the CREATEDON date. Let's say an entity was created in Feb but updated in April, by doing this query, when you do the calculation of data sizes for Feb, you're going to leave this entity out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I've updated it so that as long as the file was created <= the month_to_run, it will be used in the calculation of the total Project size.

Copy link

sonarcloud bot commented Aug 13, 2024

Copy link
Contributor

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 LGTM!

@jaymedina jaymedina merged commit 920bf0c into main Aug 13, 2024
2 checks passed
@jaymedina jaymedina deleted the SNOW-98-append-trending-snapshots branch August 13, 2024 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants