Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug-1886018: support pubsub crashqueue #6554

Merged
merged 7 commits into from Apr 3, 2024
Merged

bug-1886018: support pubsub crashqueue #6554

merged 7 commits into from Apr 3, 2024

Conversation

relud
Copy link
Member

@relud relud commented Mar 11, 2024

No description provided.

@relud relud force-pushed the pubsub-queue branch 9 times, most recently from 7665840 to e7c4b80 Compare March 18, 2024 16:09
@relud relud requested a review from willkg March 18, 2024 16:10
@relud relud marked this pull request as ready for review March 18, 2024 20:14
@relud relud requested a review from a team as a code owner March 18, 2024 20:14
@relud relud changed the title bug-1878423: support pubsub crashqueue bug-1886018: support pubsub crashqueue Mar 18, 2024
willkg

This comment was marked as resolved.

@relud

This comment was marked as resolved.

willkg

This comment was marked as resolved.

@relud
Copy link
Member Author

relud commented Apr 1, 2024

the make setup step pauses for like 6 minutes and then ends up with a "504 Deadline exceeded". What's going on is that make runservices doesn't start pubsub.

pubsub definitely needs to be added to runservices, but also i feel like either make setup should use a different container for the shell that defines depends_on in docker-compose.yml, or app should define depends_on. given that we don't have any extends in docker-compose.yml.

i'm in favor of changing app, but i don't have the context to be sure that's the right choice.'

edit: i'm going to commit the change to app, but i'm happy to remove/change that solution

@relud
Copy link
Member Author

relud commented Apr 1, 2024

sometimes the crash report will get processed twice. I saw this happen twice in the minimal testing I did. Also, it looks like two threads each get the crash id, so it's getting processed twice at the same time.

I don't remember having this problem years ago when I did pubsub the first time around. Are we missing a setting somewhere in either the topic or subscription creation? Maybe this is a problem with the emulator we're using?

Also, I took a break to get a cup of coffee and after I came back, I can't reproduce it anymore.

I'm pretty sure the issue isn't the emulator, it's the pubsub client sdk. the data platform's ingestion-edge service has a flaky test where sometimes messages are published twice, and that's using a custom emulator.

last time I looked the data platform had a duplicate rate of ~6% between between the edge and bigquery, which goes edge->pubsub->dataflow->pubsub->java app->bigquery, and the dataflow job uses beam's built-in deduplication based on document_id over a 10-minute window.

is this something that will break socorro, or is it just wasted compute? if it doesn't break anything i'd be inclined to let it slide.

@relud relud requested a review from willkg April 1, 2024 17:42
@relud

This comment was marked as resolved.

@willkg
Copy link
Collaborator

willkg commented Apr 3, 2024

Getting processed multiple times and even having two things processing at the same time where one stomps on the other's results--Socorro should be fine with both of those situations. It's just wasted compute and if we're looking at numbers of crash reports collected vs. processed, there will be some discrepancy.

Copy link
Collaborator

@willkg willkg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled down the changes in this PR. I have a mostly default local dev env--very little is set in my my.env file. So it should be running in "AWS mode". When I run ./bin/process_crash.sh, it fails with an "unbound variable" error. That should get fixed. I provided a diff.

There were a couple of other comments of minor things.

I went through and tested the CLOUD_PROVIDER=GCP mode and processing, reprocessing with webapp, and reprocessing with api all work fine.

Almost there!

Makefile Show resolved Hide resolved
docker-compose.yml Show resolved Hide resolved
bin/process_crashes.sh Outdated Show resolved Hide resolved
@relud relud requested a review from willkg April 3, 2024 15:42
@relud
Copy link
Member Author

relud commented Apr 3, 2024

ready for another pass

Copy link
Collaborator

@willkg willkg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good--thank you!

I went through this test plan:

  1. with CLOUD_PROVIDER unset
    1. does it build? does it rebuild?
    2. can i verify it's using sqs and not pubsub?
    3. does processing work?
    4. does reprocessing api work?
    5. does reprocessing via webapp work?
  2. with CLOUD_PROVIDER=GCP
    1. does it build? does it rebuild?
    2. can i verify it's using pubsub and not sqs?
    3. does processing work?
    4. does reprocessing api work?
    5. does reprocessing via webapp work?

@@ -51,7 +52,12 @@ mkdir "${DATADIR}" || echo "${DATADIR} already exists."
./bin/socorro_aws_s3.sh ls --recursive "s3://${CRASHSTORAGE_S3_BUCKET}/"

# Add crash ids to queue
./socorro-cmd sqs publish "${SQS_STANDARD_QUEUE}" $@
# ^^ returns CLOUD_PROVIDER value as uppercase
if [[ "${CLOUD_PROVIDER^^}" == "GCP" ]]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! That's bonkers! I had no idea that existed.

@relud relud merged commit e03e1fe into main Apr 3, 2024
1 check passed
@relud relud deleted the pubsub-queue branch April 3, 2024 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants