Conversation
@@ -45,6 +67,11 @@ def handle_pinpoint_error(err) | |||
) | |||
end | |||
|
|||
def notify_pinpoint_failover(error) | |||
# TODO: log some sort of message? | |||
Telephony.config.logger.warn "error region: region" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @pauldoomgov can you help me figure out what kind of warning, maybe even what format, we should log a warning here when we fail over? This will end up in telephony.log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thus far it seems we log one JSON line per-SMS. We don't have a region defined. So first I would say extend the normal message:
{
"success": true,
"errors": {},
"request_id": "61526818-bd93-476b-bb6b-6c34efaf9655",
"delivery_status": "SUCCESSFUL",
"message_id": "s0f82mp1v3c9ultqqvjpv0gc6t7f4khsf4vqat80",
"status_code": 200,
"status_message": "MessageId: s0f82mp1v3c9ultqqvjpv0gc6t7f4khsf4vqat80",
"adapter": "pinpoint",
"region": "us-west-2",
"channel": "sms",
"context": "authentication"
}
And then two variations. The first would be if a failover occurred, but the message was delivered:
{
"success": true,
"errors": {
"region_failover": {
"status_code": "500",
"region": "us-west-1",
"status_message": "Uhhhh.... SMS don't feel so goood...."
}
},
"request_id": "61526818-bd93-476b-bb6b-6c34efaf9655",
"delivery_status": "SUCCESSFUL",
"message_id": "s0f82mp1v3c9ultqqvjpv0gc6t7f4khsf4vqat80",
"status_code": 200,
"status_message": "MessageId: s0f82mp1v3c9ultqqvjpv0gc6t7f4khsf4vqat80",
"adapter": "pinpoint",
"region": "us-east-1",
"channel": "sms",
"context": "authentication"
}
And another if both failed:
{
"success": false,
"errors": {
"region_failover": {
"status_code": "500",
"region": "us-west-1",
"status_message": "Uhhhh.... SMS don't feel so goood...."
},
"other_stuff_that_normally_goes_here": "yep"
},
"request_id": "61526818-bd93-476b-bb6b-6c34efaf9655",
"delivery_status": "FAILURE",
"message_id": "s0f82mp1v3c9ultqqvjpv0gc6t7f4khsf4vqat80",
"status_code": 500,
"status_message": "Dunno... Maybe try sending a telegram?",
"adapter": "pinpoint",
"region": "us-east-1",
"channel": "sms",
"context": "authentication"
}
This should prevent double-counting message failures or (maybe?) breaking any existing parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already log the message delivery status, so this is a separate message, if we want one.
c.twilio.voice_callback_encryption_key = Base64.strict_encode64('0' * 32) | ||
c.twilio.voice_callback_base_url = 'https://example.com/api/voice' | ||
|
||
c.pinpoint.add_sms_config do |sms| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a prototype of the IDP code that will use this new config stuff. It will check for a newer variable that's an array, otherwise it will fall back and add a single config based on the old variables
Heads up, there are some instructions about using this thing in the README. I think those are out of date after this change. |
Incredible timing |
lib/telephony/pinpoint/sms_sender.rb
Outdated
def client_configs | ||
@client_configs ||= Telephony.config.pinpoint.sms_configs.map do |sms_config| | ||
credentials = AwsCredentialBuilder.new(sms_config).call | ||
args = { region: sms_config.region, retry_limit: 1 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing the retry_limit param here reminds me of a caveat.
IIRC the request here has a 5 second timeout. With 1 retry that means each config has 10 seconds total is can spend sending an SMS. If Pinpoint is timing out that means 2 configs will take 20 seconds. That will push the user's request to the IdP outside the 15 second timeout window.
Since we have a failover, does it makes sense to go to no retries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah happy to remove that.
Also that reminded me I was talking to @pauldoomgov about setting a timeout that covers our 95th percentile, it I think it's much less than 5 seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny thing - NewRelic seems to think all we need is an average :( So I am not sure what our p95 is on PinPoint calls. I would guess 1.5s
would be a sane starting point. That would give us 12sec if we fail, retry, failover+fail, retry. I think @jmhooper has a point about not retrying on failover.
One other item: Might be good to add timing output to the telephony output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you still have 2 CC issues to cleanup but LGTM otherwise
Yup! Waiting on some feedback from Paul, so the TODOs are intentionally failing the build for now :) Thanks for the review! |
Goal: fail over to a backup pinpoint region if the first region errors
Approach: Set up the gem to accept an array of SMS and Voice configs. When making an SMS or Voice call, iterate over each of the configs, returning after the first successful one (otherwise fall back on the next one)
The gem was relying on a singleton config variable and TBH I'm not a fan (especially not a fan of all the stubbing in specs), but I tried to keep my changes to a minimum.