-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add transcode queue insights into health check and integrate with service selection #1758
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great. we'll cut libs EOD today. i can take care of that
libs/src/utils/network.js
Outdated
|
||
// Sort by content node transcode queue load | ||
// defined as the ratio of (active + waiting transcodes) / (transcode slots, same as number of cpu cores) | ||
const healthRatioA = ((a.response.data.data.transcodeActive + a.response.data.data.transcodeWaiting) / a.response.data.data.numberOfCPUs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait this method is shared isn't it? should we be checking on whether these fields exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also for backwards compatibility, this would throw if these fields don't exist right? we can't release this until all cnodes update. i think the libs PR should be separated from the cnode PR in that case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if one/all of these fields don't exist or are 0, as far as the sort is concerned it does nothing so it should pass through to the time tiebreaker
also it doesn't error if these fields don't exist since pulling a nonexistent property off an object is undefined. it's only when you pull a property off another property that doesn't exist it errors. but i can add validation to make it cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const healthRatioA = ((a.response.data.data.transcodeActive + a.response.data.data.transcodeWaiting) / a.response.data.data.numberOfCPUs)
this would be undefined / some value
won't that throw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it actually doesn't, it becomes NaN, but let me just add some checking. i'm also happy to separate but I don't think we need to. i think just defaulting the ratio to 1 to simulate all transcode slots taken should be sufficient. seems like the easiest way to ensure backwards/forwards compat
in the case that a node upgrades, we'll get the real data and since it's likely to be lower than 1 it'll be selected.
in the case everyone upgrades, we'll have accurate data
in the case nobody upgrades, it'll be the same for everyone and the tie breaker will be time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default sounds reasaonble
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we be separating the PRs?
@@ -19,7 +19,7 @@ const MIN_FILESYSTEM_SIZE = 1950000000000 // 1950 GB of file system storage | |||
* @param {string?} randomBytesToSign optional bytes string to be included in response object | |||
* and used in signature generation | |||
*/ | |||
const healthCheck = async ({ libs } = {}, logger, sequelize, getMonitors, numberOfCPUs, randomBytesToSign = null) => { | |||
const healthCheck = async ({ libs } = {}, logger, sequelize, getMonitors, numberOfCPUs, getTranscodeQueueJobs, randomBytesToSign = null) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i assume you meant to change healthCheckController
call to this function and set this param to false? either way i think we should just revert this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah since you never consume it i guess you actualy meant to revert this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah leftover relic, will remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally looks great, am concerned about a couple scenarios though
libs/src/utils/network.js
Outdated
// If same version, do a tie breaker on the response time | ||
|
||
// Sort by content node transcode queue load | ||
// defined as the ratio of (active + waiting transcodes) / (transcode slots, same as number of cpu cores) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amazing
libs/src/utils/network.js
Outdated
} | ||
|
||
// lower is better, means less transcode queue utilization | ||
// default ratio for transcode count and num cpus if node doesn't have one of those defined is 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this leads to unintended behavior if one or both nodes haven't defined transcode counts
CN1 = 8 numCPUs
CN2 = 4 numCPUs -> 20 transcodeCount
outcome: 8/8 = 1 & 20/4 = 5; 1 < 5; CN1 preferred. that seems incorrect? what if CN1 has 50 transcodeCount
feels like we should only consume transcodeCounts if both nodes provide them, else only use numCPUs which we know all nodes provide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my rationale was if one reports transcode count and the other doesnt, it's likely behind in version so the higher version will get placed ahead anyway. this case should only really get hit if both nodes don't have something
@@ -143,6 +143,9 @@ const healthCheckVerbose = async ({ libs, snapbackSM } = {}, logger, sequelize, | |||
currentSnapbackReconfigMode = snapbackSM.highestEnabledReconfigMode | |||
} | |||
|
|||
const { active: transcodeActive, waiting: transcodeWaiting } = await getTranscodeQueueJobs() | |||
console.log('transcodeActive', transcodeActive, transcodeWaiting) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log
} | ||
console.log(response.transcodeActive, response.transcodeWaiting, 'respszz') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log
looks like some test code so will hold off reviewing ¨ntil you re-request |
Description
Add criteria in content node selection to sort healthy nodes by transcode queue utilization to better balance load across different content nodes
Tests
Added unit tests, testing manually now.
How will this change be monitored?
Amplitude has creator node selection graphs