Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transcode queue insights into health check and integrate with service selection #1758

Merged
merged 14 commits into from
Aug 13, 2021

Conversation

dmanjunath
Copy link
Contributor

@dmanjunath dmanjunath commented Aug 12, 2021

Description

Add criteria in content node selection to sort healthy nodes by transcode queue utilization to better balance load across different content nodes

Tests

Added unit tests, testing manually now.

How will this change be monitored?

Amplitude has creator node selection graphs

@dmanjunath dmanjunath marked this pull request as ready for review August 13, 2021 18:53
@dmanjunath dmanjunath changed the title Add transcode queue insights into health check Add transcode queue insights into health check and integrate with service selection Aug 13, 2021
Copy link
Member

@raymondjacobson raymondjacobson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great. we'll cut libs EOD today. i can take care of that

libs/src/utils/network.js Outdated Show resolved Hide resolved

// Sort by content node transcode queue load
// defined as the ratio of (active + waiting transcodes) / (transcode slots, same as number of cpu cores)
const healthRatioA = ((a.response.data.data.transcodeActive + a.response.data.data.transcodeWaiting) / a.response.data.data.numberOfCPUs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait this method is shared isn't it? should we be checking on whether these fields exist?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also for backwards compatibility, this would throw if these fields don't exist right? we can't release this until all cnodes update. i think the libs PR should be separated from the cnode PR in that case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if one/all of these fields don't exist or are 0, as far as the sort is concerned it does nothing so it should pass through to the time tiebreaker

also it doesn't error if these fields don't exist since pulling a nonexistent property off an object is undefined. it's only when you pull a property off another property that doesn't exist it errors. but i can add validation to make it cleaner

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const healthRatioA = ((a.response.data.data.transcodeActive + a.response.data.data.transcodeWaiting) / a.response.data.data.numberOfCPUs)

this would be undefined / some value

won't that throw?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it actually doesn't, it becomes NaN, but let me just add some checking. i'm also happy to separate but I don't think we need to. i think just defaulting the ratio to 1 to simulate all transcode slots taken should be sufficient. seems like the easiest way to ensure backwards/forwards compat

in the case that a node upgrades, we'll get the real data and since it's likely to be lower than 1 it'll be selected.
in the case everyone upgrades, we'll have accurate data
in the case nobody upgrades, it'll be the same for everyone and the tie breaker will be time

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default sounds reasaonble

Copy link
Member

@raymondjacobson raymondjacobson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be separating the PRs?

@@ -19,7 +19,7 @@ const MIN_FILESYSTEM_SIZE = 1950000000000 // 1950 GB of file system storage
* @param {string?} randomBytesToSign optional bytes string to be included in response object
* and used in signature generation
*/
const healthCheck = async ({ libs } = {}, logger, sequelize, getMonitors, numberOfCPUs, randomBytesToSign = null) => {
const healthCheck = async ({ libs } = {}, logger, sequelize, getMonitors, numberOfCPUs, getTranscodeQueueJobs, randomBytesToSign = null) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume you meant to change healthCheckController call to this function and set this param to false? either way i think we should just revert this change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah since you never consume it i guess you actualy meant to revert this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah leftover relic, will remove

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool

Copy link
Contributor

@SidSethi SidSethi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally looks great, am concerned about a couple scenarios though

libs/src/utils/network.js Show resolved Hide resolved
// If same version, do a tie breaker on the response time

// Sort by content node transcode queue load
// defined as the ratio of (active + waiting transcodes) / (transcode slots, same as number of cpu cores)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing

}

// lower is better, means less transcode queue utilization
// default ratio for transcode count and num cpus if node doesn't have one of those defined is 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this leads to unintended behavior if one or both nodes haven't defined transcode counts
CN1 = 8 numCPUs
CN2 = 4 numCPUs -> 20 transcodeCount
outcome: 8/8 = 1 & 20/4 = 5; 1 < 5; CN1 preferred. that seems incorrect? what if CN1 has 50 transcodeCount

feels like we should only consume transcodeCounts if both nodes provide them, else only use numCPUs which we know all nodes provide

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my rationale was if one reports transcode count and the other doesnt, it's likely behind in version so the higher version will get placed ahead anyway. this case should only really get hit if both nodes don't have something

@raymondjacobson raymondjacobson mentioned this pull request Aug 13, 2021
@@ -143,6 +143,9 @@ const healthCheckVerbose = async ({ libs, snapbackSM } = {}, logger, sequelize,
currentSnapbackReconfigMode = snapbackSM.highestEnabledReconfigMode
}

const { active: transcodeActive, waiting: transcodeWaiting } = await getTranscodeQueueJobs()
console.log('transcodeActive', transcodeActive, transcodeWaiting)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log

}
console.log(response.transcodeActive, response.transcodeWaiting, 'respszz')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log

@SidSethi
Copy link
Contributor

looks like some test code so will hold off reviewing ¨ntil you re-request

@dmanjunath dmanjunath merged commit 355bc0f into master Aug 13, 2021
@dmanjunath dmanjunath deleted the dm-monitor-transcode-queue-length branch August 13, 2021 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants