Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a uid for the kubernetes job creation #1588

Merged
merged 4 commits into from Oct 26, 2023

Conversation

tylerpotts
Copy link
Contributor

Fixes #1554

When running large for-each branches (3000+) Metaflow will sometimes cause job name collisions on kubernetes. This adds a UUID to the prefix which should add sufficient entropy to prevent this from happening.

@elgehelge
Copy link

elgehelge commented Oct 18, 2023

This is quite serious. Running only 25000 jobs within a week results in a 95% chance of error due to job name collisions.

We have run into this problem many times already, and it is super hard to debug because the join-step that fails does not tell you why.

Foot notes:

  • TTL (time to live) is set to 7 days and this cannot be changed without making an extension.
  • Computation of collision chances was computed using the formula from Wikipedia (equation marked with "(1)") like so:
n_possible_job_names = 36**5
n_jobs = 25000
1 - np.prod([(n_possible_job_names - j) / n_possible_job_names for j in range(n_jobs)])

Out[]: 0.9943079470622059

@nflx-mf-bot
Copy link
Collaborator

Testing[528] @ 22ba73f

@nflx-mf-bot
Copy link
Collaborator

Testing[528] @ 22ba73f had 6 FAILUREs.

@@ -178,7 +180,7 @@ def create_job(
job = (
KubernetesClient()
.job(
generate_name="t-",
generate_name="t-{entropy}".format(entropy=secrets.token_hex(2)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_job receives secrets as a parameter, so this won't work. Small change to directly import and use token_hex should fix it.

from secrets import token_hex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work either if Metaflow still plans on supporting python 3.5.x

Running

mamba create -p /tmp/test 'python==3.5.5'
mamba activate /tmp/test
python -c "import secrets"

Results in module not found

@tylerpotts
Copy link
Contributor Author

Went back to UUID to maintain backwards compatibility with python 3.5. Using 3 characters since UUID doesn't have as large a character set as secrets

@elgehelge
Copy link

Went back to UUID to maintain backwards compatibility with python 3.5. Using 3 characters since UUID doesn't have as large a character set as secrets

If I am not mistaken, the character set is the same. Both things are hexadecimal. secrets.token_hex(2) returns 4 characters. And the current code using str(uuid4())[:3] only return 3 characters. Is that on purpose?

@elgehelge
Copy link

I really really appreciate that all of you are pushing this forward. High five 🙌
But I don't understand why you don't want to make it scalable? With this change, we get 36**5 * 16**3 possible job names, which means that running only 25.000 jobs per day will give you a chance of collisions of 6% over the course of a week. With 4 characters as originally proposed, it is 0.39% which is somewhat acceptable if we think 25000 per day is our upper limit. But isn't that quite low?

Apologies, if you find me annoying, I am really just trying to help and avoid future Github issues. As I say, I appreciate what you are doing.

@roofurmston
Copy link

roofurmston commented Oct 25, 2023

I really really appreciate that all of you are pushing this forward. High five 🙌 But I don't understand why you don't want to make it scalable? With this change, we get 36**5 * 16**3 possible job names, which means that running only 25.000 jobs per day will give you a chance of collisions of 6% over the course of a week. With 4 characters as originally proposed, it is 0.39% which is somewhat acceptable if we think 25000 per day is our upper limit. But isn't that quite low?

Apologies, if you find me annoying, I am really just trying to help and avoid future Github issues. As I say, I appreciate what you are doing.

I was wondering the same, to be fair.

It is still not clear to me why a unix timestamp wouldn't just solve the issue, to be honest,

If you don't want to take that approach though, surely using more characters from the UID would be preferable?

@saikonen saikonen merged commit 15cf84b into Netflix:master Oct 26, 2023
22 checks passed
@tylerpotts
Copy link
Contributor Author

@elgehelge Not finding you annoying at all!

Looks like I made 2 missteps when pushing my latest changes.

  1. I misread the secrets code and assumed the 2 in secrets.token_hex(2) meant it was returning 2 characters
  2. Didn't notice that secrets also uses hexadecimal

It was actually my attempt to increase rather than decrease when using 3 characters

@roofurmston

It is still not clear to me why a unix timestamp wouldn't just solve the issue, to be honest,

I spoke with @savingoyal yesterday and now understand that changes to the codebase need to be carefully considered in order to maintain the stability of the project, particularly because Metaflow interacts with so many external systems (AWS Batch, Kubernetes, local machines, etc.) An example of a thorny problem when it comes to UID length is an active issue with Argo Workflows: #1521. A timestamp would bump us up to 10 characters, and it would limit our character set to numbers (10 possibilities vs 16 per character.)

Looks like @saikonen bumped the UID up to 8 characters yesterday, so we should be good to go with entropy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Name collisions on kubernetes jobs cause flow to fail
7 participants