Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Acquiring running dataflow streaming jobs fails #283

Closed
deverant opened this issue Sep 23, 2020 · 9 comments
Closed

Acquiring running dataflow streaming jobs fails #283

deverant opened this issue Sep 23, 2020 · 9 comments
Labels
bug Something isn't working

Comments

@deverant
Copy link

Describe the bug
ConfigConnector fails to acquire a running dataflow streaming job, even in cases where it originally created it. This was discovered while recreating the k8s cluster and dataflow streaming jobs getting this error:

Update call failed: error applying desired state: googleapi: Error 409: (4584d01fb86b1927): The workflow could not be created. Causes: (4dbea8013f4eb866): There is already an active job named XXX. If you want to submit a second job, try again by setting a different name., alreadyExists

ConfigConnector Version
1.20.1

To Reproduce
Steps to reproduce the behavior:

  1. Create a new dataflow streaming job using kubectl apply
  2. Add deletion-policy annotation: kubectl annotate dataflowjobs dataflowjob-sample-streaming cnrm.cloud.google.com/deletion-policy=abandon
  3. Delete resource: kubectl delete dataflowjobs dataflowjob-sample-streaming
  4. Reapply resource using kubectl apply
  5. Describe resource kubectl describe dataflowjobs dataflow-sample-streaming and see the error mentioned in description of the bug.

YAML snippets:
You can reproduce this using the sample given here: https://cloud.google.com/config-connector/docs/reference/resource-docs/dataflow/dataflowjob#streaming_dataflow_job

@deverant deverant added the bug Something isn't working label Sep 23, 2020
@jcanseco
Copy link
Member

Hi @deverant, we don't actually support acquisition of Dataflow jobs in general currently. Would you consider job acquisition an important use-case for you, and how important would you consider it? (i.e. is it a nice-to-have, a friction point, or a blocker?).

@deverant
Copy link
Author

This would be a blocker for us, since we couldn't really use the config connector as a way to manage and update the jobs automatically. Every time the k8s cluster would get recreated we would need to drain all dataflow jobs manually and then reapply to able to get them back under management.

If fully automatic acquisition is not possible then it would be great to at least have some labels to control what happens when the job name already exists.

@jcanseco
Copy link
Member

Gotcha, can you please allow me to better understand your use-case?

Are you looking to do frequent, mass acquisitions of running jobs? Or, do you just need to be able to acquire jobs at times? You mentioned "Every time the k8s cluster would get recreated" -- is this something you expect to happen frequently and normally?

@deverant
Copy link
Author

I guess this was primarily a surprise, since I didn't find any documentation around it. https://cloud.google.com/config-connector/docs/how-to/managing-deleting-resources#creating_a_resource makes it look like acquiring resources is something that kcc tries to do. This is why I considered this a bug to begin with.

If we adopt this we would have hundreds of dataflow streaming jobs from multiple teams being managed by this. If for any reason a job would get "desync" (cluster recreated, resource being deleted and reapplied) we would always need to drain the job to redeploy it. This would also mean we would need to drain all jobs as we adopt this to get them moved to using kcc.

Granted that the reasons for the job to get desync should normally happen rarely, but they do happen (I already know of cluster recreations that are planned for the future) and having to go and fix hundreds of resources manually to get them back to being managed by kcc seems a significant operational overhead.

@jcanseco
Copy link
Member

Most of our resources do support acquisition. DataflowJob is one of a few that don't, since its identifier (job ID) is server-generated upon resource creation. This is in contrast to most resources like SpannerInstance whose identifiers are user-configured (the instance name in this case).

We're actively working on a feature that would allow users to acquire resources with server-generated identifiers by setting a "resource ID" field in the spec equal to the resource's identifier. Example: it would allow you to acquire a Folder by specifying the folder ID.

Do you think something like the above for DataflowJob would work for your use-case? You would have to modify your YAMLs to have the appropriate server-generated identifiers to allow for acquisition, but it would at least remove the need to drain the jobs.

Please note that we'd have to investigate this more deeply though, since Dataflow jobs have strange identifier semantics in general, but we can look into the above if it fits your use-case.

@deverant
Copy link
Author

Being able to specify the exact resource id to acquire would help in the sense that it would at least allow us to build tooling around that to reacquire the jobs when necessary. I think it would at least need to be a mutable field that we can modify after the DataflowJob has been added to the cluster and the operator should react to changes to that field.

Since the job name is unique in Dataflow (in the sense that you can only have one running at any time) (https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs#Job), would it not make sense to use that as the resource identifier (only matches if the job is running)? On a quick look that seems very similar to the other user-defined identifiers. Maybe there are some edge cases or implementation hurdles I am not seeing?

@jcanseco
Copy link
Member

jcanseco commented Sep 26, 2020

Being able to specify the exact resource id to acquire would help in the sense that it would at least allow us to build tooling around that to reacquire the jobs when necessary.

Gotcha, thanks for confirming. Please give us some time to look into it then.

I think it would at least need to be a mutable field that we can modify after the DataflowJob has been added to the cluster and the operator should react to changes to that field.

Can you please elaborate why you'd need the "resource ID" field to be mutable? The way "resource ID" is intended to work right now for resources that are planned to support it is that you can only set it on resource creation/acquisition, since it doesn't really make much sense to allow for updates to the unique identifier of a resource.

We would still expose the job ID via status.jobId, so your operator should still be able to watch that field for any changes (it remains to be discussed internally if we're ok with having KCC automatically update the "resource ID" field with any changes as well since the current expectation for "resource ID" is to leave it static for all resources).

Since the job name is unique in Dataflow (in the sense that you can only have one running at any time) (https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs#Job), would it not make sense to use that as the resource identifier (only matches if the job is running)? On a quick look that seems very similar to the other user-defined identifiers. Maybe there are some edge cases or implementation hurdles I am not seeing?

Yeah, I definitely see where you're coming from. The long story short is that it sort of conflicts with how KCC generally works, which aims to present a 1:1 mapping of K8s to GCP resources. Dataflow job "names" behave more like collection identifiers rather than resource identifiers, and so mapping K8s resources to jobs by name has important implications on the product's consistency (not to mention implementation complexity and UX edge cases). Generally, we strive to avoid absorbing inconsistencies in KCC since consistency is an important part of our product UX and helps to keep complexity manageable for our users.

That said, we have no intention of ignoring the problem. We want to improve the UX around DataflowJob so that it can more closely match how you expect to use them. We just need to spend some time thinking about how we can better (re)model Dataflow jobs in KCC so that we can better support your use-cases without adding too much complexity.

@deverant
Copy link
Author

Can you please elaborate why you'd need the "resource ID" field to be mutable? The way "resource ID" is intended to work right now for resources that are planned to support it is that you can only set it on resource creation/acquisition, since it doesn't really make much sense to allow for updates to the unique identifier of a resource.

I was mainly thinking that one way to implement this would be have something that basically maps the k8s resource to the running job and then allows for acquisition to happen. I guess the other option would be to figure out the resource ID before applying the resource with the trade-off that we would need to add this logic into all the build pipelines where resources are being applied. Today we have no operator and are hoping to use KCC directly to manage the streaming jobs for us. Having to build our own operator to do these mappings probably means we could also just call Dataflow API directly and skip KCC.

That said, we have no intention of ignoring the problem. We want to improve the UX around DataflowJob so that it can more closely match how you expect to use them. We just need to spend some time thinking about how we can better (re)model Dataflow jobs in KCC so that we can better support your use-cases without adding too much complexity.

Thanks. I am really hoping I can leverage KCC fully for this use-case. In the current state I feel like I am writing YAML files to do one-off API calls instead of actually managing a resource. I can basically use KCC to launch a job and in some cases update or delete it. For something long running like streaming jobs that feels inadequate.

For streaming, I still think using the name of the currently running job makes the most sense as a way of mapping these. The name is the way we will identify this long running process and separate it from others. And since you can't edit/relaunch/modify a finished job it doesn't seem feasible that I would want to manage any other job than the one that is currently running, if any. If anything, having to define a resourceId seems like the edge case (in case I ever want to link to an already cancelled job?) rather than the normal scenario.

@caieo
Copy link
Contributor

caieo commented Dec 11, 2020

Hi @deverant, we just added support for acquiring running DataflowJobs in ConfigConnector v1.33.0 (release notes). Please update this thread if you run into any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants