Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues in containerd-driver #49

Merged
merged 6 commits into from
Jan 6, 2021
Merged

Fix issues in containerd-driver #49

merged 6 commits into from
Jan 6, 2021

Conversation

shishir-a412ed
Copy link
Collaborator

@shishir-a412ed shishir-a412ed commented Dec 18, 2020

We found 3 issues while testing failure scenarios in containerd i.e. If containerd goes down (or restarts), how does nomad-driver-containerd handles that situation:

Issues:

  1. When containerd-driver makes a gRPC call to containerd e.g. during fingerprinting operation, a context timeout must be set. Right now, without the timeout, if containerd goes down, that call never returns leaving containerd-driver in a hung state.

  2. handleWait() throwing a nil pointer exception: Sending an empty return to nomad client results in nomad client dereferencing a nil pointer which results in a nil pointer exception.

  3. Issue in recovering task if nomad/nomad-driver-containerd restarts. This might also be related to the networking error we are observing

Dec 17 18:47:14 ip-10-102-98-114 nomad[27030]:  client.driver_mgr.containerd-driver: HELLO: RecoverTask: Failed to decode driver config: driver=containerd-driver @module=containerd-driver timestamp=2020-12-17T18:47:14.167Z
Dec 17 18:47:14 ip-10-102-98-114 nomad[27030]: client.alloc_runner.task_runner: error recovering task; cleaning up: alloc_id=2d6b365f-cc89-b230-2af9-a37ccbf0f6c8 task=adaas-task error="rpc error: code = Unknown desc = failed to decode driver config: EOF" task_id=2d6b365f-cc89-b230-2af9-a37ccbf0f6c8/adaas-task/43e03bbd
Dec 17 18:47:14 ip-10-102-98-114 nomad[27030]:  client.alloc_runner.task_runner: error destroying unrecoverable task: alloc_id=2d6b365f-cc89-b230-2af9-a37ccbf0f6c8 task=adaas-task error="rpc error: code = Unknown desc = task not found for given id" task_id=2d6b365f-cc89-b230-2af9-a37ccbf0f6c8/adaas-task/43e03bbd
2020-12-16T17:20:11-08:00  Setup Failure  failed to setup alloc: pre-run hook "network" failed: failed to create network for alloc: open /var/run/netns/d30ebc57-6ed4-9dc0-c604-e2fea98257f1: operation not permitted

This PR addresses (1) and (2). I will open a separate PR for Issue (3).

@shishir-a412ed
Copy link
Collaborator Author

@shivdudhani Can you please review?

Copy link

@shivdudhani shivdudhani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@shishir-a412ed shishir-a412ed merged commit 0c1a1bb into master Jan 6, 2021
@github-actions github-actions bot locked and limited conversation to collaborators Jan 6, 2021
@shishir-a412ed shishir-a412ed deleted the fix_issues branch January 6, 2021 00:51
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants