Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MANTA-3311 mako_gc carefully checks lock file but barrels on if it can't create one #4

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

joyent-automation
Copy link

MANTA-3311 mako_gc carefully checks lock file but barrels on if it can't create one

This PR was migrated-from-gerrit, https://cr.joyent.us/#/c/2348/.
The raw archive of this CR is here.
See MANTA-4594 for info on Joyent Eng's migration from Gerrit.

CR discussion

@timkordas commented at 2017-08-09T23:56:42

Uploaded patch set 2: Commit message was updated.

@davepacheco commented at 2017-08-10T18:51:41

Patch Set 3:

(2 comments)

Thanks for doing this.

Patch Set 3 code comments
bin/mako_gc.sh#200 @davepacheco

I'm not sure what type of value this is supposed to be, and what exactly we're checking at line 207. Is it that this string is non-empty? Or that grep returned 0 (indicating it matched lines)?

I think it might be cleaner to check [[ -o noclobber ]] here and predicate on that at L207. What do you think?

bin/mako_gc.sh#200 @timkordas

yep. you are better at bash than I am. I will push a new set.

bin/mako_gc.sh#208 @davepacheco

The indent here does not seem to match the rest of the file.

@timkordas commented at 2017-08-10T19:03:51

Patch Set 3:

(1 comment)

@davepacheco commented at 2017-08-10T20:50:25

Patch Set 4: Code-Review+1

@timkordas commented at 2017-08-21T23:32:56

Uploaded patch set 5: Commit message was updated.

@davepacheco commented at 2017-08-23T16:30:24

Patch Set 5:

Hey Tim, are you waiting for IA on this? What kind of testing have you done already?

@timkordas commented at 2017-08-23T16:51:06

Patch Set 5:

Hey Tim, are you waiting for IA on this? What kind of testing have
you done already?

I've tested the changed no-clobber piece in isolation. I've tried testing the GC itself -- but my test environment doesn't have garbage to collect. If there is a standard way to test mako's GC, I'd love to hear about it.

@davepacheco commented at 2017-08-23T19:11:29

Patch Set 5:

Hey Tim, are you waiting for IA on this? What kind of testing
have
you done already?

I've tested the changed no-clobber piece in isolation. I've tried
testing the GC itself -- but my test environment doesn't have
garbage to collect. If there is a standard way to test mako's GC,
I'd love to hear about it.

I think every test environment will clean up some garbage each day. You should be able to see it by looking at the nightly mako_gc logs. If you truly have none (which would be surprising, because we run jobs every day that create intermediate objects that should be deleted), you could always create a bunch of data and then remove it.

This is the testing that I would probably do in a test deployment:

  • Deploy this script to at least one storage zone and make sure GC works as normal there. That doesn't exercise the changes itself, but is a worthwhile smoke check.
  • I'd also look at a mako_gc log from that run and compare it to a log from a previous run and make sure it looks like what I'd expect, particularly the xtrace output around the lock file logic.
  • Test the original problem described in the ticket: fill up /tmp and wait for mako_gc to run (or run it by hand). After failing to create the lock file, it should bail out with a clear error message instead of proceeding as it did in the initial report. After freeing up the disk space and running it again, it should work fine.
  • Kill the script and its children ungracefully (maybe with SIGKILL) and make sure it doesn't fail the next time it's run just because the lock file is still around. This is extra paranoid because I don't think your changes affected this.
  • Verify the lock file behavior in the context of the script. One way would be to see how long mako_gc normally takes (from its nightly logs) and then running the program by hand during that period and make sure that the second invocation bails out with a clear error message. It may be easier to add an extra crontab entry that executes the script in the same way at the same time, but with output redirected to a different file. One of the runs should fail, and the other should not.
  • make prepush in this repository, if that's pretty easy. It's pretty unlikely that that would be affected by this change, but I generally run the the tests anyway because it's supposed to be easy to do.

That said, I tend to be a little paranoid about introducing breakage. Some of this may be overkill. That said, I think all of that should be reasonably straightforward, and given the importance of this process and its history of silent failures in edge cases that we do hit in production, I think it's worth proactively verifying end-to-end that it's working as we expect, even in those edge cases.

@davepacheco commented at 2017-09-05T19:02:37

Patch Set 6: Code-Review+1

@timkordas commented at 2017-09-13T23:42:50

Uploaded patch set 7: Commit message was updated.

@timkordas commented at 2017-09-13T23:43:08

Patch Set 8: Patch Set 7 was rebased

@timkordas commented at 2017-10-04T22:38:30

Patch Set 9: Patch Set 8 was rebased

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant