Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal correctly with errors when archiving logfiles #27

Merged
merged 1 commit into from
Sep 3, 2020

Conversation

rtobar
Copy link
Contributor

@rtobar rtobar commented Sep 3, 2020

Before logfiles are rotated (i.e., removed) a series of processing tasks are performed onto them. Among these, the first one is to archive the logfile into the NGAS server itself, if the server has been configured for this. During normal operations this is not a problem, but exactly on the first try, when the janitor process has just been created and the HTTP server might not be bound yet, it might result on an ECONNREFUSED error. This produced big error messages on the logs, while in reality this is a transient error that should disappear on the next try.

A second, more general problem, was found while inspecting this code: logfiles were not renamed back to have their original ".unsaved" extensions when errors happened. This meant that when errors in general were found (and in particular when ECONNREFUSED was raised) logfiles were not picked up by successive janitor cycles.

This commit acknowledges these problems, improving the handling of the ECONNREFUSED error in particular, and of errors in general. On the on hand, when the ECONNREFUSED error is encountered we simply issue a warning log statement instead of letting the exception to propagate up through the stack. On the other hand, if any error happens during archiving we rename the file back to its original *.unsaved name so it gets picked up again in the next janitor cycle.

To make code and error handling a bit simpler I took the chance of moving the archiving of files into a separate try_archiving() function, whose invocation is then surrounded by the error handling block. Additionally I also added a sorted() call to process unsaved logfiles in time order, which until now wasn't guaranteed (and is a nice property to have).

This addresses #26.

Before logfiles are rotated (i.e., removed) a series of processing tasks
are performed onto them. Among these, the first one is to archive the
logfile into the NGAS server itself, if the server has been configured
for this. During normal operations this is not a problem, but exactly on
the first try, when the janitor process has just been created and the
HTTP server might not be bound yet, it might result on an ECONNREFUSED
error. This produced big error messages on the logs, while in reality
this is a transient error that should disappear on the next try.

A second, more general problem, was found while inspecting this code:
logfiles were not renamed back to have their original ".unsaved"
extensions when errors happened. This meant that when errors in general
were found (and in particular when ECONNREFUSED was raised) logfiles
were not picked up by successive janitor cycles.

This commit acknowledges these problems, improving the handling of the
ECONNREFUSED error in particular, and of errors in general. On the on
hand, when the ECONNREFUSED error is encountered we simply issue a
warning log statement instead of letting the exception to propagate up
through the stack. On the other hand, if *any* error happens during
archiving we rename the file back to its original *.unsaved name so it
gets picked up again in the next janitor cycle.

To make code and error handling a bit simpler I took the chance of
moving the archiving of files into a separate try_archiving() function,
whose invocation is then surrounded by the error handling block.
Additionally I also added a sorted() call to process unsaved logfiles in
time order, which until now wasn't guaranteed (and is a nice property to
have).

This commit addresses #26.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
@coveralls
Copy link

coveralls commented Sep 3, 2020

Coverage Status

Coverage decreased (-0.08%) to 68.531% when pulling c6556bc on janitor-thread-startup into 3cf85a3 on master.

Comment on lines -68 to +85
mvFile(unsaved, fname)

# Connect to the server and send a pull ARCHIVE request
if cfg.getArchiveRotatedLogfiles():
file_uri = "file://" + fname
host, port = srvObj.get_self_endpoint()
proto = srvObj.get_server_access_proto()
ngamsPClient.ngamsPClient(host, port, proto=proto).archive(
file_uri, 'ngas/nglog')
try:
mvFile(unsaved, fname)
try_archiving(cfg, srvObj, fname)
except Exception as e:
mvFile(fname, unsaved)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our conversation, is it necessary to first mvFile then on a possible exception restore? Is it more worthwhile to mvFile only after the successful archive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I absolutely agree that the first move is not required in principle, and would make exception handling simpler. However, the ngamsPClient.archive method doesn't allow to pass down arbitrary filenames, and will always use the basename of the file on disk that is being archived.

As discussed, I'll just go ahead then with these changes, but will create an issue to remember implementing this improvement in the future.

@rtobar rtobar merged commit c6556bc into master Sep 3, 2020
@rtobar rtobar deleted the janitor-thread-startup branch September 3, 2020 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants