Deal correctly with errors when archiving logfiles #27

rtobar · 2020-09-03T07:11:17Z

Before logfiles are rotated (i.e., removed) a series of processing tasks are performed onto them. Among these, the first one is to archive the logfile into the NGAS server itself, if the server has been configured for this. During normal operations this is not a problem, but exactly on the first try, when the janitor process has just been created and the HTTP server might not be bound yet, it might result on an ECONNREFUSED error. This produced big error messages on the logs, while in reality this is a transient error that should disappear on the next try.

A second, more general problem, was found while inspecting this code: logfiles were not renamed back to have their original ".unsaved" extensions when errors happened. This meant that when errors in general were found (and in particular when ECONNREFUSED was raised) logfiles were not picked up by successive janitor cycles.

This commit acknowledges these problems, improving the handling of the ECONNREFUSED error in particular, and of errors in general. On the on hand, when the ECONNREFUSED error is encountered we simply issue a warning log statement instead of letting the exception to propagate up through the stack. On the other hand, if any error happens during archiving we rename the file back to its original *.unsaved name so it gets picked up again in the next janitor cycle.

To make code and error handling a bit simpler I took the chance of moving the archiving of files into a separate try_archiving() function, whose invocation is then surrounded by the error handling block. Additionally I also added a sorted() call to process unsaved logfiles in time order, which until now wasn't guaranteed (and is a nice property to have).

This addresses #26.

Before logfiles are rotated (i.e., removed) a series of processing tasks are performed onto them. Among these, the first one is to archive the logfile into the NGAS server itself, if the server has been configured for this. During normal operations this is not a problem, but exactly on the first try, when the janitor process has just been created and the HTTP server might not be bound yet, it might result on an ECONNREFUSED error. This produced big error messages on the logs, while in reality this is a transient error that should disappear on the next try. A second, more general problem, was found while inspecting this code: logfiles were not renamed back to have their original ".unsaved" extensions when errors happened. This meant that when errors in general were found (and in particular when ECONNREFUSED was raised) logfiles were not picked up by successive janitor cycles. This commit acknowledges these problems, improving the handling of the ECONNREFUSED error in particular, and of errors in general. On the on hand, when the ECONNREFUSED error is encountered we simply issue a warning log statement instead of letting the exception to propagate up through the stack. On the other hand, if *any* error happens during archiving we rename the file back to its original *.unsaved name so it gets picked up again in the next janitor cycle. To make code and error handling a bit simpler I took the chance of moving the archiving of files into a separate try_archiving() function, whose invocation is then surrounded by the error handling block. Additionally I also added a sorted() call to process unsaved logfiles in time order, which until now wasn't guaranteed (and is a nice property to have). This commit addresses #26. Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>

coveralls · 2020-09-03T07:23:49Z

Coverage decreased (-0.08%) to 68.531% when pulling c6556bc on janitor-thread-startup into 3cf85a3 on master.

davepallot · 2020-09-03T07:29:19Z

src/ngamsServer/ngamsServer/janitor/rotated_logfiles_handler.py

-        mvFile(unsaved, fname)
-
-        # Connect to the server and send a pull ARCHIVE request
-        if cfg.getArchiveRotatedLogfiles():
-            file_uri = "file://" + fname
-            host, port = srvObj.get_self_endpoint()
-            proto = srvObj.get_server_access_proto()
-            ngamsPClient.ngamsPClient(host, port, proto=proto).archive(
-                    file_uri, 'ngas/nglog')
+        try:
+            mvFile(unsaved, fname)
+            try_archiving(cfg, srvObj, fname)
+        except Exception as e:
+            mvFile(fname, unsaved)


Based on our conversation, is it necessary to first mvFile then on a possible exception restore? Is it more worthwhile to mvFile only after the successful archive?

Yeah, I absolutely agree that the first move is not required in principle, and would make exception handling simpler. However, the ngamsPClient.archive method doesn't allow to pass down arbitrary filenames, and will always use the basename of the file on disk that is being archived.

As discussed, I'll just go ahead then with these changes, but will create an issue to remember implementing this improvement in the future.

rtobar requested a review from davepallot September 3, 2020 07:11

davepallot reviewed Sep 3, 2020

View reviewed changes

rtobar merged commit c6556bc into master Sep 3, 2020

rtobar deleted the janitor-thread-startup branch September 3, 2020 07:34

rtobar mentioned this pull request Sep 3, 2020

Race condition on startup with janitor process #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deal correctly with errors when archiving logfiles #27

Deal correctly with errors when archiving logfiles #27

rtobar commented Sep 3, 2020

coveralls commented Sep 3, 2020 •

edited

Loading

davepallot Sep 3, 2020

rtobar Sep 3, 2020

Deal correctly with errors when archiving logfiles #27

Deal correctly with errors when archiving logfiles #27

Conversation

rtobar commented Sep 3, 2020

coveralls commented Sep 3, 2020 • edited Loading

davepallot Sep 3, 2020

Choose a reason for hiding this comment

rtobar Sep 3, 2020

Choose a reason for hiding this comment

coveralls commented Sep 3, 2020 •

edited

Loading