Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement support of syslog format in the parseDateTimeBestEffort() function #50925

Merged
merged 22 commits into from Jun 15, 2023

Conversation

sirvickr
Copy link
Contributor

@sirvickr sirvickr commented Jun 13, 2023

Closes #45306

--> syslog timestamp

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Added the possibility to use date and time arguments in syslog timestamp format in functions parseDateTimeBestEffort*() and parseDateTime64BestEffort*().

@rvasin

This comment was marked as outdated.

@rschu1ze rschu1ze self-assigned this Jun 13, 2023
@rschu1ze rschu1ze added the can be tested Allows running workflows for external contributors label Jun 13, 2023
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-improvement Pull request with some product improvements label Jun 13, 2023
@robot-ch-test-poll4
Copy link
Contributor

robot-ch-test-poll4 commented Jun 13, 2023

This is an automated comment for commit bd16233 with description of existing statuses. It's updated for the latest CI running
The full report is available here
The overall status of the commit is 🔴 failure

Check nameDescriptionStatus
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help🟢 success
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR🟢 success
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process🟢 success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help🟢 success
Docker image for serversThe check to build and optionally push the mentioned image to docker hub🟢 success
Docs CheckBuilds and tests the documentation🟢 success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here🟢 success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc🟢 success
Install packagesChecks that the built packages are installable in a clear environment🟢 success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests🟢 success
Mergeable CheckChecks if all other necessary checks are successful🟢 success
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests🟢 success
Push to DockerhubThe check for building and pushing the CI related docker images to docker hub🟢 success
SQLancerFuzzing tests that detect logical bugs with SQLancer tool🟢 success
SqllogicRun clickhouse on the sqllogic test set against sqlite and checks that all statements are passed🟢 success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc🟢 success
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc🔴 failure
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors🟢 success
Style CheckRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report🟢 success
Unit testsRuns the unit tests for different release types🟢 success
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts🟢 success

@@ -0,0 +1,46 @@
WITH
86400 AS secs_in_day,
now() AS ts_now,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using now() in a test is a recipe for instable test results.

In general, the test setup looks a bit too overengineered. Maybe we can use a more dumb test structure (in particular constant input values) like in 00813_parse_date_time_best_effort_more.sql or 01442_date_time_with_params.sql?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about constant timestamps in the first place. Unfortunately, parseDateTimeBestEffort() uses now() internally (namely, it uses time() to get the current moment), so looks like I can't avoid using it in the test.
I came up with calculating a "time shift" between the reference timestamp and any current moment, thereby "freezing" the time of the test whenever it is actually run.
However, the CI run shows that I failed to avoid flakiness, and now I'm trying to tackle this issue and make the test drastically more reliable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I managed to pinpoint the origin of the problem and work it around. I'll try to explain everything from scratch.
Here is a debug query which is close to the original flaky test:

WITH
    now() AS dt_now, 
    86400 AS delta,
    dt_now - delta AS dt_before,
    dt_now + delta AS dt_after,
    toDateTime('2023-06-10 00:00:00') AS dt_ref,
    dateDiff('second', dt_ref, dt_now) AS time_shift,
    formatDateTime(dt_around, '%b %e %T') AS syslog_arg
SELECT
    dt_now,
    time_shift, 
    dt_before, dt_after, dt_around, syslog_arg,
    parseDateTimeBestEffort(syslog_arg) AS res_actual,
    formatDateTime(dt_around - time_shift, '%b %e %T') AS around_ref_point,
    res_actual - time_shift AS res_ref
FROM (SELECT arrayJoin([dt_before, dt_after]) AS dt_around)
FORMAT PrettySpaceNoEscapes;

Here are the results of two consecutive (one after another, the same second) runs of the query - the first is correct (frequent) and the second is erroneous (reproduces quite seldom):

              dt_now   time_shift             dt_before              dt_after             dt_around   syslog_arg                 res_actual   around_ref_point         res_reference

 2023-06-15 08:27:08       462428   2023-06-14 08:27:08   2023-06-16 08:27:08   2023-06-14 08:27:08   Jun 14 08:27:08   2023-06-14 08:27:08   Jun  9 00:00:00    2023-06-09 00:00:00 
 2023-06-15 08:27:08       462428   2023-06-14 08:27:08   2023-06-16 08:27:08   2023-06-16 08:27:08   Jun 16 08:27:08   2022-06-16 08:27:08   Jun 11 00:00:00    2022-06-11 00:00:00 

              dt_now   time_shift             dt_before              dt_after             dt_around   syslog_arg                 res_actual   around_ref_point         res_reference

 2023-06-15 08:27:08       462428   2023-06-14 08:27:08   2023-06-16 08:27:08   2023-06-14 08:27:09   Jun 14 08:27:09   2023-06-14 08:27:09   Jun  9 00:00:01    2023-06-09 00:00:01 
 2023-06-15 08:27:08       462428   2023-06-14 08:27:08   2023-06-16 08:27:08   2023-06-16 08:27:09   Jun 16 08:27:09   2022-06-16 08:27:09   Jun 11 00:00:01    2022-06-11 00:00:01 

Firstly, I get the current moment of time: now() AS dt_now - it happened to be 2023-06-15 08:27:08.
Secondly, I step from it backwards and forward in time by a day (plus/minus 86400 seconds):
dt_now - delta as dt_before and dt_now + delta as dt_after, i.e. 2023-06-14 08:27:08 and 2023-06-16 08:27:08.
After that, I join these two time points to a temporary dataset, which they are extracted from under the name dt_around:
FROM (SELECT arrayJoin([dt_before, dt_after]) AS dt_around).
Then they are formatted as a tested syslog timestamp string: formatDateTime(dt_around, '%b %e %T') AS syslog_arg, i.e. Jun 14 08:27:08 and Jun 16 08:27:08 and passed to parseDateTimeBestEffort as arguments:
parseDateTimeBestEffort(syslog_arg) AS res_actual.
Of course, res_actual, as well as dt_now and other transient variables change each second we run the test, so they poorly fit to be put into the .reference file and thus to be used in functional tests.
Therefore, I virtually move all the test runs to the same arbitrarily selected "reference point", regardless of the moment the test actually runs. Here is my reference point: toDateTime('2023-06-10 00:00:00') AS dt_ref. The test will virtually run each time at this time point.
So, I calculate the time shift (in seconds) between the reference point and the current moment:
dateDiff('second', dt_ref, dt_now) AS time_shift.
Then I convert the current (transient) result of the function to the reference result - as if the functions was called at the reference time point: res_actual - time_shift AS res_reference.
And on top of all that, just to make the case more clear, I also print the "reference" arguments (around_ref_point) instead of their transient counterparts (syslog_arg):
formatDateTime(dt_around - time_shift, '%b %e %T') AS around_ref_point.
This is how the test is meant to work and works most of the time.

Now let's look at the problem.

Copy link
Contributor Author

@sirvickr sirvickr Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue

Lets take a close look at this part of both query results.
I'll remind how I create and use the two time points (in the past and in the future):
FROM (SELECT arrayJoin([dt_before, dt_after]) AS dt_around).
Namely, I put the previously calculated values in a dataset and the extract them for further use under the name dt_around. So, I expect to see them unchanged, and they are, most of the time.
However, from time to time, the date-times put into the dataset and subsequently read from it differ from their original values, as we can see on the picture above.
Namely, I join 2023-06-14 08:27:08 and 2023-06-16 08:27:08 and read 2023-06-14 08:27:09 and 2023-06-16 08:27:09 instead. One second ahead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to be the only reason of the test's flakiness.
I worked around this issue by getting rid of arrayJoin() in the queries.
However, the problem with strange behavior of arrayJoin() still remains.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice explanation. Very helpful. I think the way you solved the problem it is fine.
I also don't have a good explanation what went wrong. Randomized database settings could be an option (but even then it is not clear how) - anyways, afaik the settings in normal functional test are unchanged (this only happens in stress tests).

If you can reproduce the issue locally, it would be interesting to check further, if not we might as well keep things as is for now.

@rschu1ze rschu1ze merged commit 74cb797 into ClickHouse:master Jun 15, 2023
279 of 281 checks passed
@rschu1ze
Copy link
Member

@sirvickr The new tests failed sporadically on ARM for unclear reasons (only the Auckland time zone is affected). I tried to reproduce on an ARM machine but no success, unfortunately.

Perhaps you have an idea what went wrong?

@sirvickr
Copy link
Contributor Author

It barely depends on the architecture. I'll take a look.

@sirvickr
Copy link
Contributor Author

I guess I have a clue. Robert, should we continue within the PR#51081?

@rschu1ze
Copy link
Member

Doesn't matter too much. If you make a new PR, just set me as reviewer.

@sirvickr
Copy link
Contributor Author

sirvickr commented Jun 16, 2023

Got it. I'll articulate the idea in #51081.
Looks like I've located the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
can be tested Allows running workflows for external contributors pr-improvement Pull request with some product improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support syslog datetime format in parseDateTimeBestEffort()
4 participants