Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement formatDateTime function #2770

Merged
merged 8 commits into from Sep 18, 2018
Merged

Implement formatDateTime function #2770

merged 8 commits into from Sep 18, 2018

Conversation

alex-krash
Copy link
Contributor

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

In this PR I will add support for formatDateTime function. But to proceed, I need several comments:

  1. Which formats we would like to support
  2. What is correct approach to work with DateLUTImpl (should it consider timezone of DateTime column)?
  3. What is correct implementation for constant column input (for DateTime to format)? Implementations in repo differs (somewhere it requires separate branch, somewhere - just ignored).

@alex-krash alex-krash changed the title [WIP] Implement several formattings for non-constant input [WIP] Implement formatDateTime function Jul 31, 2018
@alexey-milovidov
Copy link
Member

alexey-milovidov commented Jul 31, 2018

Which formats we would like to support

A subset of strftime, that can be easily implemented by calling methods of DateLUTImpl and probably formatting that components in different ways.

Locale dependent formatting shouldn't be supported for now.

What is correct approach to work with DateLUTImpl (should it consider timezone of DateTime column)?

The same way as it is used in toString:

FunctionsDateTime.h:

/// Determine working timezone either from optional argument with time zone name or from time zone in DateTime type of argument.
std::string extractTimeZoneNameFromFunctionArguments(const ColumnsWithTypeAndName & arguments, size_t time_zone_arg_num, size_t datetime_arg_num);
const DateLUTImpl & extractTimeZoneFromFunctionArguments(Block & block, const ColumnNumbers & arguments, size_t time_zone_arg_num, size_t datetime_arg_num);

It will consider optional argument with explicit time zone:

formatDateTime(t, '%m/%d/%Y', 'America/Los_Angeles')

or time zone, embedded in the data type of the first argument:

:) SELECT toTypeName(t) ...

DateTime('W-SU')

:) SELECT formatDateTime(t, '%m/%d/%Y')
...

That is the server time zone by default.

What is correct implementation for constant column input (for DateTime to format)? Implementations in repo differs (somewhere it requires separate branch, somewhere - just ignored).

Look at methods:

    /** If the function have non-zero number of arguments,
      *  and if all arguments are constant, that we could automatically provide default implementation:
      *  arguments are converted to ordinary columns with single value, then function is executed as usual,
      *  and then the result is converted to constant column.
      */
    virtual bool useDefaultImplementationForConstants() const { return false; }

    /** Some arguments could remain constant during this implementation.
      */
    virtual ColumnNumbers getArgumentsThatAreAlwaysConstant() const { return {}; }

In your case, you will useDefaultImplementationForConstants
and getArgumentsThatAreAlwaysConstant are {1, 2}

And you don't need to implement a separate branch in code.

@alex-krash
Copy link
Contributor Author

@alexey-milovidov , I've implemented missing formats, but faced performance problems (my implementation has weaker performance, comparing to reference query, you've provided in #2818):

SELECT count()
FROM 
(
    SELECT replaceRegexpOne(toString(t), '^(\\d\\d\\d\\d)-(\\d\\d)-(\\d\\d)( \\d\\d:\\d\\d:\\d\\d)$', '\\3/\\2/\\1\\4')
    FROM 
    (
        SELECT toDateTime('2018-01-01 00:00:00') + number AS t
        FROM numbers(20000000) 
    ) 
) 
1 rows in set. Elapsed: 4.955 sec. Processed 20.05 million rows, 160.43 MB (4.05 million rows/s., 32.38 MB/s.) 

For my implementation:

SELECT count()
FROM 
(
    SELECT formatDateTime(t, '%F %T')
    FROM 
    (
        SELECT toDateTime('2018-01-01 00:00:00') + number AS t
        FROM numbers(20000000) 
    ) 
) 
1 rows in set. Elapsed: 9.058 sec. Processed 20.05 million rows, 160.43 MB (2.21 million rows/s., 17.71 MB/s.) 

It is up to you to decide, whether #2818 should be merged, or I should try to optimize this one.
Thanks in advance.

@alex-krash alex-krash closed this Aug 29, 2018
@alex-krash alex-krash reopened this Aug 29, 2018
@alexey-milovidov alexey-milovidov self-assigned this Aug 29, 2018
@alexey-milovidov
Copy link
Member

We've just planned to polish this PR (or #2818),
and the way of implementation in this PR theoretically should be fine.

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Sep 10, 2018

There are four reasons why the code is slow:

  1. You append data to a temporary std::string object for each row.
  2. You create extra temporary std::string objects in some cases.
  3. sprintf function is used (it is extremely slow).
  4. std::function is slower or about the same than virtual call that is subsequently slower than function-by-pointer call that have something similar or little worse performance than switch-case.

@alex-krash
Copy link
Contributor Author

Thanks, @alexey-milovidov , will apply changes.

@alex-krash
Copy link
Contributor Author

@alexey-milovidov , here are results after optimization:
Reference:

SELECT count()
FROM 
(
    SELECT replaceRegexpOne(toString(t), '^(\\d\\d\\d\\d)-(\\d\\d)-(\\d\\d)( \\d\\d:\\d\\d:\\d\\d)$', '\\3/\\2/\\1\\4')
    FROM 
    (
        SELECT toDateTime('2018-01-01 00:00:00') + number AS t
        FROM numbers(20000000) 
    ) 
) 

┌──count()─┐
│ 20000000 │
└──────────┘

1 rows in set. Elapsed: 5.071 sec. Processed 20.05 million rows, 160.43 MB (3.95 million rows/s., 31.64 MB/s.) 

New implementation:

SELECT count()
FROM 
(
    SELECT formatDateTime(t, '%F %T')
    FROM 
    (
        SELECT toDateTime('2018-01-01 00:00:00') + number AS t
        FROM numbers(20000000) 
    ) 
) 

┌──count()─┐
│ 20000000 │
└──────────┘
1 rows in set. Elapsed: 1.326 sec. Processed 20.05 million rows, 160.43 MB (15.12 million rows/s., 120.99 MB/s.)

There is one issue, and one question.
Issue: I am unable to implement %U and %W modifiers (week numbers) :(

Question: I've implemented support only for DateTime input parameter, whereas Date is not implemented. It can be easily templated, and so on, but here comes the question what behavior we expect for non-supported by Date formatting? E.g. do we expect client to receive this exception:

struct ToMinuteImpl
{
    static constexpr auto name = "toMinute";

    static inline UInt8 execute(UInt32 t, const DateLUTImpl & time_zone)
    {
        return time_zone.toMinute(t);
    }
    static inline UInt8 execute(UInt16, const DateLUTImpl &)
    {
        throw Exception("Illegal type Date of argument for function toMinute", ErrorCodes::ILLEGAL_TYPE_OF_ARGUMENT);
    }

    using FactorTransform = ToStartOfHourImpl;
};

When query:

SELECT formatDateTime(today(), '%M')

Or we would convert date to date + 00:00:00 ?

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Sep 17, 2018

here are results after optimization

That's much better!

I am unable to implement %U and %W modifiers (week numbers) :(

Just leave it unimplemented (throw exception) and we will have a chance to implement later.
BTW, support for ISO week (%V) was added yesterday (I see you already use it - cool!).

It can be easily templated, and so on, but here comes the question what behavior we expect for non-supported by Date formatting?

I think, the user will expect that we use zeros as time (format Date in the same way as DateTime).

@alex-krash alex-krash changed the title [WIP] Implement formatDateTime function Implement formatDateTime function Sep 18, 2018
@alexey-milovidov
Copy link
Member

I have started to merge.

alexey-milovidov added a commit that referenced this pull request Sep 18, 2018
@alexey-milovidov alexey-milovidov merged commit 64533ff into ClickHouse:master Sep 18, 2018
alexey-milovidov added a commit that referenced this pull request Sep 18, 2018
alexey-milovidov added a commit that referenced this pull request Sep 18, 2018
alexey-milovidov added a commit that referenced this pull request Sep 18, 2018
alexey-milovidov added a commit that referenced this pull request Sep 18, 2018
@alex-krash alex-krash deleted the function_format_date branch September 19, 2018 09:35
alexey-milovidov added a commit that referenced this pull request Sep 19, 2018
alexey-milovidov added a commit that referenced this pull request Sep 19, 2018
alexey-milovidov added a commit that referenced this pull request Sep 19, 2018
alexey-milovidov added a commit that referenced this pull request Sep 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants