Merge remote-tracking branch 'ClickHouse/master' into improve_pr_62592

ClickHouse · Apr 29, 2024 · 619d030 · 619d030
2 parents 5946ff0 + 8fd38f2
commit 619d030
Show file tree

Hide file tree

Showing 32 changed files with 667 additions and 49 deletions.
diff --git a/.github/workflows/debug.yml b/.github/workflows/debug.yml
@@ -8,4 +8,4 @@ jobs:
   DebugInfo:
     runs-on: ubuntu-latest
     steps:
-    - uses: hmarr/debug-action@a701ed95a46e6f2fb0df25e1a558c16356fae35a
+    - uses: hmarr/debug-action@f7318c783045ac39ed9bb497e22ce835fdafbfe6
diff --git a/.github/workflows/master.yml b/.github/workflows/master.yml
@@ -16,7 +16,7 @@ jobs:
       data: ${{ steps.runconfig.outputs.CI_DATA }}
     steps:
       - name: DebugInfo
-        uses: hmarr/debug-action@a701ed95a46e6f2fb0df25e1a558c16356fae35a
+        uses: hmarr/debug-action@f7318c783045ac39ed9bb497e22ce835fdafbfe6
       - name: Check out repository code
         uses: ClickHouse/checkout@v1
         with:

diff --git a/.github/workflows/pull_request.yml b/.github/workflows/pull_request.yml
@@ -22,7 +22,7 @@ jobs:
       data: ${{ steps.runconfig.outputs.CI_DATA }}
     steps:
       - name: DebugInfo
-        uses: hmarr/debug-action@a701ed95a46e6f2fb0df25e1a558c16356fae35a
+        uses: hmarr/debug-action@f7318c783045ac39ed9bb497e22ce835fdafbfe6
       - name: Check out repository code
         uses: ClickHouse/checkout@v1
         with:

diff --git a/.github/workflows/reusable_simple_job.yml b/.github/workflows/reusable_simple_job.yml
@@ -63,7 +63,7 @@ jobs:
       GITHUB_JOB_OVERRIDDEN: ${{inputs.test_name}}
     steps:
       - name: DebugInfo
-        uses: hmarr/debug-action@a701ed95a46e6f2fb0df25e1a558c16356fae35a
+        uses: hmarr/debug-action@f7318c783045ac39ed9bb497e22ce835fdafbfe6
       - name: Check out repository code
         uses: ClickHouse/checkout@v1
         with:

diff --git a/docs/en/engines/table-engines/mergetree-family/mergetree.md b/docs/en/engines/table-engines/mergetree-family/mergetree.md
@@ -287,9 +287,9 @@ The number of columns in the primary key is not explicitly limited. Depending on
 
 A long primary key will negatively affect the insert performance and memory consumption, but extra columns in the primary key do not affect ClickHouse performance during `SELECT` queries.
 
-You can create a table without a primary key using the `ORDER BY tuple()` syntax. In this case, ClickHouse stores data in the order of inserting. If you want to save data order when inserting data by `INSERT ... SELECT` queries, set [max_insert_threads = 1](/docs/en/operations/settings/settings.md/#settings-max-insert-threads).
+You can create a table without a primary key using the `ORDER BY tuple()` syntax. In this case, ClickHouse stores data in the order of inserting. If you want to save data order when inserting data by `INSERT ... SELECT` queries, set [max_insert_threads = 1](/docs/en/operations/settings/settings.md/#max-insert-threads).
 
-To select data in the initial order, use [single-threaded](/docs/en/operations/settings/settings.md/#settings-max_threads) `SELECT` queries.
+To select data in the initial order, use [single-threaded](/docs/en/operations/settings/settings.md/#max_threads) `SELECT` queries.
 
 ### Choosing a Primary Key that Differs from the Sorting Key {#choosing-a-primary-key-that-differs-from-the-sorting-key}
 
@@ -344,7 +344,7 @@ In the example below, the index can’t be used.
 SELECT count() FROM table WHERE CounterID = 34 OR URL LIKE '%upyachka%'
 ```
 
-To check whether ClickHouse can use the index when running a query, use the settings [force_index_by_date](/docs/en/operations/settings/settings.md/#settings-force_index_by_date) and [force_primary_key](/docs/en/operations/settings/settings.md/#force-primary-key).
+To check whether ClickHouse can use the index when running a query, use the settings [force_index_by_date](/docs/en/operations/settings/settings.md/#force_index_by_date) and [force_primary_key](/docs/en/operations/settings/settings.md/#force-primary-key).
 
 The key for partitioning by month allows reading only those data blocks which contain dates from the proper range. In this case, the data block may contain data for many dates (up to an entire month). Within a block, data is sorted by primary key, which might not contain the date as the first column. Because of this, using a query with only a date condition that does not specify the primary key prefix will cause more data to be read than for a single date.
 

diff --git a/docs/en/engines/table-engines/mergetree-family/replication.md b/docs/en/engines/table-engines/mergetree-family/replication.md
@@ -113,7 +113,7 @@ You can specify any existing ZooKeeper cluster and the system will use a directo
 
 If ZooKeeper is not set in the config file, you can’t create replicated tables, and any existing replicated tables will be read-only.
 
-ZooKeeper is not used in `SELECT` queries because replication does not affect the performance of `SELECT` and queries run just as fast as they do for non-replicated tables. When querying distributed replicated tables, ClickHouse behavior is controlled by the settings [max_replica_delay_for_distributed_queries](/docs/en/operations/settings/settings.md/#settings-max_replica_delay_for_distributed_queries) and [fallback_to_stale_replicas_for_distributed_queries](/docs/en/operations/settings/settings.md/#settings-fallback_to_stale_replicas_for_distributed_queries).
+ZooKeeper is not used in `SELECT` queries because replication does not affect the performance of `SELECT` and queries run just as fast as they do for non-replicated tables. When querying distributed replicated tables, ClickHouse behavior is controlled by the settings [max_replica_delay_for_distributed_queries](/docs/en/operations/settings/settings.md/#max_replica_delay_for_distributed_queries) and [fallback_to_stale_replicas_for_distributed_queries](/docs/en/operations/settings/settings.md/#fallback_to_stale_replicas_for_distributed_queries).
 
 For each `INSERT` query, approximately ten entries are added to ZooKeeper through several transactions. (To be more precise, this is for each inserted block of data; an INSERT query contains one block or one block per `max_insert_block_size = 1048576` rows.) This leads to slightly longer latencies for `INSERT` compared to non-replicated tables. But if you follow the recommendations to insert data in batches of no more than one `INSERT` per second, it does not create any problems. The entire ClickHouse cluster used for coordinating one ZooKeeper cluster has a total of several hundred `INSERTs` per second. The throughput on data inserts (the number of rows per second) is just as high as for non-replicated data.
 

diff --git a/docs/en/engines/table-engines/special/join.md b/docs/en/engines/table-engines/special/join.md
@@ -83,7 +83,7 @@ When creating a table, the following settings are applied:
 
 #### join_any_take_last_row
 
-[join_any_take_last_row](/docs/en/operations/settings/settings.md/#settings-join_any_take_last_row)
+[join_any_take_last_row](/docs/en/operations/settings/settings.md/#join_any_take_last_row)
 #### join_use_nulls
 
 #### persistent

diff --git a/docs/en/sql-reference/aggregate-functions/parametric-functions.md b/docs/en/sql-reference/aggregate-functions/parametric-functions.md
@@ -505,9 +505,117 @@ HAVING uniqUpTo(4)(UserID) >= 5
 
 `uniqUpTo(4)(UserID)` calculates the number of unique `UserID` values for each `SearchPhrase`, but it only counts up to 4 unique values. If there are more than 4 unique `UserID` values for a `SearchPhrase`, the function returns 5 (4 + 1). The `HAVING` clause then filters out the `SearchPhrase` values for which the number of unique `UserID` values is less than 5. This will give you a list of search keywords that were used by at least 5 unique users.
 
-## sumMapFiltered(keys_to_keep)(keys, values)
+## sumMapFiltered
 
-Same behavior as [sumMap](../../sql-reference/aggregate-functions/reference/summap.md#agg_functions-summap) except that an array of keys is passed as a parameter. This can be especially useful when working with a high cardinality of keys.
+This function behaves the same as [sumMap](../../sql-reference/aggregate-functions/reference/summap.md#agg_functions-summap) except that it also accepts an array of keys to filter with as a parameter. This can be especially useful when working with a high cardinality of keys.
+
+**Syntax**
+
+`sumMapFiltered(keys_to_keep)(keys, values)`
+
+**Parameters**
+
+- `keys_to_keep`: [Array](../data-types/array.md) of keys to filter with.
+- `keys`: [Array](../data-types/array.md) of keys.
+- `values`: [Array](../data-types/array.md) of values.
+
+**Returned Value** 
+
+- Returns a tuple of two arrays: keys in sorted order, and values summed for the corresponding keys.
+
+**Example**
+
+Query:
+
+```sql
+CREATE TABLE sum_map
+(
+    `date` Date,
+    `timeslot` DateTime,
+    `statusMap` Nested(status UInt16, requests UInt64)
+)
+ENGINE = Log
+
+INSERT INTO sum_map VALUES 
+    ('2000-01-01', '2000-01-01 00:00:00', [1, 2, 3], [10, 10, 10]), 
+    ('2000-01-01', '2000-01-01 00:00:00', [3, 4, 5], [10, 10, 10]),
+    ('2000-01-01', '2000-01-01 00:01:00', [4, 5, 6], [10, 10, 10]), 
+    ('2000-01-01', '2000-01-01 00:01:00', [6, 7, 8], [10, 10, 10]);
+```
+
+```sql
+SELECT sumMapFiltered([1, 4, 8])(statusMap.status, statusMap.requests) FROM sum_map;
+```
+
+Result:
+
+```response
+   ┌─sumMapFiltered([1, 4, 8])(statusMap.status, statusMap.requests)─┐
+1. │ ([1,4,8],[10,20,10])                                            │
+   └─────────────────────────────────────────────────────────────────┘
+```
+
+## sumMapFilteredWithOverflow
+
+This function behaves the same as [sumMap](../../sql-reference/aggregate-functions/reference/summap.md#agg_functions-summap) except that it also accepts an array of keys to filter with as a parameter. This can be especially useful when working with a high cardinality of keys. It differs from the [sumMapFiltered](#summapfiltered) function in that it does summation with overflow - i.e. returns the same data type for the summation as the argument data type.
+
+**Syntax**
+
+`sumMapFilteredWithOverflow(keys_to_keep)(keys, values)`
+
+**Parameters**
+
+- `keys_to_keep`: [Array](../data-types/array.md) of keys to filter with.
+- `keys`: [Array](../data-types/array.md) of keys.
+- `values`: [Array](../data-types/array.md) of values.
+
+**Returned Value** 
+
+- Returns a tuple of two arrays: keys in sorted order, and values summed for the corresponding keys.
+
+**Example**
+
+In this example we create a table `sum_map`, insert some data into it and then use both `sumMapFilteredWithOverflow` and `sumMapFiltered` and the `toTypeName` function for comparison of the result. Where `requests` was of type `UInt8` in the created table, `sumMapFiltered` has promoted the type of the summed values to `UInt64` to avoid overflow whereas `sumMapFilteredWithOverflow` has kept the type as `UInt8` which is not large enough to store the result - i.e. overflow has occurred.
+
+Query:
+
+```sql
+CREATE TABLE sum_map
+(
+    `date` Date,
+    `timeslot` DateTime,
+    `statusMap` Nested(status UInt8, requests UInt8)
+)
+ENGINE = Log
+
+INSERT INTO sum_map VALUES 
+    ('2000-01-01', '2000-01-01 00:00:00', [1, 2, 3], [10, 10, 10]), 
+    ('2000-01-01', '2000-01-01 00:00:00', [3, 4, 5], [10, 10, 10]),
+    ('2000-01-01', '2000-01-01 00:01:00', [4, 5, 6], [10, 10, 10]), 
+    ('2000-01-01', '2000-01-01 00:01:00', [6, 7, 8], [10, 10, 10]);
+```
+
+```sql
+SELECT sumMapFilteredWithOverflow([1, 4, 8])(statusMap.status, statusMap.requests) as summap_overflow, toTypeName(summap_overflow) FROM sum_map;
+```
+
+```sql
+SELECT sumMapFiltered([1, 4, 8])(statusMap.status, statusMap.requests) as summap, toTypeName(summap) FROM sum_map;
+```
+
+Result:
+
+```response
+   ┌─sum──────────────────┬─toTypeName(sum)───────────────────┐
+1. │ ([1,4,8],[10,20,10]) │ Tuple(Array(UInt8), Array(UInt8)) │
+   └──────────────────────┴───────────────────────────────────┘
+```
+
+```response
+   ┌─summap───────────────┬─toTypeName(summap)─────────────────┐
+1. │ ([1,4,8],[10,20,10]) │ Tuple(Array(UInt8), Array(UInt64)) │
+   └──────────────────────┴────────────────────────────────────┘
+```
 
 ## sequenceNextNode
 

diff --git a/docs/en/sql-reference/aggregate-functions/reference/index.md b/docs/en/sql-reference/aggregate-functions/reference/index.md
@@ -16,7 +16,9 @@ Standard aggregate functions:
 - [avg](/docs/en/sql-reference/aggregate-functions/reference/avg.md)
 - [any](/docs/en/sql-reference/aggregate-functions/reference/any.md)
 - [stddevPop](/docs/en/sql-reference/aggregate-functions/reference/stddevpop.md)
+- [stddevPopStable](/docs/en/sql-reference/aggregate-functions/reference/stddevpopstable.md)
 - [stddevSamp](/docs/en/sql-reference/aggregate-functions/reference/stddevsamp.md)
+- [stddevSampStable](/docs/en/sql-reference/aggregate-functions/reference/stddevsampstable.md)
 - [varPop](/docs/en/sql-reference/aggregate-functions/reference/varpop.md)
 - [varSamp](/docs/en/sql-reference/aggregate-functions/reference/varsamp.md)
 - [corr](./corr.md)
@@ -65,6 +67,9 @@ ClickHouse-specific aggregate functions:
 - [groupBitmapXor](/docs/en/sql-reference/aggregate-functions/reference/groupbitmapxor.md)
 - [sumWithOverflow](/docs/en/sql-reference/aggregate-functions/reference/sumwithoverflow.md)
 - [sumMap](/docs/en/sql-reference/aggregate-functions/reference/summap.md)
+- [sumMapWithOverflow](/docs/en/sql-reference/aggregate-functions/reference/summapwithoverflow.md)
+- [sumMapFiltered](/docs/en/sql-reference/aggregate-functions/parametric-functions.md/#summapfiltered)
+- [sumMapFilteredWithOverflow](/docs/en/sql-reference/aggregate-functions/parametric-functions.md/#summapfilteredwithoverflow)
 - [minMap](/docs/en/sql-reference/aggregate-functions/reference/minmap.md)
 - [maxMap](/docs/en/sql-reference/aggregate-functions/reference/maxmap.md)
 - [skewSamp](/docs/en/sql-reference/aggregate-functions/reference/skewsamp.md)

diff --git a/docs/en/sql-reference/aggregate-functions/reference/stddevpop.md b/docs/en/sql-reference/aggregate-functions/reference/stddevpop.md
@@ -7,10 +7,50 @@ sidebar_position: 30
 
 The result is equal to the square root of [varPop](../../../sql-reference/aggregate-functions/reference/varpop.md).
 
-Alias:
-- `STD`
-- `STDDEV_POP`
+Aliases: `STD`, `STDDEV_POP`.
 
 :::note
-This function uses a numerically unstable algorithm. If you need [numerical stability](https://en.wikipedia.org/wiki/Numerical_stability) in calculations, use the `stddevPopStable` function. It works slower but provides a lower computational error.
-:::
+This function uses a numerically unstable algorithm. If you need [numerical stability](https://en.wikipedia.org/wiki/Numerical_stability) in calculations, use the [`stddevPopStable`](../reference/stddevpopstable.md) function. It works slower but provides a lower computational error.
+:::
+
+**Syntax**
+
+```sql
+stddevPop(x)
+```
+
+**Parameters**
+
+- `x`: Population of values to find the standard deviation of. [(U)Int*](../../data-types/int-uint.md), [Float*](../../data-types/float.md), [Decimal*](../../data-types/decimal.md).
+
+**Returned value**
+
+Square root of standard deviation of `x`. [Float64](../../data-types/float.md).
+
+
+**Example**
+
+Query:
+
+```sql
+DROP TABLE IF EXISTS test_data;
+CREATE TABLE test_data
+(
+    population UInt8,
+)
+ENGINE = Log;
+
+INSERT INTO test_data VALUES (3),(3),(3),(4),(4),(5),(5),(7),(11),(15);
+
+SELECT
+    stddevPop(population) AS stddev
+FROM test_data;
+```
+
+Result:
+
+```response
+┌────────────stddev─┐
+│ 3.794733192202055 │
+└───────────────────┘
+```
diff --git a/docs/en/sql-reference/aggregate-functions/reference/stddevpopstable.md b/docs/en/sql-reference/aggregate-functions/reference/stddevpopstable.md
@@ -0,0 +1,49 @@
+---
+slug: /en/sql-reference/aggregate-functions/reference/stddevpopstable
+sidebar_position: 30
+---
+
+# stddevPopStable
+
+The result is equal to the square root of [varPop](../../../sql-reference/aggregate-functions/reference/varpop.md). Unlike [`stddevPop`](../reference/stddevpop.md), this function uses a numerically stable algorithm. It works slower but provides a lower computational error.
+
+**Syntax**
+
+```sql
+stddevPopStable(x)
+```
+
+**Parameters**
+
+- `x`: Population of values to find the standard deviation of. [(U)Int*](../../data-types/int-uint.md), [Float*](../../data-types/float.md), [Decimal*](../../data-types/decimal.md).
+
+**Returned value**
+
+Square root of standard deviation of `x`. [Float64](../../data-types/float.md).
+
+**Example**
+
+Query:
+
+```sql
+DROP TABLE IF EXISTS test_data;
+CREATE TABLE test_data
+(
+    population Float64,
+)
+ENGINE = Log;
+
+INSERT INTO test_data SELECT randUniform(5.5, 10) FROM numbers(1000000)
+
+SELECT
+    stddevPopStable(population) AS stddev
+FROM test_data;
+```
+
+Result:
+
+```response
+┌─────────────stddev─┐
+│ 1.2999977786592576 │
+└────────────────────┘
+```
diff --git a/docs/en/sql-reference/aggregate-functions/reference/stddevsamp.md b/docs/en/sql-reference/aggregate-functions/reference/stddevsamp.md
@@ -10,5 +10,46 @@ The result is equal to the square root of [varSamp](../../../sql-reference/aggre
 Alias: `STDDEV_SAMP`.
 
 :::note
-This function uses a numerically unstable algorithm. If you need [numerical stability](https://en.wikipedia.org/wiki/Numerical_stability) in calculations, use the `stddevSampStable` function. It works slower but provides a lower computational error.
-:::
+This function uses a numerically unstable algorithm. If you need [numerical stability](https://en.wikipedia.org/wiki/Numerical_stability) in calculations, use the [`stddevSampStable`](../reference/stddevsampstable.md) function. It works slower but provides a lower computational error.
+:::
+
+**Syntax**
+
+```sql
+stddevSamp(x)
+```
+
+**Parameters**
+
+- `x`: Values for which to find the square root of sample variance. [(U)Int*](../../data-types/int-uint.md), [Float*](../../data-types/float.md), [Decimal*](../../data-types/decimal.md).
+
+**Returned value**
+
+Square root of sample variance of `x`. [Float64](../../data-types/float.md).
+
+**Example**
+
+Query:
+
+```sql
+DROP TABLE IF EXISTS test_data;
+CREATE TABLE test_data
+(
+    population UInt8,
+)
+ENGINE = Log;
+
+INSERT INTO test_data VALUES (3),(3),(3),(4),(4),(5),(5),(7),(11),(15);
+
+SELECT
+    stddevSamp(population)
+FROM test_data;
+```
+
+Result:
+
+```response
+┌─stddevSamp(population)─┐
+│                      4 │
+└────────────────────────┘
+```