Add math operators and functions to work with multidimensional vectors #27933

mathalex · 2021-08-20T14:53:38Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category:

New Feature

Changelog entry:
This fully closes #4509 and even more.

Detailed description / Documentation draft:

Function tupleHammingDistance has been refactored. Now the summation is performed within only one variable instead of a column to get the result.
operator +, tuplePlus, vectorSum — do tuple-wise addition. Arguments: (Tuple, Tuple). Returns: Tuple.
operator -, tupleMinus, vectorDifference — do tuple-wise subtraction. Arguments: (Tuple, Tuple). Returns: Tuple.
tupleMultiply — do tuple-wise multiplication (compatibility). Arguments: (Tuple, Tuple). Returns: Tuple.
tupleDivide — do tuple-wise division (compatibility). Arguments: (Tuple, Tuple). Returns: Tuple.
unary operator -, tupleNegate — do tuple-wise negation. Arguments: (Tuple). Returns: Tuple.
operator *, tupleMultiplyByNumber — multiplies each element of a tuple by a number. Arguments: (Tuple, Number). Returns: Tuple.
operator /, tupleDivideByNumber — divides each element of a tuple by a number. Arguments: (Tuple, Number). Returns: Tuple.
operator *, dotProduct, scalarProduct — a dot (aka scalar) product of vectors. Arguments: (Tuple, Tuple). Returns: Number.
L1Norm, normL1 — calculates the sum of absolute values of coordinates. Arguments: (Tuple). Returns: Number.
L2Norm, normL2 — calculates the square root of the sum of coordinates squares. Arguments: (Tuple). Returns: Number.
LinfNorm, normLinf — calculates the maximum absolute value among coordinates. Arguments: (Tuple). Returns: Number.
LpNorm, normLp — calculates a root of pth power of the sum of absolute values of coordinates in pth powers. Arguments: (Tuple, Number). Returns: Number. LpNorm should be reviewed very carefully.
L1Distance, distanceL1 — finds the distance between two points (as tuples) using 1-norm. Arguments: (Tuple, Tuple). Returns: Number.
L2Distance, distanceL2 — finds the distance between two points (as tuples) using 2-norm. Arguments: (Tuple, Tuple). Returns: Number.
LinfDistance, distanceLinf — finds the distance between two points (as tuples) using infinity-norm. Arguments: (Tuple, Tuple). Returns: Number.
LpDistance, distanceLp — finds the distance between two points (as tuples) using p-norm. Arguments: (Tuple, Tuple, Number). Returns: Number.
L1Normalize, normalizeL1 — finds a unit vector of a given vector (tuple) according to 1-norm. Arguments: (Tuple). Returns: Tuple.
L2Normalize, normalizeL2 — finds a unit vector of a given vector (tuple) according to 2-norm. Arguments: (Tuple). Returns: Tuple.
LinfNormalize, normalizeLinf — finds a unit vector of a given vector (tuple) according to infinity-norm. Arguments: (Tuple). Returns: Tuple.
LpNormalize, normalizeLp — finds a unit vector of a given vector (tuple) according to p-norm. Arguments: (Tuple, Number). Returns: Tuple.
cosineDistance — calculates the cosine of the angle between vectors and subtracts it from one. Arguments: (Tuple, Tuple). Returns: Number.
max2 — finds the maximum of two numbers (developed for LinfNorm function and it is just good to have this function). Arguments: (Number, Number). Returns: Number. Maybe should be updated as there is a conversion to Float64, as I can understand.
min2 — finds the minimum of two numbers (compatibility). Arguments: (Number, Number). Returns: Number.

Examples for each of the queries can be founded in the test related to this pull request.

In Lp functions only when p is not less than 1 makes sense as it is not a norm in the opposite case. However, there are no restrictions, so the user can pass even a negative number as a parameter.
UPD: added restrictions 1 <= p < inf.

LxDistance(u, v) := LxNorm(u - v), LxNormalize(u) := u / LxNorm(u), cosineDistance(u, v) := 1 - (u * v) / (L2Norm(u) * L2Norm(v)).

Operators overloading that can be added:

(not related to numeric vectors) Overload + ((String, String) -> String) to concatenate strings, * ((String, Integer) -> String) to concatenate one string multiple times.

Added operators, tupleHammingDistance has been refactored

vdimir

Excellent work!

Some comments, most of them suggests style change.

src/Functions/max2.cpp

src/Functions/min2.cpp

src/Functions/vectorFunctions.cpp

vdimir · 2021-09-01T08:27:53Z

src/Functions/vectorFunctions.cpp

+        {
+            try
+            {
+                ColumnWithTypeAndName left{left_elements.empty() ? nullptr : left_elements[i], left_types[i], {}};


Is there real cases where it's empty?

Just copypasted from the tupleHammingDistance... Actually, do not know (and I agree -- it is strange if there is a real case). I can delete it from all places. If old tests with tupleHammingDistance not fail, then it will be possible to conclude that there are no such cases :)

Yes, there are!!! SELECT tupleHammingDistance(materialize((1, 2)), (1, 4)); -- do not know why, but it fails if you leave only the third part of the ternary operator.

src/Functions/FunctionBinaryArithmetic.h

vdimir · 2021-09-01T08:34:52Z

src/Functions/max2.cpp

+namespace
+{
+    struct Max2Name { static constexpr auto name = "max2"; };
+    using FunctionMax2 = FunctionMathBinaryFloat64<BinaryFunctionVectorized<Max2Name, max>>;


Maybe should be updated as there is a conversion to Float64, as I can understand.

Yes it returns Fload64 for all input types.

SELECT toTypeName(max2(2, 1)) ┌─toTypeName(max2(2, 1))─┐ │ Float64 │ └────────────────────────┘

Thanks. I think that it is not good to deduce such an exact function to the Float64 type, but do not know a convenient way to fix it.

vdimir · 2021-09-01T08:35:37Z

src/Functions/vectorFunctions.cpp

+        if (tuple_size == 0)
+            return DataTypeUInt8().createColumnConstWithDefaultValue(input_rows_count);
+
+        const auto & p_column = arguments[1];


Is there use-cases for non-constnat p, how do you think? Also, shold we limit possible values for p, like 0 < p < inf or make it integer?

SELECT LpNorm((3, 1, 4), 0), LpNorm((3, 1, 4), inf); ┌─LpNorm((3, 1, 4), 0)─┬─LpNorm((3, 1, 4), inf)─┐ │ inf │ 1 │ └──────────────────────┴────────────────────────┘

I decided to not limit p. It definitely should NOT be just integer (I will add a test when p is float). Real math sense is when p >= 1, but probably somebody will want to use it for other purposes...
However, LpNorm with inf is weird as it is not the same as LinfNorm...

Let's make it 1 <= p < inf?

Restrictions added in the 53d7649.

src/Functions/vectorFunctions.cpp

vdimir · 2021-09-01T09:27:48Z

Btw, did you consider different naming variants for L1/Lp/Linf/Norm/Distance/... functions?

Maybe normL1, normLp, normLInf/normLinf, distanceL1, ... and similar is better (not to start name with capital letter)?

mathalex · 2021-09-02T17:25:42Z

Maybe normL1, normLp, normLInf/normLinf, distanceL1, ... and similar is better (not to start name with capital letter)?

For the purpose of not starting with a capital letter, new names are better. Current names are good in perspective that they can be read straightforward (Lp-norm, not the norm in Lp metric).

UPD: added aliases

Interesting bug or feature: (1, 2) * NULL is NULL, not tuple of NULLs.

+ for Strings, * for String and Number are not added as it can be implemented soon. LpNorm cannot get Decimal because of the pow function.

vdimir · 2021-09-10T06:43:30Z

AST fuzzer (debug)

Error is related to cosineDistacnce and NULL, see report.

If it's difficult to fix we can omit support of Nullable arguments (but throw readable error). If it's easy to fix let's do it.

qieqieplus · 2021-09-26T08:40:18Z

Any plan for supporting Array columns?
I know some functions can be implemented with functions like ArrayMap and ArraySum, but specialized functions would have much better performance.

CLAassistant · 2021-09-28T08:47:15Z

All committers have signed the CLA.

vdimir · 2021-09-30T10:49:50Z

Fuzzer failures on 192633c

AST Fuzzer | failure | Found error: The query formatting is broken

Looks like some issues with formatting 4acd8f3 should help

gyuton · 2021-10-15T11:56:23Z

Internal documentation ticket: DOCSUP-16593.

Operators, refactoring

b3bd6b5

Added operators, tupleHammingDistance has been refactored

robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Aug 20, 2021

mathalex and others added 21 commits August 20, 2021 18:03

Merge branch 'master' into numeric_tuple_functions

efff19e

Compatibility

00cd335

Deleting cerr

9a8cdee

Style

ba4da13

Divide, negate, test

04222f4

Merge branch 'ClickHouse:master' into numeric_tuple_functions

f6e306c

Reduced copy-paste, dotProduct

be7def3

L1Norm

ace922c

L2Norm

eb4970b

max2, min2, LinfNorm

657a8e1

LpNorm

6fbdda6

Distances, interesting tests

229b227

Reduced copy-paste

8b4c5aa

Reduced copy-paste

85bb176

Names, tupleOperatorByNumber

46596ff

Changed operator, fix clarity of test

91199dc

Normalize

0721041

Speed up

24de78f

cosineDistance

96f8079

operators

6580404

unary operator

7560481

mathalex marked this pull request as ready for review August 30, 2021 17:16

vdimir self-assigned this Aug 31, 2021

vdimir reviewed Sep 1, 2021

View reviewed changes

Style

89b672b

mathalex added 4 commits September 9, 2021 16:07

NULL support (check description)

7da26f8

Interesting bug or feature: (1, 2) * NULL is NULL, not tuple of NULLs.

Corner cases (check description)

7eb725a

+ for Strings, * for String and Number are not added as it can be implemented soon. LpNorm cannot get Decimal because of the pow function.

Tests with non constant columns

17bbe12

Add aliases

d63a1fb

mathalex and others added 5 commits September 16, 2021 15:43

Constant p

53d7649

Merge branch 'master' into numeric_tuple_functions

1215c25

Small update

6386560

Cosine null fix

e0963c7

Style

a429cbe

Merge branch 'ClickHouse:master' into numeric_tuple_functions

192633c

vdimir self-requested a review September 30, 2021 08:55

Do not add extra parentheses for tuple negate unary operator

4acd8f3

vdimir approved these changes Oct 1, 2021

View reviewed changes

vdimir merged commit ec966b7 into ClickHouse:master Oct 1, 2021

gyuton mentioned this pull request Oct 19, 2021

DOCSUP-16593: Documented tuple functions and updated operators #30418

Merged

alexey-milovidov mentioned this pull request Mar 10, 2022

Indexed Vector Similarity and kNN Search #35101

Closed

den-crane mentioned this pull request Mar 16, 2022

some geo functions are not documented #35341

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add math operators and functions to work with multidimensional vectors #27933

Add math operators and functions to work with multidimensional vectors #27933

mathalex commented Aug 20, 2021 •

edited

Loading

vdimir left a comment

vdimir Sep 1, 2021

mathalex Sep 2, 2021

mathalex Sep 9, 2021

vdimir Sep 1, 2021

mathalex Sep 2, 2021

vdimir Sep 1, 2021

mathalex Sep 2, 2021 •

edited

Loading

vdimir Sep 3, 2021

mathalex Sep 3, 2021

mathalex Sep 16, 2021

vdimir commented Sep 1, 2021

mathalex commented Sep 2, 2021 •

edited

Loading

vdimir commented Sep 10, 2021

qieqieplus commented Sep 26, 2021

CLAassistant commented Sep 28, 2021 •

edited

Loading

vdimir commented Sep 30, 2021

gyuton commented Oct 15, 2021

Add math operators and functions to work with multidimensional vectors #27933

Add math operators and functions to work with multidimensional vectors #27933

Conversation

mathalex commented Aug 20, 2021 • edited Loading

vdimir left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mathalex Sep 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vdimir commented Sep 1, 2021

mathalex commented Sep 2, 2021 • edited Loading

vdimir commented Sep 10, 2021

qieqieplus commented Sep 26, 2021

CLAassistant commented Sep 28, 2021 • edited Loading

vdimir commented Sep 30, 2021

gyuton commented Oct 15, 2021

mathalex commented Aug 20, 2021 •

edited

Loading

mathalex Sep 2, 2021 •

edited

Loading

mathalex commented Sep 2, 2021 •

edited

Loading

CLAassistant commented Sep 28, 2021 •

edited

Loading