Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add math operators and functions to work with multidimensional vectors #27933

Merged
merged 34 commits into from
Oct 1, 2021

Conversation

mathalex
Copy link
Contributor

@mathalex mathalex commented Aug 20, 2021

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category:

  • New Feature

Changelog entry:
This fully closes #4509 and even more.

Detailed description / Documentation draft:

  • Function tupleHammingDistance has been refactored. Now the summation is performed within only one variable instead of a column to get the result.

  • operator +, tuplePlus, vectorSum — do tuple-wise addition. Arguments: (Tuple, Tuple). Returns: Tuple.

  • operator -, tupleMinus, vectorDifference — do tuple-wise subtraction. Arguments: (Tuple, Tuple). Returns: Tuple.

  • tupleMultiply — do tuple-wise multiplication (compatibility). Arguments: (Tuple, Tuple). Returns: Tuple.

  • tupleDivide — do tuple-wise division (compatibility). Arguments: (Tuple, Tuple). Returns: Tuple.

  • unary operator -, tupleNegate — do tuple-wise negation. Arguments: (Tuple). Returns: Tuple.

  • operator *, tupleMultiplyByNumber — multiplies each element of a tuple by a number. Arguments: (Tuple, Number). Returns: Tuple.

  • operator /, tupleDivideByNumber — divides each element of a tuple by a number. Arguments: (Tuple, Number). Returns: Tuple.

  • operator *, dotProduct, scalarProduct — a dot (aka scalar) product of vectors. Arguments: (Tuple, Tuple). Returns: Number.

  • L1Norm, normL1 — calculates the sum of absolute values of coordinates. Arguments: (Tuple). Returns: Number.

  • L2Norm, normL2 — calculates the square root of the sum of coordinates squares. Arguments: (Tuple). Returns: Number.

  • LinfNorm, normLinf — calculates the maximum absolute value among coordinates. Arguments: (Tuple). Returns: Number.

  • LpNorm, normLp — calculates a root of pth power of the sum of absolute values of coordinates in pth powers. Arguments: (Tuple, Number). Returns: Number. LpNorm should be reviewed very carefully.

  • L1Distance, distanceL1 — finds the distance between two points (as tuples) using 1-norm. Arguments: (Tuple, Tuple). Returns: Number.

  • L2Distance, distanceL2 — finds the distance between two points (as tuples) using 2-norm. Arguments: (Tuple, Tuple). Returns: Number.

  • LinfDistance, distanceLinf — finds the distance between two points (as tuples) using infinity-norm. Arguments: (Tuple, Tuple). Returns: Number.

  • LpDistance, distanceLp — finds the distance between two points (as tuples) using p-norm. Arguments: (Tuple, Tuple, Number). Returns: Number.

  • L1Normalize, normalizeL1 — finds a unit vector of a given vector (tuple) according to 1-norm. Arguments: (Tuple). Returns: Tuple.

  • L2Normalize, normalizeL2 — finds a unit vector of a given vector (tuple) according to 2-norm. Arguments: (Tuple). Returns: Tuple.

  • LinfNormalize, normalizeLinf — finds a unit vector of a given vector (tuple) according to infinity-norm. Arguments: (Tuple). Returns: Tuple.

  • LpNormalize, normalizeLp — finds a unit vector of a given vector (tuple) according to p-norm. Arguments: (Tuple, Number). Returns: Tuple.

  • cosineDistance — calculates the cosine of the angle between vectors and subtracts it from one. Arguments: (Tuple, Tuple). Returns: Number.

  • max2 — finds the maximum of two numbers (developed for LinfNorm function and it is just good to have this function). Arguments: (Number, Number). Returns: Number. Maybe should be updated as there is a conversion to Float64, as I can understand.

  • min2 — finds the minimum of two numbers (compatibility). Arguments: (Number, Number). Returns: Number.


Examples for each of the queries can be founded in the test related to this pull request.

In Lp functions only when p is not less than 1 makes sense as it is not a norm in the opposite case. However, there are no restrictions, so the user can pass even a negative number as a parameter.
UPD: added restrictions 1 <= p < inf.

LxDistance(u, v) := LxNorm(u - v), LxNormalize(u) := u / LxNorm(u), cosineDistance(u, v) := 1 - (u * v) / (L2Norm(u) * L2Norm(v)).

Operators overloading that can be added:

  • (not related to numeric vectors) Overload + ((String, String) -> String) to concatenate strings, * ((String, Integer) -> String) to concatenate one string multiple times.

Added operators, tupleHammingDistance has been refactored
@robot-clickhouse robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Aug 20, 2021
@mathalex mathalex marked this pull request as ready for review August 30, 2021 17:16
@vdimir vdimir self-assigned this Aug 31, 2021
Copy link
Member

@vdimir vdimir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work!

Some comments, most of them suggests style change.

src/Functions/max2.cpp Show resolved Hide resolved
src/Functions/min2.cpp Show resolved Hide resolved
src/Functions/vectorFunctions.cpp Outdated Show resolved Hide resolved
src/Functions/vectorFunctions.cpp Outdated Show resolved Hide resolved
src/Functions/vectorFunctions.cpp Outdated Show resolved Hide resolved
{
try
{
ColumnWithTypeAndName left{left_elements.empty() ? nullptr : left_elements[i], left_types[i], {}};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there real cases where it's empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just copypasted from the tupleHammingDistance... Actually, do not know (and I agree -- it is strange if there is a real case). I can delete it from all places. If old tests with tupleHammingDistance not fail, then it will be possible to conclude that there are no such cases :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are!!! SELECT tupleHammingDistance(materialize((1, 2)), (1, 4)); -- do not know why, but it fails if you leave only the third part of the ternary operator.

src/Functions/FunctionBinaryArithmetic.h Outdated Show resolved Hide resolved
namespace
{
struct Max2Name { static constexpr auto name = "max2"; };
using FunctionMax2 = FunctionMathBinaryFloat64<BinaryFunctionVectorized<Max2Name, max>>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe should be updated as there is a conversion to Float64, as I can understand.

Yes it returns Fload64 for all input types.

SELECT toTypeName(max2(2, 1))

┌─toTypeName(max2(2, 1))─┐
│ Float64                │
└────────────────────────┘

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think that it is not good to deduce such an exact function to the Float64 type, but do not know a convenient way to fix it.

if (tuple_size == 0)
return DataTypeUInt8().createColumnConstWithDefaultValue(input_rows_count);

const auto & p_column = arguments[1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there use-cases for non-constnat p, how do you think? Also, shold we limit possible values for p, like 0 < p < inf or make it integer?

SELECT LpNorm((3, 1, 4), 0), LpNorm((3, 1, 4), inf);

┌─LpNorm((3, 1, 4), 0)─┬─LpNorm((3, 1, 4), inf)─┐
│                  inf │                      1 │
└──────────────────────┴────────────────────────┘

Copy link
Contributor Author

@mathalex mathalex Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to not limit p. It definitely should NOT be just integer (I will add a test when p is float). Real math sense is when p >= 1, but probably somebody will want to use it for other purposes...
However, LpNorm with inf is weird as it is not the same as LinfNorm...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it 1 <= p < inf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restrictions added in the 53d7649.

src/Functions/vectorFunctions.cpp Outdated Show resolved Hide resolved
@vdimir
Copy link
Member

vdimir commented Sep 1, 2021

Btw, did you consider different naming variants for L1/Lp/Linf/Norm/Distance/... functions?

Maybe normL1, normLp, normLInf/normLinf, distanceL1, ... and similar is better (not to start name with capital letter)?

@mathalex
Copy link
Contributor Author

mathalex commented Sep 2, 2021

Maybe normL1, normLp, normLInf/normLinf, distanceL1, ... and similar is better (not to start name with capital letter)?

For the purpose of not starting with a capital letter, new names are better. Current names are good in perspective that they can be read straightforward (Lp-norm, not the norm in Lp metric).

UPD: added aliases

Interesting bug or feature: (1, 2) * NULL is NULL, not tuple of NULLs.
+ for Strings, * for String and Number are not added as it can be implemented soon. LpNorm cannot get Decimal because of the pow function.
@vdimir
Copy link
Member

vdimir commented Sep 10, 2021

AST fuzzer (debug)

Error is related to cosineDistacnce and NULL, see report.

If it's difficult to fix we can omit support of Nullable arguments (but throw readable error). If it's easy to fix let's do it.

@qieqieplus
Copy link
Contributor

Any plan for supporting Array columns?
I know some functions can be implemented with functions like ArrayMap and ArraySum, but specialized functions would have much better performance.

@CLAassistant
Copy link

CLAassistant commented Sep 28, 2021

CLA assistant check
All committers have signed the CLA.

@vdimir vdimir self-requested a review September 30, 2021 08:55
@vdimir
Copy link
Member

vdimir commented Sep 30, 2021

Fuzzer failures on 192633c

AST Fuzzer | failure | Found error: The query formatting is broken

Looks like some issues with formatting 4acd8f3 should help

@vdimir vdimir merged commit ec966b7 into ClickHouse:master Oct 1, 2021
@gyuton
Copy link
Contributor

gyuton commented Oct 15, 2021

Internal documentation ticket: DOCSUP-16593.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature Pull request with new product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support multidimensional cosine distance and euclidean distance function
6 participants