Better support for byte arrays #228

ChrisJollyAU · 2024-03-06T16:38:27Z

Fixes some long-standing issues with byte arrays

Use the correct function to get the length of the data
Getting the specific index in a byte array now works

Some comments

Contains still doesn't work
LENB seems to only work properly if column is a longbinary. If has a max length set column type becomes varbinary(5) (if length is 5). In this case the return value is 6. Is this a length+1 or the closest multiple of 2 (because double byte char set as LEN returns 3)

ChrisJollyAU · 2024-03-07T13:57:35Z

Update: for varbinary column LENB is always in multiples of 2, because the normal LEN is unicode or 2 byte chars

@lauxjpn Any thoughts on this?
Also, any thoughts on the contains tests with the byte arrays?

lauxjpn · 2024-03-07T17:48:40Z

@ChrisJollyAU Please post me some test names that demonstrate these issues, so I can tinker around with it.

ChrisJollyAU · 2024-03-07T17:55:07Z

For the contains:
Byte_array_contains_literal
Byte_array_contains_parameter

For the one with length on varbinary
Byte_array_filter_by_length_literal_does_not_cast_on_varbinary_n

You can pretty much just filter on byte_array and its only those 2 types of problems that fail.
To keep things simple you can tinker around with them on the GearsOfWar test (not the TPH,TPC,TPT versions - they will probably come right at the same time)

lauxjpn · 2024-03-07T20:32:36Z

So, it is not pretty, but it seems to work to convert the binary data to a unicode string first and then to use normal string functions on it.

For example, for the Byte_array_contains_literal test case, this returns an incorrect result:

SELECT `s`.`Id`, `s`.`Banner`, `s`.`Banner5`, `s`.`InternalNumber`, `s`.`Name`
FROM `Squads` AS `s`
WHERE INSTR(1, `s`.`Banner`, 0x01, 0) > 0

But the result for this one is correct:

SELECT `s`.`Id`, `s`.`Banner`, `s`.`Banner5`, `s`.`InternalNumber`, `s`.`Name`
FROM `Squads` AS `s`
WHERE INSTR(1, STRCONV(`s`.`Banner`, 64), 0x01, 0) > 0

The docs state:

Remarks

When you're converting from a Byte array in ANSI format to a string, you should use the StrConv function. When you're converting from such an array in Unicode format, use an assignment statement.

So the following translation should work:

return _sqlExpressionFactory.GreaterThan(
    _sqlExpressionFactory.Function(
        "INSTR",
        new[]
        {
            _sqlExpressionFactory.Constant(1),
            _sqlExpressionFactory.Function(
                "STRCONV",
                new [] { source, _sqlExpressionFactory.Constant(64) },
                nullable: true,
                argumentsPropagateNullability: new[] { true, false },
                typeof(string)),
            value,
            _sqlExpressionFactory.Constant(0)
        },
        nullable: true,
        argumentsPropagateNullability: new[] { true, true },
        typeof(int)),
    _sqlExpressionFactory.Constant(0));

I haven't checked, whether we should force a specific LCID as the 3rd parameter to STRCONV.

ChrisJollyAU · 2024-03-08T09:48:59Z

Would not not be better to try to convert the 0x01 (the item you are looking for). That way you are only converting 1 byte rather than the whole contents?

lauxjpn · 2024-03-08T10:54:41Z

Would not not be better to try to convert the 0x01 (the item you are looking for). That way you are only converting 1 byte rather than the whole contents?

That shouldn't work.

Remember, the issue is that there seems to be no function to search a binary/blob/byte array in Jet. So we need to use a function to search strings instead. Strings are Unicode (basically USC-2) in Jet, so every character is represented by two bytes.

Therefore, in the clause INSTR(1, STRCONV(`s`.`Banner`, 64), 0x01, 0) > 0, with `s`.`Banner` being 0x0001, the STRCONV function expands to INSTR(1, 0x00000100, 0x0100, 0) > 0 and correctly matches 0x0100 at position 2 (indexing is one-based).

If you would use INSTR(1, `s`.`Banner`, STRCONV(0x01, 64), 0) > 0 instead, the STRCONV function expands to INSTR(1, 0x0001, 0x0100, 0) > 0, which is not the same as the first translation and does not return the result we want, because it does not match 0x0100 at position 2 (or any other position).

ChrisJollyAU · 2024-03-08T11:35:46Z

I'm going to play around a bit more with it.

You will only ever be looking for 1 byte. The Contains function (from https://learn.microsoft.com/en-us/dotnet/api/system.linq.enumerable.contains?view=net-8.0 ) only takes a single item to look for. In the case of byte[] the Contains takes as parameter of type byte.

There are byte versions of most string related functions LENB, ASCB, INSTRB

ChrisJollyAU · 2024-03-08T11:55:44Z

Also just a comment on the length with varbinary, longbinary or the ole object also gives the same multiples of 2

lauxjpn · 2024-03-08T11:56:58Z

There are byte versions of most string related functions LENB, ASCB, INSTRB

INSTRB just returns the byte position of a (2-byte long Unicode) character in a Unicode string. It does not search for a single byte.
INSTRB will only reliably work, if the input string is a Unicode string, in which case, it will return (whatever `INSTR` returns) * 2.

Also just a comment on the length with varbinary, longbinary or the ole object also gives the same multiples of 2

Same mechanism as explained here in this post. The solution should be to convert to a Unicode string first, then use the non-B-version (which is the same as B-version / 2) on the result.

There don't seem to be any binary/blog/byte array targeting functions at all in Jet.

ChrisJollyAU · 2024-03-09T13:30:58Z

There don't seem to be any binary/blog/byte array targeting functions at all in Jet.

Yes and no. Within Jet there is no specific byte type so everything of that is read into strings. Of course being unicode based, the size of the strings are allocated based on unicode (double byte) characters that the data will fit in. But the data within that is still using single-byte data.

Most string related functions have a B version for working on byte related data e.g. the docs for MID have the following

Use the MidB function with byte data contained in a string, as in double-byte character set languages. Instead of specifying the number of characters, the arguments specify numbers of bytes

The functions of MIDB, LEFTB, RIGHTB*,ASCB,LENB** all work pretty much as expected
**LENB returns multiples of 2 as strings are allocated based on number of unicode chars to fit the data in. If raw data is an odd number the final byte is still valid and is a 0
*RIGHTB As it starts from the right, it will include the above mentioned 0 if that is the case

The opposite of ASCB, CHRB is undefined in Jet but available in VBA (because Jet is weird - you have ASC,ASCB,ASCW, CHR,CHRB,CHRW all documented and should be available except for one). Ok, maybe not so weird given the below comments, it's impossible to just have a single byte char.

INSTRB is also documented as working on the byte data. As all the other B functions take the string data happily without any conversion (so stays as single byte data), it should follow that INSTRB behaves the same.

Now that I'm typing this, it comes to me that the value to search FOR is still a string type and no matter whether you have a single byte or more, as it is created by unicode conventions the function will ultimately get a 1 unicode char (2-byte) parameter.

Was hoping to only need to convert the small amount of data, rather than the bigger string to search IN

Slightly off topic but helpful if you need for debugging. You can access 2 nice functions that is also supported in Jet: TypeName and VarType. If you're ever not sure what type Jet is looking at you can use that e.g

TypeName(`s`.`Banner5`) AS Expr6, VarType(`s`.`Banner`) AS Expr7

Expr6 will produce: String and Expr7 will produce 8 (vbString)
See https://learn.microsoft.com/en-us/office/vba/language/reference/user-interface-help/vartype-function for other values

lauxjpn · 2024-03-09T16:18:17Z

@ChrisJollyAU Well that's good, if most of the B-functions work as you expect, then use them in that way.

The fact that LENB returns the size of the allocated storage of a field, instead of the length of the original data that we stored in that field, is indeed an issue for us.

The following test would currently fail:

        [ConditionalTheory]
        [MemberData(nameof(IsAsyncData))]
        public virtual async Task Byte_array_filter_by_length_literal2(bool async)
        {
            await AssertQuery(
                async,
                ss => ss.Set<Squad>().Where(w => w.Banner5.Length == 5),
                ss => ss.Set<Squad>().Where(w => w.Banner5 != null && w.Banner5.Length == 5));

            AssertSql(
"""
SELECT `s`.`Id`, `s`.`Banner`, `s`.`Banner5`, `s`.`InternalNumber`, `s`.`Name`
FROM `Squads` AS `s`
WHERE LENB(`s`.`Banner5`) = 5
""");
        }

(It would succeed with w.Banner5.Length == 6.)

This would also be the case, if we would translate it to this:

SELECT `s`.`Id`, `s`.`Banner`, `s`.`Banner5`, `s`.`InternalNumber`, `s`.`Name`
FROM `Squads` AS `s`
WHERE LEN(STRCONV(`s`.`Banner5`, 64)) = 5

Looks like `s`.`Banner5` returns whole storage space when used in functions, which includes the unused/non-original byte at the end and therefore results in a length of 6 (even though it does not return the 6th character (or byte if not converted)). It does not return the unused byte at the end in its result set though. Weird.

The same is true for the Banner (instead of the Banner5) field, if the stored data has an uneven length.

ChrisJollyAU · 2024-03-09T16:30:35Z

@lauxjpn Its Jet. Weirdness abounds

Byte_array_contains_literal now works but it still fails if using parameter of other column

Byte_array_contains_parameter
Contains_on_byte_array_property_using_byte_column

Edit:
Found the problem.
Just needs to be CHR(@__someByte_0). No string conversion. A byte parameter or column of type byte is actually a numerical value (type of Long within Jet). Also 0x01 literal in the SQL is actually already a type of String

ChrisJollyAU · 2024-03-09T19:56:26Z

Got a theory for the length

Check if the last byte (using RIGHTB is 0
If the above is true then subtract 1 from the calculated length, otherwise return the normal length

So anything with a length query becomes something like

SELECT `s`.`Id`, `s`.`Banner`, `s`.`Banner5`, `s`.`InternalNumber`, `s`.`Name`
FROM `Squads` AS `s`
WHERE IIF(ASCB(RIGHTB(`s`.`Banner`, 1)) = 0, LENB(`s`.`Banner`) - 1, LENB(`s`.`Banner`)) = 2

This does work for the current test cases. The ONLY exception I can see is if your byte data was an even number of bytes and is MEANT to end in a 0 (0x00). Then your length will be 1 short of what you want.

This is probably the best we can do under the limitations and probably good enough for the simple stuff. If you have any large data its probably better to query it first into the client then do whatever you want with it

Slight tangent: At least in SQL Server you can use the string functions on varchar which is single byte versus using nvarchar which would be 2 byte. Probably end up with the same length problem if you could only use nvarchar

… of bytes

…meworkCore.Jet into bytearray

lauxjpn · 2024-03-10T09:30:45Z

I think there are two ways to deal with this issue:

We do not translate those cases. It is better to throw an exception that explains the issue, than introducing subtle and hard to find bugs to user's code.
We implement your algorithm using an EF.Functions extension method, that clearly explains what its limitations are. That way, users can explicitly opt-in to using the algorithm and we can be sure, that they know what they are doing. The non EF.Functions translation would still throw an exception, explaining the issue and letting the user know about the EF.Functions extension method.

BTW, theoretically, cases that use a fixed binary column length that is even, can be translated normally, in case you want to implement it.

ChrisJollyAU · 2024-03-10T11:18:52Z

I think if I'm going to do anything I would add an option in JetOptions. Default is false and throws an error. If true then do my current code

BTW, theoretically, cases that use a fixed binary column length that is even, can be translated normally, in case you want to implement it.

How do you mean. In JetSqlTranslatingExpressionVisitor you don't know the size yet

lauxjpn · 2024-03-10T11:54:23Z

I think if I'm going to do anything I would add an option in JetOptions. Default is false and throws an error. If true then do my current code

I think that is inferior to using an extension method and kind of dangerous. With an extension method, users can precisely use the workaround where needed, which we should support for one that has quirks.

I would not want to globally enable this workaround in my user code, because it could change the behavior of queries that I add in the future to my code, without any warning. That could then easily lead to subtle and very hard to trace bugs again (that we would probably only hit once the app runs in production), which we should want to avoid at all costs.

BTW, theoretically, cases that use a fixed binary column length that is even, can be translated normally, in case you want to implement it.

How do you mean. In JetSqlTranslatingExpressionVisitor you don't know the size yet

You should generally be able to access the type mapping, which contains the size of the field if the size/field is fixed. If it is not available in the expression visitor you currently use to translate the expression, you can check it later once it is available.

ChrisJollyAU · 2024-03-10T12:03:07Z

I think that is inferior to using an extension method and kind of dangerous. With an extension method, users can precisely use the workaround where needed, which we should support for one that has quirks.

I would not want to globally enable this workaround in my user code, because it could change the behavior of queries that I add in the future to my code, without any warning. That could then easily lead to subtle and very hard to trace bugs again (that we would probably only hit once the app runs in production), which we should want to avoid at all costs.

What about having both options. If they want to turn it on for all cases then they can do that. As well as having the EF.Functions support for if they just want to do it selectively

You should generally be able to access the type mapping, which contains the size of the field if the size/field is fixed. If it is not available in the expression visitor you currently use to translate the expression, you can check it later once it is available.

Will find a test case that I can look at this scenario

lauxjpn · 2024-03-10T12:20:41Z

What about having both options. If they want to turn it on for all cases then they can do that. As well as having the EF.Functions support for if they just want to do it selectively

While that is of course technically possible, it still does not solve the issue that using the global option has a high chance of introducing subtle bugs to code, that I as a use have not even written at the time that I enable that option.

I would avoid this at all costs. We should setup our users for success, not for failure.

Throwing an exception to prompt the user to rewrite a query or use the extension method is cheap.
Having to analyze the whole codebase of an app in production for weird bugs and then fixing it is very expensive. Even more so if the global workaround has already led to data corruption.

…ength due to certain situations with unicode strings

…meworkCore.Jet into bytearray

ChrisJollyAU · 2024-03-11T10:36:59Z

While that is of course technically possible, it still does not solve the issue that using the global option has a high chance of introducing subtle bugs to code, that I as a use have not even written at the time that I enable that option.

Was more thinking of it being there to give them the choice as to the default behaviour. After thinking about it, its not going to be used too much so not too much necessity for it.

Have added the EF.Functions.ByteArrayLength method.
Original tests now assert an error is thrown
New tests added to specifically use EF.Functions.ByteArrayLength

Is the wording of the error message fine or do you have any further suggestions?

lauxjpn

I split the information in two.

src/EFCore.Jet/Properties/JetStrings.resx

src/EFCore.Jet/Query/ExpressionTranslators/Internal/JetByteArrayMethodTranslator.cs

…meworkCore.Jet into bytearray

lauxjpn

LGTM

fix array index and array length for byte arrays

49e1502

ChrisJollyAU added the bug label Mar 6, 2024

ChrisJollyAU added this to the 8.0.0-beta.2 milestone Mar 6, 2024

ChrisJollyAU self-assigned this Mar 6, 2024

[GitHub Actions] Update green tests.

c3b43a3

ChrisJollyAU and others added 4 commits March 10, 2024 04:58

Fix Contains with byte array. and fix getting length if an odd number…

f083485

… of bytes

fix array index and array length for byte arrays

43c9985

[GitHub Actions] Update green tests.

ac6a12c

Fix Contains with byte array. and fix getting length if an odd number…

3a23ffb

… of bytes

ChrisJollyAU force-pushed the bytearray branch from f083485 to 3a23ffb Compare March 9, 2024 20:59

ChrisJollyAU and others added 2 commits March 10, 2024 04:59

Merge branch 'bytearray' of https://github.com/ChrisJollyAU/EntityFra…

8656a70

…meworkCore.Jet into bytearray

[GitHub Actions] Update green tests.

408219b

ChrisJollyAU added 2 commits March 11, 2024 18:29

Enforce an optin methopdology using EF.Functions for the byte array l…

9928c51

…ength due to certain situations with unicode strings

Merge branch 'bytearray' of https://github.com/ChrisJollyAU/EntityFra…

3b5272a

…meworkCore.Jet into bytearray

[GitHub Actions] Update green tests.

8bff3cf

lauxjpn requested changes Mar 12, 2024

View reviewed changes

src/EFCore.Jet/Properties/JetStrings.resx Outdated Show resolved Hide resolved

src/EFCore.Jet/Query/ExpressionTranslators/Internal/JetByteArrayMethodTranslator.cs Show resolved Hide resolved

ChrisJollyAU added 2 commits March 12, 2024 17:59

Split error message and details into 2 parts

1cf60c5

Merge branch 'bytearray' of https://github.com/ChrisJollyAU/EntityFra…

3f7860f

…meworkCore.Jet into bytearray

lauxjpn approved these changes Mar 12, 2024

View reviewed changes

Merge branch 'master' into bytearray

52ae11a

ChrisJollyAU merged commit 41dab6c into CirrusRedOrg:master Mar 13, 2024
13 checks passed

ChrisJollyAU deleted the bytearray branch March 13, 2024 17:44

ChrisJollyAU mentioned this pull request Mar 30, 2024

EF Core 8 Support #143

Open

ChrisJollyAU mentioned this pull request Apr 13, 2024

We should not support SQL Server specific syntax or features #231

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better support for byte arrays #228

Better support for byte arrays #228

ChrisJollyAU commented Mar 6, 2024

ChrisJollyAU commented Mar 7, 2024

lauxjpn commented Mar 7, 2024

ChrisJollyAU commented Mar 7, 2024

lauxjpn commented Mar 7, 2024

Remarks

ChrisJollyAU commented Mar 8, 2024

lauxjpn commented Mar 8, 2024 •

edited

Loading

ChrisJollyAU commented Mar 8, 2024

ChrisJollyAU commented Mar 8, 2024

lauxjpn commented Mar 8, 2024 •

edited

Loading

ChrisJollyAU commented Mar 9, 2024

lauxjpn commented Mar 9, 2024 •

edited

Loading

ChrisJollyAU commented Mar 9, 2024 •

edited

Loading

ChrisJollyAU commented Mar 9, 2024

lauxjpn commented Mar 10, 2024 •

edited

Loading

ChrisJollyAU commented Mar 10, 2024 •

edited

Loading

lauxjpn commented Mar 10, 2024 •

edited

Loading

ChrisJollyAU commented Mar 10, 2024

lauxjpn commented Mar 10, 2024 •

edited

Loading

ChrisJollyAU commented Mar 11, 2024

lauxjpn left a comment

lauxjpn left a comment

Better support for byte arrays #228

Better support for byte arrays #228

Conversation

ChrisJollyAU commented Mar 6, 2024

ChrisJollyAU commented Mar 7, 2024

lauxjpn commented Mar 7, 2024

ChrisJollyAU commented Mar 7, 2024

lauxjpn commented Mar 7, 2024

Remarks

ChrisJollyAU commented Mar 8, 2024

lauxjpn commented Mar 8, 2024 • edited Loading

ChrisJollyAU commented Mar 8, 2024

ChrisJollyAU commented Mar 8, 2024

lauxjpn commented Mar 8, 2024 • edited Loading

ChrisJollyAU commented Mar 9, 2024

lauxjpn commented Mar 9, 2024 • edited Loading

ChrisJollyAU commented Mar 9, 2024 • edited Loading

ChrisJollyAU commented Mar 9, 2024

lauxjpn commented Mar 10, 2024 • edited Loading

ChrisJollyAU commented Mar 10, 2024 • edited Loading

lauxjpn commented Mar 10, 2024 • edited Loading

ChrisJollyAU commented Mar 10, 2024

lauxjpn commented Mar 10, 2024 • edited Loading

ChrisJollyAU commented Mar 11, 2024

lauxjpn left a comment

Choose a reason for hiding this comment

lauxjpn left a comment

Choose a reason for hiding this comment

lauxjpn commented Mar 8, 2024 •

edited

Loading

lauxjpn commented Mar 8, 2024 •

edited

Loading

lauxjpn commented Mar 9, 2024 •

edited

Loading

ChrisJollyAU commented Mar 9, 2024 •

edited

Loading

lauxjpn commented Mar 10, 2024 •

edited

Loading

ChrisJollyAU commented Mar 10, 2024 •

edited

Loading

lauxjpn commented Mar 10, 2024 •

edited

Loading

lauxjpn commented Mar 10, 2024 •

edited

Loading