Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support for byte arrays #228

Merged
merged 14 commits into from
Mar 13, 2024

Conversation

ChrisJollyAU
Copy link
Member

Fixes some long-standing issues with byte arrays

  • Use the correct function to get the length of the data
  • Getting the specific index in a byte array now works

Some comments

  • Contains still doesn't work
  • LENB seems to only work properly if column is a longbinary. If has a max length set column type becomes varbinary(5) (if length is 5). In this case the return value is 6. Is this a length+1 or the closest multiple of 2 (because double byte char set as LEN returns 3)

@ChrisJollyAU ChrisJollyAU added this to the 8.0.0-beta.2 milestone Mar 6, 2024
@ChrisJollyAU ChrisJollyAU self-assigned this Mar 6, 2024
@ChrisJollyAU
Copy link
Member Author

Update: for varbinary column LENB is always in multiples of 2, because the normal LEN is unicode or 2 byte chars

@lauxjpn Any thoughts on this?
Also, any thoughts on the contains tests with the byte arrays?

@lauxjpn
Copy link
Member

lauxjpn commented Mar 7, 2024

@ChrisJollyAU Please post me some test names that demonstrate these issues, so I can tinker around with it.

@ChrisJollyAU
Copy link
Member Author

For the contains:
Byte_array_contains_literal
Byte_array_contains_parameter

For the one with length on varbinary
Byte_array_filter_by_length_literal_does_not_cast_on_varbinary_n

You can pretty much just filter on byte_array and its only those 2 types of problems that fail.
To keep things simple you can tinker around with them on the GearsOfWar test (not the TPH,TPC,TPT versions - they will probably come right at the same time)

@lauxjpn
Copy link
Member

lauxjpn commented Mar 7, 2024

So, it is not pretty, but it seems to work to convert the binary data to a unicode string first and then to use normal string functions on it.

For example, for the Byte_array_contains_literal test case, this returns an incorrect result:

SELECT `s`.`Id`, `s`.`Banner`, `s`.`Banner5`, `s`.`InternalNumber`, `s`.`Name`
FROM `Squads` AS `s`
WHERE INSTR(1, `s`.`Banner`, 0x01, 0) > 0

But the result for this one is correct:

SELECT `s`.`Id`, `s`.`Banner`, `s`.`Banner5`, `s`.`InternalNumber`, `s`.`Name`
FROM `Squads` AS `s`
WHERE INSTR(1, STRCONV(`s`.`Banner`, 64), 0x01, 0) > 0

The docs state:

Remarks

When you're converting from a Byte array in ANSI format to a string, you should use the StrConv function. When you're converting from such an array in Unicode format, use an assignment statement.

So the following translation should work:

return _sqlExpressionFactory.GreaterThan(
    _sqlExpressionFactory.Function(
        "INSTR",
        new[]
        {
            _sqlExpressionFactory.Constant(1),
            _sqlExpressionFactory.Function(
                "STRCONV",
                new [] { source, _sqlExpressionFactory.Constant(64) },
                nullable: true,
                argumentsPropagateNullability: new[] { true, false },
                typeof(string)),
            value,
            _sqlExpressionFactory.Constant(0)
        },
        nullable: true,
        argumentsPropagateNullability: new[] { true, true },
        typeof(int)),
    _sqlExpressionFactory.Constant(0));

I haven't checked, whether we should force a specific LCID as the 3rd parameter to STRCONV.

@ChrisJollyAU
Copy link
Member Author

Would not not be better to try to convert the 0x01 (the item you are looking for). That way you are only converting 1 byte rather than the whole contents?

@lauxjpn
Copy link
Member

lauxjpn commented Mar 8, 2024

Would not not be better to try to convert the 0x01 (the item you are looking for). That way you are only converting 1 byte rather than the whole contents?

That shouldn't work.

Remember, the issue is that there seems to be no function to search a binary/blob/byte array in Jet. So we need to use a function to search strings instead. Strings are Unicode (basically USC-2) in Jet, so every character is represented by two bytes.

Therefore, in the clause INSTR(1, STRCONV(`s`.`Banner`, 64), 0x01, 0) > 0, with `s`.`Banner` being 0x0001, the STRCONV function expands to INSTR(1, 0x00000100, 0x0100, 0) > 0 and correctly matches 0x0100 at position 2 (indexing is one-based).

If you would use INSTR(1, `s`.`Banner`, STRCONV(0x01, 64), 0) > 0 instead, the STRCONV function expands to INSTR(1, 0x0001, 0x0100, 0) > 0, which is not the same as the first translation and does not return the result we want, because it does not match 0x0100 at position 2 (or any other position).

@ChrisJollyAU
Copy link
Member Author

I'm going to play around a bit more with it.

You will only ever be looking for 1 byte. The Contains function (from https://learn.microsoft.com/en-us/dotnet/api/system.linq.enumerable.contains?view=net-8.0 ) only takes a single item to look for. In the case of byte[] the Contains takes as parameter of type byte.

There are byte versions of most string related functions LENB, ASCB, INSTRB

@ChrisJollyAU
Copy link
Member Author

Also just a comment on the length with varbinary, longbinary or the ole object also gives the same multiples of 2

@lauxjpn
Copy link
Member

lauxjpn commented Mar 8, 2024

There are byte versions of most string related functions LENB, ASCB, INSTRB

INSTRB just returns the byte position of a (2-byte long Unicode) character in a Unicode string. It does not search for a single byte.
INSTRB will only reliably work, if the input string is a Unicode string, in which case, it will return (whatever `INSTR` returns) * 2.

Also just a comment on the length with varbinary, longbinary or the ole object also gives the same multiples of 2

Same mechanism as explained here in this post. The solution should be to convert to a Unicode string first, then use the non-B-version (which is the same as B-version / 2) on the result.


There don't seem to be any binary/blog/byte array targeting functions at all in Jet.

@ChrisJollyAU
Copy link
Member Author

There don't seem to be any binary/blog/byte array targeting functions at all in Jet.

Yes and no. Within Jet there is no specific byte type so everything of that is read into strings. Of course being unicode based, the size of the strings are allocated based on unicode (double byte) characters that the data will fit in. But the data within that is still using single-byte data.

Most string related functions have a B version for working on byte related data e.g. the docs for MID have the following

Use the MidB function with byte data contained in a string, as in double-byte character set languages. Instead of specifying the number of characters, the arguments specify numbers of bytes

The functions of MIDB, LEFTB, RIGHTB*,ASCB,LENB** all work pretty much as expected
**LENB returns multiples of 2 as strings are allocated based on number of unicode chars to fit the data in. If raw data is an odd number the final byte is still valid and is a 0
*RIGHTB As it starts from the right, it will include the above mentioned 0 if that is the case

The opposite of ASCB, CHRB is undefined in Jet but available in VBA (because Jet is weird - you have ASC,ASCB,ASCW, CHR,CHRB,CHRW all documented and should be available except for one). Ok, maybe not so weird given the below comments, it's impossible to just have a single byte char.

INSTRB is also documented as working on the byte data. As all the other B functions take the string data happily without any conversion (so stays as single byte data), it should follow that INSTRB behaves the same.

Now that I'm typing this, it comes to me that the value to search FOR is still a string type and no matter whether you have a single byte or more, as it is created by unicode conventions the function will ultimately get a 1 unicode char (2-byte) parameter.

Was hoping to only need to convert the small amount of data, rather than the bigger string to search IN

Slightly off topic but helpful if you need for debugging. You can access 2 nice functions that is also supported in Jet: TypeName and VarType. If you're ever not sure what type Jet is looking at you can use that e.g

TypeName(`s`.`Banner5`) AS Expr6, VarType(`s`.`Banner`) AS Expr7

Expr6 will produce: String and Expr7 will produce 8 (vbString)
See https://learn.microsoft.com/en-us/office/vba/language/reference/user-interface-help/vartype-function for other values

@lauxjpn
Copy link
Member

lauxjpn commented Mar 9, 2024

@ChrisJollyAU Well that's good, if most of the B-functions work as you expect, then use them in that way.


The fact that LENB returns the size of the allocated storage of a field, instead of the length of the original data that we stored in that field, is indeed an issue for us.

The following test would currently fail:

        [ConditionalTheory]
        [MemberData(nameof(IsAsyncData))]
        public virtual async Task Byte_array_filter_by_length_literal2(bool async)
        {
            await AssertQuery(
                async,
                ss => ss.Set<Squad>().Where(w => w.Banner5.Length == 5),
                ss => ss.Set<Squad>().Where(w => w.Banner5 != null && w.Banner5.Length == 5));

            AssertSql(
"""
SELECT `s`.`Id`, `s`.`Banner`, `s`.`Banner5`, `s`.`InternalNumber`, `s`.`Name`
FROM `Squads` AS `s`
WHERE LENB(`s`.`Banner5`) = 5
""");
        }

(It would succeed with w.Banner5.Length == 6.)

This would also be the case, if we would translate it to this:

SELECT `s`.`Id`, `s`.`Banner`, `s`.`Banner5`, `s`.`InternalNumber`, `s`.`Name`
FROM `Squads` AS `s`
WHERE LEN(STRCONV(`s`.`Banner5`, 64)) = 5

Looks like `s`.`Banner5` returns whole storage space when used in functions, which includes the unused/non-original byte at the end and therefore results in a length of 6 (even though it does not return the 6th character (or byte if not converted)). It does not return the unused byte at the end in its result set though. Weird.

The same is true for the Banner (instead of the Banner5) field, if the stored data has an uneven length.

@ChrisJollyAU
Copy link
Member Author

ChrisJollyAU commented Mar 9, 2024

@lauxjpn Its Jet. Weirdness abounds

Byte_array_contains_literal now works but it still fails if using parameter of other column

Byte_array_contains_parameter
Contains_on_byte_array_property_using_byte_column

Edit:
Found the problem.
Just needs to be CHR(@__someByte_0). No string conversion. A byte parameter or column of type byte is actually a numerical value (type of Long within Jet). Also 0x01 literal in the SQL is actually already a type of String

@ChrisJollyAU
Copy link
Member Author

Got a theory for the length

  • Check if the last byte (using RIGHTB is 0
  • If the above is true then subtract 1 from the calculated length, otherwise return the normal length

So anything with a length query becomes something like

SELECT `s`.`Id`, `s`.`Banner`, `s`.`Banner5`, `s`.`InternalNumber`, `s`.`Name`
FROM `Squads` AS `s`
WHERE IIF(ASCB(RIGHTB(`s`.`Banner`, 1)) = 0, LENB(`s`.`Banner`) - 1, LENB(`s`.`Banner`)) = 2

This does work for the current test cases. The ONLY exception I can see is if your byte data was an even number of bytes and is MEANT to end in a 0 (0x00). Then your length will be 1 short of what you want.

This is probably the best we can do under the limitations and probably good enough for the simple stuff. If you have any large data its probably better to query it first into the client then do whatever you want with it

Slight tangent: At least in SQL Server you can use the string functions on varchar which is single byte versus using nvarchar which would be 2 byte. Probably end up with the same length problem if you could only use nvarchar

@lauxjpn
Copy link
Member

lauxjpn commented Mar 10, 2024

I think there are two ways to deal with this issue:

  1. We do not translate those cases. It is better to throw an exception that explains the issue, than introducing subtle and hard to find bugs to user's code.

  2. We implement your algorithm using an EF.Functions extension method, that clearly explains what its limitations are. That way, users can explicitly opt-in to using the algorithm and we can be sure, that they know what they are doing. The non EF.Functions translation would still throw an exception, explaining the issue and letting the user know about the EF.Functions extension method.

BTW, theoretically, cases that use a fixed binary column length that is even, can be translated normally, in case you want to implement it.

@ChrisJollyAU
Copy link
Member Author

ChrisJollyAU commented Mar 10, 2024

I think if I'm going to do anything I would add an option in JetOptions. Default is false and throws an error. If true then do my current code

BTW, theoretically, cases that use a fixed binary column length that is even, can be translated normally, in case you want to implement it.

How do you mean. In JetSqlTranslatingExpressionVisitor you don't know the size yet

@lauxjpn
Copy link
Member

lauxjpn commented Mar 10, 2024

I think if I'm going to do anything I would add an option in JetOptions. Default is false and throws an error. If true then do my current code

I think that is inferior to using an extension method and kind of dangerous. With an extension method, users can precisely use the workaround where needed, which we should support for one that has quirks.

I would not want to globally enable this workaround in my user code, because it could change the behavior of queries that I add in the future to my code, without any warning. That could then easily lead to subtle and very hard to trace bugs again (that we would probably only hit once the app runs in production), which we should want to avoid at all costs.


BTW, theoretically, cases that use a fixed binary column length that is even, can be translated normally, in case you want to implement it.

How do you mean. In JetSqlTranslatingExpressionVisitor you don't know the size yet

You should generally be able to access the type mapping, which contains the size of the field if the size/field is fixed. If it is not available in the expression visitor you currently use to translate the expression, you can check it later once it is available.

@ChrisJollyAU
Copy link
Member Author

I think that is inferior to using an extension method and kind of dangerous. With an extension method, users can precisely use the workaround where needed, which we should support for one that has quirks.

I would not want to globally enable this workaround in my user code, because it could change the behavior of queries that I add in the future to my code, without any warning. That could then easily lead to subtle and very hard to trace bugs again (that we would probably only hit once the app runs in production), which we should want to avoid at all costs.

What about having both options. If they want to turn it on for all cases then they can do that. As well as having the EF.Functions support for if they just want to do it selectively

You should generally be able to access the type mapping, which contains the size of the field if the size/field is fixed. If it is not available in the expression visitor you currently use to translate the expression, you can check it later once it is available.

Will find a test case that I can look at this scenario

@lauxjpn
Copy link
Member

lauxjpn commented Mar 10, 2024

What about having both options. If they want to turn it on for all cases then they can do that. As well as having the EF.Functions support for if they just want to do it selectively

While that is of course technically possible, it still does not solve the issue that using the global option has a high chance of introducing subtle bugs to code, that I as a use have not even written at the time that I enable that option.

I would avoid this at all costs. We should setup our users for success, not for failure.

Throwing an exception to prompt the user to rewrite a query or use the extension method is cheap.
Having to analyze the whole codebase of an app in production for weird bugs and then fixing it is very expensive. Even more so if the global workaround has already led to data corruption.

@ChrisJollyAU
Copy link
Member Author

While that is of course technically possible, it still does not solve the issue that using the global option has a high chance of introducing subtle bugs to code, that I as a use have not even written at the time that I enable that option.

Was more thinking of it being there to give them the choice as to the default behaviour. After thinking about it, its not going to be used too much so not too much necessity for it.

  • Have added the EF.Functions.ByteArrayLength method.
  • Original tests now assert an error is thrown
  • New tests added to specifically use EF.Functions.ByteArrayLength

Is the wording of the error message fine or do you have any further suggestions?

Copy link
Member

@lauxjpn lauxjpn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split the information in two.

Copy link
Member

@lauxjpn lauxjpn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ChrisJollyAU ChrisJollyAU merged commit 41dab6c into CirrusRedOrg:master Mar 13, 2024
13 checks passed
@ChrisJollyAU ChrisJollyAU deleted the bytearray branch March 13, 2024 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants