Overload of DetectFromBytes(byte[] bytes, int offset, int len) #106

ied206 · 2020-01-05T04:06:36Z

I added an overload of CharsetDetector.DetectFromBytes(), by providing the offset and length parameters.

CharsetDetector.DetectFromBytes(byte[] bytes) does not support checking a subset of the byte array, like how Stream does in Read() and Write(). The new overload enables such operation.

ied206 · 2020-01-05T04:08:32Z

I also considered adding an overload of ReadOnlySpan<byte> by utilizing System.Memory nuget package, but did not add the code. UtfUnknown targets .Net Framework 4.0, which is not supported by the System.Memory.

rstm-sf · 2020-01-05T05:01:04Z

Hello!

You can reverts and refactor like this

UTF-unknown/src/UTF-unknown.csproj

Line 19 in d52af8d

    
           <ItemGroup Condition=" '$(TargetFramework)' == 'netstandard1.3' Or '$(TargetFramework)' == 'netstandard2.0' ">

and this

UTF-unknown/src/DetectionDetail.cs

Lines 97 to 101 in d52af8d

    
           #if NETSTANDARD && !NETSTANDARD1_0 || NETCOREAPP3_0 
        
                           return CodePagesEncodingProvider.Instance.GetEncoding(encodingName); 
        
           #else 
        
                           return null; 
        
           #endif

i think so you can also using netstandard and adding netcoreapp3_0

ied206 · 2020-01-05T16:20:02Z

I have tried to add Span APIs conditionally, but I faced some serious issues.

Summary

Writing a maintainable code for Span APIs requires dropping old targets (.Net Framework 4.0, .Net Standard 1.0) and adding a new target (.Net Framework 4.5).
If we do not compromise with targets, the code becomes unmaintainable due to the duplication.

Details

Supporting Span conditionally results duplicated core logics

Span can be spawned from an array, but not in reverse.
Thus to support Span alongside traditional APIs, calling Method(Span<byte>) internally from Method(byte[], int, int) is the easiest way to implement Span APIs.

Original Code:

void Write(byte[] buf, int offset, int count)
{
	for (int i = offset; i < offset + count; i++)
	{ 
		byte b = buf[i];
		/* Do Some Work */
	}
}

Patched Code to support Span (Ideal):

void Write(byte[] buf, int offset, int count)
{
	Write(buf.AsSpan(offset, count));
}

void Write(ReadOnlySpan<byte> span)
{
	for (int i = 0; i < span.Length; i++)
	{
		byte b = span[i];
		/* Do Some Work */
	}
}

However, since .Net Framework 4.0 and .Net Standard 1.0 target is not supported by System.Memory, I cannot use that tactic. Instead, I had to duplicate the logics which greatly increases the maintenance burden. That was why I did not add the Span API.

Patched Code to support Span (Duplication):

#if !NET40 || !NETSTANDARD1.0
#define SPAN_SUPPORT
#endif

void Write(byte[] buf, int offset, int count)
{
	for (int i = offset; i < count; i++)
	{
		byte b = span[i];
		/* Do Some Work */
	}
}

#if SPAN_SUPPORT
void Write(ReadOnlySpan<byte> span)
{
	for (int i = 0; i < span.Length; i++)
	{
		byte b = span[i];
		/* Do Some Work */
	}
}
#endif

I can translate the Span API into byte[] API, but it contradicts the reason why Span was built. It brings additional copying and lowers the performance, so the code becomes meaningless.

Patched Code to support Span (Translation):

#if !NET40 || !NETSTANDARD1.0
#define SPAN_SUPPORT
#endif

void Write(byte[] buf, int offset, int count)
{
	for (int i = offset; i < offset + count; i++)
	{
		byte b = span[i];
		/* Do Some Work */
	}
}

#if SPAN_SUPPORT
void Write(ReadOnlySpan<byte> span)
{
	Write(span.ToArray());
}
#endif

So there is not a perfect solution as long as I know. We have to choose between three options:

Duplicated logics all across CharsetDetector and children of Prober
Drop old targets (.Net Framework 4.0, .Net Standard 1.0)
Give up Span APIs

Need to add one more target: .Net Framework 4.5

As long as I know, .Net Framework applications always prefer .Net Framework targeted library. It means an .Net Framework 4.8 (w Span) application will prefer .Net Framework 4.0 target (wo Span), not .Net Standard 2.0 (w Span) target. (In here "with Span" means the API is accessible with the help of NuGet package).

To solve this, We have to add .Net Framework 4.5 target in order to make sure that the latest .Net Framework applications have access to Span. Are you okay with adding more target?

EDIT: Sorry for editing, I found some error at sample code...

ied206 · 2020-01-05T16:31:30Z

Span API issue is quite complicated, thus I suggest to discuss the Span APIs in the separate issue. I think it is better to concentrate on reviewing the (byte[], int, int) overload commits in the PR.

I made several new commits, please review.

rstm-sf · 2020-01-05T17:47:00Z

src/CharsetDetector.cs

                }
            }
        }

-        private static string FindCharSetByBom(byte[] buf, int len)
+        private static string FindCharSetByBom(byte[] buf, int offset, int len)


Could you give an example when this is necessary? As far as I know, magic number is inserted first

The new method overload has an assumption: 'the actual data starts at buf[offset]'. It is useful when the actual data was loaded into the middle of the byte array.

rstm-sf · 2020-01-05T17:54:05Z

src/CharsetDetector.cs

            {
                // other than 0xa0, if every other character is ascii, the page is ascii
-                if ((buf[i] & 0x80) != 0 && buf[i] != 0xA0)
+                if ((buf[offset + i] & 0x80) != 0 && buf[offset + i] != 0xA0)


Could you clarify why this need to do offset again?

It was written under the same assumption with (new) line 371.

But I-index already starts with offset

Oh, I made a mistake. I will patch it ASAP.

rstm-sf · 2020-01-05T17:55:12Z

src/CharsetDetector.cs

@@ -326,46 +357,46 @@ private void FindInputState(byte[] buf, int len)
                else
                {
                    if (InputState == InputState.PureASCII &&
-                        (buf[i] == 0x1B || (buf[i] == 0x7B && _lastChar == 0x7E)))
+                        (buf[offset + i] == 0x1B || (buf[offset + i] == 0x7B && _lastChar == 0x7E)))


And this offset

rstm-sf · 2020-01-05T17:55:23Z

src/CharsetDetector.cs

                    {
                        // found escape character or HZ "~{"
                        InputState = InputState.EscASCII;
                        _escCharsetProber = _escCharsetProber ?? GetNewProbers();
                    }
-                    _lastChar = buf[i];
+                    _lastChar = buf[offset + i];


And this offset

rstm-sf · 2020-01-05T18:02:34Z

Sorry for editing, I found some error at sample code...

Could you add a test for this case (so that further tests indicate this)?

rstm-sf · 2020-01-05T18:10:38Z

src/CharsetDetector.cs

+            }
+            if (bytes.Length - offset < len)
+            {
+                throw new ArgumentOutOfRangeException(nameof(len));


This exception is similar to the previous, but they are different

My intention was to find the best exception without introducing additional strings.

.Net Framework's FileStream.Write implementation use ArgumentException.

if (array.Length - offset < count) throw new ArgumentException(Environment.GetResourceString("Argument_InvalidOffLen"));

Should we benchmark the code and use throw new ArgumentException("Invalid offset and length")?

It seems to me that ArgumentOutOfRangeException looks here, but @304NotModified will tell you what message to write, since I only help to conduct a review.

For example, here is a message in mscorelib.

And then, offset + len > bytes.Length looks better. Because offset may be larger than bytes.Length

rstm-sf · 2020-01-05T18:23:35Z

It seems to me that it is not worthwhile to detect the BOM by the offset, respectively, and the changes should be other. Because the Feed(byte[] buf, int offset, int len) function searches for BOM

rstm-sf · 2020-01-05T18:34:31Z

As long as I know, .Net Framework applications always prefer .Net Framework targeted library.

Yes. We can also use this service https://nugettoolsdev.azurewebsites.net :)

To solve this, We have to add .Net Framework 4.5 target in order to make sure that the latest .Net Framework applications have access to Span. Are you okay with adding more target?

I think it will look good since it already started in another PR

rstm-sf · 2020-01-05T18:36:06Z

Span API issue is quite complicated, thus I suggest to discuss the Span APIs in the separate issue. I think it is better to concentrate on reviewing the (byte[], int, int) overload commits in the PR.

I made several new commits, please review.

/cc @304NotModified

ied206 · 2020-01-06T19:19:09Z

It seems to me that it is not worthwhile to detect the BOM by the offset, respectively, and the changes should be other. Because the Feed(byte[] buf, int offset, int len) function searches for BOM

As long as I know, designating offset in a method with byte[] buf, int offset, int count signature means the actual data starts at buffer[offset]. It does not take account of the bytes where it is not covered by the [offset, offset + count). Actually, the pattern was copied from the .Net Framework's MemoryStream and FileStream implementation.
MemoryStream.Write(byte[] buffer, int offset, int count):

if ((count <= 8) && (buffer != _buffer))
{
  int byteCount = count;
  while (--byteCount >= 0)
    _buffer[_position + byteCount] = buffer[offset + byteCount];
}
else
  Buffer.InternalBlockCopy(buffer, offset, _buffer, _position, count);

FileStream.Write(byte[] buffer, int offset, int count):

if (_writePos > 0) {
	int numBytes = _bufferSize - _writePos;   // space left in buffer
	if (numBytes > 0) {
		if (numBytes > count)
			numBytes = count;
		Buffer.InternalBlockCopy(array, offset, _buffer, _writePos, numBytes);
		_writePos += numBytes;
		if (count==numBytes) return;
		offset += numBytes;
		count -= numBytes;
	}

We can find out the actual copy is done by Buffer.InternalBlockCopy, with the given offset.

However, if you still think my explanation is not enough, I will revert the code to always search BOM from buf[0].

ied206 · 2020-01-06T19:27:40Z

Sorry for editing, I found some error at sample code...

Could you add a test for this case (so that further tests indicate this)?

The error I mentioned in the sentence meant the sample code in the post, not in the committed code.
You are able to see that in the edited history.

Original Sample Code with Error

#if NET40 || NETSTANDARD1.0
#define SPAN_SUPPORT
#endif

Fixed Sample Code

#if !NET40 || !NETSTANDARD1.0
#define SPAN_SUPPORT
#endif

ied206 · 2020-01-06T20:01:04Z

I fixed the issue of adding the offset twice to the index base at CharsetDetector.FindInputState(). Thanks for pointing it out.

rstm-sf · 2020-01-07T08:22:00Z

As long as I know, designating offset in a method with byte[] buf, int offset, int count signature means the actual data starts at buffer[offset].

It seems to me that you need to look at it easier and consider it as an opportunity to go through the loop and unify the code. At least I looked at it like that before :)

However, if you still think my explanation is not enough, I will revert the code to always search BOM from buf[0].

Thank you for the clarification! I hadn’t thought about this before.

rstm-sf · 2020-01-07T08:25:15Z

The error I mentioned in the sentence meant the sample code in the post, not in the committed code.
You are able to see that in the edited history.

Thanks! I misunderstood you

ied206 · 2020-01-10T03:10:34Z

It seems to me that you need to look at it easier and consider it as an opportunity to go through the loop and unify the code. At least I looked at it like that before :)

I had fixed the issue in commit 58c009. Please review.

rstm-sf · 2020-01-10T21:16:51Z

I think in these PR we need to add a couple of changes before merging:

indicate in the documentation that for the DetectFromBytes(byte[], int, int) method, a search by BOM will start with offset
сorrect message for exception and call condition

Added BOM offset info to the docs of DetectFromBytes(byte[], int, int) Improvme exception message of DetectFromBytes(byte[], int, int)

ied206 · 2020-01-14T05:30:37Z

indicate in the documentation that for the DetectFromBytes(byte[], int, int) method, a search by BOM will start with offset

сorrect message for exception and call condition

I reflected your advice into the code.

rstm-sf · 2020-01-14T19:57:20Z

It's look good. Thank you! :)

304NotModified · 2020-01-21T20:21:31Z

Thanks for the PR!

I'm fine with merging this one, but I have one question.

The detection of the BOM in the "middle" of a byte array could be an issue for some use cases. Would it be better to make searching for the BOM it configurable ? (so adding a bool parameter?)

ied206 · 2020-01-22T02:23:25Z

The detection of the BOM in the "middle" of a byte array could be an issue for some use cases. Would it be better to make searching for the BOM it configurable ? (so adding a bool parameter?)

I think adding another overload having explicit BOM search location parameter would also be a good solution.

304NotModified · 2020-01-22T05:35:51Z

👍. That's for another pull request? (After merging this one)

304NotModified · 2020-01-26T21:00:58Z

merged! Thanks!

ied206 added 2 commits January 5, 2020 02:31

Add an overload of CharsetDetector.DetectFromBytes

91da0d4

Patch IsStartsWithBom to support offset parameter

3ee97d7

Remove System.Memory reference

b5f5466

ied206 added 2 commits January 5, 2020 19:24

Patch CharsetDetector.FindInputState

8a798b9

Patch CharsetDetector.FindInputState (2)

876abcb

rstm-sf reviewed Jan 5, 2020

View reviewed changes

Fix duplicated offset from the index base

58c0092

304NotModified self-assigned this Jan 10, 2020

Doc/Msg improvement for DetectFromBytes overload

1ecd40f

Added BOM offset info to the docs of DetectFromBytes(byte[], int, int) Improvme exception message of DetectFromBytes(byte[], int, int)

rstm-sf approved these changes Jan 19, 2020

View reviewed changes

304NotModified approved these changes Jan 21, 2020

View reviewed changes

304NotModified added this to the 2.3 milestone Jan 26, 2020

304NotModified merged commit f1aa5fd into CharsetDetector:master Jan 26, 2020

304NotModified modified the milestones: 2.3, 2.4 Feb 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overload of DetectFromBytes(byte[] bytes, int offset, int len) #106

Overload of DetectFromBytes(byte[] bytes, int offset, int len) #106

ied206 commented Jan 5, 2020

ied206 commented Jan 5, 2020

rstm-sf commented Jan 5, 2020 •

edited

ied206 commented Jan 5, 2020 •

edited

ied206 commented Jan 5, 2020 •

edited

rstm-sf Jan 5, 2020

ied206 Jan 6, 2020

rstm-sf Jan 5, 2020

ied206 Jan 6, 2020

rstm-sf Jan 6, 2020

ied206 Jan 6, 2020

rstm-sf Jan 5, 2020

rstm-sf Jan 5, 2020

rstm-sf commented Jan 5, 2020

rstm-sf Jan 5, 2020

ied206 Jan 6, 2020

rstm-sf Jan 7, 2020

rstm-sf commented Jan 5, 2020

rstm-sf commented Jan 5, 2020

rstm-sf commented Jan 5, 2020

ied206 commented Jan 6, 2020

ied206 commented Jan 6, 2020

ied206 commented Jan 6, 2020

rstm-sf commented Jan 7, 2020

rstm-sf commented Jan 7, 2020

ied206 commented Jan 10, 2020

rstm-sf commented Jan 10, 2020

ied206 commented Jan 14, 2020

rstm-sf commented Jan 14, 2020

304NotModified commented Jan 21, 2020

ied206 commented Jan 22, 2020

304NotModified commented Jan 22, 2020

304NotModified commented Jan 26, 2020

Overload of DetectFromBytes(byte[] bytes, int offset, int len) #106

Overload of DetectFromBytes(byte[] bytes, int offset, int len) #106

Conversation

ied206 commented Jan 5, 2020

ied206 commented Jan 5, 2020

rstm-sf commented Jan 5, 2020 • edited

ied206 commented Jan 5, 2020 • edited

Summary

Details

Supporting Span conditionally results duplicated core logics

Need to add one more target: .Net Framework 4.5

ied206 commented Jan 5, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rstm-sf commented Jan 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rstm-sf commented Jan 5, 2020

rstm-sf commented Jan 5, 2020

rstm-sf commented Jan 5, 2020

ied206 commented Jan 6, 2020

ied206 commented Jan 6, 2020

ied206 commented Jan 6, 2020

rstm-sf commented Jan 7, 2020

rstm-sf commented Jan 7, 2020

ied206 commented Jan 10, 2020

rstm-sf commented Jan 10, 2020

ied206 commented Jan 14, 2020

rstm-sf commented Jan 14, 2020

304NotModified commented Jan 21, 2020

ied206 commented Jan 22, 2020

304NotModified commented Jan 22, 2020

304NotModified commented Jan 26, 2020

rstm-sf commented Jan 5, 2020 •

edited

ied206 commented Jan 5, 2020 •

edited

ied206 commented Jan 5, 2020 •

edited