Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lex UTF-8 text as a byte stream #38

Merged
merged 1 commit into from Sep 16, 2022
Merged

Lex UTF-8 text as a byte stream #38

merged 1 commit into from Sep 16, 2022

Conversation

rdipardo
Copy link
Contributor

  1. Execute this Registry script on Windows 10/11, version 1903 or later 1:
Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage]
"ACP"="65001"
"OEMCP"="65001"
"MACCP"="65001"
  1. Launch Notepad++ with CSVLint installed
  2. Open a CSV file containing international characters
  3. Notice that fields with international characters are wrongly segmented, e.g.,

Custom lexer and Unicode UTF-8 text file content

The issue here is that .NET strings are UTF-16; i.e., every char represents a 16-bit ordinal 2.

By contrast, UTF-8 uses variable byte lengths: 1 byte when that's enough (< 0x7F); 2 or more bytes for higher code points:

character UTF-8 representation # of bytes # of .NET chars
A 41 1 1
ö C3 B6 2 1
E4 BD A0 3 1

Given a string like "Aö你", the character count will differ from the byte count:

> csi

Microsoft (R) Visual C# Interactive Compiler version 4.3.0-3.22423.10
Copyright (C) Microsoft Corporation. All rights reserved.

Type "#help" for more information.
> var s = "Aö你";
> s.Length
3
> System.Text.Encoding.Default.GetBytes(s).Length
4

When we iterate this string one char at a time, each offset moves 16 bits forward.
Given a UTF-8 string, we need to iterate in 8-bit segments, or we'll miss characters.
To do that, we can iterate by bytes, which this patch implements.

Fixes https://community.notepad-plus-plus.org/topic/23471/custom-lexer-and-unicode-utf-8-text-file-content

Footnotes

  1. https://docs.microsoft.com/en-us/answers/questions/587680/where-can-i-find-34beta-use-unicode-utf-8-for-worl.html

  2. https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-encoding-introduction#the-string-and-char-types

@BdR76
Copy link
Owner

BdR76 commented Sep 16, 2022

Thanks for the PR, looks very useful.

@BdR76 BdR76 merged commit 6a1360c into BdR76:master Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants