Lex UTF-8 text as a byte stream #38

rdipardo · 2022-09-15T04:16:46Z

Execute this Registry script on Windows 10/11, version 1903 or later ¹:

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage]
"ACP"="65001"
"OEMCP"="65001"
"MACCP"="65001"

Launch Notepad++ with CSVLint installed
Open a CSV file containing international characters
Notice that fields with international characters are wrongly segmented, e.g.,

The issue here is that .NET strings are UTF-16; i.e., every char represents a 16-bit ordinal ².

By contrast, UTF-8 uses variable byte lengths: 1 byte when that's enough (< 0x7F); 2 or more bytes for higher code points:

character	UTF-8 representation	# of bytes	# of .NET `char`s
A	41	1	1
ö	C3 B6	2	1
你	E4 BD A0	3	1

Given a string like "Aö你", the character count will differ from the byte count:

> csi

Microsoft (R) Visual C# Interactive Compiler version 4.3.0-3.22423.10
Copyright (C) Microsoft Corporation. All rights reserved.

Type "#help" for more information.
> var s = "Aö你";
> s.Length
3
> System.Text.Encoding.Default.GetBytes(s).Length
4

When we iterate this string one char at a time, each offset moves 16 bits forward.
Given a UTF-8 string, we need to iterate in 8-bit segments, or we'll miss characters.
To do that, we can iterate by bytes, which this patch implements.

Fixes https://community.notepad-plus-plus.org/topic/23471/custom-lexer-and-unicode-utf-8-text-file-content

BdR76 · 2022-09-16T11:38:04Z

Thanks for the PR, looks very useful.

Lex UTF-8 text as a byte stream

60709b3

BdR76 merged commit 6a1360c into BdR76:master Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lex UTF-8 text as a byte stream #38

Lex UTF-8 text as a byte stream #38

rdipardo commented Sep 15, 2022

BdR76 commented Sep 16, 2022

Lex UTF-8 text as a byte stream #38

Lex UTF-8 text as a byte stream #38

Conversation

rdipardo commented Sep 15, 2022

Footnotes

BdR76 commented Sep 16, 2022