Highlight error when Windows is using an East Asian "ANSI" code page #50

myonlylonely · 2023-02-22T14:13:28Z

CSV data for testing:

测试123,测试2,测试,测试填充列
0,1,2,3
4,5,6,7

The text was updated successfully, but these errors were encountered:

BdR76 · 2023-02-22T15:23:41Z

Thanks for the issue report but I can't reproduce this display error, when I open the data the syntax highlighting looks good. This issue seems similar to issue #47

Are you using Windows 10 or Windows 11? And which version of the plug-in are you using?

myonlylonely · 2023-02-22T15:39:42Z

Thanks for the issue report but I can't reproduce this display error, when I open the data the syntax highlighting looks good. This issue seems similar to issue #47

Are you using Windows 10 or Windows 11? And which version of the plug-in are you using?

I'm using Windows 10 21H2 (64bit), Notepad++ 8.4.8 (64bit), CSVLint 0.4.6.2(insalled from Notepad++ plugin management).

myonlylonely · 2023-02-22T15:54:58Z

I find that this issue only exists under UTF8 encoding, the ANSI encoding works fine.
ANSI encoding works fine:

UTF8 encoding does not work:

myonlylonely · 2023-02-22T15:58:57Z

The file I used is attached:
UTF8 encoding does not work:
test-UTF8.csv
ANSI encoding works fine:
test-ANSI.csv

Friedi · 2023-02-24T10:41:55Z

The file I used is attached: UTF8 encoding does not work: test-UTF8.csv ANSI encoding works fine: test-ANSI.csv

I'm using Windows 10 (64bit), Notepad++ 8.4.9 (64bit), CSVLint 0.4.6.3beta. For me highlighting with your files works fine. They look the same.
You can try his new beta: #46 (comment)

@BdR76 if there is no additional reporting for the 0.4.6.3beta, you can publish it- and close my findings (or am I supposed to do that?)

myonlylonely · 2023-02-24T12:17:53Z

The new version still does not work.

Friedi · 2023-02-24T13:33:07Z

have you tried a clean notepad++ (you can use the zipped one without install) and without additional plugins. maybe other plugins or settings interfere. I have no other explanation, it works for me with your samples

myonlylonely · 2023-02-24T16:41:38Z

have you tried a clean notepad++ (you can use the zipped one without install) and without additional plugins. maybe other plugins or settings interfere. I have no other explanation, it works for me with your samples

Yes, I tried a clean portable(zip) version of Notepad++ and use the CSVLint 0.4.6.3beta, still doesn't work as expected.

rdipardo · 2023-03-23T04:57:54Z

I tried a clean portable(zip) version of Notepad++ and use the CSVLint 0.4.6.3beta, still doesn't work as expected.

@myonlylonely, can you check if your Windows system is using an East Asian "ANSI" code page, i.e., one of these?

932 (Japanese Shift-JIS),
936 (Simplified Chinese GBK)
949 (Korean Unified Hangul Code)
950 (Traditional Chinese Big5)
1361 (Korean Johab)

In Notepad++, go to ? on the toolbar, then "Debug Info" and look at "Current ANSI codepage".

Or open the Command Prompt and check the Registry. For example, if the system code page is "Simplified Chinese GBK", it will look something like this:

> reg query HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage /s /f "CP"

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage
    AllowDeprecatedCP    REG_DWORD    0x42414421
    ACP    REG_SZ    936
    OEMCP    REG_SZ    437
    MACCP    REG_SZ    936

End of search: 4 match(es) found.

I think the reason other people on this thread can't reproduce the issue is that their Windows systems are using a Western European code page (1252) or the new UTF-8 code page (65001).

@BdR76, if my guess is right, the problem comes from what I did here:

CSVLint/CSVLintNppPlugin/PluginInfrastructure/Lexer.cs

Line 1009 in a0fd0fc

    
           return (Win32.GetACP() == 65001U) ? Encoding.UTF8.GetBytes(byteBuf) : Encoding.Default.GetBytes(byteBuf);

The lexer falls back to System.Text.Encoding.Default if the OS is not using UTF-8. It never looks at the document's encoding, i.e., never calls PluginBase.CurrentScintillaGateway.GetCodePage().

Apparently you can have a situation where the OS is using a (fixed-width) DBCS code page, but Notepad++ displays the text as multi-byte UTF-8, so the lexer will still skip over bytes as it did before 60709b3.

Hard to test this without a spare machine to experiment with the OS encoding . . . 🤔

myonlylonely · 2023-03-23T05:28:13Z

Your guess is right. The debug info:

Current ANSI codepage : 936

It is the default encoding in this language version of Windows.

rdipardo · 2023-03-30T05:04:57Z

Current ANSI codepage : 936

Thank you. I can reproduce the bad colorization on the English version of Windows 10 just by setting the "ACP" Registry key to "936". I would also guess that issue #47 has the same root cause.

rdipardo · 2023-04-01T23:52:34Z

Re. the description East Asian "ANSI" code page

My older post needs a slight correction.

"Simplified Chinese GBK" and friends are known as Double Byte Character Sets or DBCS code pages.

The distinction is crucial because the lexer parses files 1 byte at a time. See, for example, how the "next column" style starts between the CR and LF of a Windows-style EOL:

Parsing a byte stream works fine with single-byte OS code pages (e.g. Windows-1252) and, crucially, variable-length multi-byte code pages like UTF-8, which was broken before 60709b3.

Fixing this issue would mean changing the lexer to sometimes parse the file in double-byte mode (when the OS uses a DBCS code page), while still using the current byte-by-byte parsing the rest of the time, i.e., 60709b3 was a good improvement; it just didn't anticipate DBCS code pages, which obviously need special treatment.

Scintilla has built-in support for DBCS code pages; it's not clear how this could benefit the plugin's lexer. Changing the document's properties really isn't a lexer's job, and Notepad++ would just override them as soon as the file was reloaded or saved. It's more likely the lexing algorithm needs to be more adaptable to fixed-width double-byte characters.

On the other hand, Windows introduced the UTF-8 code page in version 1903 to signal that DBCS are "legacy" code pages. The plugin's documentation could just say "DBCS code pages are not supported — use UTF-8 instead". A notice like that could go into a pinned "meta" issue to collect duplicate bug reports like #47.

myonlylonely · 2023-04-02T00:29:40Z

Re. the description East Asian "ANSI" code page

My older post needs a slight correction.

"Simplified Chinese GBK" and friends are known as Double Byte Character Sets or DBCS code pages.

The distinction is crucial because the lexer parses files 1 byte at a time. See, for example, how the "next column" style starts between the CR and LF of a Windows-style EOL:

Parsing a byte stream works fine with single-byte OS code pages (e.g. Windows-1252) and, crucially, variable-length multi-byte code pages like UTF-8, which was broken before 60709b3.

Fixing this issue would mean changing the lexer to sometimes parse the file in double-byte mode (when the OS uses a DBCS code page), while still using the current byte-by-byte parsing the rest of the time, i.e., 60709b3 was a good improvement; it just didn't anticipate DBCS code pages, which obviously need special treatment.

Scintilla has built-in support for DBCS code pages; it's not clear how this could benefit the plugin's lexer. Changing the document's properties really isn't a lexer's job, and Notepad++ would just override them as soon as the file was reloaded or saved. It's more likely the lexing algorithm needs to be more adaptable to fixed-width double-byte characters.

On the other hand, Windows introduced the UTF-8 code page in version 1903 to signal that DBCS are "legacy" code pages. The plugin's documentation could just say "DBCS code pages are not supported — use UTF-8 instead". A notice like that could go into a pinned "meta" issue to collect duplicate bug reports like #47.

Does that mean this issue will never be fixed？

rdipardo · 2023-04-02T03:10:46Z

Does that mean this issue will never be fixed？

It means that supporting DBCS code pages is more of a missing feature than a "bug". When some one has figured out how to do it, then it will be fixed. That probably won't be anytime soon. Unfortunately this plugin targets a very old .NET Framework version that's poorly suited for interacting with low-level C++ libraries like Scintilla.

myonlylonely · 2023-04-02T05:17:08Z

Thank you for the detailed explanation.
I guess I have to use VSCode or WebStorm with rainbow plugins which provide similar highlight features but have no problem dealing with DBCS.

rdipardo · 2023-04-02T09:56:32Z

@BdR76, for reference, this article explains what the lexer should be doing with DBCS-encoded text:

To interpret a DBCS string, an application must start at the beginning of the string and scan forward. It keeps track when it encounters a lead byte in the string, and treats the next byte as the trailing part of the same character. [...] The application cannot just back up one byte to see if the preceding byte is a lead byte, as that byte value might be eligible to be used as both a lead byte and a trail byte. [...] In other words, substring searches are much more complicated with a DBCS than with either SBCSs [Single Byte Character Sets] or Unicode.

Rebuild DLL to test ANSI encoding issues and Reformat function, this should fix issues #50 and #52

BdR76 · 2023-04-09T13:51:06Z

@rdipardo has submitted a fix for this issue, I've rebuild the DLL so @myonlylonely can you verify that it works now?

You can download the latest development build of the DLL (either x86 or x64), place dll in your .\Program Files\Notepad++\plugins\CSVLint\ folder, then restart Notepad++ to test it.

myonlylonely · 2023-04-09T14:02:14Z

@BdR76 Unfortunately, it still doesn't work. The result remains the same.
UTF8 encoding files doesn't work.

ANSI encoding files works.

rdipardo · 2023-04-09T21:34:41Z

Unfortunately, it still doesn't work.

That is to be expected. The OP in #52 most likely has a PC set to Windows 1252, the default ANSI code page for PCs in English and European locales.

To recap, 1252 is a single-byte encoding; that includes even the high ordinals where European vowels are mapped:

> python3 -c "print('é'.encode('cp1252'))"

b'\xe9'

East Asian ANSI code pages like 936 are double-byte:

> python3 -c "print('é'.encode('936'))"

b'\xa8\xa6'

This will continue to be an issue until the lexer knows how to properly segment double-byte characters, which will probably involve some usage of Scintilla's IsDBCSLeadByte API method, or a Win32 equivalent such as IsDBCSLeadByteEx.

rdipardo · 2023-04-09T21:37:23Z

@myonlylonely, have you tried the Notepad2 editor? It's got its own CSV lexer built in:

myonlylonely · 2023-04-10T01:59:23Z

@myonlylonely, have you tried the Notepad2 editor? It's got its own CSV lexer built in:

Yes, the same file works on Notepad2.

rdipardo · 2023-04-10T20:23:17Z

Yes, the same file works on Notepad2.

The screen capture shows test-ANSI.csv, but the problem is with the UTF-8 file. Does Notepad2 get the columns right when the data is saved in UTF-8 format?

myonlylonely · 2023-04-11T06:06:54Z

The screen capture shows test-ANSI.csv, but the problem is with the UTF-8 file. Does Notepad2 get the columns right when the data is saved in UTF-8 format?

Yes, the UTF8 encoding file also works on Notepad2.

BdR76 · 2023-04-15T21:18:39Z

Does that mean this issue will never be fixed？

I can reproduce the error and I've been trying to fix this. But it's not as easy as I thought. I've been delaying the next release of the plug-in, hoping it could include a fix for this issue. But in the mean time there also have a lot other updates and bugfixes, so maybe I'll make a new release anyway.

Just know that I want to fix this issue, I've also posted a question on the Notepad++ dev community hopefully that will lead to some new insights.

Fix for syntax highlighting issue #50 do not needlessly convert from char array to string and back to char array

BdR76 · 2023-04-15T22:41:04Z

@myonlylonely I think I found the fix to make the CSV Lint plug-in work correctly regardless of the OS language settings (code page 936 etc), can you try the development DLL again?

You can download the latest development build of the DLL (either x86 or x64) which has version 4.6.3β6

rdipardo · 2023-04-16T04:45:43Z

You can download the latest development build of the DLL (either x86 or x64) which has version 4.6.3β6

I've tested and it solves the issue, with no regressions from before 60709b3 that I can find. 👍🏼

Here's what I missed.

Notepad++ encodes the UTF-8 file as UTF-8, regardless of the system's ANSI code page.
The trick was to follow the editor's encoding, not the system's — i.e., use SCI_GETCODEPAGE, not GetACP() to determine the buffer's encoding.

You can see how starkly different the UTF-8 encoding is from DBCS by running this C# script:

// dbcs_encode.csx
using System;
using System.Text;
using static System.Console;

int cp = 936;
var s = $"{(char)0xef}{(char)0xbb}{(char)0xbf}测试123,测试2,测试,测试填充列\r\n";
var dbcs = Encoding.GetEncoding(cp);
var utf8Bytes = Encoding.UTF8.GetBytes(s);
var dbcsBytes = dbcs.GetBytes(s);

Func<byte, string> byteToString = b => {
    if (b == 0xd) return "CR";
    else if (b == 0xa) return "LF";
    else return $"{b:X2}";
  };

var asUTF8Bytes = String.Join(" ", utf8Bytes.Select(b => byteToString(b)));
var asDBCSBytes = String.Join(" ", dbcsBytes.Select(b => byteToString(b)));

WriteLine();
WriteLine("As UTF-8:");
WriteLine(asUTF8Bytes);
WriteLine();
WriteLine($"As {dbcs.EncodingName}:");
WriteLine(asDBCSBytes);

> csi dbcs_encode.csx

As UTF-8:
C3 AF C2 BB C2 BF E6 B5 8B E8 AF 95 31 32 33 2C E6 B5 8B E8 AF 95 32 2C E6 B5 8B E8 AF 95 2C E6 B5 8B E8 AF 95 E5 A1 AB E5 85 85 E5 88 97 CR LF

As Chinese Simplified (GB2312):
3F 3F 3F B2 E2 CA D4 31 32 33 2C B2 E2 CA D4 32 2C B2 E2 CA D4 2C B2 E2 CA D4 CC EE B3 E4 C1 D0 CR LF

When — and only when — the file is saved as ANSI, the editor's encoding will match the system's and use the DBCS encoding.

This is true even if we call SCI_GETCODEPAGE. For some reason, it falls back to the same value as GetACP() whenever the buffer is not UTF-8. Even if the status bar says "ANSI", SCI_GETCODEPAGE will say 936, if that's what the system is using.

By the same token, if the system is using the new UTF-8 code page, SCI_GETCODEPAGE returns 650001 even when the file is really saved as ANSI.

This was always the case, though; it may be a separate issue, but it's not a regression.

myonlylonely · 2023-04-16T06:21:53Z

@myonlylonely I think I found the fix to make the CSV Lint plug-in work correctly regardless of the OS language settings (code page 936 etc), can you try the development DLL again?

You can download the latest development build of the DLL (either x86 or x64) which has version 4.6.3β6

That's great! 👍🏼 I have confirmed that the development build works great!

BdR76 · 2023-04-16T09:22:58Z

@rdipardo Thanks for clarifying, text encoding can be a tricky subject. I think the plug-in still has an issue with converting ANSI files in some cases (when sort, reformat etc) but I'm glad the syntax highlighting and code pages is fixed now.

@myonlylonely Thanks for confirming, I'll prepare the new release of the plug-in.

BdR76 added the bug Something isn't working label Mar 4, 2023

myonlylonely changed the title ~~Highlight error when unicode column exists~~ Highlight error when Windows is using an East Asian "ANSI" code page Apr 1, 2023

BdR76 added a commit that referenced this issue Apr 9, 2023

Rebuild DLL to test encoding issues

801ad4d

Rebuild DLL to test ANSI encoding issues and Reformat function, this should fix issues #50 and #52

BdR76 added a commit that referenced this issue Apr 15, 2023

Fix for syntax highlighting issue #50

d6a1fbb

Fix for syntax highlighting issue #50 do not needlessly convert from char array to string and back to char array

myonlylonely closed this as completed Apr 21, 2023

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highlight error when Windows is using an East Asian "ANSI" code page #50

Highlight error when Windows is using an East Asian "ANSI" code page #50

myonlylonely commented Feb 22, 2023 •

edited

Loading

BdR76 commented Feb 22, 2023

myonlylonely commented Feb 22, 2023

myonlylonely commented Feb 22, 2023

myonlylonely commented Feb 22, 2023

Friedi commented Feb 24, 2023 •

edited

Loading

myonlylonely commented Feb 24, 2023

Friedi commented Feb 24, 2023

myonlylonely commented Feb 24, 2023

rdipardo commented Mar 23, 2023 •

edited

Loading

myonlylonely commented Mar 23, 2023

rdipardo commented Mar 30, 2023

rdipardo commented Apr 1, 2023 •

edited

Loading

myonlylonely commented Apr 2, 2023

rdipardo commented Apr 2, 2023

myonlylonely commented Apr 2, 2023 •

edited

Loading

rdipardo commented Apr 2, 2023

BdR76 commented Apr 9, 2023

myonlylonely commented Apr 9, 2023

rdipardo commented Apr 9, 2023

rdipardo commented Apr 9, 2023 •

edited

Loading

myonlylonely commented Apr 10, 2023 •

edited

Loading

rdipardo commented Apr 10, 2023

myonlylonely commented Apr 11, 2023

BdR76 commented Apr 15, 2023

BdR76 commented Apr 15, 2023

rdipardo commented Apr 16, 2023 •

edited

Loading

myonlylonely commented Apr 16, 2023

BdR76 commented Apr 16, 2023 •

edited

Loading

Highlight error when Windows is using an East Asian "ANSI" code page #50

Highlight error when Windows is using an East Asian "ANSI" code page #50

Comments

myonlylonely commented Feb 22, 2023 • edited Loading

BdR76 commented Feb 22, 2023

myonlylonely commented Feb 22, 2023

myonlylonely commented Feb 22, 2023

myonlylonely commented Feb 22, 2023

Friedi commented Feb 24, 2023 • edited Loading

myonlylonely commented Feb 24, 2023

Friedi commented Feb 24, 2023

myonlylonely commented Feb 24, 2023

rdipardo commented Mar 23, 2023 • edited Loading

myonlylonely commented Mar 23, 2023

rdipardo commented Mar 30, 2023

rdipardo commented Apr 1, 2023 • edited Loading

myonlylonely commented Apr 2, 2023

rdipardo commented Apr 2, 2023

myonlylonely commented Apr 2, 2023 • edited Loading

rdipardo commented Apr 2, 2023

BdR76 commented Apr 9, 2023

myonlylonely commented Apr 9, 2023

rdipardo commented Apr 9, 2023

rdipardo commented Apr 9, 2023 • edited Loading

myonlylonely commented Apr 10, 2023 • edited Loading

rdipardo commented Apr 10, 2023

myonlylonely commented Apr 11, 2023

BdR76 commented Apr 15, 2023

BdR76 commented Apr 15, 2023

rdipardo commented Apr 16, 2023 • edited Loading

myonlylonely commented Apr 16, 2023

BdR76 commented Apr 16, 2023 • edited Loading

myonlylonely commented Feb 22, 2023 •

edited

Loading

Friedi commented Feb 24, 2023 •

edited

Loading

rdipardo commented Mar 23, 2023 •

edited

Loading

rdipardo commented Apr 1, 2023 •

edited

Loading

myonlylonely commented Apr 2, 2023 •

edited

Loading

rdipardo commented Apr 9, 2023 •

edited

Loading

myonlylonely commented Apr 10, 2023 •

edited

Loading

rdipardo commented Apr 16, 2023 •

edited

Loading

BdR76 commented Apr 16, 2023 •

edited

Loading