-
-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Highlight error when Windows is using an East Asian "ANSI" code page #50
Comments
Thanks for the issue report but I can't reproduce this display error, when I open the data the syntax highlighting looks good. This issue seems similar to issue #47 Are you using Windows 10 or Windows 11? And which version of the plug-in are you using? |
I'm using Windows 10 21H2 (64bit), Notepad++ 8.4.8 (64bit), CSVLint 0.4.6.2(insalled from Notepad++ plugin management). |
The file I used is attached: |
I'm using Windows 10 (64bit), Notepad++ 8.4.9 (64bit), CSVLint 0.4.6.3beta. For me highlighting with your files works fine. They look the same. @BdR76 if there is no additional reporting for the 0.4.6.3beta, you can publish it- and close my findings (or am I supposed to do that?) |
@myonlylonely, can you check if your Windows system is using an East Asian "ANSI" code page, i.e., one of these?
In Notepad++, go to ? on the toolbar, then "Debug Info" and look at "Current ANSI codepage". Or open the Command Prompt and check the Registry. For example, if the system code page is "Simplified Chinese GBK", it will look something like this: > reg query HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage /s /f "CP"
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage
AllowDeprecatedCP REG_DWORD 0x42414421
ACP REG_SZ 936
OEMCP REG_SZ 437
MACCP REG_SZ 936
End of search: 4 match(es) found. I think the reason other people on this thread can't reproduce the issue is that their Windows systems are using a Western European code page (1252) or the new UTF-8 code page (65001). @BdR76, if my guess is right, the problem comes from what I did here:
The lexer falls back to Apparently you can have a situation where the OS is using a (fixed-width) DBCS code page, but Notepad++ displays the text as multi-byte UTF-8, so the lexer will still skip over bytes as it did before 60709b3. Hard to test this without a spare machine to experiment with the OS encoding . . . 🤔 |
Your guess is right. The debug info:
It is the default encoding in this language version of Windows. |
Thank you. I can reproduce the bad colorization on the English version of Windows 10 just by setting the "ACP" Registry key to "936". I would also guess that issue #47 has the same root cause. |
Re. the description My older post needs a slight correction. "Simplified Chinese GBK" and friends are known as Double Byte Character Sets or DBCS code pages. The distinction is crucial because the lexer parses files 1 byte at a time. See, for example, how the "next column" style starts between the Parsing a byte stream works fine with single-byte OS code pages (e.g. Windows-1252) and, crucially, variable-length multi-byte code pages like UTF-8, which was broken before 60709b3. Fixing this issue would mean changing the lexer to sometimes parse the file in double-byte mode (when the OS uses a DBCS code page), while still using the current byte-by-byte parsing the rest of the time, i.e., 60709b3 was a good improvement; it just didn't anticipate DBCS code pages, which obviously need special treatment. Scintilla has built-in support for DBCS code pages; it's not clear how this could benefit the plugin's lexer. Changing the document's properties really isn't a lexer's job, and Notepad++ would just override them as soon as the file was reloaded or saved. It's more likely the lexing algorithm needs to be more adaptable to fixed-width double-byte characters. On the other hand, Windows introduced the UTF-8 code page in version 1903 to signal that DBCS are "legacy" code pages. The plugin's documentation could just say "DBCS code pages are not supported — use UTF-8 instead". A notice like that could go into a pinned "meta" issue to collect duplicate bug reports like #47. |
Does that mean this issue will never be fixed? |
It means that supporting DBCS code pages is more of a missing feature than a "bug". When some one has figured out how to do it, then it will be fixed. That probably won't be anytime soon. Unfortunately this plugin targets a very old .NET Framework version that's poorly suited for interacting with low-level C++ libraries like Scintilla. |
Thank you for the detailed explanation. |
@BdR76, for reference, this article explains what the lexer should be doing with DBCS-encoded text:
|
@rdipardo has submitted a fix for this issue, I've rebuild the DLL so @myonlylonely can you verify that it works now? You can download the latest development build of the DLL (either x86 or x64), place dll in your |
@BdR76 Unfortunately, it still doesn't work. The result remains the same. |
That is to be expected. The OP in #52 most likely has a PC set to Windows 1252, the default ANSI code page for PCs in English and European locales. To recap, > python3 -c "print('é'.encode('cp1252'))"
b'\xe9' East Asian ANSI code pages like > python3 -c "print('é'.encode('936'))"
b'\xa8\xa6' This will continue to be an issue until the lexer knows how to properly segment double-byte characters, which will probably involve some usage of Scintilla's |
@myonlylonely, have you tried the Notepad2 editor? It's got its own CSV lexer built in: |
Yes, the same file works on Notepad2. |
The screen capture shows |
I can reproduce the error and I've been trying to fix this. But it's not as easy as I thought. I've been delaying the next release of the plug-in, hoping it could include a fix for this issue. But in the mean time there also have a lot other updates and bugfixes, so maybe I'll make a new release anyway. Just know that I want to fix this issue, I've also posted a question on the Notepad++ dev community hopefully that will lead to some new insights. |
Fix for syntax highlighting issue #50 do not needlessly convert from char array to string and back to char array
@myonlylonely I think I found the fix to make the CSV Lint plug-in work correctly regardless of the OS language settings (code page 936 etc), can you try the development DLL again? You can download the latest development build of the DLL (either x86 or x64) which has version |
I've tested and it solves the issue, with no regressions from before 60709b3 that I can find. 👍🏼 Here's what I missed. Notepad++ encodes the UTF-8 file as UTF-8, regardless of the system's ANSI code page. You can see how starkly different the UTF-8 encoding is from DBCS by running this C# script: // dbcs_encode.csx
using System;
using System.Text;
using static System.Console;
int cp = 936;
var s = $"{(char)0xef}{(char)0xbb}{(char)0xbf}测试123,测试2,测试,测试填充列\r\n";
var dbcs = Encoding.GetEncoding(cp);
var utf8Bytes = Encoding.UTF8.GetBytes(s);
var dbcsBytes = dbcs.GetBytes(s);
Func<byte, string> byteToString = b => {
if (b == 0xd) return "CR";
else if (b == 0xa) return "LF";
else return $"{b:X2}";
};
var asUTF8Bytes = String.Join(" ", utf8Bytes.Select(b => byteToString(b)));
var asDBCSBytes = String.Join(" ", dbcsBytes.Select(b => byteToString(b)));
WriteLine();
WriteLine("As UTF-8:");
WriteLine(asUTF8Bytes);
WriteLine();
WriteLine($"As {dbcs.EncodingName}:");
WriteLine(asDBCSBytes); > csi dbcs_encode.csx
As UTF-8:
C3 AF C2 BB C2 BF E6 B5 8B E8 AF 95 31 32 33 2C E6 B5 8B E8 AF 95 32 2C E6 B5 8B E8 AF 95 2C E6 B5 8B E8 AF 95 E5 A1 AB E5 85 85 E5 88 97 CR LF
As Chinese Simplified (GB2312):
3F 3F 3F B2 E2 CA D4 31 32 33 2C B2 E2 CA D4 32 2C B2 E2 CA D4 2C B2 E2 CA D4 CC EE B3 E4 C1 D0 CR LF When — and only when — the file is saved as ANSI, the editor's encoding will match the system's and use the DBCS encoding. This is true even if we call By the same token, if the system is using the new UTF-8 code page, This was always the case, though; it may be a separate issue, but it's not a regression. |
That's great! 👍🏼 I have confirmed that the development build works great! |
@rdipardo Thanks for clarifying, text encoding can be a tricky subject. I think the plug-in still has an issue with converting ANSI files in some cases (when sort, reformat etc) but I'm glad the syntax highlighting and code pages is fixed now. @myonlylonely Thanks for confirming, I'll prepare the new release of the plug-in. |
CSV data for testing:
The text was updated successfully, but these errors were encountered: