Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WriteConsoleW used with ConEmu duplicates Chinese characters output #945

Open
Nelson-numerical-software opened this issue Nov 9, 2016 · 19 comments

Comments

@Nelson-numerical-software

Versions

ConEmu build: 161023 x64 stable
OS version: Windows 10 x64 (1607)
Microsoft Windows [version 10.0.14959] cmd

Problem description

WriteConsoleW duplicates chinese characters

Steps to reproduce

Actual results

Output: Traditional Chinese 漢漢字字

Expected results

Original string: Traditional Chinese 漢字

Additional files

build this code with VS 2015 C++:

#include <Windows.h>
#include

int main()
{
std::wstring msg = L"Traditional Chinese 漢字";
HANDLE consoleHandle = GetStdHandle(STD_OUTPUT_HANDLE);
WriteConsoleW(consoleHandle, msg.c_str(), msg.size(), NULL, NULL);
return 0;
}

@Maximus5
Copy link
Owner

  1. Why do you talk about the WriteConsoleW? Have you checked the result in the RealConsole by Ctrl-Win-Alt-Space?

  2. Please run from ConEmu's prompt ConEmuC -checkunicode and show result here.

@Nelson-numerical-software
Copy link
Author

1]
It seems that it is also a bug of Windows 10 insiders 14959, 14965
With a Windows 10 stable version 1607 and same version of ConEmu 161023, it works .

2]
Please notice duplicated characters 中中文文

ConEmuC -checkunicode
ConEmu 161022 x86
OS Version: 10.0.14965 (2:)
SM_IMMENABLED=1, SM_DBCSENABLED=0, ACP=1252, OEMCP=850
ConHWND=0x00090634, Class="ConsoleWindowClass"
Console font info: 0, {3x5}, 54, 400, "Lucida Console"
Handles: In=x8 (Mode=x1F7) Out=xC (x3) Err=x10 (x3)
Buffer={131,1000} Window={0,0}-{130,35} MaxSize={131,166}
Cursor: Pos={0,9} Size=25% Visible
ConsoleCP=850, ConsoleOutputCP=850
CP850: Max=1 Def=x3F,x00 UDef=x3F
Lead=x00,x00,x00,x00,x00,x00,x00,x00,x00,x00,x00,x00
Name="850 (OEM - latin multilingue I)"

123456789也也不不是是可可运运行行的的程程序序112233445566778899
Normal Reverse x7 x4007 Normal:x7 Reverse:x4007

Check AÀÀΑΑ╬╬豈豈AAꊠꊠ黠黠だだ➀ጀะڰЯ09
Text: AÀÀΑΑ╬╬豈豈AAꊠꊠ黠黠だだ➀ጀะڰЯ09
Read: A:x7 ÀÀ:x107 ΑΑ:x207 ╬╬:x107 豈豈:x207 AA:x107 ꊠꊠ:x207 黠黠:x107 だだ:x207 ➀:x107 ጀ:x207 ะ:x107 ڰ:x207 Я:x107 0:x207 9:x107
Blck: A:x7 ÀÀ:x107 ÀÀ:x207 ΑΑ:x107 ΑΑ:x207 ╬╬:x107 ╬╬:x207 豈豈:x107 豈豈:x207 AA:x107 AA:x207 ꊠꊠ:x107 ꊠꊠ:x207 黠黠:x107 黠黠:x207 だだ:x107 Info: 0,1,1,16,1,1,24,1

╔══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╦╦══
════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗║ 中中文文 ║中中
文文║╚══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
╩╩════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
══╝
Unicode check succeeded

@Maximus5
Copy link
Owner

@miniksa Can you take a look at this? Reported already several times here.

@miniksa
Copy link

miniksa commented Nov 10, 2016

@Maximus5 I've filed it as MSFT:9751066 internally and assigned to myself. I'm currently in a deep thought on something else, so I'll probably get to it early next week. Thanks for the report.

@miniksa
Copy link

miniksa commented Nov 15, 2016

I see the issue. There appear to be duplicates coming out of ReadConsoleOutputW/A. I'm not sure what happened there. I'll have to keep investigating, but it looks like it will need a fix on our side once I figure it out.

@Maximus5
Copy link
Owner

Perhaps this comes from changes in attributes processing. I noted some time ago (not sure where exactly) that new Windows build process high byte of console attributes "in proper and better way"...
One of the most weird things in conhost is COMMON_LVB_LEADING_BYTE/COMMON_LVB_TRAILING_BYTE processing. It works differently on DBCS (Chenese/Japanese/...) Windows distros than on "European" distros. On DBCS systems, when certain CJK codepages are selected, each double-width glyph takes two (or more?) CHAR_INFOs (cells). That never happened on European distros, even if CJK support was installed and these codepages were selected in the console.
I can't reproduce this issue on my test Win 10 boxes yet.

@miniksa
Copy link

miniksa commented Nov 21, 2016

FYI, I haven't forgotten about this investigation. We've just suddenly got slammed with e-mails and bugs from all sources and so getting to investigating this may take me significantly longer than I originally predicted. I will be back when I get a chance.

@miniksa
Copy link

miniksa commented Jan 20, 2017

FYI, the fix for this should have just landed with Insider Build 15014 today.

@ncihnegn
Copy link

Just tested Build 15014. Not fixed yet.

@miniksa
Copy link

miniksa commented Jan 24, 2017

Hmmm. Not sure what's up. I'll dig into character handling stuff today.

@Maximus5
Copy link
Owner

Maximus5 commented Jan 25, 2017

@miniksa Finally I managed to install insider build.

First, the expected behavior from "stable" Win10 build. All glyphs are written and displayed properly, no doubled CJK and data properly fit on screen.
2017-01-25_11-08-09

Now the 15014.

2017-01-25_11-12-26

I'm still checking the results, here first notes.

  1. Regardless the fact SM_DBCSENABLED is 0, COMMON_LVB_LEADING_BYTE and COMMON_LVB_TRAILING_BYTE are set. Is that intended on non-DBCS enabled OS? There were not used previously, only CJK versions of Windows (up to Win 10 14393) used them.
  2. More worse that even conhost treats CJK glyphs in different ways.
  • Somewhere it shows them (by squares, yep) supposing they have double-cell width, somewhere - single-cell width.
  • When ConEmu writes 80 characters (the console width) on non-CJK Windows, the data is expected to be written properly without wrapping. But that's not true anymore. Even in conhost's window we may see that only 77 characters (I counted them) were written under the frame (the line with three CJK glyphs).

Finally. Here are drawing bugs during selection in conhost's window. I selected one by one cells with mouse. Cells have unexpected width during selection. And strangely the line below the selection is broken during selection.
win10-selection

@Maximus5 Maximus5 reopened this Jan 25, 2017
@Maximus5
Copy link
Owner

@miniksa Inconsistency of API... WriteConsoleOutputAttribute, WriteConsoleOutputCharacter, ReadConsoleOutputCharacter, ReadConsoleOutputAttribute, ReadConsoleOutput...
Some of functions treat CJK as normal single-cell glyphs (WriteConsoleOutputCharacter, ReadConsoleOutputCharacter).
Some of functions return COMMON_LVB_LEADING_BYTE/COMMON_LVB_TRAILING_BYTE and therefore double cells (ReadConsoleOutputAttribute, ReadConsoleOutput).
Some of functions has undefined behavior (after WriteConsoleOutputAttribute and further WriteConsoleOutputCharacter glyphs are "written" after filled with attributes cells).
It's all on non-CJK insider Win 10.

@miniksa
Copy link

miniksa commented Jan 25, 2017

Yeah, I was finding bad behavior like this yesterday as well. Part of the deal is that it behaves differently with Raster Fonts vs. TrueType fonts as well. I'll probably be spending the rest of the week on trying to fix this up and make it consistent. I don't know what SM_DBCSENABLED is/does. Console's DBCS check has always been based on the active code page (is equal to 932, 949, 950, 936) not that system metric.

I'll try to keep you posted as I figure this out. Sorry about that. A few of us have been working on trying to fit UTF-8 support into the console (not done yet) and it appears to have messed up quite a few DBCS routes.

@Maximus5
Copy link
Owner

I used to check GetSystemMetrics(SM_DBCSENABLED) which actually was 1 only for Windows installations developed for China, Japan, Korea (CJK).
If SM_DBCSENABLED returns 0 that meant that CJK glyphs use only one cell in conhost, regardless of the codepage.
That was true before.
Now it is broken or changed.
What is correct behavior?

@miniksa
Copy link

miniksa commented Jan 25, 2017

I'll have to get back to you on that. Everything you are telling me about SM_DBCSENABLED is 100% new information to me. I don't really know if that particular metric used to be a part of the console code in XP/Vista/7/8. I can look. I also don't know what in the system turns that metric on or off.

From what I know about the console from Win 8.1 to today, the console always did its conversions and width calculations based on code page. It's just that prior to recently, it used to prohibit changing into a CJK codepage unless your system's non-Unicode region was set to a CJK language (Control Panel-->Region-->Administrative-->Language for non-Unicode programs). I've been trying to remove that restriction to allow anyone to swap into any codepage no matter their "non-Unicode region" because in today's editions of Windows (as opposed to the CJK-specific ones of the 1990s), you can add just about any language pack and IME and font to any language edition of Windows, so the "non-Unicode" region doesn't really matter like it used to several decades ago.

My plan is:

  • Go back into the DBCS tests and expand them significantly across these APIs against the v1 console (legacy) since everything was "fine" before we mucked around with it.
  • Correct anything in v2 that is no longer compliant with the DBCS tests (including all the APIs you listed above).
  • Get that set of fixes and tests shipping up toward the insiders build.
  • Get you some documentation/research on how all this is supposed to work (including that SM_DBCSENABLED flag in the console context) and potentially get that sort of information published to MSDN.

@miniksa
Copy link

miniksa commented Feb 7, 2017

So I've got through 1, 2, and 3 in MSFT: 10187355 which is checked in and will start shipping up to Insiders builds. Probably be there in a few weeks. I've basically restored the console's behavior to the same as what it was for the legacy console. If it works against the console with the legacy box checked, it will work again against the updated one once the Insider build updates.

For part 4, I'm still working on it. I basically need to write up the way that the v1/legacy console did it and publish that.

@rprichard
Copy link

@miniksa @Maximus5 FWIW, this VSCode/winpty issue seems related: microsoft/vscode#19665. ConEmu is broken in exactly the same way (screenshot in this comment, microsoft/vscode#19665 (comment)). I wrote a test case demonstrating the new (broken?) behavior as of Win10 v15048.

@bao-qian
Copy link

bao-qian commented Jun 30, 2017

hi
I have no such problem in previous windows build (15063.413) for simplified Chinese.
I only noticed such issue after latest stable windows build 15063.447 rolled out:
alpha build works almost fine with new console.

image

stable and preview build works find with legacy console

image

@faiz-lisp
Copy link

faiz-lisp commented Mar 17, 2020

I try Chinese on the UTF8 version of Newlisp.
https://github.com/kosh04/newlisp/blob/develop/nl-utf8.c
It works well.

https://stackoverflow.com/questions/3911536/utf-8-unicode-whats-with-0xc0-and-0x80
(I hope it could help.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants