Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow rendering of colored text #4129

Open
joaobzrr opened this issue Jan 7, 2020 · 15 comments
Open

Very slow rendering of colored text #4129

joaobzrr opened this issue Jan 7, 2020 · 15 comments
Labels
Area-Performance Performance-related issue Issue-Bug It either shouldn't be doing this or needs an investigation. Priority-1 A description (P1) Product-Conhost For issues in the Console codebase
Milestone

Comments

@joaobzrr
Copy link

joaobzrr commented Jan 7, 2020

WriteConsole() performs reasonably well when writing plain uncolored text to the console (~2ms at best, ~20ms at worst with a 240x64 buffer), but is incredibly slow when writing text with color escape sequences. Below is some code that demonstrates this.

I write an entire screen worth of colored text as fast as possible shifting every character to the one next to it in ASCII as to avoid any possible caching that might happen. I measure the time it takes for each WriteConsole() call and print the results at the end. No third-party libraries are needed to compile.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <windows.h>

#define NUM_OF_MEASUREMENTS 100


static LONGLONG perf_counter_freq;


LONGLONG
get_timestamp(void)
{
    LARGE_INTEGER timestamp;
    QueryPerformanceCounter(&timestamp);
    LONGLONG result = timestamp.QuadPart;
    return result;
}


float
get_seconds_elapsed(LONGLONG start, LONGLONG end)
{
    float result = (float)(end - start) / (float)perf_counter_freq;
    return result;
}


void
generate_text(char start_char, char *buffer, size_t buffer_size)
{
    for (size_t i = 0; i < buffer_size; i++)
    {
        buffer[i] = 'A' + (((start_char - 'A') + i) % 26);
    }
}


char *
copy_string(char *dest, char *source)
{
    while (*source)
    {
        *dest++ = *source++;
    }

    return dest;
}


int main(int argc, char **argv)
{
    HANDLE con_out = CreateFile("CONOUT$", GENERIC_READ | GENERIC_WRITE, FILE_SHARE_WRITE, 0, OPEN_EXISTING, 0, 0);
    if (con_out == INVALID_HANDLE_VALUE) goto cleanup;

    DWORD console_mode;
    GetConsoleMode(con_out, &console_mode);

    console_mode |= ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_WRAP_AT_EOL_OUTPUT;
    SetConsoleMode(con_out, console_mode);

    CONSOLE_SCREEN_BUFFER_INFO buffer_info;
    GetConsoleScreenBufferInfo(con_out, &buffer_info);

    LARGE_INTEGER perf_counter_freq_result;
    QueryPerformanceFrequency(&perf_counter_freq_result);
    perf_counter_freq = perf_counter_freq_result.QuadPart;

    int console_width  = (buffer_info.srWindow.Right  - buffer_info.srWindow.Left) + 1;
    int console_height = (buffer_info.srWindow.Bottom - buffer_info.srWindow.Top);
    size_t buffer_size = console_width * console_height;

    char *buffer1 = (char*)malloc(buffer_size);
    if (!buffer1) goto cleanup;

    char *buffer2 = (char*)malloc(buffer_size);
    if (!buffer2) goto cleanup;

    generate_text('A', buffer1, buffer_size);
    generate_text('B', buffer2, buffer_size);

    size_t screen_buffer_size = 1048576;
    char *screen_buffer = (char*)malloc(screen_buffer_size);
    if (!screen_buffer) goto cleanup;

    int measurement_index = 0;
    int num_of_measurements = NUM_OF_MEASUREMENTS;
    float *measurements = (float*)malloc(num_of_measurements*sizeof(float));
    if (!measurements) goto cleanup;

    int c = 0;
    int b = 1;
    for (int count = 0; count < num_of_measurements; count++)
    {
        char *src = b ? buffer1 : buffer2;
        char *at = screen_buffer;

        at = copy_string(at, "\x1b[H");
        for (int j = 0; j < console_height; j++)
        {
            size_t row_offset = j * console_width;
            char *row = &src[row_offset];

            for (int i = 0; i < console_width; i++)
            {
                at = copy_string(at, "\x1b[38;2;");
                switch(c)
                {
                    case 1:  { at = copy_string(at, "28;215;119m"); } break;
                    case 2:  { at = copy_string(at, "0;159;255m");  } break;
                    default: { at = copy_string(at, "226;51;73m");  } break;
                }
                c = (c + 1) % 3;

                *at++ = row[i];
                at = copy_string(at, "\x1b[0m");
            }
            at = copy_string(at, "\n");
        }

        LONGLONG start = get_timestamp();

        DWORD written = 0;
        size_t chars_to_write = at - screen_buffer;
        WriteConsole(con_out, screen_buffer, (DWORD)chars_to_write, &written, 0);

        LONGLONG end = get_timestamp();

        float seconds_elapsed = get_seconds_elapsed(start, end);
        measurements[measurement_index++] = seconds_elapsed;

        b = !b;
    }

    DWORD written = 0;
    WriteConsole(con_out, "\x1b[2J\x1b[H", 7, &written, 0);

    float fastest = measurements[0];
    float slowest = measurements[0];
    float total = 0;
    for (int i = 0; i < num_of_measurements; i++)
    {
        float measurement = measurements[i];

        fastest = measurement < fastest ? measurement : fastest;
        slowest = measurement > slowest ? measurement : slowest;
        total += measurement;
    }
    float average = total / num_of_measurements;

    printf("Fastest: %.4f\n", fastest);
    printf("Slowest: %.4f\n", slowest);
    printf("Average: %.4f\n", average);

cleanup:
    if (con_out != INVALID_HANDLE_VALUE) CloseHandle(con_out);
    if (buffer1)       free(buffer1);
    if (buffer2)       free(buffer2);
    if (screen_buffer) free(screen_buffer);
    if (measurements)  free(measurements);
}

Is there anything I can do to make this run fast enough for animation (at least 30 FPS)? Techniques like only writing what changed are not an option, although I doubt that doing that would be faster than just writing the whole buffer to the console at once. I know about WriteConsoleOutput() and double-buffering with SetConsoleActiveScreenBuffer() but I don't know if they could be made to work with VT escape sequences.

@ghost ghost added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Jan 7, 2020
@zadjii-msft zadjii-msft added Area-Performance Performance-related issue Issue-Question For questions or discussion Product-Conhost For issues in the Console codebase labels Jan 7, 2020
@ghost ghost removed the Needs-Tag-Fix Doesn't match tag requirements label Jan 7, 2020
@DHowett-MSFT
Copy link
Contributor

What version of Windows is this on? We made some improvements here in the past couple releases, but it looks like the bug template got lost.

@DHowett-MSFT DHowett-MSFT added the Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something label Jan 9, 2020
@DHowett-MSFT DHowett-MSFT added this to the Console Backlog milestone Jan 9, 2020
@DHowett-MSFT DHowett-MSFT removed the Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting label Jan 9, 2020
@joaobzrr
Copy link
Author

joaobzrr commented Jan 9, 2020

Windows 10, 19013.1102 build.

@ghost ghost added Needs-Attention The core contributors need to come back around and look at this ASAP. and removed Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something labels Jan 9, 2020
@joaobzrr
Copy link
Author

@DHowett-MSFT Should I open a bug issue for this?

@DHowett-MSFT
Copy link
Contributor

This is the main bug issue we will be tracking for this report. Thank you!

@DHowett-MSFT DHowett-MSFT added Issue-Bug It either shouldn't be doing this or needs an investigation. and removed Issue-Question For questions or discussion Needs-Attention The core contributors need to come back around and look at this ASAP. labels Jan 24, 2020
@DHowett-MSFT DHowett-MSFT modified the milestones: Console Backlog, 21H1 Jan 24, 2020
@heavelock
Copy link

I noticed that in general, rendering of a text in Windows terminal is very slow compared to conemu. I am currently attached to the same Tmux session on remote linux machine via ssh from legacy WSL. I am in a dir containing ~700 files. I am able to run a simple ls -la command multiple times in Conemu while WT is still rendering output from the first call. The standard output is coloured but even after stripping colour with ls -la | sed -r "s/[[:cntrl:]]\[[0-9]{1,3}m//g" WT is noticeably slower. I have no idea how to perform quantitative tests but I would be happy to provide results if anyone suggests something.

@kevinbeason
Copy link

I noticed that in general, rendering of a text in Windows terminal is very slow

Indeed. As a point of comparison, running seq 1 70000 takes this long on various terminals (local and remote), as determined by using a physical stopwatch:

Windows Terminal + Ubuntu WSL: 12.99s
Ubuntu WSL: 5.32s
Ubuntu WSL + tmux: 0.70s
PuTTY (ssh to remote linux machine): 2.55s
VSCode terminal (in remote SSH session to linux machine): 1.15s
Ubuntu (in Konsole running natively in X11 on slower machine): 0.55s

@davidchisnall
Copy link

There are some very odd performance differences between using the alternative console and not, local and remote. I tried using konsole and WT on the same machine, locally with and without tmux (both in WSL) and remotely via an ssh connection with and without tmux and got the following approximate timings:

Terminal local / remote tmux / native Time
WT local native 14s
WT local tmux <1s
WT remote native 5s
WT remote tmux >1m
Konsole local native <1s
Konsole local tmux <1s
Konsole remote native <1s
Konsole remote tmux <1s

Tmux dropped a few lines from the scrollback with Konsole remote (but not locally), which it does when writes to the console are taking too long. WT drops all tmux scrollback, so it's difficult to tell whether it's actually trying to render all of it. Adding tmux on the remote connection makes WT incredibly slow. The writes are being buffered somewhere, so hitting control-c on a command that produces a lot of output can still cause the console to keep drawing output for several a minute or longer before the terminal window becomes useable again.

@zadjii-msft zadjii-msft modified the milestones: Windows vNext, 22H2 Jan 4, 2022
@zadjii-msft zadjii-msft added the Priority-1 A description (P1) label Jan 4, 2022
@zadjii-msft zadjii-msft removed this from the 22H2 milestone Jun 13, 2022
@zadjii-msft zadjii-msft added this to the EIM Summer 2022 milestone Jun 13, 2022
@zadjii-msft zadjii-msft modified the milestones: EIM Summer 2022, Backlog Nov 28, 2022
@aremmell
Copy link

aremmell commented Sep 5, 2023

This is still an issue in late 2023. I publish a library that, among other things, writes ANSI escape colored text to the console (libsir).

On Windows (using WriteConsole), the performance is so much worse than any other platform/terminal emulator, that I've had to cut the number of iterations to 10% of normal for the performance evaluation test.

Here are some numbers (from the same machine; everything other than macOS is a VM):

Platform # lines Elapsed Lines/sec Emulator
macOS 1m 7.418s 134807.2 iTerm2
Debian 12 1m 7.055s 141743.4 gnome-terminal
FreeBSD 1m 3.310s 302114.8 Konsole
Windows 100k 16.596s 6025.6 Windows Terminal

I know that Microsoft can do better than this, so why has this bug report been ignored for nearly 4 years now? It's the age of cross-platform software, and compatibility (which has never been Windows' strong suit) is becoming more important. You took it seriously enough to implement the parsing and rendering of the color codes, so why not finish the job and make it at least compete with other emulators in terms of throughput?

I, for one, would really appreciate it!

@lhecker
Copy link
Member

lhecker commented Sep 5, 2023

Oh... I forgot about this issue (not the problem in general of course). This issue is technically resolved as described by OP. On the latest versions of Windows Terminal it'll now look like this:

Fastest: 0.0014
Slowest: 0.0034
Average: 0.0017

For reference, at the time this issue was filed in v0.7.3451.0 it looked like this:

Fastest: 0.0132
Slowest: 0.0335
Average: 0.0282

The average has gotten 16.5x faster over the last 3 years. I think the issue as described is resolved now, even if it'll be improved further.


@aremmell If I'm interpreting "# lines" right, I think you're seeing our poor newline performance and not an issue related to colored text in general, right? That issue is difficult to solve unfortunately. It's inherent to the current architecture of ConPTY which acts as a "proxy" to translate console client output to VT (this affects all console clients, including VT to VT translation). Windows Terminal is a pure VT based terminal and uses this proxy to allow for all kinds of clients to connect. All previous attempts at hot-fixing it have failed. We still continuously pursue this however, since the >50x throughput difference to other terminals is "not great".

@aremmell
Copy link

aremmell commented Sep 5, 2023

@zadjii-msft I didn't mean to come off as insulting. I found this thread, saw that it was added to a milestone 2 years ago, and then nothing else, so I assumed (apparently incorrectly) that nobody was paying it any mind. Good to hear you guys are on it!

@lhecker Thanks for the update. I suppose maybe it is just the line breaks that are causing it to run so slowly. I also made an assumption about the cause of the slow throughput because all of the output from my library has some escape codes, even if it's just to use the default foreground and background colors.

Sorry to hear that it is such a difficult technical challenge. Perhaps you will have an epiphany, or these individual tweaks mentioned by @zadjii-msft will have a cumulative effect and make it much less noticeable.

Honestly, if I hadn't run a performance test that hammers the console as fast as one CPU can do it, I would probably not have noticed. The performance is 'good enough' for most applications, but I believe it is probably the bottleneck for any super intensive operation that is also logging its progress to stdout.

Anyhow, godspeed, and I'll keep my eye on this thread for updates (and if you guys need a guinea pig, I'll volunteer to test dev builds).

Cheers

@Mrgaton
Copy link

Mrgaton commented Dec 5, 2023

I got the same problem when i was tring to make an ascii video player, i hope a fix soon :c

@Mrgaton
Copy link

Mrgaton commented Dec 5, 2023

image

It takes like a full second to write a frame

@DHowett
Copy link
Member

DHowett commented Dec 5, 2023

It takes like a full second to write a frame

Which version of Terminal is this on, and are you using the "new" text rendering engine? We've made a lot of improvements over the past couple years since this bug was first files.

@lhecker
Copy link
Member

lhecker commented Dec 5, 2023

Unfortunately even with the faster code we're not at a state yet where watching a video at this resolution would be pleasant. We're currently limited to something in the ballpark of 60MB/s of VT processing at the moment which at 640x180 cells as in your screenshot is 5-10 FPS. That's roughly 1 order of magnitude faster than it used to be (so, try it out if you can!), but we'd need to make it another order of magnitude faster to be decent. That's theoretically possible, but requires lots and lots and lots of code refactorizations in this code base. 1

Footnotes

  1. The renderer holds the text buffer lock across D3D and D2D calls, but fixing that is difficult because there's tons of places that access the buffer directly via huge, narrow interfaces. There's the UTF8<>UTF16 translation which happens 3 times and it uses stringapiset.h which isn't the fastest. The console calls result in lots of individual heap allocations due to the reliance on the STL (as most C++ projects do; but that's bad for perf). The VT parser heavily relies on dynamic dispatch which results in a disappointing upper perf ceiling due to CFG. It uses lots of if conditions for parsing and could use a faster LL(1) design with LUTs instead. The text buffer stores VT sequences and text split up separately using RLE compression for the sequences (as metadata), which is also slow. Basically, there's a long list of long-term cleanups that need to occur and they have been and continue to be long-term ongoing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Performance Performance-related issue Issue-Bug It either shouldn't be doing this or needs an investigation. Priority-1 A description (P1) Product-Conhost For issues in the Console codebase
Projects
None yet
Development

No branches or pull requests

10 participants