New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] Reduce Tearing in NES #5
Comments
@OtherCrashOverride #if 0
back_buffer = bmp_create(width, height, 0); /* no overdraw */
if (NULL == back_buffer)
{
bmp_destroy(&primary_buffer);
return -1;
}
bmp_clear(back_buffer, GUI_BLACK);
#endif I will restore it and use swap instead of memcpy. |
It was likely disabled during early development. I am not currently aware of any issue preventing its use. |
I'm not saying there is an issue in your modification. I just wonder why you are doing this. I understood that you can NOT dynamic allocate memory in Arduino. Perhaps the same applies to ESP32. Thus, you hack bmp_create. In any case, there is no trace in your change. That sucks. My goal is to add back a proper double buffering implementation so that I can experiment interlacing and progressive update. I found the original source nofrendo where double buffering implemented in NES emulator application level. It swap primary buffer and back buffer when flushing https://github.com/rickyzhang82/nofrendo/blob/master/src/vid_drv.c#L359-L362 /* Swap pointers to the main/back buffers */
temp = back_buffer;
back_buffer = primary_buffer;
primary_buffer = temp; I will allocate two buffers in advance for primary and back buffer and swap them in your custom_blit function. More advance experimental stuffs will be done there later. PS: Don't blame the source mixed with tab and space. The original nofrendo code use 3 spaces as indentation. It is nice and clear. We should fix it in your upstream. It is very irritating to me. |
I did an experiment by restoring primary buffer and back buffer. Swapping them after custom_blit function. But I saw more serious screen tearing problem than before. I still tried to figure it out why. But cloning a copy of frame buffer by memcpy before sending it to video queue seems to solve the problem. |
@OtherCrashOverride I made two changes:
Reading through Between xQueueSend from custom_blit function and xQueueReceive from video task, they should synchronize that xQueueSend have to wait until xQueueReceive clear the slot. This alone enforce that before buffer swapping is done, current frame buffer data (either primary or back) has been sent over SPI. This guarantees that NO overwrite in frame buffer during transferring. But the test show otherwise. Is it possible that the completion of transferring SPI data doesn't mean that LCD finish display the result? See snippet of custom_blit and video_task. static void custom_blit(bitmap_t *bmp, int num_dirties, rect_t *dirty_rects) {
if (bmp->line[0] != NULL)
{
xQueueSend(vidQueue, &(bmp->line[0]), portMAX_DELAY);
ESP_LOGD(TAG, "xQueueSend bmp(%p) to queue.", (void *)bmp->line[0]);
}
} and static void videoTask(void *arg) {
uint8_t* bmp = NULL;
while(1)
{
BaseType_t hasReceivedItem = xQueuePeek(vidQueue, &bmp, portMAX_DELAY);
if (bmp == 1)
break;
if (hasReceivedItem)
{
ili9341_write_frame_nes(bmp, myPalette);
ESP_LOGD(TAG, "Sent bmp(%p) over SPI.", (void *)bmp);
odroid_input_battery_level_read(&battery);
xQueueReceive(vidQueue, &bmp, portMAX_DELAY);
}
} Here is my experimental branch: https://github.com/rickyzhang82/go-play/tree/exp-enable-double-buffering |
No such guarantees are made. The buffers are passed by reference (pointer) so it is possible to overwrite data while it is being rasterized.
The completion is asynchronous. There is no concern for their completion other than that a transaction is available for use. This is in contrast to blocking for completion. |
Let me put it this way. Two different overwrite happens here:
The guarantee refers to point 1. I think this is true as I explain in
My question is that once frame buffer data stays in internal GRAM of LCD controller, are there any guarantee that LCD show the frame before next SPI transaction starts? |
As mentioned before, the LCD display will update over 2 times (70 fps)
before a single complete frame can be sent over SPI (40Mhz). When the LCD refresh happens, what ever has been sent is display whether its complete or not.. |
That makes sense to me. Because frame buffer SPI transaction to internal GRAM of LCD controller is not in sync with LCD refresh rate. SPI transferring speed limit the frame rate to be half of LCD refresh rate. I think I should reduce SPI traffic by using techniques like interlacing and minimal rectangle update. In any case, I found that if I follow your old way using memcpy to make a copy of primary/back frame buffer into a separate frame buffer in the heap and then do SPI transfer from that separate frame buffer, I don't see any severe tearing like my Youtube video. Note that the separate frame buffer is not defined as An off-topic: I saw that you throttle rendering frame rate in nes.c. bool renderFrame = ((skipFrame % 2) == 0);
nes_renderframe(renderFrame);
system_video(renderFrame);
if (skipFrame % 7 == 0) ++skipFrame;
++skipFrame; I saw NES emulator 60FPS in output. Does it imply you throttle rendering frame rate in 30 fps or something? I don't quite get |
DMA is never done from the frame buffer. The data has to be processed by the CPU first. As stated in the first post, the SPI LCD has a maximum update rate of 30fps. The emulators run at 60fps and discard frames. The "skipFrame % 7" introduces an irregularity to whether odd or even frames are discarded. |
I see your point. It is |
Thank you @OtherCrashOverride & @rickyzhang82 for keeping this discussion public. I've been reviewing the current state of I'm going to carefully watch this thread as the tearing is probably the biggest limitation to this dev kit that people will complain of. Thank you for your hard work =) |
@OtherCrashOverride
|
@jaygarcia It is not my ideas at all. I got two ideas interlacing and delta update from juj's repository. His ili9341 work build on RasPi. But the improvement algorithm may apply to ESP32. |
hi @rickyzhang82 i was also looking at the same interlacing example and was wondering if the overhead of transactions was too high for interlacing to work well, as each row "paint" would require a whole new transaction, resetting the starting address w/ the LCD and it's required overhead. In regards to your point #2 above, are you swapping memory pointers at the end of of the SPI transfers? E.g. After |
|
Thanks @rickyzhang82 =). I've been trying to to follow your fork in addition to learning the codebase and your explanation has helped me =) |
Sorry I'm trying to understand, does this mean the tearing issue might be fixed in the future? |
@OtherCrashOverride I suspected that loading states from uSD in core 0 over SPI causing the problem Because the heap corruption happened around the time it tries to load saved state from uSD card. I saw the original |
@JasonB32 It is pure experimental work. I'm working on interlacing now. |
Ohh you're a very good man @rickyzhang82 Thank you!!! |
@rickyzhang82 |
@OtherCrashOverride did you notice a significant improvement in migrating from task queues to semaphores? =) |
I noticed you convert task queue into semaphore. Using bitmask The original approach relies on PS: I commented out loading uSD card loading code. The heap corruption error is not related to shared SPI bus between LCD and uSD card. I will look into your approach and see if I can borrow your idea to make sure |
@jaygarcia I tried on smsplusgx branch. I didn't see obvious improvement in terms of tearing. |
I figured out the heap corruption problem. It is because of mismatch between In any case, I implemented interlacing on my exp-interlacing branch. The result doesn't look great as I thought. This reminds me of the old days when my using Sony DV cassette tape, importing video by firewire and converting them by ffmpeg with interlacing enabled. See demo video. But this is a good start for the next progressive update with minimal delta. @jaygarcia, I don't see SPI transaction overhead is an issue so far. I can see that it can stay around 60 fps in interlacing. @OtherCrashOverride I don't migrate my interlacing experiment to your new branch as I haven't seen the real benefit yet. I may revisit it when I finish progressive update next. |
@jaygarcia |
The previous task queue design did have reliability issue but this is due to the fact that mismatch between I'm more curious on the efficiency in your new implementation. Have you done any empirical performance test in semaphore vs task queue? |
Empirical evidence was collected by running emulation tests with audio disabled and observing the frame rate. Evidence can also be observed in the slope change of the tearing patterns present. |
@OtherCrashOverride Could you please tell me how you can do a benchmark test like the slope change and etc? I did find that it computes FPS. But I want to know how can it only measure CPU 0 can tell CPU1 performance. Also, I wonder when thanks |
The new implementation is a rewrite. The performance difference is not due to semaphore/task. It is due to a new algorithm. The video frame rate is controlled by the sound in all emulators. Disabling sound allows measurement of video performance. go-play/odroid-go-common/components/odroid/odroid_audio.c Lines 196 to 202 in b4d6163
To eliminate other factors, create a test program that just displays a blank frame and measures the performance. |
That's a good idea. I will take a close look of your new implementation this weekend. Thanks for pointing it out. If possible, would you share the test program as well? I found that you tried to use two ways to synchronize SPI DMA completion. Reading through ESP-IDF doc, there seem to be two ways to do this.
I implemented method 1 in dual buffer. But there are scars in screen. It seems that PS: Can you tell me the mark down syntax to quote code like you did? I love it. But I can't find the syntax. |
I do not have a test program to share. I do modifications in a branch (that is deleted) for testing. As further evidence of the performance increase, the SMS/GG display functions now display the full 320x240 frame. Previously, they cropped lines due to performance. There is no markdown to quote code. I just pasted the link to the code, and its automatically done by github. |
Thanks for point it out. The current master branch used two methods to synchronize DMA.
go-play/odroid-go-common/components/odroid/odroid_display.c Lines 168 to 180 in 1953f92
go-play/odroid-go-common/components/odroid/odroid_display.c Lines 234 to 261 in 1953f92
I'm not sure which one is the right way to sync with DMA transmission yet. But using one should be good enough. Also, the number of |
Does the LCD have a VSYNC line that's exposed that the CPU can observe so we can just paint at the proper time? I'm guessing no? |
the datasheet for the ILI9341V shows a VSYNC input... so is it just a matter of finding that wire on the LCD panel or flex connector and using some 'conductive glue' (I just ruined a flex connector a few months ago when I tried soldering onto it), then connecting that to a GPIO and rigging up the code to toggle? |
Oh, it's an input? Often it's a signal from the driver... how does it work now if we aren't providing it with a VSYNC input at all? I'm not sure how it's an input unless the LCD hardware itself has double-buffered RAM or something. Of course if the tearing isn't actually VSYNC related that might not help much. VSYNC can't hurt though. |
@yyyc514 check the third reference to vsync in the PDF datasheet posted at the top of this issue. It is definitely an input, and specifies |
All, there aren't any more free pins on the ESP32. A pin would need to be taken away from something else. The only place I could think of are the pins on the external header. Maybe pin4 (IO15)? That should still be available even when using an external DAC I believe. I would rather see it fixed in software but the NES code is a mess. Perhaps someone can port a different emulator to the ESP32. Oh and a build that takes a pin from the external header will never be officially supported or included in Go Play. Just something to think about. |
@kamotswind the tearing has nothing to do with emulation. How would porting a different emulator fix this? |
The VSYNC on the LCD controller is only available when the panel is being driven in parallel RGB mode. The ODROID-GO uses SPI mode and no sync signals are available. [edit] |
Is the interlacing part still worked on? In my eyes this is the only way to reach 60 Hertz rendering. |
This issue is addressed in a different branch. I will need to backport the work to the main branch. Tearing is significantly reduced and an improved scaler was also implemented. Anyone wanting to evaluate it can do so by building the following branch: |
Can you point to which commits theses were? I did try to skim the log but couldn't find it. |
@yyyc514 I felt the same way. |
The new scaler looks nice (odroid-go-common/components/odroid/odroid_display.c). Looks like instead of just mindlessly blending every adjacent pixel 50% that it actually figures out what % of which pixel should be blended... so since we're going from 256 -> 320 that very first native pixel would be 100% of NES pixel 1 but then the second pixel would be 25% NES pixel 1 and 75% NES color from pixel 2. At least that's my understanding from skimming. Not sure I can understand what change reduced the tearing though. |
@kamotswind Thanks for pointing it out. After reading the diff, I have to say the coding style is far from ideal. Commenting out experimental code and careless spacing should and must be eliminated in Git. That's why I don't feel this project will welcome PR from others. To me, It is a disaster to work on any open source project like this. Sorry about my rant. Back to the topic. If this diff does do the magic to reduce tearing, I can disable blending completely to verify it. Correct it if I'm wrong. But I doubt it is blending. I remembered I disabled blending before but tearing still showed up. Regarding to my experiment on interlacing, it did not look that great at all on a small screen where you look at it closely for a long period of time. I gave up trying the 2nd approach which computes the minimum square (or squares) update between two consecutive frame. Perhaps I will try it out later on during Winter holiday. As someone pointed this out, it is likely hardware limitation (no wiring to vsync) which makes software fix impossible to address tearing problem. If this is the case, we need to move on. |
Just to note, I'm also working on this here: https://github.com/Cwiiis/go-play/tree/wip/cwiiis/smooth - I have dual goals of a partial updates/interlaced video update path and full-speed NES emulation. Currently my branch has interlaced, unscaled output on the NES only (the other emulators won't build at the moment). |
Just to note, I've switched branches to https://github.com/Cwiiis/go-play/tree/wip/cwiiis/partial_updates and have been making some great progress. I've still got a few ideas to try out to get a bit more from this before tidying it up, implementing scaling and updating all the other emulators. I'm not sure that I'm going to be able to do any better than nearest-neighbour scaling, we'll have to see... |
how's the progress going? @OtherCrashOverride did you ever merge your "other" branch in? |
I want to reduce tearing problem in NES. I copied our discussion from odroid forum to here
How it works now
The way the display system currently works:
meanwhile on core 1:
meanwhile on DMA:
There are two frame buffers, two scanline buffers of 3 lines, and multiple DMA buffers.
At 40Mhz SPI, there is 40,000Kbps / 8 bits = 5KBps. A frame is 320 * 240 * 2 bytes / 1024 = 150KB. 5KBps / 150KB = 33.3 fps. However, because there is protocol overhead, the achieved frame rate is less (closer to 30fps). The LCD updates at 70fps (61 is the slowest but looked worse). 70fps / 30fps = 2.3 refreshes for every frame displayed.
LCD controller specs
IL9341 specs
Other's improvement on IL9341 over SPI
fbcp-ili9341 repo
My proposal
The text was updated successfully, but these errors were encountered: