Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some unit tests fail with SSE disabled (i.e. scalar path only) #2414

Closed
4 tasks done
gfoidl opened this issue Mar 25, 2023 · 29 comments · Fixed by #2427
Closed
4 tasks done

Some unit tests fail with SSE disabled (i.e. scalar path only) #2414

gfoidl opened this issue Mar 25, 2023 · 29 comments · Fixed by #2427

Comments

@gfoidl
Copy link
Contributor

gfoidl commented Mar 25, 2023

Prerequisites

  • I have written a descriptive issue title
  • I have verified that I am running the latest version of ImageSharp
  • I have verified if the problem exist in both DEBUG and RELEASE mode
  • I have searched open and closed issues to ensure it has not already been reported

ImageSharp version

local build with latest main-branch

Other ImageSharp packages and versions

none

Environment (Operating system, version and so on)

win-x64, DOTNET_EnableSSE=0

.NET Framework version

all

Description

While hunting the bug (fixed via #2413) I tried several things, amongst others disabling SSE for unit tests run with Visual Studio by using this

<RunSettings>

<?xml version="1.0" encoding="utf-8" ?>
<RunSettings>
    <RunConfiguration>
        <!--Used in conjunction with ActiveIssueAttribute to skip tests with known issues-->
        <TestCaseFilter>category!=failing</TestCaseFilter>

        <EnvironmentVariables>
            <DOTNET_EnableSSE>0</DOTNET_EnableSSE>
        </EnvironmentVariables>
    </RunConfiguration>
</RunSettings>

Tests started to fail -- even when going back in the history of main-branch (so no recent commit caused these failures).

To me it looks like there are latent bugs in the scalar code-pathes. Can anyone confirm?

(description copied and adapted from #2413 (comment))

Steps to Reproduce

To reproduce together with that runsettings one can try this playlist.zip (for filtered tests).

Images

No response

@gfoidl
Copy link
Contributor Author

gfoidl commented Mar 25, 2023

It looks like CI runs the tests on the agents "as is", so with AVX (except MacOS?) and SSE enabled. When the PR is labeled for ARM then an extra runner for ARM is started.

Thus bugs in the scalar code-paths may not be covered by CI, as the path via AVX/SSE is taken usually.
Some bugs get revealed by running on ARM (e.g. #2409 (comment)) but CI should cover the scalar paths on non-ARM too.

My suggestion is to add a new entry in the

matrix:
isARM:
- ${{ contains(github.event.pull_request.labels.*.name, 'arch:arm32') || contains(github.event.pull_request.labels.*.name, 'arch:arm64') }}
options:
that

  • e.g. ubuntu-latest
  • has environment variable DOTNET_EnableSSE=0 set
    so that JIT treats Sse.IsSupported = false (and all "inherited" intrinsics like SSE2, ..., AVX, AVX2), thus only the scalar code-paths are executed.

@gfoidl
Copy link
Contributor Author

gfoidl commented Mar 25, 2023

@brianpopow in #2413 (comment) you wrote about coverage.

Please check if there's coverage for

private static int AddSubtractComponentHalf(int a, int b) => (int)Clip255((uint)(a + ((a - b) / 2)));

I guess that's not covered, as it's a scalar-path and method ClampedAddSubtractHalf (which calls this method) has a pure SSE2-implemation.

@JimBobSquarePants
Copy link
Member

We should be running any decoder/encoder tests using FeatureTestRunner with which we can disable Avx/Sse etc.

@brianpopow
Copy link
Collaborator

@brianpopow in #2413 (comment) you wrote about coverage.

Please check if there's coverage for

private static int AddSubtractComponentHalf(int a, int b) => (int)Clip255((uint)(a + ((a - b) / 2)));

I guess that's not covered, as it's a scalar-path and method ClampedAddSubtractHalf (which calls this method) has a pure SSE2-implemation.

Here is the coverage report For LosslessUtils. ClampedAddSubtractHalf is reported as covered, but the Clip255 which is called by ClampedAddSubtractHalf is only reported as covered partially. I think that's the problem.
I will improve the test, so it will be fully covered.

We should be running any decoder/encoder tests using FeatureTestRunner with which we can disable Avx/Sse etc.

There is a test RunPredictor13Test which uses the FeatureTestRunner to run with and without intrinsics. For context: Predictor13() calls ClampedAddSubtractHalf().

There are some paths not covered, for example BundleColorMap() and SubtractGreenFromBlueAndRedScalar()

@gfoidl
Copy link
Contributor Author

gfoidl commented Mar 25, 2023

That's a bit strange. When the private method AddSubtractComponentHalf is covered, and that method had the bug, then it should have shown up. Or is the test -- ran via FeatureTestRunner -- only executing some inputs?
I'm new to the test infrastructure here, so I have to ask?

Besides that I wonder why so many tests (10+) fail when SSE is disabled.

@brianpopow
Copy link
Collaborator

The issue is in Clip255, this method is called by AddSubtractComponentHalf
and is not fully covered. Only the path a < 256 == true is covered, not the case a < 256 == false
This hides the error. This means we need two test cases for RunPredictor13Test one for a < 256 == true and one for a < 256 == false.
The when you revert back your commit b694956 to have the issue, it will show up as a failed test.

FeatureTestRunner allows us to run tests with disabled intrinsics features.
For example:

FeatureTestRunner.RunWithHwIntrinsicsFeature(
            RunTest,
            count,
            HwIntrinsics.AllowAll | HwIntrinsics.DisableAVX2);

This will run RunTest with hardware intrinsics once and once without AVX2 (assuming the host cpu has AVX2).

Besides that I wonder why so many tests (10+) fail when SSE is disabled.

I had no chance yet to look into those failing tests, but we need to fix them all.

@gfoidl
Copy link
Contributor Author

gfoidl commented Mar 26, 2023

FeatureTestRunner is a nice solution. Thanks for the explanaition 👍🏻

Is the cost of running of a whole CI-leg with SSE disabled too high?
Or can this be run only on a scheduled basis?
So all code-paths w/o vectorization are touched, and no one has to think about adding a test-case with the FeatureTestRunner.

@JimBobSquarePants
Copy link
Member

To cover all the intrinsic paths on 4 OS (linux is counted 2x now we have ARM) for 2X target frameworks the build matrix explodes to something like 28 images now.

FeatureTestRunner was my workaround to keep numbers low and also so we could get a detailed code coverage report from a single entry in the build matrix.

I don't think there's a way to keep our coverage report intact while splitting out the features into separate matrix entries.

I would suggest we ensure that all decoder/encoder methods use FeatureTestRunner for Avx, Sse, AdvancedSimd, and scalar runs. It's a chore but absolutely necessary.

@gfoidl
Copy link
Contributor Author

gfoidl commented Mar 26, 2023

the build matrix explodes

This should be avoided, for sure.
I'd do only +1, so run a additional ubuntu-latest with DOTNET_EnableSSE=0 set. So that scalar pathes are touched at least once (for sure).
If there's a test missing for running via FeatureTestRunner code will be covered too.

I'm not that familiar with GH-Actions (I'm using Azure Pipelines mostly) but I think that should be possible w/o exploding the matrix.

FeatureTestRunner is good and should be kept, as tests can be run on other operating systems too quite easily.
(BTW: your implementation of that is really nice).

@antonfirsov
Copy link
Member

antonfirsov commented Mar 27, 2023

I can confirm this. I see 133 failing tests going scalar-only on my machine. For example, this is the output when decoding testorig12.jpg:

DecodeBaselineJpeg_Rgb24_testorig12

I recommend fixing these one-by-one, adding the missing test cases for each individual bug in the process.

While FeatureTestRunner is good tool, using it systematically in every single integration test would be very expensive. I think adding one test leg to cover the scalar case on one OS, as #2414 (comment) suggests, would be valuable.

@stefannikolei
Copy link
Contributor

After merging #2416 many errors will be solved. Probably the image you have posted will render correct.

@antonfirsov
Copy link
Member

antonfirsov commented Mar 27, 2023

No, this image is YCbCr, looks like we have more lurking bugs.

@br3aker
Copy link
Contributor

br3aker commented Mar 29, 2023

I can confirm this. I see 133 failing tests going scalar-only on my machine. For example, this is the output when decoding testorig12.jpg:

DecodeBaselineJpeg_Rgb24_testorig12

Example looks like a zig-zag algorithm error, you've tested those with FeatureTestRunner, right?

edit
Never mind, zig-zag should be covered by tests.

@stefannikolei
Copy link
Contributor

stefannikolei commented Mar 29, 2023

I went a bit back in history and could identify the problematic PR which introduced this error.

What I know so far is that the tests are failing after #2076 got merged

@br3aker @antonfirsov

@brianpopow
Copy link
Collaborator

brianpopow commented Mar 31, 2023

Now on the main branch there are 39 failing tests left for dotnet6.0 and it seems
some of them are not present in dotnet7.0. There are 24 failing tests on dotnet7.0.

The difference seem to come from png test which have text chunks and call Latin1Encoding.GetString().
Stack trace (shortend for brevity):

Message:
System.PlatformNotSupportedException : Operation is not supported on this platform.

      Stack Trace: 
Sse2.LoadScalarVector128(UInt64* address)
Latin1Utility.WidenLatin1ToUtf16_Sse2(Byte* pLatin1Buffer, Char* pUtf16Buffer, UIntPtr elementCount)
<>c.<GetString>b__29_0(Span`1 chars, ValueTuple`2 args)
String.Create[TState](Int32 length, TState state, SpanAction`2 action)
Latin1Encoding.GetString(Byte[] bytes)
PngDecoderCore.TryUncompressTextData

Other failing tests are tiff images which use jpeg compression, example:

TiffDecoder_CanDecode_JpegCompressed_Rgba32_Issue2123

Other failing tests seems to have something todo with resizing, for example GifDecoder_Decode_Resize
does not match expectation by a small amount (Total difference: 0,0001%)

@br3aker
Copy link
Contributor

br3aker commented Apr 2, 2023

I've found a nasty bug, not related to the color converter pipeline.
PR is in the works.

@br3aker
Copy link
Contributor

br3aker commented Apr 2, 2023

Test run with DOTNET_EnableSSE=0 for linked PR build:

image

All jpeg tests are passing now.

@brianpopow
Copy link
Collaborator

All jpeg tests are passing now.

@br3aker thanks for looking into this issue!

@brianpopow
Copy link
Collaborator

With #2427 merged, 15 test remain failing on the current main branch now.

Two failing tests are webp encoding tests. I think there is an issue that the AVX version of Vp8Residual.GetResidualCost() does not match the scalar version. I am trying to figure out what's wrong there.

P.S. I whish there would be an option not to close linked issues when PR are merged. I could not find any.

@JimBobSquarePants
Copy link
Member

The only way to avoid closing is not to use the closing keywords when linking to issues.

@brianpopow
Copy link
Collaborator

I cant figure out what's going wrong with those png encoder tests. As mentioned above those failing tests seem to be related to decoding text chunks.

Here is an example stack trace from  PngEncoderTests.Chunks.cs:

Message: 
System.PlatformNotSupportedException : Operation is not supported on this platform.

      Stack Trace: 
Sse2.LoadScalarVector128(UInt64* address)
Latin1Utility.WidenLatin1ToUtf16_Sse2(Byte* pLatin1Buffer, Char* pUtf16Buffer, UIntPtr elementCount)
<>c.<GetString>b__29_0(Span`1 chars, ValueTuple`2 args)
String.Create[TState](Int32 length, TState state, SpanAction`2 action)
Latin1Encoding.GetString(Byte[] bytes)
PngDecoderCore.TryUncompressTextData(ReadOnlySpan`1 compressedData, Encoding encoding, String& value) line 1385
PngDecoderCore.ReadCompressedTextChunk(ImageMetadata baseMetadata, PngMetadata metadata, ReadOnlySpan`1 data) line 1065
PngDecoderCore.Decode[TPixel](BufferedReadStream stream, CancellationToken cancellationToken) line 195
ImageDecoderUtilities.Decode[TPixel](IImageDecoderInternals decoder, Configuration configuration, Stream stream, Func`3 largeImageExceptionFactory, CancellationToken cancellationToken) line 57
ImageDecoderUtilities.Decode[TPixel](IImageDecoderInternals decoder, Configuration configuration, Stream stream, CancellationToken cancellationToken) line 43
PngDecoder.Decode[TPixel](DecoderOptions options, Stream stream, CancellationToken cancellationToken) line 38
<>c__DisplayClass0_0`1.<Decode>b__0(Stream s) line 23
ImageDecoder.<WithSeekableStream>g__PeformActionAndResetPosition|11_0[T](Stream s, Int64 position, <>c__DisplayClass11_0`1& ) line 194
ImageDecoder.WithSeekableStream[T](DecoderOptions options, Stream stream, Func`2 action) line 209
ImageDecoder.Decode[TPixel](DecoderOptions options, Stream stream) line 20
Image.Decode[TPixel](DecoderOptions options, Stream stream) line 122
<>c__DisplayClass84_0`1.<Load>b__0(Stream s) line 234
Image.WithSeekableStream[T](DecoderOptions options, Stream stream, Func`2 action) line 301
Image.Load[TPixel](DecoderOptions options, Stream stream) line 234
Image.Load[TPixel](DecoderOptions options, ReadOnlySpan`1 data) line 139
Image.Load[TPixel](ReadOnlySpan`1 data) line 120
TestFile.CreateRgba32Image() line 110
PngEncoderTests.Chunk_ComesBeforeIDat(Object chunkTypeObj) line 115

The weird thing is, when I execute a single failing test, the test runs fine and is green.

@br3aker
Copy link
Contributor

br3aker commented Apr 23, 2023

Wild guess: maybe tiered compilation (rejitter?) doesn't respect ISA environmental variables on net6? Didn't check on net7.

I've set this env variable DOTNET_TieredCompilation=0 and got following results:
image

Last failing test is related to image diff threshold, not sure whether it's related to the codegen.

Now I'm completely lost, we can't simply set this variable and go on - there's a probability it can be broken by some R2R situation? Sorry to bother you @tannergooding but I'm not sure we can solve this on our own :)

@tannergooding
Copy link
Contributor

Could I get a few more details on exactly what's being set and what OS/architecture you're running on?

@br3aker
Copy link
Contributor

br3aker commented Apr 24, 2023

what OS/architecture you're running on

Win11/x86, no emulation

what's being set

We are trying to test all ImageSharp tests for scalar path so I've set <DOTNET_EnableSSE>0</DOTNET_EnableSSE> in .runsettings for emulating cpu without sse support at all.

We got some flaky tests which end up throwing System.PlatformNotSupportedException : Operation is not supported on this platform.. If these tests are executed among other tests - everything breaks due to this call: Sse2.LoadScalarVector128(UInt64* address) from mscorlib, you can check exact stacktrace from the message above mine.
But if these flacky tests are launched in isolation - they successfully pass. I immediately thought about tiered jitting and decided to turn it off with <DOTNET_TieredCompilation>0</DOTNET_TieredCompilation> in .runsettings file, file looked like this:

<?xml version="1.0" encoding="utf-8" ?>
<RunSettings>
    <RunConfiguration>
        <DOTNET_EnableSSE>0</DOTNET_EnableSSE>
        <DOTNET_TieredCompilation>0</DOTNET_TieredCompilation>
    </RunConfiguration>
</RunSettings>

With this settings flaky tests do not fail, i.e. no more System.PlatformNotSupportedException from SSE calls.

@tannergooding
Copy link
Contributor

This sounds like a bad interaction between EnableSSE=0 and ReadyToRun. Does setting DOTNET_ReadyToRun=0 fix it?

@br3aker
Copy link
Contributor

br3aker commented Apr 24, 2023

Yep, setting DOTNET_ReadyToRun=0 fixed tests. Should we open an issue or it's intended/workaround exists?

@tannergooding
Copy link
Contributor

Please open an issue. I'm pretty sure I know what the issue is, but would be good to get it tracked explicitly as impacting the broader community

@br3aker
Copy link
Contributor

br3aker commented Apr 24, 2023

Got it, will do tomorrow, thank you for the help!

@JimBobSquarePants
Copy link
Member

I think we can go ahead and close this now that the upstream issue is fixed in .NET 7+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants