Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SIMD versions for performance at newer CPUs #4

Closed
DTL2020 opened this issue Feb 26, 2023 · 23 comments
Closed

Add SIMD versions for performance at newer CPUs #4

DTL2020 opened this issue Feb 26, 2023 · 23 comments

Comments

@DTL2020
Copy link
Contributor

DTL2020 commented Feb 26, 2023

Users report it is good in quality but very slow in performance filter. As it seen with current sources it is C-only program and its main processing loops at

for (int x = 0; x < width; x++)

and

for (int x = 0; x < width; x++)

walk with single sample only.
So it is good to put in 'todo list': Make SIMD (SSE2/AVX2/AVX512) versions of these loops with several samples processing per loop spin.

@Asd-g
Copy link
Owner

Asd-g commented Feb 27, 2023

Ok.

Asd-g added a commit that referenced this issue Feb 28, 2023
Fix processing with float clips. (regression from 1.2.0)
Add parameter opt.
Add SSE2, AVX2, AVX-512 code. (#4)
Fix earlier exit of the scene change detection.
@DTL2020
Copy link
Contributor Author

DTL2020 commented Feb 28, 2023

Version 1.2.1 work already several times faster. At i5-9600K CPU about 3.2x faster at 1920x1080 YV12 frame with maxr=7 and AVX2 used.
I trying to finish debug of 8bit AVX2 processing function and if it will be finished will try to offer pull request. It not uses external helper classes/libraries for SIMD.

@DTL2020
Copy link
Contributor Author

DTL2020 commented Feb 28, 2023

Well - when I finally implement vector's gathering

weight = weightSaved[useDiff ? static_cast<int>(diff * 255.f) + v : frameIndex];

And in SIMD processing

for (int i{ 0 }; i < 8; ++i)

I.e. when UseDiff = true it read some random value from weightSaved array pointed by diff (or diff+v) value - performance drops about a half (60 fps vs 120 fps when simply read weightSaved[frameIndex] value and broadcast to all processed in a group 8 values) and now about equal to the release 1.2.1.

It is about strange why data gathering from about small (256*3 floats ?) array make such great performance hit. It expected to be cached in L1D very good. May be possible some algorithmic changes to workaround this point ? May be even runtime SIMD calculation of weights based on diff value may be faster ? Will try to look and experiment with this more.

@Asd-g
Copy link
Owner

Asd-g commented Mar 2, 2023

There are two comments in this commit 7bef2d8, but I cannot see them.

@DTL2020
Copy link
Contributor Author

DTL2020 commented Mar 2, 2023

That is strange - I hope they are e-mailed to repository owner automatically.

Copy here:
For line 49 of _AVX2.cpp:

const auto check_v{ diff < thresh };

Some possible performance optimization: after getting vector of {diff < thresh } it may be checked for all condition non_met and so total block of lines 51..77 may be skipped (as in the original C-reference -

). Checking of the mask to be all zeroes or all 1 is fast enough - download mask from SIMD with _mm256_movemask_epi8() to integer and test if for zero or 0xFFFFFFF . Example is https://github.com/DTL2020/AviSynth-vsTTempSmooth/blob/2e15314a55227543d4bb9f5078b3b70aba2eb17d/src/vsTTempSmooth.cpp#L217

So if all samples in a group of 8 not met condition - it is no need to make SIMD processing at all and initial values of weights{} and sum{} may be passed to next program block. Same is applicable to all other SIMD versions of functions.

For line 53 (and 54) of _AVX512.cpp: (same is for lines 72..73 and all other of this vector gathering operations) .
In current 1.2.2 sources it is

weight.insert(i, _weight[l][(useDiff) ? (diff.extract(i) >> _shift) : frameIndex]);

and

weight.insert(i, _weight[l][(useDiff) ? ((diff.extract(i) >> _shift) + v) : frameIndex]);

2 more optimization ideas:

1.May be data gathering instructions may be used for this operation ? They exist in AVX512 and may be some few also in AVX2 instruction sets. It not make great performance benefit (because of still gathering from slow caches and different cache lines) but may add something.
2.More strategic idea - redesign this part to use runtime-computing of function weight=f(diff >> _shift) for every element of SIMD vector in total SIMD way. So totally skip usage of weights table. It may take a bit more computing but will totally remove this non-SIMD friendly random data gathering from memory that make significant penalty and greatly limit total SIMD benefit. And SIMD versions of processing may be expanded to total register file usage (like 32..64 or more samples per AVX512 version) for better performance. Also if current weight=f(diff) is too hard (slow) to compute at runtime SIMD the more simple function for usage with motion compensated input clips may be added as a new processing option for the plugin to use it with motion compensated input data (as I see it is used in MCTD.avsi script with mvtools->MCompensate() .

@DTL2020
Copy link
Contributor Author

DTL2020 commented Mar 3, 2023

Also after changing of 'scalar' vector gathering to hardware SIMD gather instruction as next optimization step it is good to adjust single pass 'workunit size' for each architecture (AVX2 and AVX512).

I think current versions using WU size of 8 samples for AVX2 and 16 for AVX512 do not use all possible register file and may be increased.
Also the latency of gather instruction is large https://www.laruence.com/sse/#techs=AVX2,AVX_512&text=gath&expand=3007,3011,3010 (see example _mm512_i32gather_ps ) but Throughput (CPI) is about 3 times better. So it may be really good to combine several gather instructions in a sequence.
Also gathering more random LUT memory points in a single (or close) operation may cause hitting of a single cache line more times and make performance better.

So the test versions with larger WU may look like (for AVX512 x+=64, for AVX2 x+=16 (?)):
` for (int x{ 0 }; x < width; x += 64)
{
const auto& c01{ load(&pfp[_maxr][x]) };
const auto& srcp_v01{ load(&srcp[_maxr][x]) };

        const auto& c02{ load<T>(&pfp[_maxr][x+16]) };
        const auto& srcp_v02{ load<T>(&srcp[_maxr][x+16]) };

        const auto& c03{ load<T>(&pfp[_maxr][x+32]) };
        const auto& srcp_v03{ load<T>(&srcp[_maxr][x+32]) };

        const auto& c04{ load<T>(&pfp[_maxr][x+48]) };
        const auto& srcp_v04{ load<T>(&srcp[_maxr][x+48]) };


        Vec16f weights01{ _cw };
        Vec16f sum01{ to_float(srcp_v01) * weights01 };

        Vec16f weights02{ _cw };
        Vec16f sum02{ to_float(srcp_v02) * weights02 };

        Vec16f weights03{ _cw };
        Vec16f sum03{ to_float(srcp_v03) * weights03 };

        Vec16f weights04{ _cw };
        Vec16f sum04{ to_float(srcp_v04) * weights04 };


        int frameIndex{ _maxr - 1 };

        if (frameIndex > fromFrame)
        {
            auto t1_01{ load<T>(&pfp[frameIndex][x]) };
            auto diff01{ abs(c01 - t1_01) };
            const auto check_v01{ diff01 < thresh };

            auto t1_02{ load<T>(&pfp[frameIndex][x+16]) };
            auto diff02{ abs(c02 - t1_02) };
            const auto check_v02{ diff02 < thresh };

            auto t1_03{ load<T>(&pfp[frameIndex][x+32]) };
            auto diff03{ abs(c03 - t1_03) };
            const auto check_v03{ diff03 < thresh };

            auto t1_04{ load<T>(&pfp[frameIndex][x+48]) };
            auto diff04{ abs(c04 - t1_04) };
            const auto check_v04{ diff04 < thresh };


            Vec16f weight01;
            Vec16f weight02;
            Vec16f weight03;
            Vec16f weight04;

	if(useDiff)
	{
		weight01 = lookup<16>(diff01 >> _shift, &_weight[l][0]); // should compile into vgatherdps instruction
		weight02 = lookup<16>(diff02 >> _shift, &_weight[l][0]); // should compile into vgatherdps instruction
		weight03 = lookup<16>(diff03 >> _shift, &_weight[l][0]); // should compile into vgatherdps instruction
		weight04 = lookup<16>(diff04 >> _shift, &_weight[l][0]); // should compile into vgatherdps instruction
	}
	else
	{
		weight01=weight02=weight03=weight04=/*(broadcast)? */ _mm512_set1_ps(_weight[l][frameIndex]);
	}

and so on for cycle while (frameIndex > fromFrame) too.
`

As a next memory-bandwidth optimization - may be try to use FP16 instead of FP32 for weights LUT ? It will make table 2 times more compact and increase rate of cachelines hits and increase caches productivity on handling both table and frames data. The FP16 to FP32 pack/unpack operations should be fast enough. Pack is once at lut creation and only runtime unpack after weight-vector gathering finishes.

@Asd-g
Copy link
Owner

Asd-g commented Mar 4, 2023

From quick test (the new version 1.2.3) with 1080p_8bit - AVX2 ~55% fps improvement, AVX-512 ~240% fps improvement.

@DTL2020
Copy link
Contributor Author

DTL2020 commented Mar 4, 2023

Oh - 240% looks about good benefit. AVX512 starting to show its 'internal-multithreading power'. Also may be AVX512-capable chips have a visibly improved gathering mode of memory controller over AVX2. I add more note with ideas in current commit (line 88 of AVX512.cpp) - I hope you see it ?.

The best 'fine-tuning' of workunit size per SIMD pass for each architecture is slow process for enthusiasts - with manual adjusting lines of program text and testing each compilation. I will try to make current version compillable with VisualStudio 2019 (I hope your VCL is compatible with it?) and test different sizes of workunits for AVX2 and AVX512 if it helps more to performance. Unfortunately it looks still no exist some better program-text design tools allow easy adjusting processing datasize for SIMD programs. Like change single number and recompile.

General note on current version - do it still not completed with 'epilogue' for lines/rows processing not-mod of the 'main' SIMD workunit size (64 columns for AVX512 8bit in current commit) ? I see only SIMD parts in loops are processed so last columns not mod64 for AVX512 for example not processed ? If trying to add +1 full-SIMD memory read/write pass for last columns not mod-SIMD sized it is dangerous to get buffer-overrun memory protection error or memory corruption. So it looks the simple processing of last columns in a row will be still slow per-sample (not SIMD). Because making program even more complex adding last columns processing of lower and lower granularity progressively may be hard for programmer (for example if last columns >16 for AVX512 - add 16columns processing part still in AVX512 and next if residual still <16 and >0 - add the single sample processing epilogue).

Also for possibly better AVX2/AVX512 performance it may be good to test if provided start addreses of the rows to process in SIMD part of program are 32 (AVX2) or 64 (AVX512) bytes aligned (so all load/store operations may use aligned load/store instructions - not sure how it is marked with VCL syntax or even possible in current VCL version ?). I read some notes at stackoverflow site about significant penalty at AVX512 when using non-64 bytes aligned load/stores (crossing cachelines) like 30% (per operation may be). Do current Avisynth provide correctly aligned for 32/64 bytes starting addresses of rows in frame buffers ?
If not - possible solution is to add 'prologue' (may be single sample processing for simplicity) before next column starting address will be 32 (for AVX2) or 64 (for AVX512) bytes aligned and start SIMD part of processing (also making sure the VCL-based compilation will be forced to use aligned load/store operations for frame buffers access - as I see it have load_a and store_a syntax to use aligned instructions like:

// Member function to load from array, aligned by 64
// You may use load_a instead of load if you are certain that p points to an address divisible by 64
Vec16f & load_a(float const * p) {
    zmm = _mm512_load_ps(p);
    return *this;
}

from vectori512.h:
// Member function to load from array, aligned by 64
// You may use load_a instead of load if you are certain that p points to an address
// divisible by 64, but there is hardly any speed advantage of load_a on modern processors
Vec512b & load_a(void const * p) {
zmm = _mm512_load_si512(p);
return *this;
}
the note about speed advantage of aligned load/store is subject to test on different AVX2/AVX512 chips versions (intel and AMD).
).

@Asd-g
Copy link
Owner

Asd-g commented Mar 4, 2023

VCL2 is ok for VisualStudio 2019.

The new frame is 64 aligned by default (you can specify alignment).
The source frame is not always aligned. Here example for making it aligned.

I used here VecX.load()/store() which are for unaligned data. There is VecX.load_a()/store_a()/store_nt() for aligned array.

If the data is already aligned, using unaligned load/store will have the same performance. With the current implementation 1920px width processing is ~6% faster than 1918x width processing.

@DTL2020
Copy link
Contributor Author

DTL2020 commented Mar 4, 2023

"Here example for making it aligned."

BitBlt unfortunately full frame copy operation and may significantly impact the performance. So for operating with unaligned source buffers from avisynth looks only special prologue is working (also if we have several input buffers at once it all must be aligned at once or even prologue-based pre-processing can not align all buffers at once as required ?).

May be simply make separate copy of function for initially correctly aligned buffers and only check at startup if alignment condition is met and select which version of processing function to start ? Also we can ask AVS developers to provide aligned storage if possible (and other plugins developers - nice to have feature ?) . So the both start of buffer and start of each row must be aligned (so the pitch need adjusted in such way so each start of each row will be aligned too).

"The new frame is 64 aligned by default (you can specify alignment)."

Do environment also align pitch of each row so each row start address is also aligned as required ?

So for storage always store_a (aligned) may be used now ?

@Asd-g
Copy link
Owner

Asd-g commented Mar 4, 2023

May be simply make separate copy of function for initially correctly aligned buffers and only check at startup if alignment condition is met and select which version of processing function to start ?

That functions already checks if every plane is aligned. If any of the planes isn't aligned, copy the data to new aligned buffer, otherwise do nothing.

Do environment also align pitch of each row so each row start address is also aligned as required ?

Yes.

@DTL2020
Copy link
Contributor Author

DTL2020 commented Mar 4, 2023

Also an additional template-bool variable can be added like 'useAload' and before calling processing function check if all source alignment conditions are met (if input clip is from new enough AVS environment and have all rows properly aligned ?) - so call version of processing with aligned SIMD loads.

" If any of the planes isn't aligned, copy the data to new aligned buffer, otherwise do nothing."

I think it may not be best way for simple enough processing in vsTTempSmooth - the penalty from copy may be more in compare with benefit from aligned load. May be simply before calling processing function add checker-of-alignment function and if all input frames are OK - so use aligned load templated version ?

The output of processing may always use aligned storing and save some cachelines from partial usage (so more cachelines may be used to store valuable data).

So it is the source to vsTTempSmooth must be responsible of providing correctly for up to AVX512 (64byres) aligned start of rows and everything will run more nicely. I hope such plugins like RIFE already provide correctly aligned buffers (and Convert() AVS internal functions) ?

@Asd-g
Copy link
Owner

Asd-g commented Mar 6, 2023

load (unaligned) and load_a (aligned) should have the same performance (assuming modern CPU) when the input data is aligned.

RIFE should have aligned output. If the plugin is using NewVideoFrame instead of MakeWriteable (copying the input frame), the output should be aligned.

@Asd-g Asd-g closed this as completed Mar 10, 2023
@DTL2020
Copy link
Contributor Author

DTL2020 commented Mar 11, 2023

Heh - I see you close as completed but

for (int x{ 0 }; x < width; x += 64)

Still not have prologue for non-mod64 widths ? Same is for all other added multi-samples per pass SIMD functions. I think you wait if SIMD part is no more significant additions and add prologue later ? Now it looks will not process end of rows if row width is not integer dividable to SIMD workunit size ?

So frame with 1920/64=30 will now processed completely and for example (1920+35)/64 = 30.54... will have end 35 samples of each row not processed ?

@Asd-g
Copy link
Owner

Asd-g commented Mar 11, 2023

There aren't unprocessed end of rows.

For your example ..[x + 32] will have partially loaded zeroes, and ...[x + 48] will have all zeroes.

@DTL2020
Copy link
Contributor Author

DTL2020 commented May 1, 2023

Some addition about superscalar programming for SIMD:
Many new CPU chips (may be after 199x years already) have some limited capability of superscalar computing. It mean in some cases more computing may be performed at the same time (clock count). In chip design it is performed with several dispatch ports capable to execute same instruction. So the total computing performance on CPU is
Number_of_Cores x SIMD_datawidth x Superscalarity_factor

The number of cores and max SIMD dataword width is cleary visible from CPU hardware config and SIMD family (64/128/256/512 bit). The superscalarity factor depend on current chip design and depend on instruction and number of dispatch ports capable to dispatch compute instruction. For some groups of instructions superscalarity factor may be 2 and more directly noted in CPU specs - like 2 FMA units in some Xeons.

Generally for some program the superscalarity factor is >1 and for some instruction and some chips may reach 3. It may be found in the CPU documentation in the list of CPI per instruction (if CPI <1 it mean it is executed at 1 clocktick and 2 or more dispatch ports available, so CPI of 0.5 is 2 dispatch ports and CPI of 0.33 is 3 dispatch ports).

The required conditions for superscalar computing:

  1. Data for computing must not have dependancy.
  2. Data for computing mostly probably should be located in register file (reading memory even L1D cache is too slow).
  3. There should be free to dispatch 2 or more ports supporting this instruction computing.

Example of possible to superscalar computing program:
a=b+c
d=e+f

Not possible (data dependant):
a=b+c
d=a+e

So in the SIMD programming to use possible benefit from superscalarity it is good to group big workunits of data (of several SIMD datawords) and if they are not dependant - group several compute instructions to process this data. So the instructions decode unit of CPU may detect it as superscalar ready part of program and route commands to several free and supporting dispatch ports.

Example of low or not superscalar friendly program processing loop:
for (int i=0; i < N; i++)
{
data_A=load(mem_A+i)
data_B=load(mem_B+i)
result=data_A+data_B
store(dst+i, result)
}

It use SIMD but process only one SIMD dataword per loop spin. If program designer is very lucky with compiler - it may unroll this loop to be more superscalar friendly. But it depends on compiler.

More superscalar-way of explicit programming is:
for (int i=0; i < N; i+=2)
{
dataA1=load(mem_A+i)
dataA2=load(mem_A+i+1)
dataB1=load(mem_B+i)
dataB2=load(mem_B+i+1)
result1=data_A1+data_B1
result2=data_A2+data_B2
store(dst+i, result1)
store(dst+i+1, result2)
}

It uses superscalarity factor of 2 if sum instruction is supported on 2 or more dispatch ports. Also there is less bus direction switches on load and store of data. It is expected with progress of CPU design the superscalarity factor for more and more instrucsions may be increased (may be to 4 and more) so it may be recommended to design SIMD programs supporting up to 4 and more dispatch ports in the same computing (depend on available space in register file and more).

The C-program text for superscalar computing is not very nice with lots of repeating blocks - may be it can be somehow compacted with language tools in more compact form.

@DTL2020
Copy link
Contributor Author

DTL2020 commented May 13, 2023

Also some addition to high-performance computing programming:

It looks dispatch ports of core are not support all range of instructions directly but designed as sort of FPGA with reloading of compute config to support all required instructions.

So when instruction decoder see some new instruction it performs:

  1. Find free dispatch port supporting this instruction.
  2. Check if port configured to dispatch.
  3. If port not configured - load configuration of FPGA (takes several clocktics).
  4. Route data and instruction code to dispatch to the port.

So instruction have 2 performance params: Latency and Throughput. The Latency used when it is first instruction in a sequence and no ready configured dispatch ports available. So first result will be ready only after Latency clockticks. If there are several equal instructions in a sequence - they can be pipelined to ready to dispatch port at Throughput performance. So it may be good to arrange many equal instructions in large sets to use Throughput performance level. Good compilers should do this work from intrinsics and VCL based C-programming if enough data to compute is prepared.

@DTL2020
Copy link
Contributor Author

DTL2020 commented May 15, 2023

Can you try to make LLVM build of that new simple plugin - https://github.com/DTL2020/ConvertYV12toRGB ? It looks LLVM typically may make much faster builds of SIMD programs and I still not setup LLVM build tools. Only have VS2019 and IC 19 . It is better to try to make AVX2 limited/targeted build and AVX512 limited/targeted (so the compiler can use larger sized register file of AVX512_x64 and not store/load temporals to L1D cache and it can make performance of AVX2 SIMD max possible). I not sure if that handcrafted non-ASM program not run out of available register file size of AVX2. I not know how to contact you directly with e-mail so trying to use github e-mail notifications.

@Asd-g
Copy link
Owner

Asd-g commented May 19, 2023

_clang_avx2 - arch:avx2
_clang_avx512 - arch:avx512
_icx_avx2 - arch:avx2 (intel c++ compiler 2023)
_icx_avx512 - arch:avx512 (intel c++ compiler 2023)

DecodeYV12toRGB.zip

@DTL2020
Copy link
Contributor Author

DTL2020 commented May 19, 2023

Thank you. We will try to check if it better in performance in compare with VS2019 builds.

Also pinterf wrote AVS+ internal dematrix uses 32bit intermediate values and it provide better precision. So I will try to make second version of processing with 32bit intermediates too later and will ask for builds one more time.

It also shows we can have 2 different performance/quality balanced versions of processing. So if both implemented user can select quality/performance balance setting parameters to processing functions to use either higher fps or higher quality.

@DTL2020
Copy link
Contributor Author

DTL2020 commented Jun 1, 2023

Yes - the llvm builds both runs a bit faster in compare with VS2019 builds.

As I do not know other ways to send message I going to post here one more strategic requrest for support AVS plugin development: The main motion search engine for AVS is mvtools (sort of extension to AVS core or motion search supplement to AVS for very many scripts used in). But last programmer to make 'stable' builds was pinterf in 201x years. In 202x it looks pinterf have low time to AVS plugins and only make some additions to AVS core. In 2019..2022 and partially 2023 I add many new performance/quality features to post-2.7.45 builds in my github. But my progrramming skills/time is limited so this brunch runs some unstable with some blocksize/bitdepth combinations and also most of new features added only tested in my own used workflow of 8x8 block size and YV12. Users like to use 16x16 and larger block sizes and 16bit for UHD/HDR nowdays. So can the AVS plugins using small community ask you to take mvtools and port as many new features from my version into new 'stable' build and release it for scripts developers and users (Dogway is known with SMDegrain script to make settings for MAnalyse/MDegrain easy). If some money payment required I think we can open some crowdfanding project to collect required sum. Because it looks the development of 'stable' releases of mvtools stops at 2.7.45 version at 201x years and it uses very outdated MAnalsye/MDegrainX filters (though support most of blocksize/bitdepths of mvtools).

I will try to make e-table (openoffice based or MS Excel compatible) document with list of all new features for post-2.7.45 version and their status of implementation. It was expected for pinterf to select some may be simple and most required features to be ported to his 'stable' brunch. Total new features list is about 40 today (and some are not small but new processing modes to future development). Also in 2023 it was visible the both block-based denoising in mvtools and sample-based denoising in vsttempsmooth may be combined as additional processing modes for MDegrainN for example for blocks where motion compensation is very poor (not possible in good way even for small block size or other reason) so for such blocks the denoise engine may make fallback to sample-based as designed in vsttempsmooth. So these projects are also interconnected in some way.

I understand the complexity of mvtools part used for denoise (MAnalyse + MDegrainN with lots of new features) is comparable to some MPEG encoder engine and it is good to have some team of developers (like we see at x264 MPEG encoder) but it looks the residuals of video processing community pay too few attention to denoising before MPEG encoding and mostly put efforts to MPEG encoder development (thogh as I see with AOM AV encoder it finally have some advanced and somehow user-configurable temporal denoiser included, but may be much more poor in compare with current mvtools/MDegrainN in my post-2.7.45 builds). So we have close to no developers-programmers able to support and expand new added features to all supported processing modes (block size / bitdepth). And all my new developed features and processing modes poorly available to endusers because of low support of blocksize/bitdepth and also some non-stability so Dogway as main intermediate script designer reasonably not want to put time to look into new versions and add it to his SMDegrain script until some 'official stable' release available from some known good programmer.

@Asd-g
Copy link
Owner

Asd-g commented Jun 11, 2023

Don't your features require some Windows-only GPU software? Are they CPU compatible too?

Attached llvm version of ConvertYUVtoRGB 0.4.1.

DecodeYUVtoRGB.zip

@DTL2020
Copy link
Contributor Author

DTL2020 commented Jun 11, 2023

Windows-only is DX12-ME feature (also SAD optional computing using Compute Shader via DX12 too) only. It can be separated to separate source file. Also it is optional in building (with DX12_ME global project define). So I typically make 2 builds - the required DX12 (it auto-load dx12.dll or something like this - so only can run at Win10 and later) and not using DX12 (also not having the option to use hardware DX12-ME acceleration). All other features are CPU-only.

I add some list of new features - https://github.com/DTL2020/mvtools/blob/mvtools-pfmod/new_features_list.ods . But it turns to be mostly planned to implementation (only some are ready). And some already implemented still not listed - need to collect from description of releases after 2.7.45 and add to the list too.

Also the ME-option is not really Windows-specific but GPU-accelerator (also compute shader must have some equal API in Unix ?). So I hope with progress of drivers for Unix the hardware vendors will provide some API to Unix for the same ME feature from MPEG encoder ASIC in GPU boards to use it. So the software can have 2 separate API - DX12-ME for Windows builds and some other for Linux.
Also the used VirtualAlloc memory allocation function have #ifdef for WIN_32 define and simple malloc() for other OS builds. I hope in Linux API there is also some kernel function for providing 4-KB pages aligned allocation ? So current my sources are not Windows-only and can be build for other OS.

Thank you you llvm build of that plugin. About that plugin - the users reports the VS2019 build not run at WinXP 32bit with AVX2 CPU and crash with 'illegal instruction' reason. May be it require the SSE2-targeted build by C-compiler ? Or the AVX2 functions really can not run at WinXP 32bit if even build for 32bit .dll ? That tested builds where made with AVX2-target platform in compiler settings so compiler may use some instructions not included in intrinsics-based processing function or may be something other not compatible with WinXP ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants