Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient and unlimited implementation of fluid_ct2hz() #569

Merged
merged 9 commits into from Oct 22, 2019

Conversation

derselbst
Copy link
Member

@derselbst derselbst commented Oct 18, 2019

While working on an unlimited implementation of fluid_ct2hz() as discussed in #568, I'm now providing a less branchy and therefore more instruction-cache-friendly version, which has no upper limit (like the old one had). Also, it significantly reduces the number of floating-point comparisons, which is esp. beneficial for non-FPU systems.

Using Intel's VTune Amplifier gives the following results.

Function / Call Stack CPU Time Instructions Retired Microarchitecture Usage Comment
fluid_ct2hz_real 0.072s 312,800,000 56.0% old, legacy version
fluid_ct2hz_real 0.062s 217,600,000 65.4% version of this PR without ldexp()
fluid_ct2hz_real 0.032s 163,200,000 52.0% version of this PR with ldexp()
ldexp 0.067s 268,600,000 70.4% scalbn() and ldexp() turn out to be less efficient than arithmetic operators

Just added a test_ct2hz unit test to master to make sure this change doesn't cause trouble.


Are lookup tables even contemporary? Yes, they are!

Function / Call Stack CPU Time Instructions Retired Microarchitecture Usage Comment
fluid_ct2hz_real 0.048s 204,000,000 52.1% no lookup table, only pow()
pow 0.658s 2,692,800,000 45.8%

This reverts commit ef9e081.

"Although ldexp, scalbn and scalbln are specified to perform
the operation efficiently, on many implementations they are
less efficient than multiplication or division by a power of
two using arithmetic operators."

Which was proven by using VTune.
@jjceresa
Copy link
Collaborator

Thanks for this implementation. Just some details:

if(FLUID_UNLIKELY(res.quot > (int)(sizeof(mult)*8)))
should be
if(FLUID_UNLIKELY(res.quot >= (int)(sizeof(mult)8)))

I wonder if removing else branch will enhance performance ?
// else
{
mult = 1u << (unsigned int)res.quot;
return mult * fluid_ct2hz_tab[(res.rem)];
}

@derselbst
Copy link
Member Author

derselbst commented Oct 19, 2019

Oh, you're right, thanks.

And no, removing the else keyword does not affect performance, as it results in the same assembly being generated.

@jjceresa
Copy link
Collaborator

Something now sound wrong.
When listening a midifile played on new-ct2hz branch, high frequency are much more attenuated than playing the on master branch !.

@jjceresa
Copy link
Collaborator

Something now sound wrong....

Sorry, please ignore my previous comment. All sound fine. (I had compiled a wrong gen_conv.c)

Copy link
Member

@mawe42 mawe42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I've tried to measure the performance gain with profiling on my target platform (ARM Coretex A7 - 1GHz) but could not notice any speed improvements with the built-in profiling feature. The range of "max estimated voices" between different profiling runs spans 146 - 150, both on master and this branch. Would be interesting to see what this does to less capable platforms.

I'm not so sure about the fallback to on-the-fly calculation though. From a sound perspective, I don't think we gain anything from calculating those extremely high frequencies. But they could still happen if a Soundfont designer adds a huge amount into the (mod|vib)-lfo-to-frequency generator. Now I know that doing so would not make any sense, but neither does calculating those frequencies to exact values using lots of CPU cycles. So I would vote to choose a high cutoff and return a fixed high frequency instead.

@mawe42
Copy link
Member

mawe42 commented Oct 20, 2019

The range of "max estimated voices" between different profiling runs spans 146 - 150, both on master and this branch.

Sorry, need to revise this (I worked in the wrong directory and accidentally tested an older master checkout in both cases). Now with the proper branches, these are the real profiling outputs:

estimated maxVoices

  • master: 147 - 150
  • new-ct2hz: 143 - 145

Below are some more details. I chose the slowest run on master and the fastest run on new-ct2hz. Not sure if there are any other performance improvements on current master compared to new-ct2hz, but at least on my platform, profiling suggests that it is a slight performance loss rather than an improvement.

Maybe relevant: I'm compiling with -ffast-math -O3 and -Denable-floats=1.

master

# LD_LIBRARY_PATH=. ./fluidsynth -a alsa /data/sounds/french.sf2 -R0 -C0
FluidSynth runtime version 2.0.7
Copyright (C) 2000-2019 Peter Hanappe and others.
Distributed under the LGPL license.
SoundFont(R) is a registered trademark of E-mu Systems, Inc.

fluidsynth: warning: No preset found on channel 9 [bank=128 prog=0]
Type 'help' for help topics.

> prof_set_print 1
> prof_set_notes 120 0 1
> prof_start 2 1000
Generating 120 notes, fluidsynth: warning: Instrument not found on channel 0 [bank=0 prog=1], substituted [bank=0 prog=0]
fluidsynth: warning: Instrument not found on channel 1 [bank=0 prog=1], substituted [bank=0 prog=0]
generated voices:120
Number of measures(n_prof):2, duration of one mesure(dur):1000ms

Profiling time(mm:ss): Total=0:2  Remainder=0:2, press <ENTER> to cancel
 ------------------------------------------------------------------------------
 Duration(microsecond) and cpu loads(%) (sr: 44100 Hz, sp: 22.68 microsecond)
 ------------------------------------------------------------------------------
 Code under profiling       |Voices|       Duration (microsecond)   |  Load(%)
                            |   nbr|       min|       avg|       max|
 ---------------------------|------|--------------------------------|----------
 synth_write_* ------------>|   120|   1150.00|   1190.80|   1313.00|  82.053
 synth_one_block ---------->|   120|   1141.00|   1180.82|   1302.00|  81.366
 synth_one_block:clear ---->|   120|      1.00|      2.62|     27.00|   0.181
 synth_one_block:one voice->|     1|      8.00|      9.77|    266.00|   0.674
 synth_one_block:all voices>|   120|   1137.00|   1175.10|   1298.00|  80.972
 synth_one_block:reverb --->| no profiling available
 synth_one_block:chorus --->| no profiling available
 voice:note --------------->| no profiling available
 voice:release ------------>| no profiling available
 ------------------------------------------------------------------------------
 Cpu loads(%) (sr: 44100 Hz, sp: 22.68 microsecond) and maximum voices
 ------------------------------------------------------------------------------
 nVoices| total(%)|voices(%)| reverb(%)|chorus(%)| voice(%)|estimated maxVoices
 -------|---------|---------|----------|---------|---------|-------------------
     120|   82.053|   82.053|     0.000|    0.000|    0.675|              148

Profiling time(mm:ss): Total=0:2  Remainder=0:1, press <ENTER> to cancel
 ------------------------------------------------------------------------------
 Duration(microsecond) and cpu loads(%) (sr: 44100 Hz, sp: 22.68 microsecond)
 ------------------------------------------------------------------------------
 Code under profiling       |Voices|       Duration (microsecond)   |  Load(%)
                            |   nbr|       min|       avg|       max|
 ---------------------------|------|--------------------------------|----------
 synth_write_* ------------>|   120|   1150.00|   1194.50|   1444.00|  82.308
 synth_one_block ---------->|   120|   1141.00|   1184.47|   1432.00|  81.617
 synth_one_block:clear ---->|   120|      1.00|      2.70|     30.00|   0.186
 synth_one_block:one voice->|     1|      8.00|      9.80|    266.00|   0.675
 synth_one_block:all voices>|   120|   1137.00|   1178.72|   1421.00|  81.221
 synth_one_block:reverb --->| no profiling available
 synth_one_block:chorus --->| no profiling available
 voice:note --------------->| no profiling available
 voice:release ------------>| no profiling available
 ------------------------------------------------------------------------------
 Cpu loads(%) (sr: 44100 Hz, sp: 22.68 microsecond) and maximum voices
 ------------------------------------------------------------------------------
 nVoices| total(%)|voices(%)| reverb(%)|chorus(%)| voice(%)|estimated maxVoices
 -------|---------|---------|----------|---------|---------|-------------------
     120|   82.308|   82.308|     0.000|    0.000|    0.677|              147
Stopping 120 voices...voices stopped.
> fluidsynth: panic: An error occurred while reading from stdin.
fluid_profiling_print
fluidsynth: Estimated times: min/avg/max (micro seconds)
fluidsynth: synth_write_* ------------>: 1150.000/1194.497/1444.000
fluidsynth: synth_one_block ---------->: 1141.000/1184.466/1432.000
fluidsynth: synth_one_block:clear ---->: 1.000/2.703/30.000
fluidsynth: synth_one_block:one voice->: 8.000/9.803/266.000
fluidsynth: synth_one_block:all voices>: 1137.000/1178.718/1421.000

new-ct2hz

# LD_LIBRARY_PATH=. ./fluidsynth -a alsa /data/sounds/french.sf2 -R0 -C0
FluidSynth runtime version 2.0.7
Copyright (C) 2000-2019 Peter Hanappe and others.
Distributed under the LGPL license.
SoundFont(R) is a registered trademark of E-mu Systems, Inc.

fluidsynth: warning: No preset found on channel 9 [bank=128 prog=0]
Type 'help' for help topics.

> prof_set_print 1
> prof_set_notes 120 0 1
> prof_start 2 1000
Generating 120 notes, fluidsynth: warning: Instrument not found on channel 0 [bank=0 prog=1], substituted [bank=0 prog=0]
fluidsynth: warning: Instrument not found on channel 1 [bank=0 prog=1], substituted [bank=0 prog=0]
generated voices:120
Number of measures(n_prof):2, duration of one mesure(dur):1000ms

Profiling time(mm:ss): Total=0:2  Remainder=0:2, press <ENTER> to cancel
 ------------------------------------------------------------------------------
 Duration(microsecond) and cpu loads(%) (sr: 44100 Hz, sp: 22.68 microsecond)
 ------------------------------------------------------------------------------
 Code under profiling       |Voices|       Duration (microsecond)   |  Load(%)
                            |   nbr|       min|       avg|       max|
 ---------------------------|------|--------------------------------|----------
 synth_write_* ------------>|   120|   1177.00|   1212.27|   1413.00|  83.533
 synth_one_block ---------->|   120|   1168.00|   1202.39|   1401.00|  82.852
 synth_one_block:clear ---->|   120|      1.00|      2.68|     26.00|   0.185
 synth_one_block:one voice->|     1|      8.00|      9.95|     63.00|   0.685
 synth_one_block:all voices>|   120|   1163.00|   1196.20|   1391.00|  82.425
 synth_one_block:reverb --->| no profiling available
 synth_one_block:chorus --->| no profiling available
 voice:note --------------->| no profiling available
 voice:release ------------>| no profiling available
 ------------------------------------------------------------------------------
 Cpu loads(%) (sr: 44100 Hz, sp: 22.68 microsecond) and maximum voices
 ------------------------------------------------------------------------------
 nVoices| total(%)|voices(%)| reverb(%)|chorus(%)| voice(%)|estimated maxVoices
 -------|---------|---------|----------|---------|---------|-------------------
     120|   83.533|   83.533|     0.000|    0.000|    0.687|              145

Profiling time(mm:ss): Total=0:2  Remainder=0:1, press <ENTER> to cancel
 ------------------------------------------------------------------------------
 Duration(microsecond) and cpu loads(%) (sr: 44100 Hz, sp: 22.68 microsecond)
 ------------------------------------------------------------------------------
 Code under profiling       |Voices|       Duration (microsecond)   |  Load(%)
                            |   nbr|       min|       avg|       max|
 ---------------------------|------|--------------------------------|----------
 synth_write_* ------------>|   120|   1175.00|   1218.27|   1485.00|  83.946
 synth_one_block ---------->|   120|   1166.00|   1208.31|   1470.00|  83.260
 synth_one_block:clear ---->|   120|      1.00|      2.81|     31.00|   0.194
 synth_one_block:one voice->|     1|      8.00|     10.00|    609.00|   0.689
 synth_one_block:all voices>|   120|   1161.00|   1201.98|   1464.00|  82.824
 synth_one_block:reverb --->| no profiling available
 synth_one_block:chorus --->| no profiling available
 voice:note --------------->| no profiling available
 voice:release ------------>| no profiling available
 ------------------------------------------------------------------------------
 Cpu loads(%) (sr: 44100 Hz, sp: 22.68 microsecond) and maximum voices
 ------------------------------------------------------------------------------
 nVoices| total(%)|voices(%)| reverb(%)|chorus(%)| voice(%)|estimated maxVoices
 -------|---------|---------|----------|---------|---------|-------------------
     120|   83.946|   83.946|     0.000|    0.000|    0.690|              144
Stopping 120 voices...voices stopped.
> fluidsynth: panic: An error occurred while reading from stdin.
fluid_profiling_print
fluidsynth: Estimated times: min/avg/max (micro seconds)
fluidsynth: synth_write_* ------------>: 1175.000/1218.270/1485.000
fluidsynth: synth_one_block ---------->: 1166.000/1208.309/1470.000
fluidsynth: synth_one_block:clear ---->: 1.000/2.812/31.000
fluidsynth: synth_one_block:one voice->: 8.000/9.999/609.000
fluidsynth: synth_one_block:all voices>: 1161.000/1201.984/1464.000

@mawe42
Copy link
Member

mawe42 commented Oct 20, 2019

The culprit seems to be the div() stdlib call. If I replace that with icents / 1200 and icents % 1200, then performance of new-ct2hz is at least equal, sometimes a little higher than master.

@derselbst
Copy link
Member Author

Thanks for testing on ARM.

ARM Coretex A7

That CPU has a proper FPU and dynamic branch prediction. Doesn't surprise me that your test don't show a real difference. @carlo-bramini for instance has a Cortex M4, maybe things look different for him.

The culprit seems to be the div() stdlib call.

Oh, you are right.

Function / Call Stack CPU Time Instructions Retired Microarchitecture Usage
div 0.044s 265,200,000 69.5%

Quite disappointing that stdlib calls like div() and ldexp() perform so badly. I'll change it to unsigned integer div and mod then.

I'm not so sure about the fallback to on-the-fly calculation though. From a sound perspective, I don't think we gain anything from calculating those extremely high frequencies.

Thinking about it again, I agree. With this new implementation we have a maximum of 38100 absolute cents, which is 29,527,900,160 Hz . Nothing will ever reach this. Inaudible anyway, so yeah, use an upper limit... Now the question is where to check that upper limit?

  1. if(FLUID_UNLIKELY(cents < 0 || cents > 38100)) - FPU comparison :(
  2. res.quot = min(res.quot, 32/*bits*/) - conditional assignment :(
  3. Don't check anything and just let the shift safely wrap around - :) ?

turned out to be expensive
@mawe42
Copy link
Member

mawe42 commented Oct 20, 2019

Don't check anything and just let the shift safely wrap around - :) ?

Yes, why not? It's not any worse than the current 1.0Hz fallback, and due to the higher cents cutoff actually a little better.

@carlo-bramini
Copy link
Contributor

Excuse me, since icents is an int type, why don't you simply write something like:

quo = icents / 1200;
rem = icents % 1200;

Since the divisor is a constant value known at compile time, the compiler simplifies the calculation of the divisor and the reminder by just multiplying for its reciprocal.

@derselbst
Copy link
Member Author

Should be fine now. Unless smb. objects, I would merge this.

@mawe42
Copy link
Member

mawe42 commented Oct 22, 2019

👍

@derselbst derselbst merged commit fa7354a into master Oct 22, 2019
@derselbst derselbst deleted the new-ct2hz branch October 22, 2019 11:09
@jjceresa
Copy link
Collaborator

Please, as fac has same type then mul what could be the difference between fac and (fac & (sizeof(mult)*8u - 1u)) ?

@derselbst
Copy link
Member Author

If fac becomes 32 or more, the result of the shift would be zero. The & is just a replacement for the quick modulo operation % 32.

@jjceresa
Copy link
Collaborator

Oh yes, ofc.

@mawe42
Copy link
Member

mawe42 commented Oct 22, 2019

Maybe it would be good to extend the comment above that line to explain this?

@derselbst
Copy link
Member Author

Done.

@jjceresa
Copy link
Collaborator

I forgot that (fac & (sizeof(mult)*8u - 1u)) is the fast equivalent of fac % 12. I'm getting old :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants