Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using KS with different sampling rates #10

Closed
hagenw opened this issue Feb 29, 2016 · 20 comments
Closed

Using KS with different sampling rates #10

hagenw opened this issue Feb 29, 2016 · 20 comments
Assignees
Labels

Comments

@hagenw
Copy link
Member

hagenw commented Feb 29, 2016

Our two knowledge sources DnnLocationKS and GmmLocationKS are both trained for a sampling rate of 16000 Hz at the moment and they are used with the following blackboard configuration:

<dataConnection Type="AuditoryFrontEndKS">
   <Param Type="double">16000</Param>
</dataConnection>

Most of our other knowledge sources were developed with 44100 Hz in mind. So my question is, will this be a problem? Is it possible to get the data with two different sampling rates from the auditory front-end in one blackboard? Or should we retrain the location knowledge sources to also use 44100 Hz?

/cc @ningma97 @chrschy @ivo--t

@ivo--t ivo--t assigned ivo--t and chrschy and unassigned ivo--t and chrschy Feb 29, 2016
@ivo--t
Copy link
Member

ivo--t commented Mar 1, 2016

At the moment, there can only one framerate be used in the blackboard system. I see the following points:

  1. We could invoke several AFE manager objects in the AuditoryFrontEndKS, one for each distinct sampling frequency. We have to find out, whether parallel instantiation of several AFE manager and respective data objects will lead to problems on the side of AFE. @Hardcorehobel
    It certainly will lead to one problem: everything taking even longer than it already does.
  2. We could retrain the DnnLocationKS to use 44100 Hz. @ningma97
  3. We could retrain everything else to use 16 kHz. As far as we're concerned, our models at the moment don't use anything above 8 kHz, so 16 kHz samplerate should be fine, if I'm not overlooking something. Anybody else using higher frequencies in their models? @chrschy @hagenw
    The benefit of course will probably higher speed in AFE processing.

Please discuss now so that we can conclude very soon, since we want to start a big range of new ADREAM scene model trainings and it would be good to do it with 16 kHz already, if we use that later.

@Hardcorehobel
Copy link

I would agree that 16 kHz is suitable for most applications. Although the localization models are trained with signals sampled at 44.1 kHz, they should still work at lower sampling frequencies because the ITD estimation incorporates an interpolation stage.

@hagenw
Copy link
Member Author

hagenw commented Mar 1, 2016

For some of the stuff related with quality evaluation (for example prediction of coloration), we need definitely 44.1 kHz. I would prefer to use this in all our knowledge sources.

But I can understand that it is maybe of advantage to use 16 kHz in most cases as testing and learning will then be faster in the DASA case. I have to think about this until tomorrow.

@Hardcorehobel
Copy link

We could incorporate a switch that would allows us to resample the input depending on the task.

@ningma97
Copy link
Contributor

ningma97 commented Mar 2, 2016

I would also like to work with 16kHz since everything is so much faster.

Hagen, if prediction of coloration is the only scenario where 44.1kHz is needed, is it possible to use the switch available in AFE KS to specify 44.1kHz?

Currently when we specify 16kHz in the AFE KS, the signal is downsampled in every block (here the block is what is defined in the binaural simulator, 4096 samples at 44.1KHz). It'd be nicer to accumulate signals to the desire length by a KS, say 0.5 second, then downsample it as a whole.

@hagenw
Copy link
Member Author

hagenw commented Mar 3, 2016

I see there would be a benefit if we would allow for 16 kHz.
Then I would say we should go for it, but create a solution that is not a hack.

First, we should summarize the current behavior and what is possible at what stage. I think @fietew should also be involved in the discussion. I will start with a few bullet points, it would be nice if everyone could add missing points:

Current behavior (please fill in the answers to the questions):

  1. Binaural simulator: What defines the sampling rate? What causes a resampling? What defines the block length the resampling is performed on?
  2. Auditory front-end: What defines the sampling rate? What causes a resampling? What defines the block length the resampling is performed on? Can all processors work with different sampling rates?
  3. Blackboard system: What defines the sampling rate? Is there a possibility for resampling? Which knowledge sources depend on a specific frequency at the moment?

Ideas for proposed behavior:

  1. The input signals define the sampling rate: the binaural simulator decides what sampling rate best suited the defined audio scene. The knowledge sources in the blackboard can ask for arbitrary sampling rates, which is provided by a resampling stage in between (where is the best location for this?)
  2. The blackboard defines the sampling rate: we have to specify for each knowledge source which sampling rate to use (there can be default settings of course). If all knowledge sources need the same sampling rate, this one is passed to the binaural simulator and it performs a resampling of the scene. If the knowledge sources need different sampling rates, we have to provide resampling for particular signals inbetween. Again the question is where to put those and make them efficient?

For the question of where we do the resampling: is there a performance difference in resampling in the binaural simulator or the auditory front-end, or did the use all the same Matlab function? What is the influence of block length on resampling performance?

@hagenw
Copy link
Member Author

hagenw commented Mar 4, 2016

During the discussions for #14 @ningma97 pointed out that it doesn't matter for GmmLocationKS which sampling rate we use, if I understood it correctly.

Does this mean for GmmLocationKS and maybe also for DnnLocationKS the sampling rate is not a big deal as both use only low frequency features (ITD, ILD, ...) that are not affected by a change in sampling rate (as long as the sampling rate is not to low)?

@hagenw
Copy link
Member Author

hagenw commented Mar 4, 2016

I tested GmmLocationKS it works with different sampling rates without a problem.
For DnnLocationKS I get an error. For example try to run localise in the folder localisation_DNNs in TWOEARS/examples and reconfigure the Blackboard to use 44.1 kHz instead of 16 kHz:

>> localise

-------------------------------------------------------------------------
Source direction   DnnLocationKS w head rot.   DnnLocationKS wo head rot.
-------------------------------------------------------------------------
Error using -
Matrix dimensions must agree.

Error in DnnLocationKS/execute (line 95)
                testFeatures = testFeatures - ...


Error in Scheduler/executeFirstExecutableAgendaOrderItem (line 63)
                        nextKsi.ks.execute();

Error in Scheduler/processAgenda (line 29)
                [exctdKsi,cantExctKsis,~] = ...

Error in BlackboardSystem/run (line 217)
                obj.scheduler.processAgenda();


Error in estimateAzimuth (line 18)
bbs.run();

Error in localise (line 37)
    phi1 = estimateAzimuth(sim, 'BlackboardDnn.xml');                % DnnLocationKS w head movements

@ningma97
Copy link
Contributor

ningma97 commented Mar 4, 2016

DnnLocationKS uses crosscorrelation output which depending on the sampling rate has different lags, thus different feature dimensions.

@hagenw
Copy link
Member Author

hagenw commented Mar 4, 2016

Ah, ok, that is was @Hardcorehobel ment with his comment:

I agree. We should ensure that the DNN-based system always gets the cross-correlation function with the correct number of lags (independent of the input sampling frequency), either by resampling the input or by down-sampling the cross-correlation function.

Maybe we could implement this by incorporating the sampling frequency as a mandatory parameter when asking the AFE for crosscorrelation output (if this is not done yet).

@fietew
Copy link
Member

fietew commented Mar 4, 2016

The sampling rate of the binaural simulator is constrained by the sample rate of the measured HRTF/BRTF datasets. The signals of the involved sound sources can be resampled while loading the respective *.wav using source.loadFile. if necessary. Resampling the HRTF/BRTF is imho a bad idea.

@hagenw
Copy link
Member Author

hagenw commented Mar 4, 2016

But you could still resample the output of the buffer, after convolution between signals and HRIR/BRIR?
Would this be faster if it is performed in the binaural simulator, or could it also simply be done by the AFE. I'm still not sure what is the most effective way of handling this issue.

@ivo--t
Copy link
Member

ivo--t commented Mar 4, 2016

@Hardcorehobel : so far we've only used the default 80..8000 Hz range for our features. I guess you have put this as default, because for speech processing, the higher frequencies are not important. Is this, however, also true for the more general case of sound type detection? Now that I think about it, I believe humans hear up to 20 kHz, probably not without reason, and music is sampled with 44100 Hz to not loose information, right?

@ivo--t
Copy link
Member

ivo--t commented Mar 4, 2016

Hah, and thinking more about it: the default 80..8000 Hz are the center frequencies of the filters, right? So what's the about range of the highest filter, then? If we sample only with 16 kHz, all information above 8 kHz is lost, so the highest filter probably would already loose information, correct?

@Hardcorehobel
Copy link

Hossa, indeed the frequency range determines the range for the center frequencies. Since the bandwidth of the filters increases with frequency, the filters cover a wide range at high frequencies. When using 44.1 kHz, of course one could go all the way up to Nyquist frequency. But the signal will be quite noisy in realistic conditions. Indeed one should avoid placing the highest filter directly at the nyquist frequency. Only if the number of filters is specified, filters will be placed at the lower and the upper frequency range.

@hagenw
Copy link
Member Author

hagenw commented Mar 14, 2016

In order to come to some action points and conclusions on this point, I created TWOEARS/auditory-front-end#5 which should ensure that the output of the AFE is independent of the used sampling frequency (for example by returning time in seconds and not in samples).
This should fix the problems we have at the moment with DnnLocationKS.

Are there more points were we should do something or change the behavior?

@Hardcorehobel
Copy link

Just to be clear: You are not requesting a resampling processor, but a representation in terms of time vector that is independent of the sampling frequency, right?

@hagenw
Copy link
Member Author

hagenw commented Mar 14, 2016

Yes, I thought this would be the easiest solution to guarantee that different KS could work together that otherwise would require different sampling rates.

On Mon Mar 14 11:03:26 2016 GMT+0100, Tobias May wrote:

Just to be clear: You are not requesting a resampling processor, but a representation in terms of time vector that is independent of the sampling frequency, right?


Reply to this email directly or view it on GitHub:
#10 (comment)

@hagenw
Copy link
Member Author

hagenw commented Jan 5, 2017

I would like to close this issue. I guess our current solution is to use all KS with the same sampling rate, isn't it?

@ningma97
Copy link
Contributor

ningma97 commented Jan 6, 2017

Yes we specify the sampling rate in the AFE.

@hagenw hagenw closed this as completed Jan 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants