Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing results #38

Closed
rohithmars opened this issue Dec 21, 2021 · 15 comments
Closed

Reproducing results #38

rohithmars opened this issue Dec 21, 2021 · 15 comments
Labels

Comments

@rohithmars
Copy link

rohithmars commented Dec 21, 2021

Hi,

I had tried to re-train the deepfilternet model using the DNS-3 challenge dataset mentioned in your work.

I don't have the additional 10k IR. However, the other dataset remains the same.

On VCTK test set, using the config.ini in the pre-trained model as my training config, my "best model" on validation gives PESQ score of 2.60. It is much lower than 2.81 from the pre-trained model.

In config.ini, Adamw is used, while in the paper Adam as optimizer is mentioned.

Do you think any other factors would result in such a performance drop?

Could you clarify on the 3 s sample for training? Suppose the DNS-3 sample has 10 s in a sample, do I need to split it into 3 s segments so as to utilize the entire train clip? Or just use the first 3 seconds of the clip? Alternatively, is random 3 s generated on-the-fly while training?

In the hdf5 setup, does the speech/noise/rir need to have sample number of samples? Or is the noise and RIR sampled randomly from a list? For example, if the speech list has 1000 samples, noise list is 100 samples and rir list is 100 samples, is it okay? or should it be 1000 speech, 1000 noise, 1000 rir? Is it needed to make the duration of speech and noise samples to be the same?

How about the reverberation parameter p_reverb = 0.05? The data augmentation is performed by default or any other config is needed? conv_lookahead = 2 in config.ini. But the paper mentions "look-ahead of l = 1 frame for both DF as well as in the DNN convolutions".

@Rikorose
Copy link
Owner

Hi there, could you send me the training log as well as the config file used for training. Are you sure the model fully converged? Did you make sure to choose the model based on the best validation loss? What is the batch size? Were there any NaNs in the training run? Which exact commit was used for training?

AdamW and Adam should not make a huge difference. I will double check if which one was used.

3 seconds should be fine. In each epoch, a different 3 seconds window will be used. I may have made some changes since then, e.g. handling shorter (than 3s) training samples.
100 RIRs should be fine, they are getting randomly sampled.

p_reverb is the reverb probability during training. Since VCTK is non-reverb, this should not have a large impact. The conv_lookahead is a different lookahead (i.e. in the convolutions) than the df lookahead. The overall latency should be no more than 30 ms which is less compared to PercepNet or DCRNN.

@rohithmars
Copy link
Author

rohithmars commented Dec 21, 2021

Hi,

Thank you for the reply.

Please find attached the log and config.ini
train.log
config.zip

So you mean to say even if the number of speech samples and number of noise samples, differ, it is okay? or should i oversample to match them?

As we had discussed before, nan occurs. But then i restart the training from previous epoch.

@Rikorose
Copy link
Owner

The numbers of clean and noise samples may differ since they are randomly mixed at training time. One thing that I changed for better robustness in multi-speaker scenarios is sampling a different speech sample if the first one is shorter than 3 s. This may cause an performance drop.
It can be disabled by setting the probability to zero:
https://github.com/Rikorose/DeepFilterNet/blob/main/pyDF-data/src/lib.rs#L232

Also, we did not include the singing voice dataset, since it is correlated with the music that is also included in the noise sets, as well as the emotional speech due to worse recording quality.
How did you do your train/validation/test split?

The original model was trained in a different repository which I cleaned up for publishing. Maybe I cleaned up too much or introduced a bug somewhere during the refactoring process which causes a performance drop. I will need to double-check.

@rohithmars
Copy link
Author

rohithmars commented Dec 21, 2021

Hi,

I had included the singing and emotional speech in my target. But the contribution from these data is small compared to read speech from DNS. Split is similar to 75%/15%/15%.

Also, did you include VCTK trainset as part of your training data? or is it only DNS dataset?

By the way, I also notice that when I use the pre-trained model to evaluate the validation set, it gives SDR/STOI scores lower than when I train the first epoch and validate on it. It is very strange and I am not sure where the bug is. Definitely, the pre-trained model should give much higher score than first epoch model.

@Rikorose
Copy link
Owner

I did not include the VCTK as an extra training set. What SDR/STOI scores do you observe? On the validation set or the VCTK test set?

@rohithmars
Copy link
Author

rohithmars commented Dec 21, 2021

Hi,

Sorry for long post and questions. The validation set below is from my own split. [29] is the pre-trained model provided and its scores on this validation set. [0] is the validation scores on the first epoch model. I think there is something wrong?

Pre-trained model score on validation set

[29] [valid] | DfAlphaLoss: 0.016212 | SpectralLoss: 0.30626 | loss: 0.32247 | sdr_snr_-5: 3.6033 | sdr_snr_0: 7.6844 | sdr_snr_10: 13.983 | sdr_snr_20: 19.054 | sdr_snr_40: 24.983 | sdr_snr_5: 10.971 | stoi_snr_-5: 0.68123 | stoi_snr_0: 0.77549 | stoi_snr_10: 0.89468 | stoi_snr_20: 0.94909 | stoi_snr_40: 0.983 | stoi_snr_5: 0.844

Epoch-1 model score on validation set

| DF | [0] [valid] | DfAlphaLoss: 0.011831 | SpectralLoss: 0.2401 | loss: 0.25193 | sdr_snr_-5: 4.9098 | sdr_snr_0: 8.912 | sdr_snr_10: 15.457 | sdr_snr_20: 21.364 | sdr_snr_40: 28.92 | sdr_snr_5: 12.27 | stoi_snr_-5: 0.7153 | stoi_snr_0: 0.80117 | stoi_snr_10: 0.90952 | stoi_snr_20: 0.95896 | stoi_snr_40: 0.99052 | stoi_snr_5: 0.86384

@Rikorose
Copy link
Owner

Hm, there is certainly the possibility to improve the DeepFilterNet model. Then it is strange that you are observing worse results on the VCTK set. I am currently working on a different topic but will try to come back to this to provide better reproducibility.

I also noticed a bug in the dataset functionality resulting in different order of samples during training which also limits reproducibility.

@rohithmars
Copy link
Author

Hi,

Can you tell me how to use the pre-trained model as the starting model when training?

Do I need to do any other configuration?

@Rikorose
Copy link
Owner

Rikorose commented Dec 23, 2021

Hi,

You should be able to load the model via the train script and just continue training. You will need to increase max_epochs in the config file.
What you could also try, is to not load the last layer(s) of the pretrained model. This is a common approach when using a pretrained model and fine-tuning on a dataset/task etc. You can use the cp_blacklist config parameter for this. Just add this option as a comma separated list to the train section. E.g.:

[train]
cp_blacklist = df_fc_out,df_fc_a,conv0_out

FYI I am currently trying to reproduce your findings using the current repo state (since I trained the published model on a different repo). I got some issues with our cluster though, so no eta.

@rohithmars
Copy link
Author

Hi,
I did try the first method you suggested. i.e., load the pre-trained model via the train script and just continue training. However, as I mentioned before, the scores on validation set is lower than when I validate on first epoch of a model trained from scratch. It is very strange. The config.ini is same as provided in the pre-trained model. Just that I have used cuda as the device.

@rohithmars
Copy link
Author

Hi @Rikorose

Did you get a chance to look into the issue of reproducing the results? The PESQ score I obtain is 2.60, which is quite less compared to 2.80 from the pre-trained model.

The loading from pre-trained model also does not help as I have explained above.

@Rikorose
Copy link
Owner

Rikorose commented Jan 5, 2022

Hi @rohithmars

Sorry for the late reply. I finally was able to look into it. I found 2 things:

  1. The loss value needs to be higher by a factor of 5. Originally we had some hard-coded factor in the spectral which I now cleaned up.
[spectralloss]
factor_magnitude = 100000
factor_complex = 100000
gamma = 0.6

Another notice on this: You could also try decreasing gamma to 0.3. We found that having a stronger compression (i.e. via a lower gamma) might benefit pesq. I think this is also reported by others in DNS3 challenge. However, other metrics might get slightly worse. If you try this, you would possibly also need to adjust the loss factors.

  1. I looked into our HDF5 files and found that we originally over-sampled high-quality datasets within the DNS corpus by a factor of 10. That is the PTDB and VCTK dataset. These datasets are dry close-talk recordings while librivox contains all sorts of microphones and recording conditions. I am not quite sure if these two datasets are actually included the DNS training set since the english read speech folder only contains common voice samples, even though the readme states it. So I think we just included these datasets manually. I am really sorry that I forgot about this and did not specify this earlier.
    We had split the VCTK and PTDB sets on speaker level to ensure no speaker overlap with the VCTK test set, the rest of the DNS set is split on signal level. Here are our train/val/tests splits for VCTK and PTDB: traintestsplit.zip
    I will provide an updated paper which includes this notice. Thank you for investigating this issue. I retrained our model and just got slightly (but probably not significantly) higher scores (pesq 2.84). So I guess there is still some variance left (e.g. due to NaNs during training #44, Generate dataset samples in reproducable order #42)

Wrt. the pretraining: Have you tried reinitializing the last layers as I suggested earlier? This usually works pretty well and is the usual approach when fine-tuning on a new/different dataset.

@rohithmars
Copy link
Author

rohithmars commented Jan 6, 2022

Hi @Rikorose

First of all, thanks for getting back to me and your effort to address the issue.

  1. If this factor [spectralloss] is 20000 * 5 = 100000 instead of 20000, then does it mean the factor for [dfalphaloss] should also be scaled to 1000 * 5 = 5000? This is because, in the paper it mentions lambda_spec = 1 and lambda_alpha = 0.05.

  2. Okay, so it means my results may be affected by the lack of VCTK train data. I did not use VCTK train. My train data was only from DNS-3 dataset and test on VCTK test.

Could you please show me a sample dataset.cfg which lists multiple .hdf5 for speech? It is because, I don't want to recreate the hdf5. I want to know if it is possible to add VCTK_speech.hdf5 to my current speech_train.hdf5. I hope you understand this question.

Thanks a lot again.

@Rikorose
Copy link
Owner

Rikorose commented Jan 6, 2022

Hi here is an example dataset config (without noise datasets):

{
    "test": [
        [
            "DNS_TEST.hdf5",
            1
        ],
        [
            "VCTK_TEST.hdf5",
            10
        ],
        [
            "SLR26_SIM_DNS_TEST.hdf5",
            1
        ]
    ],
    "train": [
        [
            "DNS_TRAIN.hdf5",
            1
        ],
        [
            "VCTK_TRAIN.hdf5",
            10
        ],
        [
            "SLR26_SIM_DNS_TRAIN.hdf5",
            1
        ]
    ],
    "valid": [
        [
            "DNS_VALID.hdf5",
            1
        ],
        [
            "VCTK_VALID.hdf5",
            10
        ],
        [
            "SLR26_SIM_DNS_VALID.hdf5",
            1
        ]
    ]
}

@github-actions
Copy link

github-actions bot commented Apr 7, 2022

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants