New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducing results #38
Comments
Hi there, could you send me the training log as well as the config file used for training. Are you sure the model fully converged? Did you make sure to choose the model based on the best validation loss? What is the batch size? Were there any NaNs in the training run? Which exact commit was used for training? AdamW and Adam should not make a huge difference. I will double check if which one was used. 3 seconds should be fine. In each epoch, a different 3 seconds window will be used. I may have made some changes since then, e.g. handling shorter (than 3s) training samples. p_reverb is the reverb probability during training. Since VCTK is non-reverb, this should not have a large impact. The conv_lookahead is a different lookahead (i.e. in the convolutions) than the df lookahead. The overall latency should be no more than 30 ms which is less compared to PercepNet or DCRNN. |
Hi, Thank you for the reply. Please find attached the log and config.ini So you mean to say even if the number of speech samples and number of noise samples, differ, it is okay? or should i oversample to match them? As we had discussed before, nan occurs. But then i restart the training from previous epoch. |
The numbers of clean and noise samples may differ since they are randomly mixed at training time. One thing that I changed for better robustness in multi-speaker scenarios is sampling a different speech sample if the first one is shorter than 3 s. This may cause an performance drop. Also, we did not include the singing voice dataset, since it is correlated with the music that is also included in the noise sets, as well as the emotional speech due to worse recording quality. The original model was trained in a different repository which I cleaned up for publishing. Maybe I cleaned up too much or introduced a bug somewhere during the refactoring process which causes a performance drop. I will need to double-check. |
Hi, I had included the singing and emotional speech in my target. But the contribution from these data is small compared to read speech from DNS. Split is similar to 75%/15%/15%. Also, did you include VCTK trainset as part of your training data? or is it only DNS dataset? By the way, I also notice that when I use the pre-trained model to evaluate the validation set, it gives SDR/STOI scores lower than when I train the first epoch and validate on it. It is very strange and I am not sure where the bug is. Definitely, the pre-trained model should give much higher score than first epoch model. |
I did not include the VCTK as an extra training set. What SDR/STOI scores do you observe? On the validation set or the VCTK test set? |
Hi, Sorry for long post and questions. The validation set below is from my own split. [29] is the pre-trained model provided and its scores on this validation set. [0] is the validation scores on the first epoch model. I think there is something wrong? Pre-trained model score on validation set[29] [valid] | DfAlphaLoss: 0.016212 | SpectralLoss: 0.30626 | loss: 0.32247 | sdr_snr_-5: 3.6033 | sdr_snr_0: 7.6844 | sdr_snr_10: 13.983 | sdr_snr_20: 19.054 | sdr_snr_40: 24.983 | sdr_snr_5: 10.971 | stoi_snr_-5: 0.68123 | stoi_snr_0: 0.77549 | stoi_snr_10: 0.89468 | stoi_snr_20: 0.94909 | stoi_snr_40: 0.983 | stoi_snr_5: 0.844 Epoch-1 model score on validation set| DF | [0] [valid] | DfAlphaLoss: 0.011831 | SpectralLoss: 0.2401 | loss: 0.25193 | sdr_snr_-5: 4.9098 | sdr_snr_0: 8.912 | sdr_snr_10: 15.457 | sdr_snr_20: 21.364 | sdr_snr_40: 28.92 | sdr_snr_5: 12.27 | stoi_snr_-5: 0.7153 | stoi_snr_0: 0.80117 | stoi_snr_10: 0.90952 | stoi_snr_20: 0.95896 | stoi_snr_40: 0.99052 | stoi_snr_5: 0.86384 |
Hm, there is certainly the possibility to improve the DeepFilterNet model. Then it is strange that you are observing worse results on the VCTK set. I am currently working on a different topic but will try to come back to this to provide better reproducibility. I also noticed a bug in the dataset functionality resulting in different order of samples during training which also limits reproducibility. |
Hi, Can you tell me how to use the pre-trained model as the starting model when training? Do I need to do any other configuration? |
Hi, You should be able to load the model via the train script and just continue training. You will need to increase [train]
cp_blacklist = df_fc_out,df_fc_a,conv0_out FYI I am currently trying to reproduce your findings using the current repo state (since I trained the published model on a different repo). I got some issues with our cluster though, so no eta. |
Hi, |
Hi @Rikorose Did you get a chance to look into the issue of reproducing the results? The PESQ score I obtain is 2.60, which is quite less compared to 2.80 from the pre-trained model. The loading from pre-trained model also does not help as I have explained above. |
Hi @rohithmars Sorry for the late reply. I finally was able to look into it. I found 2 things:
[spectralloss]
factor_magnitude = 100000
factor_complex = 100000
gamma = 0.6 Another notice on this: You could also try decreasing gamma to 0.3. We found that having a stronger compression (i.e. via a lower gamma) might benefit pesq. I think this is also reported by others in DNS3 challenge. However, other metrics might get slightly worse. If you try this, you would possibly also need to adjust the loss factors.
Wrt. the pretraining: Have you tried reinitializing the last layers as I suggested earlier? This usually works pretty well and is the usual approach when fine-tuning on a new/different dataset. |
Hi @Rikorose First of all, thanks for getting back to me and your effort to address the issue.
Could you please show me a sample dataset.cfg which lists multiple .hdf5 for speech? It is because, I don't want to recreate the hdf5. I want to know if it is possible to add VCTK_speech.hdf5 to my current speech_train.hdf5. I hope you understand this question. Thanks a lot again. |
Hi here is an example dataset config (without noise datasets): {
"test": [
[
"DNS_TEST.hdf5",
1
],
[
"VCTK_TEST.hdf5",
10
],
[
"SLR26_SIM_DNS_TEST.hdf5",
1
]
],
"train": [
[
"DNS_TRAIN.hdf5",
1
],
[
"VCTK_TRAIN.hdf5",
10
],
[
"SLR26_SIM_DNS_TRAIN.hdf5",
1
]
],
"valid": [
[
"DNS_VALID.hdf5",
1
],
[
"VCTK_VALID.hdf5",
10
],
[
"SLR26_SIM_DNS_VALID.hdf5",
1
]
]
} |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
Hi,
I had tried to re-train the deepfilternet model using the DNS-3 challenge dataset mentioned in your work.
I don't have the additional 10k IR. However, the other dataset remains the same.
On VCTK test set, using the config.ini in the pre-trained model as my training config, my "best model" on validation gives PESQ score of 2.60. It is much lower than 2.81 from the pre-trained model.
In config.ini, Adamw is used, while in the paper Adam as optimizer is mentioned.
Do you think any other factors would result in such a performance drop?
Could you clarify on the 3 s sample for training? Suppose the DNS-3 sample has 10 s in a sample, do I need to split it into 3 s segments so as to utilize the entire train clip? Or just use the first 3 seconds of the clip? Alternatively, is random 3 s generated on-the-fly while training?
In the hdf5 setup, does the speech/noise/rir need to have sample number of samples? Or is the noise and RIR sampled randomly from a list? For example, if the speech list has 1000 samples, noise list is 100 samples and rir list is 100 samples, is it okay? or should it be 1000 speech, 1000 noise, 1000 rir? Is it needed to make the duration of speech and noise samples to be the same?
How about the reverberation parameter p_reverb = 0.05? The data augmentation is performed by default or any other config is needed? conv_lookahead = 2 in config.ini. But the paper mentions "look-ahead of l = 1 frame for both DF as well as in the DNN convolutions".
The text was updated successfully, but these errors were encountered: