Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: DF dataloader error: ThreadJoinError("Any { .. }") when attempting to train DeepFilterNet #187

Closed
AnonymousEliforp opened this issue Nov 22, 2022 · 13 comments · Fixed by #189

Comments

@AnonymousEliforp
Copy link

I am trying to train DeepFilterNet but I am running into the following error when training: RuntimeError: DF dataloader error: ThreadJoinError("Any { .. }").

What I did

I installed DeepFilterNet via PyPI using:

pip3 install torch torchvision torchaudio
pip install deepfilternet[train]

Then I generated the following HDF5 dataset files:

TEST_SET_NOISE.hdf5
TEST_SET_SPEECH.hdf5
TRAIN_SET_NOISE.hdf5
TRAIN_SET_SPEECH.hdf5
VALID_SET_NOISE.hdf5
VALID_SET_SPEECH.hdf5

and placed them into the same directory .../data_folder.

I also created a config.ini file with the following contents:

config.ini

[deepfilternet]
emb_hidden_dim = 256
df_hidden_dim = 256
df_num_layers = 3
conv_ch = 16
conv_lookahead = 0
conv_depthwise = True
convt_depthwise = True
conv_kernel = 1,3
conv_kernel_inp = 3,3
emb_num_layers = 2
emb_gru_skip = none
df_gru_skip = none
df_pathway_kernel_size_t = 1
enc_concat = False
df_n_iter = 1
linear_groups = 1
enc_linear_groups = 16
mask_pf = False
gru_type = grouped
gru_groups = 1
group_shuffle = True
dfop_method = real_unfold
df_output_layer = linear

[df]
fft_size = 960
nb_erb = 32
nb_df = 96
sr = 48000
hop_size = 480
norm_tau = 1
lsnr_max = 35
lsnr_min = -15
min_nb_erb_freqs = 2
df_order = 5
df_lookahead = 0
pad_mode = input

[train]
model = deepfilternet2
batch_size = 16
batch_size_eval = 16
num_workers = 4
overfit = false
lr = 0.001
max_epochs = 30
seed = 42
device = cuda:1
mask_only = False
df_only = False
jit = False
max_sample_len_s = 5.0
num_prefetch_batches = 32
global_ds_sampling_f = 1.0
dataloader_snrs = -5,0,5,10,20,40
batch_size_scheduling = 
validation_criteria = loss
validation_criteria_rule = min
early_stopping_patience = 5
start_eval = False

[distortion]
p_reverb = 0.0
p_bandwidth_ext = 0.0
p_clipping = 0.0
p_air_absorption = 0.0

[optim]
lr = 0.0005
momentum = 0
weight_decay = 0.05
optimizer = adamw
lr_min = 1e-06
lr_warmup = 0.0001
warmup_epochs = 3
lr_cycle_mul = 1.0
lr_cycle_decay = 0.5
lr_cycle_epochs = -1
weight_decay_end = -1

[maskloss]
factor = 0
mask = iam
gamma = 0.6
gamma_pred = 0.6
f_under = 2

[spectralloss]
factor_magnitude = 0
factor_complex = 0
factor_under = 1
gamma = 1

[multiresspecloss]
factor = 0
factor_complex = 0
gamma = 1
fft_sizes = 512,1024,2048

[sdrloss]
factor = 0

[localsnrloss]
factor = 0.0005

and put it in a directory .../log_folder.

I also created a dataset.cfg file with the following contents:

dataset.cfg

{
  "train": [
    [
      "TRAIN_SET_SPEECH.hdf5",
      1.0
    ],
    [
      "TRAIN_SET_NOISE.hdf5",
      1.0
    ]
  ],
  "valid": [
    [
      "VALID_SET_SPEECH.hdf5",
      1.0
    ],
    [
      "VALID_SET_NOISE.hdf5",
      1.0
    ]
  ],
  "test": [
    [
      "TEST_SET_SPEECH.hdf5",
      1.0
    ],
    [
      "TEST_SET_NOISE.hdf5",
      1.0
    ]
  ]
}

Lastly, I ran the command to train the model:

python df/train.py path/to/dataset.cfg .../data_folder .../log_folder

and this is the output that I get:

2022-11-22 22:10:07 | INFO     | DF | Running on torch 1.13.0+cu117
2022-11-22 22:10:07 | INFO     | DF | Running on host workstation3
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
2022-11-22 22:10:07 | INFO     | DF | Loading model settings of log_folder
2022-11-22 22:10:07 | INFO     | DF | Running on device cuda:1
2022-11-22 22:10:07 | INFO     | DF | Initializing model `deepfilternet2`
2022-11-22 22:10:08 | DEPRECATED | DF | Use of `linear` for `df_ouput_layer` is marked as deprecated.
2022-11-22 22:10:08 | INFO     | DF | Initializing dataloader with data directory /home/tester/bokleong/dl_project_dfn/data_folder/
2022-11-22 22:10:08 | INFO     | DF | Loading HDF5 key cache from /home/tester/bokleong/dl_project_dfn/.cache_dataset.cfg
2022-11-22 22:10:08 | INFO     | DF | Start train epoch 0 with batch size 32
thread 'DataLoader Worker 1' panicked at 'called `Option::unwrap()` on a `None` value', libDF/src/dataset.rs:1134:69
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'DataLoader Worker 0' panicked at 'called `Option::unwrap()` on a `None` value', libDF/src/dataset.rs:1134:69
thread 'DataLoader Worker 2' panicked at 'called `Option::unwrap()` on a `None` value', libDF/src/dataset.rs:1134:69
thread 'DataLoader Worker 3' panicked at 'called `Option::unwrap()` on a `None` value', libDF/src/dataset.rs:1134:69
Exception in thread PinMemoryLoop:
Traceback (most recent call last):
  File "/home/tester/miniconda3/envs/user5/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/tester/miniconda3/envs/user5/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/tester/miniconda3/envs/user5/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 49, in _pin_memory_loop
    do_one_step()
  File "/home/tester/miniconda3/envs/user5/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 26, in do_one_step
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
  File "/home/tester/miniconda3/envs/user5/lib/python3.9/site-packages/libdfdata/torch_dataloader.py", line 206, in get
    self.loader.cleanup()
RuntimeError: DF dataloader error: ThreadJoinError("Any { .. }")

Appreciate any help on this, thank you!

@bommyboy
Copy link

Update: this was caused by not specifying --mono when using prepare-data.py on mono wav files to generate the hdf5 files

@Rikorose
Copy link
Owner

@AnonymousEliforp on which version did this error occur? On main there is no unwrap() at libDF/src/dataset.rs:1134:69. I would like to properly fix this error.

@AnonymousEliforp
Copy link
Author

@AnonymousEliforp on which version did this error occur? On main there is no unwrap() at libDF/src/dataset.rs:1134:69. I would like to properly fix this error.

Not sure if this is what you mean, but I was using deepfilternet2 on the main branch

@Rikorose
Copy link
Owner

At what exact commit? The main branch has changed and does not fit to your stack trace.

@AnonymousEliforp
Copy link
Author

At what exact commit? The main branch has changed and does not fit to your stack trace.

As far as I can tell from git log, I am at the 17-Nov 15:27:25 commit bc6bd9102212731843ed0a74f30164472865b167

@Rikorose
Copy link
Owner

Hm there is also no unwrap call on

Some(LpParam {

Your stack trace above said something else:
thread 'DataLoader Worker 0' panicked at 'called `Option::unwrap()` on a `None` value', libDF/src/dataset.rs:1134:69

@bommyboy
Copy link

image
Hi I am a colleague of @AnonymousEliforp this is the output from git log -l on our machine. Also I have tested that using prepare_data.py script with --mono enabled on the test, train and valid sound files separately will not cause this issue. However, using prepare_data.py with --mono enabled and then using the split_hdf5.py script will still cause the error to occur.

@Rikorose
Copy link
Owner

Ah, now I see. You used the main branch, but installed the rust modules from pypi. The error is since you don't have any noise samples in your dataset. If you run train.py with --debug you will see that you either have a too low sampling factor or you did you split your dataset incorrectly.

@bommyboy
Copy link

image
Hi I don't think I quite understand what you mean. I have ran the code with --debug and the output seems to register my noise dataset. However, the noise dataset is definitely lower than my speech dataset. Where might I be able to change the sampling factor?

@Rikorose
Copy link
Owner

You only have speech datasets.

@bommyboy
Copy link

Ok, but I meant for the hdf5 files ending with NOISE to be the noise datasets and in dataset.cfg, I have tried to follow the example dataset configuration.
image
Is there an error in which i defined dataset.cfg?

@Rikorose
Copy link
Owner

No in your prepare_dataset usage.

@bommyboy
Copy link

ah ok that is my bad then I have realised my mistake now so sorry for this and thanks for your help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants