Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error opening training file, File contains data in an unknown format. #20

Closed
JihadZoabi opened this issue Apr 25, 2022 · 4 comments
Closed

Comments

@JihadZoabi
Copy link

Hi Ziad,
I tried running this script that is available in the readme file to the train the MSA model:

python run_common_voice.py --model_name_or_path="facebook/wav2vec2-large-xlsr-53" --dataset_config_name="ar" --output_dir=/path/to/output/ --cache_dir=/path/to/cache --overwrite_output_dir="yes" --num_train_epochs="1" --per_device_train_batch_size="32" --per_device_eval_batch_size="32" --evaluation_strategy="steps" --learning_rate="3e-4" --warmup_steps="500" --fp16="no" --freeze_feature_extractor="yes" --save_steps="10" --eval_steps="10" --save_total_limit="1" --logging_steps="10" --group_by_length="no" --feat_proj_dropout="0.0" --layerdrop="0.1" --do_train="yes" --do_eval="yes" --max_train_samples 100 --max_val_samples 100

And I got this message:

_Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\pythonProject1\klaam\run_common_voice.py", line 511, in
main()
File "C:\Users\user\PycharmProjects\pythonProject1\klaam\run_common_voice.py", line 400, in main
train_dataset = train_dataset.map(
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 1955, in map
return self._map_single(
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 520, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 487, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\fingerprint.py", line 458, in wrapper
out = func(self, *args, **kwargs)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 2320, in map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 2220, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 1915, in decorated
result = f(decorated_item, *args, **kwargs)
File "C:\Users\user\PycharmProjects\pythonProject1\klaam\run_common_voice.py", line 394, in speech_file_to_array_fn
speech_array, sampling_rate = torchaudio.load(batch["path"])
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\torchaudio\backend\soundfile_backend.py", line 197, in load
with soundfile.SoundFile(filepath, "r") as file
:
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\soundfile.py", line 629, in init
self._file = self._open(file, mode_int, closefd)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\soundfile.py", line 1183, in _open
_error_check(_snd.sf_error(file_ptr),
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\soundfile.py", line 1357, in error_check
raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace'))
RuntimeError: Error opening '/path/to/cache\downloads\extracted\31455a499a0212b1751dd0c1547b0d360037f6a8c0a69178647a45a577d0ff67\cv-corpus-6.1-2020-12-11/ar/clips/common_voice_ar_19225971.mp3': File contains data in an unknown format
.

I think the reason behind it is that the training files are in .mp3 instead of .wav
Any suggestions to how I can tackle this problem?

@MagedSaeed
Copy link
Member

MagedSaeed commented May 5, 2022

Maybe you need to convert them first?

To be in the save side, I always convert to .wav when using this wav2vec model on an Audio dataset. I usually use pydub for this.

you can loop over your files, process each with

#https://stackoverflow.com/a/12391451/4412324
from pydub import AudioSegment
sound = AudioSegment.from_mp3("/path/to/file.mp3")
# export to the proper place
sound.export("/output/path/file.wav", format="wav")

@JihadZoabi
Copy link
Author

JihadZoabi commented May 7, 2022

Yes, I iterated over each file and changed its format to .wav
But now I am getting this error:

RuntimeError: Error opening '/path/to/cache\downloads\extracted\31455a499a0212b1751dd0c1547b0d360037f6a8c0a69178647a45a577d0ff67\cv-corpus-6.1-2020-12-11/ar/clips/common_voice_ar_19225971.mp3': System error.

And I think it's because the file common_voice_ar_19225971.mp3 doesn't exist. Now it's common_voice_ar_19225971.wav.
I changed the ending of the training files (from .mp3 to .wav) also in the tsv files (train.tsv, test.tsv, validated.tsv, invalidated.tsv, etc).

So now, it begs the question of why is the model looking for common_voice_ar_19225971.mp3 and not common_voice_ar_19225971.wav, and a possible explanation for that might be that the Arrow files for train.tsv, test.tsv, validated.tsv, Invalidated.tsv, still have the former ending (mp3).
Arrow files cannot be edited with a text editor, and the documentation doesn't explain how I can generate them according to the new tsv files or just edit them.

Of course, that's just a possibility, maybe there is something clear that I am missing.
What do you think?

@Hochmah
Copy link

Hochmah commented Sep 15, 2022

Yes, I iterated over each file and changed its format to .wav But now I am getting this error:

RuntimeError: Error opening '/path/to/cache\downloads\extracted\31455a499a0212b1751dd0c1547b0d360037f6a8c0a69178647a45a577d0ff67\cv-corpus-6.1-2020-12-11/ar/clips/common_voice_ar_19225971.mp3': System error.

And I think it's because the file common_voice_ar_19225971.mp3 doesn't exist. Now it's common_voice_ar_19225971.wav. I changed the ending of the training files (from .mp3 to .wav) also in the tsv files (train.tsv, test.tsv, validated.tsv, invalidated.tsv, etc).

So now, it begs the question of why is the model looking for common_voice_ar_19225971.mp3 and not common_voice_ar_19225971.wav, and a possible explanation for that might be that the Arrow files for train.tsv, test.tsv, validated.tsv, Invalidated.tsv, still have the former ending (mp3). Arrow files cannot be edited with a text editor, and the documentation doesn't explain how I can generate them according to the new tsv files or just edit them.

Of course, that's just a possibility, maybe there is something clear that I am missing. What do you think?

Did you find a solution for that?

@bashartalafha
Copy link

You should install a supported FFmpeg.

sudo add-apt-repository -y ppa:savoury1/ffmpeg4
sudo apt-get -qq install -y ffmpeg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants