New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error opening training file, File contains data in an unknown format. #20
Comments
Maybe you need to convert them first? To be in the save side, I always convert to you can loop over your files, process each with #https://stackoverflow.com/a/12391451/4412324
from pydub import AudioSegment
sound = AudioSegment.from_mp3("/path/to/file.mp3")
# export to the proper place
sound.export("/output/path/file.wav", format="wav") |
Yes, I iterated over each file and changed its format to .wav
And I think it's because the file common_voice_ar_19225971.mp3 doesn't exist. Now it's common_voice_ar_19225971.wav. So now, it begs the question of why is the model looking for common_voice_ar_19225971.mp3 and not common_voice_ar_19225971.wav, and a possible explanation for that might be that the Arrow files for train.tsv, test.tsv, validated.tsv, Invalidated.tsv, still have the former ending (mp3). Of course, that's just a possibility, maybe there is something clear that I am missing. |
Did you find a solution for that? |
You should install a supported FFmpeg.
|
Hi Ziad,
I tried running this script that is available in the readme file to the train the MSA model:
python run_common_voice.py --model_name_or_path="facebook/wav2vec2-large-xlsr-53" --dataset_config_name="ar" --output_dir=/path/to/output/ --cache_dir=/path/to/cache --overwrite_output_dir="yes" --num_train_epochs="1" --per_device_train_batch_size="32" --per_device_eval_batch_size="32" --evaluation_strategy="steps" --learning_rate="3e-4" --warmup_steps="500" --fp16="no" --freeze_feature_extractor="yes" --save_steps="10" --eval_steps="10" --save_total_limit="1" --logging_steps="10" --group_by_length="no" --feat_proj_dropout="0.0" --layerdrop="0.1" --do_train="yes" --do_eval="yes" --max_train_samples 100 --max_val_samples 100
And I got this message:
_Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\pythonProject1\klaam\run_common_voice.py", line 511, in
main()
File "C:\Users\user\PycharmProjects\pythonProject1\klaam\run_common_voice.py", line 400, in main
train_dataset = train_dataset.map(
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 1955, in map
return self._map_single(
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 520, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 487, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\fingerprint.py", line 458, in wrapper
out = func(self, *args, **kwargs)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 2320, in map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 2220, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 1915, in decorated
result = f(decorated_item, *args, **kwargs)
File "C:\Users\user\PycharmProjects\pythonProject1\klaam\run_common_voice.py", line 394, in speech_file_to_array_fn
speech_array, sampling_rate = torchaudio.load(batch["path"])
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\torchaudio\backend\soundfile_backend.py", line 197, in load
with soundfile.SoundFile(filepath, "r") as file:
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\soundfile.py", line 629, in init
self._file = self._open(file, mode_int, closefd)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\soundfile.py", line 1183, in _open
_error_check(_snd.sf_error(file_ptr),
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\soundfile.py", line 1357, in error_check
raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace'))
RuntimeError: Error opening '/path/to/cache\downloads\extracted\31455a499a0212b1751dd0c1547b0d360037f6a8c0a69178647a45a577d0ff67\cv-corpus-6.1-2020-12-11/ar/clips/common_voice_ar_19225971.mp3': File contains data in an unknown format.
I think the reason behind it is that the training files are in .mp3 instead of .wav
Any suggestions to how I can tackle this problem?
The text was updated successfully, but these errors were encountered: