Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not load commonvoice dataset on windows #3781

Open
jacobjennings opened this issue Apr 27, 2024 · 1 comment
Open

Can not load commonvoice dataset on windows #3781

jacobjennings opened this issue Apr 27, 2024 · 1 comment

Comments

@jacobjennings
Copy link

馃悰 Describe the bug

When loading the common voice dataset on windows, the file train.tsv is loaded using cp1252 file encoding, leading to a failure.

training_speech_dataset = torchaudio.datasets.COMMONVOICE(root=base_dataset_cache_directory)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[49], line 1
----> 1 training_speech_dataset = torchaudio.datasets.COMMONVOICE(root=base_dataset_cache_directory)

File ~\Documents\GitHub\clarification\venv-pc\Lib\site-packages\torchaudio\datasets\commonvoice.py:55, in COMMONVOICE.__init__(self, root, tsv)
     53 walker = csv.reader(tsv_, delimiter="\t")
     54 self._header = next(walker)
---> 55 self._walker = list(walker)

File ~\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
     22 def decode(self, input, final=False):
---> 23     return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3155: character maps to <undefined>

Versions

Python 3.11

@mogwai
Copy link

mogwai commented May 3, 2024

You can try to download it from hugging face:

https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants