# Convert Common Voice Dataset for Coqui
To convert once and use multiple times.
1. Hardware accelerator is set to "none" to save from GPU time.
See Edit -> Notebook settings
2. Upload the CV dataset to Google Drive
3. Replace "tr" with your language code and "v8.0" to another version if needed.
4. Run this
5. You can delete the CV dataset, other notebooks will use the converted version.


In [1]:
# Basic imports
import shutil
from google.colab import drive

## Mount Google Drive

In [2]:
# mount your private google drive
drive.mount('/content/drive')

Mounted at /content/drive


## Basic Setup

In [3]:
# install libraries to convert mp3 to wav
!apt-get install sox libsox-fmt-mp3

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libid3tag0 libmad0 libmagic-mgc libmagic1 libopencore-amrnb0
  libopencore-amrwb0 libsox-fmt-alsa libsox-fmt-base libsox3
Suggested packages:
  file libsox-fmt-all
The following NEW packages will be installed:
  libid3tag0 libmad0 libmagic-mgc libmagic1 libopencore-amrnb0
  libopencore-amrwb0 libsox-fmt-alsa libsox-fmt-base libsox-fmt-mp3 libsox3
  sox
0 upgraded, 11 newly installed, 0 to remove and 37 not upgraded.
Need to get 872 kB of archives.
After this operation, 7,087 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libopencore-amrnb0 amd64 0.1.3-2.1 [92.0 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libopencore-amrwb0 amd64 0.1.3-2.1 [45.8 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libmagic-mgc amd64 1:5.32-2ubuntu0.4 [184 kB]
Get:

In [4]:
## Install Coqui STT 
!git clone --depth 1 --branch v1.1.0 https://github.com/coqui-ai/STT.git
!cd STT; pip install -U pip wheel setuptools; pip install .

Cloning into 'STT'...
remote: Enumerating objects: 2214, done.[K
remote: Counting objects: 100% (2214/2214), done.[K
remote: Compressing objects: 100% (1395/1395), done.[K
remote: Total 2214 (delta 856), reused 1647 (delta 721), pack-reused 0[K
Receiving objects: 100% (2214/2214), 12.38 MiB | 25.67 MiB/s, done.
Resolving deltas: 100% (856/856), done.
Note: checking out 'f3605e23cb01a7e86c4d46f09070098a506fac4e'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

Collecting pip
  Downloading pip-21.3.1-py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 5.4 MB/s 
Collecting setuptools
  Downloading setuptools-60.

## GET COMMON VOICE DATASET & SUPPLEMENTS

In [5]:
# Prep directories
!mkdir -p /content/data/tr

In [9]:
# Get Dataset and move to place
!tar -xzvf "/content/drive/MyDrive/cv-datasets/tr/v8.0/cv-corpus-8.0-2022-01-19-tr.tar.gz" -C "/content/data"
!mv /content/data/cv-corpus-8.0-2022-01-19/tr /content/data/tr/v8.0
!rmdir /content/data/cv-corpus-8.0-2022-01-19

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
cv-corpus-8.0-2022-01-19/tr/clips/common_voice_tr_30439730.mp3
cv-corpus-8.0-2022-01-19/tr/clips/common_voice_tr_30439731.mp3
cv-corpus-8.0-2022-01-19/tr/clips/common_voice_tr_30439732.mp3
cv-corpus-8.0-2022-01-19/tr/clips/common_voice_tr_30439733.mp3
cv-corpus-8.0-2022-01-19/tr/clips/common_voice_tr_30439749.mp3
cv-corpus-8.0-2022-01-19/tr/clips/common_voice_tr_30439750.mp3
cv-corpus-8.0-2022-01-19/tr/clips/common_voice_tr_30439751.mp3
cv-corpus-8.0-2022-01-19/tr/clips/common_voice_tr_30439752.mp3
cv-corpus-8.0-2022-01-19/tr/clips/common_voice_tr_30439778.mp3
cv-corpus-8.0-2022-01-19/tr/clips/common_voice_tr_30439779.mp3
cv-corpus-8.0-2022-01-19/tr/clips/common_voice_tr_30439780.mp3
cv-corpus-8.0-2022-01-19/tr/clips/common_voice_tr_30439781.mp3
cv-corpus-8.0-2022-01-19/tr/clips/common_voice_tr_30439782.mp3
cv-corpus-8.0-2022-01-19/tr/clips/common_voice_tr_30439793.mp3
cv-corpus-8.0-2022-01-19/tr/clips/common_voice_tr_304

In [7]:
# Get Alphabet & Validator
!cp drive/MyDrive/cv-datasets/tr/alphabet.txt /content/data/tr/alphabet.txt
!cp drive/MyDrive/cv-datasets/tr/validate_label_tr.py /content/data/tr/validate_label_tr.py

In [10]:
# check
!ls /content/data
!ls /content/data/tr
!ls /content/data/tr/v8.0

tr
alphabet.txt  v8.0  validate_label_tr.py
clips	 invalidated.tsv  reported.tsv	train.tsv
dev.tsv  other.tsv	  test.tsv	validated.tsv


## CONVERT DATASET

In [11]:
# This step converts the mp3s to wav-files. The result will be around three times as big as your mp3 folder.
!/content/STT/bin/import_cv2.py --validate_label_locale /content/data/tr/validate_label_tr.py --filter_alphabet /content/data/tr/alphabet.txt /content/data/tr/v8.0

Loading TSV file:  /content/data/tr/v8.0/test.tsv
Importing mp3 files...
Progress |######################################################| 100% completedImported 7948 samples.
Skipped 381 samples that failed on transcript validation.
Skipped 10 samples that were longer than 10 seconds.
Final amount of imported audio: 9:04:24 from 9:32:38.
Saving new Coqui STT-formatted CSV file to:  /content/data/tr/v8.0/clips/test.csv
Writing CSV file for train.py as:  /content/data/tr/v8.0/clips/test.csv
Progress |######################################################| 100% completed
Loading TSV file:  /content/data/tr/v8.0/dev.tsv
Importing mp3 files...
Progress |######################################################| 100% completedImported 7754 samples.
Skipped 344 samples that failed on transcript validation.
Skipped 12 samples that were longer than 10 seconds.
Final amount of imported audio: 7:48:00 from 8:11:58.
Saving new Coqui STT-formatted CSV file to:  /content/data/tr/v8.0/clips/dev.csv
Wri

In [12]:
# Delete mp3 files AFTER they got converted to wav
!find /content/data/tr/v8.0/clips/ -name "*.mp3" -delete

In [13]:
# Check contents
!ls /content/data/tr/v8.0/clips

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
common_voice_tr_26124313.wav  common_voice_tr_30439596.wav
common_voice_tr_26124314.wav  common_voice_tr_30439633.wav
common_voice_tr_26124315.wav  common_voice_tr_30439635.wav
common_voice_tr_26124316.wav  common_voice_tr_30439636.wav
common_voice_tr_26124317.wav  common_voice_tr_30439638.wav
common_voice_tr_26124318.wav  common_voice_tr_30439659.wav
common_voice_tr_26124319.wav  common_voice_tr_30439660.wav
common_voice_tr_26124322.wav  common_voice_tr_30439661.wav
common_voice_tr_26124333.wav  common_voice_tr_30439662.wav
common_voice_tr_26124334.wav  common_voice_tr_30439680.wav
common_voice_tr_26124335.wav  common_voice_tr_30439681.wav
common_voice_tr_26124356.wav  common_voice_tr_30439682.wav
common_voice_tr_26124357.wav  common_voice_tr_30439683.wav
common_voice_tr_26124368.wav  common_voice_tr_30439684.wav
common_voice_tr_26124370.wav  common_voice_tr_30439685.wav
common_voice_tr_26124371.wav  common_voice_tr_3043

## BACKUP CONVERTED DATASET


In [14]:
# DO THIS ONCE FOR TAKE BACKUP FOR FURTHER TRIALS pack WAVs and CSVs into tar.gz inside of the workspace
!tar czf /content/data/converted-tr-corpus-v8.0.tar.gz /content/data/tr/v8.0/
#copy big file to Google Drive. If space gets low during the transfere open the terminal and use "find /content/data/tr/v8.0/clips/ -name "*.wav" -delete"
shutil.move("/content/data/converted-tr-corpus-v8.0.tar.gz", "/content/drive/MyDrive/cv-datasets/tr/v8.0")
# if files don't appear in your Google Drive, this often helps (sync and disconnect drive)
drive.flush_and_unmount()

tar: Removing leading `/' from member names
