# Training a speech-to-text model using Huggingsound

<a target="_blank" href="https://colab.research.google.com/github/Koffair/colab_pipelines/blob/main/notebooks/03_train_s2t.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

**WARNING**
- Open the notebook in Colab by clicking on the "Open in Colab" badge at the top of the notebook
- Save a copy of the notebook to your Google Drive by clicking "File" > "Save a Copy to Drive"
- This notebook assumes that you have downloaded the necessery data and preprocessed it by following the instructions in ```01_train_language_models.ipynb```
- Set runtime type to "GPU" by clicking "Runtime" > "Change runtime type" and selecting "GPU" from "Harware accelerator"
- If you have a Colab subscription, set "Runtime class" to "Premium" for a better performance
- If you have lots of training data, set "Runtime shape" to "High-RAM"
- If you have no GPU, the training time will last significantly longer


## Setup
### Getting training data
We have to download the Common Voice dataset. **NOTE**: VMs are not persistent. If you have to restart your VM, you have to download and
uncompress the data too.

In [None]:
!wget https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-14.0-2023-06-23/cv-corpus-14.0-2023-06-23-hu.tar.gz

--2023-07-21 06:57:10--  https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-14.0-2023-06-23/cv-corpus-14.0-2023-06-23-hu.tar.gz
Resolving mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com (mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com)... 52.92.226.226, 52.92.195.26, 52.92.192.234, ...
Connecting to mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com (mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com)|52.92.226.226|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3154771110 (2.9G) [application/octet-stream]
Saving to: ‘cv-corpus-14.0-2023-06-23-hu.tar.gz’


2023-07-21 07:00:03 (17.5 MB/s) - ‘cv-corpus-14.0-2023-06-23-hu.tar.gz’ saved [3154771110/3154771110]



In [None]:
!tar -xvf cv-corpus-14.0-2023-06-23-hu.tar.gz

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
cv-corpus-14.0-2023-06-23/hu/clips/common_voice_hu_37842248.mp3
cv-corpus-14.0-2023-06-23/hu/clips/common_voice_hu_37842249.mp3
cv-corpus-14.0-2023-06-23/hu/clips/common_voice_hu_37842255.mp3
cv-corpus-14.0-2023-06-23/hu/clips/common_voice_hu_37842256.mp3
cv-corpus-14.0-2023-06-23/hu/clips/common_voice_hu_37842257.mp3
cv-corpus-14.0-2023-06-23/hu/clips/common_voice_hu_37842258.mp3
cv-corpus-14.0-2023-06-23/hu/clips/common_voice_hu_37842259.mp3
cv-corpus-14.0-2023-06-23/hu/clips/common_voice_hu_37842533.mp3
cv-corpus-14.0-2023-06-23/hu/clips/common_voice_hu_37842534.mp3
cv-corpus-14.0-2023-06-23/hu/clips/common_voice_hu_37842535.mp3
cv-corpus-14.0-2023-06-23/hu/clips/common_voice_hu_37842536.mp3
cv-corpus-14.0-2023-06-23/hu/clips/common_voice_hu_37842537.mp3
cv-corpus-14.0-2023-06-23/hu/clips/common_voice_hu_37842604.mp3
cv-corpus-14.0-2023-06-23/hu/clips/common_voice_hu_37842605.mp3
cv-corpus-14.0-2023-06-23/hu/clips/comm

### Setting up the environment
Since huggingsound `torch = ">=1.7,!=1.12.0,<1.13.0"`, we have to uninstall packages with higher version of torch. Then we can install the appropriate version of pytorch and the huggingsound package (which will carry out the heavy duty of training a model).

In [None]:
!pip uninstall torch torchdata torchtext torchvision torchaudio fastai -y # uninstalling packages uncompatible with huggingsound

Found existing installation: torch 2.0.1+cu118
Uninstalling torch-2.0.1+cu118:
  Successfully uninstalled torch-2.0.1+cu118
Found existing installation: torchdata 0.6.1
Uninstalling torchdata-0.6.1:
  Successfully uninstalled torchdata-0.6.1
Found existing installation: torchtext 0.15.2
Uninstalling torchtext-0.15.2:
  Successfully uninstalled torchtext-0.15.2
Found existing installation: torchvision 0.15.2+cu118
Uninstalling torchvision-0.15.2+cu118:
  Successfully uninstalled torchvision-0.15.2+cu118
Found existing installation: torchaudio 2.0.2+cu118
Uninstalling torchaudio-2.0.2+cu118:
  Successfully uninstalled torchaudio-2.0.2+cu118
Found existing installation: fastai 2.7.12
Uninstalling fastai-2.7.12:
  Successfully uninstalled fastai-2.7.12


In [None]:
!nvcc --version # check cuda version, pytorch cuda version must match cuda version or at least it must be close to it

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0


In [None]:
!pip install --extra-index-url https://download.pytorch.org/whl/ "torch==1.12.1+cu116"

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/
Collecting torch==1.12.1+cu116
  Downloading https://download.pytorch.org/whl/cu116/torch-1.12.1%2Bcu116-cp310-cp310-linux_x86_64.whl (1904.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 GB[0m [31m666.3 kB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch
Successfully installed torch-1.12.1+cu116


In [None]:
!pip install accelerate

Collecting accelerate
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.21.0


In [None]:
!pip install huggingsound

Collecting huggingsound
  Downloading huggingsound-0.1.6-py3-none-any.whl (28 kB)
Collecting datasets<3.0.0,>=2.6.1 (from huggingsound)
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jiwer<3.0.0,>=2.5.1 (from huggingsound)
  Downloading jiwer-2.6.0-py3-none-any.whl (20 kB)
Collecting librosa<0.10.0,>=0.9.2 (from huggingsound)
  Downloading librosa-0.9.2-py3-none-any.whl (214 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m214.3/214.3 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
Collecting transformers<5.0.0,>=4.23.1 (from huggingsound)
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m38.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets<3.0.0,>=2.6.1->huggingsound)
  Downloading dill-0.3.6-py3-none-any.

### Importing packages and other boring things

In [None]:
import os
from datetime import datetime

import pandas as pd
import torch
from huggingsound import (
    ModelArguments,
    SpeechRecognitionModel,
    TokenSet,
    TrainingArguments,
)
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

cuda


Let's locate the `clips`, the actual recordings in `Common Voice`.

In [None]:
!ls

cv-corpus-14.0-2023-06-23  cv-corpus-14.0-2023-06-23-hu.tar.gz	sample_data


In [None]:
!ls cv-corpus-14.0-2023-06-23

hu


In [None]:
!ls cv-corpus-14.0-2023-06-23/hu

clips	 invalidated.tsv  reported.tsv	times.txt  validated.tsv
dev.tsv  other.tsv	  test.tsv	train.tsv


## Preprocessing data

In [None]:
clip_prefix = "cv-corpus-14.0-2023-06-23/hu/clips"

df = pd.read_csv("cv-corpus-14.0-2023-06-23/hu/validated.tsv", sep="\t")
df = df[df["down_votes"] == 0]  # use only validated data without down_votes
# uncompression had problems, we have missing an zero byte files
# df = df.sample(n=2000)
clips = df["path"]
clips = [e for e in clips if os.path.isfile(os.path.join(clip_prefix, e)) and os.stat(os.path.join(clip_prefix, e)).st_size != 0]
df = df[df["path"].isin(clips)]

print(df.shape)

(49420, 11)


In [None]:
train, test = train_test_split(df, test_size=0.15)

In [None]:
trainx = zip(train["path"], train["sentence"])
testx = zip(test["path"], test["sentence"])


def clean_sentence(sentence):
    wds = word_tokenize(sentence)
    return " ".join([wd.lower() for wd in wds if wd.isalnum()])


train_data = [
    {"path": os.path.join(clip_prefix, e[0]), "transcription": clean_sentence(e[1])}
    for e in trainx
]
eval_data = [
    {"path": os.path.join(clip_prefix, e[0]), "transcription": clean_sentence(e[1])}
    for e in testx
]

## Model setup

In [None]:
dname = str(datetime.now())
model = SpeechRecognitionModel("facebook/wav2vec2-large-xlsr-53", device=device)
output_dir = f"{dname}"

tokens = [
    "a",
    "á",
    "b",
    "c",
    "d",
    "e",
    "é",
    "f",
    "g",
    "h",
    "i",
    "í",
    "j",
    "k",
    "l",
    "m",
    "n",
    "o",
    "ó",
    "ö",
    "ő",
    "p",
    "q",
    "r",
    "s",
    "t",
    "u",
    "ú",
    "ü",
    "ű",
    "v",
    "w",
    "x",
    "y",
    "z",
]
token_set = TokenSet(tokens)

INFO:huggingsound.speech_recognition.model:Loading model...


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.77k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.27G [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-xlsr-53 and are newly initialized: ['lm_head.weight', 'lm_head.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)rocessor_config.json:   0%|          | 0.00/212 [00:00<?, ?B/s]



In [None]:
training_args = TrainingArguments(
    learning_rate=3e-4,
    max_steps=5000,
    eval_steps=500,
    per_device_train_batch_size=10,
    per_device_eval_batch_size=10,
)
model_args = ModelArguments(
    activation_dropout=0.1,
    hidden_dropout=0.1,
)

## Fine-tune model

In [None]:
model.finetune(
    output_dir,
    train_data=train_data,
    eval_data=eval_data,
    token_set=token_set,
    training_args=training_args,
    model_args=model_args,
)

`use_fast` is set to `True` but the tokenizer class does not have a fast version.  Falling back to the slow version.
INFO:huggingsound.speech_recognition.model:Loading training data...
INFO:huggingsound.speech_recognition.model:Converting data format...
INFO:huggingsound.speech_recognition.model:Preparing data input and labels...


Map:   0%|          | 0/42007 [00:00<?, ? examples/s]

INFO:huggingsound.speech_recognition.model:Loading evaluation data...
INFO:huggingsound.speech_recognition.model:Converting data format...
INFO:huggingsound.speech_recognition.model:Preparing data input and labels...


Map:   0%|          | 0/7413 [00:00<?, ? examples/s]

INFO:huggingsound.speech_recognition.model:Starting fine-tuning process...
INFO:huggingsound.trainer:Getting dataset stats...
INFO:huggingsound.trainer:Training dataset size: 42007 samples, 61.07631385416626 hours
INFO:huggingsound.trainer:Evaluation dataset size: 7413 samples, 10.81067883680558 hours
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-xlsr-53 and are newly initialized: ['lm_head.weight', 'lm_head.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:huggingsound.trainer:Building trainer...
INFO:huggingsound.trainer:Starting training...


Step,Training Loss,Validation Loss,Wer,Cer
500,478.0811,376.942657,0.760528,0.215755
1000,349.4335,274.278351,0.587026,0.151682
1500,308.4796,249.277695,0.532171,0.1399
2000,280.2542,232.439056,0.485477,0.128379
2500,270.2169,211.322311,0.468795,0.12593
3000,261.8612,196.620285,0.446186,0.115897
3500,251.7534,186.701126,0.446357,0.121253
4000,242.5334,173.930832,0.406139,0.106158
4500,217.7602,161.678665,0.420189,0.116815
5000,217.312,172.40834,0.41411,0.112872


INFO:huggingsound.speech_recognition.model:Loading fine-tuned model...


***** train metrics *****
  epoch                    =         1.19
  total_flos               = 7493464832GF
  train_loss               =     372.1765
  train_runtime            =   6:53:10.53
  train_samples            =        42007
  train_samples_per_second =        2.017
  train_steps_per_second   =        0.202


In [None]:
!ls

'2023-07-21 07:06:56.463229'   cv-corpus-14.0-2023-06-23-hu.tar.gz
 cv-corpus-14.0-2023-06-23     sample_data


In [None]:
!ls '2023-07-21 07:06:56.463229'

all_results.json  checkpoint-4000	    special_tokens_map.json
checkpoint-1000   checkpoint-4500	    tokenizer_config.json
checkpoint-1500   checkpoint-500	    trainer_state.json
checkpoint-2000   checkpoint-5000	    training_args.bin
checkpoint-2500   config.json		    train_results.json
checkpoint-3000   preprocessor_config.json  vocab.json
checkpoint-3500   pytorch_model.bin


In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
!ls '/content/gdrive'

MyDrive  Shareddrives


In [None]:
!cp  "2023-07-21 07:06:56.463229/pytorch_model.bin" "/content/gdrive/My Drive/Colab Notebooks/models/pytorch_model.bin"

In [None]:
!cp 2023-07-21\ 07:06:56.463229/*.json "/content/gdrive/My Drive/Colab Notebooks/models/"

In [None]:
!cp 2023-07-21\ 07:06:56.463229/checkpoint-4000/* "/content/gdrive/My Drive/Colab Notebooks/models/4000/"