# Training a speech-to-text model using Huggingsound

<a target="_blank" href="https://colab.research.google.com/github/Koffair/colab_pipelines/blob/main/notebooks/03_train_s2t.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

**WARNING**
- Open the notebook in Colab by clicking on the "Open in Colab" badge at the top of the notebook
- Save a copy of the notebook to your Google Drive by clicking "File" > "Save a Copy to Drive"
- This notebook assumes that you have downloaded the necessery data and preprocessed it by following the instructions in ```01_train_language_models.ipynb```
- Set runtime type to "GPU" by clicking "Runtime" > "Change runtime type" and selecting "GPU" from "Harware accelerator"
- If you have a Colab subscription, set "Runtime class" to "Premium" for a better performance
- If you have lots of training data, set "Runtime shape" to "High-RAM"
- If you have no GPU, the training time will last significantly longer


## Setup
### Getting training data
We have to download the Common Voice dataset. **NOTE**: VMs are not persistent. If you have to restart your VM, probably you have to download and
uncompress the data too.

In [1]:
!wget https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-13.0-2023-03-09/cv-corpus-13.0-2023-03-09-hu.tar.gz

--2023-04-13 12:28:43--  https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-13.0-2023-03-09/cv-corpus-13.0-2023-03-09-hu.tar.gz
Resolving mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com (mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com)... 52.92.194.98, 52.92.176.130, 52.218.246.17, ...
Connecting to mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com (mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com)|52.92.194.98|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1771017252 (1.6G) [application/octet-stream]
Saving to: ‘cv-corpus-13.0-2023-03-09-hu.tar.gz’


2023-04-13 12:30:25 (16.6 MB/s) - ‘cv-corpus-13.0-2023-03-09-hu.tar.gz’ saved [1771017252/1771017252]



In [2]:
!tar -xvf cv-corpus-13.0-2023-03-09-hu.tar.gz

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
cv-corpus-13.0-2023-03-09/hu/clips/common_voice_hu_37233130.mp3
cv-corpus-13.0-2023-03-09/hu/clips/common_voice_hu_37233152.mp3
cv-corpus-13.0-2023-03-09/hu/clips/common_voice_hu_37233153.mp3
cv-corpus-13.0-2023-03-09/hu/clips/common_voice_hu_37233154.mp3
cv-corpus-13.0-2023-03-09/hu/clips/common_voice_hu_37233155.mp3
cv-corpus-13.0-2023-03-09/hu/clips/common_voice_hu_37233156.mp3
cv-corpus-13.0-2023-03-09/hu/clips/common_voice_hu_37233157.mp3
cv-corpus-13.0-2023-03-09/hu/clips/common_voice_hu_37233158.mp3
cv-corpus-13.0-2023-03-09/hu/clips/common_voice_hu_37233159.mp3
cv-corpus-13.0-2023-03-09/hu/clips/common_voice_hu_37233160.mp3
cv-corpus-13.0-2023-03-09/hu/clips/common_voice_hu_37233163.mp3
cv-corpus-13.0-2023-03-09/hu/clips/common_voice_hu_37233164.mp3
cv-corpus-13.0-2023-03-09/hu/clips/common_voice_hu_37233165.mp3
cv-corpus-13.0-2023-03-09/hu/clips/common_voice_hu_37233166.mp3
cv-corpus-13.0-2023-03-09/hu/clips/comm

### Setting up the environment
Since huggingsound `torch = ">=1.7,!=1.12.0,<1.13.0"`, we have to uninstall packages with higher version of torch. Then we can install the appropriate version of pytorch and the huggingsound package (which will carry out the heavy duty of training a model).

In [3]:
!pip uninstall torch torchdata torchtext torchvision torchaudio fastai -y # uninstalling packages uncompatible with huggingsound

Found existing installation: torch 2.0.0+cu118
Uninstalling torch-2.0.0+cu118:
  Successfully uninstalled torch-2.0.0+cu118
Found existing installation: torchdata 0.6.0
Uninstalling torchdata-0.6.0:
  Successfully uninstalled torchdata-0.6.0
Found existing installation: torchtext 0.15.1
Uninstalling torchtext-0.15.1:
  Successfully uninstalled torchtext-0.15.1
Found existing installation: torchvision 0.15.1+cu118
Uninstalling torchvision-0.15.1+cu118:
  Successfully uninstalled torchvision-0.15.1+cu118
Found existing installation: torchaudio 2.0.1+cu118
Uninstalling torchaudio-2.0.1+cu118:
  Successfully uninstalled torchaudio-2.0.1+cu118
Found existing installation: fastai 2.7.12
Uninstalling fastai-2.7.12:
  Successfully uninstalled fastai-2.7.12


In [4]:
!nvcc --version # check cuda version, pytorch cuda version must match cuda version or at least it must be close to it

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0


In [5]:
!pip install --extra-index-url https://download.pytorch.org/whl/ "torch==1.12.1+cu116"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/, https://download.pytorch.org/whl/
Collecting torch==1.12.1+cu116
  Downloading https://download.pytorch.org/whl/cu116/torch-1.12.1%2Bcu116-cp39-cp39-linux_x86_64.whl (1904.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 GB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch
Successfully installed torch-1.12.1+cu116


In [6]:
!pip install huggingsound

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting huggingsound
  Downloading huggingsound-0.1.6-py3-none-any.whl (28 kB)
Collecting librosa<0.10.0,>=0.9.2
  Downloading librosa-0.9.2-py3-none-any.whl (214 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m214.3/214.3 kB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jiwer<3.0.0,>=2.5.1
  Downloading jiwer-2.6.0-py3-none-any.whl (20 kB)
Collecting transformers<5.0.0,>=4.23.1
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m105.4 MB/s[0m eta [36m0:00:00[0m
Collecting datasets<3.0.0,>=2.6.1
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m46.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess
  Downloading multiprocess-0.70.14-py39-none-any.whl (132 kB)


### Importing packages and other boring things

In [7]:
import os
from datetime import datetime

import pandas as pd
import torch
from huggingsound import (
    ModelArguments,
    SpeechRecognitionModel,
    TokenSet,
    TrainingArguments,
)
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split

In [8]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [9]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

cuda


Let's locate the `clips`, the actual recordings in `Common Voice`.

In [10]:
!ls

cv-corpus-13.0-2023-03-09  cv-corpus-13.0-2023-03-09-hu.tar.gz	sample_data


In [11]:
!ls cv-corpus-13.0-2023-03-09

hu


In [12]:
!ls cv-corpus-13.0-2023-03-09/hu

clips	 invalidated.tsv  reported.tsv	train.tsv
dev.tsv  other.tsv	  test.tsv	validated.tsv


## Preprocessing data

In [13]:
clip_prefix = "cv-corpus-13.0-2023-03-09/hu/clips"

df = pd.read_csv("cv-corpus-13.0-2023-03-09/hu/validated.tsv", sep="\t")
df = df[df["down_votes"] == 0]  # use only validated data without down_votes
# uncompression had problems, we have missing an zero byte files
# df = df.sample(n=2000)
clips = df["path"]
clips = [e for e in clips if os.path.isfile(os.path.join(clip_prefix, e)) and os.stat(os.path.join(clip_prefix, e)).st_size != 0]
df = df[df["path"].isin(clips)]

print(df.shape)

(28891, 11)


In [14]:
train, test = train_test_split(df, test_size=0.15)

In [15]:
trainx = zip(train["path"], train["sentence"])
testx = zip(test["path"], test["sentence"])


def clean_sentence(sentence):
    wds = word_tokenize(sentence)
    return " ".join([wd.lower() for wd in wds if wd.isalnum()])


train_data = [
    {"path": os.path.join(clip_prefix, e[0]), "transcription": clean_sentence(e[1])}
    for e in trainx
]
eval_data = [
    {"path": os.path.join(clip_prefix, e[0]), "transcription": clean_sentence(e[1])}
    for e in testx
]

## Model setup

In [16]:
dname = str(datetime.now())
model = SpeechRecognitionModel("facebook/wav2vec2-large-xlsr-53", device=device)
output_dir = f"{dname}"

tokens = [
    "a",
    "á",
    "b",
    "c",
    "d",
    "e",
    "é",
    "f",
    "g",
    "h",
    "i",
    "í",
    "j",
    "k",
    "l",
    "m",
    "n",
    "o",
    "ó",
    "ö",
    "ő",
    "p",
    "q",
    "r",
    "s",
    "t",
    "u",
    "ú",
    "ü",
    "ű",
    "v",
    "w",
    "x",
    "y",
    "z",
]
token_set = TokenSet(tokens)

INFO:huggingsound.speech_recognition.model:Loading model...


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.77k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.27G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-large-xlsr-53 were not used when initializing Wav2Vec2ForCTC: ['quantizer.weight_proj.bias', 'quantizer.codevectors', 'project_hid.bias', 'quantizer.weight_proj.weight', 'project_q.bias', 'project_hid.weight', 'project_q.weight']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-xlsr-53 and are newly initialized: ['lm_head.weight', 'lm_head.bias']
You should probably TRAIN this model on a down-stream task to be able to u

Downloading (…)rocessor_config.json:   0%|          | 0.00/212 [00:00<?, ?B/s]



In [17]:
training_args = TrainingArguments(
    learning_rate=3e-4,
    max_steps=3000,
    eval_steps=500,
    per_device_train_batch_size=10,
    per_device_eval_batch_size=10,
)
model_args = ModelArguments(
    activation_dropout=0.1,
    hidden_dropout=0.1,
) 

## Fine-tune model

In [18]:
model.finetune(
    output_dir, 
    train_data=train_data, 
    eval_data=eval_data,
    token_set=token_set, 
    training_args=training_args,
    model_args=model_args,
)

`use_fast` is set to `True` but the tokenizer class does not have a fast version.  Falling back to the slow version.
INFO:huggingsound.speech_recognition.model:Loading training data...
INFO:huggingsound.speech_recognition.model:Converting data format...
INFO:huggingsound.speech_recognition.model:Preparing data input and labels...


Map:   0%|          | 0/24557 [00:00<?, ? examples/s]

INFO:huggingsound.speech_recognition.model:Loading evaluation data...
INFO:huggingsound.speech_recognition.model:Converting data format...
INFO:huggingsound.speech_recognition.model:Preparing data input and labels...


Map:   0%|          | 0/4334 [00:00<?, ? examples/s]

INFO:huggingsound.speech_recognition.model:Starting fine-tuning process...
INFO:huggingsound.trainer:Getting dataset stats...
INFO:huggingsound.trainer:Training dataset size: 24557 samples, 34.64255307291643 hours
INFO:huggingsound.trainer:Evaluation dataset size: 4334 samples, 6.0941796180555725 hours
Some weights of the model checkpoint at facebook/wav2vec2-large-xlsr-53 were not used when initializing Wav2Vec2ForCTC: ['quantizer.weight_proj.bias', 'quantizer.codevectors', 'project_hid.bias', 'quantizer.weight_proj.weight', 'project_q.bias', 'project_hid.weight', 'project_q.weight']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassificat

Step,Training Loss,Validation Loss,Wer,Cer
500,580.2699,442.697968,0.86291,0.268527
1000,424.3729,315.527588,0.761519,0.201067
1500,364.8653,286.474487,0.679187,0.182392
2000,336.8181,245.54921,0.621818,0.162022
2500,310.5879,228.918411,0.55714,0.1475
3000,282.8769,218.524948,0.52529,0.13912


INFO:huggingsound.speech_recognition.model:Loading fine-tuned model...


***** train metrics *****
  epoch                    =         1.22
  total_flos               = 4365507556GF
  train_loss               =     516.9127
  train_runtime            =   1:10:33.58
  train_samples            =        24557
  train_samples_per_second =        7.086
  train_steps_per_second   =        0.709


In [19]:
!ls

'2023-04-13 12:36:44.374260'   cv-corpus-13.0-2023-03-09-hu.tar.gz
 cv-corpus-13.0-2023-03-09     sample_data


In [20]:
!ls '2023-04-13 12:36:44.374260'/

all_results.json  checkpoint-2500  preprocessor_config.json  trainer_state.json
checkpoint-1000   checkpoint-3000  pytorch_model.bin	     training_args.bin
checkpoint-1500   checkpoint-500   special_tokens_map.json   train_results.json
checkpoint-2000   config.json	   tokenizer_config.json     vocab.json


In [21]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [27]:
!ls '/content/gdrive'

MyDrive  Shareddrives


In [28]:
!cp  "2023-04-13 12:36:44.374260/pytorch_model.bin" "/content/gdrive/My Drive/Colab Notebooks/models/pytorch_model.bin"

In [30]:
!cp 2023-04-13\ 12:36:44.374260/*.json "/content/gdrive/My Drive/Colab Notebooks/models/"

In [31]:
!cp -r "2023-04-13 12:36:44.374260/checkpoint-1000" "/content/gdrive/My Drive/Colab Notebooks/models/"

In [32]:
!cp -r "2023-04-13 12:36:44.374260/checkpoint-1500" "/content/gdrive/My Drive/Colab Notebooks/models/"

In [33]:
!cp -r "2023-04-13 12:36:44.374260/checkpoint-2000" "/content/gdrive/My Drive/Colab Notebooks/models/"

In [34]:
!cp -r "2023-04-13 12:36:44.374260/checkpoint-2500" "/content/gdrive/My Drive/Colab Notebooks/models/"

In [35]:
!cp -r "2023-04-13 12:36:44.374260/checkpoint-500" "/content/gdrive/My Drive/Colab Notebooks/models/"