# Master Thesis First Version of the STT Model

**Author**: Karin Thommen

**Date**: March 2023


---

**Content of the Notebook**: Train and test STT model (Spontaneous and prepared Speech)

---
**References:**
- https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Fine_Tune_XLS_R_on_Common_Voice.ipynb#scrollTo=72737oog2F6U

## Import and Setup

In [None]:
%%capture
!pip install torchaudio
!pip install librosa
!pip install jiwer
!pip install torch
!pip install datasets
!pip install transformers==4.28.0
!pip install torchaudio
!pip install jiwer
!pip install audio-metadata

In [None]:
pip install "dill<0.3.5"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting dill<0.3.5
  Downloading dill-0.3.4-py2.py3-none-any.whl (86 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.9/86.9 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dill
  Attempting uninstall: dill
    Found existing installation: dill 0.3.6
    Uninstalling dill-0.3.6:
      Successfully uninstalled dill-0.3.6
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
multiprocess 0.70.14 requires dill>=0.3.6, but you have dill 0.3.4 which is incompatible.[0m[31m
[0mSuccessfully installed dill-0.3.4


In [None]:
import pandas as pd
import os
import transformers
from transformers import Wav2Vec2FeatureExtractor
from transformers import Wav2Vec2ForCTC
from transformers import Wav2Vec2CTCTokenizer
from transformers import Wav2Vec2FeatureExtractor
from transformers import Wav2Vec2Processor

from datasets.fingerprint import Hasher
import pickle
import dill

from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML
import re
import json

import IPython.display as ipd
import numpy as np
import random

import torch
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

import audio_metadata

from datasets import load_dataset, Audio, load_metric
from datasets import Dataset

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union
from huggingface_hub import notebook_login

In [None]:
transformers.__version__

'4.28.0'

In [None]:
# login to huggingface account for data
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# load dataset from huggingface (after uploading it via local machine to huggingface)
dataset = load_dataset("karinthommen/sds200")

Downloading readme:   0%|          | 0.00/620 [00:00<?, ?B/s]

Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/karinthommen___parquet/karinthommen--sds200-a1893d366d27240a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/401M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/452M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/436M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/358M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/428M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/369M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/416M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/427M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/449M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/109M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/114M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/135271 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3638 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3636 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/karinthommen___parquet/karinthommen--sds200-a1893d366d27240a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

# Train Model on Prepared Speech

## Prepare Data

In [None]:
# check if data loading worked
dataset["train"][0]

{'audio': {'path': '09966c7743291ccf1129c8136143bf5a6132947fe352795bc6d5456a3afeb4de.mp3',
  'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
          1.58690691e-05, -6.36559753e-06, -1.80013558e-05]),
  'sampling_rate': 32000},
 'transcription': 'Dadurch wird auch der Lebensraum von vielen Tier- und Pflanzenarten zerstört.',
 'canton': None,
 'duration': 6.732}

In [None]:
# filter data since model could have some problems with data with more than 5sec duration
dataset["train"] = dataset["train"].filter(lambda example: example["duration"] <= 6)

Filter:   0%|          | 0/135271 [00:00<?, ? examples/s]

In [None]:
# remove columns from dataset that we do not need at the moment
dataset = dataset.remove_columns(["canton", "duration"])

In [None]:
# reference of some code snippets: https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Fine_Tune_XLS_R_on_Common_Voice.ipynb#scrollTo=72737oog2F6U

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    display(HTML(df.to_html()))

### Remove Special Characters and clean all datasets

In [None]:
# Remove special characters

chars = '[\'̈\’\•\‹\₂\›\–\²\½\‑\°\`\&\(\)\*\+\/\=\,\?\.\!\-\;\:\"\“\%\‘\”\�\']'

def remove_special_chars(batch):
  batch["sentence"] = re.sub(chars, '', batch["transcription"]).lower()
  return batch

# do for whole datasets
dataset = dataset.map(remove_special_chars)

Map:   0%|          | 0/113094 [00:00<?, ? examples/s]

Map:   0%|          | 0/3638 [00:00<?, ? examples/s]

Map:   0%|          | 0/3636 [00:00<?, ? examples/s]

In [None]:
show_random_elements(dataset["train"].remove_columns(["audio"]))

Unnamed: 0,transcription,sentence
0,Seither seien die internen Prozesse angepasst worden.,seither seien die internen prozesse angepasst worden
1,Sie hätte in derselben Nacht noch Paris und London kontaktieren müssen.,sie hätte in derselben nacht noch paris und london kontaktieren müssen
2,Wenig entfernt steht das Chrysler Building.,wenig entfernt steht das chrysler building
3,Wie stehen die Chancen dafür?,wie stehen die chancen dafür
4,Am Spengler-Cup fokussiert sich vieles auf mich.,am spenglercup fokussiert sich vieles auf mich
5,Noch immer ist Sascha sehr zurückhaltend.,noch immer ist sascha sehr zurückhaltend
6,Dieser Friedhof wird nach wie vor genutzt.,dieser friedhof wird nach wie vor genutzt
7,Beim Fahrer konnte nur mehr der Tod festgestellt werden.,beim fahrer konnte nur mehr der tod festgestellt werden
8,Dann fuhr der rote PW weiter.,dann fuhr der rote pw weiter
9,Nun will der Student mit seinem Fall vor das Bundesgericht ziehen.,nun will der student mit seinem fall vor das bundesgericht ziehen


### Get Vocabulary

In [None]:
dataset["train"][0]

{'audio': {'path': '09d45d91d4a03720071316419bbf578c677bd4f72722ed8fa14613c244430e6c.mp3',
  'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
          9.34452293e-10, -8.30981450e-10,  1.34210865e-10]),
  'sampling_rate': 32000},
 'transcription': 'Karten dieser Bezirke gab es bisher aber nicht.',
 'sentence': 'karten dieser bezirke gab es bisher aber nicht'}

In [None]:
def extract_chars(batch):
  all_text = " ".join(batch["sentence"])
  vocab = list(set(all_text))
  return {"vocab": [vocab], "all_text": [all_text]}

# extract the characters from all datasets
vocab = dataset.map(extract_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=['audio', 'transcription', 'sentence'])

Map:   0%|          | 0/113094 [00:00<?, ? examples/s]

Map:   0%|          | 0/3638 [00:00<?, ? examples/s]

Map:   0%|          | 0/3636 [00:00<?, ? examples/s]

In [None]:
vocab

DatasetDict({
    train: Dataset({
        features: ['vocab', 'all_text'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['vocab', 'all_text'],
        num_rows: 1
    })
    test: Dataset({
        features: ['vocab', 'all_text'],
        num_rows: 1
    })
})

In [None]:
# make a vocabulary list out of all three data sets and enumerate all characters and save them enumerated in a dictionary
vocab_list = list(set(vocab["train"]["vocab"][0]) | set(vocab["test"]["vocab"][0]))
vocab_dict = {v: k for k, v in enumerate(sorted(vocab_list))}
vocab_dict

{' ': 0,
 '0': 1,
 '1': 2,
 '2': 3,
 '3': 4,
 '4': 5,
 '5': 6,
 '6': 7,
 '7': 8,
 '8': 9,
 '9': 10,
 'a': 11,
 'b': 12,
 'c': 13,
 'd': 14,
 'e': 15,
 'f': 16,
 'g': 17,
 'h': 18,
 'i': 19,
 'j': 20,
 'k': 21,
 'l': 22,
 'm': 23,
 'n': 24,
 'o': 25,
 'p': 26,
 'q': 27,
 'r': 28,
 's': 29,
 't': 30,
 'u': 31,
 'v': 32,
 'w': 33,
 'x': 34,
 'y': 35,
 'z': 36,
 '\xad': 37,
 'à': 38,
 'á': 39,
 'ä': 40,
 'ç': 41,
 'è': 42,
 'é': 43,
 'ë': 44,
 'í': 45,
 'ô': 46,
 'ö': 47,
 'ü': 48}

In [None]:
# this cell only once after the above cell. Do not rerun this cell without rerunning the above cell.

vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "] # replace the empty string in the vocabulary with a character that is better visible
vocab_dict["[UNK]"] = len(vocab_dict) # add unknown token
vocab_dict["[PAD]"] = len(vocab_dict) # add padding token or blank token (important for CTC algorithm)
print("Length of the vocabulary:", len(vocab_dict))

Length of the vocabulary: 51


In [None]:
# save vocabulary file for later usage
with open('vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

### Load tokenizer, feature extractor and processor

In [None]:
# load vocabulary with Wav2Vec Tokenizer
tokenizer = Wav2Vec2CTCTokenizer.from_pretrained("./", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
# instantiate feature extractor from Wav2Vec2 Feature Extractor with sampling rate of 16kHz.
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=True)

In [None]:
# after loading the tokenizer and the feature extractor, both get wrapped in a processor
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

### Prepare audio clips

In [None]:
dataset["train"][0]["audio"]

{'path': '09d45d91d4a03720071316419bbf578c677bd4f72722ed8fa14613c244430e6c.mp3',
 'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         9.34452293e-10, -8.30981450e-10,  1.34210865e-10]),
 'sampling_rate': 32000}

In [None]:
# resample audio to 16kHz.
dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))

In [None]:
rand_int = random.randint(0, len(dataset["train"])-1)

print(dataset["train"][rand_int]["sentence"])
ipd.Audio(data=dataset["train"][rand_int]["audio"]["array"], autoplay=True, rate=16000)

am barren gewann er die bronzemedaille


In [None]:
rand_int = random.randint(0, len(dataset["train"])-1)

print("Target text:", dataset["train"][rand_int]["sentence"])
print("Input array shape:", dataset["train"][rand_int]["audio"]["array"].shape)
print("Sampling rate:", dataset["train"][rand_int]["audio"]["sampling_rate"])

Target text: darüber hinaus ist lewis unermüdlich auf tour
Input array shape: (61056,)
Sampling rate: 16000


In [None]:
def prepare_dataset(batch):
    audio = batch["audio"]

    # batched output is "un-batched"
    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
    batch["input_length"] = len(batch["input_values"])

    with processor.as_target_processor():
        batch["labels"] = processor(batch["sentence"]).input_ids
    return batch

In [None]:
dataset = dataset.map(prepare_dataset, remove_columns=['audio', 'transcription', 'sentence'])

Map:   0%|          | 0/113094 [00:00<?, ? examples/s]

    `as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your labels by using the argument `text` of the regular `__call__` method (either in the same call as your audio inputs, or in a separate call.


Map:   0%|          | 0/3638 [00:00<?, ? examples/s]

Map:   0%|          | 0/3636 [00:00<?, ? examples/s]

In [None]:
#dataset.push_to_hub("karinthommen/sds200-features-incl-vocab", private=True)

### Training

In [None]:
 @dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
    """

    processor: Wav2Vec2ForCTC
    padding: Union[bool, str] = True

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

In [None]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

In [None]:
wer_metric = load_metric("wer")

    load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate


Downloading builder script:   0%|          | 0.00/1.90k [00:00<?, ?B/s]

In [None]:
def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = 100* wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

In [None]:
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-xls-r-300m",
    attention_dropout=0.0,
    hidden_dropout=0.0,
    feat_proj_dropout=0.0,
    mask_time_prob=0.05,
    layerdrop=0.0,
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer),
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.27G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-xls-r-300m were not used when initializing Wav2Vec2ForCTC: ['project_hid.bias', 'project_q.bias', 'quantizer.codevectors', 'quantizer.weight_proj.weight', 'quantizer.weight_proj.bias', 'project_q.weight', 'project_hid.weight']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-xls-r-300m and are newly initialized: ['lm_head.weight', 'lm_head.bias']
You should probably TRAIN this model on a down-stream task to be able to use it 

In [None]:
model.freeze_feature_extractor()

    The method `freeze_feature_extractor` is deprecated and will be removed in Transformers v5.Please use the equivalent `freeze_feature_encoder` method instead.


In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
  output_dir="./xlsr-V2",
  group_by_length=True,
  per_device_train_batch_size=16,
  gradient_accumulation_steps=2,
  evaluation_strategy="steps",
  num_train_epochs=10,
  max_steps=4000,
  gradient_checkpointing=True,
  fp16=True,
  save_steps=400,
  eval_steps=400,
  logging_steps=400,
  learning_rate=3e-4,
  warmup_steps=100,
  save_total_limit=2,
  report_to=["tensorboard"],
  load_best_model_at_end=True,
  metric_for_best_model="wer",
  greater_is_better=False,
  push_to_hub=True,
)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    tokenizer=processor.feature_extractor,
)

Cloning https://huggingface.co/karinthommen/xlsr-V2 into local empty directory.


Download file pytorch_model.bin:   0%|          | 15.4k/1.18G [00:00<?, ?B/s]

Download file runs/Jun07_11-07-53_9eb2825d2986/events.out.tfevents.1686139816.9eb2825d2986.920.0: 100%|#######…

Clean file runs/Jun07_11-07-53_9eb2825d2986/events.out.tfevents.1686139816.9eb2825d2986.920.0:  16%|#6        …

Download file training_args.bin: 100%|##########| 3.50k/3.50k [00:00<?, ?B/s]

Download file runs/Jun07_11-07-53_9eb2825d2986/1686139816.3395715/events.out.tfevents.1686139816.9eb2825d2986.…

Clean file training_args.bin:  29%|##8       | 1.00k/3.50k [00:00<?, ?B/s]

Clean file runs/Jun07_11-07-53_9eb2825d2986/1686139816.3395715/events.out.tfevents.1686139816.9eb2825d2986.920…

Clean file pytorch_model.bin:   0%|          | 1.00k/1.18G [00:00<?, ?B/s]

### Version V2.2
New version saved on huggingface

In [None]:
trainer.train()

    `as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your labels by using the argument `text` of the regular `__call__` method (either in the same call as your audio inputs, or in a separate call.


Step,Training Loss,Validation Loss,Wer
400,3.6174,1.961939,111.64198
800,1.3765,1.180564,71.551209
1200,1.0222,1.010096,61.510045
1600,0.8889,0.919222,54.051296
2000,0.8098,0.848303,50.423952
2400,0.7614,0.807415,46.574957
2800,0.7217,0.766979,43.253703
3200,0.679,0.732038,41.976568
3600,0.6463,0.710304,39.781163
4000,0.585,0.700163,38.986032


    `as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your labels by using the argument `text` of the regular `__call__` method (either in the same call as your audio inputs, or in a separate call.
    `as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your labels by using the argument `text` of the regular `__call__` method (either in the same call as your audio inputs, or in a separate call.
    `as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your labels by using the argument `text` of the regular `__call__` method (either in the same call as your audio inputs, or in a separate call.
    `as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your labels by using the argument `text` of the regular `__call__` method (either in the same call as your audio inputs, or in a separate call.
    `as_target_processor` is

TrainOutput(global_step=4000, training_loss=1.1108133392333985, metrics={'train_runtime': 15984.0478, 'train_samples_per_second': 8.008, 'train_steps_per_second': 0.25, 'total_flos': 1.677488954224495e+19, 'train_loss': 1.1108133392333985, 'epoch': 1.13})

### Version V2.1
Old version without saving it on huggingface

In [None]:
trainer.train()

    `as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your labels by using the argument `text` of the regular `__call__` method (either in the same call as your audio inputs, or in a separate call.


Step,Training Loss,Validation Loss,Wer
400,3.4446,1.901822,101.076593
800,1.3861,1.277623,76.57179
1200,1.0696,1.117697,69.232664
1600,0.9572,1.023674,60.503817
2000,0.8959,0.954935,57.084052
2400,0.8441,0.912878,54.35035
2800,0.8015,0.873229,51.91922
3200,0.7322,0.88047,52.408261
3600,0.742,0.856338,49.470499
4000,0.7187,0.850561,47.961158


    `as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your labels by using the argument `text` of the regular `__call__` method (either in the same call as your audio inputs, or in a separate call.
    `as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your labels by using the argument `text` of the regular `__call__` method (either in the same call as your audio inputs, or in a separate call.
    `as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your labels by using the argument `text` of the regular `__call__` method (either in the same call as your audio inputs, or in a separate call.
    `as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your labels by using the argument `text` of the regular `__call__` method (either in the same call as your audio inputs, or in a separate call.
    `as_target_processor` is

Step,Training Loss,Validation Loss,Wer
400,3.4446,1.901822,101.076593
800,1.3861,1.277623,76.57179
1200,1.0696,1.117697,69.232664
1600,0.9572,1.023674,60.503817
2000,0.8959,0.954935,57.084052
2400,0.8441,0.912878,54.35035
2800,0.8015,0.873229,51.91922
3200,0.7322,0.88047,52.408261
3600,0.742,0.856338,49.470499
4000,0.7187,0.850561,47.961158
