# Introduction

The goal of this project is to explore the pretrained model for ASR, Wav2Vec2 which uses transformer network. We use short sentences to train and test our model for 2 languages : English and Spanish. We use different kind of approach. For the English one, we will use the Timit data that contains hours of speech and the Spanish one, which only contains around 10 min of speech. We want to see the limit between those 2 approaches and we want to test how good those models are. We will as well see how we can improve those models. In order to evaluate our model we need to use a metric and we chose the"Word error rate" (WER) metric.

You can find the datasets used on the following link : https://drive.google.com/drive/folders/1LnBKW9yf-r_iueT4ylM96a-3fP4h0X_2?usp=sharing

# ***Fine tuning Wav2Vec2 for Spanish***


## Model

The model used in the former part of this notebook was pretrained on exclusively english data (using *facebook/wav2vec2-base*). Therefore, we could not use the same one if we wanted to have a chance to obtain good results. We used the pretrained model *facebook/wav2vec2-xls-r-300m*, which is a multinlingual pretrained model for speech. It is pretrained on 436k hours of unlabeled speech and uses the wav2vec 2.0 objective, in 128 languages.

## Data

We created on our own spanish dataset, that can be found in the archive spanish.zip in the shared folder ASP. There are two folders : the *train* folder contains audio for training, and the *test* folder contains the audio for testing. The metadata file contains the metadata needed to create the dataset (file name and transcription), plus some additional data (accent, speaker_id, ...).

**Training** : We took sentences from spanish texts and used a *text to speech* technologie to create the audio. We took the sentences from https://lingua.com/spanish/reading/ and used the website https://ttsfree.com/ to obtain the audio. On this last website, it is possible to choose the accent when generating the audio. Thus, in our training set, we have accents from Spain, Uruguay, Porto Rico, Venezuela, Peru, Mexico, Cuba, Colombia, and Argentina. We then converted the audio to have a wav format with a sampling frequence of 16kHz.

**Testing** : For the test data, we did our recordings, reading some passages from texts. The accent is therefore a "French" accent.


## Code

The code is for most part the same. The main differences are the data itself and the way the dataset is created.

In [None]:
# %%capture
!pip install datasets==1.18.3
!pip install -U transformers==4.24.0  # 4.17.0
!pip install jiwer

# %%capture
!apt install git-lfs
!pip install evaluate

import json
import random
import re
from pathlib import Path

import datasets
import evaluate
import IPython.display as ipd
import numpy as np
import pandas as pd
import torch
import transformers
from dataclasses import dataclass, field
from datasets import ClassLabel, load_dataset  # load_metric
from huggingface_hub import notebook_login
from IPython.display import display, HTML
from transformers import (
    Trainer,
    TrainingArguments,
    Wav2Vec2CTCTokenizer,
    Wav2Vec2FeatureExtractor,
    Wav2Vec2ForCTC,
    Wav2Vec2Processor,
)
from typing import Any, Dict, List, Optional, Union

from google.colab import drive
drive.mount('/content/drive')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets==1.18.3
  Downloading datasets-1.18.3-py3-none-any.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.7/311.7 KB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash
  Downloading xxhash-3.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (213 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 KB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packa

In [None]:
# downloading the data
#!wget http://data.deepai.org/timit.zip

# decompressing the data
from zipfile import ZipFile

with ZipFile('/content/drive/MyDrive/ASR/spanish.zip','r') as zip:
   zip.extractall()
   print('Data decompressed successfully')

# removing the .zip file after extraction to clean space
!rm spanish.zip

Data decompressed successfully
rm: cannot remove 'spanish.zip': No such file or directory


In [None]:
DATA = Path("./spanish").expanduser()
DATA.is_dir()

True

In [None]:
spanish_dataset = load_dataset("audiofolder", data_dir="/content/spanish")

Resolving data files:   0%|          | 0/72 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]



Downloading and preparing dataset audiofolder/default to /root/.cache/huggingface/datasets/audiofolder/default-483c70c8d76b53ca/0.0.0/6cbdd16f8688354c63b4e2a36e1585d05de285023ee6443ffd71c4182055c0fc...


Downloading data files:   0%|          | 0/73 [00:00<?, ?it/s]

Downloading data files: 0it [00:00, ?it/s]

Extracting data files: 0it [00:00, ?it/s]

Downloading data files:   0%|          | 0/21 [00:00<?, ?it/s]

Downloading data files: 0it [00:00, ?it/s]

Extracting data files: 0it [00:00, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset audiofolder downloaded and prepared to /root/.cache/huggingface/datasets/audiofolder/default-483c70c8d76b53ca/0.0.0/6cbdd16f8688354c63b4e2a36e1585d05de285023ee6443ffd71c4182055c0fc. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
spanish_dataset

DatasetDict({
    train: Dataset({
        features: ['audio', 'test_or_train', 'speaker_id', 'accent', 'text'],
        num_rows: 72
    })
    test: Dataset({
        features: ['audio', 'test_or_train', 'speaker_id', 'accent', 'text'],
        num_rows: 20
    })
})

In [None]:
spanish_dataset = spanish_dataset.remove_columns(
    [
        "accent",
        "speaker_id",
        "test_or_train",
    ]
)

In [None]:
from IPython.display import display

def show_random_elements(dataset, num_examples=3):
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(spanish_dataset["train"].remove_columns(["audio"]), num_examples=3)

Unnamed: 0,text
0,"Mi nombre es Laura y mi familia es muy divertida. Mi abuela Rocío cocina delicioso, por lo que nos encanta comer y hacer sobremesa. Mi abuelo Julio encuentra monedas detrás de las orejas de las personas, no sé por qué se queja sobre la vida de pensionado."
1,El armario está situado a la derecha del escritorio y tiene mucho espacio para poder colocar la ropa. Me gusta tenerla bien doblada y organizada.
2,"También me animó a tocar el piano para mejorar la coordinación entre mi cerebro y mi cuerpo, mi capacidad cognitiva, perseverancia y disciplina."


In [None]:
chars_to_ignore_regex = '[\,\?\¿\.\…\!\¡\-\;\:"]'


def remove_special_characters(batch):
    batch["text"] = re.sub(chars_to_ignore_regex, "", batch["text"]).lower() + " "
    return batch

In [None]:
spanish_dataset = spanish_dataset.map(remove_special_characters)

  0%|          | 0/72 [00:00<?, ?ex/s]

  0%|          | 0/20 [00:00<?, ?ex/s]

In [None]:
show_random_elements(spanish_dataset["train"].remove_columns(["audio"]))

Unnamed: 0,text
0,poco antes de cerrar la tienda llega un hombre muy apurado porque debe hacer un viaje a finlandia y no tiene ropa abrigada
1,hace un mes que he hecho la reserva de la habitación y estoy encantado la habitación es de tamaño mediano tiene mucha luz y una gran ventana tengo una cama grande y una gran caja fuerte
2,encima del escritorio tengo un estante donde almaceno todos mis libros y también otros objetos que me gustan y decoran la habitación como un cuadro o una flor de tela de bonitos colores


In [None]:
def extract_all_chars(batch):
    all_text = " ".join(batch["text"])
    vocab = list(set(all_text))
    return {"vocab": [vocab], "all_text": [all_text]}

In [None]:
vocabs = spanish_dataset.map(
    extract_all_chars,
    batched=True,
    batch_size=-1,
    keep_in_memory=True,
    remove_columns=spanish_dataset.column_names["train"],
)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
vocab_list = list(set(vocabs["train"]["vocab"][0])) #  | set(vocabs["test"]["vocab"][0]
vocab_dict = {v: k for k, v in enumerate(vocab_list)}
vocab_dict

{'ñ': 0,
 'u': 1,
 'k': 2,
 's': 3,
 'ú': 4,
 'n': 5,
 'j': 6,
 'y': 7,
 'á': 8,
 'm': 9,
 '2': 10,
 'b': 11,
 'i': 12,
 'p': 13,
 'v': 14,
 'q': 15,
 'e': 16,
 'z': 17,
 'h': 18,
 'ó': 19,
 'r': 20,
 't': 21,
 'l': 22,
 'é': 23,
 'a': 24,
 '1': 25,
 'd': 26,
 'g': 27,
 'c': 28,
 ' ': 29,
 'o': 30,
 'f': 31,
 'x': 32,
 'í': 33,
 '0': 34}

In [None]:
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)
len(vocab_dict)

with open("vocab.json", "w") as vocab_file:
    json.dump(vocab_dict, vocab_file)

spanish_tokenizer = Wav2Vec2CTCTokenizer(
    "./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|"
)

In [None]:
spanish_repo_name = "wav2vec2-base-spanish-demo-google-colab"
spanish_repo_name2 = "Cachoups/wav2vec2-base-spanish-demo-google-colab"

!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` now requires a token generated from https://huggingface.co/settings/tokens .
    
Token: 
Add token as git credential? (Y/n) Y
Token is valid.
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credenti

In [None]:
spanish_tokenizer.push_to_hub(spanish_repo_name)

CommitInfo(commit_url='https://huggingface.co/Cachoups/wav2vec2-base-spanish-demo-google-colab/commit/b639b3257f72174d756cb7ccd4d4f58bd0e0d6e7', commit_message='Upload tokenizer', commit_description='', oid='b639b3257f72174d756cb7ccd4d4f58bd0e0d6e7', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
feature_extractor = Wav2Vec2FeatureExtractor(
    feature_size=1,
    sampling_rate=16000,
    padding_value=0.0,
    do_normalize=True,
    return_attention_mask=False,
)

In [None]:
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=spanish_tokenizer)

In [None]:
spanish_dataset

DatasetDict({
    train: Dataset({
        features: ['audio', 'text'],
        num_rows: 72
    })
    test: Dataset({
        features: ['audio', 'text'],
        num_rows: 20
    })
})

### Preprocess Data


In [None]:
# verifying that the audio was correctly loaded
rand_int = random.randint(0, len(spanish_dataset["train"]))

print(spanish_dataset["train"][rand_int]["text"])
ipd.Audio(
    data=np.asarray(spanish_dataset["train"][rand_int]["audio"]["array"]),
    autoplay=True,
    rate=16000,
)

así que comencé a probar otros deportes como el basquetbol y el tenis  


In [None]:
rand_int = random.randint(0, len(spanish_dataset["train"]))

print("Target text:", spanish_dataset["train"][rand_int]["text"])
print(
    "Input array shape:", np.asarray(spanish_dataset["train"][rand_int]["audio"]["array"]).shape
)
print("Sampling rate:", spanish_dataset["train"][rand_int]["audio"]["sampling_rate"])

Target text: así que comencé a probar otros deportes como el basquetbol y el tenis  
Input array shape: (74880,)
Sampling rate: 16000


In [None]:
for i in range(72):
  if (spanish_dataset['train'][i]['audio']['sampling_rate'] != 16000):
    print(spanish_dataset['train'][i])

In [None]:
def prepare_dataset(batch):
    audio = batch["audio"]

    # batched output is "un-batched" to ensure mapping is correct
    batch["input_values"] = processor(
          audio["array"], sampling_rate=audio["sampling_rate"]
      ).input_values[0]
    batch["input_length"] = len(batch["input_values"])

    with processor.as_target_processor():
        batch["labels"] = processor(batch["text"]).input_ids
    return batch

In [None]:
spanish_dataset = spanish_dataset.map(
    prepare_dataset, remove_columns=spanish_dataset.column_names["train"], num_proc=4
)

        

#1:   0%|          | 0/18 [00:00<?, ?ex/s]

#2:   0%|          | 0/18 [00:00<?, ?ex/s]

#0:   0%|          | 0/18 [00:00<?, ?ex/s]

#3:   0%|          | 0/18 [00:00<?, ?ex/s]



        

#3:   0%|          | 0/5 [00:00<?, ?ex/s]

#1:   0%|          | 0/5 [00:00<?, ?ex/s]

#0:   0%|          | 0/5 [00:00<?, ?ex/s]

#2:   0%|          | 0/5 [00:00<?, ?ex/s]



In [None]:
#max_input_length_in_sec = 4.0
#spanish_dataset["train"] = spanish_dataset["train"].filter(
#    lambda x: x < max_input_length_in_sec * processor.feature_extractor.sampling_rate,
#    input_columns=["input_length"],
#)

## Training & Evaluation

In [None]:
@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str`,
                 or :class:`~transformers.tokenization_utils_base.PaddingStrategy`,
                 `optional`,
                 defaults to :obj:`True`):
            Select a strategy to pad the returned sequences
            (according to the model's padding side and padding index) among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch
                (or no padding if only a single sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument
                :obj:`max_length` or to the maximum acceptable input length for the model
                if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding
                (i.e., can output a batch with sequences of different lengths).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True

    def __call__(
        self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
    ) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [
            {"input_values": feature["input_values"]} for feature in features
        ]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(
            labels_batch.attention_mask.ne(1), -100
        )

        batch["labels"] = labels

        return batch

In [None]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

In [None]:
wer_metric = evaluate.load("wer")

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

In [None]:
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-xls-r-300m",
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer),
)

Downloading:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.27G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-xls-r-300m were not used when initializing Wav2Vec2ForCTC: ['project_hid.bias', 'project_q.bias', 'quantizer.codevectors', 'project_q.weight', 'quantizer.weight_proj.bias', 'project_hid.weight', 'quantizer.weight_proj.weight']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-xls-r-300m and are newly initialized: ['lm_head.weight', 'lm_head.bias']
You should probably TRAIN this model on a down-stream task to be able to use it 

In [None]:
def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

In [None]:
model.freeze_feature_encoder() #maybe unfreeze if bad results

In [None]:
spanish_training_args = TrainingArguments(
    output_dir=spanish_repo_name2,
    group_by_length=True,
    per_device_train_batch_size=4, #reducing batch_size was necessary to remove the "out of memory" error
    evaluation_strategy="steps",
    num_train_epochs=100,
    # fp16=True,    # For GPU
    gradient_checkpointing=True,
    save_steps=10,
    eval_steps=10,
    logging_steps=10,
    learning_rate=1e-4,
    weight_decay=0.005,
    warmup_steps=5,
    save_total_limit=2,
    push_to_hub=True
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
spanish_dataset["train"]

Dataset({
    features: ['input_values', 'input_length', 'labels'],
    num_rows: 72
})

In [None]:
spanish_trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=spanish_training_args,
    compute_metrics=compute_metrics,
    train_dataset=spanish_dataset["train"],
    eval_dataset=spanish_dataset["test"],
    tokenizer=processor.feature_extractor,
)

/content/Loelia/wav2vec2-base-spanish-demo-google-colab is already a clone of https://huggingface.co/Cachoups/wav2vec2-base-spanish-demo-google-colab. Make sure you pull the latest changes with `repo.git_pull()`.


In [None]:
spanish_trainer.train()

The following columns in the training set don't have a corresponding argument in `Wav2Vec2ForCTC.forward` and have been ignored: input_length. If input_length are not expected by `Wav2Vec2ForCTC.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 72
  Num Epochs = 100
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 1800
  Number of trainable parameters = 311266469


Step,Training Loss,Validation Loss,Wer
10,3.0399,3.679806,1.0
20,2.9586,3.797618,1.0
30,2.9662,3.666707,1.0
40,2.9781,3.400784,1.0
50,2.9261,3.255877,1.0
60,3.0358,3.198736,1.0
70,2.9321,3.600662,1.0
80,2.9916,3.332069,1.0
90,2.9219,3.389026,1.0
100,2.8993,3.162867,1.0


The following columns in the evaluation set don't have a corresponding argument in `Wav2Vec2ForCTC.forward` and have been ignored: input_length. If input_length are not expected by `Wav2Vec2ForCTC.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 20
  Batch size = 8
Saving model checkpoint to Loelia/wav2vec2-base-spanish-demo-google-colab/checkpoint-10
Configuration saved in Loelia/wav2vec2-base-spanish-demo-google-colab/checkpoint-10/config.json
Model weights saved in Loelia/wav2vec2-base-spanish-demo-google-colab/checkpoint-10/pytorch_model.bin
Feature extractor saved in Loelia/wav2vec2-base-spanish-demo-google-colab/checkpoint-10/preprocessor_config.json
Feature extractor saved in Loelia/wav2vec2-base-spanish-demo-google-colab/preprocessor_config.json
Several commits (3) will be pushed upstream.
Deleting older checkpoint [Loelia/wav2vec2-base-spanish-demo-google-colab/checkpoint-170] due to args.save_total_limit
The following columns in t

TrainOutput(global_step=1800, training_loss=1.0308863363001082, metrics={'train_runtime': 8431.5251, 'train_samples_per_second': 0.854, 'train_steps_per_second': 0.213, 'total_flos': 2.6590790589323187e+18, 'train_loss': 1.0308863363001082, 'epoch': 100.0})

In [None]:
spanish_trainer.push_to_hub()

Saving model checkpoint to Loelia/wav2vec2-base-spanish-demo-google-colab
Configuration saved in Loelia/wav2vec2-base-spanish-demo-google-colab/config.json
Model weights saved in Loelia/wav2vec2-base-spanish-demo-google-colab/pytorch_model.bin
Feature extractor saved in Loelia/wav2vec2-base-spanish-demo-google-colab/preprocessor_config.json
Several commits (4) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.30k/1.18G [00:00<?, ?B/s]

Upload file runs/Jan05_14-54-43_304d54d1c2de/events.out.tfevents.1672930497.304d54d1c2de.232.6:   4%|3        …

remote: Scanning LFS files for validity, may be slow...        
remote: LFS file scan complete.        
To https://huggingface.co/Cachoups/wav2vec2-base-spanish-demo-google-colab
   b639b32..dbc29d3  main -> main

remote: LFS file scan complete.        
To https://huggingface.co/Cachoups/wav2vec2-base-spanish-demo-google-colab
   b639b32..dbc29d3  main -> main

Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Automatic Speech Recognition', 'type': 'automatic-speech-recognition'}, 'metrics': [{'name': 'Wer', 'type': 'wer', 'value': 0.7383177570093458}]}
To https://huggingface.co/Cachoups/wav2vec2-base-spanish-demo-google-colab
   dbc29d3..9795233  main -> main

   dbc29d3..9795233  main -> main



'https://huggingface.co/Cachoups/wav2vec2-base-spanish-demo-google-colab/commit/dbc29d3967d7cb6a5ad50646ee3d1792f0c08556'

In [None]:
from transformers import AutoModelForCTC, Wav2Vec2Processor

### *Remarks* :

- We first trained with a smaller number of epochs (5) but had random output of letters as prediction. We increased the number to 30, but then had no prediction at all (a blank string). In both cases, we therefore had 100% WER. Only when we hit around 100 epochs did we have a real result, with which we could work. We also see that the WER first remains constant with a value of 100% and begin to fall after step 520.

- In the end, we have a loss around 0.1, but a validation loss stagnating at 1.2. We must have overtrained the model, but given the time taken to train it, and the fact that the validation loss did not rise again, we kept this one to do the tests.

### Evaluate

In [None]:
processor = Wav2Vec2Processor.from_pretrained(
    "Cachoups/wav2vec2-base-spanish-demo-google-colab"
)

Downloading:   0%|          | 0.00/215 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/236 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/96.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
model = Wav2Vec2ForCTC.from_pretrained(
    "Cachoups/wav2vec2-base-spanish-demo-google-colab"
).cuda()

In [None]:
def map_to_result(batch):
    with torch.no_grad():
        input_values = torch.tensor(batch["input_values"], device = "cuda").unsqueeze(0)
        logits = model(input_values).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_str"] = processor.batch_decode(pred_ids)[0]
    batch["text"] = processor.decode(batch["labels"], group_tokens=False)

    return batch


In [None]:
def prepare_dataset_test(batch):
    audio = batch["input_values"]

    # batched output is "un-batched" to ensure mapping is correct
    batch["input_values"] = processor(
        audio["array"], sampling_rate=audio["sampling_rate"]
    ).input_values[0]
    batch["input_length"] = len(batch["input_values"])

    with processor.as_target_processor():
        batch["labels"] = processor(batch["text"]).input_ids
    return batch

In [None]:
# PRED FOR NORMAL TEST
results = spanish_dataset["test"].map(map_to_result, remove_columns=spanish_dataset["test"].column_names)

  0%|          | 0/20 [00:00<?, ?ex/s]

In [None]:
print(
    "Test WER: {:.3f}".format(
        wer_metric.compute(predictions=results["pred_str"], references=results["text"])
    )
)

Test WER: 0.668


In [None]:
show_random_elements(results)

Unnamed: 0,pred_str,text
0,los liblo asso s son me jugaes que la tere ibición,los libros son mejores que la televisión
1,todos emos ouido ablar de la clostro fobia,todos hemos oído hablar de la claustrofobia
2,po ejemplo los ñiños no soportan la osculidado,por ejemplo los niños no soportan la oscuridad


The results are quite bad. At least, it seems the model can recognize the sounds, but it seems unable to recognize and form the right words from those sounds. However, as we used our own recordings for the tests, it could partially be due to the accent - French - that we didn't use in the training, or other reasons detailed earlier in the english part.

# Conclusion
To conclude, for English language, the model trained with the pre-trained from Facebook can have good results when the speaker talks clearly and well without noise involved. A clipped sound doesn't affect consequently the quality of prediction. For Spanish language, because of the lack of data, we have to overtrain our model to have a decent result. We could have added white noise or clipped version of the audio to first make the model better and secondly to have more audio.