Carlos Henrique Amorim Dutra


In [2]:
%%capture
# Local installation
!git clone https://github.com/speechbrain/speechbrain/
%cd /content/speechbrain/
!pip install -r requirements.txt
!pip install -e .

In [3]:
import speechbrain as sb

In [12]:
%cd /content/speechbrain/templates/speaker_id
!python train.py train.yaml --number_of_epochs=15

/content/speechbrain/templates/speaker_id
./data/rirs_noises.zip exists. Skipping download
speechbrain.core - Beginning experiment!
speechbrain.core - Experiment folder: ./results/speaker_id/1986
mini_librispeech_prepare - Preparation completed in previous run, skipping.
speechbrain.dataio.encoder - Load called, but CategoricalEncoder is not empty. Loaded data will overwrite everything. This is normal if there is e.g. an unk label defined at init.
speechbrain.core - Info: ckpt_interval_minutes arg from hparam file is used
speechbrain.core - 4.5M trainable parameters in SpkIdBrain
speechbrain.utils.checkpoints - Loading a checkpoint from results/speaker_id/1986/save/CKPT+2023-09-24+11-45-19+00
speechbrain.utils.checkpoints - Loading a checkpoint from results/speaker_id/1986/save/CKPT+2023-09-24+11-45-19+00
100% 10/10 [00:02<00:00,  3.57it/s]
speechbrain.utils.train_logger - Epoch loaded: 15 - test loss: 4.07e-03, test error: 0.00e+00


In [6]:
# Create folder for best model
!mkdir /content/best_model/

# Copy label encoder
!cp results/speaker_id/1986/save/label_encoder.txt /content/best_model/

# Copy best model
!cp "`ls -td results/speaker_id/1986/save/CKPT* | tail -1`"/* /content/best_model/

In [13]:
import torchaudio
from speechbrain.pretrained import EncoderClassifier

classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-xvect-voxceleb")
signal, fs =torchaudio.load('/content/speechbrain/tests/samples/single-mic/example1.wav')

# Compute speaker embeddings
embeddings = classifier.encode_batch(signal)

# Perform classification
output_probs, score, index, text_lab = classifier.classify_batch(signal)

# Posterior log probabilities
print(output_probs)

# Score (i.e, max log posteriors)
print(score)

# Index of the predicted speaker
print(index)

# Text label of the predicted speaker
print(text_lab)


tensor([[-31.8672, -35.2024, -25.7931,  ..., -21.0045, -12.4279, -21.5266]])
tensor([-1.1278])
tensor([2710])
['id10892']


For those of you interested in speaker verification, we also created an inference interface called `SpeakerRecognition`:

In [14]:
from speechbrain.pretrained import SpeakerRecognition
verification = SpeakerRecognition.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb", savedir="pretrained_models/spkrec-ecapa-voxceleb")

file1 = '/content/speechbrain/tests/samples/single-mic/example1.wav'
file2 = '/content/speechbrain/tests/samples/single-mic/example2.flac'

score, prediction = verification.verify_files(file1, file2)

print(score)
print(prediction) # True = same speaker, False=Different speakers

tensor([0.1799])
tensor([False])


In [15]:
%%writefile /content/best_model/hparams_inference.yaml

# #################################
# Basic inference parameters for speaker-id. We have first a network that
# computes some embeddings. On the top of that, we employ a classifier.
#
# Author:
#  * Mirco Ravanelli 2021
# #################################

# pretrain folders:
pretrained_path: /content/best_model/


# Model parameters
n_mels: 23
sample_rate: 16000
n_classes: 28 # In this case, we have 28 speakers
emb_dim: 512 # dimensionality of the embeddings

# Feature extraction
compute_features: !new:speechbrain.lobes.features.Fbank
    n_mels: !ref <n_mels>

# Mean and std normalization of the input features
mean_var_norm: !new:speechbrain.processing.features.InputNormalization
    norm_type: sentence
    std_norm: False

# To design a custom model, either just edit the simple CustomModel
# class that's listed here, or replace this `!new` call with a line
# pointing to a different file you've defined.
embedding_model: !new:custom_model.Xvector
    in_channels: !ref <n_mels>
    activation: !name:torch.nn.LeakyReLU
    tdnn_blocks: 5
    tdnn_channels: [512, 512, 512, 512, 1500]
    tdnn_kernel_sizes: [5, 3, 3, 1, 1]
    tdnn_dilations: [1, 2, 3, 1, 1]
    lin_neurons: !ref <emb_dim>

classifier: !new:custom_model.Classifier
    input_shape: [null, null, !ref <emb_dim>]
    activation: !name:torch.nn.LeakyReLU
    lin_blocks: 1
    lin_neurons: !ref <emb_dim>
    out_neurons: !ref <n_classes>

label_encoder: !new:speechbrain.dataio.encoder.CategoricalEncoder

# Objects in "modules" dict will have their parameters moved to the correct
# device, as well as having train()/eval() called on them by the Brain class.
modules:
    compute_features: !ref <compute_features>
    embedding_model: !ref <embedding_model>
    classifier: !ref <classifier>
    mean_var_norm: !ref <mean_var_norm>

pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
    loadables:
        embedding_model: !ref <embedding_model>
        classifier: !ref <classifier>
        label_encoder: !ref <label_encoder>
    paths:
        embedding_model: !ref <pretrained_path>/embedding_model.ckpt
        classifier: !ref <pretrained_path>/classifier.ckpt
        label_encoder: !ref <pretrained_path>/label_encoder.txt


Overwriting /content/best_model/hparams_inference.yaml


In [16]:
from speechbrain.pretrained import EncoderClassifier

classifier = EncoderClassifier.from_hparams(source="/content/best_model/", hparams_file='hparams_inference.yaml', savedir="/content/best_model/")

# Perform classification
audio_file = 'data/LibriSpeech/train-clean-5/5789/70653/5789-70653-0036.flac'
signal, fs = torchaudio.load(audio_file) # test_speaker: 5789
output_probs, score, index, text_lab = classifier.classify_batch(signal)
print('Target: 5789, Predicted: ' + text_lab[0])

# Another speaker
audio_file = 'data/LibriSpeech/train-clean-5/460/172359/460-172359-0012.flac'
signal, fs =torchaudio.load(audio_file) # test_speaker: 460
output_probs, score, index, text_lab = classifier.classify_batch(signal)
print('Target: 460, Predicted: ' + text_lab[0])

# And if you want to extract embeddings...
embeddings = classifier.encode_batch(signal)


Target: 5789, Predicted: 5789
Target: 460, Predicted: 460


In [18]:
##Com o comando abaixo, podemos executar vários treinamentos usando o mesmo arquivo de parâmetros, onde se alteram variaveis unicas para a execução, sendo diferenciados pelo seed que cria uma pasta diferente para cada um, definido também nos parametros
%cd /content/speechbrain/templates/speaker_id

!python train.py train.yaml --number_of_epochs=15 --n_mels=10 --seed=123 --rnn_layers=3 --learning_rate=0.001

!python train.py train.yaml --number_of_epochs=15 --n_mels=20 --seed=456 --rnn_layers=2 --learning_rate=0.01

!python train.py train.yaml --number_of_epochs=20 --n_mels=15 --seed=789 --rnn_layers=4 --learning_rate=0.0005

!python train.py train.yaml --number_of_epochs=20 --n_mels=12 --seed=321 --rnn_layers=2 --rnn_neurons=512 --learning_rate=0.0015

!python train.py train.yaml --number_of_epochs=12 --n_mels=18 --seed=555 --rnn_layers=3 --rnn_neurons=256 --learning_rate=0.002


/content/speechbrain/templates/speaker_id
Traceback (most recent call last):
  File "/content/speechbrain/templates/speaker_id/train.py", line 290, in <module>
    hparams = load_hyperpyyaml(fin, overrides)
  File "/usr/local/lib/python3.10/dist-packages/hyperpyyaml/core.py", line 157, in load_hyperpyyaml
    yaml_stream = resolve_references(yaml_stream, overrides, overrides_must_match)
  File "/usr/local/lib/python3.10/dist-packages/hyperpyyaml/core.py", line 316, in resolve_references
    recursive_update(preview, overrides, must_match=overrides_must_match)
  File "/usr/local/lib/python3.10/dist-packages/hyperpyyaml/core.py", line 778, in recursive_update
    raise KeyError(f"Override '{k}' not found in: {[key for key in d.keys()]}")
KeyError: "Override 'rnn_layers' not found in: ['seed', '__set_seed', 'data_folder', 'output_folder', 'save_folder', 'train_log', 'rir_folder', 'train_annotation', 'valid_annotation', 'test_annotation', 'split_ratio', 'skip_prep', 'train_logger', 'error_

In [None]:
#Neste passo, foram criados 5 tipos de treinamento alterando os hiperparametros, mas o passo anterior torna ele desnecessário
%cd /content/speechbrain/templates/speaker_id
!python train.py train1.yaml
!python train.py train2.yaml
!python train.py train3.yaml
!python train.py train4.yaml
!python train.py train5.yaml

[WinError 3] The system cannot find the path specified: '/content/speechbrain/templates/speaker_id'
C:\Users\amori\Downloads


python: can't open file 'C:\\Users\\amori\\Downloads\\train.py': [Errno 2] No such file or directory
python: can't open file 'C:\\Users\\amori\\Downloads\\train.py': [Errno 2] No such file or directory
python: can't open file 'C:\\Users\\amori\\Downloads\\train.py': [Errno 2] No such file or directory
python: can't open file 'C:\\Users\\amori\\Downloads\\train.py': [Errno 2] No such file or directory
python: can't open file 'C:\\Users\\amori\\Downloads\\train.py': [Errno 2] No such file or directory


In [19]:
import torchaudio
from speechbrain.pretrained import EncoderClassifier
import os

# Caminho para o modelo treinado
caminho_modelo_treinado = "/content/speechbrain/templates/speaker_id/results/speaker_id/1988"

# Carregar o modelo treinado
classifier = EncoderClassifier.from_hparams(source=caminho_modelo_treinado)

# Carregar um sinal de áudio
signal, fs = torchaudio.load('/content/speechbrain/tests/samples/single-mic/example1.wav')

# Calcular embeddings dos locutores
embeddings = classifier.encode_batch(signal)

# Realizar classificação
output_probs, score, index, text_lab = classifier.classify_batch(signal)

# Posterior log probabilities
print(output_probs)

# Score (i.e, max log posteriors)
print(score)

# Index of the predicted speaker
print(index)

# Text label of the predicted speaker
print(text_lab)


HFValidationError: ignored

In [None]:
from speechbrain.pretrained import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="/content/speechbrain/templates/speaker_id/results/speaker_id/1988/save/CKPT+2023-09-23+14-18-21+00", hparams_file='/content/speechbrain/templates/speaker_id/results/speaker_id/1988/save/CKPT+2023-09-23+14-18-21+00/CKPT.yaml', savedir="/content/speechbrain/templates/speaker_id/results/speaker_id/1988/save")
audio_file = '/content/speechbrain/tests/samples/single-mic/example1.wav'
asr_model.transcribe_file(audio_file)

KeyError: ignored

In [None]:
%%writefile /content/speechbrain/templates/speaker_id/results/speaker_id/1986/save/CKPT+2023-09-23+12-23-22+00/CKPT.yaml

# #################################
# Basic inference parameters for speaker-id. We have first a network that
# computes some embeddings. On the top of that, we employ a classifier.
#
# Author:
#  * Mirco Ravanelli 2021
# #################################

# pretrain folders:
pretrained_path: /content/best_model/


# Model parameters
n_mels: 23
sample_rate: 16000
n_classes: 28 # In this case, we have 28 speakers
emb_dim: 512 # dimensionality of the embeddings

# Feature extraction
compute_features: !new:speechbrain.lobes.features.Fbank
    n_mels: !ref <n_mels>

# Mean and std normalization of the input features
mean_var_norm: !new:speechbrain.processing.features.InputNormalization
    norm_type: sentence
    std_norm: False

# To design a custom model, either just edit the simple CustomModel
# class that's listed here, or replace this `!new` call with a line
# pointing to a different file you've defined.
embedding_model: !new:custom_model.Xvector
    in_channels: !ref <n_mels>
    activation: !name:torch.nn.LeakyReLU
    tdnn_blocks: 5
    tdnn_channels: [512, 512, 512, 512, 1500]
    tdnn_kernel_sizes: [5, 3, 3, 1, 1]
    tdnn_dilations: [1, 2, 3, 1, 1]
    lin_neurons: !ref <emb_dim>

classifier: !new:custom_model.Classifier
    input_shape: [null, null, !ref <emb_dim>]
    activation: !name:torch.nn.LeakyReLU
    lin_blocks: 1
    lin_neurons: !ref <emb_dim>
    out_neurons: !ref <n_classes>

label_encoder: !new:speechbrain.dataio.encoder.CategoricalEncoder

# Objects in "modules" dict will have their parameters moved to the correct
# device, as well as having train()/eval() called on them by the Brain class.
modules:
    compute_features: !ref <compute_features>
    embedding_model: !ref <embedding_model>
    classifier: !ref <classifier>
    mean_var_norm: !ref <mean_var_norm>

pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
    loadables:
        embedding_model: !ref <embedding_model>
        classifier: !ref <classifier>
        label_encoder: !ref <label_encoder>
    paths:
        embedding_model: !ref <pretrained_path>/embedding_model.ckpt
        classifier: !ref <pretrained_path>/classifier.ckpt
        label_encoder: !ref <pretrained_path>/label_encoder.txt


Overwriting /content/speechbrain/templates/speaker_id/results/speaker_id/1986/save/CKPT+2023-09-23+12-23-22+00/CKPT.yaml


In [None]:
from speechbrain.pretrained import SpeakerRecognition
verification = SpeakerRecognition.from_hparams(source=caminho_modelo_treinado)

file1 = '/content/speechbrain/tests/samples/single-mic/example1.wav'
file2 = '/content/speechbrain/tests/samples/single-mic/example2.flac'

score, prediction = verification.verify_files(file1, file2)

print(score)
print(prediction) # True = same speaker, False=Different speakers


./data/rirs_noises.zip exists. Skipping download


KeyError: ignored



## **Conclusão**

Nesse notebook, foi feito um modelo de comparação entre audios para definição estatística se um áudio tem a mesma voz de outro áudio, por meio do treinamento de uma rede neural que extrai features de arquivos de voz. Para a comparação das redes treinadas, não foi possível executá-las por não conseguir alterar do best_model para as variações.