# Criar Dataset de Áudio
Neste Jupyter Notebook é possível criar um dataset de áudio.

Para isso, são separados diversas frases coletadas de um documento (por exemplo um PDF) para falar no microfone e salvar as frases e áudio.

In [1]:
import pdfplumber, os, pathlib
import pandas as pd
import re
import numpy as np
import time
import torch
from transformers import AutoModelForCTC, AutoProcessor, Wav2Vec2Processor
import sys
import librosa
import pyaudio
import IPython
import webrtcvad
import pyaudio
import wave

Caso queria ver o nome de todos microfones disponíveis com terminada configuração

In [2]:
# Show all microphone with the format supported
p = pyaudio.PyAudio()

for i in range(p.get_device_count()):
    devinfo = p.get_device_info_by_index(i)  # Or whatever device you care about.
    try:
        if p.is_format_supported(48000,  # Sample rate
                                 input_device=devinfo['index'],
                                 input_channels=devinfo['maxInputChannels'],
                                 input_format=pyaudio.paInt16):
            print(p.get_device_info_by_index(i).get('name'))
    except Exception as e:
        continue

HDA Intel: ALC897 Analog (hw:0,0)
HDA Intel: ALC897 Alt Analog (hw:0,2)
HyperX SoloCast: USB Audio (hw:1,0)
sysdefault
pulse
default


ALSA lib pcm_dmix.c:1052:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2495:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2495:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2495:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm_route.c:867:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_dmix.c:1052:(snd_pcm_dmix_open) unable to open slave


## Criar dataset

In [3]:
all_clean_text = {}

### Carregar documento e separar em frases

**OBS:** Este código deve ser customizado conforme cada documento

Limpado os dados removendo caracteres que ocorreram com a transformação de documento para texto dentro do Python

#### PDF: Soltando a imagenição lendas e contos infantis

In [4]:
def clean_text(extracted_text):
    text = extracted_text.replace('-\n', '')
    text = text.replace('\n', ' ')
    text = text.replace(' -', '')
    text = text.replace('— ', '')
    text = text[:text.rfind(' soltando a imaginação soltando_a_imaginacao_c')]
    text = text.strip()
    return text

Lê PDF, separa texto e faz tratamento para ser utilizado como guia para falar e criar um dataset

In [5]:
current_tale_name = 'temp'
#all_clean_text[current_tale_name] = ''
new_tale = False
with pdfplumber.open('01. Soltando a imaginação lendas e contos infantis autor Hans Christian Andersen e Oscar Wilde.pdf') as pdf:
    # Range de paginas que serao usadas
    for i in range(8, 98):
        # Segunda pagina do inicio de um conto eh branca
        if new_tale:
            new_tale = False
            continue
        text = pdf.pages[i]
        # Pega somente texto normal (ignora texto em negrito)
        text_filtered = text.filter(lambda obj: obj["object_type"] == "char" and not "Bold" in obj["fontname"])
        text_cleaned = clean_text(text_filtered.extract_text())
        # Detecta que comecou um conto novo
        if text_cleaned.startswith('tradução do conto'):
            new_tale = True
            title = text.filter(lambda obj: obj["object_type"] == "char" and "Bold" in obj["fontname"]).extract_text()
            current_tale_name = title[:title.find('Tradução')-1]
            all_clean_text[current_tale_name] = ''
            continue
        
        all_clean_text[current_tale_name] += f' {clean_text(text_filtered.extract_text())}'

#### PDF: Contos tradicionais, fábulas, lendas e mitos

In [6]:
def clean_text(extracted_text):
    text = extracted_text.replace('\n— ', ' ')
    text = text.replace('-\n', '')
    text = text.replace('\n', ' ')
    text = text.replace('— ', '')
    text = text.strip()
    return text

Lê PDF, separa texto e faz tratamento para ser utilizado como guia para falar e criar um dataset.

Não separei os contos e eliminei todos títulos se quiser acompanhar qual história é, recomendo seguir com o PDF ou melhorar o código :)

In [7]:
current_tale_name = 'contos_tradicionais'
all_clean_text[current_tale_name] = ''
with pdfplumber.open('03. Contos tradicionais, fábulas, lendas e mitos autor Ana Rosa Abreu, Claudia Rosenberg Aratangy, Eliane Mingues, Marília Costa Dias, Marta Durante e Telma Weisz.pdf') as pdf:
    for i in range(6, 126):
        text = pdf.pages[i]
        #print(text.extract_text())
        text_filtered = text.filter(lambda obj: obj["object_type"] == "char" and not "Bold" in obj["fontname"])
        all_clean_text[current_tale_name] += f' {clean_text(text_filtered.extract_text())}'

### Carrega microfone e suas dependências

Para facilitar a captura correta do áudio, está sendo utilizado Wav2Vec2 para ajudar a separar ruído de fala humana.

In [8]:
def list_microphones(pyaudio_instance):
    info = pyaudio_instance.get_host_api_info_by_index(0)
    numdevices = info.get('deviceCount')

    result = []
    for i in range(0, numdevices):
        if (pyaudio_instance.get_device_info_by_host_api_device_index(0, i).get('maxInputChannels')) > 0:
            name = pyaudio_instance.get_device_info_by_host_api_device_index(
                0, i).get('name')
            result += [[i, name]]
    return result

In [9]:
def get_input_device_id(device_name, microphones):
    for device in microphones:
        if device_name in device[1]:
            return device[0]

In [10]:
vad = webrtcvad.Vad()
vad.set_mode(1)

audio = pyaudio.PyAudio()
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 48000
# A frame must be either 10, 20, or 30 ms in duration for webrtcvad
FRAME_DURATION = 30
CHUNK = int(RATE * FRAME_DURATION / 1000)

device_name = 'HyperX SoloCast: USB Audio (hw:2,0)' #"default"
#asr_input_queue = Queue()
microphones = list_microphones(audio)
selected_input_device_id = get_input_device_id(
    device_name, microphones)

stream = audio.open(input_device_index=selected_input_device_id,
                    format=FORMAT,
                    channels=CHANNELS,
                    rate=RATE,
                    input=True,
                    frames_per_buffer=CHUNK)
stream.stop_stream()

### Carrega modelo Wav2Vec2

Código fonte: https://github.com/oliverguhr/wav2vec2-live

In [11]:
class Wave2Vec2Inference:
    def __init__(self,model_name, hotwords=[], use_lm_if_possible=True, use_gpu=True):
        self.device = "cuda" if use_gpu and torch.cuda.is_available() else "cpu"
        if use_lm_if_possible:            
            self.processor = AutoProcessor.from_pretrained(model_name)
        else:
            self.processor = Wav2Vec2Processor.from_pretrained(model_name)
        self.model = AutoModelForCTC.from_pretrained(model_name)
        self.model.to(self.device)
        self.hotwords = hotwords
        self.use_lm_if_possible = use_lm_if_possible

    def buffer_to_text(self, audio_buffer):
        if len(audio_buffer) == 0:
            return ""

        inputs = self.processor(torch.tensor(audio_buffer), sampling_rate=16_000, return_tensors="pt", padding=True)

        with torch.no_grad():
            logits = self.model(inputs.input_values.to(self.device),
                                attention_mask=inputs.attention_mask.to(self.device)).logits            

        if hasattr(self.processor, 'decoder') and self.use_lm_if_possible:
            transcription = \
                self.processor.decode(logits[0].cpu().numpy(),                                      
                                      hotwords=self.hotwords,
                                      #hotword_weight=self.hotword_weight,  
                                      output_word_offsets=True,                                      
                                   )                             
            confidence = transcription.lm_score / len(transcription.text.split(" "))
            transcription = transcription.text       
        else:
            predicted_ids = torch.argmax(logits, dim=-1)
            transcription = self.processor.batch_decode(predicted_ids)[0]
            confidence = self.confidence_score(logits,predicted_ids)

        return transcription, confidence   

    def confidence_score(self, logits, predicted_ids):
        scores = torch.nn.functional.softmax(logits, dim=-1)                                                           
        pred_scores = scores.gather(-1, predicted_ids.unsqueeze(-1))[:, :, 0]
        mask = torch.logical_and(
            predicted_ids.not_equal(self.processor.tokenizer.word_delimiter_token_id), 
            predicted_ids.not_equal(self.processor.tokenizer.pad_token_id))

        character_scores = pred_scores.masked_select(mask)
        total_average = torch.sum(character_scores) / len(character_scores)
        return total_average

    def file_to_text(self, filename):
        audio_input, samplerate = sf.read(filename)
        assert samplerate == 16000
        return self.buffer_to_text(audio_input)


In [12]:
asr = Wave2Vec2Inference("facebook/wav2vec2-large-xlsr-53-portuguese")

## Função para escutar microfone e ignorar ruídos

Faz a captura do áudio e utiliza algumas regras para evitar pegar ruído, assim sendo possível salvar somente a fala, mesmo que a pessoa demore para começar a falar

In [13]:
# Time without speaking to consider that already said all phase
silence_time = 0.4
def listen_speech(text, df, file_id, path_to_save_data='data'):
    print(text)
    while True:
        stream.start_stream()
        tic = time.time()
        frames = []
        f = b''
        while True:
            frame = stream.read(CHUNK, exception_on_overflow=False)
            is_speech = vad.is_speech(frame, RATE)
            if is_speech:
                frames.append(frame)
                f += frame
                tic = time.time()
            elif time.time() - tic < silence_time:
                continue
            else:
                if len(frames) > 1:
                    if RATE == 16000:
                        audio_frames = f
                        float64_buffer = np.frombuffer(audio_frames, dtype=np.int16) / 32767
                        output_model = asr.buffer_to_text(float64_buffer)
                    else:
                        waveFile = wave.open(f'{path_to_save_data}/audio/temp/temp.wav', 'wb')
                        waveFile.setnchannels(CHANNELS)
                        waveFile.setsampwidth(audio.get_sample_size(FORMAT))
                        #waveFile.setframerate(RATE)
                        waveFile.setframerate(RATE)
                        waveFile.writeframes(b''.join(frames))
                        duration_seconds = waveFile.getnframes() / waveFile.getframerate()
                        #print(f'Video Duration: {duration_seconds}')
                        waveFile.close()

                        audio_file, _ = librosa.load(f'{path_to_save_data}/audio/temp/temp.wav', sr=16000)
                        output_model = asr.buffer_to_text(audio_file)
                        #print(output_model)
                    
                    # Min audio size and confidence
                    if len(output_model[0]) < 2 or output_model[1] < 0.8:
                        #print(f'Not saved: {output_model}')
                        frames = []
                        tic = time.time()
                        continue
                    print(f'Wav2Vec2 result: {output_model}')
                    stream.stop_stream()

                    # Converte audio do microfone em formato WAV e salva em disco
                    waveFile = wave.open(f'{path_to_save_data}/audio/{file_id}.wav', 'wb')
                    waveFile.setnchannels(CHANNELS)
                    waveFile.setsampwidth(audio.get_sample_size(FORMAT))
                    #waveFile.setframerate(RATE)
                    waveFile.setframerate(RATE)
                    waveFile.writeframes(b''.join(frames))
                    duration_seconds = waveFile.getnframes() / waveFile.getframerate()
                    print(f'Video Duration: {duration_seconds}')
                    waveFile.close()
                    break
                frames = []
                f = b''
        #stream.stop_stream()
        #stream.close()
        #audio.terminate()
        result_input = input()
        if result_input == 'r':
            continue
        # Jump == Do not save this audio
        elif result_input == 'j':
            os.remove(f'{path_to_save_data}/audio/{file_id}.wav')
            IPython.display.clear_output()
            return False, df, result_input
        else:
            IPython.display.clear_output()
            df = pd.concat([df, pd.DataFrame([[f'{path_to_save_data}/audio/{file_id}.wav', text.strip()]], columns=list(df.columns))])
            return True, df, result_input


In [None]:
#stream.stop_stream()
#stream.close()
#audio.terminate()

### Mostra frases e captura microfone

In [14]:
def get_df_annotation(path='data'):
    if os.path.exists(f'{path}/annotation.tsv'):
        return pd.read_csv(f'{path}/annotation.tsv', sep='\t')
    return pd.DataFrame(columns=['path', 'sentence'])

Cada áudio possi um ID único. Quando começar a capturar precisa verificar se já existem áudios e caso positivo precisa continuar a contagem

In [15]:
def get_next_file_id(df_annotation):
    if len(df_annotation) > 0:
        last_filename = df_annotation.iloc[-1][0]
        if last_filename.rfind('/') == -1:
            return int(last_filename[:last_filename.find('.')]) + 1
        return int(last_filename[last_filename.rfind('/')+1:last_filename.find('.')]) + 1
    return 0

In [16]:
def save_annotation(is_to_save, df_annotation, path_to_save_data):
    if is_to_save:
        df_annotation.to_csv(f'{path_to_save_data}/annotation.tsv', sep='\t', index=None)
        return True
    return False

Procura as ultmas N frases iguais as salvas no arquivo de anotação para continuar de onde parou

In [17]:
count_search_where_stopped_default = 1
def check_is_where_stopped(text, count_search_where_stopped):
    if len(df_annotation) > 0:
        if text.strip() == df_annotation.iloc[-count_search_where_stopped].sentence:
            count_search_where_stopped -= 1
            if count_search_where_stopped == 0:
                return True, count_search_where_stopped
        else:
            count_search_where_stopped = count_search_where_stopped_default
    
    return False, count_search_where_stopped

Aqui é onde a mágico acontece. Será mostrado uma frase no output e iniciado a captura do microfone, quando for detectado que algo foi falado (e parou de falar) será aberto um input, podendo ter 4 ações possíveis:

- Deixar em branco = Salva áudio e anotação
- Digitar a letra "j" = Ignora frase, não salva áudio nem anotação
- Digitar a letra "r" = Para falar no microfone a mesma frase (ignorando o áudio anterior). Para quando falou algo errado
- Digitar "end" = Para parar de mostrar novas frases e encerrar o sistema, salvando o último áudio falado (apenas encerra se a frase terminar com ponto final)

In [18]:
is_where_stopped = True  # True == start from beginning; False == Continue from where stopped

path_to_save_data = 'data'
pathlib.Path(path_to_save_data).mkdir(exist_ok=True, parents=True)
pathlib.Path(path_to_save_data + '/audio').mkdir(exist_ok=True, parents=True)
pathlib.Path(path_to_save_data + '/audio/temp').mkdir(exist_ok=True, parents=True)
df_annotation = get_df_annotation(path_to_save_data)
file_id = get_next_file_id(df_annotation)
result_input = ''

count_search_where_stopped = count_search_where_stopped_default
for tale_name, tale_text in all_clean_text.items():
    for text in tale_text.split('.'):
        if len(text) > 0:
            is_subsplited = False
            for character_split in [';', '!', '?']:
                if text.find(character_split) != -1:
                    splited = text.split(character_split)
                    for idx, text2 in enumerate(splited):
                        if len(text2) > 0:
                            # Add the character removed by split
                            if idx+1 != len(splited):
                                text2 += character_split
                            else:
                                text2 += '.'
                            
                            if is_where_stopped:
                                is_where_stopped, count_search_where_stopped = check_is_where_stopped(text2, count_search_where_stopped)
                                continue
                            
                                
                            is_file_saved, df_annotation, result_input = listen_speech(text2, df_annotation, file_id, path_to_save_data)
                            file_id += save_annotation(is_file_saved, df_annotation, path_to_save_data)
                    is_subsplited = True
                    break
            
            if is_subsplited is False:
                if is_where_stopped is False:
                    is_where_stopped, count_search_where_stopped = check_is_where_stopped(text + '.', count_search_where_stopped)
                    continue
                
                is_file_saved, df_annotation, result_input = listen_speech(text + '.', df_annotation, file_id, path_to_save_data)
                file_id += save_annotation(is_file_saved, df_annotation, path_to_save_data)
                
            if result_input == 'end':
                sys.exit()

SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


## Calcula tempo audios

In [20]:
path_data = 'data/audio'

total_time = 0
total_files = 0
for filename in os.listdir(path_data):
    if filename.endswith('wav'):
        audio_file, _ = librosa.load(f'{path_data}/{filename}', sr=16000)
        total_time += librosa.get_duration(y=audio_file, sr=16000)
        total_files += 1
print(f'Total audio files: {total_files}')
print(f'Mean time audio files (sec): {total_time/total_files}')
print(f'Total time audio files (min): {total_time/60}')

Total audio files: 8
Mean time audio files (sec): 3.1575
Total time audio files (min): 0.42100000000000004


## Preparar dataset para treinamento

In [21]:
from sklearn.model_selection import train_test_split

In [22]:
def get_df_annotation(path='data'):
    if os.path.exists(f'{path}/annotation.tsv'):
        return pd.read_csv(f'{path}/annotation.tsv', sep='\t')
    return None

In [23]:
def save_dataset_splited(df_annotation, filename='train'):
    df_annotation.to_csv(f'{filename}.tsv', sep='\t', index=None)

In [24]:
df_annotation = get_df_annotation('data')

In [25]:
train, test = train_test_split(df_annotation, test_size=0.4)
test, validation = train_test_split(test, test_size=0.5)

In [26]:
save_dataset_splited(train, 'train')
save_dataset_splited(test, 'test')
save_dataset_splited(validation, 'validation')