# Chatbot de voz con Deep Learning

In [1]:
!git clone https://github.com/AcecomFCUNI/Chatbot-Acecom.git
%cd Chatbot-Acecom/src

Cloning into 'Chatbot-Acecom'...
remote: Enumerating objects: 22, done.[K
remote: Counting objects: 100% (22/22), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 22 (delta 2), reused 22 (delta 2), pack-reused 0[K
Unpacking objects: 100% (22/22), done.
/content/Chatbot-Acecom/src


## 1 - El problema a resolver

La idea es crear un chatbot que interprete voz humana y genere la conversación en formato texto, usando las <ins>mejores arquitecturas de Deep Learning disponibles</ins>:

![](https://github.com/AcecomFCUNI/Chatbot-Acecom/blob/master/assets/idea_general_chatbot.png?raw=1)

## 2 - Elementos del chatbot

Usaremos *wav2vec2* para la conversión voz a texto, y *BlenderBot* para generar la conversación:

![](https://github.com/AcecomFCUNI/Chatbot-Acecom/blob/master/assets/chatbot_detallado.png?raw=1)

Tanto *wav2vec2* como *BlenderBot* se basan en las [Redes Transformer](https://youtu.be/Wp8NocXW_C4):

![](https://github.com/AcecomFCUNI/Chatbot-Acecom/blob/master/assets/red-transformer.png?raw=1)

## 3 - Conversión voz a texto con *wav2vec2*

[*wav2vec2*](https://arxiv.org/pdf/2006.11477.pdf) fue desarrollado por Facebook en 2020:

![](https://github.com/AcecomFCUNI/Chatbot-Acecom/blob/master/assets/wav2vec2.png?raw=1)



In [2]:
!pip install transformers #wav2vec2 y blenderbot
!pip install git+git://github.com/ricardodeazambuja/colab_utils.git #mic
!pip install librosa # pre-procesamiento audio

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/fd/1a/41c644c963249fd7f3836d926afa1e3f1cc234a1c40d80c5f03ad8f6f1b2/transformers-4.8.2-py3-none-any.whl (2.5MB)
[K     |▏                               | 10kB 22.3MB/s eta 0:00:01[K     |▎                               | 20kB 29.7MB/s eta 0:00:01[K     |▍                               | 30kB 35.2MB/s eta 0:00:01[K     |▌                               | 40kB 34.9MB/s eta 0:00:01[K     |▋                               | 51kB 35.2MB/s eta 0:00:01[K     |▉                               | 61kB 36.8MB/s eta 0:00:01[K     |█                               | 71kB 32.8MB/s eta 0:00:01[K     |█                               | 81kB 33.6MB/s eta 0:00:01[K     |█▏                              | 92kB 30.5MB/s eta 0:00:01[K     |█▎                              | 102kB 31.9MB/s eta 0:00:01[K     |█▍                              | 112kB 31.9MB/s eta 0:00:01[K     |█▋                              | 

In [3]:
# Importar librerías
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from colab_utils import getAudio
import librosa
import numpy as np

w2v2 = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
w2v2_processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1596.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=377667514.0, style=ProgressStyle(descri…




Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=159.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=291.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=163.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=85.0, style=ProgressStyle(description_w…




In [4]:
# Capturar audio del mic (a 48 KHz)
audio, sr = getAudio()

In [5]:
# Cambiar tasa de muestreo a 16 KHz (requerido por wav2vec2)
audio_float = audio.astype(np.float32)
audio_16k = librosa.resample(audio_float, sr, 16000)
print(f'Tamaño audio original: {audio_16k.shape}')

# Voz a texto
entrada = w2v2_processor(audio_16k, sampling_rate=16000, return_tensors="pt").input_values
print(f'Tamaño entrada a wav2vec2: {entrada.shape}')
probabilidades = w2v2(entrada).logits
print(f'Tamaño arreglo probabilidades (salida de wav2vec2): {probabilidades.shape}')
predicciones = torch.argmax(probabilidades, dim=-1)
print(f'Tamaño arreglo predicciones: {predicciones.shape}')
transcripcion = w2v2_processor.decode(predicciones[0])
print(transcripcion)

Tamaño audio original: (48960,)
Tamaño entrada a wav2vec2: torch.Size([1, 48960])
Tamaño arreglo probabilidades (salida de wav2vec2): torch.Size([1, 152, 32])
Tamaño arreglo predicciones: torch.Size([1, 152])
E OMESTASAN


## 4 - *BlenderBot*



[*BlenderBot*](https://ai.facebook.com/blog/state-of-the-art-open-source-chatbot/) también fue desarrollado por FaceBook en 2020, con el fin de permitir una interacción más humana y natural:

![](https://github.com/AcecomFCUNI/Chatbot-Acecom/blob/master/assets/blenderbot.png?raw=1)

In [6]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")
blender = AutoModelForSeq2SeqLM.from_pretrained("facebook/blenderbot-400M-distill")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1153.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1505.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=126891.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=62871.0, style=ProgressStyle(descriptio…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=16.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=772.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=729755983.0, style=ProgressStyle(descri…




In [7]:
blender.generate?

In [8]:
# Prueba inicial
entradaBlender = tokenizer([transcripcion], return_tensors='pt')
print(f'Frase de entrada: {transcripcion}')
print(f'Entrada a BlenderBot: {entradaBlender}')
ids_respuesta = blender.generate(**entradaBlender)
print(f'Salida BlenderBot: {ids_respuesta}')
respuesta = tokenizer.batch_decode(ids_respuesta)
print(f'Salida después del Tokenizer: {respuesta}')

Frase de entrada: E OMESTASAN
Entrada a BlenderBot: {'input_ids': tensor([[ 477,  471,   52, 2291,   59, 3159, 2159,    2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}


To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


Salida BlenderBot: tensor([[   1, 2219,  304,  957,  635,  324,  265,  885,   92,  923,   38,  281,
          615,  635,  324,  487,  298,  312,  372, 1874,    8,    2]])
Salida después del Tokenizer: ["<s> Have you ever been on a cruise? I've been on one and it was amazing!</s>"]


In [9]:
# Eliminar tokens de inicio y finalización de frase
respuesta = respuesta[0].replace('<s>','').replace('</s>','')
print(f'Salida en el formato correcto: {respuesta}')

Salida en el formato correcto:  Have you ever been on a cruise? I've been on one and it was amazing!


In [10]:
# Crear un corto chat de prueba
NFRASES = 5
nfrase = 1
while nfrase <= NFRASES:
  frase = input('-> MIGUEL: ')
  entradaBlender = tokenizer([frase], return_tensors='pt')
  ids_respuesta = blender.generate(**entradaBlender)
  respuesta = tokenizer.batch_decode(ids_respuesta)
  respuesta = respuesta[0].replace('<s>','').replace('</s>','')
  print(f'-> BLENDERBOT: {respuesta}')

  nfrase += 1

-> MIGUEL: hola
-> BLENDERBOT:  Holidays are my favorite time of the year.  Do you like holidays?
-> MIGUEL: what i do?
-> BLENDERBOT:  What do you mean what do you do? I'm not sure what you mean by what you do.


KeyboardInterrupt: ignored

## 5 - *wav2dec2* + *BlenderBot* y prueba del chatbot

Ahora introduciremos la captura de audio -> wav2dec2 -> BlenderBot en un loop:

In [11]:
NFRASES = 5
nfrase = 1

while nfrase <= NFRASES:
  input()     # Esperar a pulsar tecla para iniciar grabación
  
  # Capturar audio y llevarlo a 16 KHz
  audio, sr = getAudio()
  audio_float = audio.astype(np.float32)
  audio_16k = librosa.resample(audio_float, sr, 16000)

  # Voz a texto
  entrada = w2v2_processor(audio_16k, sampling_rate=16000, return_tensors="pt").input_values
  probabilidades = w2v2(entrada).logits
  predicciones = torch.argmax(probabilidades, dim=-1)
  frase = w2v2_processor.decode(predicciones[0])
  
  # Imprimir transcripción
  print(f'-> MIGUEL: {frase}')

  # BlenderBot
  entradaBlender = tokenizer([frase], return_tensors='pt')
  ids_respuesta = blender.generate(**entradaBlender)
  respuesta = tokenizer.batch_decode(ids_respuesta)
  respuesta = respuesta[0].replace('<s>','').replace('</s>','')
  print(f'-> BLENDERBOT: {respuesta}')

  nfrase += 1




-> MIGUEL: PORING IN A
-> BLENDERBOT:  Have you ever been in a situation where you felt like you had to work really hard to get a good grade?



-> MIGUEL: WHAT CAN I DO
-> BLENDERBOT:  I don't know what to do. I feel like I'm going to lose my job.


KeyboardInterrupt: ignored