<img src="https://github.com/hernancontigiani/ceia_memorias_especializacion/raw/master/Figures/logoFIUBA.jpg" width="500" align="center">


# Natural Language Processing
## LSTM Bot QA

### Datos
El objecto es utilizar datos disponibles del challenge ConvAI2 (Conversational Intelligence Challenge 2) de conversaciones en inglés. Se construirá un BOT para responder a preguntas del usuario (QA).\
[LINK](http://convai.io/data/)

In [44]:
!pip install --upgrade --no-cache-dir gdown --quiet

In [45]:
import re

import numpy as np
import pandas as pd

import tensorflow as tf
from keras.preprocessing.text import one_hot
from tensorflow.keras.utils import pad_sequences
from keras.models import Sequential
from keras.layers import Activation, Dropout, Dense
#from keras.layers.core import Activation, Dropout, Dense
from keras.layers import Flatten, LSTM, SimpleRNN
from keras.models import Model
from tensorflow.keras.layers import Embedding
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.layers import Input

In [46]:
# Descargar la carpeta de dataset
import os
import gdown
if os.access('data_volunteers.json', os.F_OK) is False:
    url = 'https://drive.google.com/uc?id=1awUxYwImF84MIT5-jCaYAPe2QwSgS1hN&export=download'
    output = 'data_volunteers.json'
    gdown.download(url, output, quiet=False)
else:
    print("El dataset ya se encuentra descargado")

El dataset ya se encuentra descargado


In [47]:
# dataset_file
import json

text_file = "data_volunteers.json"
with open(text_file) as f:
    data = json.load(f) # the data variable will be a dictionary


In [48]:
# Observing the disponibles fields in every line of the dataset
data[0].keys()

dict_keys(['dialog', 'start_time', 'end_time', 'bot_profile', 'user_profile', 'eval_score', 'profile_match', 'participant1_id', 'participant2_id'])

In [49]:
chat_in = []
chat_out = []

input_sentences = []
output_sentences = []
output_sentences_inputs = []
max_len = 30

def clean_text(txt):
    txt = txt.lower()    
    txt.replace("\'d", " had")
    txt.replace("\'s", " is")
    txt.replace("\'m", " am")
    txt.replace("don't", "do not")
    txt = re.sub(r'\W+', ' ', txt)
    
    return txt

for line in data:
    for i in range(len(line['dialog'])-1):
        # vamos separando el texto en "preguntas" (chat_in)
        # y "respuestas" (chat_out)
        chat_in = clean_text(line['dialog'][i]['text'])
        chat_out = clean_text(line['dialog'][i+1]['text'])

        if len(chat_in) >= max_len or len(chat_out) >= max_len:
            continue

        input_sentence, output = chat_in, chat_out
        
        # output sentence (decoder_output) tiene <eos>
        output_sentence = output + ' <eos>'
        # output sentence input (decoder_input) tiene <sos>
        output_sentence_input = '<sos> ' + output

        input_sentences.append(input_sentence)
        output_sentences.append(output_sentence)
        output_sentences_inputs.append(output_sentence_input)

print("Number of rows used:", len(input_sentences))

Number of rows used: 6033


In [50]:
input_sentences[1], output_sentences[1], output_sentences_inputs[1]

('hi how are you ', 'not bad and you  <eos>', '<sos> not bad and you ')

### 2 - Preprocessing
Realizar el preprocesamiento necesario para obtener:
- word2idx_inputs, max_input_len
- word2idx_outputs, max_out_len, num_words_output
- encoder_input_sequences, decoder_output_sequences, decoder_targets

In [51]:
# Define the maximun number of words
MAX_VOCABULARY_SIZE = 8000

In [52]:
from keras.preprocessing.text import Tokenizer


# Create tokenizer for the input text and fit it to them
tokenizer_inputs= Tokenizer(num_words=MAX_VOCABULARY_SIZE)
tokenizer_inputs.fit_on_texts(input_sentences)

# Tokenize and transform input texts to sequence of integers
input_integer_seq = tokenizer_inputs.texts_to_sequences(input_sentences)

In [53]:
word2indx_inputs = tokenizer_inputs.word_index
print('Words in the vocabulary', len(word2indx_inputs))

# Calculate the max length
max_input_len = max(len(sentence) for sentence in input_integer_seq )
print('The longest sentence', max_input_len)

Words in the vocabulary 1799
The longest sentence 9


In [54]:
# Check the tokenization
print(input_integer_seq[200])

[22]


In [55]:
# Create tokenizer for the outpu text and fit it to them
output_tokenizer = Tokenizer(num_words=MAX_VOCABULARY_SIZE, filters='')
#output_tokenizer = Tokenizer(num_words=MAX_VOCABULARY_SIZE, filters='!"#$%&()*+,-./:;=¿?@[\\]^_`{|}~\t\n')

output_tokenizer.fit_on_texts(output_sentence)

In [56]:
# Get the word to index mapping for output answer
word2indx_outputs = output_tokenizer.word_index
print('Found %s unique output tokens.' %len(word2indx_outputs))

Found 5 unique output tokens.


In [57]:
output_integer_seq = output_tokenizer.texts_to_sequences(output_sentence)

In [58]:
# Calculate the max length for the ouput
max_output_len = max(len(output_sentence) for output_sentence in output_integer_seq)
print('The longest sentence in the output', max_input_len)

The longest sentence in the output 9


In [59]:
# One is added to include the toke of unknown word
number_word_output = min(len(word2indx_outputs) + 1, MAX_VOCABULARY_SIZE) 

### 3 - Preparing the embeddings
Utilizar los embeddings de Glove o FastText para transformar los tokens de entrada en vectores

### 4 - Training the model
Entrenar un modelo basado en el esquema encoder-decoder utilizando los datos generados en los puntos anteriores. Utilce como referencias los ejemplos vistos en clase.

### 5 - Inference
Experimentar el funcionamiento de su modelo. Recuerde que debe realizar la inferencia de los modelos por separado de encoder y decoder.