Durante essa aula vamos explorar um pouco os tipos de tarefas que podemos resolver utilizando os conceitos aprendidos nas últimas 2 aulas.

A primeira tarefa será de **classificacão de texto**, similar à tarefa explorada na primeira semana.

A tarefa consiste em analisar uma notícia e classificá-la dentre um conjunto possível de temas, por exemplo, *tecnologia*, *esporte*, etc.

Para isto, vamos utilizar o dataset liberado pela [BBC](http://mlg.ucd.ie/datasets/bbc.html).



In [1]:
import pandas as pd

data = pd.read_csv("data/text_classification/bbc-text.csv")

num_classes = len(data.category.value_counts())


training_proportion = 0.9

training_index = int(training_proportion * len(data))

train_data = data.iloc[:training_index,:]
test_data = data.iloc[training_index:, :]

train_text = train_data.loc[:, 'text'].tolist()

data.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


Nas últimas aulas, apesar de utilizarmos embeddings em todas as tarefas, estávamos os aprendendo do zero baseando-se na base de treinamento.

Na verdade, não há necessidade disso porque já existe uma variedade muito grande de embeddings pré-treinados que podem ser utilizados.

Para este exercício vamos utilizar o [GloVe](https://nlp.stanford.edu/projects/glove/), mas existem muitas opcões de embeddings pré-treinados, inclusive em português.

Antes de rodar as próximas linhas, baixe o arquivo [glove.6B.50d.txt](https://github.com/uclnlp/inferbeddings/blob/master/data/glove/glove.6B.50d.txt.gz)

Vamos ler o conteúdo do arquivo com embeddings pré-treinados, para posteriormente utilizá-los na nossa camada de Embedding. O propósito da próxima célula é determinar o vocabulário para a nossa base de treinamento, e carregar os embeddings para todas as palavras encontradas.

In [2]:
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np

max_vocab = 2000

tokenizer = Tokenizer(num_words = max_vocab)
tokenizer.fit_on_texts(train_text)

indexes = min(max_vocab, len(tokenizer.word_index))

num_emb = 0
cons_keys = []
for word in tokenizer.word_index.keys():
    cons_keys.append(word)
    num_emb += 1
    if num_emb == max_vocab:
        break


embeddings_index = dict()
f = open('data/text_classification/glove.6B.50d.txt')
for line in f:
    values = line.split()
    word = values[0]
    if word in cons_keys:
        coefs = np.array(values[1:], dtype='float32')
        embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

embedding_matrix = np.zeros((max_vocab, 50))
for word, i in tokenizer.word_index.items():
    if i>=max_vocab:
        break
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
        
print(embedding_matrix[10:13])


Loaded 1995 word vectors.
[[ 0.88387001 -0.14199001  0.13565999  0.098682    0.51217997  0.49138001
  -0.47154999 -0.30741999  0.01963     0.12685999  0.073524    0.35835999
  -0.60873997 -0.18675999  0.78934997  0.54534     0.1106     -0.29229999
   0.059041   -0.69551003 -0.18804     0.19454999  0.32269001 -0.49981001
   0.30599999 -2.3901999  -0.60749     0.37107     0.078912   -0.23896
   3.83899999 -0.20355    -0.35613    -0.69185001 -0.17497    -0.35323
   0.10598    -0.039303    0.015701    0.038279   -0.35282999  0.44881999
  -0.16534001  0.31579     0.14963    -0.071277   -0.53505999  0.52710998
  -0.20148     0.0095952 ]
 [ 0.61183    -0.22071999 -0.10898    -0.052967    0.50804001  0.34683999
  -0.33557999 -0.19152001 -0.035865    0.1051      0.07935     0.2449
  -0.4373     -0.33344001  0.57479     0.69051999  0.29712999  0.090669
  -0.54992002 -0.46176001  0.10113    -0.02024     0.28479001  0.043512
   0.45734999 -2.0466001  -0.58083999  0.61796999  0.65179998 -0.58262998

Agora faca o processamento restante necessário, including padding.

In [3]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

sentence_size = 500

dataset_train_sequences = tokenizer.texts_to_sequences(train_data.loc[:,'text'])
dataset_test_sequences = tokenizer.texts_to_sequences(test_data.loc[:,'text'])

padded_train = pad_sequences(dataset_train_sequences, maxlen=sentence_size, padding='post', truncating='post')
padded_test = pad_sequences(dataset_test_sequences, maxlen=sentence_size, padding='post', truncating='post')




In [4]:
from tensorflow.keras.layers import Dense, Embedding, Bidirectional, LSTM
from tensorflow.keras import Sequential


model = Sequential()
# 50 é o número de dimensoes do embedding (definido pelo arquivo que baixamos)
#trainable = False significa que esses pesos nao sao atualizados durante o treinamento.
model.add(Embedding(max_vocab, 50, weights=[embedding_matrix], input_length=sentence_size, trainable=False))
model.add(Bidirectional(LSTM(64)))
model.add(Dense(max_vocab/2, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))

optimizer = 'adam'
loss = 'categorical_crossentropy'

model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 500, 50)           100000    
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               58880     
_________________________________________________________________
dense (Dense)                (None, 1000)              129000    
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 5005      
Total params: 292,885
Trainable params: 192,885
Non-trainable params: 100,000
_________________________________________________________________


In [11]:
from tensorflow.keras.utils import to_categorical
import tensorflow.keras.utils as ku
from sklearn.preprocessing import LabelEncoder

labelencoder = LabelEncoder()
labelencoder.fit_transform(train_data.loc[:, 'category'])

label_train = train_data.loc[:, ['category']].apply(LabelEncoder().fit_transform).values
label_test = test_data.loc[:, ['category']].apply(LabelEncoder().fit_transform).values

label_train = ku.to_categorical(label_train, num_classes = num_classes)
label_test = ku.to_categorical(label_test, num_classes = num_classes)


[[0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0.]
 ...
 [0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]]


In [12]:
num_epochs = 10
model.fit(padded_train,label_train, epochs = num_epochs, validation_data=(padded_test,label_test), verbose=2 )

Train on 2002 samples, validate on 223 samples
Epoch 1/10
2002/2002 - 17s - loss: 0.7438 - accuracy: 0.7398 - val_loss: 0.3838 - val_accuracy: 0.8610
Epoch 2/10
2002/2002 - 2s - loss: 0.2770 - accuracy: 0.9116 - val_loss: 0.2811 - val_accuracy: 0.9193
Epoch 3/10
2002/2002 - 2s - loss: 0.2461 - accuracy: 0.9221 - val_loss: 0.3172 - val_accuracy: 0.9013
Epoch 4/10
2002/2002 - 2s - loss: 0.1748 - accuracy: 0.9461 - val_loss: 0.2820 - val_accuracy: 0.9103
Epoch 5/10
2002/2002 - 2s - loss: 0.1547 - accuracy: 0.9476 - val_loss: 0.2689 - val_accuracy: 0.9193
Epoch 6/10
2002/2002 - 3s - loss: 0.1071 - accuracy: 0.9620 - val_loss: 0.3902 - val_accuracy: 0.9058
Epoch 7/10
2002/2002 - 3s - loss: 0.1146 - accuracy: 0.9565 - val_loss: 0.3175 - val_accuracy: 0.9193
Epoch 8/10
2002/2002 - 3s - loss: 0.0892 - accuracy: 0.9715 - val_loss: 0.3330 - val_accuracy: 0.9327
Epoch 9/10
2002/2002 - 3s - loss: 0.0704 - accuracy: 0.9760 - val_loss: 0.2688 - val_accuracy: 0.9372
Epoch 10/10
2002/2002 - 3s - loss:

<tensorflow.python.keras.callbacks.History at 0x7f7bd01b3c90>

In [34]:
def predict_topic(text):
    seq = tokenizer.texts_to_sequences(text)
    proc_text = pad_sequences(seq, maxlen=sentence_size, padding='post', truncating='post')
    topic = np.argmax(model.predict(proc_text), axis=1)
    topic = labelencoder.inverse_transform(topic)
    return topic

predict_topic(['A lot of data centers',"A lot of stock prices and financial data", 'A lot of goals scored'])
    

array(['tech', 'business', 'sport'], dtype=object)

Agora que já somos experts em classificar sentimentos em textos, vamos considerar uma tarefa diferente mas igualmente importante: **Traducão**.

Como realizar traducões entre duas linguas requer modelos grandes e um grande corpus de treinamento, vamos considerar uma tarefa de "traducão" mais simples.

A tarefa que consideraremos é a traducão de uma data em texto livre para um formato mais palatável para o computador. Alguns exemplos de traducão estão abaixo:

`9 may 1998 -> 1998-05-09
10.11.19 -> 2019-11-10
9/10/70 -> 1970-09-10
saturday april 28 1990 -> 1990-04-28
thursday january 26 1995 -> 1995-01-26
monday march 7 1983 -> 1983-03-07`

Para realizar esta tarefa, modelaremos o problema de uma forma um pouco diferente. Para a criacão de nosso vocabulário, designaremos um código para cada caractere (não cada palavra), formando um vocabulario de origem e um de destino.

Seu modelo irá ler um conjunto de caracteres de entrada de tamanho `n`, processá-lo e dar como saída um conjunto de caracteres de tamanho `10`.



In [76]:
data = pd.read_csv('data/translation/dates_dataset.csv', header=None)
train_index = int(0.9 * len(data))
train_data = data[:train_index]
test_data = data[train_index:]

tok_origem = Tokenizer(char_level=True)
tok_destino = Tokenizer(char_level = True)

tok_origem.fit_on_texts(train_data.loc[:,0])
tok_destino.fit_on_texts(train_data.loc[:,1])

seq_train = tok_origem.texts_to_sequences(train_data.loc[:,0])
seq_test = tok_origem.texts_to_sequences(test_data.loc[:,0])

label_train = np.array(tok_destino.texts_to_sequences(train_data.loc[:,1]))
label_test = np.array(tok_destino.texts_to_sequences(test_data.loc[:,1]))

label_train = ku.to_categorical(label_train, num_classes = len(tok_destino.word_index)+1)
label_test = ku.to_categorical(label_test, num_classes = len(tok_destino.word_index)+1)



(9000, 10, 12)

In [71]:
max_size_orig = 30
max_size_dest = 10

padded_train = pad_sequences(seq_train, maxlen=max_size_orig, padding='post', truncating='post')
padded_test = pad_sequences(seq_test, maxlen=max_size_orig, padding='post', truncating='post')



In [96]:
from tensorflow.keras.layers import Reshape, Lambda, Dropout
import tensorflow.keras.backend as K

model = Sequential()
model.add(Embedding(len(tok_origem.word_index)+1, 50,  input_length=max_size_orig))
model.add(LSTM(64))
model.add(Dense(128, activation='relu'))
model.add(Dense(max_size_dest*(len(tok_destino.word_index)+1), activation='sigmoid'))
model.add(Reshape((max_size_dest,len(tok_destino.word_index)+1)))
model.add(Dense(len(tok_destino.word_index)+1, activation='softmax'))


optimizer = 'adam'
loss = 'categorical_crossentropy'

model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
model.summary()

Model: "sequential_20"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_20 (Embedding)     (None, 30, 50)            1800      
_________________________________________________________________
lstm_21 (LSTM)               (None, 64)                29440     
_________________________________________________________________
dense_50 (Dense)             (None, 128)               8320      
_________________________________________________________________
dense_51 (Dense)             (None, 120)               15480     
_________________________________________________________________
reshape_14 (Reshape)         (None, 10, 12)            0         
_________________________________________________________________
dense_52 (Dense)             (None, 10, 12)            156       
Total params: 55,196
Trainable params: 55,196
Non-trainable params: 0
_________________________________________________

In [97]:
num_epochs = 50

model.fit(padded_train,label_train, epochs = num_epochs, validation_data=(padded_test,label_test), verbose=2 )

Train on 9000 samples, validate on 1000 samples
Epoch 1/50
9000/9000 - 2s - loss: 1.5737 - accuracy: 0.4613 - val_loss: 1.3637 - val_accuracy: 0.4933
Epoch 2/50
9000/9000 - 2s - loss: 1.3011 - accuracy: 0.4853 - val_loss: 1.2577 - val_accuracy: 0.4889
Epoch 3/50
9000/9000 - 2s - loss: 1.2389 - accuracy: 0.4852 - val_loss: 1.2222 - val_accuracy: 0.4910
Epoch 4/50
9000/9000 - 2s - loss: 1.2129 - accuracy: 0.4864 - val_loss: 1.2043 - val_accuracy: 0.4920
Epoch 5/50
9000/9000 - 2s - loss: 1.1991 - accuracy: 0.4868 - val_loss: 1.1933 - val_accuracy: 0.4933
Epoch 6/50
9000/9000 - 2s - loss: 1.1911 - accuracy: 0.4863 - val_loss: 1.1867 - val_accuracy: 0.4938
Epoch 7/50
9000/9000 - 2s - loss: 1.1855 - accuracy: 0.4865 - val_loss: 1.1831 - val_accuracy: 0.4916
Epoch 8/50
9000/9000 - 1s - loss: 1.1814 - accuracy: 0.4857 - val_loss: 1.1803 - val_accuracy: 0.4920
Epoch 9/50
9000/9000 - 2s - loss: 1.1786 - accuracy: 0.4864 - val_loss: 1.1773 - val_accuracy: 0.4898
Epoch 10/50
9000/9000 - 2s - loss:

<tensorflow.python.keras.callbacks.History at 0x7f7a3bd2c1d0>

In [120]:
def predict_dates(dates):
    seqs = tok_origem.texts_to_sequences(dates)
    padded = pad_sequences(seqs, maxlen=max_size_orig, padding='post', truncating='post')
    converted = np.argmax(model.predict(padded), axis=2)
    converted = tok_destino.sequences_to_texts(converted)
    return converted
                           
predict_dates(['august 26 1975', 'friday july 12 1991', '10 sep 1975'])


['1 9 7 5 - 0 8 - 1 6', '1 9 9 1 - 0 7 - 1 2', '1 9 7 5 - 0 9 - 1 0']

Para o último problema a ser resolvido, vamos considerar um problema um pouco mais exótico, semelhante ao problema de geracão de palavras estudado na segunda aula.


Para esta tarefa geraremos **nomes de Dinossauros**.

Nomes de dinossauros têm uma certa semelhanca, por exemplo:

`
Abdallahsaurus
Abelisaurus
Abrictosaurus
`

Portanto, vamos treinar um modelo que completa o nome de um dinossauro!

Para isto, novamente vamos gerar um token para cara caractere. O modelo deve ser treinado recebendo um nome parcial e o completando, por exemplo:

`x = 'Abda', y = 'Abdallahsaurus'
x = 'Abelisa', y = 'Abelisaurus'
`

O modelo pode ser treinado com a base de treinamento sendo gerada similarmente ao problema de gerar sonetos, mas adicionando um token "\n" à saída para indicar o final do nome do Dinossauro:


`x = 'A', y = 'Aardonyx\n'
x = 'Aa', y = 'Aardonyx\n'
x = 'Aar', y = 'Aardonyx\n'
...
x = 'Aardony', y = 'Aardonyx\n'
`



In [138]:
row_list =[]
largest_name = 0
with open("data/text_generation/dinos.txt", "r") as a_file:
  for line in a_file:
    largest_name = max(largest_name, len(line))
    for i in range(len(line)-1):
        row_list.append([line[:i+1], line])

data = pd.DataFrame(row_list, columns = ['x', 'y'])
data.head(12)
        

Unnamed: 0,x,y
0,A,Aachenosaurus\n
1,Aa,Aachenosaurus\n
2,Aac,Aachenosaurus\n
3,Aach,Aachenosaurus\n
4,Aache,Aachenosaurus\n
5,Aachen,Aachenosaurus\n
6,Aacheno,Aachenosaurus\n
7,Aachenos,Aachenosaurus\n
8,Aachenosa,Aachenosaurus\n
9,Aachenosau,Aachenosaurus\n


In [143]:
data = data.sample(frac=1).reset_index(drop=True)
training_index = int(0.9*len(data))
train_data = data.loc[:training_index,:]
test_data = data.loc[training_index:, :]

tok = Tokenizer(char_level=True)
tok.fit_on_texts(train_data.loc[:,'y'])
words_num = len(tok.word_index)

train_seq = tok.texts_to_sequences(train_data.loc[:,'x'])
test_seq = tok.texts_to_sequences(test_data.loc[:,'x'])

train_label_seq = tok.texts_to_sequences(train_data.loc[:,'y'])
test_label_seq = tok.texts_to_sequences(test_data.loc[:,'y'])

padded_train_seq = pad_sequences(train_seq, maxlen=largest_name, padding='post', truncating='post')
padded_test_seq = pad_sequences(test_seq, maxlen=largest_name, padding='post', truncating='post')
padded_label_train = pad_sequences(train_label_seq, maxlen=largest_name, padding='post', truncating='post')
padded_label_test = pad_sequences(test_label_seq, maxlen=largest_name, padding='post', truncating='post')

train_labels = ku.to_categorical(padded_label_train, num_classes =words_num+1)
test_labels = ku.to_categorical(padded_label_test, num_classes =words_num+1)
train_labels.shape

(16536, 27, 28)

In [159]:
model = Sequential()
model.add(Embedding(words_num+1, 50,  input_length=largest_name))
model.add(Bidirectional(LSTM(64)))
model.add(Dense(32, activation='relu'))
model.add(Dense(largest_name*(words_num+1), activation='sigmoid'))
model.add(Reshape((largest_name,words_num+1)))
model.add(Dense(words_num+1, activation='softmax'))


optimizer = 'adam'
loss = 'binary_crossentropy'

model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
model.summary()


Model: "sequential_23"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_23 (Embedding)     (None, 27, 50)            1400      
_________________________________________________________________
bidirectional_19 (Bidirectio (None, 128)               58880     
_________________________________________________________________
dense_59 (Dense)             (None, 32)                4128      
_________________________________________________________________
dense_60 (Dense)             (None, 756)               24948     
_________________________________________________________________
reshape_17 (Reshape)         (None, 27, 28)            0         
_________________________________________________________________
dense_61 (Dense)             (None, 27, 28)            812       
Total params: 90,168
Trainable params: 90,168
Non-trainable params: 0
_________________________________________________

In [160]:
num_epochs = 5

model.fit(padded_train_seq,train_labels, epochs = num_epochs, validation_data=(padded_test_seq,test_labels), verbose=2 )

Train on 16536 samples, validate on 1838 samples
Epoch 1/5
16536/16536 - 5s - loss: 0.0836 - accuracy: 0.9772 - val_loss: 0.0720 - val_accuracy: 0.9798
Epoch 2/5
16536/16536 - 4s - loss: 0.0721 - accuracy: 0.9796 - val_loss: 0.0713 - val_accuracy: 0.9798
Epoch 3/5
16536/16536 - 4s - loss: 0.0718 - accuracy: 0.9796 - val_loss: 0.0712 - val_accuracy: 0.9798
Epoch 4/5
16536/16536 - 4s - loss: 0.0714 - accuracy: 0.9797 - val_loss: 0.0702 - val_accuracy: 0.9798
Epoch 5/5
16536/16536 - 4s - loss: 0.0696 - accuracy: 0.9799 - val_loss: 0.0683 - val_accuracy: 0.9802


<tensorflow.python.keras.callbacks.History at 0x7f79c019c310>

In [193]:
def predict_dino(prefixos):
    seqs = tok.texts_to_sequences(prefixos)
    padded = pad_sequences(seqs, maxlen=largest_name, padding='post', truncating='post')
    predictions = model.predict(padded)
    predictions = np.argmax(predictions,axis=2)
    converted = tok.sequences_to_texts(predictions)
    converted = ["".join(x.split()) for x in converted]
    final = [x.split('\n')[0] for x in converted]
    final = [prefixos[i]+final[i][len(prefixos[i]):] for i in range(len(final))]
    return final
             
predict_dino(['test', 'brasil', 'fiep'])                        

['testosaaauuu', 'brasiloaauuu', 'fiepoaaaauuu']