## Aula 1 - NLP

Neste exercício faremos todos os pré-processamentos necessários para que sequências de textos possam ser interpretadas por Redes Neurais.

A tarefa que exploraremos é a Classificaćão de Sentimento usando um dataset de revisões de restaurantes (YELP), produtos (Amazon) e filmes (IMDB) [link](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences).

A nossa tarefa consiste em analisar a revisão e classificá-la entre "positiva" ou "negativa".

Primeiramente, vamos explorar o dataset:

In [1]:
import tensorflow as tf
tf.__version__ #Talvez seja necessário instalar tensorflow 2 antes de iniciar esse notebook

'2.1.0'

Nosso dataset tem 3 colunas:

- sentence: O texto da revisão
- label: 1 para texto positivo e 0 para negativo
- source: yelp, amazon ou imdb


In [2]:
import pandas as pd
filepath_dict = {'yelp':   'data/sentiment/yelp_labelled.txt',
                 'amazon': 'data/sentiment/amazon_cells_labelled.txt',
                 'imdb':   'data/sentiment/imdb_labelled.txt'}

df_list = []
for source, filepath in filepath_dict.items():
    df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
    df['source'] = source  # Add another column filled with the source name
    df_list.append(df)

df = pd.concat(df_list)
df = df.sample(frac=1).reset_index(drop=True)
df.head()


Unnamed: 0,sentence,label,source
0,"The ""1.2"" mega pixel camera, being a part of a...",1,amazon
1,"This isn't a small family restaurant, this is ...",1,yelp
2,"Give it a try, you will be happy you did.",1,yelp
3,I love that they put their food in nice plasti...,1,yelp
4,"If you do watch it, however, there are small c...",1,imdb


Primeiramente, vamos separar nosso dataset de modo que 15% dele seja reservado para teste.

In [3]:
perc_train = 0.85
len_train = int(len(df)*perc_train)

dataset_train = df.iloc[0:len_train, :-1]
dataset_test = df.iloc[len_train:, :-1]

print(len(dataset_train))
print(len(dataset_test))

dataset_train.head()

2335
413


Unnamed: 0,sentence,label
0,"The ""1.2"" mega pixel camera, being a part of a...",1
1,"This isn't a small family restaurant, this is ...",1
2,"Give it a try, you will be happy you did.",1
3,I love that they put their food in nice plasti...,1
4,"If you do watch it, however, there are small c...",1


Agora que temos nosso dataset organizado, o primeiro passo é processar o texto para que seja legível por uma Rede Neural

O primeiro passo é gerar o vocabulário a partir da base de treinamento com a classe [Tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer).

Essa classe executa diversas rotinas de pré-processamento úteis, entre elas:

- Remover pontuacões.
- através do parâmetro `num_words`, permite limitar o tamanho do vocabulário, descartando palavras incomuns.
- Normaliza capitalizacao com `lower=True`

Porém, antes de utilizar a classe, vamos remover stopwords do texto.

Stopwords são palavras com serventia apenas sintática, isso é, são irrelevantes para classificar o "sentimento" da sentenca (leia mais sobre stopwords [aqui](https://demacdolincoln.github.io/anotacoes-nlp/posts/pre-processamento-de-textos/#id2)).

In [4]:
#Estamos adicionando stopwords manualmente aqui. Também é possível baixá-las do módulo nltk
stopwords = [ "a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]


#Adicione seu código para Excluir todas as stopwords de todos os exemplos de treinamento
dataset_train.loc[:,'sentence'] = dataset_train.loc[:,'sentence'].apply(lambda x: ' '.join([item for item in x.split() if item not in stopwords]))
dataset_train.head()

Unnamed: 0,sentence,label
0,"The ""1.2"" mega pixel camera, part phone, reaso...",1
1,"This isn't small family restaurant, fine dinin...",1
2,"Give try, will happy did.",1
3,I love put food nice plastic containers oppose...,1
4,"If watch it, however, small consolations: The ...",1


Agora podemos gerar o vocabulário e codificar as sentencas

In [5]:
from tensorflow.keras.preprocessing.text import Tokenizer

max_vocab_size = 500   #Tamanho máximo do vocabulário
oov_token = '<OOV>'   # Token usado caso alguma palavra não for encontrada no vocabulário

tokenizer = Tokenizer(num_words=max_vocab_size, lower=True, oov_token = oov_token)
tokenizer.fit_on_texts(dataset_train.loc[:, 'sentence'])

Através do atributo `word_index`, podemos consultar o vocabulário gerado. As primeiras palavras são as mais comuns.

Em seguida, codificamos o dataset de treinamento e de teste

In [6]:
vocab_size = len(tokenizer.word_index)
tokenizer.word_index

{'<OOV>': 1,
 'i': 2,
 'the': 3,
 'not': 4,
 'it': 5,
 'good': 6,
 'great': 7,
 'this': 8,
 'movie': 9,
 'film': 10,
 'phone': 11,
 '0': 12,
 '1': 13,
 'one': 14,
 'food': 15,
 'like': 16,
 'place': 17,
 'just': 18,
 'time': 19,
 'really': 20,
 'service': 21,
 'bad': 22,
 'well': 23,
 'no': 24,
 'best': 25,
 'even': 26,
 'ever': 27,
 'also': 28,
 'back': 29,
 "don't": 30,
 'will': 31,
 'go': 32,
 'quality': 33,
 'love': 34,
 "i've": 35,
 'made': 36,
 'nice': 37,
 'can': 38,
 'all': 39,
 'recommend': 40,
 "it's": 41,
 'if': 42,
 'work': 43,
 'product': 44,
 'get': 45,
 'very': 46,
 'excellent': 47,
 "i'm": 48,
 'better': 49,
 'never': 50,
 'much': 51,
 'use': 52,
 'way': 53,
 'a': 54,
 'battery': 55,
 'and': 56,
 'think': 57,
 'headset': 58,
 'works': 59,
 'there': 60,
 'first': 61,
 'acting': 62,
 'make': 63,
 'sound': 64,
 'pretty': 65,
 'now': 66,
 'but': 67,
 'you': 68,
 'see': 69,
 'characters': 70,
 "didn't": 71,
 'we': 72,
 'my': 73,
 'amazing': 74,
 'everything': 75,
 'enough': 

In [7]:
dataset_train_sequences = tokenizer.texts_to_sequences(dataset_train.loc[:,'sentence'])
dataset_test_sequences = tokenizer.texts_to_sequences(dataset_test.loc[:,'sentence'])
print(dataset_train_sequences[0:2])

[[3, 13, 86, 1, 1, 124, 208, 11, 1, 1], [8, 467, 173, 191, 160, 192, 1, 1]]


O último passo de pré-processamento agora consiste em realizar o padding das sequências.

Para isso, utilizaremos a funcão [`pad_sequences`](https://keras.io/preprocessing/sequence/)

Os principais argumentos dessa funcão são:

- `maxlen`: tamanho da sequência a ser gerada.
- `padding`: 'pre' para adicionar zeros à esquerda e 'post' para adicionar zeros à direita.
- `truncating`: 'pre' para remover palavras no comeco da frase se for maior que o tamanho especificado, 'post' para remover do final

In [8]:
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences

maxlen = 500  #Tamanho máximo da frase
padding_type = 'post'
truncating_type = 'post'

dataset_train_sequences = pad_sequences(dataset_train_sequences, maxlen = maxlen, padding=padding_type, truncating=truncating_type)
dataset_test_sequences = pad_sequences(dataset_test_sequences, maxlen = maxlen, padding=padding_type, truncating=truncating_type)

print(len(dataset_train_sequences[0]))
print(len(dataset_train_sequences[1]))
print(dataset_train_sequences)

500
500
[[  3  13  86 ...   0   0   0]
 [  8 467 173 ...   0   0   0]
 [146 209  31 ...   0   0   0]
 ...
 [439   1   1 ...   0   0   0]
 [305   1   1 ...   0   0   0]
 [108  43   0 ...   0   0   0]]


Agora que as sentencas estão em um formato favorável, podemos treinar nosso modelo.

In [9]:
#Adicione a sua arquitetura, lembrando que a entrada tem tamanho maxlen e a saída 2
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Softmax
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Flatten

embedding_dim = 16

model = Sequential()
model.add(Embedding(vocab_size, embedding_dim, input_length=maxlen))
model.add(Flatten()),
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))


In [10]:
# Defina aqui seu otimizador e sua loss
optimizer = 'adam'
loss = 'binary_crossentropy'

In [11]:
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 500, 16)           77680     
_________________________________________________________________
flatten (Flatten)            (None, 8000)              0         
_________________________________________________________________
dense (Dense)                (None, 64)                512064    
_________________________________________________________________
dense_1 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 33        
Total params: 591,857
Trainable params: 591,857
Non-trainable params: 0
_________________________________________________________________


In [12]:
from tensorflow.keras.utils import to_categorical
num_epochs = 20

train_seqs = dataset_train_sequences
train_labels = np.array(dataset_train.loc[:, 'label'])
test_seqs = dataset_test_sequences
test_labels = np.array(dataset_test.loc[:, 'label'])

print(len(train_seqs))
print(len(train_labels))
print(len(test_seqs))
print(len(test_labels))

model.fit(train_seqs,train_labels, epochs = num_epochs, validation_data=(test_seqs,test_labels), verbose=2 )

2335
2335
413
413
Train on 2335 samples, validate on 413 samples
Epoch 1/20
2335/2335 - 1s - loss: 0.6944 - accuracy: 0.5062 - val_loss: 0.6952 - val_accuracy: 0.4600
Epoch 2/20
2335/2335 - 1s - loss: 0.6805 - accuracy: 0.5884 - val_loss: 0.6450 - val_accuracy: 0.7070
Epoch 3/20
2335/2335 - 1s - loss: 0.4791 - accuracy: 0.7961 - val_loss: 0.4877 - val_accuracy: 0.7506
Epoch 4/20
2335/2335 - 1s - loss: 0.3295 - accuracy: 0.8608 - val_loss: 0.5331 - val_accuracy: 0.7264
Epoch 5/20
2335/2335 - 1s - loss: 0.2887 - accuracy: 0.8831 - val_loss: 0.5095 - val_accuracy: 0.7506
Epoch 6/20
2335/2335 - 1s - loss: 0.2449 - accuracy: 0.9054 - val_loss: 0.5350 - val_accuracy: 0.7530
Epoch 7/20
2335/2335 - 1s - loss: 0.2086 - accuracy: 0.9216 - val_loss: 0.5554 - val_accuracy: 0.7676
Epoch 8/20
2335/2335 - 1s - loss: 0.1767 - accuracy: 0.9405 - val_loss: 0.6219 - val_accuracy: 0.7579
Epoch 9/20
2335/2335 - 1s - loss: 0.1439 - accuracy: 0.9478 - val_loss: 0.6793 - val_accuracy: 0.7627
Epoch 10/20
2335/

<tensorflow.python.keras.callbacks.History at 0x7f0f3c07a050>

Vamos verificar se as classificacões fazem sentido

In [13]:
or_test_sentences = ['very good movie', 'terrible taste', 'worst product ever']
#codificando
test_sentences = tokenizer.texts_to_sequences(or_test_sentences)
test_sentences = pad_sequences(test_sentences, maxlen = maxlen, padding=padding_type, truncating=truncating_type)

print(test_sentences)


[[ 46   6   9 ...   0   0   0]
 [ 96 276   0 ...   0   0   0]
 [ 77  44  27 ...   0   0   0]]


In [14]:
predictions = model.predict(test_sentences)
print(or_test_sentences)
print(predictions > 0.5)

['very good movie', 'terrible taste', 'worst product ever']
[[ True]
 [False]
 [False]]


Avalie como o número de dimensões do embedding, o tipo do padding, o tamanho do vocabulário, o tamanho máximo de sentenca, etc. contribuem para a qualidade do modelo.

Em geral o Modelo parece OK, por que será que a acurácia ainda está baixa?

In [15]:
or_test_sentences = ['not bad']
#codificando
test_sentences = tokenizer.texts_to_sequences(or_test_sentences)
test_sentences = pad_sequences(test_sentences, maxlen = maxlen, padding=padding_type, truncating=truncating_type)
predictions = model.predict(test_sentences)
print(predictions > 0.5)

[[False]]
