## Aula 1 - NLP

Neste exercício faremos todos os pré-processamentos necessários para que sequências de textos possam ser interpretadas por Redes Neurais.

A tarefa que exploraremos é a Classificaćão de Sentimento usando um dataset de revisões de restaurantes (YELP), produtos (Amazon) e filmes (IMDB) [link](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences).

A nossa tarefa consiste em analisar a revisão e classificá-la entre "positiva" ou "negativa".

Primeiramente, vamos explorar o dataset:

In [1]:
import tensorflow as tf
tf.__version__ #Talvez seja necessário instalar tensorflow 2 antes de iniciar esse notebook

'2.1.0'

Nosso dataset tem 3 colunas:

- sentence: O texto da revisão
- label: 1 para texto positivo e 0 para negativo
- source: yelp, amazon ou imdb


In [2]:
import pandas as pd
filepath_dict = {'yelp':   'data/sentiment/yelp_labelled.txt',
                 'amazon': 'data/sentiment/amazon_cells_labelled.txt',
                 'imdb':   'data/sentiment/imdb_labelled.txt'}

df_list = []
for source, filepath in filepath_dict.items():
    df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
    df['source'] = source  # Add another column filled with the source name
    df_list.append(df)

df = pd.concat(df_list)
df = df.sample(frac=1).reset_index(drop=True)
df.head()


Unnamed: 0,sentence,label,source
0,"After my phone got to be about a year old, it'...",0,amazon
1,"Pros:-Good camera - very nice pictures , also ...",1,amazon
2,The manager was the worst.,0,yelp
3,clipping this to your belt will deffinitely ma...,1,amazon
4,No buyers remorse on this one!.,0,amazon


Primeiramente, vamos separar nosso dataset de modo que 15% dele seja reservado para teste.

In [3]:
perc_train = 0.85
len_train = int(len(df)*perc_train)

dataset_train = df.iloc[0:len_train, :-1]
dataset_test = df.iloc[len_train:, :-1]

print(len(dataset_train))
print(len(dataset_test))

dataset_train.head()

2335
413


Unnamed: 0,sentence,label
0,"After my phone got to be about a year old, it'...",0
1,"Pros:-Good camera - very nice pictures , also ...",1
2,The manager was the worst.,0
3,clipping this to your belt will deffinitely ma...,1
4,No buyers remorse on this one!.,0


Agora que temos nosso dataset organizado, o primeiro passo é processar o texto para que seja legível por uma Rede Neural

O primeiro passo é gerar o vocabulário a partir da base de treinamento com a classe [Tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer).

Essa classe executa diversas rotinas de pré-processamento úteis, entre elas:

- Remover pontuacões.
- através do parâmetro `num_words`, permite limitar o tamanho do vocabulário, descartando palavras incomuns.
- Normaliza capitalizacao com `lower=True`

Porém, antes de utilizar a classe, vamos remover stopwords do texto.

Stopwords são palavras com serventia apenas sintática, isso é, são irrelevantes para classificar o "sentimento" da sentenca (leia mais sobre stopwords [aqui](https://demacdolincoln.github.io/anotacoes-nlp/posts/pre-processamento-de-textos/#id2)).

In [4]:
#Estamos adicionando stopwords manualmente aqui. Também é possível baixá-las do módulo nltk
stopwords = [ "a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]


#Adicione seu código para Excluir todas as stopwords de todos os exemplos de treinamento
dataset_train.loc[:,'sentence'] = dataset_train.loc[:,'sentence'].apply(lambda x: ' '.join([item for item in x.split() if item not in stopwords]))
dataset_train.head()

Unnamed: 0,sentence,label
0,"After phone got year old, slowly breaking desp...",0
1,"Pros:-Good camera - nice pictures , also cool ...",1
2,The manager worst.,0
3,clipping belt will deffinitely make feel like ...,1
4,No buyers remorse one!.,0


Agora podemos gerar o vocabulário e codificar as sentencas

In [5]:
from tensorflow.keras.preprocessing.text import Tokenizer

max_vocab_size = 500   #Tamanho máximo do vocabulário
oov_token = '<OOV>'   # Token usado caso alguma palavra não for encontrada no vocabulário

tokenizer = Tokenizer(num_words=max_vocab_size, lower=True, oov_token = oov_token)
tokenizer.fit_on_texts(dataset_train.loc[:, 'sentence'])

Através do atributo `word_index`, podemos consultar o vocabulário gerado. As primeiras palavras são as mais comuns.

Em seguida, codificamos o dataset de treinamento e de teste

In [6]:
vocab_size = len(tokenizer.word_index)
tokenizer.word_index

{'<OOV>': 1,
 'i': 2,
 'the': 3,
 'not': 4,
 'it': 5,
 'good': 6,
 'great': 7,
 'this': 8,
 'movie': 9,
 'phone': 10,
 'film': 11,
 '0': 12,
 'one': 13,
 '1': 14,
 'food': 15,
 'just': 16,
 'like': 17,
 'time': 18,
 'place': 19,
 'bad': 20,
 'really': 21,
 'service': 22,
 'well': 23,
 'no': 24,
 'even': 25,
 'ever': 26,
 'best': 27,
 "don't": 28,
 'back': 29,
 'also': 30,
 'will': 31,
 'quality': 32,
 'made': 33,
 "it's": 34,
 'can': 35,
 "i've": 36,
 'go': 37,
 'love': 38,
 'nice': 39,
 'better': 40,
 'work': 41,
 'very': 42,
 'get': 43,
 'much': 44,
 'all': 45,
 'product': 46,
 'use': 47,
 'excellent': 48,
 'recommend': 49,
 'never': 50,
 'headset': 51,
 "i'm": 52,
 'acting': 53,
 'works': 54,
 'first': 55,
 'see': 56,
 'and': 57,
 'think': 58,
 'way': 59,
 "didn't": 60,
 'battery': 61,
 'make': 62,
 'sound': 63,
 'now': 64,
 'if': 65,
 'a': 66,
 'worst': 67,
 'my': 68,
 'but': 69,
 '10': 70,
 'we': 71,
 'there': 72,
 'pretty': 73,
 'everything': 74,
 'say': 75,
 'still': 76,
 'got':

In [7]:
dataset_train_sequences = tokenizer.texts_to_sequences(dataset_train.loc[:,'sentence'])
dataset_test_sequences = tokenizer.texts_to_sequences(dataset_test.loc[:,'sentence'])
print(dataset_train_sequences)

[[209, 10, 77, 246, 192, 1, 1, 1, 44, 300, 210], [1, 6, 146, 39, 447, 30, 211, 1, 17, 193, 301, 405], [3, 1, 67], [1, 1, 31, 1, 62, 194, 17, 1, 1, 155, 57, 270], [24, 1, 1, 13], [2, 1, 1, 6], [448, 1, 1, 302, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [130, 2, 21, 17, 1, 1], [8, 13, 228, 366], [5, 303, 88, 211], [3, 19, 1, 406, 15, 228, 173, 127, 5], [55, 45, 78, 1, 1, 1, 1, 1, 304], [156, 114, 1, 131, 406], [271, 407, 97, 1], [42, 1, 1, 1, 1, 1], [3, 22, 1], [2, 1], [34, 30, 7, 56, 1, 1, 1, 1, 1, 1, 1, 1, 104, 1, 63], [25, 1, 121, 157], [3, 1, 1, 1, 1, 1, 17, 1, 1, 1, 1, 1], [2, 60, 1, 128, 272, 21, 174, 81, 147], [2, 1, 1, 1, 1, 1, 332, 1, 16, 195], [247, 1, 136, 37, 56, 5], [1, 61, 6, 158], [10, 1, 4, 97], [8, 19, 4, 32, 305, 4, 32, 137], [55, 93, 196, 273, 2, 50, 82, 1, 26], [1, 1, 138, 47], [1, 197, 51, 70], [1, 2, 367], [175, 1, 1, 197, 52, 4, 1, 5], [3, 1, 30, 39], [1, 131, 6], [148, 128, 194, 6, 333], [229, 89, 2, 35, 58, 19, 2, 44, 274, 1], [1, 1, 9, 23], [7, 10], [5, 306, 1, 73, 1], [

O último passo de pré-processamento agora consiste em realizar o padding das sequências.

Para isso, utilizaremos a funcão [`pad_sequences`](https://keras.io/preprocessing/sequence/)

Os principais argumentos dessa funcão são:

- `maxlen`: tamanho da sequência a ser gerada.
- `padding`: 'pre' para adicionar zeros à esquerda e 'post' para adicionar zeros à direita.
- `truncating`: 'pre' para remover palavras no comeco da frase se for maior que o tamanho especificado, 'post' para remover do final

In [8]:
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences

maxlen = 500  #Tamanho máximo da frase
padding_type = 'post'
truncating_type = 'post'

dataset_train_sequences = pad_sequences(dataset_train_sequences, maxlen = maxlen, padding=padding_type, truncating=truncating_type)
dataset_test_sequences = pad_sequences(dataset_test_sequences, maxlen = maxlen, padding=padding_type, truncating=truncating_type)

print(len(dataset_train_sequences[0]))
print(len(dataset_train_sequences[1]))
print(dataset_train_sequences)

500
500
[[209  10  77 ...   0   0   0]
 [  1   6 146 ...   0   0   0]
 [  3   1  67 ...   0   0   0]
 ...
 [ 57 427 125 ...   0   0   0]
 [  2  17 328 ...   0   0   0]
 [  3   1   1 ...   0   0   0]]


Agora que as sentencas estão em um formato favorável, podemos treinar nosso modelo.

In [9]:
#Adicione a sua arquitetura, lembrando que a entrada tem tamanho maxlen e a saída 2
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Softmax
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Flatten

embedding_dim = 16

model = Sequential()
model.add(Embedding(vocab_size, embedding_dim, input_length=maxlen))
model.add(Flatten()),
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))


In [10]:
# Defina aqui seu otimizador e sua loss
optimizer = 'adam'
loss = 'binary_crossentropy'

In [11]:
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 500, 16)           77056     
_________________________________________________________________
flatten (Flatten)            (None, 8000)              0         
_________________________________________________________________
dense (Dense)                (None, 64)                512064    
_________________________________________________________________
dense_1 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 33        
Total params: 591,233
Trainable params: 591,233
Non-trainable params: 0
_________________________________________________________________


In [12]:
from tensorflow.keras.utils import to_categorical
num_epochs = 20

train_seqs = dataset_train_sequences
train_labels = np.array(dataset_train.loc[:, 'label'])
test_seqs = dataset_test_sequences
test_labels = np.array(dataset_test.loc[:, 'label'])

print(len(train_seqs))
print(len(train_labels))
print(len(test_seqs))
print(len(test_labels))

model.fit(train_seqs,train_labels, epochs = num_epochs, validation_data=(test_seqs,test_labels), verbose=2 )

2335
2335
413
413
Train on 2335 samples, validate on 413 samples
Epoch 1/20
2335/2335 - 1s - loss: 0.6939 - accuracy: 0.5113 - val_loss: 0.6917 - val_accuracy: 0.5303
Epoch 2/20
2335/2335 - 1s - loss: 0.6712 - accuracy: 0.5829 - val_loss: 0.6181 - val_accuracy: 0.6973
Epoch 3/20
2335/2335 - 1s - loss: 0.4658 - accuracy: 0.8158 - val_loss: 0.5099 - val_accuracy: 0.7288
Epoch 4/20
2335/2335 - 1s - loss: 0.3461 - accuracy: 0.8574 - val_loss: 0.4755 - val_accuracy: 0.7482
Epoch 5/20
2335/2335 - 1s - loss: 0.2850 - accuracy: 0.8852 - val_loss: 0.5268 - val_accuracy: 0.7385
Epoch 6/20
2335/2335 - 1s - loss: 0.2521 - accuracy: 0.9024 - val_loss: 0.5489 - val_accuracy: 0.7409
Epoch 7/20
2335/2335 - 1s - loss: 0.2085 - accuracy: 0.9238 - val_loss: 0.5456 - val_accuracy: 0.7676
Epoch 8/20
2335/2335 - 1s - loss: 0.1707 - accuracy: 0.9413 - val_loss: 0.5637 - val_accuracy: 0.7700
Epoch 9/20
2335/2335 - 1s - loss: 0.1444 - accuracy: 0.9529 - val_loss: 0.6333 - val_accuracy: 0.7748
Epoch 10/20
2335/

<tensorflow.python.keras.callbacks.History at 0x7f8794081f10>

Vamos verificar se as classificacões fazem sentido

In [13]:
or_test_sentences = ['very good movie', 'terrible tast', 'worst product ever']
#codificando
test_sentences = tokenizer.texts_to_sequences(or_test_sentences)
test_sentences = pad_sequences(test_sentences, maxlen = maxlen, padding=padding_type, truncating=truncating_type)

print(test_sentences)


[[ 42   6   9 ...   0   0   0]
 [103   1   0 ...   0   0   0]
 [ 67  46  26 ...   0   0   0]]


In [14]:
predictions = model.predict(test_sentences)
print(or_test_sentences)
print(predictions > 0.5)

['very good movie', 'terrible tast', 'worst product ever']
[[ True]
 [False]
 [False]]


Em geral o Modelo parece OK, por que será que a acurácia ainda está baixa?

In [15]:
or_test_sentences = ['not bad']
#codificando
test_sentences = tokenizer.texts_to_sequences(or_test_sentences)
test_sentences = pad_sequences(test_sentences, maxlen = maxlen, padding=padding_type, truncating=truncating_type)
predictions = model.predict(test_sentences)
print(predictions > 0.5)

[[False]]
