# Criando um analisador de sentimentos com LSTM

Este notebook é uma atividade de extensão que não está no livro do curso. Nesta atividade você aprenderá a desenvolver um analisador de sentimento que usa uma rede recorrente LSTM para processar um texto e classificar ....

```
Este Notebook é baseado no material disponibilizado por Garrett Hoffman em: https://github.com/GarrettHoffman/lstm-oreilly 
```

## 1 - Pacotes

Execute o bloco abaixo para importar os pacotes necessarios. 

- [tensorflow](https://www.tensorflow.org/) um framework para machine learning
- [numpy](www.numpy.org) pacote de bilbiotecas para computação científica.
- [matplotlib](http://matplotlib.org) biblioteca para desenho de gráficos.
- [pandas](https://pandas.pydata.org/) biblioteca para analise de dados.


In [1]:
import tensorflow as tf
import numpy as np
import pandas as pd
from collections import Counter

#documento local com funções auxiliares
import utils as utl

  from ._conv import register_converters as _register_converters


## 2 - Dataset


O dataset é composto por aproximadamente 100 mil mensagens postadas em 2017, as mensagens são rotuladas com um $SPY, que indica o sentimento.

bullish (otimista)
bearish (pessimista)


In [2]:
#lendo os dados do csv
data = pd.read_csv("data/StockTwits_SPY_Sentiment_2017.gz",
                   encoding="utf-8",
                   compression="gzip",
                   index_col=0)

#obtendo a lista de mensagens e os rótulos
messages = data.message.values
labels = data.sentiment.values

#Imprimindo as 10 primeiras mensagens
for i in range(10):
    print("Mensagem:", messages[i], "| Rótulo:", labels[i])

Mensagem: $SPY crazy day so far! | Rótulo: bearish
Mensagem: $SPY Will make a new ATH this week. Watch it! | Rótulo: bullish
Mensagem: $SPY $DJIA white elephant in room is $AAPL. Up 14% since election. Strong headwinds w/Trump trade & Strong dollar. How many 7's do you see? | Rótulo: bearish
Mensagem: $SPY blocks above. We break above them We should push to double top | Rótulo: bullish
Mensagem: $SPY Nothing happening in the market today, guess I'll go to the store and spend some $. | Rótulo: bearish
Mensagem: $SPY What an easy call. Good jobs report: good economy, markets go up.  Bad jobs report: no more rate hikes, markets go up.  Win-win. | Rótulo: bullish
Mensagem: $SPY BS market. | Rótulo: bullish
Mensagem: $SPY this rally all the cheerleaders were screaming about this morning is pretty weak. I keep adding 2 my short at all spikes | Rótulo: bearish
Mensagem: $SPY Dollar ripping higher! | Rótulo: bearish
Mensagem: $SPY no reason to go down ! | Rótulo: bullish


In [3]:
#for i, message in enumerate(messages):
#    messages[i] = np.array([utl.preprocess_ST_message(message)])

#messages = np.array([messages])    

messages = np.array([utl.preprocess_ST_message(message) for message in messages])

#Imprimindo as 10 primeiras mensagens
for i in range(10):
    print("Mensagem:", messages[i])

Mensagem: <TICKER> crazy day so far
Mensagem: <TICKER> will make a new ath this week watch it
Mensagem: <TICKER> <TICKER> white elephant in room is <TICKER> up <NUMBER> since election strong headwinds wtrump trade strong dollar how many <NUMBER> s do you see
Mensagem: <TICKER> blocks above we break above them we should push to double top
Mensagem: <TICKER> nothing happening in the market today guess ill go to the store and spend some
Mensagem: <TICKER> what an easy call good jobs report good economy markets go up bad jobs report no more rate hikes markets go up winwin
Mensagem: <TICKER> bs market
Mensagem: <TICKER> this rally all the cheerleaders were screaming about this morning is pretty weak i keep adding <NUMBER> my short at all spikes
Mensagem: <TICKER> dollar ripping higher
Mensagem: <TICKER> no reason to go down


In [4]:
print(messages.shape)

(96967,)


In [5]:
full_lexicon = " ".join(messages).split()
vocab_to_int, int_to_vocab = utl.create_lookup_tables(full_lexicon)

In [8]:
print(len(vocab_to_int))

31980


In [10]:
messages_lens = Counter([len(x) for x in messages])
print("Zero-length messages: {}".format(messages_lens[0]))
print("Maximum message length: {}".format(max(messages_lens)))
print("Average message length: {}".format(np.mean([len(x) for x in messages])))

# jogar fora as mensagens com tamanho zero
messages, labels = utl.drop_empty_messages(messages, labels)

Zero-length messages: 1
Maximum message length: 244
Average message length: 78.21856920395598


Codificar mensagens

In [11]:
messages = utl.encode_ST_messages(messages, vocab_to_int)
labels = utl.encode_ST_labels(labels)

Padding

In [12]:
messages = utl.zero_pad_messages(messages, seq_len=244)

dataset split

In [13]:
train_x, val_x, test_x, train_y, val_y, test_y = utl.train_val_test_split(messages, labels, split_frac=0.80)

print("Data Set Size")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))


Data Set Size
Train set: 		(77572, 244) 
Validation set: 	(9697, 244) 
Test set: 		(9697, 244)


## 2 - Construindo a rede LSTM

In [14]:
def model_inputs():
    """
    Create the model inputs
    """
    inputs_ = tf.placeholder(tf.int32, [None, None], name='inputs')
    labels_ = tf.placeholder(tf.int32, [None, None], name='labels')
    keep_prob_ = tf.placeholder(tf.float32, name='keep_prob')
    
    return inputs_, labels_, keep_prob_

In [16]:
def build_embedding_layer(inputs_, vocab_size, embed_size):
    """
    Create the embedding layer
    """
    embedding = tf.Variable(tf.random_uniform((vocab_size, embed_size), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, inputs_)
    
    return embed

In [17]:
def build_lstm_layers(lstm_sizes, embed, keep_prob_, batch_size):
    """
    Create the LSTM layers
    """
    lstms = [tf.contrib.rnn.BasicLSTMCell(size) for size in lstm_sizes]
    # Add dropout to the cell
    drops = [tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob_) for lstm in lstms]
    # Stack up multiple LSTM layers, for deep learning
    cell = tf.contrib.rnn.MultiRNNCell(drops)
    # Getting an initial state of all zeros
    initial_state = cell.zero_state(batch_size, tf.float32)
    
    lstm_outputs, final_state = tf.nn.dynamic_rnn(cell, embed, initial_state=initial_state)
    
    return initial_state, lstm_outputs, cell, final_state

In [18]:
def build_cost_fn_and_opt(lstm_outputs, labels_, learning_rate):
    """
    Create the Loss function and Optimizer
    """
    predictions = tf.contrib.layers.fully_connected(lstm_outputs[:, -1], 1, activation_fn=tf.sigmoid)
    loss = tf.losses.mean_squared_error(labels_, predictions)
    optimzer = tf.train.AdadeltaOptimizer(learning_rate).minimize(loss)
    
    return predictions, loss, optimzer

In [19]:
def build_accuracy(predictions, labels_):
    """
    Create accuracy
    """
    correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_)
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
    
    return accuracy

In [20]:
def build_and_train_network(lstm_sizes, vocab_size, embed_size, epochs, batch_size,
                            learning_rate, keep_prob, train_x, val_x, train_y, val_y):
    
    inputs_, labels_, keep_prob_ = model_inputs()
    embed = build_embedding_layer(inputs_, vocab_size, embed_size)
    initial_state, lstm_outputs, lstm_cell, final_state = build_lstm_layers(lstm_sizes, embed, keep_prob_, batch_size)
    predictions, loss, optimizer = build_cost_fn_and_opt(lstm_outputs, labels_, learning_rate)
    accuracy = build_accuracy(predictions, labels_)
    
    saver = tf.train.Saver()
    
    with tf.Session() as sess:
        
        sess.run(tf.global_variables_initializer())
        n_batches = len(train_x)//batch_size
        for e in range(epochs):
            state = sess.run(initial_state)
            
            train_acc = []
            for ii, (x, y) in enumerate(utl.get_batches(train_x, train_y, batch_size), 1):
                feed = {inputs_: x,
                        labels_: y[:, None],
                        keep_prob_: keep_prob,
                        initial_state: state}
                loss_, state, _,  batch_acc = sess.run([loss, final_state, optimizer, accuracy], feed_dict=feed)
                train_acc.append(batch_acc)
                
                if (ii + 1) % n_batches == 0:
                    
                    val_acc = []
                    val_state = sess.run(lstm_cell.zero_state(batch_size, tf.float32))
                    for xx, yy in utl.get_batches(val_x, val_y, batch_size):
                        feed = {inputs_: xx,
                                labels_: yy[:, None],
                                keep_prob_: 1,
                                initial_state: val_state}
                        val_batch_acc, val_state = sess.run([accuracy, final_state], feed_dict=feed)
                        val_acc.append(val_batch_acc)
                    
                    print("Epoch: {}/{}...".format(e+1, epochs),
                          "Batch: {}/{}...".format(ii+1, n_batches),
                          "Train Loss: {:.3f}...".format(loss_),
                          "Train Accruacy: {:.3f}...".format(np.mean(train_acc)),
                          "Val Accuracy: {:.3f}".format(np.mean(val_acc)))
    
        saver.save(sess, "checkpoints/sentiment.ckpt")

In [21]:
# Define Inputs and Hyperparameters
lstm_sizes = [128, 64]
vocab_size = len(vocab_to_int) + 1 #add one for padding
embed_size = 300
epochs = 50
batch_size = 256
learning_rate = 0.1
keep_prob = 0.5

In [22]:
with tf.Graph().as_default():
    build_and_train_network(lstm_sizes, vocab_size, embed_size, epochs, batch_size,
                            learning_rate, keep_prob, train_x, val_x, train_y, val_y)

KeyboardInterrupt: 

## Testando o modelo

In [None]:
def test_network(model_dir, batch_size, test_x, test_y):
    
    inputs_, labels_, keep_prob_ = model_inputs()
    embed = build_embedding_layer(inputs_, vocab_size, embed_size)
    initial_state, lstm_outputs, lstm_cell, final_state = build_lstm_layers(lstm_sizes, embed, keep_prob_, batch_size)
    predictions, loss, optimizer = build_cost_fn_and_opt(lstm_outputs, labels_, learning_rate)
    accuracy = build_accuracy(predictions, labels_)
    
    saver = tf.train.Saver()
    
    test_acc = []
    with tf.Session() as sess:
        saver.restore(sess, tf.train.latest_checkpoint(model_dir))
        test_state = sess.run(lstm_cell.zero_state(batch_size, tf.float32))
        for ii, (x, y) in enumerate(utl.get_batches(test_x, test_y, batch_size), 1):
            feed = {inputs_: x,
                    labels_: y[:, None],
                    keep_prob_: 1,
                    initial_state: test_state}
            batch_acc, test_state = sess.run([accuracy, final_state], feed_dict=feed)
            test_acc.append(batch_acc)
        print("Test Accuracy: {:.3f}".format(np.mean(test_acc)))

In [None]:
with tf.Graph().as_default():
    test_network('checkpoints', batch_size, test_x, test_y)
