#Analyzing Stock Sentiment from Twits

##Load Packages

In [None]:
import json
import nltk
import os
import random
import re
import torch

from torch import nn, optim
import torch.nn.functional as F


## Introduction
When deciding the value of a company, it's important to follow the news. For example, a product recall or natural disaster in a company's product chain. T his information can be turned into a signal by using Neural Network. 

In this project, the posts from the social media site [StockTwits](https://en.wikipedia.org/wiki/StockTwits) is used. The community on StockTwits is full of investors, traders, and entrepreneurs. Each message posted is a Twit. A model will be built to generate a sentiment score around these twits.

A bunch of twits have been collected with the labeled sentiment of each. There are five scaled sentimets: very negative, negative, neutral, positive, very positive, from -2 to 2 in steps of 1 respectively. A sentiment analysis model will be built to learn to assign sentiment to twits on its own using this labeled data.

## Import Twits and Load Twits Data 

The fields represent the following:

* `'message_body'`: The text of the twit.
* `'sentiment'`: Sentiment score for the twit, ranges from -2 to 2 in steps of 1, with 0 being neutral.


In [None]:
with open(os.path.join('..', '..', 'data', 'project_6_stocktwits', 'twits.json'), 'r') as f:
    twits = json.load(f)

print(twits['data'][:10])

[{'message_body': '$FITB great buy at 26.00...ill wait', 'sentiment': 2, 'timestamp': '2018-07-01T00:00:09Z'}, {'message_body': '@StockTwits $MSFT', 'sentiment': 1, 'timestamp': '2018-07-01T00:00:42Z'}, {'message_body': '#STAAnalystAlert for $TDG : Jefferies Maintains with a rating of Hold setting target price at USD 350.00. Our own verdict is Buy  http://www.stocktargetadvisor.com/toprating', 'sentiment': 2, 'timestamp': '2018-07-01T00:01:24Z'}, {'message_body': '$AMD I heard there’s a guy who knows someone who thinks somebody knows something - on StockTwits.', 'sentiment': 1, 'timestamp': '2018-07-01T00:01:47Z'}, {'message_body': '$AMD reveal yourself!', 'sentiment': 0, 'timestamp': '2018-07-01T00:02:13Z'}, {'message_body': '$AAPL Why the drop? I warren Buffet taking out his position?', 'sentiment': 1, 'timestamp': '2018-07-01T00:03:10Z'}, {'message_body': '$BA bears have 1 reason on 06-29 to pay more attention https://dividendbot.com?s=BA', 'sentiment': -2, 'timestamp': '2018-07-01T

### Length of Data is the number of twits in dataset.

In [None]:
"""print out the number of twits"""

# TODO Implement 
len(twits['data'])

1548010

### Split Message Body and Sentiment Score

In [None]:
messages = [twit['message_body'] for twit in twits['data']]
# Since the sentiment scores are discrete, we'll scale the sentiments to 0 to 4 for use in our network
sentiments = [twit['sentiment'] + 2 for twit in twits['data']]
print(messages[:10])
print(sentiments[:10])

['$FITB great buy at 26.00...ill wait', '@StockTwits $MSFT', '#STAAnalystAlert for $TDG : Jefferies Maintains with a rating of Hold setting target price at USD 350.00. Our own verdict is Buy  http://www.stocktargetadvisor.com/toprating', '$AMD I heard there’s a guy who knows someone who thinks somebody knows something - on StockTwits.', '$AMD reveal yourself!', '$AAPL Why the drop? I warren Buffet taking out his position?', '$BA bears have 1 reason on 06-29 to pay more attention https://dividendbot.com?s=BA', '$BAC ok good we&#39;re not dropping in price over the weekend, lol', '$AMAT - Daily Chart, we need to get back to above 50.', '$GME 3% drop per week after spike... if no news in 3 months, back to 12s... if BO, then bingo... what is the odds?']
[4, 3, 4, 3, 2, 3, 0, 3, 4, 0]


## Preprocessing the Data

### Remove unuseful symbols with regex using the re module

In [None]:
nltk.download('wordnet')


def preprocess(message):
    """
    This function takes a string as input, then performs these operations: 
        - lowercase
        - remove URLs
        - remove ticker symbols 
        - removes punctuation
        - tokenize by splitting the string on whitespace 
        - removes any single character tokens
    
    Parameters
    ----------
        message : The text message to be preprocessed.
        
    Returns
    -------
        tokens: The preprocessed text into tokens.
    """ 
    #TODO: Implement 
    
    # Lowercase the twit message
    text = message.lower()
    
    # Replace URLs with a space in the message
    text = re.sub(r'http\S+', ' ', text)
    
    # Replace ticker symbols with a space. The ticker symbols are any stock symbol that starts with $.
    text = re.sub('\$\S+',' ',text)
    
    # Replace StockTwits usernames with a space. The usernames are any word that starts with @.
    text = re.sub(r'@([A-Za-z0-9_]+)', ' ', text)

    # Replace everything not a letter with a space
    text = re.sub(r'[^a-zA-Z]',' ',text)
    
    # Tokenize by splitting the string on whitespace into a list of words
    tokens = text.split(' ')

    # Lemmatize words using the WordNetLemmatizer. You can ignore any word that is not longer than one character.
    wnl = nltk.stem.WordNetLemmatizer()
    tokens = [wnl.lemmatize(t) for t in tokens if len(t)>1]
    
    return tokens

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


### Preprocess All the Twits 

In [None]:
# TODO Implement
tokenized = [preprocess(message) for message in messages]

### Bag of Words
With all messages tokenized, create a vocabulary and count up how often each word appears in the entire corpus. Use the [`Counter`](https://docs.python.org/3.1/library/collections.html#collections.Counter) function to count up all the tokens.

In [None]:
messages[:10]

['$FITB great buy at 26.00...ill wait',
 '@StockTwits $MSFT',
 '#STAAnalystAlert for $TDG : Jefferies Maintains with a rating of Hold setting target price at USD 350.00. Our own verdict is Buy  http://www.stocktargetadvisor.com/toprating',
 '$AMD I heard there’s a guy who knows someone who thinks somebody knows something - on StockTwits.',
 '$AMD reveal yourself!',
 '$AAPL Why the drop? I warren Buffet taking out his position?',
 '$BA bears have 1 reason on 06-29 to pay more attention https://dividendbot.com?s=BA',
 '$BAC ok good we&#39;re not dropping in price over the weekend, lol',
 '$AMAT - Daily Chart, we need to get back to above 50.',
 '$GME 3% drop per week after spike... if no news in 3 months, back to 12s... if BO, then bingo... what is the odds?']

In [None]:
tokenized[:10]

[['great', 'buy', 'at', 'ill', 'wait'],
 [],
 ['staanalystalert',
  'for',
  'jefferies',
  'maintains',
  'with',
  'rating',
  'of',
  'hold',
  'setting',
  'target',
  'price',
  'at',
  'usd',
  'our',
  'own',
  'verdict',
  'is',
  'buy'],
 ['heard',
  'there',
  'guy',
  'who',
  'know',
  'someone',
  'who',
  'think',
  'somebody',
  'know',
  'something',
  'on',
  'stocktwits'],
 ['reveal', 'yourself'],
 ['why',
  'the',
  'drop',
  'warren',
  'buffet',
  'taking',
  'out',
  'his',
  'position'],
 ['bear', 'have', 'reason', 'on', 'to', 'pay', 'more', 'attention'],
 ['ok',
  'good',
  'we',
  're',
  'not',
  'dropping',
  'in',
  'price',
  'over',
  'the',
  'weekend',
  'lol'],
 ['daily', 'chart', 'we', 'need', 'to', 'get', 'back', 'to', 'above'],
 ['drop',
  'per',
  'week',
  'after',
  'spike',
  'if',
  'no',
  'news',
  'in',
  'month',
  'back',
  'to',
  'if',
  'bo',
  'then',
  'bingo',
  'what',
  'is',
  'the',
  'odds']]

In [None]:
from collections import Counter


"""
Create a vocabulary by using Bag of words
"""

# TODO: Implement 
all_words = []
for w in tokenized:
    for w1 in w:
        all_words.append(w1)
bow = Counter(all_words)
len(bow)

98376

### Frequency of Words Appearing in Message
With the vocabulary, we'll emove some of the most common words such as 'the', 'and', 'it', etc. These words don't contribute to identifying sentiment and are really common, resulting in a lot of noise in our input. If we can filter these out, then our network should have an easier time learning.

We also want to remove really rare words that show up in a only a few twits. Here we want to divide the count of each word by the number of messages. Then remove words that only appear in some small fraction of the messages.

In [None]:
"""
Set the following variables:
    freqs
    low_cutoff
    high_cutoff
    K_most_common
"""

# TODO Implement 

# Dictionart that contains the Frequency of words appearing in messages.
# The key is the token and the value is the frequency of that word in the corpus.
total_appearing = len(tokenized)
total_appearing
freqs = {key:word_count/total_appearing for key, word_count in bow.items()}

# Float that is the frequency cutoff. Drop words with a frequency that is lower or equal to this number.
low_cutoff = 0.000001

# Integer that is the cut off for most common words. Drop words that are the `high_cutoff` most common words.
high_cutoff = 20

# The k most common words in the corpus. Use `high_cutoff` as the k.
#print(bow.most_common(high_cutoff))
K_most_common = dict(bow.most_common(high_cutoff))

filtered_words = [word for word in freqs if (freqs[word] > low_cutoff and word not in K_most_common)]
print(K_most_common)
print(len(filtered_words))

{'the': 398754, 'to': 379487, 'is': 284865, 'for': 273537, 'on': 241663, 'of': 211334, 'and': 208471, 'in': 205307, 'this': 203542, 'it': 193484, 'at': 138453, 'will': 128180, 'up': 121567, 'are': 101424, 'you': 94278, 'that': 89655, 'be': 89277, 'short': 86642, 'what': 79115, 'today': 76240}
47981


### Updating Vocabulary by Removing Filtered Words

In [None]:
"""
Set the following variables:
    vocab
    id2vocab
    filtered
"""
#TODO Implement
# A dictionary for the `filtered_words`. The key is the word and value is an id that represents the word. 
vocab = {w:i for i, w in enumerate(filtered_words)}
# Reverse of the `vocab` dictionary. The key is word id and value is the word. 
id2vocab = {i:w for i, w in enumerate(filtered_words)}
# tokenized with the words not in `filtered_words` removed.
filtered = [[word for word in sentence if word in vocab] for sentence in tokenized]
print(len(filtered), filtered[:10])

1548010 [['great', 'buy', 'ill', 'wait'], [], ['staanalystalert', 'jefferies', 'maintains', 'with', 'rating', 'hold', 'setting', 'target', 'price', 'usd', 'our', 'own', 'verdict', 'buy'], ['heard', 'there', 'guy', 'who', 'know', 'someone', 'who', 'think', 'somebody', 'know', 'something', 'stocktwits'], ['reveal', 'yourself'], ['why', 'drop', 'warren', 'buffet', 'taking', 'out', 'his', 'position'], ['bear', 'have', 'reason', 'pay', 'more', 'attention'], ['ok', 'good', 'we', 're', 'not', 'dropping', 'price', 'over', 'weekend', 'lol'], ['daily', 'chart', 'we', 'need', 'get', 'back', 'above'], ['drop', 'per', 'week', 'after', 'spike', 'if', 'no', 'news', 'month', 'back', 'if', 'bo', 'then', 'bingo', 'odds']]


### Balancing the classes
50% of the labeled twits are neutral. This means that the network will be 50% accurate just by guessing 0 every single time. To help the network learn appropriately, classes need to be balanced.
That is, make sure each of the different sentiment scores show up roughly as frequently in the data.

We'll go through each of the examples and randomly drop twits with neutral sentiment to get around 20% neutral twits.

In [None]:
balanced = {'messages': [], 'sentiments':[]}

n_neutral = sum(1 for each in sentiments if each == 2)
N_examples = len(sentiments)
keep_prob = (N_examples - n_neutral)/4/n_neutral

for idx, sentiment in enumerate(sentiments):
    message = filtered[idx]
    if len(message) == 0:
        # skip this message because it has length zero
        continue
    elif sentiment != 2 or random.random() < keep_prob:
        balanced['messages'].append(message)
        balanced['sentiments'].append(sentiment) 

In [None]:
n_neutral = sum(1 for each in balanced['sentiments'] if each == 2)
N_examples = len(balanced['sentiments'])
n_neutral/N_examples

0.19510884980164975

Convert tokens into integer ids which can be passed to the network.

In [None]:
token_ids = [[vocab[word] for word in message] for message in balanced['messages']]
sentiments = balanced['sentiments']

## Build Neural Network LSTM

### Implement the text classifier

Softmax instead of sigmoid is used because the output of NN is not a binary. In the network, sentiment scores have 5 possible outcomes. An outcome with the highest probability is the best choice.

In [None]:
class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_size, lstm_size, output_size, lstm_layers=1, dropout=0.1):
        """
        Initialize the model by setting up the layers.
        
        Parameters
        ----------
            vocab_size : The vocabulary size.
            embed_size : The embedding layer size.
            lstm_size : The LSTM layer size.
            output_size : The output size.
            lstm_layers : The number of LSTM layers.
            dropout : The dropout probability.
        """
        
        super().__init__()
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.lstm_size = lstm_size
        self.output_size = output_size
        self.lstm_layers = lstm_layers
        self.dropout = dropout
        
        # TODO Implement

        # Setup embedding layer
        self.embedding = nn.Embedding(vocab_size,embed_size)
        
        # Setup additional layers
        self.lstm = nn.LSTM(embed_size, lstm_size, lstm_layers,
                            dropout = dropout, batch_first = False)
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(lstm_size, output_size)
        self.log_softmax = nn.LogSoftmax(dim=1)

    def init_hidden(self, batch_size):
        """ 
        Initializes hidden state
        
        Parameters
        ----------
            batch_size : The size of batches.
        
        Returns
        -------
            hidden_state
            
        """
        
        # TODO Implement 
        
        
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        
        weight = next(self.parameters()).data
        hidden_state = (weight.new(self.lstm_layers, batch_size, self.lstm_size).zero_(),
                        weight.new(self.lstm_layers, batch_size, self.lstm_size).zero_())
        return hidden_state


    def forward(self, nn_input, hidden_state):
        """
        Perform a forward pass of our model on nn_input.
        
        Parameters
        ----------
            nn_input : The batch of input to the NN.
            hidden_state : The LSTM hidden state.

        Returns
        -------
            logps: log softmax output
            hidden_state: The new hidden state.

        """
        
        # TODO Implement 
        #print(nn_input)
        batch_size = nn_input.size(1)

        # embeddings and lstm_out
        nn_input_long = nn_input.long()
        embeds = self.embedding(nn_input_long)
        lstm_out, hidden_state = self.lstm(embeds, hidden_state)
    
        # stack up lstm outputs
        #lstm_out = lstm_out.contiguous().view(-1, self.lstm_size)
        lstm_out = lstm_out[-1,:,:]  
        # dropout and fully-connected layer
        out = self.dropout(lstm_out)
        out = self.fc(out)
        # softmax function
        logps = self.log_softmax(out)
        
        #logps = logps.view(-1, batch_size, self.output_size)
        #logps = logps[-1,:,:] # get last batch of labels
        
        # return last sigmoid output and hidden state

        return logps, hidden_state

### View Model

In [None]:
model = TextClassifier(len(vocab), 10, 6, 5, dropout=0.1, lstm_layers=2)
model.embedding.weight.data.uniform_(-1, 1)
input = torch.randint(0, 1000, (5, 4), dtype=torch.int64)
hidden = model.init_hidden(4)

logps, _ = model.forward(input, hidden)
print(logps)
print(logps.size())

tensor([[-1.7876, -1.4405, -1.5648, -1.7997, -1.5081],
        [-1.7840, -1.4768, -1.5033, -1.7929, -1.5381],
        [-1.7983, -1.4138, -1.5963, -1.8173, -1.4868],
        [-1.7808, -1.4354, -1.5691, -1.8099, -1.5071]])
torch.Size([4, 5])


## Training
### DataLoaders and Batching
Now we'll build a generator that we can use to loop through our data. It'll be more efficient if we can pass our sequences in as batches. Our input tensors should look like `(sequence_length, batch_size)`. So if our sequences are 40 tokens long and we pass in 25 sequences, then we'd have an input size of `(40, 25)`.

If we set our sequence length to 40, for messages with fewer than 40 tokens, we will pad the empty spots with zeros. We should be sure to **left** pad so that the RNN starts from nothing before going through the data. If the message has 20 tokens, then the first 20 spots of our 40 long sequence will be 0. If a message has more than 40 tokens, we'll just keep the first 40 tokens.

In [None]:
def dataloader(messages, labels, sequence_length=30, batch_size=32, shuffle=False):
    """ 
    Build a dataloader.
    """
    if shuffle:
        indices = list(range(len(messages)))
        random.shuffle(indices)
        messages = [messages[idx] for idx in indices]
        labels = [labels[idx] for idx in indices]

    total_sequences = len(messages)

    for ii in range(0, total_sequences, batch_size):
        batch_messages = messages[ii: ii+batch_size]
        
        # First initialize a tensor of all zeros
        batch = torch.zeros((sequence_length, len(batch_messages)), dtype=torch.int64)
        for batch_num, tokens in enumerate(batch_messages):
            token_tensor = torch.tensor(tokens)
            # Left pad!
            start_idx = max(sequence_length - len(token_tensor), 0)
            batch[start_idx:, batch_num] = token_tensor[:sequence_length]
        
        label_tensor = torch.tensor(labels[ii: ii+len(batch_messages)])
        
        yield batch, label_tensor

### Training and  Validation
With our data in nice shape, we'll split it into training and validation sets.

In [None]:
"""
Split data into training and validation datasets. Use an appropriate split size.
The features are the `token_ids` and the labels are the `sentiments`.
"""   

# TODO Implement 
split_frac = 0.8
split_idx = int(len(token_ids)*split_frac)

train_features = token_ids[:split_idx]
valid_features = token_ids[split_idx:]
train_labels = sentiments[:split_idx]
valid_labels = sentiments[split_idx:]

print(valid_features[:10])
print(valid_labels[:10])

[[1859, 3050, 5107, 465, 1514, 484, 1452, 4623, 350, 366, 230, 108, 454, 1276], [426, 503, 728, 752, 38027, 1488, 5611, 46807], [134, 57, 654, 80], [84, 336, 1444, 758], [209, 1382, 1986, 2805, 1761, 358, 1025, 3251, 1184, 59], [41, 1332, 1243, 4239, 144], [7359, 126, 697, 1774, 1369, 1434, 6216, 7, 714, 1416, 355, 454, 5040, 149, 263], [4446, 7111, 3189, 488, 1284], [3453, 135], [6979, 641, 5765, 4506]]
[0, 4, 4, 0, 3, 1, 3, 3, 3, 2]


In [None]:
text_batch, labels = next(iter(dataloader(train_features, train_labels, sequence_length=20, batch_size=64)))
model = TextClassifier(len(vocab)+1, 200, 128, 5, dropout=0.)
hidden = model.init_hidden(64)
logps, hidden = model.forward(text_batch, hidden)
print(logps.size())

torch.Size([64, 5])


### Training
It's time to train the neural network!

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = TextClassifier(len(vocab)+1, 1024, 512, 5, lstm_layers=2, dropout=0.2)
model.embedding.weight.data.uniform_(-1, 1)
model.to(device)

TextClassifier(
  (embedding): Embedding(47982, 1024)
  (lstm): LSTM(1024, 512, num_layers=2, dropout=0.2)
  (dropout): Dropout(p=0.2)
  (fc): Linear(in_features=512, out_features=5, bias=True)
  (log_softmax): LogSoftmax()
)

In [None]:
import numpy as np
"""
Train your model with dropout. Make sure to clip your gradients.
Print the training loss, validation loss, and validation accuracy for every 100 steps.
"""
epochs = 4
batch_size = 512
learning_rate = 0.001
clip = 5

print_every = 100
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
model.train()

sequence_length = 20
for epoch in range(epochs):
    print('Starting epoch {}'.format(epoch + 1))
    
    steps = 0
    # initialize hidden state
    
    for text_batch, labels in dataloader(
            train_features, train_labels, batch_size=batch_size, sequence_length=sequence_length, shuffle=True):
        steps += 1
        min_batch_size = min(batch_size,text_batch.size(1))
        hidden = model.init_hidden(min_batch_size)
        
        text_batch, labels = text_batch.to(device), labels.to(device)
        for each in hidden:
            each.to(device)
        
        # TODO Implement: Train Model
        
        # zero accumulated gradients
        model.zero_grad()
        
        # get the output from the model
        output, h = model(text_batch, hidden)


        # calculate the loss and perform backprop
        loss = criterion(output.squeeze().view(min_batch_size,-1), labels)
        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()


        if steps % print_every == 0:
            model.eval()
            
            # TODO Implement: Print metrics
            # Get validation loss
            
            val_losses = []
            number_correct = 0
            for text_batch, labels in dataloader(valid_features, valid_labels, batch_size=batch_size, 
                    sequence_length=sequence_length, shuffle=False):

                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                min_batch_size = min(text_batch.size(1),batch_size)
                val_h = model.init_hidden(min_batch_size)
                #text_batch, labels = text_batch.cuda(), labels.cuda()
                text_batch, labels = text_batch.to(device), labels.to(device)
                for each in hidden:
                    each.to(device)
                #print(text_batch.size())
                output, val_h = model(text_batch, val_h)
                #print(output.size())
                val_loss = criterion(output, labels)
                val_losses.append(val_loss.item())
                #print(output)
                # convert output probabilities to predicted class (0 or 1)
                pred = torch.argmax(torch.exp(output),dim=-1)  # rounds to the nearest integer
                #print(pred,labels.size())
                
                # compare predictions to true label
                correct_tensor = pred.eq(labels.view_as(pred))
                train_on_gpu=torch.cuda.is_available()
                correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    
                number_correct += np.sum(correct)

            # accuracy over all test data
            test_acc = number_correct/len(valid_labels)

            print("Epoch: {}/{}...".format(epoch+1, epochs),
                  "Step: {}...".format(steps),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)),
                  "Test accuracy: {:.3f}".format(test_acc))
            model.train()

Starting epoch 1
Epoch: 1/4... Step: 100... Loss: 1.049599... Val Loss: 1.083315 Test accuracy: 0.573
Epoch: 1/4... Step: 200... Loss: 0.891162... Val Loss: 0.952108 Test accuracy: 0.620
Epoch: 1/4... Step: 300... Loss: 0.895416... Val Loss: 0.922335 Test accuracy: 0.637
Epoch: 1/4... Step: 400... Loss: 0.862422... Val Loss: 0.877372 Test accuracy: 0.662
Epoch: 1/4... Step: 500... Loss: 0.886698... Val Loss: 0.875703 Test accuracy: 0.660
Epoch: 1/4... Step: 600... Loss: 0.804387... Val Loss: 0.848205 Test accuracy: 0.672
Epoch: 1/4... Step: 700... Loss: 0.840885... Val Loss: 0.839681 Test accuracy: 0.675
Epoch: 1/4... Step: 800... Loss: 0.757674... Val Loss: 0.823805 Test accuracy: 0.682
Epoch: 1/4... Step: 900... Loss: 0.759832... Val Loss: 0.807475 Test accuracy: 0.689
Epoch: 1/4... Step: 1000... Loss: 0.805853... Val Loss: 0.817683 Test accuracy: 0.683
Epoch: 1/4... Step: 1100... Loss: 0.757313... Val Loss: 0.806333 Test accuracy: 0.688
Epoch: 1/4... Step: 1200... Loss: 0.792271... 

## Making Predictions
### Prediction 
Now implement the `predict` function to generate the prediction vector from a message.

In [None]:
def predict(text, model, vocab):
    """ 
    Make a prediction on a single sentence.

    Parameters
    ----------
        text : The string to make a prediction on.
        model : The model to use for making the prediction.
        vocab : Dictionary for word to word ids. The key is the word and the value is the word id.

    Returns
    -------
        pred : Prediction vector
    """    
    
    # TODO Implement
    
    tokens = preprocess(text)
    #print(tokens)
    # Filter non-vocab words
    tokens_vocab = [vocab[word] for word in tokens if word in filtered_words]
    #print(tokens_vocab)
    tokens_tensor = torch.tensor(tokens_vocab)
    # Adding a batch dimension
    batch_size = 1
    seq_length=sequence_length
    
    #left pad
    features = torch.zeros((seq_length, batch_size), dtype=torch.int64)
    features[max(-seq_length, -len(tokens_tensor)):,0] = tokens_tensor[:min(seq_length,len(tokens_tensor))]
    #print(features)
    # Get the NN output
    hidden = model.init_hidden(batch_size)
    logps, _ = model(features, hidden)
    # Take the exponent of the NN output to get a range of 0 to 1 for each label.
    pred = torch.exp(logps) 
    
    return pred

In [None]:
text = "Google is working on self driving cars, I'm bullish on $goog"
model.eval()
model.to("cpu")
predict(text, model, vocab)

tensor([[ 0.0003,  0.0607,  0.0125,  0.8170,  0.1094]])

### The prediction of the model is positive. The uncertainty of the prediction is 19.50%.

Now we have a trained model and we can make predictions. We can use this model to track the sentiments of various stocks by predicting the sentiments of twits as they are coming in. Now we have a stream of twits. For each of those twits, pull out the stocks mentioned in them and keep track of the sentiments. Remember that in the twits, ticker symbols are encoded with a dollar sign as the first character, all caps, and 2-4 letters, like $AAPL. Ideally, you'd want to track the sentiments of the stocks in your universe and use this as a signal in your larger model(s).

## Testing
### Load the Data 

In [None]:
with open(os.path.join('..', '..', 'data', 'project_6_stocktwits', 'test_twits.json'), 'r') as f:
    test_data = json.load(f)

### Twit Stream

In [None]:
def twit_stream():
    for twit in test_data['data']:
        yield twit

next(twit_stream())

{'message_body': '$JWN has moved -1.69% on 10-31. Check out the movement and peers at  https://dividendbot.com?s=JWN',
 'timestamp': '2018-11-01T00:00:05Z'}

Using the `prediction` function, let's apply it to a stream of twits.

In [None]:
def score_twits(stream, model, vocab, universe):
    """ 
    Given a stream of twits and a universe of tickers, return sentiment scores for tickers in the universe.
    """
    for twit in stream:

        # Get the message text
        text = twit['message_body']
        symbols = re.findall('\$[A-Z]{2,4}', text)
        score = predict(text, model, vocab)

        for symbol in symbols:
            if symbol in universe:
                yield {'symbol': symbol, 'score': score, 'timestamp': twit['timestamp']}

In [None]:
universe = {'$BBRY', '$AAPL', '$AMZN', '$BABA', '$YHOO', '$LQMT', '$FB', '$GOOG', '$BBBY', '$JNUG', '$SBUX', '$MU'}
score_stream = score_twits(twit_stream(), model, vocab, universe)

next(score_stream)

{'symbol': '$AAPL',
 'score': tensor([[ 0.1609,  0.0094,  0.0307,  0.0378,  0.7612]]),
 'timestamp': '2018-11-01T00:00:18Z'}

That's it. We have successfully built a model for sentiment analysis! 