
# Project 6: Analyzing Stock Sentiment from Twits

## Instructions


<Story>  
    There is 
Each problem consists of a function to implement and instructions on how to implement the function. The parts of the function that need to be implemented are marked with a # TODO comment. After implementing the function, run the cell to test it against the unit tests we've provided. For each problem, we provide one or more unit tests from our project_tests package. These unit tests won't tell you if your answer is correct, but will warn you of any major errors. Your code will be checked for the correct solution when you submit it to Udacity.



## Packages
When you implement the functions, you'll only need to you use the packages you've used in the classroom, like Torch, NLTK. These packages will be imported for you. We recommend you don't add any import statements, otherwise the grader might not be able to run your code.

The other packages that we're importing are project_helper and project_tests. These are custom packages built to help you solve the problems. The project_helper module contains utility functions and graph functions. The project_tests contains the unit tests for all the problems.

In [None]:
import json
import re
import nltk
import torch
import random
import project_tests
import project_helper



## Introduction (~Subject to change) 

**Parni's suggestions** 

In this project, your aim is to build a model that can predict the sentiment of the text in the twits. 

In order to this should follow the four steps: 
1- Import the data 
2- Preprocess your data 
3- Building your NN model 
4- Make a prediction for your model 


**End** 


You've been considering using sentiment around specific stocks in your models. You've been subscribed to StockTwits for a while to stay up to date on trading news. You can collect these twits (similar to tweets) and now you want to build a model that can predict the sentiment of the text in the twits. Many of the existing models perform feature extraction manually, basically assigning sentiment scores to individual words by hand. Instead you'd like to train a neural network to learn the features itself then have it predict sentiment. This means you'll need labeled data.

You collected a bunch of twits, then hand labeled the sentiment of each with the help of some interns. You wanted to capture the degree of sentiment so you decided to use a five-point scale: very negative, negative, neutral, positive, very positive. Each tweet is labeled -2 to 2 in steps of 1, from very negative to very positive. 

Here then, you'll build a sentiment analysis model that will learn to assign sentiment to tweets on its own, using this labeled data.

 <img src="image.png">

 <img src="image2.png">


Load in the twits. This is a JSON object with structure like so:

```
{'data':
  {'message_body': 'Tweet body text here',
   'sentiment': 0},
  {'message_body': 'Happy tweet body text here',
   'sentiment': 1},
   ...
}
```

## 1.Import Twits 

In [1]:
import json

In [2]:
with open('twits.json', 'r') as f: #say if this is label data or not 
    twits = json.load(f)

In [3]:
twits['data'][0:10]# loading 10 tweets from the list

[{'message_body': 'RT @google Our annual look at the year in Google blogging (and beyond) http://t.co/sptHOAh8 $GOOG',
  'sentiment': 0,
  'timestamp': '2012-01-01T00:06:01Z'},
 {'message_body': '$GOOG http://stks.co/1jQs Many market leaders appear extended. Are these moves sustainable? I think not.',
  'sentiment': -1,
  'timestamp': '2012-01-01T00:18:17Z'},
 {'message_body': '"Deconstructing A Trade: $AAPL 12/29/2011"-New Blog Post.  Yeah it\'s New Year\'s Eve but I\'m married with kids http://t.co/6VV31tBY $STUDY',
  'sentiment': 0,
  'timestamp': '2012-01-01T00:26:18Z'},
 {'message_body': 'My prediction for 2012 is that the $spx and $djia ($spy and $dia) will make all time highs. And $aapl right along with them.',
  'sentiment': 2,
  'timestamp': '2012-01-01T00:30:36Z'},
 {'message_body': 'RT @bclund &quot;Deconstructing A Trade: $AAPL 12/29/2011&quot;New Blog Post. Yeah it&#39;s NY&#39;s Eve but I&#39;m married with kids http://stks.co/1jQy $STUDY',
  'sentiment': 0,
  'timestamp'

Print out one tweet and look at the fields. Fields in our individual tweets:

* `'message_body'`: The actual text in the tweet
* `'sentiment'`: Score on the sentiment of the tweet, ranges from -2 to 2 in steps of 1, with 0 being neutral

Remember that we want our network to look at some text and predict the sentiment. Our training input will be the message bodies, and we can use the sentiment score as training label for our data.

In [4]:
len(twits['data'])

3000000

We have 3 million tweets over all. For development purposes, let's only use the first 2 million tweets.

In [5]:
data = twits['data'][:2000000]

In [6]:
# This is the data we'll train & test on
messages = [twit['message_body'] for twit in data]
# Adding 2 here to scale the sentiments to 0 to 4 for use in our network
sentiments = [twit['sentiment'] + 2 for twit in data]


## 2. Preprocessing


With our data in hand we need to preprocess our text. These tweets are collected by filtering on ticker symbols where these are denoted with a leader $ symbol in the tweet itself. For example,

`{'message_body': 'RT @google Our annual look at the year in Google blogging (and beyond) http://t.co/sptHOAh8 $GOOG',
 'sentiment': 0}`


The ticker symbols don't provide information on the sentiment, and they are in every tweet, so we should remove them. This tweet also has the `@google` username, again not providing sentiment information, so we should also remove it. And, we see a URL `http://t.co/sptHOAh8`, let's remove these too.



<img src="image3.png">

The easiest way to remove specific words or phrases is with regex, the `re` module. You can sub out specific patterns with a space:

```python
re.sub(pattern, ' ', text)
```
This will substitute a space with anywhere the pattern matches in the text. Later when we tokenize the text, we'll split appropriately on those spaces.

In [7]:
import re
import nltk
nltk.download('wordnet')
wnl = nltk.stem.WordNetLemmatizer()

def preprocess(message):
    ''' This function takes a string as input, then performs these operations: 
        * lowercase
        * remove URLs
        * removes punctuation
        * tokenize
        * removes single character tokens
    ''' 
    #TODO: Implement function
    
    # Lowercase 
    text = message.lower()
    
    # Match and remove URLs
    text = re.sub('http\S+', ' ', text)
    
    # Match and remove ticker symbols that start with $
    text = re.sub('\$\S{2,5}', ' ', text)
    
    # Match and remove twitter usernames that start with @
    text = re.sub('@\S+', ' ', text)

    # Replace punctuation and numbers (anything not a letter) with spaces
    text = re.sub('[^A-Za-z]', ' ', text)
    
    # Tokenize by splitting the string on whitespace
    tokens = text.split()
    
    # Lemmatize and remove any tokens with only one character
    tokens = [wnl.lemmatize(token, pos='n') for token in tokens if len(token) > 1]
    
    return tokens

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/pbarekatain/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Now we can preprocess each of the twits in our dataset. This will take a while since we have millions of twits.

In [8]:
tokenized = [preprocess(message) for message in messages]

### Collect the vocabulary

Now with all of our messages tokenized, we want to create a vocabulary and count up how often each word appears in our entire corpus.

In [None]:
from collections import Counter

In [None]:
# Bag of words
bow = Counter()
for tokens in tokenized:
    bow.update(tokens)

#### Remove common and rare words

With our vocabulary, now we'll remove some of the most common words such as 'the', 'and', 'it', etc. These words don't contribute to identfying sentiment and are really common, resulting in a lot of noise in our input. If we can filter these out, then our network should have an easier time learning. It's up to you to decide how many to remove.

We also want to remove really rare words that show up in a only a few tweets. Here you'll want to divide the count of each word by the number of messages. Then remove words that only appear in some small fraction of the messages. Again, it's up to you how much you want to keep.

In [None]:
## Filter high and low frequency words
# Frequency of words appearing in messages
freqs = {key: val/len(tokenized) for key, val in bow.items()} 
low_cutoff = 0.000005 # frequency cutoff, this is probably better as a percent of the low frequency words

# Filter high frequency words
# Here I'm setting the number of high frequency words instead of a frequency threshold.
# The distribution of word counts is peaked at the most frequent words with a really long tail
# So it's usually better just cut off the first K words
high_cutoff = 10 # K high frequency words 
k_most_common = {each[0] for each in bow.most_common(high_cutoff)}
print(k_most_common)

filtered_words = [word for word in freqs if (freqs[word] > low_cutoff and word not in k_most_common)]
len(filtered_words)

{'for', 'of', 'on', 'is', 'to', 'in', 'quot', 'the', 'and', 'it'}


22810

In [None]:
vocab = {word: word_id for word_id, word in enumerate(filtered_words, 1)}
id2vocab = {val: key for key, val in vocab.items()}

In [None]:
# Go through all the data and remove words that aren't in our vocab
filtered = [[token for token in message if token in vocab] for message in tokenized]

#### Balancing the classes

Let's do a few last pre-processing steps. If we look at how our tweets are labeled, we'll find that 50% of them are neutral. This means that our network will be 50% accurate just by guessing 0 every single time. To help our network learn appropriately, we'll want to balance our classes. That is, make sure each of our different sentiment scores show up roughly as frequently in the data. What we can do here is go through each of our examples and randomly drop tweets with neutral sentiment. What should be the probability we drop these tweets if we want to get around 20% neutral tweets starting at 50% neutral?

We should also take this opportunity to remove messages with length 0.

In [None]:
n_neutral = sum(1 for each in sentiments if each == 2)
N_examples = len(sentiments)
keep_prob = (N_examples - n_neutral)/4/n_neutral

In [None]:
import random
balanced = {'messages': [], 'sentiments':[]}

for idx, sentiment in enumerate(sentiments):
    message = filtered[idx]
    if len(message) == 0:
        # skip this message because it has length zero
        continue
    elif sentiment != 2 or random.random() < keep_prob:
        balanced['messages'].append(message)
        balanced['sentiments'].append(sentiment)

In [None]:
n_neutral = sum(1 for each in balanced['sentiments'] if each == 2)
N_examples = len(balanced['sentiments'])
n_neutral/N_examples

0.1976361248285764

Finally let's convert our tokens into integer ids which we can pass to the network.

In [None]:
token_ids = [[vocab[word] for word in message] for message in balanced['messages']]
sentiments = balanced['sentiments']

## 3. Building the NN Network

#### Add description with images of RNN (~Subject to change + diagram)
#### 1. TextClassifier -- already make
#### 2. Forward Pass -- ask student to build this - know more about LSTM excerise 
#### 3. Hidden Layer -- ask student to build this 
#### 4. Make our models -- ask student to build ( tell them the input that your layer get) === embed_size, lstm_size, output_size

Now we have our vocabulary which means we can transform our tokens into ids, which are then passed to our network. So, let's define the network now!

TODO: Have a nice diagram showing the network we'd like to build

#### Embed -> RNN -> Dense -> Softmax



In [None]:
import torch
from torch import nn, optim
import torch.nn.functional as F

In [None]:
class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_size, lstm_size, output_size, lstm_layers=1, dropout=0.1):
        super().__init__()
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.lstm_size = lstm_size
        self.lstm_layers = lstm_layers
        self.dropout = dropout
        
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, lstm_size, lstm_layers, dropout=dropout)
        
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(lstm_size, output_size)
        
    def forward(self, input, hidden):
        embed = self.embedding(input)
        #print(embed.shape)
        lstm_out, hidden = self.lstm(embed, hidden)
        
        # We just want the last sequence step output which we'll use to classify
        # lstm_out shape is (sequence_steps, batch_size, hidden_units)
        #print(lstm_out)
        lstm_out = lstm_out[-1, :, :]
        #print(lstm_out)

        # Output fully-connected layer
        out = self.dropout(self.fc(lstm_out))
        
        logps = F.log_softmax(out, dim=1)
        
        return logps, hidden
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        hidden = (weight.new(self.lstm_layers, batch_size, self.lstm_size).zero_(),
                  weight.new(self.lstm_layers, batch_size, self.lstm_size).zero_())
        
        return hidden

In [None]:
model = TextClassifier(len(vocab), 10, 6, 5, dropout=0.1, lstm_layers=2)
model.embedding.weight.data.uniform_(-1, 1)
input = torch.randint(0, 1000, (5, 4), dtype=torch.int64)
hidden = model.init_hidden(4)

logps, _ = model.forward(input, hidden)
print(logps)

tensor([[-1.7422, -1.4971, -1.8657, -1.7451, -1.3031],
        [-1.6957, -1.4752, -1.8318, -1.4041, -1.7033],
        [-1.7506, -1.7483, -1.8730, -1.4560, -1.3264],
        [-1.8018, -1.5647, -1.9210, -1.4892, -1.3711]],
       grad_fn=<LogSoftmaxBackward>)


Now we should build a generator that we can use to loop through our data. It'll be more efficient if we can pass our sequences in as a batch, that is some number of sequences all at the same time. Our input tensors should look like `(sequence_length, batch_size)`. So if our sequences are 40 tokens long and we pass in 25 sequences, then we'd have an input size of `(40, 25)`.

If we set our sequence length to 40, what do we do with messages that are more or less than 40 tokens? For messages with fewer than 40 tokens, we will pad the empty spots with zeros. We should be sure to **left** pad so that the RNN starts from nothing before going through the data. If the message has 20 tokens, then the first 20 spots of our 40 long sequence will be 0. If a message has more than 40 tokens, we'll just keep the first 40 tokens.

In [None]:
def dataloader(messages, labels, sequence_length=30, batch_size=32, shuffle=False):
    
    if shuffle:
        indices = list(range(len(messages)))
        random.shuffle(indices)
        messages = [messages[idx] for idx in indices]
        labels = [labels[idx] for idx in indices]
    
    total_sequences = len(messages)
    
    for ii in range(0, total_sequences, batch_size):
        batch_messages = messages[ii: ii+batch_size]
        
        # First initialize a tensor of all zeros
        batch = torch.zeros((sequence_length, len(batch_messages)), dtype=torch.int64)
        for batch_num, tokens in enumerate(batch_messages):
            token_tensor = torch.tensor(tokens)
            # Left pad!
            start_idx = max(sequence_length - len(token_tensor), 0)
            batch[start_idx:, batch_num] = token_tensor[:sequence_length]
        
        label_tensor = torch.tensor(labels[ii: ii+len(batch_messages)])

        yield batch, label_tensor

In [None]:
valid_split = int(len(token_ids)*.9)

train_features = token_ids[:valid_split]
valid_features = token_ids[valid_split:]

train_labels = sentiments[:valid_split]
valid_labels = sentiments[valid_split:]

In [None]:
text_batch, labels = next(iter(dataloader(train_features, train_labels, sequence_length=20, batch_size=64)))
model = TextClassifier(len(vocab)+1, 200, 128, 5, dropout=0.)
hidden = model.init_hidden(64)
logps, hidden = model.forward(text_batch, hidden)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Embedding size is vocab size + 1 for sequence padding
model = TextClassifier(len(vocab)+1, 1024, 512, 5, lstm_layers=2, dropout=0.2)
model.embedding.weight.data.uniform_(-1, 1)
epochs = 5
    criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)
print_every = 100
batch_size = 256

model.to(device)
for epoch in range(epochs):
    print(f"Starting epoch {epoch+1}")
    steps = 0
    running_loss = 0
    for text_batch, labels in dataloader(train_features, train_labels, 
                                         batch_size=batch_size, sequence_length=20, shuffle=True):
        steps += 1
        
        # We're not passing the hidden state between batches, so create a new one here
        hidden = model.init_hidden(labels.shape[0])
        
        text_batch, labels = text_batch.to(device), labels.to(device)
        for each in hidden:
            each.to(device)
        
        optimizer.zero_grad()
        logps, hidden = model(text_batch, hidden)
        
        loss = criterion(logps, labels)
        loss.backward()
        
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(model.parameters(), 5)
        
        optimizer.step()
        
        running_loss += loss.item()
        
        if steps % print_every == 0:
            model.eval()
            val_steps = 0
            val_loss, accuracy = 0, 0
            for text_batch, labels in dataloader(valid_features, valid_labels, batch_size=batch_size):
                val_steps += 1
                # We're not passing the hidden state between batches, so create a new one here
                hidden = model.init_hidden(labels.shape[0])
                
                text_batch, labels = text_batch.to(device), labels.to(device)
                for each in hidden:
                    each.to(device)
                    
                logps, hidden = model(text_batch, hidden)
                loss = criterion(logps, labels)
                
                val_loss += loss.item()
                
                topval, topclass = torch.exp(logps).topk(1)
                accuracy += torch.sum(topclass.squeeze() == labels)

            print(f"Train loss: {running_loss/print_every:.3f}, Val. Loss: {val_loss/val_steps:.3f}, Val Acc.: {100*accuracy/len(valid_labels):.3f}")
            model.train()
            running_loss = 0

Starting epoch 1
Train loss: 1.218, Val. Loss: 1.072, Val Acc.: 59.000
Train loss: 0.952, Val. Loss: 0.968, Val Acc.: 63.000
Train loss: 0.867, Val. Loss: 0.912, Val Acc.: 66.000
Train loss: 0.818, Val. Loss: 0.879, Val Acc.: 68.000
Train loss: 0.784, Val. Loss: 0.868, Val Acc.: 68.000
Train loss: 0.758, Val. Loss: 0.854, Val Acc.: 69.000
Train loss: 0.749, Val. Loss: 0.834, Val Acc.: 70.000
Train loss: 0.752, Val. Loss: 0.834, Val Acc.: 69.000
Train loss: 0.737, Val. Loss: 0.817, Val Acc.: 70.000
Train loss: 0.729, Val. Loss: 0.816, Val Acc.: 70.000
Train loss: 0.712, Val. Loss: 0.803, Val Acc.: 70.000
Train loss: 0.715, Val. Loss: 0.796, Val Acc.: 71.000
Train loss: 0.716, Val. Loss: 0.799, Val Acc.: 70.000
Train loss: 0.706, Val. Loss: 0.807, Val Acc.: 71.000
Train loss: 0.705, Val. Loss: 0.802, Val Acc.: 71.000
Train loss: 0.702, Val. Loss: 0.807, Val Acc.: 71.000
Train loss: 0.697, Val. Loss: 0.799, Val Acc.: 71.000
Train loss: 0.703, Val. Loss: 0.806, Val Acc.: 71.000
Train loss:

## Making predictions

Okay, now that you have a trained model, try it on some new tweets and see if it works appropriately. Remember that for any new text, you'll need to preprocess it first before passing it to the network. You should also think about how to handle input words that aren't in your vocabulary.

We also want to use these sentiment scores in a larger ensemble model which you'll be learning about next. For this, we'll need the output to be some continuous value, typically on a scale of -3 to 3. Our model is predicting the probability on this discrete 5 value scale. Since we have a probability distribution, a good way to convert this to a continuous value is using the expectated value of the score. The expected value $\bar{s}$ is the sum of each score $s_i$ multiplied by the probability $p_i$ of getting that score.

$$
\large \bar{s} = \sum_i p_i s_i
$$

This is nice because it captures uncertainty in our model's predictions. For example, if it predicts 50% in positive ($s = 1$) and 50% in strongly positive ($s=2$), the expected value will be in between at 1.5.

In [None]:
def predict(text, model, vocab):
    tokens = preprocess(text)
    # Filter non-vocab words and convert to ids
    tokens = [vocab[word] for word in tokens if word in vocab]
    if len(tokens) == 0:
        return None, None
        
    # Adding a batch dimension
    text_input = torch.tensor(tokens).view(-1, 1)
    hidden = model.init_hidden(1)
    
    logps, _ = model(text_input, hidden)
    ps = torch.exp(logps)

    # Sentiment expectation
    expectation = torch.dot(torch.tensor([-2., -1., 0., 1., 2.]), ps.squeeze())
    
    return expectation.item()

In [None]:
text = "Good earnings this year, I'm bullish on $goog"
model.to("cpu")
predict(text, model, vocab)

Okay, last part. Now we have a trained model and we can make predictions. We can use this model to track the sentiments of various stocks by predicting the sentiments of twits as they are coming in. Now we have a stream of twits. For each of those twits, pull out the stocks mentioned in them and keep track of the sentiments. Remember that in the twits, ticker symbols are encoded with a dollar sign as the first character, all caps, and 2-4 letters, like $AAPL. Ideally, you'd want to track the sentiments of the stocks in your universe and use this as a signal in your larger model(s).

**Note from Mat:** I'm leaving this to Brok and Parnian to finish up. At this point we need to determine the universe of stocks/tickers students will use. I'm using data from the original StockTwits dataset, but we probably want something else. In this last part, we want to pretend students are getting a stream of twits and classifying them. Justin said the final output of the model should be a tickers, timestamps, and sentiment score.

In [None]:
with open('test_twits.json', 'r') as f:
    test_data = json.load(f)

In [None]:
# Provide this stream for students
def twit_stream():
    for twit in test_data['data'][:1000]:
        yield twit

In [None]:
next(twit_stream())

In [None]:
def score_twits(stream, model, vocab, universe):
    """ Given a stream of twits and a universe of tickers, return sentiment scores for 
        tickers in the universe. """
    
    for twit in stream:
        
        text = twit['message_body']
        symbols = re.findall('\$[A-Z]{2,4}', text)

        score = predict(text, model, vocab)
        
        for symbol in symbols:
            if symbol in universe:
                yield symbols, score, twit['timestamp']
        

In [None]:
universe = {'$BBRY', '$AAPL', '$AMZN', '$BABA', '$YHOO', '$LQMT', '$FB', '$GOOG', '$BBBY', '$JNUG', '$SBUX', '$MU'}
score_stream = score_twits(twit_stream(), model, vocab, universe)

In [None]:
next(score_stream)

#### ideas 
 1. Add Digram 
 2. Add a Stock Twits - maybe record the video to watch how to train stock 
 3. Add a story board of what is happening 
 4. Breaking a code into piceses 