# Assignment 4 - Using NLP to play the stock market

In this assignment, we'll use everything we've learned to analyze corporate news and pick stocks. Be aware that in this assignment, we're trying to beat the benchmark of random chance (aka better than 50%).

This assignment will involve building three models:

**1. An RNN based on word inputs**

**2. A CNN based on character inputs**

**3. A neural net architecture that merges the previous two models**

You will apply these models to predicting whether a stock return will be positive or negative in the same day of a news publication.

## Your X - Reuters news data

Reuters is a news outlet that reports on corporations, among many other things. Stored in the `news_reuters.csv` file is news data listed in columns. The corresponding columns are the `ticker`, `name of company`, `date of publication`, `headline`, `first sentence`, and `news category`.

In this assignment it is up to you to decide how to clean this dataset. For instance, many of the first sentences contain a location name showing where the reporting is done. This is largely irrevant information and will probably just make your data noisier. You can also choose to subset on a certain news category, which might enhance your model performance and also limit the size of your data.

## Your Y - Stock information from Yahoo! Finance

Trading data from Yahoo! Finance was collected and then normalized using the [S&P 500](https://en.wikipedia.org/wiki/S%26P_500_Index). This is stored in the `stockReturns.json` file. 

In our dataset, the ticker for the S&P is `^GSPC`. Each ticker is compared the S&P and then judged on whether it is outperforming (positive value) or under-performing (negative value) the S&P. Each value is reported on a daily interval from 2004 to now.

Below is a diagram of the data in the json file. Note there are three types of data: short: 1 day return, mid: 7 day return, long 28 day return.

```
          term (short/mid/long)
         /         |         \
   ticker A   ticker B   ticker C
      /   \      /   \      /   \
  date1 date2 date1 date2 date1 date2
```

You will need to pick a length of time to focus on (day, week, month). You are welcome to train models on each dataset as well.  

Transform the return data such that the outcome will be binary:

```
label[y < 0] = 0
label[y >= 0] = 1
```

Finally, this data needs needs to be joined on the date and ticker - For each date of news publication, we want to join the corresponding corporation's news on its return information. We make the assumption that the day's return will reflect the sentiment of the news, regardless of timing.


# Your models - RNN, CNN, and RNN+CNN

For your RNN model, it needs to be based on word inputs, embedding the word inputs, encoding them with an RNN layer, and finally a decoding step (such as softmax or some other choice).

Your CNN model will be based on characters. For reference on how to do this, look at the CNN class demonstration in the course repository.

Finally you will combine the architecture for both of these models, either [merging](https://github.com/ShadyF/cnn-rnn-classifier) using the [Functional API](https://keras.io/getting-started/functional-api-guide/) or [stacking](http://www.aclweb.org/anthology/S17-2134). See the links for reference.

For each of these models, you will need to:
1. Create a train and test set, retaining the same test set for every model
2. Show the architecture for each model, printing it in your python notebook
2. Report the peformance according to some metric
3. Compare the performance of all of these models in a table (precision and recall)
4. Look at your labeling and print out the underlying data compared to the labels - for each model print out 2-3 examples of a good classification and a bad classification. Make an assertion why your model does well or poorly on those outputs.
5. For each model, calculate the return from the three most probable positive stock returns. Compare it to the actual return. Print this information in a table.

### Good luck!

In [71]:
import json
import pandas as pd
import nltk
import itertools
import re

import numpy
import keras 
from keras import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Reshape, Conv2D, GlobalMaxPooling2D, MaxPooling2D
from keras.layers import Dense, Dropout, Activation, Conv1D, MaxPooling1D, Embedding, Flatten
from keras import optimizers
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

from sklearn.model_selection import train_test_split

In [72]:
# Download NLTK model data (you need to do this once)
nltk.download("book")

[nltk_data] Downloading collection u'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/jahuang/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/jahuang/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     /Users/jahuang/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /Users/jahuang/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     /Users/jahuang/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     /Users/jahuang/nltk_data...
[nltk_data]    |   Package conll2002 is already up-to-date!
[nltk_data]    

True

# Load Data

In [73]:
# Load Data
with open('stockReturns.json', 'r') as f:
    stockReturns = json.load(f)
news_df = pd.read_csv('news_reuters.csv', header = None, 
                      names= ['sticker', 'company', 'publication_date', 'headline', 'first_sentence', 'news_category'],
                     encoding = 'utf-8');


# Join the Data Sets

In [74]:
# join the returns and the news
results = []
short_gains = stockReturns['short']
for index, row in news_df.iterrows():
    sticker_name = row['sticker']
    publication_date = str(row['publication_date'])
    try:
        gain = short_gains[sticker_name][publication_date]
        if (gain > 0):
            results.append(1)
        else:
            results.append(0)
    except:
        results.append(-1)

news_df['outcome'] = pd.Series(results)
news_outcome = news_df[news_df['outcome'] != -1]
news_outcome = news_outcome.reset_index()

# Preprocess the news content into tokens and indicies

In [75]:
vocab_size = 8000
sentence_size = 120
# Tokenize Words
tokenized_sentences1 = [nltk.word_tokenize(sent) for sent in news_outcome['headline']]
tokenized_sentences2 = [nltk.word_tokenize(sent) for sent in news_outcome['first_sentence']]
tokenized_sentences = tokenized_sentences1 + tokenized_sentences2
words = []
for tokens in tokenized_sentences: 
    for token in tokens:
        # remove numbers 
        if re.search('\d', token) == None:
            words.append(token)
word_freq = nltk.FreqDist(words)
vocab = word_freq.most_common(vocab_size - 1)
index_to_word = [x[0] for x in vocab]
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])
X = []
for i in range(0, len(tokenized_sentences1)):
    row = []
    for token in tokenized_sentences1[i]:
        index = word_to_index.get(token)
        if index != None: 
            row.append(index)
    for token in tokenized_sentences2[i]:
        index = word_to_index.get(token)
        if index != None:
            row.append(index)
    X.append(row)
    
X = sequence.pad_sequences(X, maxlen=sentence_size)
y = news_outcome['outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model 1: RNN

In [76]:
def rnn():
    embedding_vecor_length = 32
    model = Sequential()
    model.add(Embedding(vocab_size, embedding_vecor_length, input_length=len(X_train[0])))
    model.add(LSTM(100))
    # model.add(Dense(10, activation='sigmoid'))
    # model.add(Dropout(0.3))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    print(model.summary())
    return model
rnn_model = rnn()
rnn_model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=64)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_51 (Embedding)     (None, 120, 32)           256000    
_________________________________________________________________
lstm_15 (LSTM)               (None, 100)               53200     
_________________________________________________________________
dense_63 (Dense)             (None, 1)                 101       
Total params: 309,301
Trainable params: 309,301
Non-trainable params: 0
_________________________________________________________________
None
Train on 24722 samples, validate on 6181 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x139e09c10>

In [77]:
scores = rnn_model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 64.89%


## Model 2: CNN

In [78]:
embedding_dimension = 100
def cnn():
    model = Sequential()

    model.add(Embedding(input_dim = vocab_size, output_dim = embedding_dimension, input_length = sentence_size))
    model.add(Reshape((sentence_size, embedding_dimension, 1), input_shape = (sentence_size, embedding_dimension)))
    model.add(Conv2D(filters = 50, kernel_size = (5, embedding_dimension), strides = (1,1), padding = 'valid'))
    model.add(GlobalMaxPooling2D())

    model.add(Dense(20))
    model.add(Activation('relu'))
    model.add(Dropout(0.5))
    model.add(Dense(10))
    model.add(Activation('relu'))
    model.add(Dropout(0.3))
    model.add(Dense(1))
    model.add(Activation('sigmoid'))

    adam = optimizers.Adam(lr = 0.001)

    model.compile(loss='binary_crossentropy', optimizer=adam , metrics=['accuracy'])
    print(model.summary())
    return model

In [79]:
cnn_model = cnn()
cnn_model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=128)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_52 (Embedding)     (None, 120, 100)          800000    
_________________________________________________________________
reshape_27 (Reshape)         (None, 120, 100, 1)       0         
_________________________________________________________________
conv2d_27 (Conv2D)           (None, 116, 1, 50)        25050     
_________________________________________________________________
global_max_pooling2d_17 (Glo (None, 50)                0         
_________________________________________________________________
dense_64 (Dense)             (None, 20)                1020      
_________________________________________________________________
activation_38 (Activation)   (None, 20)                0         
_________________________________________________________________
dropout_13 (Dropout)         (None, 20)                0         
__________

<keras.callbacks.History at 0x1221dbbd0>

In [80]:
cnn_scores = cnn_model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (cnn_scores[1]*100))

Accuracy: 69.00%


## Model 3: RNN+CNN

In [83]:
def cnn_and_rnn():
    embedding_dimension = 100
    sentence_size = 120
    
    model_cnn = Sequential()

    model_cnn.add(Embedding(input_dim = vocab_size, output_dim = embedding_dimension, input_length = sentence_size))
    model_cnn.add(Reshape((sentence_size, embedding_dimension, 1), input_shape = (sentence_size, embedding_dimension)))
    model_cnn.add(Conv2D(filters = 50, kernel_size = (5, embedding_dimension), strides = (1,1), padding = 'valid'))
    model_cnn.add(GlobalMaxPooling2D())
    
    model_cnn.add(Dense(20))
    model_cnn.add(Activation('relu'))
    print('CNN Branch Architecture ------------------------------------------------------')
    print('------------------------------------------------------------------------------')
    print(model_cnn.summary())
    
    model_rnn = Sequential()
    model_rnn.add(Embedding(vocab_size, embedding_vecor_length, input_length=len(X_train[0])))
    model_rnn.add(LSTM(100))
    model_rnn.add(Dense(20))
    model_rnn.add(Activation('relu'))
    print('RNN Branch Architecture ------------------------------------------------------ ')
    print('------------------------------------------------------------------------------ ')
    print(model_rnn.summary())
    
    model = Sequential()
    model.add(keras.layers.Merge([model_cnn, model_rnn], mode='concat'))
    model.add(Dropout(0.3))
    model.add(Dense(10, activation = 'relu'))
    model.add(Dense(1, activation='sigmoid'))

    adam = optimizers.Adam(lr = 0.001)
    model.compile(loss='binary_crossentropy', optimizer=adam , metrics=['accuracy'])
    print('Merged Model Architecture ---------------------------------------------------- ')
    print('------------------------------------------------------------------------------ ')   
    print(model.summary())
    return model
merged_model = cnn_and_rnn()
merged_model.fit([X_train, X_train], y_train, validation_data=([X_test, X_test], y_test), epochs=5, batch_size=256)

CNN Branch Architecture ------------------------------------------------------
------------------------------------------------------------------------------
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_53 (Embedding)     (None, 120, 100)          800000    
_________________________________________________________________
reshape_28 (Reshape)         (None, 120, 100, 1)       0         
_________________________________________________________________
conv2d_28 (Conv2D)           (None, 116, 1, 50)        25050     
_________________________________________________________________
global_max_pooling2d_18 (Glo (None, 50)                0         
_________________________________________________________________
dense_67 (Dense)             (None, 20)                1020      
_________________________________________________________________
activation_41 (Activation)   (None, 20)           



Train on 24722 samples, validate on 6181 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1208c6cd0>

In [86]:
merged_scores = merged_model.evaluate([X_test, X_test], y_test, verbose=0)
print("Accuracy: %.2f%%" % (merged_scores[1]*100))

Accuracy: 68.45%


# Evaluation 

In [88]:
from sklearn.metrics import f1_score, precision_score, recall_score

def evaluate(test_labels, predictions): 
    precision = precision_score(test_labels, predictions, average='macro') 
    recall = recall_score(test_labels, predictions, average='macro') 
    return [precision, recall]
    #print("Precision: {:.4f}, Recall: {:.4f}".format(precision, recall))

cnn_predicted = cnn_model.predict_classes(X_test)
rnn_predicted = rnn_model.predict_classes(X_test)
merged_predicted = merged_model.predict_classes([X_test, X_test])

In [92]:
cnn_metrics = evaluate(cnn_predicted, y_test)
rnn_metrics = evaluate(rnn_predicted, y_test)
merged_metrics = evaluate(merged_predicted, y_test)
data = {'metrics': ['Precision', 'Recall'] ,'CNN': cnn_metrics, 'RNN': rnn_metrics, 'MERGED': merged_metrics}
metrics_df = pd.DataFrame(data=data)
print(metrics_df.set_index('metrics'))

                CNN    MERGED       RNN
metrics                                
Precision  0.690018  0.684512  0.648926
Recall     0.690018  0.685246  0.648997


# The Good and The Bad - Classifications 

In [105]:
def get_sentence(X_test, index):
    word_indicies = X_test[index]
    words = []
    for word_index in word_indicies:
        if (word_index != 0):
            words.append(index_to_word[word_index])
    return ' '.join(words)

## CNN

In [109]:
print('Good Classification')
print('')
print(get_sentence(X_test, 0))
print('')
print(get_sentence(X_test, 2))


Good Classification

Nvidia shows off smaller artificial intelligence computer for car Sept U.S. chipmaker Nvidia Corp showed off on Monday a smaller and more efficient artificial intelligence computer for self-driving cars saying it would power 's mapping and autonomous vehicle technology .

FDA approves longer-term use of AstraZeneca blood thinner AstraZeneca Plc on Thursday said the U.S. Food and Drug Administration approved a new dose of its blood thinner Brilinta intended for longer-term use in patients with a history of heart attack or a condition known as .


In [110]:
print('Bad Classification')
print('')
print(get_sentence(X_test, 1))
print('')
print(get_sentence(X_test, 3))

Bad Classification

MarkWest shareholder says he opposes MPLX deal Nov A shareholder of natural gas processor MarkWest Energy Partners LP John Fox said he was opposed refiner Marathon Petroleum Corp 's proposed $ billion acquisition of the company through its pipeline unit MPLX LP .

Wall Street up on jobs data off Greek default NEW YORK Stocks advanced on Friday as investors off the technical default by Greece and focused instead on another strong monthly jobs report .


### CNN did a poor job on the second bad classification because it contained a mix of good and bad sentiment words. The first bad classification 

## RNN

In [116]:
print('Good Classification')
print('')
print(get_sentence(X_test, 6))
print('')
print(get_sentence(X_test, 8))

Good Classification

Chipotle Massachusetts shut after workers fall ill Chipotle Mexican Grill Inc which is trying recover from a series of food-borne illness outbreaks temporarily shut a Massachusetts restaurant after four employees fell sick .

Allergan says it stands by statements on Valeant BOSTON Oct Allergan Inc said on Tuesday that it believes there is no evidence support Valeant Pharmaceuticals and hedge fund Pershing Square Capital Management 's claims that its chief executive officer a campaign spread about Valeant .


In [121]:
print('Bad Classification')
print('')
print(get_sentence(X_test, 5))
print('')
print(get_sentence(X_test, 20))

Bad Classification

BRIEF-Vodafone CEO urges close scrutiny of BT deal * BT deal dominant player in Britain requires scrutiny

UPDATE Garden Red Lobster kids ' menus * Move comes amid calls help reduce obesity


### RNN seems to bias toward a couple of positive words 

# RNN + CNN

In [132]:
print('Good Classification')
print('')
print(get_sentence(X_test, 31))
print('')
print(get_sentence(X_test, 32))

Good Classification

UPDATE hands out $ million of shares in ' pay awards * bonuses for top managers scrapped ( Adds details of 's award )

UPDATE 's net profit down pct on weak home front * Shares up percent vs index up pct


In [133]:
print('Bad Classification')
print('')
print(get_sentence(X_test, 30))
print('')
print(get_sentence(X_test, 37))

Bad Classification

BRIEF-Allergan Teva entered amendment master purchase agreement * Co and pharmaceutical industries entered into an amendment dated master purchase agreement

ON THE MOVE-Morgan Stanley hires three brokers from Citi Barclays Sept Morgan Stanley the world 's largest retail brokerage by its number of advisers said it hired two brokers from Citigroup Inc 's private banking unit .


### These bad classifications doesn't contain any straight forward positive sentiments

# Three Most Probable Predicted Stocks

# CNN

In [161]:
stickers = news_outcome['company'].tolist()
def top_three_stickers(model, data):
    predicted = model.predict(data)
    index1 = numpy.argmax(predicted)
    predicted[index1] = -1;
    index2 = numpy.argmax(predicted)
    predicted[index2] = -1;
    index3 = numpy.argmax(predicted)
    top_stickers = []
    top_stickers.append(stickers[index1])
    top_stickers.append(stickers[index2])
    top_stickers.append(stickers[index3])
    return top_stickers 

In [162]:
cnn_top = top_three_stickers(cnn_model, X)
rnn_top = top_three_stickers(rnn_model, X)
cnn_and_rnn = top_three_stickers(merged_model, [X, X])

In [164]:
top_companies = {'CNN': cnn_top, 'RNN': rnn_top, 'CNN and RNN': cnn_and_rnn}
top_companies_df = pd.DataFrame(data=top_companies)
print(top_companies_df)

            CNN                 CNN and RNN                     RNN
0     Apple Inc          Vodafone Group Plc                  BRF SA
1  Allergan plc  McDonald&#39;s Corporation  Intuitive Surgical Inc
2     Amgen Inc  McDonald&#39;s Corporation  Intuitive Surgical Inc
