# Assignment 4 - Using NLP to play the stock market

In this assignment, we'll use everything we've learned to analyze corporate news and pick stocks. Be aware that in this assignment, we're trying to beat the benchmark of random chance (aka better than 50%).

This assignment will involve building three models:

**1. An RNN based on word inputs**

**2. A CNN based on character inputs**

**3. A neural net architecture that merges the previous two models**

You will apply these models to predicting whether a stock return will be positive or negative in the same day of a news publication.

## Your X - Reuters news data

Reuters is a news outlet that reports on corporations, among many other things. Stored in the `news_reuters.csv` file is news data listed in columns. The corresponding columns are the `ticker`, `name of company`, `date of publication`, `headline`, `first sentence`, and `news category`.

In this assignment it is up to you to decide how to clean this dataset. For instance, many of the first sentences contain a location name showing where the reporting is done. This is largely irrevant information and will probably just make your data noisier. You can also choose to subset on a certain news category, which might enhance your model performance and also limit the size of your data.

## Your Y - Stock information from Yahoo! Finance

Trading data from Yahoo! Finance was collected and then normalized using the [S&P 500](https://en.wikipedia.org/wiki/S%26P_500_Index). This is stored in the `stockReturns.json` file. 

In our dataset, the ticker for the S&P is `^GSPC`. Each ticker is compared the S&P and then judged on whether it is outperforming (positive value) or under-performing (negative value) the S&P. Each value is reported on a daily interval from 2004 to now.

Below is a diagram of the data in the json file. Note there are three types of data: short: 1 day return, mid: 7 day return, long 28 day return.

```
          term (short/mid/long)
         /         |         \
   ticker A   ticker B   ticker C
      /   \      /   \      /   \
  date1 date2 date1 date2 date1 date2
```

You will need to pick a length of time to focus on (day, week, month). You are welcome to train models on each dataset as well.  

Transform the return data such that the outcome will be binary:

```
label[y < 0] = 0
label[y >= 0] = 1
```

Finally, this data needs needs to be joined on the date and ticker - For each date of news publication, we want to join the corresponding corporation's news on its return information. We make the assumption that the day's return will reflect the sentiment of the news, regardless of timing.


# Your models - RNN, CNN, and RNN+CNN

For your RNN model, it needs to be based on word inputs, embedding the word inputs, encoding them with an RNN layer, and finally a decoding step (such as softmax or some other choice).

Your CNN model will be based on characters. For reference on how to do this, look at the CNN class demonstration in the course repository.

Finally you will combine the architecture for both of these models, either [merging](https://github.com/ShadyF/cnn-rnn-classifier) using the [Functional API](https://keras.io/getting-started/functional-api-guide/) or [stacking](http://www.aclweb.org/anthology/S17-2134). See the links for reference.

For each of these models, you will need to:
1. Create a train and test set, retaining the same test set for every model
2. Show the architecture for each model, printing it in your python notebook
2. Report the peformance according to some metric
3. Compare the performance of all of these models in a table (precision and recall)
4. Look at your labeling and print out the underlying data compared to the labels - for each model print out 2-3 examples of a good classification and a bad classification. Make an assertion why your model does well or poorly on those outputs.
5. For each model, calculate the return from the three most probable positive stock returns. Compare it to the actual return. Print this information in a table.

### Good luck!

In [1]:
import json
import pandas as pd
import nltk
import itertools
import re


In [2]:
# Download NLTK model data (you need to do this once)
nltk.download("book")

[nltk_data] Downloading collection u'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/jahuang/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/jahuang/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     /Users/jahuang/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /Users/jahuang/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     /Users/jahuang/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     /Users/jahuang/nltk_data...
[nltk_data]    |   Package conll2002 is already up-to-date!
[nltk_data]    

True

In [3]:
with open('stockReturns.json', 'r') as f:
    stockReturns = json.load(f)
news_df = pd.read_csv('news_reuters.csv', header = None, 
                      names= ['sticker', 'company', 'publication_date', 'headline', 'first_sentence', 'news_category'],
                     encoding = 'utf-8');


In [4]:
news_df

Unnamed: 0,sticker,company,publication_date,headline,first_sentence,news_category
0,AA,Alcoa Corporation,20110707,Alcoa profit seen higher on aluminum price surge,* Analysts expect profit of 34 cts/shr vs yea...,topStory
1,AA,Alcoa Corporation,20110708,Global markets weekahead: Lacking conviction,LONDON Investors are unlikely to gain strong c...,normal
2,AA,Alcoa Corporation,20110708,Jobs halt Wall Street rally investors eye ear...,NEW YORK Stocks fell on Friday as a weak jobs ...,topStory
3,AA,Alcoa Corporation,20110708,REFILE-TABLE-Australia's top carbon polluters,CANBERRA July 8 Following is a list of Austr...,normal
4,AA,Alcoa Corporation,20110708,US STOCKS-Jobs data hits stocks but earnings ...,* Google slumps on downgrade one of Nasdaq's...,normal
5,AA,Alcoa Corporation,20110708,US STOCKS-Jobs halt Wall St rally investors e...,* Dow off 0.5 pct S&P down 0.7 pct Nasdaq o...,normal
6,AA,Alcoa Corporation,20110708,Wall St Week Ahead: Recipe for a rally? Beat l...,NEW YORK July 8 Wall Street heads into earni...,normal
7,AA,Alcoa Corporation,20110708,Wall St Week Ahead: Recipe for a rally? Beat l...,NEW YORK Wall Street heads into earnings seaso...,normal
8,AA,Alcoa Corporation,20110709,Recipe for a rally? Beat lowered estimates,NEW YORK Wall Street heads into earnings seaso...,topStory
9,AA,Alcoa Corporation,20110710,Earnings surprises may spark rally,NEW YORK Wall Street heads into earnings seaso...,topStory


In [5]:
results = []
short_gains = stockReturns['short']
for index, row in news_df.iterrows():
    sticker_name = row['sticker']
    publication_date = str(row['publication_date'])
    try:
        gain = short_gains[sticker_name][publication_date]
        if (gain > 0):
            results.append(1)
        else:
            results.append(0)
    except:
        results.append(-1)

In [6]:
news_df['outcome'] = pd.Series(results)
len(news_df)

215288

In [7]:
news_outcome = news_df[news_df['outcome'] != -1]
news_outcome = news_outcome.reset_index()

In [8]:
vocab_size = 8000
sentence_size = 120

In [9]:
# Tokenize Words
tokenized_sentences1 = [nltk.word_tokenize(sent) for sent in news_outcome['headline']]
tokenized_sentences2 = [nltk.word_tokenize(sent) for sent in news_outcome['first_sentence']]

In [10]:
tokenized_sentences = tokenized_sentences1 + tokenized_sentences2
words = []
for tokens in tokenized_sentences: 
    for token in tokens:
        # remove numbers 
        if re.search('\d', token) == None:
            words.append(token)

In [11]:
word_freq = nltk.FreqDist(words)
vocab = word_freq.most_common(vocab_size - 1)
index_to_word = [x[0] for x in vocab]
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])

In [12]:
X = []
for i in range(0, len(tokenized_sentences1)):
    row = []
    for token in tokenized_sentences1[i]:
        index = word_to_index.get(token)
        if index != None: 
            row.append(index)
    for token in tokenized_sentences2[i]:
        index = word_to_index.get(token)
        if index != None:
            row.append(index)
    X.append(row)

In [13]:
import numpy
import keras 
from keras import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Reshape, Conv2D, GlobalMaxPooling2D
from keras.layers import Dense, Dropout, Activation, Conv1D, MaxPooling1D, Embedding, Flatten
from keras import optimizers
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

from sklearn.model_selection import train_test_split

Using TensorFlow backend.


In [None]:
X = sequence.pad_sequences(X, maxlen=sentence_size)
y = news_outcome['outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model 1: RNN

In [None]:
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(vocab_size, embedding_vecor_length, input_length=len(X_train[0])))
model.add(LSTM(100))
# model.add(Dense(10, activation='sigmoid'))
# model.add(Dropout(0.3))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=64)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 120, 32)           256000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 309,301
Trainable params: 309,301
Non-trainable params: 0
_________________________________________________________________
None
Train on 24722 samples, validate on 6181 samples
Epoch 1/5

In [None]:
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

In [None]:
model.predict_classes(X_test)

In [None]:
X_train.shape

## Model 2: CNN

In [None]:
embedding_dimension = 100
def cnn():
    model = Sequential()

    model.add(Embedding(input_dim = vocab_size, output_dim = embedding_dimension, input_length = sentence_size))
    model.add(Reshape((sentence_size, embedding_dimension, 1), input_shape = (sentence_size, embedding_dimension)))
    model.add(Conv2D(filters = 50, kernel_size = (5, embedding_dimension), strides = (1,1), padding = 'valid'))
    model.add(GlobalMaxPooling2D())

    model.add(Dense(20))
    model.add(Activation('relu'))
    model.add(Dropout(0.5))
    model.add(Dense(10))
    model.add(Activation('relu'))
    model.add(Dropout(0.3))
    model.add(Dense(1))
    model.add(Activation('sigmoid'))

    adam = optimizers.Adam(lr = 0.001)

    model.compile(loss='binary_crossentropy', optimizer=adam , metrics=['accuracy'])
    
    return model

In [None]:
cnn_model = cnn()
cnn_model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=128)

## Model 3: RNN+CNN