# Assignment 4 - Using NLP to play the stock market

In this assignment, we'll use everything we've learned to analyze corporate news and pick stocks. Be aware that in this assignment, we're trying to beat the benchmark of random chance (aka better than 50%).

This assignment will involve building three models:

**1. An RNN based on word inputs**

**2. A CNN based on character inputs**

**3. A neural net architecture that merges the previous two models**

You will apply these models to predicting whether a stock return will be positive or negative in the same day of a news publication.

## Your X - Reuters news data

Reuters is a news outlet that reports on corporations, among many other things. Stored in the `news_reuters.csv` file is news data listed in columns. The corresponding columns are the `ticker`, `name of company`, `date of publication`, `headline`, `first sentence`, and `news category`.

In this assignment it is up to you to decide how to clean this dataset. For instance, many of the first sentences contain a location name showing where the reporting is done. This is largely irrevant information and will probably just make your data noisier. You can also choose to subset on a certain news category, which might enhance your model performance and also limit the size of your data.

## Your Y - Stock information from Yahoo! Finance

Trading data from Yahoo! Finance was collected and then normalized using the [S&P 500](https://en.wikipedia.org/wiki/S%26P_500_Index). This is stored in the `stockReturns.json` file. 

In our dataset, the ticker for the S&P is `^GSPC`. Each ticker is compared the S&P and then judged on whether it is outperforming (positive value) or under-performing (negative value) the S&P. Each value is reported on a daily interval from 2004 to now.

Below is a diagram of the data in the json file. Note there are three types of data: short: 1 day return, mid: 7 day return, long 28 day return.

```
          term (short/mid/long)
         /         |         \
   ticker A   ticker B   ticker C
      /   \      /   \      /   \
  date1 date2 date1 date2 date1 date2
```

You will need to pick a length of time to focus on (day, week, month). You are welcome to train models on each dataset as well.  

Transform the return data such that the outcome will be binary:

```
label[y < 0] = 0
label[y >= 0] = 1
```

Finally, this data needs needs to be joined on the date and ticker - For each date of news publication, we want to join the corresponding corporation's news on its return information. We make the assumption that the day's return will reflect the sentiment of the news, regardless of timing.


# Your models - RNN, CNN, and RNN+CNN

For your RNN model, it needs to be based on word inputs, embedding the word inputs, encoding them with an RNN layer, and finally a decoding step (such as softmax or some other choice).

Your CNN model will be based on characters. For reference on how to do this, look at the CNN class demonstration in the course repository.

Finally you will combine the architecture for both of these models, either [merging](https://github.com/ShadyF/cnn-rnn-classifier) using the [Functional API](https://keras.io/getting-started/functional-api-guide/) or [stacking](http://www.aclweb.org/anthology/S17-2134). See the links for reference.

For each of these models, you will need to:
1. Create a train and test set, retaining the same test set for every model
2. Show the architecture for each model, printing it in your python notebook
2. Report the peformance according to some metric
3. Compare the performance of all of these models in a table (precision and recall)
4. Look at your labeling and print out the underlying data compared to the labels - for each model print out 2-3 examples of a good classification and a bad classification. Make an assertion why your model does well or poorly on those outputs.
5. For each model, calculate the return from the three most probable positive stock returns. Compare it to the actual return. Print this information in a table.

### Good luck!

## Model 1: RNN

In [2]:
# Import Data
import pandas as pd
import numpy as np

news_reuters=pd.read_csv("news_reuters.csv",header=None)
stockReturns=pd.read_json("stockReturns.json")

In [9]:
stockReturns=stockReturns.iloc[0:len(stockReturns)-1,:] #remove SP500
tickers=stockReturns.index.tolist() #collect all tickers

#create a data frame that show return relative to SP500 for each ticker at available dates
returns=[]
for i in range(len(tickers)):
    for date, short_return in stockReturns.iloc[i,2].items():
        returns.append([tickers[i],date,short_return])

for j in range(len(returns)):
    if returns[j][2]>=0:
        returns[j][2]=1
    else:
        returns[j][2]=0
        
returns=pd.DataFrame(returns,columns=["Ticker","Date","Return"])
returns.head(3)

Unnamed: 0,Ticker,Date,Return
0,AAPL,20040106,0
1,AAPL,20040107,1
2,AAPL,20040108,1


In [11]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jiach\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [12]:
#tokenize news data

from nltk import word_tokenize

news=[]
text=[]
for i in range(len(news_reuters)):
    headline=str(news_reuters.iloc[i,3])
    sent=str(news_reuters.iloc[i,4])
    total=word_tokenize(headline+sent)
    for j in range(len(total)):
        text.append(total[j])        
    news.append([news_reuters.iloc[i,0],news_reuters.iloc[i,2],total])

In [13]:
news=pd.DataFrame(news,columns=["Ticker","Date","News"])
news["Date"]=news["Date"].astype("str")
data=pd.merge(returns,news,on=["Ticker","Date"])
data.head(3)

Unnamed: 0,Ticker,Date,Return,News
0,AAPL,20110706,1,"[Hackers, expose, flaw, in, Apple, iPad, iPhon..."
1,AAPL,20110706,1,"[Hackers, expose, flaw, in, Apple, iPad, iPhon..."
2,AAPL,20110706,1,"[Samsung, estimates, Q2, profit, down, 26, pct..."


In [14]:
#create a dictionary

import nltk

vocabulary_size = 30000
unknown_token = "UNKNOWN_TOKEN"

word_freq=nltk.FreqDist(text)
print("Found %d unique words tokens." % len(word_freq.items()))

vocab = word_freq.most_common(vocabulary_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])
print("Using vocabulary size %d." % vocabulary_size)

data=data.values.tolist()

length=[]
for i in range(len(data)):
    length.append(len(data[i][3]))
max_length=max(length)
    
for i in range(len(data)):
    for j in range(len(data[i][3])):
        if data[i][3][j] in word_to_index:
            data[i][3][j]=word_to_index[data[i][3][j]]
        else:
            data[i][3][j]=word_to_index[unknown_token]

Found 114388 unique words tokens.
Using vocabulary size 30000.


In [15]:
#padding

from keras.preprocessing.sequence import pad_sequences

X=[]
Y=[]
for i in range(len(data)):
    X.append(data[i][3])
    Y.append(data[i][2])
X = pad_sequences(maxlen=max_length, sequences=X, value=0)

from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(X, Y, test_size=0.3)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [16]:
# Model
from keras.models import Sequential
from keras.layers import LSTM, Embedding, Dense, Dropout
from keras.layers.recurrent import LSTM, GRU, SimpleRNN
from keras.layers import Dense, Activation


model_1=Sequential()
model_1.add(Embedding(vocabulary_size, 200, input_length=max_length))
model_1.add(SimpleRNN(100))  
model_1.add(Dropout(0.2))
model_1.add(Dense(20,activation="relu"))
model_1.add(Dense(1, activation='sigmoid'))
model_1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model_1.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 129, 200)          6000000   
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 100)               30100     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 20)                2020      
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 21        
Total params: 6,032,141
Trainable params: 6,032,141
Non-trainable params: 0
_________________________________________________________________
None


In [17]:
model_1.fit(X_tr, y_tr, batch_size=200, epochs=4, validation_split=0.1,verbose=1)

Train on 19468 samples, validate on 2164 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x1c05e0698d0>

In [18]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

prediction_1=model_1.predict(X_te)
prediction_1=np.argmax(prediction_1,axis=-1)

precision_1 = precision_score(y_true=y_te, y_pred=prediction_1, average='weighted')
recall_1 = recall_score(y_true=y_te, y_pred=prediction_1, average='weighted')
f1_1 = f1_score(y_true=y_te, y_pred=prediction_1, average='weighted')

print("PRECISION: {:.3f}".format(precision_1))
print("RECALL: {:.3f}".format(recall_1))
print("F1: {:.3f}".format(f1_1))

PRECISION: 0.256
RECALL: 0.506
F1: 0.340


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [19]:
# show some good and bad example

correct=[]
incorrect=[]

for i in range(len(y_te)):
    if prediction_1[i]==y_te[i]:
        correct.append(i)
    else:
        incorrect.append(i)
        
print("First good example")
print("Actual Return: %s" % y_te[correct[0]])
print("Predicted Return: %s" % prediction_1[correct[0]])

information=[]
for i in range(len(X_te[correct[0]])):
    if X_te[correct[0]][i]!=0:
        information.append(X_te[correct[0]][i])
information=[index_to_word[w] for w in information]
    
print("Stock news: %s" %information)

First good example
Actual Return: 0
Predicted Return: 0
Stock news: ['Bristol-Myers', 'pulls', 'U.S.', 'marketing', 'application', 'for', 'hepatitis', 'C', 'treatment', 'Oct', '7', 'Bristol-Myers', 'Squibb', 'said', 'it', 'withdrew', 'its', 'U.S.', 'marketing', 'application', 'for', 'a', 'drug', 'combination', 'treat', 'hepatitis', 'C', '.']


In [20]:
print("Second good example")
print("Actual Return: %s" % y_te[correct[1]])
print("Predicted Return: %s" % prediction_1[correct[1]])

information=[]
for i in range(len(X_te[correct[1]])):
    if X_te[correct[1]][i]!=0:
        information.append(X_te[correct[1]][i])
information=[index_to_word[w] for w in information]
    
print("Stock news: %s" %information)

Second good example
Actual Return: 0
Predicted Return: 0
Stock news: ['Lehman', 'payout', 'plan', 'OK', "'d", 'for', 'creditor', 'vote', 'NEW', 'YORK', 'Creditors', 'of', 'Lehman', 'Brothers', 'Holdings', 'Inc', 'will', 'be', 'allowed', 'vote', 'on', 'the', 'failed', 'bank', "'s", '$', '65', 'billion', 'payback', 'plan', 'clearing', 'a', 'major', 'hurdle', 'in', 'the', 'path', 'ending', 'the', 'biggest', 'bankruptcy', 'in', 'U.S.', 'history', '.']


In [21]:
print("First bad example")
print("Actual Return: %s" % y_te[incorrect[0]])
print("Predicted Return: %s" % prediction_1[incorrect[0]])

information=[]
for i in range(len(X_te[incorrect[0]])):
    if X_te[incorrect[0]][i]!=0:
        information.append(X_te[incorrect[0]][i])
information=[index_to_word[w] for w in information]
    
print("Stock news: %s" %information)

First bad example
Actual Return: 1
Predicted Return: 0
Stock news: ['Credit', 'Suisse', 'CEO', 'says', 'bank', 'does', 'not', 'need', 'raise', 'UNKNOWN_TOKEN', 'Aug', '9', 'Swiss', 'bank', 'Credit', 'Suisse', 'AG', 'does', 'not', 'need', 'raise', 'capital', '``', 'in', 'most', 'foreseeable', 'scenarios', "''", 'Chief', 'Executive', 'Tidjane', 'Thiam', 'said', 'in', 'an', 'interview', 'with', 'Bloomberg', '.']


In [22]:
print("Second bad example")
print("Actual Return: %s" % y_te[incorrect[1]])
print("Predicted Return: %s" % prediction_1[incorrect[1]])

information=[]
for i in range(len(X_te[incorrect[1]])):
    if X_te[incorrect[1]][i]!=0:
        information.append(X_te[incorrect[1]][i])
information=[index_to_word[w] for w in information]
    
print("Stock news: %s" %information)

Second bad example
Actual Return: 1
Predicted Return: 0
Stock news: ['US', 'STOCKS-Apple', 'lifts', 'Nasdaq', ';', 'Ukraine', 'drags', 'on', 'broader', 'market', '*', 'Apple', 'rallies', 'a', 'day', 'after', 'announcing', '7-for-1', 'stock', 'split']


## Model 2: CNN

In [23]:
from keras.layers import Conv1D, MaxPooling1D, Flatten
model_2 = Sequential()
model_2.add(Embedding(input_dim =vocabulary_size, output_dim = 200, input_length = max_length))
model_2.add(Conv1D(filters = 30, kernel_size = 3, strides = 1, padding = 'valid'))
model_2.add(MaxPooling1D(2, padding = 'valid'))
model_2.add(Flatten())
model_2.add(Dense(10,activation="relu"))
model_2.add(Dense(1,activation="sigmoid"))
model_2.compile(loss='binary_crossentropy', optimizer='adam' , metrics=['accuracy'])
print(model_2.summary())

Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 129, 200)          6000000   
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 127, 30)           18030     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 63, 30)            0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 1890)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                18910     
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 11        
Total params: 6,036,951
Trainable params: 6,036,951
Non-tra

In [24]:
model_2.fit(X_tr, y_tr, batch_size=200, epochs=4, validation_split=0.1,verbose=1)

Train on 19468 samples, validate on 2164 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x1c063a93dd8>

In [25]:
prediction_2=model_2.predict(X_te)
prediction_2=np.argmax(prediction_2,axis=-1)

precision_2 = precision_score(y_true=y_te, y_pred=prediction_2, average='weighted')
recall_2 = recall_score(y_true=y_te, y_pred=prediction_2, average='weighted')
f1_2 = f1_score(y_true=y_te, y_pred=prediction_2, average='weighted')

print("PRECISION: {:.3f}".format(precision_2))
print("RECALL: {:.3f}".format(recall_2))
print("F1: {:.3f}".format(f1_2))

PRECISION: 0.256
RECALL: 0.506
F1: 0.340


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [26]:
correct=[]
incorrect=[]

for i in range(len(y_te)):
    if prediction_2[i]==y_te[i]:
        correct.append(i)
    else:
        incorrect.append(i)
        
print("First good example")
print("Actual Return: %s" % y_te[correct[4]])
print("Predicted Return: %s" % prediction_2[correct[4]])

information=[]
for i in range(len(X_te[correct[4]])):
    if X_te[correct[4]][i]!=0:
        information.append(X_te[correct[4]][i])
information=[index_to_word[w] for w in information]
    
print("Stock news: %s" %information)

First good example
Actual Return: 0
Predicted Return: 0
Stock news: ['AstraZeneca', 'wins', 'U.S.', 'approval', 'for', 'longer', 'use', 'of', 'blood', 'thinner', 'U.S.', 'regulators', 'have', 'approved', 'a', 'new', 'dose', 'of', 'AstraZeneca', "'s", 'blood', 'thinner', 'Brilinta', 'for', 'longer-term', 'use', 'in', 'patients', 'with', 'a', 'history', 'of', 'heart', 'attacks', 'boosting', 'prospects', 'for', 'a', 'drug', 'the', 'company', 'thinks', 'will', 'eventually', 'sell', '$', '3.5', 'billion', 'a', 'year', '.']


In [27]:
print("Second good example")
print("Actual Return: %s" % y_te[correct[3]])
print("Predicted Return: %s" % prediction_2[correct[3]])

information=[]
for i in range(len(X_te[correct[3]])):
    if X_te[correct[3]][i]!=0:
        information.append(X_te[correct[3]][i])
information=[index_to_word[w] for w in information]
    
print("Stock news: %s" %information)

Second good example
Actual Return: 0
Predicted Return: 0
Stock news: ['BRIEF-Bristol-Myers', 'says', 'EU', 'advanced', 'melanoma', 'drug', '*', 'European', 'Commission', 'UNKNOWN_TOKEN', 'The', 'First', 'And', 'Only', 'UNKNOWN_TOKEN', 'UNKNOWN_TOKEN', 'Bristol-Myers', 'Squibb', "'s", 'UNKNOWN_TOKEN', '(', 'UNKNOWN_TOKEN', ')', '+', 'UNKNOWN_TOKEN', '(', 'UNKNOWN_TOKEN', ')', 'UNKNOWN_TOKEN', 'For', 'Treatment', 'Of', 'Advanced', 'UNKNOWN_TOKEN']


In [28]:
print("First bad example")
print("Actual Return: %s" % y_te[incorrect[3]])
print("Predicted Return: %s" % prediction_2[incorrect[3]])

information=[]
for i in range(len(X_te[incorrect[3]])):
    if X_te[incorrect[3]][i]!=0:
        information.append(X_te[incorrect[3]][i])
information=[index_to_word[w] for w in information]
    
print("Stock news: %s" %information)

First bad example
Actual Return: 1
Predicted Return: 0
Stock news: ['Campbell', 'Soup', "'s", 'profit', 'beats', 'expectations', 'Nov', '22', 'Campbell', 'Soup', 'Co', 'the', 'world', "'s", 'largest', 'UNKNOWN_TOKEN', 'reported', 'a', 'better-than-expected', 'quarterly', 'profit', 'on', 'Tuesday', 'helped', 'by', 'cost-cutting', 'and', 'lower', 'commodity', 'prices', '.']


In [29]:
print("Second bad example")
print("Actual Return: %s" % y_te[incorrect[4]])
print("Predicted Return: %s" % prediction_2[incorrect[4]])

information=[]
for i in range(len(X_te[incorrect[4]])):
    if X_te[incorrect[4]][i]!=0:
        information.append(X_te[incorrect[4]][i])
information=[index_to_word[w] for w in information]
    
print("Stock news: %s" %information)

Second bad example
Actual Return: 1
Predicted Return: 0
Stock news: ['BRIEF-Barclays', 'Bank', 'begins', 'note', 'buyback', 'LONDON', 'March', '27', 'Barclays', 'Bank', 'PLC', ':', '*', 'Offer', 'purchase', 'notes', 'for', 'cash', 'UNKNOWN_TOKEN', 'TO', 'UNKNOWN_TOKEN', 'NOTES', 'FOR', 'CASH', 'UNKNOWN_TOKEN', 'A', 'UNKNOWN_TOKEN', '``', 'UNKNOWN_TOKEN', 'UNKNOWN_TOKEN', "''", 'UNKNOWN_TOKEN']


## Model 3: RNN+CNN

In [30]:
from keras.layers import Concatenate
from keras.models import Model, Input

input_data=Input(shape=(max_length,))
rnn_embedding=Embedding(input_dim=vocabulary_size,output_dim=200,
                        input_length=max_length)(input_data)

cnn_embedding=Embedding(input_dim=vocabulary_size,output_dim=200,
                        input_length=max_length)(input_data)

rnn_1=LSTM(100,return_sequences=True,recurrent_dropout=0.1)(rnn_embedding)
rnn_2=Flatten()(rnn_1)
cnn_1=Conv1D(filters = 50, kernel_size = 2, strides = 1, padding = 'valid')(cnn_embedding)
cnn_2=MaxPooling1D(2, padding = 'valid')(cnn_1)
cnn_3=Flatten()(cnn_2)

combine=Concatenate(axis=-1)([rnn_2, cnn_3])

model_3=Dense(100, activation='relu')(combine)
out=Dense(1,activation="sigmoid")(model_3)

model_3=Model(input_data,out)

model_3.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
print(model_3.summary())

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 129)          0                                            
__________________________________________________________________________________________________
embedding_4 (Embedding)         (None, 129, 200)     6000000     input_1[0][0]                    
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 129, 200)     6000000     input_1[0][0]                    
__________________________________________________________________________________________________
conv1d_2 (Conv1D)               (None, 128, 50)      20050       embedding_4[0][0]                
__________________________________________________________________________________________________
lstm_1 (LS

In [31]:
model_3.fit(X_tr, y_tr, batch_size=200, epochs=4, validation_split=0.1, verbose=1)

Train on 19468 samples, validate on 2164 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x1c064069f60>

In [32]:
prediction_3=model_3.predict(X_te)
prediction_3=np.argmax(prediction_3,axis=-1)

precision_3 = precision_score(y_true=y_te, y_pred=prediction_3, average='weighted')
recall_3 = recall_score(y_true=y_te, y_pred=prediction_3, average='weighted')
f1_3 = f1_score(y_true=y_te, y_pred=prediction_3, average='weighted')

print("PRECISION: {:.3f}".format(precision_3))
print("RECALL: {:.3f}".format(recall_3))
print("F1: {:.3f}".format(f1_3))

PRECISION: 0.256
RECALL: 0.506
F1: 0.340


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [33]:
correct=[]
incorrect=[]

for i in range(len(y_te)):
    if prediction_3[i]==y_te[i]:
        correct.append(i)
    else:
        incorrect.append(i)
        
print("First good example")
print("Actual Return: %s" % y_te[correct[5]])
print("Predicted Return: %s" % prediction_3[correct[5]])

information=[]
for i in range(len(X_te[correct[5]])):
    if X_te[correct[5]][i]!=0:
        information.append(X_te[correct[5]][i])
information=[index_to_word[w] for w in information]
    
print("Stock news: %s" %information)

First good example
Actual Return: 0
Predicted Return: 0
Stock news: ['UBS', 'axes', '3', '500', 'jobs', 'in', 'cost-cutting', 'push', 'ZURICH', 'Switzerland', "'s", 'biggest', 'bank', 'UBS', 'AG', 'is', 'axe', '3', '500', 'jobs', 'shave', '2', 'billion', 'Swiss', 'francs', '(', '$', '2.5', 'billion', ')', 'off', 'annual', 'costs', 'as', 'it', 'joins', 'rival', 'investment', 'banks', 'in', 'reversing', 'the', 'post-crisis', 'hiring', 'binge', 'and', 'preparing', 'for', 'a', 'tough', 'few', 'years', '.', '|', 'Video']


In [34]:
print("First good example")
print("Actual Return: %s" % y_te[correct[6]])
print("Predicted Return: %s" % prediction_3[correct[6]])

information=[]
for i in range(len(X_te[correct[6]])):
    if X_te[correct[6]][i]!=0:
        information.append(X_te[correct[6]][i])
information=[index_to_word[w] for w in information]
    
print("Stock news: %s" %information)

First good example
Actual Return: 0
Predicted Return: 0
Stock news: ['UPDATE', 'UNKNOWN_TOKEN', 'brews', 'in', 'Gulf', 'of', 'Mexico', 'as', 'energy', 'ops', 'resume', '*', 'Lingering', 'bad', 'weather', 'slowed', 'restart', 'efforts', 'post-Lee']


In [35]:
print("First bad example")
print("Actual Return: %s" % y_te[incorrect[5]])
print("Predicted Return: %s" % prediction_3[incorrect[5]])

information=[]
for i in range(len(X_te[incorrect[5]])):
    if X_te[incorrect[5]][i]!=0:
        information.append(X_te[incorrect[5]][i])
information=[index_to_word[w] for w in information]
    
print("Stock news: %s" %information)

First bad example
Actual Return: 1
Predicted Return: 0
Stock news: ['Fitch', ':', 'Barclays', "'", 'Retail', 'and', 'Cards', 'Businesses', 'UNKNOWN_TOKEN', 'Weak', 'Investment', 'Bank', '(', 'The', 'following', 'statement', 'was', 'released', 'by', 'the', 'rating', 'agency', ')', 'LONDON', 'October', '31', '(', 'Fitch', ')', 'Fitch', 'Ratings', 'says', 'the', 'UNKNOWN_TOKEN', 'of', 'Barclays', 'Plc', "'s", '(', 'Barclays', 'A/Stable/a', ')', 'investment', 'bank', '(', 'IB', ')', 'in', 'UNKNOWN_TOKEN', 'where', 'pre-tax', 'profit', 'fell', '39', '%', 'yoy', 'was', 'compensated', 'by', 'the', 'solid', 'performance', 'of', 'its', 'other', 'core', 'businesses', 'in', 'personal', 'and', 'corporate', 'banking', '(', 'UNKNOWN_TOKEN', ')', 'Barclaycard', 'and', 'Africa', 'Banking', '.', 'The', 'results', 'have', 'no', 'immediate', 'effect', 'on', 'Barclays', "'", 'ratings', '.', 'Results', 'in', '3Q', 'were', 'affected', 'by', 'changes', 'in', 'provisions', 'for', 'UNKNOWN_TOKEN']


In [36]:
print("Second bad example")
print("Actual Return: %s" % y_te[incorrect[6]])
print("Predicted Return: %s" % prediction_3[incorrect[6]])

information=[]
for i in range(len(X_te[incorrect[6]])):
    if X_te[incorrect[6]][i]!=0:
        information.append(X_te[incorrect[6]][i])
information=[index_to_word[w] for w in information]
    
print("Stock news: %s" %information)

Second bad example
Actual Return: 1
Predicted Return: 0
Stock news: ['Deals', 'of', 'the', 'day-', 'Mergers', 'and', 'acquisitions', 'Oct', '14', 'The', 'following', 'bids', 'mergers', 'acquisitions', 'and', 'disposals', 'were', 'reported', 'by', '2000', 'GMT', 'on', 'Wednesday', ':']


In [39]:
# Model comparision

from tabulate import tabulate
print (tabulate([['RNN',precision_1, recall_1,f1_1],
                 ['CNN',precision_2, recall_2,f1_2],
                 ['RNN+CNN',precision_3,recall_3,f1_3]],
                headers=['Model','Precision','Recall', 'f1_score']))

Model      Precision    Recall    f1_score
-------  -----------  --------  ----------
RNN         0.255913  0.505879    0.339885
CNN         0.255913  0.505879    0.339885
RNN+CNN     0.255913  0.505879    0.339885
