# Using NLP to play the stock market

we'll use everything we've learned to analyze corporate news and pick stocks. 

This project will involve building three models:

**1. An RNN based on word inputs**

**2. A CNN based on character inputs**

**3. A neural net architecture that merges the previous two models**

You will apply these models to predicting whether a stock return will be positive or negative in the same day of a news publication.

## Your X - Reuters news data

Reuters is a news outlet that reports on corporations, among many other things. Stored in the `news_reuters.csv` file is news data listed in columns. The corresponding columns are the `ticker`, `name of company`, `date of publication`, `headline`, `first sentence`, and `news category`.

It is up to you to decide how to clean this dataset. For instance, many of the first sentences contain a location name showing where the reporting is done. This is largely irrevant information and will probably just make your data noisier. You can also choose to subset on a certain news category, which might enhance your model performance and also limit the size of your data.

## Your Y - Stock information from Yahoo! Finance

Trading data from Yahoo! Finance was collected and then normalized using the [S&P 500](https://en.wikipedia.org/wiki/S%26P_500_Index). This is stored in the `stockReturns.json` file. 

In our dataset, the ticker for the S&P is `^GSPC`. Each ticker is compared the S&P and then judged on whether it is outperforming (positive value) or under-performing (negative value) the S&P. Each value is reported on a daily interval from 2004 to now.

Below is a diagram of the data in the json file. Note there are three types of data: short: 1 day return, mid: 7 day return, long 28 day return.

```
          term (short/mid/long)
         /         |         \
   ticker A   ticker B   ticker C
      /   \      /   \      /   \
  date1 date2 date1 date2 date1 date2
```

You will need to pick a length of time to focus on (day, week, month). You are welcome to train models on each dataset as well.  

Transform the return data such that the outcome will be binary:

```
label[y < 0] = 0
label[y >= 0] = 1
```

Finally, this data needs needs to be joined on the date and ticker - For each date of news publication, we want to join the corresponding corporation's news on its return information. We make the assumption that the day's return will reflect the sentiment of the news, regardless of timing.


# Your models - RNN, CNN, and RNN+CNN

For your RNN model, it needs to be based on word inputs, embedding the word inputs, encoding them with an RNN layer, and finally a decoding step (such as softmax or some other choice).

Your CNN model will be based on characters. For reference on how to do this, look at the CNN class demonstration in the course repository.

Finally you will combine the architecture for both of these models, either [merging](https://github.com/ShadyF/cnn-rnn-classifier) using the [Functional API](https://keras.io/getting-started/functional-api-guide/) or [stacking](http://www.aclweb.org/anthology/S17-2134). See the links for reference.

### Good luck!

In [427]:
import pandas as pd
news=pd.read_csv("news_reuters.csv",header=None)

In [428]:
news.head()

Unnamed: 0,0,1,2,3,4,5
0,AA,Alcoa Corporation,20110707,Alcoa profit seen higher on aluminum price surge,* Analysts expect profit of 34 cts/shr vs yea...,topStory
1,AA,Alcoa Corporation,20110708,Global markets weekahead: Lacking conviction,LONDON Investors are unlikely to gain strong c...,normal
2,AA,Alcoa Corporation,20110708,Jobs halt Wall Street rally investors eye ear...,NEW YORK Stocks fell on Friday as a weak jobs ...,topStory
3,AA,Alcoa Corporation,20110708,REFILE-TABLE-Australia's top carbon polluters,CANBERRA July 8 Following is a list of Austr...,normal
4,AA,Alcoa Corporation,20110708,US STOCKS-Jobs data hits stocks but earnings ...,* Google slumps on downgrade one of Nasdaq's...,normal


In [429]:
#cleaning format 
news[4]=news[4].replace('\*','',regex=True)

In [430]:
news[4]=news[4].replace('\|\xa0Video','',regex=True)

In [431]:
#choose March the whole month 
date=[
 20160301,
 20160302,
 20160303,
 20160304,
 20160308,
 20160309,
 20160310,
 20160311,
 20160315,
 20160316,
 20160317,
 20160318,
 20160322,
 20160323,
 20160324,
 20160329,
 20160330,
 20160331,
 ]

In [432]:
#Mrach news data
news_d=news[news[2].isin(date)]

In [433]:
len(news_d)

2162

In [434]:
return_data=pd.read_json("stockReturns.json")

In [435]:
return_data.head()

Unnamed: 0,long,mid,short
AAPL,"{'20040106': -0.0023, '20040107': -0.0016, '20...","{'20040106': 0.06760000000000001, '20040107': ...","{'20040106': -0.0013000000000000002, '20040107..."
ABB,"{'20040106': 0.09630000000000001, '20040107': ...","{'20040106': 0.09340000000000001, '20040107': ...","{'20040106': 0.0015, '20040107': -0.0107000000..."
ABMD,"{'20040106': 0.08360000000000001, '20040107': ...","{'20040106': 0.039400000000000004, '20040107':...","{'20040106': 0.0102, '20040107': 0.0217, '2004..."
ABR,"{'20040413': 0.0367, '20040414': 0.0053, '2004...","{'20040413': 0.0082, '20040414': 0.01970000000...","{'20040413': 0.013900000000000001, '20040414':..."
ACAD,"{'20040602': -0.049300000000000004, '20040603'...","{'20040602': -0.0821, '20040603': -0.0611, '20...","{'20040602': -0.0346, '20040603': -0.0005, '20..."


In [436]:
return_data_l=pd.read_json((return_data["long"]).to_json(), orient="index")

In [437]:
return_data_m=pd.read_json((return_data["mid"]).to_json(), orient="index")

In [438]:
return_data_s=pd.read_json((return_data["short"]).to_json(), orient="index")

In [439]:
return_data_l.head()

Unnamed: 0,20040106,20040107,20040108,20040109,20040113,20040114,20040115,20040116,20040121,20040122,...,20180308,20180309,20180313,20180314,20180315,20180316,20180320,20180321,20180322,20180323
AAPL,-0.0023,-0.0016,-0.0376,-0.0423,-0.0556,-0.0741,-0.0411,0.0269,0.0021,0.0235,...,0.0108,0.0045,-0.0034,0.0019,0.0056,0.0052,0.016,0.0209,0.0389,0.0046
ABB,0.0963,0.0916,0.1032,-0.0069,0.0404,0.0003,-0.0378,-0.0684,-0.0017,-0.0395,...,-0.0155,-0.0069,0.0214,0.0227,0.0138,0.0122,0.0087,0.0134,0.0206,0.0639
ABMD,0.0836,0.0283,-0.0199,-0.0829,-0.0392,0.0117,-0.0115,0.0258,-0.0825,-0.1271,...,0.0416,0.0403,0.0403,0.064,0.0417,0.0388,0.0416,0.0387,0.0523,0.0492
ABR,,,,,,,,,,,...,0.0641,0.0431,0.0493,0.032,0.0204,0.0171,-0.0076,-0.0121,-0.0239,-0.0415
ACAD,,,,,,,,,,,...,-0.0978,-0.1037,-0.3875,-0.3022,-0.2314,-0.2368,-0.2455,-0.2055,-0.2092,-0.2117


In [440]:
#March long return data
data_l=return_data_l[date]

In [441]:
#March mid return data
data_m=return_data_m[date]

In [442]:
#March short return 
data_s=return_data_s[date]

In [443]:
#choose to focus on march long return data
data_l.index

Index(['AAPL', 'ABB', 'ABMD', 'ABR', 'ACAD', 'ACAT', 'ACFC', 'ACRX', 'ADMA',
       'ADMS',
       ...
       'WPXP', 'WSFSL', 'WSO.B', 'WU', 'XCO', 'XLNX', 'ZBK', 'ZBRA', 'ZIXI',
       '^GSPC'],
      dtype='object', length=501)

In [445]:
#news data with tickers in long return data
newd=news_d[news_d[0].isin(data_l.index)]

In [446]:
rlist=[]
for i in newd.index:
    for e in range(len(data_l.index)):
          for d in range(len(data_l.columns)):     
                 if newd[0][i]==data_l.index[e]:
                         if newd[2][i]==data_l.columns[d]:
                                 rlist.append(data_l[data_l.columns[d]][e])

In [447]:
newd["long_return"]=rlist

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [448]:
#there are some missing data
newd

Unnamed: 0,0,1,2,3,4,5,long_return
7086,AAPL,Apple Inc,20160301,Judicial panel members consider legal brief in...,WASHINGTON Members of the House Judiciary Comm...,normal,0.0315
7087,AAPL,Apple Inc,20160301,N.Y. judge backs Apple in encryption fight wit...,The U.S. government cannot force Apple Inc to...,normal,0.0315
7088,AAPL,Apple Inc,20160301,UPDATE 3-N.Y. judge backs Apple in encryption ...,Feb 29 The U.S. government cannot force Apple ...,normal,0.0315
7089,AAPL,Apple Inc,20160301,U.S. attorney general worried encryption debat...,SAN FRANCISCO March 1 U.S. Attorney General L...,normal,0.0315
7090,AAPL,Apple Inc,20160301,U.S. attorney general worried encryption debat...,SAN FRANCISCO U.S. Attorney General Loretta Ly...,topStory,0.0315
7091,AAPL,Apple Inc,20160301,U.S. judicial panel members consider legal bri...,WASHINGTON Feb 29 Members of the U.S. House J...,normal,0.0315
7092,AAPL,Apple Inc,20160301,U.S. judicial panel members consider legal bri...,WASHINGTON Members of the U.S. House Judiciary...,normal,0.0315
7093,AAPL,Apple Inc,20160302,Apple should not try making a car on its own ...,GENEVA March 2 U.S. technology giant Apple s...,normal,0.0307
7094,AAPL,Apple Inc,20160302,Apple should not try making a car on its own ...,GENEVA U.S. technology giant Apple should coll...,topStory,0.0307
7095,AAPL,Apple Inc,20160303,Apple's new San Francisco office could be a to...,SAN FRANCISCO From Apple’s earliest days exec...,normal,0.0455


In [467]:
#clean nan value
newdd=newd.dropna(axis=0, how='any')

In [450]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
#from nltk.corpus import stopwords

In [451]:
#import nltk
#nltk.download('punkt')

In [587]:
#using tokenize to clean location word
words=[nltk.word_tokenize(d)for d in newdd[4]]

In [453]:
#import nltk
#nltk.download('perluniprops')

In [588]:
for i in range(len(words)):
    w=words[i][0].isupper()
    if w ==True:
        words[i].remove(words[i][0])

In [589]:
words

[['Members',
  'of',
  'the',
  'House',
  'Judiciary',
  'Committee',
  'are',
  'considering',
  'filing',
  'a',
  '``',
  'friend',
  'of',
  'the',
  'court',
  "''",
  'brief',
  'in',
  'Apple',
  'Inc',
  "'s",
  '[',
  'AAPL.O',
  ']',
  'encryption',
  'dispute',
  'with',
  'the',
  'U.S.',
  'government',
  'to',
  'argue',
  'that',
  'the',
  'case',
  'should',
  'be',
  'decided',
  'by',
  'Congress',
  'and',
  'not',
  'the',
  'courts',
  'five',
  'sources',
  'familiar',
  'with',
  'the',
  'matter',
  'said',
  '.'],
 ['The',
  'U.S.',
  'government',
  'can',
  'not',
  'force',
  'Apple',
  'Inc',
  'to',
  'unlock',
  'an',
  'iPhone',
  'in',
  'a',
  'New',
  'York',
  'drug',
  'case',
  'a',
  'federal',
  'judge',
  'in',
  'Brooklyn',
  'said',
  'on',
  'Monday',
  'a',
  'ruling',
  'that',
  'bolsters',
  'the',
  'company',
  "'s",
  'arguments',
  'in',
  'its',
  'landmark',
  'legal',
  'showdown',
  'with',
  'the',
  'Justice',
  'Department',


In [590]:
from nltk.tokenize.moses import MosesDetokenizer
detokenizer = MosesDetokenizer(words)
senten=[]
for i in range(len(words)):
         d=detokenizer.detokenize(words[i], return_str=True)
         senten.append(d)

In [591]:
#new sentences without location 
newdd[4]=senten

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [592]:
newdd[4].head()

7086    Members of the House Judiciary Committee are c...
7087    The U.S. government can not force Apple Inc to...
7088    Feb 29 The U.S. government can not force Apple...
7089    FRANCISCO March 1 U.S. Attorney General Lorett...
7090    FRANCISCO U.S. Attorney General Loretta Lynch ...
Name: 4, dtype: object

In [471]:
perform=[]
for i in newdd['long_return'].index:
         if newdd['long_return'][i]>=0:
                perform.append("outperforming")
         else:   
                perform.append("underperforming")

In [473]:
newdd["perform"]=perform

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [593]:
newdd

Unnamed: 0,0,1,2,3,4,5,long_return,perform
7086,AAPL,Apple Inc,20160301,Judicial panel members consider legal brief in...,Members of the House Judiciary Committee are c...,normal,0.0315,outperforming
7087,AAPL,Apple Inc,20160301,N.Y. judge backs Apple in encryption fight wit...,The U.S. government can not force Apple Inc to...,normal,0.0315,outperforming
7088,AAPL,Apple Inc,20160301,UPDATE 3-N.Y. judge backs Apple in encryption ...,Feb 29 The U.S. government can not force Apple...,normal,0.0315,outperforming
7089,AAPL,Apple Inc,20160301,U.S. attorney general worried encryption debat...,FRANCISCO March 1 U.S. Attorney General Lorett...,normal,0.0315,outperforming
7090,AAPL,Apple Inc,20160301,U.S. attorney general worried encryption debat...,FRANCISCO U.S. Attorney General Loretta Lynch ...,topStory,0.0315,outperforming
7091,AAPL,Apple Inc,20160301,U.S. judicial panel members consider legal bri...,Feb 29 Members of the U.S. House Judiciary Com...,normal,0.0315,outperforming
7092,AAPL,Apple Inc,20160301,U.S. judicial panel members consider legal bri...,Members of the U.S. House Judiciary Committee ...,normal,0.0315,outperforming
7093,AAPL,Apple Inc,20160302,Apple should not try making a car on its own ...,March 2 U.S. technology giant Apple should col...,normal,0.0307,outperforming
7094,AAPL,Apple Inc,20160302,Apple should not try making a car on its own ...,U.S. technology giant Apple should collaborate...,topStory,0.0307,outperforming
7095,AAPL,Apple Inc,20160303,Apple's new San Francisco office could be a to...,FRANCISCO From Apple ’ s earliest days executi...,normal,0.0455,outperforming


In [475]:
size_mapping = {
        'outperforming': 1,
        'underperforming': 0,
        }

# apply the mapping dictionary with map()
Y= newdd['perform'].map(size_mapping)




## Model 1: RNN

In [493]:
import numpy as np
from numpy import array

In [598]:
#word based 
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing import sequence
from numpy import argmax
t = Tokenizer()
t.fit_on_texts(newdd[4])
vocab_size_1 = len(t.word_index) + 1
encoded_x = t.texts_to_sequences(newdd[4])
max_length = 700
padded_x= pad_sequences(encoded_x, maxlen=max_length)
print(padded_x.shape)

(489, 700)


In [599]:
import nltk
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
x_train, x_test, y_train, y_test = train_test_split(padded_x,Y,test_size=0.3,random_state=0)

In [600]:
from keras.utils import np_utils
y_tr = np_utils.to_categorical(y_train)

In [601]:
y_te = np_utils.to_categorical(y_test)

In [602]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Embedding,GRU,Flatten
from keras.layers import LSTM

In [603]:
model = Sequential()
model.add(Embedding(vocab_size_1, 32, input_length=max_length))
model.add(LSTM(256,return_sequences=True,recurrent_dropout=0.2))
model.add(Dropout(0.2))
model.add(LSTM(128,activation="tanh"))
model.add(Dense(y_tr.shape[1], activation='sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])

In [604]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_24 (Embedding)     (None, 700, 32)           78400     
_________________________________________________________________
lstm_52 (LSTM)               (None, 700, 256)          295936    
_________________________________________________________________
dropout_42 (Dropout)         (None, 700, 256)          0         
_________________________________________________________________
lstm_53 (LSTM)               (None, 128)               197120    
_________________________________________________________________
dense_47 (Dense)             (None, 2)                 258       
Total params: 571,714
Trainable params: 571,714
Non-trainable params: 0
_________________________________________________________________


In [605]:
model.fit(x_train, y_tr, epochs=10, batch_size=20)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fc16f62c6d8>

In [606]:
scores = model.evaluate(x_test, y_te, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 71.43%


In [607]:
pp=model.predict_classes(x_test)

In [608]:
from sklearn.metrics import confusion_matrix
confusion_matrixr = confusion_matrix(y_test, pp)
confusion_matrixr

array([[55, 17],
       [25, 50]])

In [609]:
from sklearn.metrics import classification_report
print(classification_report(y_test, pp))

             precision    recall  f1-score   support

          0       0.69      0.76      0.72        72
          1       0.75      0.67      0.70        75

avg / total       0.72      0.71      0.71       147



In [610]:
from sklearn.metrics import precision_recall_fscore_support
rnnfr=precision_recall_fscore_support(y_test, pp,average='macro')


In [611]:
#get the test data content
test_data=newdd[newdd.index.isin(y_test.index)]

In [612]:
#RNN good classification 
print("{:10} {:5} {:5} {}".format("Date", "True", "Pred","Sentence"))
print(30 * "-")
for w, t, pred,sen in zip(test_data[2], y_test, pp,test_data[4]):
     if t==pred:
        print("{}: {:5} {:5}\n{}".format(w, t, pred,sen))

Date       True  Pred  Sentence
------------------------------
20160301:     1     1
The U.S. government can not force Apple Inc to unlock an iPhone in a New York drug case a federal judge in Brooklyn said on Monday a ruling that bolsters the company 's arguments in its landmark legal showdown with the Justice Department over encryption and privacy.
20160301:     1     1
FRANCISCO U.S. Attorney General Loretta Lynch said on Tuesday her Justice Department 's court battle to force Apple Inc to unlock an iPhone linked to one of the San Bernardino shooters ran the risk of becoming ``all about Apple ''and that the company should not be able to decide the broader encryption debate alone.
20160301:     0     0
Feb 29 Members of the U.S. House Judiciary Committee are considering filing a ``friend of the court ''brief in Apple Inc 's encryption dispute with the U.S. government to argue that the case should be decided by Congress and not the courts five sources familiar with the matter said.
201

In [613]:
#RNN bad classification
print("{:10} {:5} {:5} {}".format("Date", "True", "Pred","Sentence"))
print(30 * "-")
for w, t, pred, sen in zip(test_data[2], y_test, pp,test_data[4]):
     if t!=pred:
        print("{}: {:5} {:5}\n{}".format(w, t, pred,sen))

Date       True  Pred  Sentence
------------------------------
20160301:     0     1
Members of the U.S. House Judiciary Committee are considering filing a ``friend of the court ''brief in Apple Inc 's encryption dispute with the U.S. government to argue that the case should be decided by Congress and not the courts five sources familiar with the matter said.
20160302:     1     0
U.S. technology giant Apple should collaborate with carmakers to make a vehicle and use the expertise already available rather than attempt to do it on its own Fiat Chrysler Chief Executive Sergio Marchionne said.
20160303:     1     0
FRANCISCO March 3 From Apple 's earliest days executives insisted that employees work from its headquarters in sleepy suburban Cupertino.
20160308:     0     1
The U.S. Justice Department on Monday sought to overturn a ruling which protects Apple from unlocking an iPhone in a New York drug case.
20160315:     1     0
March 15 China 's annual consumer rights day TV show took aim

## Model 2: CNN

In [614]:
docs = []
sentences = []


for sentences, label in zip(newdd[4], Y):
    sentences_cleaned = [sent.lower() for sent in sentences]
    docs.append(sentences_cleaned)
    
len(docs)

489

In [615]:
txt = ''
for doc in docs:
    for s in doc:
        txt += s
chars = set(txt)
vocab_size = len(chars)
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

total chars: 56


In [616]:
def vectorize_sentences(data, char_indices):
    X = []
    for sentences in data:
        x = [char_indices[w] for w in sentences]
        x2 = np.eye(len(char_indices))[x]
        X.append(x2)
    return (pad_sequences(X, maxlen=max_length))

padded_x_2 = vectorize_sentences(docs,char_indices)
padded_x_2.shape



(489, 700, 56)

In [617]:
x_trainc, x_testc, y_trainc, y_testc = train_test_split(padded_x_2,Y,test_size=0.3,random_state=0)

In [618]:
#get the test data for CNN and CNN+RNN 
test_datac=newdd[newdd.index.isin(y_testc.index)]

In [619]:
y_tr2 = np_utils.to_categorical(y_trainc)

In [620]:
y_te2 = np_utils.to_categorical(y_testc)

In [621]:
nb_filter = 256
dense_outputs = 700
filter_kernels = [7, 7, 5, 5, 3, 3]
n_out = 2

In [622]:
from keras.models import Model
from keras.layers import Input
from keras.layers.convolutional import Convolution1D, MaxPooling1D

inputs = Input(shape=(max_length, vocab_size), name='input', dtype='float32')

conv = Convolution1D(nb_filter=nb_filter, filter_length=filter_kernels[0],
                     border_mode='valid', activation='relu',
                     input_shape=(max_length, vocab_size))(inputs)
conv = MaxPooling1D(pool_length=3)(conv)


conv1 = Convolution1D(nb_filter=nb_filter, filter_length=filter_kernels[1],
                      border_mode='valid', activation='relu')(conv)
conv1 = MaxPooling1D(pool_length=3)(conv1)

conv2 = Convolution1D(nb_filter=nb_filter, filter_length=filter_kernels[2],
                      border_mode='valid', activation='relu')(conv1)

conv3 = Convolution1D(nb_filter=nb_filter, filter_length=filter_kernels[3],
                      border_mode='valid', activation='relu')(conv2)

conv4 = Convolution1D(nb_filter=nb_filter, filter_length=filter_kernels[4],
                      border_mode='valid', activation='relu')(conv3)

conv5 = Convolution1D(nb_filter=nb_filter, filter_length=filter_kernels[5],
                      border_mode='valid', activation='relu')(conv4)
conv5 = MaxPooling1D(pool_length=3)(conv5)
conv5 = Flatten()(conv5)

z = Dropout(0.25)(Dense(dense_outputs, activation='relu')(conv5))
z = Dropout(0.25)(Dense(dense_outputs, activation='relu')(z))

pred = Dense(n_out, activation='softmax', name='output')(z)

modelc = Model(input=inputs, output=pred)

modelc.compile(loss='categorical_crossentropy', optimizer='rmsprop',
              metrics=['accuracy'])




In [623]:
modelc.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           (None, 700, 56)           0         
_________________________________________________________________
conv1d_71 (Conv1D)           (None, 694, 256)          100608    
_________________________________________________________________
max_pooling1d_36 (MaxPooling (None, 231, 256)          0         
_________________________________________________________________
conv1d_72 (Conv1D)           (None, 225, 256)          459008    
_________________________________________________________________
max_pooling1d_37 (MaxPooling (None, 75, 256)           0         
_________________________________________________________________
conv1d_73 (Conv1D)           (None, 71, 256)           327936    
_________________________________________________________________
conv1d_74 (Conv1D)           (None, 67, 256)           327936    
__________

In [624]:
modelc.fit(x_trainc, y_tr2, batch_size=30,
         epochs=20, validation_split=0.2, verbose=True)

Train on 273 samples, validate on 69 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7fc1ee46bc88>

In [625]:
loss, accuracy = modelc.evaluate(x_testc, y_te2, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 71.428572


In [626]:
ppc=modelc.predict(x_testc)

In [627]:
ppcc=np.argmax(ppc,axis=1) 

In [628]:
confusion_matrix_c = confusion_matrix(y_testc, ppcc)
confusion_matrix_c

array([[55, 17],
       [25, 50]])

In [629]:
print(classification_report(y_testc, ppcc))

             precision    recall  f1-score   support

          0       0.69      0.76      0.72        72
          1       0.75      0.67      0.70        75

avg / total       0.72      0.71      0.71       147



In [630]:
ppccp=precision_recall_fscore_support(y_testc, ppcc,average='macro')

## Model 3: RNN+CNN

In [631]:
modelcr = Sequential()
modelcr.add(Convolution1D(filters=256, kernel_size=3, padding='valid', activation='relu', kernel_initializer='lecun_uniform',input_shape=(x_trainc.shape[1], x_trainc.shape[2])))
modelcr.add(MaxPooling1D(pool_size=2))
modelcr.add(Convolution1D(filters=128, kernel_size=3, padding='valid', activation='relu'))
modelcr.add(LSTM(100,dropout=0.2))
modelcr.add(Dense(2, activation='softmax'))
modelcr.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(modelcr.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_77 (Conv1D)           (None, 698, 256)          43264     
_________________________________________________________________
max_pooling1d_39 (MaxPooling (None, 349, 256)          0         
_________________________________________________________________
conv1d_78 (Conv1D)           (None, 347, 128)          98432     
_________________________________________________________________
lstm_54 (LSTM)               (None, 100)               91600     
_________________________________________________________________
dense_50 (Dense)             (None, 2)                 202       
Total params: 233,498
Trainable params: 233,498
Non-trainable params: 0
_________________________________________________________________
None


In [632]:
modelcr.fit(x_trainc, y_tr2, batch_size=20,
         epochs=20, validation_split=0.1, verbose=True)

Train on 307 samples, validate on 35 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7fc17e0eff98>

In [633]:
ppcr=modelcr.predict_classes(x_testc)

In [634]:
ppcr

array([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,
       1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1])

In [635]:
print(classification_report(y_testc, ppcr))

             precision    recall  f1-score   support

          0       0.68      0.88      0.77        72
          1       0.84      0.61      0.71        75

avg / total       0.76      0.74      0.74       147



In [636]:
ppcrp=precision_recall_fscore_support(y_testc, ppcr,average='macro')

In [637]:
#!pip install tabulate
from tabulate import tabulate
print( tabulate([["RNN Precision", rnnfr[0]], ["CNN Precision", ppccp[0]],["CNN+RNN Precision",ppcrp[0]],["RNN Recall", rnnfr[1]],["CNN Recall", ppccp[1]],["CNN+RNN Recall", ppcrp[1]]]))

-----------------  --------
RNN Precision      0.716884
CNN Precision      0.716884
CNN+RNN Precision  0.760573
RNN Recall         0.715278
CNN Recall         0.715278
CNN+RNN Recall     0.744167
-----------------  --------


In [638]:
#CNN good classification 

print("{:10} {:5} {:5} {}".format("Date", "True", "Pred","Sentence"))
print(40 * "-")
for w, t, pred,sen in zip(test_datac[2], y_testc, ppcc,test_datac[4]):
     if t==pred:
        print("{}: {:5} {:5}\n{}".format(w, t, pred,sen))

Date       True  Pred  Sentence
----------------------------------------
20160301:     1     1
The U.S. government can not force Apple Inc to unlock an iPhone in a New York drug case a federal judge in Brooklyn said on Monday a ruling that bolsters the company 's arguments in its landmark legal showdown with the Justice Department over encryption and privacy.
20160301:     0     0
Feb 29 Members of the U.S. House Judiciary Committee are considering filing a ``friend of the court ''brief in Apple Inc 's encryption dispute with the U.S. government to argue that the case should be decided by Congress and not the courts five sources familiar with the matter said.
20160302:     0     0
March 2 U.S. technology giant Apple should collaborate with carmakers to make a vehicle and use the expertise already available rather than attempt to do it on its own Fiat Chrysler Chief Executive Sergio Marchionne said.
20160303:     0     0
Chipmaker Broadcom Ltd the company created following the merger of

In [639]:
#CNN bad classification 

print("{:10} {:5} {:5} {}".format("Date", "True", "Pred","Sentence"))
print(40 * "-")
for w, t, pred,sen in zip(test_datac[2], y_testc, ppcc,test_datac[4]):
     if t!=pred:
        print("{}: {:5} {:5}\n{}".format(w, t, pred,sen))

Date       True  Pred  Sentence
----------------------------------------
20160301:     1     0
FRANCISCO U.S. Attorney General Loretta Lynch said on Tuesday her Justice Department 's court battle to force Apple Inc to unlock an iPhone linked to one of the San Bernardino shooters ran the risk of becoming ``all about Apple ''and that the company should not be able to decide the broader encryption debate alone.
20160301:     0     1
Members of the U.S. House Judiciary Committee are considering filing a ``friend of the court ''brief in Apple Inc 's encryption dispute with the U.S. government to argue that the case should be decided by Congress and not the courts five sources familiar with the matter said.
20160302:     1     0
U.S. technology giant Apple should collaborate with carmakers to make a vehicle and use the expertise already available rather than attempt to do it on its own Fiat Chrysler Chief Executive Sergio Marchionne said.
20160303:     1     0
FRANCISCO March 3 From Apple 's

In [640]:
#CNN+RNN good classification
print("{:10} {:5} {:5} {}".format("Date", "True", "Pred","Sentence"))
print(40 * "-")
for w, t, pred,sen in zip(test_datac[2], y_testc, ppcr,test_datac[4]):
     if t==pred:
        print("{}: {:5} {:5}\n{}".format(w, t, pred,sen))

Date       True  Pred  Sentence
----------------------------------------
20160301:     0     0
Feb 29 Members of the U.S. House Judiciary Committee are considering filing a ``friend of the court ''brief in Apple Inc 's encryption dispute with the U.S. government to argue that the case should be decided by Congress and not the courts five sources familiar with the matter said.
20160301:     0     0
Members of the U.S. House Judiciary Committee are considering filing a ``friend of the court ''brief in Apple Inc 's encryption dispute with the U.S. government to argue that the case should be decided by Congress and not the courts five sources familiar with the matter said.
20160302:     0     0
March 2 U.S. technology giant Apple should collaborate with carmakers to make a vehicle and use the expertise already available rather than attempt to do it on its own Fiat Chrysler Chief Executive Sergio Marchionne said.
20160302:     1     1
U.S. technology giant Apple should collaborate with carm

In [641]:
#CNN+RNN bad classification 

print("{:10} {:5} {:5} {}".format("Date", "True", "Pred","Sentence"))
print(40 * "-")
for w, t, pred,sen in zip(test_datac[2], y_testc, ppcr,test_datac[4]):
     if t!=pred:
        print("{}: {:5} {:5}\n{}".format(w, t, pred,sen))

Date       True  Pred  Sentence
----------------------------------------
20160301:     1     0
The U.S. government can not force Apple Inc to unlock an iPhone in a New York drug case a federal judge in Brooklyn said on Monday a ruling that bolsters the company 's arguments in its landmark legal showdown with the Justice Department over encryption and privacy.
20160301:     1     0
FRANCISCO U.S. Attorney General Loretta Lynch said on Tuesday her Justice Department 's court battle to force Apple Inc to unlock an iPhone linked to one of the San Bernardino shooters ran the risk of becoming ``all about Apple ''and that the company should not be able to decide the broader encryption debate alone.
20160303:     1     0
FRANCISCO March 3 From Apple 's earliest days executives insisted that employees work from its headquarters in sleepy suburban Cupertino.
20160308:     0     1
The U.S. Justice Department on Monday sought to overturn a ruling which protects Apple from unlocking an iPhone in a 

All the model's precision is not that high.I tired several times that the precision range is from 60% to 74% accoding to the test set.Initially, the precision of rnn model is below 50%, after I changed one the lstm layer's activation function to tanh, it rised. Comparing the good and bad classiication, I found some news and returns are classified wrong in more than one model. For example, "March 18 The following are mergers under review by the European Commission and a brief guide to the EU merger process:", "March 10 Taiwan Semiconductor Manufacturing Co Ltd". I suppose that is because actually they does not offer much information about what the news really are. They are just a introcuction to the following sentence. While I also found I should do more cleaning with the data, including brackets, date like March 3 in the sentence.And some location data I did not clean it thoroughly like YORK(New York) still remains. They make the data noisier as well. And after I choose 2016 March as my time of the data, the records is only 522 including missing value. After I removed all missing value, it only remains 489. The size is too small.I should train more and test more to get better accuracy.