# Daily news for stock market prediction

To practice natural language processing, I download the news and stock dataset from Kaggle (https://www.kaggle.com/aaron7sun/stocknews). There are two channels of data provided in this dataset:
- News data: Historical news headlines from Reddit WorldNews Channel (/r/worldnews). They are ranked by reddit users' votes, and only the top 25 headlines are considered for a single date. (Range: 2008-06-08 to 2016-07-01)
- Stock data: Dow Jones Industrial Average (DJIA) from Yahoo Finance is used to "prove the concept". (Range: 2008-08-08 to 2016-07-01)

I will use the above data in this practice, to train and predict stock market goes up or down. 

In [123]:
# import libraries needed
import pandas as pd
import numpy as np
import nltk
#from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('vader_lexicon')
import matplotlib.pyplot as plt

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Jing\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jing\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Jing\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Jing\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [2]:
# load data
news = pd.read_csv('stocknews/RedditNews.csv')
stock = pd.read_csv('stocknews/DJIA_table.csv')

In [3]:
# convert news date to datetime format and filter news only in work days
news['Date'] = pd.to_datetime(news.Date, format='%Y-%m-%d')
news = news[news.Date.dt.dayofweek<5].reset_index()

In [49]:
news.head()

Unnamed: 0,index,Date,News,News_lower,News_token,News_stem,News_lem,neg,neu,pos,compound
0,0,2016-07-01,A 117-year-old woman in Mexico City finally re...,a 117-year-old woman in mexico city finally re...,"[117, year, old, woman, mexico, city, finally,...","[117, year, old, woman, mexico, citi, final, r...","[117, year, old, woman, mexico, citi, final, r...",0.163,0.837,0.0,-0.5994
1,1,2016-07-01,IMF chief backs Athens as permanent Olympic host,imf chief backs athens as permanent olympic host,"[imf, chief, backs, athens, permanent, olympic...","[imf, chief, back, athen, perman, olymp, host]","[imf, chief, back, athen, perman, olymp, host]",0.0,1.0,0.0,0.0
2,2,2016-07-01,"The president of France says if Brexit won, so...","the president of france says if brexit won, so...","[president, france, says, brexit, donald, trump]","[presid, franc, say, brexit, donald, trump]","[presid, franc, say, brexit, donald, trump]",0.0,1.0,0.0,0.0
3,3,2016-07-01,British Man Who Must Give Police 24 Hours' Not...,british man who must give police 24 hours' not...,"[british, man, must, give, police, 24, hours, ...","[british, man, must, give, polic, 24, hour, no...","[british, man, must, give, polic, 24, hour, no...",0.303,0.591,0.105,-0.5187
4,4,2016-07-01,100+ Nobel laureates urge Greenpeace to stop o...,100+ nobel laureates urge greenpeace to stop o...,"[100, nobel, laureates, urge, greenpeace, stop...","[100, nobel, laureat, urg, greenpeac, stop, op...","[100, nobel, laureat, urg, greenpeac, stop, op...",0.239,0.761,0.0,-0.296


In [5]:
# nltk preprocess modules
tokenizer = RegexpTokenizer('\w+')
stemmer= PorterStemmer()
lemmatizer=WordNetLemmatizer()
senti_analyze = SentimentIntensityAnalyzer()

In [9]:
# preprocess news: lower case, remove stopwords, token, stemming and lemmatization, and then run Sentiment Analysis on the
# clean data to get negative, neutral, positive and compound scores of each news headline
News_token = []
News_stem = []
News_lem = []
neg = []
neu = []
pos = []
compound = []
news['News_lower'] = news['News'].str.lower()

for i in range(len(news)):
    lower=news['News_lower'][i]
    token = [t for t in tokenizer.tokenize(lower) if t not in stopwords.words('english')]
    News_token.append(token)
    stem = [stemmer.stem(word) for word in token]
    News_stem.append(stem)
    lem = [lemmatizer.lemmatize(word) for word in stem]
    News_lem.append(lem)
    News_lem_sent = ' '.join( word for word in lem)
    scores = senti_analyze.polarity_scores(News_lem_sent)
    neg.append(scores['neg'])
    neu.append(scores['neu'])
    pos.append(scores['pos'])
    compound.append(scores['compound'])
    
news['News_token'] = News_token 
news['News_stem'] = News_stem
news['News_lem'] = News_lem
news['neg'] = neg
news['neu'] = neu
news['pos'] = pos
news['compound'] = compound

In [48]:
news.head()

Unnamed: 0,index,Date,News,News_lower,News_token,News_stem,News_lem,neg,neu,pos,compound
0,0,2016-07-01,A 117-year-old woman in Mexico City finally re...,a 117-year-old woman in mexico city finally re...,"[117, year, old, woman, mexico, city, finally,...","[117, year, old, woman, mexico, citi, final, r...","[117, year, old, woman, mexico, citi, final, r...",0.163,0.837,0.0,-0.5994
1,1,2016-07-01,IMF chief backs Athens as permanent Olympic host,imf chief backs athens as permanent olympic host,"[imf, chief, backs, athens, permanent, olympic...","[imf, chief, back, athen, perman, olymp, host]","[imf, chief, back, athen, perman, olymp, host]",0.0,1.0,0.0,0.0
2,2,2016-07-01,"The president of France says if Brexit won, so...","the president of france says if brexit won, so...","[president, france, says, brexit, donald, trump]","[presid, franc, say, brexit, donald, trump]","[presid, franc, say, brexit, donald, trump]",0.0,1.0,0.0,0.0
3,3,2016-07-01,British Man Who Must Give Police 24 Hours' Not...,british man who must give police 24 hours' not...,"[british, man, must, give, police, 24, hours, ...","[british, man, must, give, polic, 24, hour, no...","[british, man, must, give, polic, 24, hour, no...",0.303,0.591,0.105,-0.5187
4,4,2016-07-01,100+ Nobel laureates urge Greenpeace to stop o...,100+ nobel laureates urge greenpeace to stop o...,"[100, nobel, laureates, urge, greenpeace, stop...","[100, nobel, laureat, urg, greenpeac, stop, op...","[100, nobel, laureat, urg, greenpeac, stop, op...",0.239,0.761,0.0,-0.296


In [28]:
# select the date and scores from news, and covert the scores to numeric
news_sentiment = news[['Date','neg','neu','pos','compound']]
news_sentiment[['neg','neu','pos','compound']] = news_sentiment[['neg','neu','pos','compound']].apply(pd.to_numeric, errors='coerce', axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]


In [14]:
news_sentiment

Unnamed: 0,Date,neg,neu,pos,compound
0,2016-07-01,0.163,0.837,0.000,-0.5994
1,2016-07-01,0.000,1.000,0.000,0.0000
2,2016-07-01,0.000,1.000,0.000,0.0000
3,2016-07-01,0.303,0.591,0.105,-0.5187
4,2016-07-01,0.239,0.761,0.000,-0.2960
5,2016-07-01,0.329,0.420,0.252,-0.4767
6,2016-07-01,0.187,0.813,0.000,-0.3182
7,2016-07-01,0.000,0.592,0.408,0.6369
8,2016-07-01,0.000,1.000,0.000,0.0000
9,2016-07-01,0.402,0.598,0.000,-0.6908


In [19]:
# label the daily stock to be 1 if the price at adj close larger than open, otherwise 0
# covert date to datetime
stock['label']= (stock['Open']<stock['Adj Close']).astype(int)
stock['Date'] = pd.to_datetime(stock.Date, format='%Y-%m-%d')

In [20]:
stock.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,label
0,2016-07-01,17924.240234,18002.380859,17916.910156,17949.369141,82160000,17949.369141,1
1,2016-06-30,17712.759766,17930.609375,17711.800781,17929.990234,133030000,17929.990234,1
2,2016-06-29,17456.019531,17704.509766,17456.019531,17694.679688,106380000,17694.679688,1
3,2016-06-28,17190.509766,17409.720703,17190.509766,17409.720703,112190000,17409.720703,1
4,2016-06-27,17355.210938,17355.210938,17063.080078,17140.240234,138740000,17140.240234,0


In [None]:
classifiers = [
    KNeighborsClassifier(),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    SGDClassifier(),
    SVC(),
    GaussianProcessClassifier(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    MLPClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()]

In [230]:
scores = []
for j in range(1,25):
    news_sentiment_tail = news_sentiment.drop(news_sentiment.groupby('Date').tail(j).index)
    test_news = news_sentiment_tail.groupby('Date').agg({'neg':'mean','neu':'mean','pos':'mean','compound':'mean'}).reset_index()
    test_news
    # merge the news and stock table, and extract time related features: year, month, day and day of week
    news_stock = pd.merge(test_news, stock, on='Date', how='left').dropna().reset_index()
    news_stock['year'] = news_stock.Date.dt.year
    news_stock['month'] = news_stock.Date.dt.month
    news_stock['day'] = news_stock.Date.dt.day
    news_stock['dayofweek'] = news_stock.Date.dt.dayofweek
    # get the daily compound score in the past 7 days as features
    compound1 = []
    compound2 = []
    compound3 = []
    compound4 = []
    compound5 = []
    compound6 = []
    compound7 = []

    n = len(news_stock)
    for i in range(7,n):
        compound1.append(news_stock['compound'][i-1])
        compound2.append(news_stock['compound'][i-2])
        compound3.append(news_stock['compound'][i-3])
        compound4.append(news_stock['compound'][i-4])
        compound5.append(news_stock['compound'][i-5])
        compound6.append(news_stock['compound'][i-6])
        compound7.append(news_stock['compound'][i-7])

    news_stock['compound1']=news_stock['compound']
    news_stock['compound2']=news_stock['compound']
    news_stock['compound3']=news_stock['compound']
    news_stock['compound4']=news_stock['compound']
    news_stock['compound5']=news_stock['compound']
    news_stock['compound6']=news_stock['compound']
    news_stock['compound7']=news_stock['compound']

    news_stock['compound1'][7:]=compound1
    news_stock['compound2'][7:]=compound2
    news_stock['compound3'][7:]=compound3
    news_stock['compound4'][7:]=compound4
    news_stock['compound5'][7:]=compound5
    news_stock['compound6'][7:]=compound6
    news_stock['compound7'][7:]=compound7

    # after testing, features of compound scores in yesterday and the day ago, plusing the time related features, predict 
    # today's stock market up/down best, with 56% accuracy score on test data
    news_stock_model = news_stock[['label','compound1','compound2','compound3','compound4','year','month','day','dayofweek']][7:]
    X = news_stock_model.drop(['label'],axis=1)
    y = news_stock_model['label']
    #X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=5)
    X_train, X_test, y_train, y_test = (X[:1600],X[1600:],y[:1600],y[1600:])

    score = []
    score_comparison = pd.DataFrame()
    for clf in classifiers:
        clf.fit(X_train, y_train)
        score.append(clf.score(X_test, y_test))

    score_comparison['classifier'] = classifiers
    score_comparison['tail_removed'] = j
    score_comparison['test_score'] = score
    scores.append(score_comparison)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a

In [247]:
# the best test result
best_test_result = pd.concat(scores)[pd.concat(scores)['test_score']==pd.concat(scores)['test_score'].max()]
best_test_result

Unnamed: 0,classifier,tail_removed,test_score
13,"(DecisionTreeClassifier(class_weight=None, cri...",20,0.581152


In [270]:
print("The top {} news are picked to train the model".format(25 - best_test_result['tail_removed'].values.tolist()[0]))
print("The best algorithm is: ", classifiers[best_test_result.index.tolist()[0]])
print("The highest test score is: ",best_test_result['test_score'].values.tolist()[0])

The top 5 news are picked to train the model
The best algorithm is:  AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
The highest test score is:  0.581151832460733
