# Forecasting the Stock Market using News Sentiment Analysis

In [1]:
import pandas as pd
import numpy as np
from textblob import TextBlob

### 1.A. News articles preprocessing (all-the-news-2-1.csv):
* keep date, title, section, and publication. 
* all the rows without a title are filtered to assure that each entry has a filled title column
* Some publishers do not have a filled section column; these entries are replaced with “unknown”. However, this study is mainly focused on financial and business news articles and with an “unknown” section column it is impossible to decide whether it belongs to a financial and business article or not. 
* Therefore, this study uses two news article datasets, (1) with all the news articles with a financial or business-related section, and (2) all the news articles with unknown sections combined with dataset 1.

**keep 'date', 'title', 'section', and 'publication' columns**

In [2]:
news_fields = ['date', 'title', 'section', 'publication']
news_df = pd.read_csv('../data/all-the-news-2-1.csv', usecols=news_fields)

In [3]:
news_df.head(3)

Unnamed: 0,date,title,section,publication
0,2016-12-09 18:31:00,We should take concerns about the health of li...,,Vox
1,2016-10-07 21:26:46,Colts GM Ryan Grigson says Andrew Luck's contr...,,Business Insider
2,2018-01-26 00:00:00,Trump denies report he ordered Mueller fired,Davos,Reuters


In [4]:
news_df.keys()

Index(['date', 'title', 'section', 'publication'], dtype='object')

In [5]:
len(news_df)

2688878

**all the rows without a title are filtered to assure that each entry has a filled title column**

In [6]:
news_df = news_df[news_df['title'].notna()]

In [7]:
len(news_df)

2688841

In [8]:
news_df['section'].value_counts()

Market News                                                                  108724
World News                                                                   108651
Business News                                                                 96395
Wires                                                                         67352
Financials                                                                    57845
                                                                              ...  
news-qs-2015-was-hottest-year-in-recorded-history-scientists-say                  1
news-qs-tzi-the-icemans-stomach-bacteria-offers-clues-on-human-migration          1
news-qs-el-chapo-captured-a-secret-interview-comes-to-light-one-day-later         1
news-qs-a-rare-chance-to-pursue-a-new-path-for-u-s-and-iran                       1
Manufacturing the Future                                                          1
Name: section, Length: 7509, dtype: int64

**Some publishers do not have a filled section column; these entries are replaced with “unknown”.**

* here, it is unclear, if NaN schould be renamed to 'unknown' or if only the existing 7 'unknown' entries should be taken into account

In [9]:
news_df['section'] = news_df['section'].replace(np.nan, 'unknown')

**this study uses two news article datasets, (1) with all the news articles with a financial or business-related section...**

In [10]:
valid_section_values_1 = ['Financials', 'Business News']
news_df_1 = news_df[news_df['section'].isin(valid_section_values_1)]
news_df_1 = news_df_1.reset_index(drop=True)

In [11]:
news_df_1

Unnamed: 0,date,title,section,publication
0,2019-06-17 00:00:00,"ECB's Coeure: If we decide to cut rates, we'd ...",Financials,Reuters
1,2019-06-23 00:00:00,Hudson's Bay's chairman's buyout bid pits reta...,Business News,Reuters
2,2018-12-28 00:00:00,Wells Fargo to pay $575 million in settlement ...,Business News,Reuters
3,2019-05-21 00:00:00,Factbox: Investments by automakers in the U.S....,Business News,Reuters
4,2019-02-05 00:00:00,Exclusive: Britain's financial heartland unbow...,Business News,Reuters
...,...,...,...,...
154235,2020-03-26,BlackRock says 'rebalancing into risky assets'...,Business News,Reuters
154236,2020-03-23,"Croatia cenbank rolls over liquidity boost, in...",Financials,Reuters
154237,2020-03-26,"Putin, at G20 summit, proposes lifting sanctio...",Business News,Reuters
154238,2020-03-23,"Most North Sea oil output ""in the money"", but ...",Financials,Reuters


**...and (2) all the news articles with unknown sections combined with dataset 1**

In [12]:
valid_section_values_2 = ['unknown', 'Financials', 'Business News']
news_df_2 = news_df[news_df['section'].isin(valid_section_values_2)]
news_df_2 = news_df_2.reset_index(drop=True)

In [13]:
news_df_2

Unnamed: 0,date,title,section,publication
0,2016-12-09 18:31:00,We should take concerns about the health of li...,unknown,Vox
1,2016-10-07 21:26:46,Colts GM Ryan Grigson says Andrew Luck's contr...,unknown,Business Insider
2,2016-01-27 00:00:00,Paris Hilton: Woman In Black For Uncle Monty's...,unknown,TMZ
3,2019-06-17 00:00:00,"ECB's Coeure: If we decide to cut rates, we'd ...",Financials,Reuters
4,2016-05-18 13:00:06,How to watch the Google I/O keynote live,unknown,Vox
...,...,...,...,...
1066483,2020-04-01 00:00:00,Florida Ammo Selling Out On Heels of Stay-At-H...,unknown,TMZ
1066484,2020-04-01 00:00:00,Disney Forcing Annual Pass Holders to Continue...,unknown,TMZ
1066485,2020-04-01 00:00:00,Nick Cannon Pimps Out His Impala with Custom N...,unknown,TMZ
1066486,2020-04-01 00:00:00,Pete Buttigieg Says Governors Showing More Lea...,unknown,TMZ


### 1.B. News Articles Sentiment Scores
* After preprocessing the news article dataset, some code should be executed to gather the different sentiment score features. As mentioned in Section 3.2 the following sentiment scores are calculated and added to the two news articles datasets: VADER, **TextBlob**, and LM. Hence, each row (news article) has several sentiment score columns. Since the news article and price datasets will be merged on the date columns, a group by date function has been executed. While grouping the rows by the date column, the sentiment scores will be averaged.

In [14]:
def text_blob_sentiment(some_df):
    scores = []
    sentences = list(some_df['title'])
    
    for sent in sentences:
        testimonial = TextBlob(sent)
        scores.append(testimonial.sentiment.polarity)
        
    some_df['TextBlob_Sentiment'] = scores
    return some_df

In [15]:
news_df_1 = text_blob_sentiment(news_df_1)

In [16]:
news_df_2 = text_blob_sentiment(news_df_2)

In [17]:
news_df_1.head(3)

Unnamed: 0,date,title,section,publication,TextBlob_Sentiment
0,2019-06-17 00:00:00,"ECB's Coeure: If we decide to cut rates, we'd ...",Financials,Reuters,0.0
1,2019-06-23 00:00:00,Hudson's Bay's chairman's buyout bid pits reta...,Business News,Reuters,0.2
2,2018-12-28 00:00:00,Wells Fargo to pay $575 million in settlement ...,Business News,Reuters,0.0


In [18]:
news_df_2.head(3)

Unnamed: 0,date,title,section,publication,TextBlob_Sentiment
0,2016-12-09 18:31:00,We should take concerns about the health of li...,unknown,Vox,-0.333333
1,2016-10-07 21:26:46,Colts GM Ryan Grigson says Andrew Luck's contr...,unknown,Business Insider,-0.5
2,2016-01-27 00:00:00,Paris Hilton: Woman In Black For Uncle Monty's...,unknown,TMZ,-0.166667


**a group by date function has been executed. While grouping the rows by the date column, the sentiment scores will be averaged.**

In [19]:
news_df_1 = news_df_1.groupby(['date'])
news_df_2 = news_df_2.groupby(['date'])

In [20]:
news_df_1 = pd.DataFrame(news_df_1)
news_df_2 = pd.DataFrame(news_df_2)

In [21]:
print(type(news_df_1))

<class 'pandas.core.frame.DataFrame'>


In [22]:
news_df_1.head(3)

Unnamed: 0,0,1
0,2016-01-01,date ...
1,2016-01-01 00:00:00,date ...
2,2016-01-02,date ...


In [23]:
news_df_2.head(3)

Unnamed: 0,0,1
0,2016-01-01,date ...
1,2016-01-01 00:00:00,date \ 7212 2016-01...
2,2016-01-01 01:41:26,date ...


### 2. S&P 500 preprocessing (YahooFinancials package.xls) (1785 entries before preprocessing, includig keys row)
* keep columns: close price, volume, and date
* Another column has been added with the movement, this is the binary dependent variable in this study. A 1 is assigned if the stock market value went up or remained the same compared to the day before and 0 if the stock market values went down.
* Since it has been decided to use fundamental as well as technical analysis, a column has been added with a technical indicator. Although there are many technical indicators which can be calculated with the features in the S&P 500 dataset, the most common one has been chosen, which is the simple moving average (SMA). To calculate the SMA, all the closing prices are summed up over a given period and divided by the number of periods
SMA = (A1 + A2 + ... + An) / n
* Since the prediction of a stock for the following day is considered as a short-term prediction, the period should is set on 10 days

**keep columns: close price, volume, and date**

In [23]:
fields = ['Date', 'Close*', 'Volume']

sandp_df = pd.read_excel('../data/YahooFinancials package.xls', usecols=fields)
sandp_df.rename(columns = {'Date':'date'}, inplace = True)

In [24]:
sandp_df.head(3)

Unnamed: 0,date,Close*,Volume
0,"Jan 31, 2020",3225.52,4527830000
1,"Jan 30, 2020",3283.66,3787250000
2,"Jan 29, 2020",3273.4,3584500000


**Another column has been added with the movement, this is the binary dependent variable in this study. A 1 is assigned if the stock market value went up or remained the same compared to the day before and 0 if the stock market values went down.**

In [25]:
def calculate_movement(some_df):
    
    close_list = list(some_df['Close*'])
    
    movement = []  
    for i in range(len(close_list)-1):
        if close_list[i] < close_list[i+1]:
            movement.append(0)
        else:
            movement.append(1)
    movement.append(1)   # there is nothing to compare for the first day, set to 1 (remained the same)
    some_df['movement'] = movement
    return some_df

In [26]:
sandp_df = calculate_movement(sandp_df)

In [27]:
sandp_df.tail(3)

Unnamed: 0,date,Close*,Volume,movement
1781,"Jan 03, 2013",1459.37,3829730000,0
1782,"Jan 02, 2013",1462.42,4202600000,1
1783,"Dec 31, 2012",1426.19,3204330000,1


In [28]:
sandp_df.head(15)

Unnamed: 0,date,Close*,Volume,movement
0,"Jan 31, 2020",3225.52,4527830000,0
1,"Jan 30, 2020",3283.66,3787250000,1
2,"Jan 29, 2020",3273.4,3584500000,0
3,"Jan 28, 2020",3276.24,3526720000,1
4,"Jan 27, 2020",3243.63,3823100000,0
5,"Jan 24, 2020",3295.47,3707130000,0
6,"Jan 23, 2020",3325.54,3764860000,1
7,"Jan 22, 2020",3321.75,3619850000,1
8,"Jan 21, 2020",3320.79,4105340000,0
9,"Jan 17, 2020",3329.62,3698170000,1


**To calculate the SMA, all the closing prices are summed up over a given period and divided by the number of periods SMA = (A1 + A2 + ... + An) / n.**

**Since the prediction of a stock for the following day is considered as a short-term prediction, the period should is set on 10 days**

In [29]:
def calculate_SMA(some_df):
    
    close_list = list(some_df['Close*'])
   
    sma = []
    for i in range(len(close_list)-1):
        previous = close_list[i+1:i+11]
        sma.append( (sum(previous))/len(previous) )

    sma.append(close_list[-1])  # there is nothing to compare for the first day, set to current day value  
    some_df['SMA'] = sma
    return some_df

In [30]:
sandp_df = calculate_SMA(sandp_df)

In [31]:
sandp_df.head(3)

Unnamed: 0,date,Close*,Volume,movement,SMA
0,"Jan 31, 2020",3225.52,4527830000,0,3298.691
1,"Jan 30, 2020",3283.66,3787250000,1,3299.254
2,"Jan 29, 2020",3273.4,3584500000,0,3300.229


In [32]:
sandp_df.tail(3)

Unnamed: 0,date,Close*,Volume,movement,SMA
1781,"Jan 03, 2013",1459.37,3829730000,0,1444.305
1782,"Jan 02, 2013",1462.42,4202600000,1,1426.19
1783,"Dec 31, 2012",1426.19,3204330000,1,1426.19


In [33]:
print(len(sandp_df))

1784


In [34]:
# normalize datetime

### 3. Merge Datasets:
* After the datasets are preprocessed, the price dataset has been left joined to the news article datasets to create two datasets that will be used as input and output for the machine and deep learning models. 
* The datasets are merged on the column date. Thus, the two dataset which will be used for this research are fully preprocessed and both contains 1,096 rows. Because both datasets consist of the same features only one table has been created to give an overview of all the features in the datasets.
* The dependent variables have 589 entries that the stock market movement directionality went down, and 480 entries where the stock market price went up or remained the same.
* One small adjustment has been made to the final datasets, which is known as standardization. This method may help to minimize dataset dissimilarities. Rescaling the features to give them the characteristics of a regular normal distribution is known as standardization (or Z-score normalization) (formula on page 20)

In [35]:
#merge_1 = news_df_1.merge(df,d, how='inner', left_index=True, right_index=True)

In [36]:
merge_1 = news_df_1.merge(sandp_df, how='left', on='date')

KeyError: 'date'

In [None]:
# date, title, section, publication, TextBlob_Sentiment
# Date, Close*, Volume, movement, SMA

### 4/5. Training, Validation, and Test Set
* dataset for this study has been divided into two parts, 80% for training and 20% for testing the models. 
* Hence, the training data contains 855 observations, and the test data consists of 214 observations.
* Real-world datasets should use a stratified 10-fold cross-validation (Kohavi, 1995). The cross-validation method ensures that the training and test sets include the same proportions of both target groups, which are in this study the upwards and downwards movement of the S&P 500 index price. Cross-validation guarantees the model's validity and reliability; see Section 4.3 for a more comprehensive explanation. The validation set was used as part of a grid search, covered in more detail in Subsection 4.2.3.
* Each of these algorithms was used to create models, and a grid search was applied to find the best model settings, as described in the following subsection. Subsequently, the models were evaluated, and the best performing model for predicting daily S&P 500 movements was selected.
* automated hyperparameter optimization (HPO) has been used in this study. Grid search, also known as full factorial design, is a commonly used HPO tool which automatically finds the best hyperparameters for an algorithm