# Portfolio Piece 5

## Predicting the Behavior of Financial Markets from News Headlines


### Introduction
Behavioral Economics believe in importance of emotions in human decision making process and subsequently its influence on market through investors and social mood. In recent years, variety of methods have been investigated to compute indicators of the public’s sentiment and mood state from large-scale online data instead of traditional surveys. And therefore, financial market prediction based on online sentiment tracking has drawn a lot of attention.
One of the online data sources is news media content that has been shown to be an important factor shaping investor sentiment. In this project, the relation between news headline and market was investigated to explore the possibility of predicting market trends from news.
Three measures were selected as market indicators including Dow Jones Industrial Average  (DJIA), the S&P 500 Volatility Index (VIX) and gold price. The data for these indicators was collected from Yahoo Finance website and http://www.gold.org/ website for the years of 2008 to 2016.

### Methods
Two news headline datasets were employed in this test. The archive of New York Times website was scraped to collect the top business news headlines for each day from January 2016 to January 2017. 

More comprehensive dataset was downloaded from kaggle which included top 25 business headlines from reddit which was a collection of variety of sources from 2008 to 2016.
To get the sentiment of the stories and news headlines, two dictionary of negative words were used. Both dictionaries were downloaded from http://www3.nd.edu/~mcdonald/Word_Lists.html.
One dictionary contains general negative words from Harvard Psychosociological Dictionary (~4000 words) and the other one was developed by Loughran and McDonald at University of Notre Dame (Mendoza College of Business) which contains negative words specific to financial texts (~2100 words).

To calculate the negative emotional content of a news headline, the ratio of negative words in a headline was calculated by counting the negative words in the headline that are listed in the negative dictionary divided by the total count of words in the headline. Then, this emotional ratio was summed up and divided by the total number of news articles on the same day, yielding the negative news sentiment score (NNSS) for that day.

Correlation between the negative news sentiment score and all market indicators was calculated  to investigate the possibility of predicting the market from a linear regression between negative news sentiment score as a predictor and market indicator as the dependent variable.

The other approach was to turn this problem to a classification problem. If the closing index or the price of the chosen market indicator is higher than or equal to the opening value of it for that day, then it was concluded that the market had a positive day and that day was classified as one. For those days with lower closing index, the day labeled as class zero.
In this approach, all the words in news headlines for that day were used for the classification.

### Results and Discussions
According to the Fig1. VIX index and gold price show almost opposite trends vs DJIA index. This is expected since higher volatility in market translates in lower DJIA prices while investments moves toward buying gold and hence higher gold prices. The trends for the negative news sentiment score is however is very noisy with no apparent trends except for few hikes in these time period. However, the peaks in the negative news sentiment score matches the volatility peaks in terms of time.

Figure 1. Gold price, DJIA index VIX index and Negative NEWs Sentiment Score for news headlines (January 2008 - January 2017)

![trend1](images/Capstone_trend1.png)
![trend2](images/Capstone_trend2.png)

In order to calculate the sentiment of news headlines list of negative words was used. List of most frequent words in the negative word list and most frequent words in the news headlines are shown below in Figure 2.

Figure 2. List of most frequent words in negative dictionary list and Headlines
![word1](images/Capstone_word1.png)


Most frequent words in news headlines from 2008-2016
![word2](images/Capstone_word2.png)

Very weak correlation between the negative news sentiment score and market indicators was found. However, the sign of the correlations are in correct direction meaning that increased negativity score resulted in higher gold price and volatility index and lower DJ score all indicating negative trends in financial markets. The volatility index (VIX) showed highest correlation among market indicators and therefore used for predictions.

Table 1. Correlation between market index and negative news sentiment score 

``` python

            NNSS       NNSS2*    DJIA_Index   VIX_Index   Gold price
NNSS       1.000000   0.723482   -0.055177     0.081742   0.025809 
NNSS2      0.723482   1.000000   -0.076706     0.110759   0.036842
DJ_Index  -0.055177  -0.076706    1.000000    -0.681593   0.182802
VIX_Index  0.081742   0.110759   -0.681593     1.000000  -0.408126
Gold price 0.025809   0.036842    0.182802    -0.408126   1.000000
* NNSS2 : 2-day moving average of negative news sentiment score

``` 

Dealing with time series it was interesting to investigate the Granger causality. It was shown that there is  Granger causality between negative news sentiment score and VIX market indicator with time lag of one day (p-value < 0.01). This means using previous day negative news sentiment score can improve the predictions of VIX index for the next day. This can also be seen in the correlation matrix above that shows higher correlation between VIX Index and 2-day moving average of negative news sentiment score of (NNSS2). 

Due to very low correlations between derived sentiment for news headline and market indicators, linear regression resulted in non satisfactory results. Taking another approach, it was attempted to predict whether market would react positive or negative based on the news headlines each day. The difference between closing and opening prices of DJIA was calculated for each day. If the DJIA closing price is higher than its opening price, market had a positive day and it was labeled as 1 otherwise it was labeled as 0. Using the bags of words method applied on the news headlines for each day, it was attempted to classify the market class of 1 or 0. 
Random forest model and support vector machine showed the most promising results in terms of accuracy and AUC score. It should be mentioned that the baseline accuracy for this data set is around to 0.53. In addition of single words, combination of two-word phrases also was considered in these models (ngram range 1 and 2).
``` python
Model          AUC score  Accuracy score
Random Forest   0.63      0.64
SVM             0.63      0.65
```
As mentioned before, market indicators showed higher correlation to news headlines sentiment from the previous day compared to the same day headline sentiments. Using the headlines from the previous day in addition to the same day news headlines increased the accuracy of random forest model to 0.69.

### Conclusion
Using news headlines improved the prediction of market trends.

Given that none of the most predictive words in the model are from the negative lists along with very low correlation between the negative news sentiment score and market indicators suggest that the negative word dictionary used was not appropriate or a better weighting method is required for the calculation of negative news sentiment score.						

### Recommendations 
Use of more market related news websites such as bloomberg, forbes or Financial Times would help to get more market related headlines.
Better weighting method  or word lists might be required for calculating negative news sentiment score.
Weak signal from news headlines can be amplified if used along other online sentiments such as search engine data ( such as Google Insights for Search entries) or  social media data (twitter data).


					
				
			
		

 






