<h2>SPYPredictor: Sentiment Analysis</h2>
Final Project Component Shuzo Katayama, 11 December 2020

For a significant portion of this project, I would like to focus on running sentiment analysis on the news headlines of a particular day as data to train the SPYPredictor. Broadly, I plan to take several news headlines on a certain day, run the headline through a trained model to analyse each headline, average the scores for all the headlines in a day, and use that result as the news sentiment indicator for that day. 

This is definitely a much more difficult task than I previously imagined because of the lack of very convenient data. Data that is rich in the number of dates is not pre-labelled with 'positive' or 'negative', making impossible to train a model with that dataset, and datasets with labels are often irrelevant (being comprised of tweets or otherwise, not newspaper headlines), or are very limited in terms of dates. In order to curtail this problem the best I can, I am going to attempt to first, train a model with newspaper headline training data, and then use that model to label a larger dataset with 'positive' or 'negative'. Ultimately, I will use this self-labelled larger dataset to train the SPYPredictor.

Data: \
Labelled Headlines : https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news \
Unlabelled Headlines: https://www.kaggle.com/therohk/million-headlines

Works Cited:

"6.2. Feature extraction", Scikit-Learn Documentation  \
https://scikit-learn.org/stable/modules/feature_extraction.html

"VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text", by C.J. Hutto Eric Gilbert, Georgia Institute of Technology, 2014 http://eegilbert.org/papers/icwsm14.vader.hutto.pdf

"Python | Sentiment Analysis using VADER", Geeksforgeeks.org, 2019 https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/


In [1]:
import pandas as pd
import numpy as np

In [2]:
l_headlines = pd.read_csv('FinancialHeadlines.csv', encoding = "ISO-8859-1")
l_headlines

# Encoding change from StackOverflow: "https://stackoverflow.com/questions/18171739/unicodedecodeerror-
# when-reading-csv-file-in-pandas-with-python"

Unnamed: 0,sentiment,headline
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...
...,...,...
4841,negative,LONDON MarketWatch -- Share prices ended lower...
4842,neutral,Rinkuskiai 's beer sales fell by 6.5 per cent ...
4843,negative,Operating profit fell to EUR 35.4 mn from EUR ...
4844,negative,Net sales of the Paper segment decreased to EU...


In [3]:
a = l_headlines.to_numpy()

In [4]:
# Convert sentiments to values: -1, negative; 0, neutral; 1, positive
for item in a:
    if item[0] == "negative":
        item[0] = -1
    elif item[0] == "neutral":
        item[0] = 0
    else:
        item[0] = 1

In [5]:
X = a[:, 1]
y = a[:, 0]
y = y.astype('int')

In [6]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

Reference:
"6.2. Feature extraction", Scikit-Learn Documentation
https://scikit-learn.org/stable/modules/feature_extraction.html

In [7]:
# Feature extraction from documentation: https://scikit-learn.org/stable/modules/feature_extraction.html
vectorizer = CountVectorizer()
X_vectorized = vectorizer.fit_transform(X).toarray()
X_vectorized

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [8]:
# Splitting into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.4, random_state=0)

In [9]:
classifier = RandomForestClassifier(n_estimators=1000, random_state=0)
classifier.fit(X_train, y_train)

RandomForestClassifier(n_estimators=1000, random_state=0)

In [10]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

y_train_pred = classifier.predict(X_train)
y_test_pred = classifier.predict(X_test)

print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred),
        mean_squared_error(y_test, y_test_pred)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train, y_train_pred),
        r2_score(y_test, y_test_pred)))

MSE train: 0.000, test: 0.329
R^2 train: 0.999, test: 0.154


I first trained the model on 100 trees. Here, the model performed very poorly. It learned from the training data well, but when it was put to the test, the R<sup>2</sup> dropped to 0.107, meaning that the model did not generalise at all. Even on 1000 trees, the test R<sup>2</sup> turned out to be 0.154 .

Because the data is generally flawed and difficult to get a hold of, I will instead use VADER. Unlike other models for prediction, VADER does not learn by training from data directly. Instead, the VADER model was created by researchers as a model that already did the work of learning, and is able to generalise. The model takes a number of general rules that takes grammar and syntax into consideration, along with a preset list of lexical features, to predict sentiment in unseen text. In order to be able to use the headline sentiment data in the SPYPredictor, I will use the large dataset and use the collection of sentiment predictions as data for the SPYPredictor

Reference: 

"VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text", by C.J. Hutto Eric Gilbert, Georgia Institute of Technology, 2014 http://eegilbert.org/papers/icwsm14.vader.hutto.pdf 

In [11]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer 

In [12]:
headlines = pd.read_csv('abcnews-date-text.csv', encoding = "ISO-8859-1")
headlines

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers
...,...,...
1186013,20191231,vision of flames approaching corryong in victoria
1186014,20191231,wa police and government backflip on drug amne...
1186015,20191231,we have fears for their safety: victorian premier
1186016,20191231,when do the 20s start


In [13]:
a = headlines.to_numpy()

In [33]:
# Initialise the array with average scores as having 6151 (days) rows, and 2 columns
r = 6151
c = 2
total_scores = np.empty((r, c))

In [36]:
# Loop through large headlines dataset. Get a compound score for each headline
# After each headline for a particular date is referenced, add it to the 

sid = SentimentIntensityAnalyzer()
publish_date = a[0, 0]
counter = 1
score_sum = 0
total_counter = 0

for item in a:
    if item[0] == publish_date:
        scores = sid.polarity_scores(item[1])
        compound_score = scores['compound']
        score_sum = score_sum + compound_score
        counter = counter+1
    else:
        total_scores[total_counter, 0] = publish_date
        total_scores[total_counter, 1] = (score_sum/counter)
        
        publish_date = item[0]
        counter = 1
        total_counter = total_counter+1
        score_sum = 0

total_scores
    

array([[ 2.00302190e+07, -1.07072864e-01],
       [ 2.00302200e+07, -1.07921200e-01],
       [ 2.00302210e+07, -9.93488000e-02],
       ...,
       [ 2.01912280e+07, -9.96333333e-03],
       [ 2.01912290e+07, -9.81674419e-02],
       [ 2.01912300e+07, -1.71289130e-01]])

Reference: "Python | Sentiment Analysis using VADER", Geeksforgeeks.org, 2019 https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/

Here, using VADER, I now have a set of average sentiment scores for each date between the 19th of February, 2003 and the 30th of December, 2019. I will now add this data to the SPYPredictor to see if it is able to predict the movement of SPY better