Evaluating the efficiency of Twitter Sentiment Analysis as a tool of prediction for the stock market

This research wants to build a time serie of the polarity of tweets related to a cluster of firms, and compare it to the time serie of the same firms in the stock market. The chosen firms are: Apple, Google, Nike, Nestlé, Beyond Meat, Bayer and NovaVax.
A real forecasting capacity of the Twitter Sentiment series would mean that this method could have a valuable implementation to trading-bots.

Download and Analysis

To download all the tweets related to the firms I used Twitter's API. I wrote all the code in R, and automated it thanks to Windows Task Manager, so that the download would have started every day at the same our, by itself. The downloaded tweets were then automatically uploaded to OneDrive, so that I could access them at any time.

I also implemented an automated Gmail notification that would notified me that the download was correctly occured and would send me some general statistics.

I then proceded to clean all the data, lemmatize it, and analyze it. In order to sentiment analyze it I used 3 methods:

Naive Bayes
Based on the Bayes Theorem, the algorithm classifies every tweet as "positive" or "negative" using the "MPQA Subjectivity Lexicon" by Janyce Wiebe.
Syuzhet
Uses package Syuzhet and homonym dictionary to give a score to each tweet.
udpipe
Uses package UdPipe (with the MPQA subj lexicon) to give a score to each tweet. Has the possibility to use inensifier, weakeners and modifiers (so that it can, for example, distinguish between "good", "very good", "quite good" and "not good")

Visualizations

The data was then visualized using R package "ggplot2". Here is some example of some of the graph I built, using Google as reference:

Conclusions

In order to evaluate the existence of a relationship of causality between the tweets and the closing price of the firm, I built a test based on Granger Causality Test that I called "Close Test". This test brought very positive results highlighting numerous relationship of causality, summarized in the next table:

where:
"n" = number of significant reletionship found
"*" = number of relationship with a p.value < 0.10
"**" = number of relationship with a p.value < 0.05
"***" = number of relationship with a p.value < 0.01

The Test Score also found that, in our cluster of firms, using the tweets that only refer to the value of the firm in the stock market (for ex containing: $AAPL, $GOOGL, ecc.) (dataset "stock") is more suitable for a short term prevision (forecasting the closing prize of the same day), while using the tweet tha refer to the company in general (for ex containing also: Apple, Google, ecc.) (dataset "score") is more suitable for longer term prevision (forecasting the value in the next days). As summarized by the next table:

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Notifier		Notifier
Sentiment		Sentiment
README.md		README.md
downloader.R		downloader.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating the efficiency of Twitter Sentiment Analysis as a tool of prediction for the stock market

Download and Analysis

Visualizations

Conclusions

About

Releases

Packages

Languages

DavideGiardini/Twitter-Sentiment-Analysis-to-predict-the-stock-market

Folders and files

Latest commit

History

Repository files navigation

Evaluating the efficiency of Twitter Sentiment Analysis as a tool of prediction for the stock market

Download and Analysis

Visualizations

Conclusions

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages