# Capstone: Trump vs Stock Market

# Data Collection Notebook

## Contents
- [Problem statement](#Problem-statement)
- [Data dictionary](#Data-dictionary)
- [Importing Trump twitter dataset](#Importing-Trump-twitter-dataset)
- [Data Cleaning](#Data-Cleaning)
- [Feature engineering](#Feature-engineering)
- [Identifying positive class](#Identifying-positive-class)
- [Train Validation and Holdout Data](#Train-Validation-and-Holdout-Data)

## Problem statement

The US market team at a local bank has seen literature on models that are able to predict market movement based on tweets by Donald Trump:
- [JP Morgan creating a 'Volfefe index to track tweets vs bond interest rates](https://www.marketwatch.com/story/are-trump-tweets-influencing-bond-volatility-jp-morgans-volfefe-index-aims-to-find-out-2019-09-09)
- [On days when President Trump tweets a lot, the stock market falls, Bank of America](https://www.cnbc.com/2019/09/03/on-days-when-president-trump-tweets-a-lot-the-stock-market-falls-investment-bank-finds.html)

They have tasked the data science team to build a classification model using Natural Language Processing to predict if Donald Trump's tweets are market moving. The models used are:
- Logistic Regression
- XGBoost
- Long Short Term Memory Neural Network
- Evaluate the models based on:
    - accuracy (% predictions the model gets correct, both a significant movement and a non-significant movement)
    - precision (% predicted significant movement when it is actually significant movement)
    - sensitivity (% predicted significant movement out of all correct predictions)
- choose the best performing model to test it on the holdout csv

### Datasets:
- Tweets from Donald Trump 2009 till June 2020
    - sourced from https://www.kaggle.com/austinreese/trump-tweets
    - 43352 recorded tweets since May 2009
- Historical S&P500 stock price from Yahoo Finance
- Loughran McDonald financial sentiment library
    - sourced from https://sraf.nd.edu/textual-analysis/resources/#Master%20Dictionary
    
### Assumptions made:
- Donald Trump's time of tweet is in the same time zone as the NYSE
- Stock market trading time is from 9:30am to 4pm
- He offically became president on 20th Jan 2017

## Data dictionary

### df_tweets
| Feature     	| Type 	| Description|                                                  
|:------------------	|:----------	|:----------------------------------------------------------------------------------|
| content	| object | Raw tweets from therealdonaldtrump |
| date	| object | Date of his tweet. YYYY-MM-DD-HH-MM-SS format |
| retweets	| int64 | Number of retweets by others at the time of data collection |
| favorites	| int64 | Number of favourites at the time of data collection |
| cleaned_tweets | object | Pre-processed tweets |
| tweet_day	| datetime64[ns] | Date of tweet YYYY-MM-DD format |
| month_sin	| float64 | Cycial sin month feature engineered. Range of -1 to 1 |
| month_cos	| float64 | Cycical cosine month feature engineered. Range of -1 to 1 |
| day_sin	| float64 | Cycical sin day feature engineered. Range of -1 to 1 |
| day_cos	| float64 | Cycical cosine day feature engineered. Range of -1 to 1 |
| hour_sin	| float64 | Cycical sin hour feature engineered. Range of -1 to 1 |
| hour_cos	| float64 | Cycical cosine hour feature engineered. Range of -1 to 1 |
| min_sin	| float64 | Cycical sin min feature engineered. Range of -1 to 1 |
| min_cos	| float64 | Cycical cosine min feature engineered. Range of -1 to 1 |
| time_to_open	| float64 | Time before the stock market opens on that day. In seconds | 	
| time_after_close	| float64 | Time after the stock market closes on that day. In seconds |	
| potus_status	| int64 | President of the United States status |
| vader_negative	| float64 | Negative sentiment analysis using Vader. Scale of 0 to 1 |
| vader_neutral	| float64 | Neutral sentiment analysis using Vader. Scale of 0 to 1 |
| vader_positive	| float64 | Positive sentiment analysis using Vader. Scale of 0 to 1 |
| vader_compound	| float64 | Overall sentiment analysis using Vader. Scale of 0 to 1 |	
| negative_word	| int64 | Negative word count using Loughran McDonald Financial sentiment analysis |
| positive_word	| int64 | Positive word count using Loughran McDonald Financial sentiment analysis |
| uncertainty_word	| int64 | Uncertainty word count using Loughran McDonald Financial sentiment analysis |
| litigious_word	| int64 | Litigious word count using Loughran McDonald Financial sentiment analysis |
| constraining_word	| int64 | Constraining word count using Loughran McDonald Financial sentiment analysis |
| interesting_word	| int64 | Interesting word count using Loughran McDonald Financial sentiment analysis |
| modal_strong_word	| int64 | Modal Strong word count using Loughran McDonald Financial sentiment analysis |
| modal_neutral_word	| int64 | Modal Neutral word count using Loughran McDonald Financial sentiment analysis |
| modal_weak_word	| int64 | Modal Weak word count using Loughran McDonald Financial sentiment analysis |
| difference	| float64 | Intra-day difference between opening and closing S&P500 stock price  	
| target	| int64 | Target variable |

In [1]:
# importing libraries
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
from progressbar import ProgressBar

# web scraping libraries
import re
import random
import requests
from bs4 import BeautifulSoup
from datetime import datetime, time

# NLP libraries
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from sklearn.model_selection import train_test_split
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

#library for historical stock prices
import yfinance as yf
from datetime import timedelta
from dateutil.relativedelta import relativedelta

# Importing Trump twitter dataset

In [2]:
# importing the dataset
df_tweets = pd.read_csv('../datasets/realdonaldtrump.csv')

In [3]:
# checking first few rows
df_tweets.head()

Unnamed: 0,id,link,content,date,retweets,favorites,mentions,hashtags
0,1698308935,https://twitter.com/realDonaldTrump/status/169...,Be sure to tune in and watch Donald Trump on L...,2009-05-04 13:54:25,510,917,,
1,1701461182,https://twitter.com/realDonaldTrump/status/170...,Donald Trump will be appearing on The View tom...,2009-05-04 20:00:10,34,267,,
2,1737479987,https://twitter.com/realDonaldTrump/status/173...,Donald Trump reads Top Ten Financial Tips on L...,2009-05-08 08:38:08,13,19,,
3,1741160716,https://twitter.com/realDonaldTrump/status/174...,New Blog Post: Celebrity Apprentice Finale and...,2009-05-08 15:40:15,11,26,,
4,1773561338,https://twitter.com/realDonaldTrump/status/177...,"""My persona will never be that of a wallflower...",2009-05-12 09:07:28,1375,1945,,


In [4]:
# checking last few rows
df_tweets.tail()

Unnamed: 0,id,link,content,date,retweets,favorites,mentions,hashtags
43347,1273405198698975232,https://twitter.com/realDonaldTrump/status/127...,Joe Biden was a TOTAL FAILURE in Government. H...,2020-06-17 19:00:32,23402,116377,,
43348,1273408026968457216,https://twitter.com/realDonaldTrump/status/127...,Will be interviewed on @ seanhannity tonight a...,2020-06-17 19:11:47,11810,56659,@seanhannity,
43349,1273442195161387008,https://twitter.com/realDonaldTrump/status/127...,pic.twitter.com/3lm1spbU8X,2020-06-17 21:27:33,4959,19344,,
43350,1273442469066276864,https://twitter.com/realDonaldTrump/status/127...,pic.twitter.com/vpCE5MadUz,2020-06-17 21:28:38,4627,17022,,
43351,1273442528411385858,https://twitter.com/realDonaldTrump/status/127...,pic.twitter.com/VLlc0BHW41,2020-06-17 21:28:52,3951,14344,,


In [5]:
# checking shape
df_tweets.shape

(43352, 8)

In [6]:
df_tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43352 entries, 0 to 43351
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         43352 non-null  int64 
 1   link       43352 non-null  object
 2   content    43352 non-null  object
 3   date       43352 non-null  object
 4   retweets   43352 non-null  int64 
 5   favorites  43352 non-null  int64 
 6   mentions   20386 non-null  object
 7   hashtags   5583 non-null   object
dtypes: int64(3), object(5)
memory usage: 2.6+ MB


# Data Cleaning

### Missing values

In [7]:
# checking for missing values
df_tweets.isnull().sum()

id               0
link             0
content          0
date             0
retweets         0
favorites        0
mentions     22966
hashtags     37769
dtype: int64

In [8]:
#dropping the missing values
df_tweets.dropna(axis=1, inplace=True)

In [9]:
#also drop id and link columns as these are unncessary
df_tweets.drop(['id', 'link'], axis=1, inplace=True)

### Text Data Pre-Processing

In [10]:
# Instantiating Tokenizer, PorterStemmer and lemmatizer

tokenizer = RegexpTokenizer(r'\w+')
p_stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [11]:
# loding NLTK stopwords
stopwords_nltk = set(stopwords.words('english'))

# loading in spacy stopwords
sp = spacy.load('en_core_web_sm')
all_stopwords = sp.Defaults.stop_words

for i in stopwords_nltk:
    all_stopwords.add(i)

additional_stop_words = ['twitter', 'pic', 'https', 'com', 'http', 'www', 'bit', 'ly']    

for i in additional_stop_words:
    all_stopwords.add(i)

In [12]:
#taken from DSI-14 week 5.03 NLP startercode
def pre_processing(raw_text):

    # 1. Remove HTML.
    review_text = BeautifulSoup(raw_text).get_text()
    
    # 2. Remove non-letters.
    # regex modified from https://towardsdatascience.com/covfefe-nlp-do-trumps-tweets-move-the-stock-market-42a83ab17fea
    letters_only = re.sub('(@[A-Za-z]+)|([^A-Za-z \t])|(\w+:\/\/\S+)', ' ', review_text)
    
    # 3. Convert to lower case, split into individual words.
    words = letters_only.lower().split()
    
    # 4. Searching a set is much faster than searching a list
    #    Adding in additional stopwords.
    stops = all_stopwords
    
    # 5. Remove stopwords.
    meaningful_words = [w for w in words if w not in stops]
    
    
    #5.5 lemmatizing of words
    meaningful_words = [lemmatizer.lemmatize(w) for w in words]
    
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return(" ".join(meaningful_words))

In [13]:
# Get the number of posts on the dataframe size.
total_tweets= df_tweets.shape[0]
print(f'There are {total_tweets} tweets')

There are 43352 tweets


In [14]:
# Initialize an empty list to hold the clean posts.
pbar = ProgressBar()
cleaned_tweets = []

for tweets in pbar(df_tweets['content']):
    cleaned_tweets.append(pre_processing(tweets))

  ' Beautiful Soup.' % self._decode_markup(markup)
  markup
100% (43352 of 43352) |##################| Elapsed Time: 0:00:33 Time:  0:00:33


In [15]:
#appending cleaned_tweets back to the df
df_tweets['cleaned_tweets'] = cleaned_tweets

# Feature engineering

### Date into cyclical features

credit to: http://blog.davidkaleko.com/feature-engineering-cyclical-features.html

In [16]:
# Extracting the only the date of the tweet
df_tweets['tweet_day'] = pd.to_datetime(df_tweets['date'])
df_tweets['tweet_day'] = df_tweets['tweet_day'].map(lambda x: x.date())
df_tweets['tweet_day'] = pd.to_datetime(df_tweets['tweet_day'])

In [17]:
#extracting the necessary features from date
df_tweets['month'] = df_tweets['date'].map(lambda x: int(x[5:7]))
df_tweets['day'] = df_tweets['date'].map(lambda x: int(x[8:10]))
df_tweets['hour'] = df_tweets['date'].map(lambda x: int(x[11:13]))
df_tweets['min'] = df_tweets['date'].map(lambda x: int(x[-5:-3]))

In [18]:
#cyclical features for month
df_tweets['month_sin'] = np.sin(df_tweets['month']*(2.*np.pi/12))
df_tweets['month_cos'] = np.cos(df_tweets['month']*(2.*np.pi/12))

#cyclical features for day
#assumption that every month has 31 days
df_tweets['day_sin'] = np.sin(df_tweets['day']*(2.*np.pi/31))
df_tweets['day_cos'] = np.cos(df_tweets['day']*(2.*np.pi/31))

#cyclical features for hour
df_tweets['hour_sin'] = np.sin(df_tweets['hour']*(2.*np.pi/24))
df_tweets['hour_cos'] = np.cos(df_tweets['hour']*(2.*np.pi/24))

#cyclical features for minute
df_tweets['min_sin'] = np.sin(df_tweets['min']*(2.*np.pi/60))
df_tweets['min_cos'] = np.cos(df_tweets['min']*(2.*np.pi/60))

In [19]:
df_tweets.head()

Unnamed: 0,content,date,retweets,favorites,cleaned_tweets,tweet_day,month,day,hour,min,month_sin,month_cos,day_sin,day_cos,hour_sin,hour_cos,min_sin,min_cos
0,Be sure to tune in and watch Donald Trump on L...,2009-05-04 13:54:25,510,917,be sure to tune in and watch donald trump on l...,2009-05-04,5,4,13,54,0.5,-0.866025,0.724793,0.688967,-0.258819,-0.965926,-0.587785,0.809017
1,Donald Trump will be appearing on The View tom...,2009-05-04 20:00:10,34,267,donald trump will be appearing on the view tom...,2009-05-04,5,4,20,0,0.5,-0.866025,0.724793,0.688967,-0.866025,0.5,0.0,1.0
2,Donald Trump reads Top Ten Financial Tips on L...,2009-05-08 08:38:08,13,19,donald trump read top ten financial tip on lat...,2009-05-08,5,8,8,38,0.5,-0.866025,0.998717,-0.050649,0.866025,-0.5,-0.743145,-0.669131
3,New Blog Post: Celebrity Apprentice Finale and...,2009-05-08 15:40:15,11,26,new blog post celebrity apprentice finale and ...,2009-05-08,5,8,15,40,0.5,-0.866025,0.998717,-0.050649,-0.707107,-0.707107,-0.866025,-0.5
4,"""My persona will never be that of a wallflower...",2009-05-12 09:07:28,1375,1945,my persona will never be that of a wallflower ...,2009-05-12,5,12,9,7,0.5,-0.866025,0.651372,-0.758758,0.707107,-0.707107,0.669131,0.743145


In [20]:
# Extracting the time of tweet
df_tweets['tweet_time'] = pd.to_datetime(df_tweets['date'])
df_tweets['tweet_time'] = df_tweets['tweet_time'].map(lambda x: x.time())
df_tweets['tweet_time'] = pd.to_datetime(df_tweets['tweet_time'], format='%H:%M:%S') - pd.to_datetime(df_tweets['tweet_time'], format='%H:%M:%S').dt.normalize()

In [21]:
# new columns that measures how close the tweet is to opening/ closing
# market opens at 9:30am and closes at 4pm
market_open = timedelta(hours=9, minutes=30)
market_close = timedelta(hours=16)

def time_to_open(time):
    if time <= market_open:
        return market_open - time
    elif time > market_open:
        return timedelta(0)
    
def time_after_close(time):
    if time >= market_close:
        return time - market_close
    elif time < market_close:
        return timedelta(0)  


#converting to seconds    
df_tweets['time_to_open'] = df_tweets['tweet_time'].map(lambda x: time_to_open(x).total_seconds())
df_tweets['time_after_close'] = df_tweets['tweet_time'].map(lambda x: time_after_close(x).total_seconds())

In [22]:
df_tweets.head()

Unnamed: 0,content,date,retweets,favorites,cleaned_tweets,tweet_day,month,day,hour,min,month_sin,month_cos,day_sin,day_cos,hour_sin,hour_cos,min_sin,min_cos,tweet_time,time_to_open,time_after_close
0,Be sure to tune in and watch Donald Trump on L...,2009-05-04 13:54:25,510,917,be sure to tune in and watch donald trump on l...,2009-05-04,5,4,13,54,0.5,-0.866025,0.724793,0.688967,-0.258819,-0.965926,-0.587785,0.809017,13:54:25,0.0,0.0
1,Donald Trump will be appearing on The View tom...,2009-05-04 20:00:10,34,267,donald trump will be appearing on the view tom...,2009-05-04,5,4,20,0,0.5,-0.866025,0.724793,0.688967,-0.866025,0.5,0.0,1.0,20:00:10,0.0,14410.0
2,Donald Trump reads Top Ten Financial Tips on L...,2009-05-08 08:38:08,13,19,donald trump read top ten financial tip on lat...,2009-05-08,5,8,8,38,0.5,-0.866025,0.998717,-0.050649,0.866025,-0.5,-0.743145,-0.669131,08:38:08,3112.0,0.0
3,New Blog Post: Celebrity Apprentice Finale and...,2009-05-08 15:40:15,11,26,new blog post celebrity apprentice finale and ...,2009-05-08,5,8,15,40,0.5,-0.866025,0.998717,-0.050649,-0.707107,-0.707107,-0.866025,-0.5,15:40:15,0.0,0.0
4,"""My persona will never be that of a wallflower...",2009-05-12 09:07:28,1375,1945,my persona will never be that of a wallflower ...,2009-05-12,5,12,9,7,0.5,-0.866025,0.651372,-0.758758,0.707107,-0.707107,0.669131,0.743145,09:07:28,1352.0,0.0


### POTUS status

In [23]:
# Adding a president of the united states column
potus_day = pd.to_datetime('2017-01-20')
df_tweets['potus_status'] = df_tweets['tweet_day'].map(lambda x: 1 if x >= potus_day else 0)

In [24]:
#dropping unnecessary columns
df_tweets.drop(['month', 'day', 'hour', 'min', 'tweet_time'], axis=1, inplace=True)

### Vader Sentiment Analysis

In [25]:
#instantiating Vader
vader = SentimentIntensityAnalyzer()

In [26]:
#creating functions to pull the scores
def neg_score(text):
    score = vader.polarity_scores(text)
    return score['neg']

def neu_score(text):
    score = vader.polarity_scores(text)
    return score['neu']
    
def positive_score(text):
    score = vader.polarity_scores(text)
    return score['pos']
    
def compound_score(text):
    score = vader.polarity_scores(text)
    return score['compound']    

In [27]:
#adding the vader sentiment to the dataframe
df_tweets['vader_negative'] = df_tweets['content'].apply(neg_score)
df_tweets['vader_neutral'] = df_tweets['content'].apply(neu_score)
df_tweets['vader_positive'] = df_tweets['content'].apply(positive_score)
df_tweets['vader_compound'] = df_tweets['content'].apply(compound_score)

In [28]:
df_tweets.head()

Unnamed: 0,content,date,retweets,favorites,cleaned_tweets,tweet_day,month_sin,month_cos,day_sin,day_cos,hour_sin,hour_cos,min_sin,min_cos,time_to_open,time_after_close,potus_status,vader_negative,vader_neutral,vader_positive,vader_compound
0,Be sure to tune in and watch Donald Trump on L...,2009-05-04 13:54:25,510,917,be sure to tune in and watch donald trump on l...,2009-05-04,0.5,-0.866025,0.724793,0.688967,-0.258819,-0.965926,-0.587785,0.809017,0.0,0.0,0,0.0,0.827,0.173,0.5255
1,Donald Trump will be appearing on The View tom...,2009-05-04 20:00:10,34,267,donald trump will be appearing on the view tom...,2009-05-04,0.5,-0.866025,0.724793,0.688967,-0.866025,0.5,0.0,1.0,0.0,14410.0,0,0.0,0.749,0.251,0.7712
2,Donald Trump reads Top Ten Financial Tips on L...,2009-05-08 08:38:08,13,19,donald trump read top ten financial tip on lat...,2009-05-08,0.5,-0.866025,0.998717,-0.050649,0.866025,-0.5,-0.743145,-0.669131,3112.0,0.0,0,0.0,0.739,0.261,0.6468
3,New Blog Post: Celebrity Apprentice Finale and...,2009-05-08 15:40:15,11,26,new blog post celebrity apprentice finale and ...,2009-05-08,0.5,-0.866025,0.998717,-0.050649,-0.707107,-0.707107,-0.866025,-0.5,0.0,0.0,0,0.0,1.0,0.0,0.0
4,"""My persona will never be that of a wallflower...",2009-05-12 09:07:28,1375,1945,my persona will never be that of a wallflower ...,2009-05-12,0.5,-0.866025,0.651372,-0.758758,0.707107,-0.707107,0.669131,0.743145,1352.0,0.0,0,0.0,1.0,0.0,0.0


### Loughran McDonald Financial sentiment analysis

In [30]:
#loading the financial statement dictionary
df_sentiment = pd.read_csv('../datasets/LoughranMcDonald_MasterDictionary_2018.csv')

In [31]:
#creating dummies for modal
df_sentiment = pd.get_dummies(df_sentiment, columns = ['Modal'], drop_first=True)

In [32]:
#checking the df
df_sentiment.head()

Unnamed: 0,Word,Sequence Number,Word Count,Word Proportion,Average Proportion,Std Dev,Doc Count,Negative,Positive,Uncertainty,Litigious,Constraining,Superfluous,Interesting,Irr_Verb,Harvard_IV,Syllables,Source,Modal_1,Modal_2,Modal_3
0,AARDVARK,1,277,1.48e-08,1.24e-08,3.56e-06,84,0,0,0,0,0,0,0,0,0,2,12of12inf,0,0,0
1,AARDVARKS,2,3,1.6e-10,9.73e-12,9.86e-09,1,0,0,0,0,0,0,0,0,0,2,12of12inf,0,0,0
2,ABACI,3,8,4.28e-10,1.39e-10,6.23e-08,7,0,0,0,0,0,0,0,0,0,3,12of12inf,0,0,0
3,ABACK,4,12,6.41e-10,3.16e-10,9.38e-08,12,0,0,0,0,0,0,0,0,0,2,12of12inf,0,0,0
4,ABACUS,5,7250,3.87e-07,3.68e-07,3.37e-05,914,0,0,0,0,0,0,0,0,0,3,12of12inf,0,0,0


In [33]:
#Renaming column headers using .rename takes significantly longer
#df_sentiment.rename({'Modal_1':'modal_strong', 'Modal_2':'modal_neutral', 'Modal_3':'modal_weak'}, axis=1)

df_sentiment.columns = ['Word', 'Sequence Number', 'Word Count', 'Word Proportion',
       'Average Proportion', 'Std Dev', 'Doc Count', 'Negative', 'Positive',
       'Uncertainty', 'Litigious', 'Constraining', 'Superfluous',
       'Interesting', 'Irr_Verb', 'Harvard_IV', 'Syllables', 'Source',
       'modal_strong', 'modal_neutral', 'modal_weak']

In [34]:
#Picking out the sentiment topics to use
sentiments = ['negative', 'positive', 'uncertainty', 'litigious', 'constraining', 
              'interesting', 'modal_strong', 'modal_neutral', 'modal_weak']
df_sentiment.columns = [column.lower() for column in df_sentiment.columns]

In [35]:
# changing the dictionary to only include words that appear in sentiments
df_sentiment = df_sentiment[['word'] + sentiments]
df_sentiment[sentiments] = df_sentiment[sentiments].astype(bool)
df_sentiment = df_sentiment[(df_sentiment[sentiments]).any(1)]

In [36]:
# Initialize an empty list to hold the clean posts.
clean_words = []

for words in df_sentiment['word']:
    
    # Convert post to words, then append to clean_train_posts.
    clean_words.append(pre_processing(words))

df_sentiment['word'] = clean_words
df_sentiment = df_sentiment.drop_duplicates('word')

In [37]:
#negative words
def negative_word(text):
    negative = 0
    for word in tokenizer.tokenize(text.lower()):
        if word in set(df_sentiment[df_sentiment['negative']==1]['word']):
            negative += 1
    return negative

#positive words
def positive_word(text):
    positive = 0
    for word in tokenizer.tokenize(text.lower()):
        if word in set(df_sentiment[df_sentiment['positive']==1]['word']):
            positive += 1
    return positive

#uncertainty words
def uncertainty_word(text):
    uncertainty = 0
    for word in tokenizer.tokenize(text.lower()):
        if word in set(df_sentiment[df_sentiment['uncertainty']==1]['word']):
            uncertainty += 1
    return uncertainty

#litigious words
def litigious_word(text):
    litigious = 0
    for word in tokenizer.tokenize(text.lower()):
        if word in set(df_sentiment[df_sentiment['litigious']==1]['word']):
            litigious += 1
    return litigious

#constraining words
def constraining_word(text):
    constraining = 0
    for word in tokenizer.tokenize(text.lower()):
        if word in set(df_sentiment[df_sentiment['constraining']==1]['word']):
            constraining += 1
    return constraining

#interesting words
def interesting_word(text):
    interesting = 0
    for word in tokenizer.tokenize(text.lower()):
        if word in set(df_sentiment[df_sentiment['interesting']==1]['word']):
            interesting += 1
    return interesting

#modal_strong words
def modal_strong_word(text):
    modal_strong = 0
    for word in tokenizer.tokenize(text.lower()):
        if word in set(df_sentiment[df_sentiment['modal_strong']==1]['word']):
            modal_strong += 1
    return modal_strong

#modal_neutral words
def modal_neutral_word(text):
    modal_neutral = 0
    for word in tokenizer.tokenize(text.lower()):
        if word in set(df_sentiment[df_sentiment['modal_neutral']==1]['word']):
            modal_neutral += 1
    return modal_neutral

#modal_weak words
def modal_weak_word(text):
    modal_weak = 0
    for word in tokenizer.tokenize(text.lower()):
        if word in set(df_sentiment[df_sentiment['modal_weak']==1]['word']):
            modal_weak += 1
    return modal_weak

functions = [negative_word, positive_word, uncertainty_word, litigious_word, constraining_word, 
             interesting_word, modal_strong_word, modal_neutral_word, modal_weak_word]

In [38]:
df_dict = df_tweets.copy()

In [39]:
# apply functions to the new coloumns
''' Commenting out as these take a long time to run'''
# df_dict['negative_word'] = df_dict['cleaned_tweets'].apply(negative_word)
# df_dict['positive_word'] = df_dict['cleaned_tweets'].apply(positive_word)
# df_dict['uncertainty_word'] = df_dict['cleaned_tweets'].apply(uncertainty_word)
# df_dict['litigious_word'] = df_dict['cleaned_tweets'].apply(litigious_word)
# df_dict['constraining_word'] = df_dict['cleaned_tweets'].apply(constraining_word)
# df_dict['interesting_word'] = df_dict['cleaned_tweets'].apply(interesting_word)
# df_dict['modal_strong_word'] = df_dict['cleaned_tweets'].apply(modal_strong_word)
# df_dict['modal_neutral_word'] = df_dict['cleaned_tweets'].apply(modal_neutral_word)
# df_dict['modal_weak_word'] = df_dict['cleaned_tweets'].apply(modal_weak_word)

' Commenting out as these take a long time to run'

In [40]:
# # saving
# df_dict[['negative_word','positive_word','uncertainty_word', 'litigious_word', 'constraining_word','interesting_word','modal_strong_word', 'modal_neutral_word', 'modal_weak_word']].to_csv('../datasets/df_dict.csv')

In [41]:
df_dict = pd.read_csv('../datasets/df_dict.csv', index_col=0)

In [42]:
df_dict.head()

Unnamed: 0,negative_word,positive_word,uncertainty_word,litigious_word,constraining_word,interesting_word,modal_strong_word,modal_neutral_word,modal_weak_word
0,1,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,1,0,1
2,1,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,2,0,0


In [43]:
df_tweets = df_tweets.join(df_dict)

# Identifying positive class

### Adding the opening and closing price of the S&P500

In [44]:
# Ceating a dataframe of the S&P500 historical returns
df_stock_price = yf.Ticker('^GSPC').history(start='2009-05-01',end='2020-06-20')

In [45]:
# checking shape
df_stock_price.shape

(2805, 7)

In [46]:
# checking df
df_stock_price.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2009-04-30,876.59,888.7,868.51,872.81,6862540000,0,0
2009-05-01,872.74,880.48,866.1,877.52,5312170000,0,0
2009-05-04,879.21,907.85,879.21,907.24,7038840000,0,0
2009-05-05,906.1,907.7,897.34,903.8,6882860000,0,0
2009-05-06,903.95,920.28,903.95,919.53,8555040000,0,0


In [47]:
# finding the S&P500 return within the day
df_stock_price['difference'] = (df_stock_price['Close'] - df_stock_price['Open'])/df_stock_price['Open']

In [48]:
df_stock_price.to_csv('../datasets/df_stock_price.csv')

In [49]:
df_stock_price.describe()

Unnamed: 0,Open,High,Low,Close,Volume,Dividends,Stock Splits,difference
count,2805.0,2805.0,2805.0,2805.0,2805.0,2805.0,2805.0,2805.0
mean,1947.225597,1957.165786,1936.463836,1947.679084,3874781000.0,0.0,0.0,0.000359
std,643.904533,646.016805,641.199932,643.627539,1002020000.0,0.0,0.0,0.009356
min,872.74,880.48,866.1,872.81,1025000000.0,0.0,0.0,-0.065934
25%,1337.39,1343.33,1330.03,1337.77,3275080000.0,0.0,0.0,-0.003149
50%,1972.73,1981.8,1962.42,1972.74,3665340000.0,0.0,0.0,0.000554
75%,2469.12,2477.96,2460.29,2472.1,4240570000.0,0.0,0.0,0.004611
max,3380.45,3393.52,3378.83,3386.15,10617810000.0,0.0,0.0,0.054876


In [50]:
difference_mean = 0.000359
std_1 = 0.009356
std_2 = 2*0.009356
std_3 = 3*0.009356

In [51]:
# joining the df onto df_tweets
df_tweets = df_tweets.join(df_stock_price['difference'], on='tweet_day')

In [52]:
# Positive class as 1 standard deviation away from the mean difference
df_tweets['target'] = df_tweets['difference'].map(lambda x: 1 if x > difference_mean+std_1 or x < difference_mean-std_1 else 0)

In [53]:
# # positive class as gain in stock market, and negative class as loss in stock market
# df_tweets['target'] = df_tweets['difference'].map(lambda x: 1 if x >= 0 else 0)

In [54]:
df_tweets['target'].value_counts(normalize=True)

0    0.846282
1    0.153718
Name: target, dtype: float64

In [55]:
df_tweets.head()

Unnamed: 0,content,date,retweets,favorites,cleaned_tweets,tweet_day,month_sin,month_cos,day_sin,day_cos,hour_sin,hour_cos,min_sin,min_cos,time_to_open,time_after_close,potus_status,vader_negative,vader_neutral,vader_positive,vader_compound,negative_word,positive_word,uncertainty_word,litigious_word,constraining_word,interesting_word,modal_strong_word,modal_neutral_word,modal_weak_word,difference,target
0,Be sure to tune in and watch Donald Trump on L...,2009-05-04 13:54:25,510,917,be sure to tune in and watch donald trump on l...,2009-05-04,0.5,-0.866025,0.724793,0.688967,-0.258819,-0.965926,-0.587785,0.809017,0.0,0.0,0,0.0,0.827,0.173,0.5255,1,0,0,0,0,0,0,0,0,0.031881,1
1,Donald Trump will be appearing on The View tom...,2009-05-04 20:00:10,34,267,donald trump will be appearing on the view tom...,2009-05-04,0.5,-0.866025,0.724793,0.688967,-0.866025,0.5,0.0,1.0,0.0,14410.0,0,0.0,0.749,0.251,0.7712,0,0,1,0,0,0,1,0,1,0.031881,1
2,Donald Trump reads Top Ten Financial Tips on L...,2009-05-08 08:38:08,13,19,donald trump read top ten financial tip on lat...,2009-05-08,0.5,-0.866025,0.998717,-0.050649,0.866025,-0.5,-0.743145,-0.669131,3112.0,0.0,0,0.0,0.739,0.261,0.6468,1,0,0,0,0,0,0,0,0,0.022221,1
3,New Blog Post: Celebrity Apprentice Finale and...,2009-05-08 15:40:15,11,26,new blog post celebrity apprentice finale and ...,2009-05-08,0.5,-0.866025,0.998717,-0.050649,-0.707107,-0.707107,-0.866025,-0.5,0.0,0.0,0,0.0,1.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0.022221,1
4,"""My persona will never be that of a wallflower...",2009-05-12 09:07:28,1375,1945,my persona will never be that of a wallflower ...,2009-05-12,0.5,-0.866025,0.651372,-0.758758,0.707107,-0.707107,0.669131,0.743145,1352.0,0.0,0,0.0,1.0,0.0,0.0,0,0,0,0,0,0,2,0,0,-0.002383,0


# Train Validation and Holdout Data

In [56]:
df_tweets.to_csv('../datasets/df_tweets.csv')

In [57]:
df_tweets[df_tweets['cleaned_tweets']== '']

Unnamed: 0,content,date,retweets,favorites,cleaned_tweets,tweet_day,month_sin,month_cos,day_sin,day_cos,hour_sin,hour_cos,min_sin,min_cos,time_to_open,time_after_close,potus_status,vader_negative,vader_neutral,vader_positive,vader_compound,negative_word,positive_word,uncertainty_word,litigious_word,constraining_word,interesting_word,modal_strong_word,modal_neutral_word,modal_weak_word,difference,target
3338,@sayhigreg http://fxn.ws/VdRXwh,2012-09-20 15:16:36,6,5,,2012-09-20,-1.0,-1.83697e-16,-0.7907757,-0.612106,-0.7071068,-0.7071068,0.9945219,-0.1045285,0.0,0.0,0,0.0,1.0,0.0,0.0,0,0,0,0,0,0,0,0,0,-0.000541,0
3339,. @ janinegibsonhttp://fxn.ws/VdRXwh,2012-09-20 15:19:44,16,11,,2012-09-20,-1.0,-1.83697e-16,-0.7907757,-0.612106,-0.7071068,-0.7071068,0.9135455,-0.4067366,0.0,0.0,0,0.0,1.0,0.0,0.0,0,0,0,0,0,0,0,0,0,-0.000541,0
6874,@JaylonSeales,2013-02-27 01:39:28,1,2,,2013-02-27,0.8660254,0.5,-0.7247928,0.688967,0.258819,0.9659258,-0.809017,-0.5877853,28232.0,0.0,0,0.0,1.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0.012726,1
22171,# MakeAmericaGreatAgainhttp://on.fb.me/1zMGDu5,2015-04-22 12:49:22,45,58,,2015-04-22,0.8660254,-0.5,-0.9680771,-0.250653,1.224647e-16,-1.0,-0.9135455,0.4067366,0.0,0.0,0,0.0,1.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0.004618,0
24953,# CNNDebatehttp://time.com/4037510/poll-second...,2015-09-16 23:10:37,1391,1680,,2015-09-16,-1.0,-1.83697e-16,-0.1011683,-0.994869,-0.258819,0.9659258,0.8660254,0.5,0.0,25837.0,0,0.0,1.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0.008741,0
29512,# MakeAmericaGreatAgainhttps://twitter.com/emi...,2016-07-07 15:06:17,5317,14962,,2016-07-07,-0.5,-0.8660254,0.9884683,0.151428,-0.7071068,-0.7071068,0.5877853,0.809017,0.0,0.0,0,0.0,1.0,0.0,0.0,0,0,0,0,0,0,0,0,0,-0.0012,0
29801,# CrookedHillaryhttp://www.dailymail.co.uk/new...,2016-08-01 09:48:22,7186,16171,,2016-08-01,-0.8660254,-0.5,0.2012985,0.97953,0.7071068,-0.7071068,-0.9510565,0.309017,0.0,0.0,0,0.0,1.0,0.0,0.0,0,0,0,0,0,0,0,0,0,-0.001063,0
30578,# DrainTheSwamphttp://thehill.com/blogs/pundit...,2016-10-18 20:00:40,8082,13262,,2016-10-18,-0.8660254,0.5,-0.485302,-0.874347,-0.8660254,0.5,0.0,1.0,0.0,14440.0,0,0.0,1.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0.000603,0
30737,# ObamacareFailhttp://thehill.com/policy/healt...,2016-10-25 10:21:37,5905,11374,,2016-10-25,-0.8660254,0.5,-0.9377521,0.347305,0.5,-0.8660254,0.809017,-0.5877853,0.0,0.0,0,0.0,1.0,0.0,0.0,0,0,0,0,0,0,0,0,0,-0.003052,0
31872,# MakeAmericaGreatAgainhttps://twitter.com/i/m...,2017-06-02 13:30:41,13612,68498,,2017-06-02,1.224647e-16,-1.0,0.3943559,0.918958,-0.258819,-0.9659258,1.224647e-16,-1.0,0.0,0.0,1,0.0,1.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0.003204,0


In [58]:
df_tweets = df_tweets[df_tweets['cleaned_tweets'] != '']

## Holdout Data

In [59]:
X = df_tweets.drop('target', axis=1)
y = df_tweets['target']

In [60]:
# Holdout data split
X_train_validate, X_holdout, y_train_validate, y_holdout = train_test_split(X,y,stratify=y, random_state=42, test_size=0.2)

In [61]:
X_holdout.shape

(8649, 31)

In [62]:
# Saving holdout data to csv
X_holdout.to_csv('../datasets/X_holdout.csv')
y_holdout.to_csv('../datasets/y_holdout.csv')

## Train and Validation

In [63]:
df_train_validate = X_train_validate.join(y_train_validate)

In [64]:
df_train_validate.head()

Unnamed: 0,content,date,retweets,favorites,cleaned_tweets,tweet_day,month_sin,month_cos,day_sin,day_cos,hour_sin,hour_cos,min_sin,min_cos,time_to_open,time_after_close,potus_status,vader_negative,vader_neutral,vader_positive,vader_compound,negative_word,positive_word,uncertainty_word,litigious_word,constraining_word,interesting_word,modal_strong_word,modal_neutral_word,modal_weak_word,difference,target
24974,""" @ LeghanLiptak712: CNN tried to destroy Dona...",2015-09-20 04:27:54,1123,2400,leghanliptak cnn tried to destroy donald trump...,2015-09-20,-1.0,-1.83697e-16,-0.790776,-0.612106,0.866025,0.5,0.309017,-0.951057,18126.0,0.0,0,0.078,0.658,0.264,0.8113,2,0,0,0,0,0,0,0,0,,0
17951,""" @ InsideCableNews @ FTVLive Dream on. You ha...",2014-10-20 18:59:09,17,38,insidecablenews ftvlive dream on you have a be...,2014-10-20,-0.866025,0.5,-0.790776,-0.612106,-1.0,-1.83697e-16,-0.104528,0.994522,0.0,10749.0,0,0.152,0.638,0.21,0.2263,1,2,0,0,0,0,0,0,0,0.009753,1
1138,"""The American work ethic is what led generatio...",2012-01-13 13:48:23,93,40,the american work ethic is what led generation...,2012-01-13,0.5,0.8660254,0.485302,-0.874347,-0.258819,-0.9659258,-0.951057,0.309017,0.0,0.0,0,0.0,0.755,0.245,0.6369,0,1,0,0,0,0,0,0,0,-0.004425,0
17957,"Wow, one of the all-time greats in fashion, OS...",2014-10-21 04:20:56,157,192,wow one of the all time great in fashion oscar...,2014-10-21,-0.866025,0.5,-0.897805,-0.440394,0.866025,0.5,0.866025,-0.5,18544.0,0.0,0,0.068,0.653,0.279,0.7953,0,3,0,0,0,1,0,0,0,0.016707,1
32423,Received a # HurricaneHarvey briefing this mor...,2017-08-25 11:02:33,6920,34234,received a hurricaneharvey briefing this morni...,2017-08-25,-0.866025,-0.5,-0.937752,0.347305,0.258819,-0.9659258,0.207912,0.978148,0.0,0.0,1,0.0,1.0,0.0,0.0,0,0,0,0,0,0,0,0,0,-0.000683,0


In [65]:
df_train_validate.to_csv('../datasets/df_train_validate.csv')