## Sample Data - Tweets From A World Leader

### Data collection
This sample data consists of tweets of Donald Trump for the year 2019 (before he was permanently banned from twitter (X) and when was the President). The main problem statement is - can the tweets and the sentiment of the tweets predict behaviour at individual, organizational or government level ?   

At an individual level, questions could deal with some antecedents that predict the sentiment of the tweets. In this case, we could check if the day/ month/ time can predict the mood of an individual and the sentiment of the tweet. At an organizational level, we could think of whether the language of the tweets can predict the performance of an organization. Here, we could try to predict changes to the monthly sales/ revenues given the nature of the tweets. At a government level, we could check if the monthly agregate nature of tweets could predict certain country-level macro indicators. Here, we could study if some macro indicator like monthly jobs added may change with the nature of text in the tweets. 

**In this example, we try to predict the movement of the NASDAQ given the text of the tweet by a world leader.** 

#### Collecting Tweets of Trump:
The data comes from a web based twitter archive https://www.thetrumparchive.com/faq . This has all the tweets of Donald Trump. From this set I could extract the tweets from the year 2019. 


#### Collecting NASDAQ data:
You have many online databases for the daily stock and index prices. The file NASDAQ_Close_2019.csv, included, has the NASDAQ daily index data for 2019. It has been downloaded from -
https://finance.yahoo.com/quote/%5EIXIC/history?period1=1546300800&period2=1577836800&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true. 

### Data Preperation

There may be some data tranformation you may need to do to the collected data. In this example, we seperate the day, month, and year of the tweet and store it in seperate columns. You can use Python or MS Excel for this data manipulation. Note: There is a nice module called "datetime" specifically designed to manipulate date. 

### Data Preperation - Merging Tweet Data and Stock Index Data
With the tweets ready, the next task is to merge the tweet data with the main predictor variable. In this case we are trying to predict the direction of movement of the NASDAQ index, hence we need to merge daily stock index data with the collected tweet data. You can use MS Excel operations to clean and manipulate data (like seperate the day, month, year).  

To merge csv files in Python, I use a module called Pandas, which provides a tabular data structure called Dataframe. This module provides a simple and efficient merge operation. You can also merge using for-loops, but that will be time consuming and inefficient. I merge using the Day, Month and Year infomation of the tweets and the index data.

In [1]:
# To use the code, assign the file path to the NEW_TWEET_CSV, MERGE_CSV variables.
# Assign the path of NASDAQ_Close_2019.csv file to INDEX_CSV variable

import pandas as pd

NEW_TWEET_CSV = "https://raw.githubusercontent.com/Minh-Khoa-Pham/DSS_Assignment_2/main/Trump_Tweets_2019.csv?token=GHSAT0AAAAAACKHOFUVB3SR4G6Z5VVFXTGSZKSBJTA"
INDEX_CSV = "https://raw.githubusercontent.com/Minh-Khoa-Pham/DSS_Assignment_2/main/NASDAQ_Close_2019.csv?token=GHSAT0AAAAAACKHOFUU4ER3L3NHTBQLIE56ZKSBPLQ"
MERGE_CSV = "https://raw.githubusercontent.com/Minh-Khoa-Pham/DSS_Assignment_2/main/Merged_Tweet_NASDAQ_2019.csv?token=GHSAT0AAAAAACKHOFUVUG3KY6HFMIHXBIKMZKSBPYQ"

# Here i read the csv file and import the data into python. Since the csv has no headers
# names = ["Tweet_date","Tweet_text","Tweet_day","Tweet_month","Tweet_year"] is used to 
# assign column names for for the csv data
df_tweet = pd.read_csv(NEW_TWEET_CSV, header = None, names= ["Tweet_date","Tweet_text","Tweet_day","Tweet_month","Tweet_year"])

# The INDEX data csv file has header information at the '0' row
df_idx = pd.read_csv(INDEX_CSV, header= 0)

# .merge is used to merge the NEW_TWEET_CSV and INDEX_CSV using the date information
# left_on specifies the column headers in the tweet csv that should be matched. 
# right_on specifies the column headers on INDEX_CSV that need to be matched 
merge_df = pd.merge(df_tweet, df_idx,  how='left', left_on=['Tweet_day','Tweet_month','Tweet_year'], right_on = ['Day','Month','Year'])
# here i drop the rows that do not have stock index data. Note: Stock index is closed on
# Saturday and Sunday. But tweets keep coming!
merge_df= merge_df.dropna(axis=0,subset = ['Direction'])
merge_df.to_csv(MERGE_CSV) #Save the data in the MERGE_CSV file
print("Done")

Done


In [2]:
#Check the merged data frame
print(merge_df.head(1))

                       Tweet_date  \
0  Tue Dec 31 23:35:58 +0000 2019   

                                          Tweet_text  Tweet_day  Tweet_month  \
0  RT @WhiteHouse: Americans saw plenty of Washin...         31           12   

   Tweet_year        Date         Open         High          Low        Close  \
0        2019  31/12/2019  8918.740234  8975.360352  8912.769531  8972.599609   

     Adj Close        Volume  Direction   Day  Month    Year  
0  8972.599609  2.182800e+09        1.0  31.0   12.0  2019.0  


In [3]:
import numpy as np

label_data = merge_df
#Check the dataframe size
print("Size of labelled data ", len(label_data))

Size of labelled data  5359


In [4]:
from sklearn.utils import shuffle #To shuffle the dataframe
from sklearn.model_selection import train_test_split

label_data = shuffle(label_data)
# We need to split the labelled data into trining and testing sets. USe 80-20 split for the labelled data.
df_train, df_test = train_test_split(label_data, test_size=0.2)
print("Size of trainig data ", len(df_train))

Size of trainig data  4287


In [5]:
from sklearn.feature_extraction.text import CountVectorizer

def feature_extractor(label_text):
  """This function converts the input sentence into its features"""
  gen_feature = CountVectorizer(strip_accents = "unicode", analyzer="word", stop_words="english", ngram_range=(1,2), max_features=10000)
  gen_feature.fit(label_text)
  return gen_feature

label_data['Tweet_text'] = label_data['Tweet_text'].str.lower() #Change all the tweets into lower case
label_data['Tweet_text'] = label_data['Tweet_text'].str.replace('[^\w\s]', '', regex=True) #Remove all punctuations in the tweets
label_data['Tweet_text'] = label_data['Tweet_text'].replace('\d+', 'NUM', regex=True)  # Replace numbers
label_data['Tweet_text'] = label_data['Tweet_text'].replace(r'http\S+', '', regex=True)  # Remove URLs

gen_feature = feature_extractor(label_data['Tweet_text']) 
train_x = gen_feature.transform(df_train['Tweet_text'])
test_x = gen_feature.transform(df_test['Tweet_text'])
print(type(train_x))


<class 'scipy.sparse.csr.csr_matrix'>


### Generate the Naive Bayes Classifier

In [6]:
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB() #initialize the classifier object
classifier = mnb.fit(train_x, df_train['Direction']) #train the model
acc = classifier.score(test_x,df_test['Direction']) #check the accuracy score
print("accuracy of mnb is = ", acc)

accuracy of mnb is =  0.6138059701492538


In [7]:
#Check other metrics of the model
from sklearn.metrics import classification_report

y_pred = classifier.predict(test_x)
print(classification_report(df_test['Direction'], y_pred))


              precision    recall  f1-score   support

        -1.0       0.53      0.50      0.51       441
         1.0       0.66      0.70      0.68       631

    accuracy                           0.61      1072
   macro avg       0.60      0.60      0.60      1072
weighted avg       0.61      0.61      0.61      1072



### Generate the Random Forest Classifier

In [8]:
from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier()
rf_model = rf_classifier.fit(train_x, df_train['Direction'])
rf_acc = rf_model.score(test_x, df_test['Direction'])
print("Accuracy of Random Forest classifier is =", rf_acc)


Accuracy of Random Forest classifier is = 0.5998134328358209


In [9]:
from sklearn.metrics import classification_report

rf_y_pred = rf_classifier.predict(test_x)
print(classification_report(df_test['Direction'], rf_y_pred))


              precision    recall  f1-score   support

        -1.0       0.52      0.38      0.44       441
         1.0       0.64      0.75      0.69       631

    accuracy                           0.60      1072
   macro avg       0.58      0.57      0.56      1072
weighted avg       0.59      0.60      0.59      1072



### Generate the Decision Tree Classifier

In [10]:
from sklearn.tree import DecisionTreeClassifier

dt_classifier = DecisionTreeClassifier()
dt_model = dt_classifier.fit(train_x, df_train['Direction'])

dt_acc = dt_model.score(test_x, df_test['Direction'])
print("Accuracy of Decision Tree classifier is =", dt_acc)

Accuracy of Decision Tree classifier is = 0.585820895522388


In [11]:
from sklearn.metrics import classification_report

dt_y_pred = dt_classifier.predict(test_x)
print(classification_report(df_test['Direction'], dt_y_pred))


              precision    recall  f1-score   support

        -1.0       0.50      0.51      0.51       441
         1.0       0.65      0.64      0.64       631

    accuracy                           0.59      1072
   macro avg       0.57      0.58      0.57      1072
weighted avg       0.59      0.59      0.59      1072



### Evaluate the model
Since the Naive Bayes model has the highest accuracy score (0.61), we will use the Naive Bayes model to predict stock movement based on Trump tweets.

In [14]:
s = input("Enter the tweet for analysis: ").lower()

s_l = pd.Series(s)
x = gen_feature.transform(s_l)
print('NASDAQ stock direction: ',classifier.predict(x))

Enter the tweet for analysis I am asking for everyone at the U.S. Capitol to remain peaceful. No violence! Remember, WE are the Party of Law & Order – respect the Law and our great men and women in Blue. Thank you!
NASDAQ stock direction:  [1.]
