In [1]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import naive_bayes, svm, metrics
import pandas as pd
# reset colwitdth options when running all cells 
pd.reset_option('display.max_colwidth')

### Load dataset (all tweets and corresponding stock prices)
... and group by days / timestemps even groups weren't used yet in the classifier.

In [2]:
data = pd.read_json('processed_data/data_merged.json')
# remove columns that were unexpectedly generated during saving
# data.drop(columns=['level_0', 'index'], inplace=True)

In [3]:
# group data by day
daily_data = data.groupby(data['timestamp'])
daily_data.first()

Unnamed: 0_level_0,hashtags,text,username,likes,replies,retweets,Open,Close,PriceUp
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2018-01-02,"[Tesla, ModelS]","In the past 2 years, I've driven 18,823 miles ...",Ben Sullins 💪,110,6,10,312.00,320.53,True
2018-01-03,[Tesla],Día de piernas ... Estrenando mallas...#Tesla ...,El CaZador,107,3,4,321.00,317.25,False
2018-01-04,"[Innovation, Tesla, electricvehicles, Cars, br...",This is awesome! New brand technology #Innovat...,Gabriela Mascaró,6,1,3,312.87,314.62,True
2018-01-05,"[Tesla, TeslaModel3]",#Tesla #TeslaModel3 hahapic.twitter.com/DMlxOf...,WtaFiGO,19,1,7,316.62,316.58,False
2018-01-08,[Tesla],Tesla Planning Supercharger Station With ‘Old ...,Tesla Motors Club,132,4,16,316.00,336.41,True
...,...,...,...,...,...,...,...,...,...
2018-05-29,"[Tesla, Elektroautopic]","#Tesla Model 3: Europa-Start ""erste Jahreshälf...",ecomento.de,3,2,1,278.51,283.76,True
2018-05-30,[Tesla],Oh well... in case you were wondering why @Tes...,Safer Vehicles Proved,27,4,9,283.29,291.72,True
2018-05-31,[Tesla],Take comfort #Tesla friends. All the alleged ...,Groggy T. Bear,22,4,3,287.21,284.73,False
2018-06-01,"[FBI, Tesla]",Según documento desclasificado del #FBI Niko...,Misterio Desconocido,40,3,14,285.86,291.82,True


In [4]:
# get groups' names
daily_data.groups.keys()
groups = [name for name, _ in daily_data]
groups[0]

Timestamp('2018-01-02 00:00:00')

In [5]:
# get all tweets from the first day
first_day_data = daily_data.get_group(groups[0])
first_day_data.head()

Unnamed: 0,hashtags,text,username,likes,replies,retweets,Open,Close,PriceUp
0,"[Tesla, ModelS]","In the past 2 years, I've driven 18,823 miles ...",Ben Sullins 💪,110,6,10,312.0,320.53,True
1,"[Tesla, P90D, Blog, Youtube]",Ya estamos en @louesfera probando un #Tesla #P...,Fco Javier,2,1,2,312.0,320.53,True
2,"[Snapchat, Uber, Twitter, Facebook, Tesla, Goo...",Here's how old these companies will be turning...,Imran,53,7,41,312.0,320.53,True
3,[Muskwatchpic],"From SpaceX to Tesla, here are our biggest que...",Nerdist,37,5,10,312.0,320.53,True
4,"[Braunschweig, VW, Tesla]",In #Braunschweig produziert #VW seine Batterie...,HAZ,5,3,2,312.0,320.53,True


## A first very simple classifier
Use each tweet and predict whether it was written on a day where stock price has grown (PriceUp == True) or not

As classificators use the algorithms learned in class: naive bayes and SVMs

In [6]:
# generate the train and test sets
tweets_train, tweets_test, labels_train, labels_test = train_test_split(data['text'], data['PriceUp'], 
                                                   test_size=0.15, random_state=333, shuffle=True)
len(tweets_train)

4206

In [7]:
# vectorize train and test data with TF-IDF
vectorizer = TfidfVectorizer()
train_matrix = vectorizer.fit_transform(tweets_train)
test_matrix = vectorizer.transform(tweets_test)
print(train_matrix.shape)
type(train_matrix)

(4206, 22108)


scipy.sparse.csr.csr_matrix

### Observation:
Our vector space is far higher than the number of training data. Therefore the classifier will for sure overfit and not generalize well to the test data at all. We can still use it as a simple baseline.

In [8]:
svm_classifier = svm.LinearSVC(max_iter=int(1e6))
svm_classifier.fit(train_matrix, labels_train)

nb_classifier = naive_bayes.GaussianNB()
nb_classifier.fit(train_matrix.toarray(), labels_train)

GaussianNB(priors=None, var_smoothing=1e-09)

#### Test if training was successful
As we have not enough data, a working classifier should overfit to the training data and hence perfectly predict the labels of the training set.

In [9]:
# check if classifier has really overfitted to the data by testing it on the training data
preds_svm = svm_classifier.predict(train_matrix)
svm_acc = metrics.accuracy_score(labels_train, preds_svm)

preds_nb = nb_classifier.predict(train_matrix.toarray())
nb_acc = metrics.accuracy_score(labels_train, preds_nb)


svm_prec, svm_rec, svm_fscore, svm_sup = \
metrics.precision_recall_fscore_support(labels_train, preds_svm, pos_label=True, average='binary')

nb_prec, nb_rec, nb_fscore, nb_sup = \
metrics.precision_recall_fscore_support(labels_train, preds_nb, pos_label=True, average='binary')

print('   \t\tSVM \t\tNaive Bayes')
print('Acc \t\t {0:.3f} \t\t {1:.3f}'.format(svm_acc, nb_acc))
print('Prec \t\t {0:.3f} \t\t {1:.3f}'.format(svm_prec, nb_prec))
print('Rec \t\t {0:.3f} \t\t {1:.3f}'.format(svm_rec, nb_rec))
print('FMeas \t\t {0:.3f} \t\t {1:.3f}'.format(svm_fscore, nb_fscore))

   		SVM 		Naive Bayes
Acc 		 0.996 		 0.977
Prec 		 0.995 		 1.000
Rec 		 0.997 		 0.957
FMeas 		 0.996 		 0.978


Check! Both classifiers reach an almost 100% accuracy on the training data. Therefore we can be sure, the classifier really learned a model based on the training data.

### Test learned models on the test set

In [10]:
# test the classifiers
preds_svm = svm_classifier.predict(test_matrix)
svm_acc = metrics.accuracy_score(labels_test, preds_svm)

preds_nb = nb_classifier.predict(test_matrix.toarray())
nb_acc = metrics.accuracy_score(labels_test, preds_nb)


svm_prec, svm_rec, svm_fscore, svm_sup = \
metrics.precision_recall_fscore_support(labels_test, preds_svm, pos_label=True, average='binary')

nb_prec, nb_rec, nb_fscore, nb_sup = \
metrics.precision_recall_fscore_support(labels_test, preds_nb, pos_label=True, average='binary')

print('   \t\tSVM \t\tNaive Bayes')
print('Acc \t\t {0:.3f} \t\t {1:.3f}'.format(svm_acc, nb_acc))
print('Prec \t\t {0:.3f} \t\t {1:.3f}'.format(svm_prec, nb_prec))
print('Rec \t\t {0:.3f} \t\t {1:.3f}'.format(svm_rec, nb_rec))
print('FMeas \t\t {0:.3f} \t\t {1:.3f}'.format(svm_fscore, nb_fscore))

   		SVM 		Naive Bayes
Acc 		 0.569 		 0.556
Prec 		 0.584 		 0.592
Rec 		 0.599 		 0.473
FMeas 		 0.592 		 0.526


To make a statement about the results, we first have to look at the distribution of labels in the test dataset.
An even simpler baseline we can use to compare our results with is a classifier that constantly predicts the class that is most common in the test set. 

In [11]:
num_trues, num_falses = labels_test.value_counts()
print("A classifier that always predicts 'True' would get an accuracy of: %.3f" % (num_trues/labels_test.count()))

A classifier that always predicts 'True' would get an accuracy of: 0.521


On the first sight, our classifier seems to have learned a very little bit, having an accuracy of 55.4 and 54.9 percent while the constant prediction would lead to 53.7 percent. In our case of predicting whether the stock price will close higher that it has opened based on a tweet, recall is much more important to us.

Altogether the difference is too insignificant and is expected to be not reproducible when using different hyper parameters like the size of the training set, another random seed etc.

**Update**: After playing a bit with hyperparameters, we can say that the accuracy of our classifiers is always slightly above the constant value predictor. The SVM classifier always reaches a higher accuracy than the NB classifier as well as a higher Recall, which is especially important for our goal as we want to avoid false negatives.

**Discussion:** It looks like this very simple classifier already has learned some patterns in the data, which is unexpected, but can be explained as follows: 

- test and training data are expected to be correlated, as the test data contains tweets from days which we've already trained on. A better evaluation: use data of new days
- Think about it again: As we're predicting only if a tweet was written on a good day for TSLA and do not consider the time a tweet was written at, it is likely to happen, that people write about the positive development of the stock price...
