In [1]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import naive_bayes, svm, metrics 
from sklearn.utils import shuffle
import pandas as pd
import numpy as np
# reset colwitdth options when running all cells 
pd.reset_option('display.max_colwidth')

### Load dataset (all tweets and corresponding stock prices)
... and group by days / timestemps even groups weren't used yet in the classifier.

In [2]:
data = pd.read_json('processed_data/data_final_merged.json')
# remove columns that were unexpectedly generated during saving
# data.drop(columns=['level_0', 'index'], inplace=True)

In [3]:
# group data by day
daily_data = data.groupby(data['timestamp'])
daily_data.first()

Unnamed: 0_level_0,hashtags,text,username,likes,replies,retweets,Open,Close,PriceUp
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2018-01-02,[Muskwatchpic],"From SpaceX to Tesla, here are our biggest que...",Nerdist,37,5,10,312.00,320.53,True
2018-01-03,[Tesla],#Tesla just released record delivery numbers f...,InsideEVs Forum,11,1,5,321.00,317.25,False
2018-01-04,[],Tesla struggles with Model 3 production pic.t...,Automotive News,5,0,5,312.87,314.62,True
2018-01-05,[munilandhttps],Head of Puerto Rico electric utility says they...,Cate Long,17,4,9,316.62,316.58,False
2018-01-08,[],“Bırakın doğruları gelecek söylesin ve herkesi...,[n]Beyin,324,2,86,316.00,336.41,True
...,...,...,...,...,...,...,...,...,...
2018-05-29,[],You know Erin let us forget for a short period...,Darji,23,1,4,278.51,283.76,True
2018-05-30,[],Tesla Autopilot blamed for crash with parked p...,BBC News Technology,11,3,11,283.29,291.72,True
2018-05-31,[Tesla],Weekly #Tesla short update. $TSLA short intere...,Ihor Dusaniwsky,12,4,8,287.21,284.73,False
2018-06-01,[1u],Tesla and Elon Musk face tough questions from ...,Minnesota AFL-CIO,19,0,10,285.86,291.82,True


In [4]:
# get groups' names
daily_data.groups.keys()
groups = [name for name, _ in daily_data]
groups[0]

Timestamp('2018-01-02 00:00:00')

In [5]:
# get all tweets from the first day
first_day_data = daily_data.get_group(groups[0])
first_day_data.head()

Unnamed: 0,hashtags,text,username,likes,replies,retweets,Open,Close,PriceUp
0,[Muskwatchpic],"From SpaceX to Tesla, here are our biggest que...",Nerdist,37,5,10,312.0,320.53,True
1,"[Snapchat, Uber, Twitter, Facebook, Tesla, Goo...",Here's how old these companies will be turning...,Imran,53,7,41,312.0,320.53,True
2,"[Model3, Autopilot2, pasatealoelectrico, Tesla]","Primera prueba del @Tesla #Model3 en la nieve,...",PasatealoElectrico,23,0,6,312.0,320.53,True
3,[],Know the whirr sound a Tesla makes?\n\nThat's ...,Elon Musk News,8,0,5,312.0,320.53,True
4,[],"In Norway, @Tesla finished Q4 with 3,753 Model...",Tesla Daily,28,0,6,312.0,320.53,True


## A first very simple classifier
Use each tweet and predict whether it was written on a day where stock price has grown (PriceUp == True) or not

As classificators use the algorithms learned in class: naive bayes and SVMs

In [6]:
# generate the train and test sets
tweets_train, tweets_test, labels_train, labels_test = train_test_split(data['text'], data['PriceUp'], 
                                                   test_size=0.15, random_state=333, shuffle=False)

# now shuffle the data to decorrelate them for training and testing
tweets_train, labels_train = shuffle(tweets_train, labels_train)
tweets_test, labels_test = shuffle(tweets_test, labels_test)

print('Training Set size:\t', len(tweets_train))
print('Test Set size:\t\t ', len(tweets_test))

Training Set size:	 19991
Test Set size:		  3528


In [7]:
# vectorize train and test data with TF-IDF
vectorizer = TfidfVectorizer()
train_matrix = vectorizer.fit_transform(tweets_train)
test_matrix = vectorizer.transform(tweets_test)
print(train_matrix.shape)
type(train_matrix)
print("Size of the vocabulary: ", len(vectorizer.get_feature_names()))

(19991, 59290)
Size of the vocabulary:  59290


### Observation:
Our vector space is far higher than the number of training data. Therefore the classifier will for sure overfit and not generalize well to the test data at all. We can still use it as a simple baseline.

BUT BEFORE...
### Refit the vectorizer with most important tweets
In order to decrease the vocabulary of the vectorizer, we first filter out the most important tweets (min. two retweets or five likes or five replies) and use them to determine the vocabulary of the vectorizer.

In [8]:
vocab_data = data[(data['retweets']>25) &
                  ((data['likes']>100) | (data['replies']>10))]

# display whole tweet texts
pd.set_option('display.max_colwidth', -1)
print("Number of considered tweets: ", vocab_data.shape[0])

Number of considered tweets:  2684


In [9]:
vectorizer = TfidfVectorizer(dtype=np.float32)
vectorizer.fit(vocab_data['text'])
print("Size of the vocabulary: ", len(vectorizer.get_feature_names()))

Size of the vocabulary:  13399


In [10]:
# now transform the train and test set with the limited vocabulary
train_matrix = vectorizer.transform(tweets_train)
test_matrix = vectorizer.transform(tweets_test)
print(train_matrix.shape)

(19991, 13399)


Now, we have almost ten times more data than dimension in vector space. This is definitely better compared with having less data as dimensions, but is expected to still not be enough to train a good model. 

We still try it and use it as a simple baseline, that we'll try to overcome!
Also an SVM is reported to be effective even when the number of dimensions is greater than the number of samples, so we definitely give it a try.

Normalization of data may not be necessary, as a TfIdf vectorizer outputs vectors with values between 0 and 1. But we should better check the means. Having a unit variance will not work as we are dealing with a sparse matrix.

In [11]:
means = np.mean(train_matrix, axis=0)
print('Min and Max means: ', np.matrix.min(means), np.matrix.max(means))
train_matrix -= means
means = np.mean(train_matrix, axis=0)
print('Min and Max after substracting the means: ', np.matrix.min(means), np.matrix.max(means))

Min and Max means:  0.0 0.05854584
Min and Max after substracting the means:  -7.129554e-08 8.599791e-08


e-08 for the maximum mean are definitely better than 0.05. 

In [12]:
# use min-max normalization instead
# doesn't work because of mins and maxs being sparse matrices
mins = np.amin(train_matrix, axis=0)
maxs = np.max(train_matrix, axis=0)

In [14]:
svm_classifier = svm.LinearSVC(C=1.5, max_iter=int(1e8))
svm_classifier.fit(train_matrix, labels_train)

# Caution: you might not have enough memory to train the NB classifier
nb_classifier = naive_bayes.GaussianNB()
nb_classifier.fit(train_matrix, labels_train)

GaussianNB(priors=None, var_smoothing=1e-09)

#### Test if training was successful
As we have not enough data, a working classifier should overfit to the training data and hence perfectly predict the labels of the training set.

In [16]:
# check if classifier has really overfitted to the data by testing it on the training data
preds_svm = svm_classifier.predict(train_matrix)
svm_acc = metrics.accuracy_score(labels_train, preds_svm)

preds_nb = nb_classifier.predict(train_matrix)
nb_acc = metrics.accuracy_score(labels_train, preds_nb)
# A NB classifier could not be trained due to unfulfilled memory requirements
# nb_acc = 0

svm_prec, svm_rec, svm_fscore, svm_sup = \
metrics.precision_recall_fscore_support(labels_train, preds_svm, pos_label=True, average='binary')

nb_prec, nb_rec, nb_fscore, nb_sup = \
metrics.precision_recall_fscore_support(labels_train, preds_nb, pos_label=True, average='binary')

print('   \t\tSVM \t\tNaive Bayes')
print('Acc \t\t {0:.3f} \t\t {1:.3f}'.format(svm_acc, nb_acc))
print('Prec \t\t {0:.3f} \t\t {1:.3f}'.format(svm_prec, nb_prec))
print('Rec \t\t {0:.3f} \t\t {1:.3f}'.format(svm_rec, nb_rec))
print('FMeas \t\t {0:.3f} \t\t {1:.3f}'.format(svm_fscore, nb_fscore))

   		SVM 		Naive Bayes
Acc 		 0.825 		 0.654
Prec 		 0.826 		 0.990
Rec 		 0.832 		 0.326
FMeas 		 0.829 		 0.490


OLD: Check! Both classifiers reach an almost 100% accuracy on the training data. Therefore we can be sure, the classifier really learned a model based on the training data.

UPDATED: Both classifiers reache a performance that is clearly over random guessing. Even not reaching 100% accuracy that would definitely show overfitting, we definitely see that our model has learned patterns in the training data. The usage of a simple linear SVM, no complex kernels, no hyperparameter tuning etc. does not allow to get better results.

### Evaluate learned models on the test set

In [18]:
# test the classifiers
preds_svm = svm_classifier.predict(test_matrix)
svm_acc = metrics.accuracy_score(labels_test, preds_svm)

preds_nb = nb_classifier.predict(test_matrix.toarray())
nb_acc = metrics.accuracy_score(labels_test, preds_nb)
#nb_acc = 0

svm_prec, svm_rec, svm_fscore, svm_sup = \
metrics.precision_recall_fscore_support(labels_test, preds_svm, pos_label=True, average='binary')

nb_prec, nb_rec, nb_fscore, nb_sup = \
metrics.precision_recall_fscore_support(labels_test, preds_nb, pos_label=True, average='binary')

print('   \t\tSVM \t\tNaive Bayes')
print('Acc \t\t {0:.3f} \t\t {1:.3f}'.format(svm_acc, nb_acc))
print('Prec \t\t {0:.3f} \t\t {1:.3f}'.format(svm_prec, nb_prec))
print('Rec \t\t {0:.3f} \t\t {1:.3f}'.format(svm_rec, nb_rec))
print('FMeas \t\t {0:.3f} \t\t {1:.3f}'.format(svm_fscore, nb_fscore))

   		SVM 		Naive Bayes
Acc 		 0.496 		 0.499
Prec 		 0.504 		 0.529
Rec 		 0.477 		 0.133
FMeas 		 0.490 		 0.212


### Observation

As expected, we get a classifier that is only as good as random guessing. Even not expecting better results, we tried out some different hyperparameters that have not improved the results. Trying out an SVM with polynomial 

OLD (getting about 57% accuracy when shuffling the data during train-test-split): To make a statement about the results, we first have to look at the distribution of labels in the test dataset.
An even simpler baseline we can use to compare our results with is a classifier that constantly predicts the class that is most common in the test set. 

In [19]:
num_trues, num_falses = labels_test.value_counts()
print("A classifier that always predicts 'True' would get an accuracy of: %.3f" % (num_trues/labels_test.count()))

A classifier that always predicts 'True' would get an accuracy of: 0.508


OLD (getting about 57% accuracy when shuffling the data during train-test-split): On the first sight, our classifier seems to have learned a very little bit, having an accuracy of 55.4 and 54.9 percent while the constant prediction would lead to 53.7 percent. In our case of predicting whether the stock price will close higher that it has opened based on a tweet, recall is much more important to us.

Altogether the difference is too insignificant and is expected to be not reproducible when using different hyper parameters like the size of the training set, another random seed etc.

**Update**: After playing a bit with hyperparameters, we can say that the accuracy of our classifiers is always slightly above the constant value predictor. The SVM classifier always reaches a higher accuracy than the NB classifier as well as a higher Recall, which is especially important for our goal as we want to avoid false negatives.

**Discussion:** It looks like this very simple classifier already has learned some patterns in the data, which is unexpected, but can be explained as follows: 

- test and training data are expected to be correlated, as the test data contains tweets from days which we've already trained on. A better evaluation: use data of new days
- Think about it again: As we're predicting only if a tweet was written on a good day for TSLA and do not consider the time a tweet was written at, it is likely to happen, that people write about the positive development of the stock price...


### Another Update:
After no longer shuffling the data on the train-test-split, the evaluation of our model results in the expected accuracy of below 50%. With shuffling the data before, the test set contained tweets from the same days that were already used during training.