# Overview

This notebook uses the 'TextFeaturesGenerator' class (from text_features) to convert textual data into qunatitaive data. 

For now, it creates a bag-of-words representation and a tf-idf representation. We will also add SVD/PCA of these matrices and a Word2Vec representation in the next few days.

Will update the TextFeaturesGenerator class on an ongoing basis and update the usage here.

In [1]:
from text_features import TextFeaturesGenerator
from project_helper import TweetData
import pandas as pd
import numpy as np
from datetime import timedelta  
import datetime

Reusing the TweetData class to get cleaned tweets.

In [2]:
tweet_data = TweetData()
tweet_data.clean_tweets.head()

Unnamed: 0_level_0,tweets,timestamp,after4_date
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-11-17 19:57:12-06:00,tell jennifer williams whoever that is to read...,2019-11-17 19:57:12-06:00,2019-11-18
2019-11-17 19:56:02-06:00,,2019-11-17 19:56:02-06:00,2019-11-18
2019-11-17 19:49:47-06:00,paul krugman of has been wrong about me from t...,2019-11-17 19:49:47-06:00,2019-11-18
2019-11-17 19:47:32-06:00,schiff is a corrupt politician,2019-11-17 19:47:32-06:00,2019-11-18
2019-11-17 19:30:09-06:00,blew the nasty amp obnoxious chris wallace wil...,2019-11-17 19:30:09-06:00,2019-11-18


# Daily Tweets

This does the following two things:

1) Change the date of the tweets after 3 PM Chicago time to the following day (as trading closes then)
2) Concatenate all tweets in a given day to one large document

In [3]:
tweet_data.daily_tweets.head()

Unnamed: 0_level_0,tweets
date,Unnamed: 1_level_1
2009-05-05,donald trump will be appearing on the view tom...
2009-05-08,donald trump reads top ten financial tips on l...
2009-05-09,new blog post celebrity apprentice finale and ...
2009-05-12,my persona will never be that of a wallflower ...
2009-05-13,miss usa tara conner will not be fired ive alw...


# Feature Generator

Creating a 'TextFeaturesGenerator' instance which takes the tweets as an argument

In [4]:
feature_generator = TextFeaturesGenerator(tweet_data.clean_tweets.tweets)

'get_bow_matrix' creates the bag-of-words matrix

In [5]:
bow_mat = feature_generator.get_bow_matrix()

In [6]:
bow_mat.shape

(28813, 17035)

The shape of this matrix is 27.96K rows (same number as the tweets) and the columns are 16,781, which is equal to the unique number of words in the vocabulary.

In [7]:
bow_mat[:10,:10].todense()

matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

As you can see, most of the values are zero which is why it is stored as a 'sparse-matrix'

Bag-of-words is simply a count of words in the tweet. A better representation is 'tf-idf'. The 'get_tfidf_matrix' creates

In [8]:
tfidf_mat = feature_generator.get_tfidf_matrix()
tfidf_mat.shape

(28813, 17035)

The matrices can be saved using the matrices function. You can either specify a 'folder' which will be created and both matrices stored in it, else will store in the working directory.

In [9]:
feature_generator.save_matrices()

The two matrices will be saved with the names "bow_mat.npz" and "tfidf_mat.npz"

You can also specify a folder and a suffix to the file names.

In [10]:
feature_generator.save_matrices(folder="matrices",suffix="_v2")

The files can be loaded using the following commands:

In [11]:
from scipy import sparse
bow_loaded = sparse.load_npz("bow_mat.npz")
tfidf_loaded = sparse.load_npz("tfidf_mat.npz")
print(bow_loaded.shape)
print(tfidf_loaded.shape)

(28813, 17035)
(28813, 17035)


## PCA (through SVD) of the matrices

You can get the SVD of the bow and tfidf matrices as well.

In [12]:
svd_bow_mat = feature_generator.get_svd_bow_mat()

In [13]:
svd_bow_mat.shape

(28813, 2)

By default, it gives back two components. You can changet that using the n_components argument.

In [14]:
svd_bow_mat = feature_generator.get_svd_bow_mat(n_components=100)

In [15]:
svd_bow_mat.shape

(28813, 100)

You can get the SVD of the tf-idf as well.

In [16]:
svd_tfidf_mat = feature_generator.get_svd_bow_mat(n_components=100)

In [17]:
svd_tfidf_mat.shape

(28813, 100)

These matrices can be saved as well.

In [18]:
feature_generator.save_matrices()

You can load them back using np.load

In [19]:
svd_loaded_mat = np.load('svd_tfidf_mat.npy')

In [20]:
svd_loaded_mat.shape

(28813, 100)

# Aggregagte SVD per day 

In [21]:
svd_df = pd.DataFrame(svd_loaded_mat)

In [22]:
svd_df['timestamp'] = tweet_data.clean_tweets.index
svd_df['date'] = svd_df.timestamp.dt.date

In [23]:
svd_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,92,93,94,95,96,97,98,99,timestamp,date
0,3.827242,1.058184,-0.753201,0.539504,0.672026,1.173379,-0.282925,0.09573,0.447752,-0.022805,...,-0.098207,0.069798,-0.232117,-0.009961,0.229828,0.013722,0.137623,0.391273,2019-11-17 19:57:12-06:00,2019-11-17
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2019-11-17 19:56:02-06:00,2019-11-17
2,3.06019,0.759136,0.960683,-0.707494,1.130351,1.936883,-0.004514,-0.147008,-0.626272,-0.132273,...,-0.313087,-0.077174,0.074933,0.180533,-0.033349,0.239652,0.08065,-0.146537,2019-11-17 19:49:47-06:00,2019-11-17
3,0.200777,-0.107046,0.113282,0.87704,-0.034224,0.142449,-0.0589,0.020884,-0.023058,-0.125923,...,0.008438,0.006316,0.007866,-0.016235,-0.028435,0.028861,0.034558,0.014253,2019-11-17 19:47:32-06:00,2019-11-17
4,2.915336,0.145921,0.789791,-0.586309,1.237927,-0.773927,-0.802347,-0.924382,-0.588655,-0.114364,...,0.025276,-0.276758,0.008522,0.255128,0.254288,0.186863,0.014944,0.031535,2019-11-17 19:30:09-06:00,2019-11-17


In [24]:
svd_df_daily = svd_df.groupby('date').agg(np.mean)

In [25]:
svd_df_daily

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2009-05-04,1.914085,-0.744047,-0.003781,-0.297262,0.104558,-0.762712,0.079807,-0.860134,-0.830334,-0.303307,...,-0.176265,-0.086100,-0.242026,0.129208,-0.092467,0.132687,-0.161863,0.043532,-0.050128,-0.006094
2009-05-05,1.728747,-0.735490,-0.032372,-0.510345,-0.136988,-0.583485,-0.960771,-0.846735,-0.502048,-0.394207,...,-0.090684,0.022433,0.068426,0.093914,0.060350,-0.042107,-0.200073,0.030385,0.038850,-0.038555
2009-05-08,0.656670,0.017658,0.343568,-0.132163,-0.182062,-0.136581,-0.153954,-0.149953,-0.287268,-0.003448,...,-0.081094,-0.014594,-0.072290,0.057428,-0.013680,0.023649,-0.012842,-0.003922,-0.106709,0.005820
2009-05-12,0.759489,-0.616653,-0.256694,-0.132355,0.892657,-0.322315,-0.301618,-0.619333,0.309500,-0.410880,...,0.027074,-0.104524,0.164644,0.037623,0.076811,0.158482,-0.176192,-0.124771,0.014102,-0.031679
2009-05-13,0.549987,-0.714711,-0.650531,-0.033119,-0.053220,0.134759,0.064928,-0.091967,-0.262836,-0.087195,...,-0.036659,0.093412,0.030083,0.022980,0.088272,-0.080342,-0.147136,0.130038,-0.022252,-0.069660
2009-05-14,0.727417,-0.468012,0.820975,-0.391198,0.348320,0.668746,-0.038269,0.298439,-0.459662,0.435132,...,-0.087160,0.019333,-0.003791,0.101827,-0.015975,-0.000250,-0.021610,0.071288,-0.005891,0.006866
2009-05-15,1.016232,0.217992,0.566398,-0.296191,-0.626761,-0.083145,0.005869,-0.009849,-0.186846,-0.188720,...,0.040316,0.003404,0.091805,-0.013832,0.014922,-0.037120,-0.022741,0.066674,0.010254,-0.039019
2009-05-16,0.911011,0.485667,-0.143426,0.200778,-0.203702,-0.216453,-0.121470,0.065230,0.061738,-0.114868,...,0.019313,-0.007365,-0.134006,-0.019780,-0.283361,0.056332,0.122664,-0.101825,-0.027381,-0.201861
2009-05-17,0.748257,-0.077040,0.151531,-0.336610,1.387736,1.182505,-0.317207,-0.203950,0.058316,-0.495735,...,-0.060321,-0.074810,0.520986,0.644740,-0.343636,0.377943,-0.095240,-0.151847,0.533804,-0.594654
2009-05-18,0.693850,-0.194201,0.225672,-0.171285,1.127994,0.249324,0.814233,-0.075671,0.363008,0.173169,...,0.065754,0.083678,-0.020261,0.092459,0.002077,0.064020,-0.038542,0.156425,-0.044433,0.015944


In [26]:
svd_df_daily.to_csv('svd_df_daily.csv')

# 4 PM

In [27]:
tweet_data.clean_tweets['timestamp'] = tweet_data.clean_tweets.index
after_4_tweets = tweet_data.clean_tweets.timestamp.dt.hour >= 15
tweet_data.clean_tweets['after4_date'] = tweet_data.clean_tweets.timestamp.dt.date
tweet_data.clean_tweets.loc[after_4_tweets,'after4_date'] = tweet_data.clean_tweets.timestamp[after_4_tweets].dt.date + timedelta(days=1)

In [28]:
tweet_data.clean_tweets.head(100)

Unnamed: 0_level_0,tweets,timestamp,after4_date
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-11-17 19:57:12-06:00,tell jennifer williams whoever that is to read...,2019-11-17 19:57:12-06:00,2019-11-18
2019-11-17 19:56:02-06:00,,2019-11-17 19:56:02-06:00,2019-11-18
2019-11-17 19:49:47-06:00,paul krugman of has been wrong about me from t...,2019-11-17 19:49:47-06:00,2019-11-18
2019-11-17 19:47:32-06:00,schiff is a corrupt politician,2019-11-17 19:47:32-06:00,2019-11-18
2019-11-17 19:30:09-06:00,blew the nasty amp obnoxious chris wallace wil...,2019-11-17 19:30:09-06:00,2019-11-18
2019-11-17 19:26:04-06:00,blew the nasty amp obnoxious chris wallace wil...,2019-11-17 19:26:04-06:00,2019-11-18
2019-11-17 18:34:46-06:00,thanks eric,2019-11-17 18:34:46-06:00,2019-11-18
2019-11-17 18:10:59-06:00,,2019-11-17 18:10:59-06:00,2019-11-18
2019-11-17 18:10:19-06:00,,2019-11-17 18:10:19-06:00,2019-11-18
2019-11-17 17:54:12-06:00,,2019-11-17 17:54:12-06:00,2019-11-18


In [29]:
combined_daily_tweets = tweet_data.clean_tweets.groupby('after4_date')['tweets'].apply(lambda x: ' '.join(x))
combined_daily_tweets.head()

after4_date
2009-05-05    donald trump will be appearing on the view tom...
2009-05-08    donald trump reads top ten financial tips on l...
2009-05-09    new blog post celebrity apprentice finale and ...
2009-05-12    my persona will never be that of a wallflower ...
2009-05-13    miss usa tara conner will not be fired ive alw...
Name: tweets, dtype: object

In [30]:
combined_daily_tweets.to_csv('combined_daily_tweets.csv')

  """Entry point for launching an IPython kernel.


# Check if the concatenation is correct

In [31]:
tweet_data.clean_tweets.tweets[tweet_data.clean_tweets.after4_date==pd.to_datetime("2019-10-03")]

Series([], Name: tweets, dtype: object)

In [32]:
combined_daily_tweets[combined_daily_tweets.index.values==pd.to_datetime("2019-10-03")]

Series([], Name: tweets, dtype: object)

# Create SVD matrix of the combiened 4 PM tweets

In [33]:
combined_generator = TextFeaturesGenerator(combined_daily_tweets)

In [34]:
n_components = 2
combined_svd_df = pd.DataFrame(combined_generator.get_svd_tfidf_mat(n_components=n_components))

In [35]:
combined_svd_df['after4_date'] = combined_daily_tweets.index.values

In [36]:
combined_svd_df.head()

Unnamed: 0,0,1,after4_date
0,0.229959,0.196038,2009-05-05
1,0.052085,0.062532,2009-05-08
2,0.079564,0.035637,2009-05-09
3,0.101352,0.043731,2009-05-12
4,0.068212,0.061998,2009-05-13


In [37]:
combined_svd_df.to_csv('combined_svd_df.csv')

# Scoring Tweets

In [3]:
tweet_data = TweetData()
tweet_data.clean_tweets.head()

Unnamed: 0_level_0,tweets,timestamp,after4_date
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-10-02 23:41:51-05:00,democrats want to steal the election,2019-10-02 23:41:51-05:00,2019-10-03
2019-10-02 23:27:52-05:00,mississippi there is a very important election...,2019-10-02 23:27:52-05:00,2019-10-03
2019-10-02 23:27:52-05:00,he loves our military and supports our vets de...,2019-10-02 23:27:52-05:00,2019-10-03
2019-10-02 21:06:36-05:00,look at this photograph,2019-10-02 21:06:36-05:00,2019-10-03
2019-10-02 19:51:56-05:00,schiff house intel chairman got early account ...,2019-10-02 19:51:56-05:00,2019-10-03


In [4]:
tweet_data.daily_tweets.head()

Unnamed: 0_level_0,tweets
date,Unnamed: 1_level_1
2009-05-05,donald trump will be appearing on the view tom...
2009-05-08,donald trump reads top ten financial tips on l...
2009-05-09,new blog post celebrity apprentice finale and ...
2009-05-12,my persona will never be that of a wallflower ...
2009-05-13,miss usa tara conner will not be fired ive alw...


Split into train at test a certain date (in the example, 2018-01-01)

In [18]:
train_tweets = tweet_data.daily_tweets[tweet_data.daily_tweets.index<=pd.to_datetime("2018-01-01")]
score_tweets = tweet_data.daily_tweets[tweet_data.daily_tweets.index>pd.to_datetime("2018-01-01")]

Create the feature generator class

In [19]:
feature_generator_with_scores = TextFeaturesGenerator(train_tweets.tweets,score_tweets.tweets)

In [20]:
train_svd, test_svd = feature_generator_with_scores.get_svd_tfidf_mat(n_components=10)

In [21]:
print(train_svd.shape)
print(test_svd.shape)

(2395, 10)
(636, 10)


Convert to dataframe and add date

In [22]:
train_svd_df = pd.DataFrame(train_svd)
train_svd_df['date'] = train_tweets.index

train_svd_df = pd.DataFrame(train_svd)
train_svd_df['date'] = train_tweets.index
train_svd_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,date
0,0.255393,0.094646,0.166428,0.268642,0.086863,0.041839,0.005266,-0.011504,0.035837,0.034239,2009-05-05
1,0.060725,0.020172,0.073589,0.057457,0.092638,0.029433,0.026468,0.003534,0.04732,0.005016,2009-05-08
2,0.08115,0.018445,0.059621,0.137287,-0.036615,0.029662,-0.144165,-0.02869,0.015878,-0.031538,2009-05-09
3,0.1083,0.008986,0.051234,0.012044,0.094238,0.004119,0.044145,0.03785,0.015892,0.036891,2009-05-12
4,0.076058,0.024465,0.065035,0.045956,0.075716,0.012833,0.009636,0.029604,0.005604,0.03297,2009-05-13


In [23]:
test_svd_df = pd.DataFrame(test_svd)
test_svd_df['date'] = score_tweets.index
test_svd_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,date
0,0.47717,-0.055577,-0.089174,-0.017984,-0.019911,-0.018469,-0.111293,0.081631,-0.007596,-0.042455,2018-01-02
1,0.481051,-0.085808,-0.085502,-0.002189,-0.024158,-0.01511,0.015735,0.073181,-0.022097,-0.033393,2018-01-03
2,0.397135,-0.071496,-0.070673,-0.022737,-0.01664,-0.041272,-0.041109,0.038034,0.002848,-0.015921,2018-01-04
3,0.442611,-0.027943,-0.130676,-0.003118,0.000328,-0.038041,-0.064087,0.045842,0.063076,-0.067497,2018-01-05
4,0.3656,-0.071949,-0.074393,0.006848,-0.016845,0.085461,-0.022985,0.045839,0.038582,-0.042177,2018-01-06


In [None]:
plt.plot(final_daily_tweets.groupby('after4_date').max().single_ret)