# Overview

This notebook uses the 'TextFeaturesGenerator' class (from text_features) to convert textual data into qunatitaive data. 

For now, it creates a bag-of-words representation and a tf-idf representation. We will also add SVD/PCA of these matrices and a Word2Vec representation in the next few days.

Will update the TextFeaturesGenerator class on an ongoing basis and update the usage here.

In [2]:
from text_features import TextFeaturesGenerator
from project_helper import TweetData
import pandas as pd
import numpy as np
from datetime import timedelta  
import datetime

Reusing the TweetData class to get cleaned tweets.

In [2]:
tweet_data = TweetData()
tweet_data.clean_tweets.head()

Unnamed: 0_level_0,tweets,timestamp,after4_date
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-10-02 23:41:51-05:00,democrats want to steal the election,2019-10-02 23:41:51-05:00,2019-10-03
2019-10-02 23:27:52-05:00,mississippi there is a very important election...,2019-10-02 23:27:52-05:00,2019-10-03
2019-10-02 23:27:52-05:00,he loves our military and supports our vets de...,2019-10-02 23:27:52-05:00,2019-10-03
2019-10-02 21:06:36-05:00,look at this photograph,2019-10-02 21:06:36-05:00,2019-10-03
2019-10-02 19:51:56-05:00,schiff house intel chairman got early account ...,2019-10-02 19:51:56-05:00,2019-10-03


# Daily Tweets

This does the following two things:

1) Change the date of the tweets after 3 PM Chicago time to the following day (as trading closes then)
2) Concatenate all tweets in a given day to one large document

In [3]:
tweet_data.daily_tweets.head()

Unnamed: 0_level_0,tweets
date,Unnamed: 1_level_1
2009-05-05,donald trump will be appearing on the view tom...
2009-05-08,donald trump reads top ten financial tips on l...
2009-05-09,new blog post celebrity apprentice finale and ...
2009-05-12,my persona will never be that of a wallflower ...
2009-05-13,miss usa tara conner will not be fired ive alw...


# Feature Generator

Creating a 'TextFeaturesGenerator' instance which takes the tweets as an argument

In [4]:
feature_generator = TextFeaturesGenerator(tweet_data.clean_tweets.tweets)

'get_bow_matrix' creates the bag-of-words matrix

In [5]:
bow_mat = feature_generator.get_bow_matrix()

In [6]:
bow_mat.shape

(27960, 16781)

The shape of this matrix is 27.96K rows (same number as the tweets) and the columns are 16,781, which is equal to the unique number of words in the vocabulary.

In [7]:
bow_mat[:10,:10].todense()

matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

As you can see, most of the values are zero which is why it is stored as a 'sparse-matrix'

Bag-of-words is simply a count of words in the tweet. A better representation is 'tf-idf'. The 'get_tfidf_matrix' creates

In [8]:
tfidf_mat = feature_generator.get_tfidf_matrix()
tfidf_mat.shape

(27960, 16781)

The matrices can be saved using the matrices function. You can either specify a 'folder' which will be created and both matrices stored in it, else will store in the working directory.

In [9]:
feature_generator.save_matrices()

The two matrices will be saved with the names "bow_mat.npz" and "tfidf_mat.npz"

You can also specify a folder and a suffix to the file names.

In [10]:
feature_generator.save_matrices(folder="matrices",suffix="_v2")

The files can be loaded using the following commands:

In [11]:
from scipy import sparse
bow_loaded = sparse.load_npz("bow_mat.npz")
tfidf_loaded = sparse.load_npz("tfidf_mat.npz")
print(bow_loaded.shape)
print(tfidf_loaded.shape)

(27960, 16781)
(27960, 16781)


## PCA (through SVD) of the matrices

You can get the SVD of the bow and tfidf matrices as well.

In [12]:
svd_bow_mat = feature_generator.get_svd_bow_mat()

In [13]:
svd_bow_mat.shape

(27960, 2)

By default, it gives back two components. You can changet that using the n_components argument.

In [14]:
svd_bow_mat = feature_generator.get_svd_bow_mat(n_components=100)

In [15]:
svd_bow_mat.shape

(27960, 100)

You can get the SVD of the tf-idf as well.

In [16]:
svd_tfidf_mat = feature_generator.get_svd_bow_mat(n_components=100)

In [17]:
svd_tfidf_mat.shape

(27960, 100)

These matrices can be saved as well.

In [18]:
feature_generator.save_matrices()

You can load them back using np.load

In [19]:
svd_loaded_mat = np.load('svd_tfidf_mat.npy')

In [20]:
svd_loaded_mat.shape

(27960, 100)

# Aggregagte SVD per day 

In [21]:
svd_df = pd.DataFrame(svd_loaded_mat)

In [22]:
svd_df['timestamp'] = tweet_data.clean_tweets.index
svd_df['date'] = svd_df.timestamp.dt.date

In [23]:
svd_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,92,93,94,95,96,97,98,99,timestamp,date
0,1.067114,0.077396,-0.803793,-0.205652,-0.50518,0.004843,0.039186,-0.049387,-0.095551,-0.017207,...,-0.097385,-0.040557,-0.074507,-0.15403,-0.093137,-0.052937,0.143799,-0.088854,2019-10-02 23:41:51-05:00,2019-10-02
1,2.398382,-2.635966,1.53775,0.957777,0.272032,-0.429117,-1.631507,0.552008,-1.720561,1.136831,...,0.258984,-0.070872,0.086509,-0.116275,-0.196379,0.337119,-0.056392,0.015231,2019-10-02 23:27:52-05:00,2019-10-02
2,2.039926,-2.169685,2.635236,0.428123,-0.238948,-0.022691,-0.655132,-0.46228,-1.005659,1.190836,...,-0.414179,0.352808,-0.303401,-0.299134,0.059244,0.560983,0.063551,-0.360784,2019-10-02 23:27:52-05:00,2019-10-02
3,0.116174,-0.057314,0.003153,0.019891,0.105173,-0.114806,-0.126431,-0.100204,-0.022856,-0.030552,...,-0.004304,-0.024867,-0.074782,-0.03856,0.000536,0.116081,0.002978,0.120855,2019-10-02 21:06:36-05:00,2019-10-02
4,0.473519,-0.025296,0.216236,0.710462,0.506245,0.856471,-0.055029,-0.01595,-0.015656,-0.262654,...,-0.043025,-0.079855,0.032954,-0.131985,0.04881,0.047779,-0.098556,-0.062281,2019-10-02 19:51:56-05:00,2019-10-02


In [24]:
svd_df_daily = svd_df.groupby('date').agg(np.mean)

In [25]:
svd_df_daily

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2009-05-04,1.917882,-0.737776,-0.008240,-0.307825,0.125388,-0.765858,0.104030,-0.852788,-0.843460,-0.306545,...,0.256583,0.000425,-0.069494,0.123169,-0.048137,0.105338,-0.111068,-0.012371,-0.147085,0.008953
2009-05-05,1.734408,-0.729916,-0.037521,-0.528596,-0.111331,-0.610496,-0.951095,-0.822610,-0.527628,-0.409864,...,-0.002147,-0.020303,-0.172469,0.097354,-0.069305,0.014423,-0.017498,-0.009654,-0.056903,0.008024
2009-05-08,0.656465,0.025012,0.336565,-0.142836,-0.174618,-0.143168,-0.148943,-0.144386,-0.311437,-0.002341,...,0.065810,0.043735,-0.001123,0.063384,-0.043466,0.046604,-0.050328,-0.021959,-0.066037,0.044944
2009-05-12,0.764610,-0.615122,-0.252076,-0.132610,0.900807,-0.321875,-0.307898,-0.607115,0.308673,-0.375667,...,-0.012838,-0.133662,-0.115200,0.160354,-0.034786,0.205643,0.098947,0.132937,-0.059507,0.090006
2009-05-13,0.552576,-0.718454,-0.642428,-0.030314,-0.052141,0.129544,0.069837,-0.090266,-0.274465,-0.070405,...,0.026995,-0.030393,-0.121937,0.121079,0.077812,-0.030831,-0.011993,-0.094668,-0.088516,-0.087727
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-09-28,1.766456,-0.018465,-0.006562,-0.071148,-0.085782,-0.067368,0.134505,0.280319,-0.044777,0.094144,...,-0.009233,0.009786,0.080701,-0.028289,0.015793,-0.058447,0.042836,0.063615,-0.049853,0.026295
2019-09-29,1.818351,0.067952,-0.139502,-0.020150,-0.080780,-0.107743,0.294175,-0.018606,0.171831,0.062312,...,-0.029585,0.041702,0.003403,-0.041324,-0.018078,-0.042881,0.034329,-0.051666,-0.012458,-0.036183
2019-09-30,2.838771,0.208005,0.114255,0.087938,-0.168927,-0.027974,0.171122,0.041970,0.033269,-0.042353,...,-0.022258,0.032971,0.008412,-0.039595,0.001071,0.044836,-0.008554,0.002438,-0.020873,0.069324
2019-10-01,2.743603,-0.035050,-0.160602,0.004014,-0.083679,0.294312,-0.183208,0.268333,0.017663,0.096170,...,-0.036987,0.052118,0.026958,0.015224,-0.034136,0.038097,-0.089549,0.044991,-0.020062,-0.024954


In [26]:
svd_df_daily.to_csv('svd_df_daily.csv')

# 4 PM

In [27]:
tweet_data.clean_tweets['timestamp'] = tweet_data.clean_tweets.index
after_4_tweets = tweet_data.clean_tweets.timestamp.dt.hour >= 15
tweet_data.clean_tweets['after4_date'] = tweet_data.clean_tweets.timestamp.dt.date
tweet_data.clean_tweets.loc[after_4_tweets,'after4_date'] = tweet_data.clean_tweets.timestamp[after_4_tweets].dt.date + timedelta(days=1)

In [28]:
tweet_data.clean_tweets.head(100)

Unnamed: 0_level_0,tweets,timestamp,after4_date
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-10-02 23:41:51-05:00,democrats want to steal the election,2019-10-02 23:41:51-05:00,2019-10-03
2019-10-02 23:27:52-05:00,mississippi there is a very important election...,2019-10-02 23:27:52-05:00,2019-10-03
2019-10-02 23:27:52-05:00,he loves our military and supports our vets de...,2019-10-02 23:27:52-05:00,2019-10-03
2019-10-02 21:06:36-05:00,look at this photograph,2019-10-02 21:06:36-05:00,2019-10-03
2019-10-02 19:51:56-05:00,schiff house intel chairman got early account ...,2019-10-02 19:51:56-05:00,2019-10-03
...,...,...,...
2019-09-28 03:55:17-05:00,thank you to general mcmaster just more fake n...,2019-09-28 03:55:17-05:00,2019-09-28
2019-09-28 02:03:00-05:00,,2019-09-28 02:03:00-05:00,2019-09-28
2019-09-27 19:41:18-05:00,i am draining the swamp,2019-09-27 19:41:18-05:00,2019-09-28
2019-09-27 15:24:05-05:00,if that perfect phone call with the president ...,2019-09-27 15:24:05-05:00,2019-09-28


In [29]:
combined_daily_tweets = tweet_data.clean_tweets.groupby('after4_date')['tweets'].apply(lambda x: ' '.join(x))
combined_daily_tweets.head()

after4_date
2009-05-05    donald trump will be appearing on the view tom...
2009-05-08    donald trump reads top ten financial tips on l...
2009-05-09    new blog post celebrity apprentice finale and ...
2009-05-12    my persona will never be that of a wallflower ...
2009-05-13    miss usa tara conner will not be fired ive alw...
Name: tweets, dtype: object

In [30]:
combined_daily_tweets.to_csv('combined_daily_tweets.csv')

  """Entry point for launching an IPython kernel.


# Check if the concatenation is correct

In [31]:
tweet_data.clean_tweets.tweets[tweet_data.clean_tweets.after4_date==pd.to_datetime("2019-10-03")]

timestamp
2019-10-02 23:41:51-05:00                democrats want to steal the election 
2019-10-02 23:27:52-05:00    mississippi there is a very important election...
2019-10-02 23:27:52-05:00    he loves our military and supports our vets de...
2019-10-02 21:06:36-05:00                             look at this photograph 
2019-10-02 19:51:56-05:00    schiff house intel chairman got early account ...
2019-10-02 15:48:47-05:00    the do nothing democrats should be focused on ...
2019-10-02 15:39:07-05:00    adam schiff should only be so lucky to have th...
2019-10-02 15:31:53-05:00    democrats are trying to undo the election rega...
2019-10-02 15:31:03-05:00    nancy pelosi just said that she is interested ...
2019-10-02 15:19:09-05:00    all of this impeachment nonsense which is goin...
2019-10-02 15:02:11-05:00    now the press is trying to sell the fact that ...
Name: tweets, dtype: object

In [32]:
combined_daily_tweets[combined_daily_tweets.index.values==pd.to_datetime("2019-10-03")]

after4_date
2019-10-03    democrats want to steal the election  mississi...
Name: tweets, dtype: object

# Create SVD matrix of the combiened 4 PM tweets

In [33]:
combined_generator = TextFeaturesGenerator(combined_daily_tweets)

In [34]:
n_components = 2
combined_svd_df = pd.DataFrame(combined_generator.get_svd_tfidf_mat(n_components=n_components))

In [35]:
combined_svd_df['after4_date'] = combined_daily_tweets.index.values

In [36]:
combined_svd_df.head()

Unnamed: 0,0,1,after4_date
0,0.231443,0.194499,2009-05-05
1,0.052787,0.062074,2009-05-08
2,0.079677,0.035338,2009-05-09
3,0.102123,0.042774,2009-05-12
4,0.068873,0.061957,2009-05-13


In [37]:
combined_svd_df.to_csv('combined_svd_df.csv')

# Scoring Tweets

In [3]:
tweet_data = TweetData()
tweet_data.clean_tweets.head()

Unnamed: 0_level_0,tweets,timestamp,after4_date
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-10-02 23:41:51-05:00,democrats want to steal the election,2019-10-02 23:41:51-05:00,2019-10-03
2019-10-02 23:27:52-05:00,mississippi there is a very important election...,2019-10-02 23:27:52-05:00,2019-10-03
2019-10-02 23:27:52-05:00,he loves our military and supports our vets de...,2019-10-02 23:27:52-05:00,2019-10-03
2019-10-02 21:06:36-05:00,look at this photograph,2019-10-02 21:06:36-05:00,2019-10-03
2019-10-02 19:51:56-05:00,schiff house intel chairman got early account ...,2019-10-02 19:51:56-05:00,2019-10-03


In [4]:
tweet_data.daily_tweets.head()

Unnamed: 0_level_0,tweets
date,Unnamed: 1_level_1
2009-05-05,donald trump will be appearing on the view tom...
2009-05-08,donald trump reads top ten financial tips on l...
2009-05-09,new blog post celebrity apprentice finale and ...
2009-05-12,my persona will never be that of a wallflower ...
2009-05-13,miss usa tara conner will not be fired ive alw...


Split into train at test a certain date (in the example, 2018-01-01)

In [18]:
train_tweets = tweet_data.daily_tweets[tweet_data.daily_tweets.index<=pd.to_datetime("2018-01-01")]
score_tweets = tweet_data.daily_tweets[tweet_data.daily_tweets.index>pd.to_datetime("2018-01-01")]

Create the feature generator class

In [19]:
feature_generator_with_scores = TextFeaturesGenerator(train_tweets.tweets,score_tweets.tweets)

In [20]:
train_svd, test_svd = feature_generator_with_scores.get_svd_tfidf_mat(n_components=10)

In [21]:
print(train_svd.shape)
print(test_svd.shape)

(2395, 10)
(636, 10)


Convert to dataframe and add date

In [22]:
train_svd_df = pd.DataFrame(train_svd)
train_svd_df['date'] = train_tweets.index

train_svd_df = pd.DataFrame(train_svd)
train_svd_df['date'] = train_tweets.index
train_svd_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,date
0,0.255393,0.094646,0.166428,0.268642,0.086863,0.041839,0.005266,-0.011504,0.035837,0.034239,2009-05-05
1,0.060725,0.020172,0.073589,0.057457,0.092638,0.029433,0.026468,0.003534,0.04732,0.005016,2009-05-08
2,0.08115,0.018445,0.059621,0.137287,-0.036615,0.029662,-0.144165,-0.02869,0.015878,-0.031538,2009-05-09
3,0.1083,0.008986,0.051234,0.012044,0.094238,0.004119,0.044145,0.03785,0.015892,0.036891,2009-05-12
4,0.076058,0.024465,0.065035,0.045956,0.075716,0.012833,0.009636,0.029604,0.005604,0.03297,2009-05-13


In [23]:
test_svd_df = pd.DataFrame(test_svd)
test_svd_df['date'] = score_tweets.index
test_svd_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,date
0,0.47717,-0.055577,-0.089174,-0.017984,-0.019911,-0.018469,-0.111293,0.081631,-0.007596,-0.042455,2018-01-02
1,0.481051,-0.085808,-0.085502,-0.002189,-0.024158,-0.01511,0.015735,0.073181,-0.022097,-0.033393,2018-01-03
2,0.397135,-0.071496,-0.070673,-0.022737,-0.01664,-0.041272,-0.041109,0.038034,0.002848,-0.015921,2018-01-04
3,0.442611,-0.027943,-0.130676,-0.003118,0.000328,-0.038041,-0.064087,0.045842,0.063076,-0.067497,2018-01-05
4,0.3656,-0.071949,-0.074393,0.006848,-0.016845,0.085461,-0.022985,0.045839,0.038582,-0.042177,2018-01-06
