# Overview

This notebook uses the 'TextFeaturesGenerator' class (from text_features) to convert textual data into qunatitaive data. 

For now, it creates a bag-of-words representation and a tf-idf representation. We will also add SVD/PCA of these matrices and a Word2Vec representation in the next few days.

Will update the TextFeaturesGenerator class on an ongoing basis and update the usage here.

In [1]:
from text_features import TextFeaturesGenerator
from project_helper import TweetData
import pandas as pd
import numpy as np

Reusing the TweetData class to get cleaned tweets.

In [2]:
tweet_data = TweetData()
tweet_data.clean_tweets.head()

Unnamed: 0_level_0,tweets
timestamp,Unnamed: 1_level_1
2019-10-02 23:41:51-05:00,democrats want to steal the election
2019-10-02 23:27:52-05:00,mississippi there is a very important election...
2019-10-02 23:27:52-05:00,he loves our military and supports our vets de...
2019-10-02 21:06:36-05:00,look at this photograph
2019-10-02 19:51:56-05:00,schiff house intel chairman got early account ...


Creating a 'TextFeaturesGenerator' instance which takes the tweets as an argument

In [3]:
feature_generator = TextFeaturesGenerator(tweet_data.clean_tweets.tweets)

'get_bow_matrix' creates the bag-of-words matrix

In [4]:
bow_mat = feature_generator.get_bow_matrix()

In [5]:
bow_mat.shape

(27960, 17359)

The shape of this matrix is 27.96K rows (same number as the tweets) and the columns are 16,781, which is equal to the unique number of words in the vocabulary.

In [6]:
bow_mat[:10,:10].todense()

matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

As you can see, most of the values are zero which is why it is stored as a 'sparse-matrix'

Bag-of-words is simply a count of words in the tweet. A better representation is 'tf-idf'. The 'get_tfidf_matrix' creates

In [7]:
tfidf_mat = feature_generator.get_tfidf_matrix()
tfidf_mat.shape

(27960, 17359)

The matrices can be saved using the matrices function. You can either specify a 'folder' which will be created and both matrices stored in it, else will store in the working directory.

In [8]:
feature_generator.save_matrices()

The two matrices will be saved with the names "bow_mat.npz" and "tfidf_mat.npz"

You can also specify a folder and a suffix to the file names.

In [9]:
feature_generator.save_matrices(folder="matrices",suffix="_v2")

The files can be loaded using the following commands:

In [10]:
from scipy import sparse
bow_loaded = sparse.load_npz("bow_mat.npz")
tfidf_loaded = sparse.load_npz("tfidf_mat.npz")
print(bow_loaded.shape)
print(tfidf_loaded.shape)

(27960, 17359)
(27960, 17359)


## PCA (through SVD) of the matrices

You can get the SVD of the bow and tfidf matrices as well.

In [11]:
svd_bow_mat = feature_generator.get_svd_bow_mat()

In [12]:
svd_bow_mat.shape

(27960, 2)

By default, it gives back two components. You can changet that using the n_components argument.

In [13]:
svd_bow_mat = feature_generator.get_svd_bow_mat(n_components=100)

In [14]:
svd_bow_mat.shape

(27960, 100)

You can get the SVD of the tf-idf as well.

In [15]:
svd_tfidf_mat = feature_generator.get_svd_bow_mat(n_components=100)

In [16]:
svd_tfidf_mat.shape

(27960, 100)

These matrices can be saved as well.

In [17]:
feature_generator.save_matrices()

You can load them back using np.load

In [18]:
svd_loaded_mat = np.load('svd_tfidf_mat.npy')

In [19]:
svd_loaded_mat.shape

(27960, 100)

# Aggregagte SVD per day 

In [20]:
svd_df = pd.DataFrame(svd_loaded_mat)

In [30]:
svd_df['timestamp'] = tweet_data.clean_tweets.index
svd_df['date'] = svd_df.timestamp.dt.date

In [31]:
svd_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,92,93,94,95,96,97,98,99,timestamp,date
0,1.069162,0.073472,-0.803908,-0.202682,-0.503965,0.011639,0.038897,-0.052699,-0.081935,-0.020411,...,-0.008533,0.103911,0.132162,0.06493,-0.044275,0.374643,-0.025156,-0.07736,2019-10-02 23:41:51-05:00,2019-10-02
1,2.39901,-2.64069,1.537751,0.988793,0.253774,-0.426361,-1.617954,0.713951,-1.783919,0.935433,...,0.037253,0.329752,-0.163356,-0.255373,0.15259,0.148491,-0.060753,-0.12584,2019-10-02 23:27:52-05:00,2019-10-02
2,2.036368,-2.172903,2.635054,0.428314,-0.241706,-0.040652,-0.673243,-0.356856,-1.043026,1.124688,...,-0.081089,0.268323,0.137235,0.54285,0.196629,-0.039206,-0.293467,-0.561141,2019-10-02 23:27:52-05:00,2019-10-02
3,0.116249,-0.056975,0.003445,0.020397,0.103324,-0.11921,-0.130229,-0.093681,-0.024051,-0.028973,...,-0.031563,0.010567,0.022513,0.062977,0.055526,0.042436,-0.069955,0.007539,2019-10-02 21:06:36-05:00,2019-10-02
4,0.474283,-0.025551,0.216644,0.714577,0.52169,0.847947,-0.057162,-0.040227,0.004714,-0.261764,...,-0.063327,0.126635,0.115928,-0.038622,-0.059819,0.106529,0.051002,0.144991,2019-10-02 19:51:56-05:00,2019-10-02


In [32]:
svd_df_daily = svd_df.groupby('date').agg(np.mean)

In [33]:
svd_df_daily

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2009-05-04,1.917539,-0.741860,-0.006851,-0.307680,0.114204,-0.790058,0.078518,-0.845050,-0.826082,-0.329043,...,-0.114264,-0.068996,-0.046006,-0.175538,-0.010240,0.157938,-0.052772,-0.065883,-0.006727,0.118449
2009-05-05,1.736160,-0.734847,-0.035298,-0.524848,-0.118777,-0.626884,-0.976174,-0.785034,-0.511813,-0.418263,...,-0.066591,-0.081658,0.033595,0.037179,0.072670,0.088817,-0.207716,0.094504,-0.026172,0.103785
2009-05-08,0.657092,0.022901,0.337051,-0.141766,-0.178049,-0.143132,-0.152205,-0.137561,-0.320182,-0.021014,...,-0.093939,0.005010,0.015407,-0.009648,-0.053923,0.001675,0.063372,-0.009658,0.004553,0.038659
2009-05-12,0.763673,-0.616437,-0.249637,-0.133147,0.902717,-0.351933,-0.332287,-0.596769,0.318566,-0.321152,...,0.033095,0.065421,0.094737,-0.058205,-0.053908,0.269022,-0.069903,0.140128,0.008667,0.028387
2009-05-13,0.552410,-0.719471,-0.641173,-0.030115,-0.049287,0.127540,0.068206,-0.100774,-0.274819,-0.080950,...,-0.076818,-0.134267,0.038221,0.078122,0.074399,0.082008,-0.129604,0.094293,0.063868,0.117058
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-09-28,1.764919,-0.019437,-0.006130,-0.071961,-0.089839,-0.060733,0.146705,0.280572,-0.044637,0.081915,...,-0.061211,-0.042600,-0.097677,0.009592,-0.046995,-0.090214,-0.007210,0.063117,-0.031090,-0.039078
2019-09-29,1.811718,0.070875,-0.140102,-0.027907,-0.079576,-0.108192,0.294700,-0.024340,0.161656,0.077195,...,-0.020664,0.014886,0.014062,0.019580,0.045472,-0.029547,0.006700,0.066310,-0.079988,0.013068
2019-09-30,2.817178,0.213592,0.116190,0.067874,-0.165713,-0.020973,0.172849,0.026442,0.002949,-0.039970,...,0.007047,-0.020523,0.024337,0.007288,-0.048441,0.005252,0.051084,-0.012075,-0.052275,-0.003989
2019-10-01,2.740777,-0.038259,-0.159981,0.008356,-0.081524,0.297672,-0.174806,0.275577,0.013806,0.085301,...,-0.038705,0.050866,0.028910,-0.011732,0.001326,-0.091190,0.062107,-0.024470,0.021412,0.095215


In [34]:
svd_df_daily.to_csv('svd_df_daily.csv')
