## Twitter Sentiment Analysis 

---

###  Pre-process cleaned data for machine learning 

### Part 1: Bag of Words

While cleanup involved simply reformatting a Tweet's text by standardizing it and reducing the feature space (less punctuation, replacing usernames and URLs, lower casing, tokenizing, lemmatizing, etc.), pre-processing for machine learning is often more involved. It mainly consists of further data cleanup steps such as imputing NAs, but also feature engineering, and perhaps most importantly, a method of representing text in numerical form, such as [Document Term Matrices](./01_Document_Term_Matrices.ipynb), since most machine-learning algorithms do not accept text as input. This notebook explores the creation of a simple Bag of Words Document-Frequency Matrix.

---

### Load cleaned TRAIN data


In [1]:
import os 
import re
import time

import numpy as np
import pandas as pd
import scipy.sparse as sp

# for ML preprocessing
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# custom (see script)
import loading_module as lm

start_time = time.time()
X_train, y_train = lm.load_clean_data('X_train')

### Save y as .npy

In [2]:
proc_dir = os.path.join("..","data","3_processed","sentiment140")
y_filepath = os.path.join(proc_dir, "y_train.npy")

with open(y_filepath, 'wb') as f:
    np.save(f, y_train)

### Quick EDA

In [3]:
X_train.shape, y_train.shape

((1199999, 3), (1199999, 1))

In [4]:
X_train.head() 

Unnamed: 0,username,text,lemmatized
66270,rohdesign,My favorite part of the Jimmy Fallon show: The...,my favorite part jimmy fallon show root crew t...
428045,blettany,A Chorus Line at the Aronoff with Dad...then C...,chorus line aronoff dadthen cadillac ranch dinner
1307927,bryan_wilson,@judahworldchamp - I think for next season of ...,USERNAME i think next season 30rock sometime a...
1112400,bowieblue,Moving into my new place today &lt;3,moving into my new place today 3
840793,nkeeyah,I'm a little sad today I don't know why.,im little sad today i dont know why


In [5]:
y_train.head()

Unnamed: 0,target
66270,0
428045,0
1307927,1
1112400,1
840793,1


Look for emojis, get `NaNs` instead:

In [6]:
error_ix = []
emoji_ix = []
for i, tweet in enumerate(X_train['lemmatized'][:len(X_train)]):
    try:
        m = re.search(r'EMOJI', tweet)
        if m:
            emoji_ix.append(i)
    except TypeError as e: 
        error_ix.append(i)

In [7]:
# only a few NaNs
X_train.iloc[error_ix, ]

Unnamed: 0,username,text,lemmatized
394523,Jmoux,are on..,
738615,Spacegirlspif13,Is... ...,
371034,WMonk,It will,
666330,LukeOgle,is in IT,
1032283,dianamra,and it was,
1224562,ChickWithAName,. . . . . and it's on!,
704528,rooroocachoo,It will,
935651,geegeeludlow,is in IT,
1472714,sangofsorrow,He is...,
986445,SquarahFaggins,to it!!,


In [8]:
# emojis
X_train.iloc[emoji_ix[:5], ]

Unnamed: 0,username,text,lemmatized
917087,Lesley_M,@DarrenRoberts Hope youÃ¢ÂÂre having a blast...,USERNAME hope you EMOJI re having blast i EMOJ...
175035,violetMars,RIP- Cpl. Charles Dustin Ã¢ÂÂDustyÃ¢Â? Parr...,rip cpl charles dustin EMOJI dusty EMOJI parri...
490152,Johnsito127,he said noo ok doesnÃÂ´t matter i think we a...,said noo ok doesn EMOJI t matter i think we go...
621566,MelllTGP,@babycarrot5 your kind words made me feel so m...,USERNAME your kind word made me feel so much b...
80064,barraisah,Ouvindo The Killers. AtÃ¯Â¿Â½ron,ouvindo killer EMOJI ron


In [9]:
len(emoji_ix) # could be better?

10856

### Impute NAs created during cleanup

We do not want to drop since the fact they ended up as empty strings is possibly informative.

In [10]:
X_train.isnull().sum()

username       0
text           0
lemmatized    10
dtype: int64

In [11]:
# Impute with NULL as a string
error_ix = []
NULL_ix = []
for i, tweet in enumerate(X_train['lemmatized'][:len(X_train)]):
    try:
        m = re.search(r'NULL', tweet)
        if m:
            NULL_ix.append(i)
    except TypeError as e: 
        error_ix.append(i)

NA_ix = X_train.loc[X_train['lemmatized'].isnull(), ].index
X_train['lemmatized'].loc[list(NA_ix), ] = 'NULL'

In [12]:
# double check
X_train.isnull().sum()

username      0
text          0
lemmatized    0
dtype: int64

In [13]:
error_ix = []
for i, tweet in enumerate(X_train['lemmatized'][:len(X_train)]):
    try:
        m = re.search(r'NULL', tweet)
        if m:
            NULL_ix.append(i)
    except TypeError as e: 
        continue

X_train.iloc[NULL_ix, ]

Unnamed: 0,username,text,lemmatized
394523,Jmoux,are on..,
738615,Spacegirlspif13,Is... ...,
371034,WMonk,It will,
666330,LukeOgle,is in IT,
1032283,dianamra,and it was,
1224562,ChickWithAName,. . . . . and it's on!,
704528,rooroocachoo,It will,
935651,geegeeludlow,is in IT,
1472714,sangofsorrow,He is...,
986445,SquarahFaggins,to it!!,


### Create BoW DFM

In [14]:
# lemmatized column (2) as array, ravel will flatten the structure
X_array = np.array(X_train.iloc[:, 2]).ravel()

In [15]:
X_array

array(['my favorite part jimmy fallon show root crew tuba player',
       'chorus line aronoff dadthen cadillac ranch dinner',
       'USERNAME i think next season 30rock sometime alternate between yes or no hat every other shot breadbutter',
       ...,
       'i wish my head wasnt so sore plaza tapatia shorebird game whitney my momma',
       'USERNAME could you post link internet version time so i can read pllleeeaaassseeeeee',
       'USERNAME try listen USERNAME song i think youll like'],
      dtype=object)

In [17]:
# create a BoW DFM
bow_vectorizer_ung = CountVectorizer(max_features=10000) 
bow_vectorizer_big = CountVectorizer(max_features=10000, ngram_range=(1,2))

X_bow_ung = bow_vectorizer_ung.fit_transform(X_array)
X_bow_big = bow_vectorizer_big.fit_transform(X_array)

In [18]:
# Total space <1199999x329492 sparse matrix of type '<class 'numpy.int64'>' with 11446957 stored elements
X_bow_ung, X_bow_big

(<1199999x10000 sparse matrix of type '<class 'numpy.int64'>'
 	with 10626030 stored elements in Compressed Sparse Row format>,
 <1199999x10000 sparse matrix of type '<class 'numpy.int64'>'
 	with 13212223 stored elements in Compressed Sparse Row format>)

In [19]:
# Only 0.0028951048 % nonzero for ALL features
def calc_sparsity(X):
    total_space = X.shape[0] * X.shape[1]
    total_store = X.getnnz()
    pct_zeroes = 100 * (total_store/total_space)
    print(f'Only {pct_zeroes:0.10f} % nonzero.')

calc_sparsity(X_bow_ung)

Only 0.0885503238 % nonzero.


In [21]:
calc_sparsity(X_bow_big)

Only 0.1101019501 % nonzero.


### Save BoW

In [22]:
savepath = os.path.join("..","data","3_processed","sentiment140")
sp.save_npz(os.path.join(savepath, 'X_bow_ung.npz'), X_bow_ung)
sp.save_npz(os.path.join(savepath, 'X_bow_big.npz'), X_bow_big)

# print total running time
mins, secs = divmod(time.time() - start_time, 60)
print(f'Elapsed Time: {mins:0.0f} minute(s) and {secs:0.0f} second(s)')

Elapsed Time: 7 minute(s) and 5 second(s)


---