## Twitter Sentiment Analysis 

---

### Pre-process cleaned data for machine learning 

While cleanup involved simply reformatting a Tweet's text by standardizing it and reducing the feature space (less punctuation, replacing usernames and URLs, lower casing, tokenizing, lemmatizing, etc.), pre-processing for machine learning is often more involved. It mainly consists of further data cleanup steps such as imputing NAs, but also feature engineering, and perhaps most importantly, a method of representing text in numerical form, such as [Document Term Matrices](./01_Document_Term_Matrices.ipynb), since most machine-learning algorithms do not accept text as input.

---

### Load cleaned TRAIN data


In [11]:
import os 
import re
import time

import numpy as np
import pandas as pd
import scipy.sparse as sp

# for ML preprocessing
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# custom (see script)
import loading_module as lm

start_time = time.time()
df = lm.load_clean_data('X_train')
mins, secs = divmod(time.time() - start_time, 60)
print(f'Elapsed Time: {mins:0.0f} minute(s) and {secs:0.0f} second(s)')

Elapsed Time: 0 minute(s) and 5 second(s)


### Quick EDA

In [12]:
df.shape

(1199999, 3)

In [13]:
df.head() 

Unnamed: 0,username,text,lemmatized
0,TessFTW,@haleyxfax at least your phone didn't get stol...,USERNAME least your phone didnt get stoled min...
0,haleybear,I shall leave for school when i am done jammin...,i shall leave school when i am done jammin kri...
0,Pace,Wish I was in Bournemouth today - how's it lo...,wish i bournemouth today hows looking down the...
0,goatkinghoang,working madly,working madly
0,area259,The mad boys are here to tweet,mad boy here tweet


In [14]:
# load original train indices and subset
raw_path = os.path.join("..","data","1_raw","sentiment140")  
train_ix = np.load(os.path.join(raw_path, "train_ix.npy"))
df.index = list(train_ix)

In [15]:
df.head()

Unnamed: 0,username,text,lemmatized
66270,TessFTW,@haleyxfax at least your phone didn't get stol...,USERNAME least your phone didnt get stoled min...
428045,haleybear,I shall leave for school when i am done jammin...,i shall leave school when i am done jammin kri...
1307927,Pace,Wish I was in Bournemouth today - how's it lo...,wish i bournemouth today hows looking down the...
1112400,goatkinghoang,working madly,working madly
840793,area259,The mad boys are here to tweet,mad boy here tweet


Look for emojis, get `NaNs` instead:

In [16]:
error_ix = []
emoji_ix = []
for i, tweet in enumerate(df['lemmatized'][:len(df)]):
    try:
        m = re.search(r'EMOJI', tweet)
        if m:
            emoji_ix.append(i)
    except TypeError as e: 
        error_ix.append(i)

In [17]:
# only a few NaNs
df.iloc[error_ix, ]

Unnamed: 0,username,text,lemmatized
1345707,ChickWithAName,. . . . . and it's on!,
1384923,sangofsorrow,He is...,
127147,LukeOgle,is in IT,
332364,dianamra,and it was,
1045989,Spacegirlspif13,Is... ...,
859506,WMonk,It will,
229686,Jmoux,are on..,
968236,SquarahFaggins,to it!!,
1463690,geegeeludlow,is in IT,
891561,rooroocachoo,It will,


In [19]:
# emojis
df.iloc[emoji_ix[:5], ]

Unnamed: 0,username,text,lemmatized
490715,oeaejung,"@FlowGoTom Hey, I watched your clip. wanna say...",USERNAME hey i watched your clip wan na say tr...
471789,javadimon,"ÃÂ¡ÃÂ¸ÃÂ¶ÃÂ ÃÂ² ÃÂÃÂµÃÂÃÂ, ÃÂ½Ã...",EMOJI EMOJI EMOJI EMOJI mt g3 EMOJI EMOJI 1mb ...
143909,jstn7,Time to pick the dragon upÃ¯Â¼?I'm sure she'll...,time pick dragon up EMOJI im sure shell have p...
1252403,Hanescymru,Cardiff 1989! There's lovely! Ã¢ÂÂ« http://b...,cardiff 1989 there lovely EMOJI URL
1249804,edwinduinkerken,Not so motivated for work today. Since that is...,not so motivated work today since not good thi...


In [20]:
len(emoji_ix) # could be better?

10856

### Impute NAs created during cleanup

We do not want to drop since the fact they ended up as empty strings is possibly informative.

In [21]:
df.isnull().sum()

username       0
text           0
lemmatized    10
dtype: int64

In [24]:
# Impute with NULL as a string?
error_ix = []
NULL_ix = []
for i, tweet in enumerate(df['lemmatized'][:len(df)]):
    try:
        m = re.search(r'NULL', tweet)
        if m:
            NULL_ix.append(i)
    except TypeError as e: 
        error_ix.append(i)

In [25]:
df.iloc[NULL_ix, ]

Unnamed: 0,username,text,lemmatized


In [26]:
NA_ix = df.loc[df['lemmatized'].isnull(), ].index
df['lemmatized'].loc[list(NA_ix), ] = 'NULL'

In [27]:
# double check
df.isnull().sum()

username      0
text          0
lemmatized    0
dtype: int64

In [28]:
error_ix = []
NULL_ix = []
for i, tweet in enumerate(df['lemmatized'][:len(df)]):
    try:
        m = re.search(r'NULL', tweet)
        if m:
            NULL_ix.append(i)
    except TypeError as e: 
        error_ix.append(i)

In [29]:
df.iloc[NULL_ix, ]

Unnamed: 0,username,text,lemmatized
1345707,ChickWithAName,. . . . . and it's on!,
1384923,sangofsorrow,He is...,
127147,LukeOgle,is in IT,
332364,dianamra,and it was,
1045989,Spacegirlspif13,Is... ...,
859506,WMonk,It will,
229686,Jmoux,are on..,
968236,SquarahFaggins,to it!!,
1463690,geegeeludlow,is in IT,
891561,rooroocachoo,It will,


### Create BoW DFM

In [31]:
# target
# load target...... why?

#y = np.array(df.iloc[:, 0]).ravel()

In [33]:
# lemmatized column (as array)
X_array = np.array(df.iloc[:, 2]).ravel()

In [34]:
X_array

array(['USERNAME least your phone didnt get stoled mine did',
       'i shall leave school when i am done jammin kris allen i miss him so much 3',
       'wish i bournemouth today hows looking down there dorset folk',
       ...,
       'USERNAME i mean USERNAME USERNAME never invite me 2 cicis USERNAME i when come back r gon na go do our own thang',
       'im gutted im night i want some eye candy',
       'mmmmm bedtime sound good today i found out i might have have down payment buy housewtf up'],
      dtype=object)

In [35]:
# # create a BoW DFM 
bow_vectorizer = CountVectorizer()

start_bow = time.time()

X_bow = bow_vectorizer.fit_transform(X_array)

mins, secs = divmod(time.time() - start_bow, 60)
print(f"BoW vectorization time: {mins:0.0f} minute(s) and {secs:0.0f} second(s).")

In [38]:
X_bow

<1199999x329492 sparse matrix of type '<class 'numpy.int64'>'
	with 11446957 stored elements in Compressed Sparse Row format>

In [40]:
# visualize tiny portion, sparse indeed
X_bow[0:10, 0:20].todense()

matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
       dtype=int64)

In [None]:
# use log(tf)? do this now? what about savind idfs for test set?
# tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True) 

### Save pre-processed DFMs

In [43]:
savepath = os.path.join("..","data","3_processed","sentiment140")
filename = 'X_bow'
filepath = os.path.join(savepath, ''.join([filename, '.npz']))
filepath

'..\\data\\3_processed\\sentiment140\\X_bow.npz'

In [44]:
sp.save_npz(filepath, X_bow)

In [45]:
# save y target vector == we already have this!?
#np.save(os.path.join(dirpath, 'y'), y)

---