## Twitter Sentiment Analysis 

---

###  Pre-process cleaned data for machine learning 

### Part 1: Bag of Words

While cleanup involved simply reformatting a Tweet's text by standardizing it and reducing the feature space (less punctuation, replacing usernames and URLs, lower casing, tokenizing, lemmatizing, etc.), pre-processing for machine learning is often more involved. It mainly consists of further data cleanup steps such as imputing NAs, but also feature engineering, and perhaps most importantly, a method of representing text in numerical form, such as [Document Term Matrices](./01_Document_Term_Matrices.ipynb), since most machine-learning algorithms do not accept text as input. This notebook explores the creation of a simple Bag of Words Document-Frequency Matrix.

---

### Load cleaned TRAIN data


In [2]:
import os 
import re
import time

import numpy as np
import pandas as pd
import scipy.sparse as sp

# for ML preprocessing
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# custom (see script)
import loading_module as lm

start_time = time.time()
X_train, y_train = lm.load_clean_data('X_train')

### Save y as .npy

In [3]:
proc_dir = os.path.join("..","data","3_processed","sentiment140")
y_filepath = os.path.join(proc_dir, "y_train.npy")

with open(y_filepath, 'wb') as f:
    np.save(f, y_train)

### Quick EDA

In [4]:
X_train.shape, y_train.shape

((1199999, 3), (1199999, 1))

In [5]:
X_train.head() 

Unnamed: 0,username,text,lemmatized
66270,kylefong,Please pray for my house. There is a major wat...,please pray my house there major water leakage...
428045,Abcmsaj,http://twitpic.com/kfu7 - Bump this,URL bump this
1307927,rkguruparan,@sanjanah just saw the 'not' part in your last...,USERNAME just saw not part your last message
1112400,Breedon,"@Princesz22 lol, I put up 100 tracks u haven't...",USERNAME lol i put up 100 track u havent retwe...
840793,redgehomes,@NBA u gotta endorse an application if u gonna...,USERNAME u got ta endorse application if u gon...


In [4]:
y_train.head()

Unnamed: 0,target
66270,0
428045,0
1307927,1
1112400,1
840793,1


Look for emojis, get `NaNs` instead:

In [5]:
error_ix = []
emoji_ix = []
for i, tweet in enumerate(df['lemmatized'][:len(df)]):
    try:
        m = re.search(r'EMOJI', tweet)
        if m:
            emoji_ix.append(i)
    except TypeError as e: 
        error_ix.append(i)

In [6]:
# only a few NaNs
df.iloc[error_ix, ]

Unnamed: 0,username,text,lemmatized
1023057,ChickWithAName,. . . . . and it's on!,
301554,sangofsorrow,He is...,
473192,LukeOgle,is in IT,
690091,dianamra,and it was,
910200,Spacegirlspif13,Is... ...,
248115,WMonk,It will,
229686,Jmoux,are on..,
1058230,SquarahFaggins,to it!!,
213856,geegeeludlow,is in IT,
1288815,rooroocachoo,It will,


In [7]:
# emojis
df.iloc[emoji_ix[:5], ]

Unnamed: 0,username,text,lemmatized
752651,oeaejung,"@FlowGoTom Hey, I watched your clip. wanna say...",USERNAME hey i watched your clip wan na say tr...
962667,javadimon,"ÃÂ¡ÃÂ¸ÃÂ¶ÃÂ ÃÂ² ÃÂÃÂµÃÂÃÂ, ÃÂ½Ã...",EMOJI EMOJI EMOJI EMOJI mt g3 EMOJI EMOJI 1mb ...
228718,jstn7,Time to pick the dragon upÃ¯Â¼?I'm sure she'll...,time pick dragon up EMOJI im sure shell have p...
1269140,Hanescymru,Cardiff 1989! There's lovely! Ã¢ÂÂ« http://b...,cardiff 1989 there lovely EMOJI URL
956933,edwinduinkerken,Not so motivated for work today. Since that is...,not so motivated work today since not good thi...


In [8]:
len(emoji_ix) # could be better?

10856

### Impute NAs created during cleanup

We do not want to drop since the fact they ended up as empty strings is possibly informative.

In [9]:
df.isnull().sum()

username       0
text           0
lemmatized    10
dtype: int64

In [10]:
# Impute with NULL as a string
error_ix = []
NULL_ix = []
for i, tweet in enumerate(df['lemmatized'][:len(df)]):
    try:
        m = re.search(r'NULL', tweet)
        if m:
            NULL_ix.append(i)
    except TypeError as e: 
        error_ix.append(i)

NA_ix = df.loc[df['lemmatized'].isnull(), ].index
df['lemmatized'].loc[list(NA_ix), ] = 'NULL'

In [11]:
# double check
df.isnull().sum()

username      0
text          0
lemmatized    0
dtype: int64

In [12]:
error_ix = []
for i, tweet in enumerate(df['lemmatized'][:len(df)]):
    try:
        m = re.search(r'NULL', tweet)
        if m:
            NULL_ix.append(i)
    except TypeError as e: 
        continue

df.iloc[NULL_ix, ]

Unnamed: 0,username,text,lemmatized
1023057,ChickWithAName,. . . . . and it's on!,
301554,sangofsorrow,He is...,
473192,LukeOgle,is in IT,
690091,dianamra,and it was,
910200,Spacegirlspif13,Is... ...,
248115,WMonk,It will,
229686,Jmoux,are on..,
1058230,SquarahFaggins,to it!!,
213856,geegeeludlow,is in IT,
1288815,rooroocachoo,It will,


### Create BoW DFM

In [13]:
# lemmatized column (2) as array, ravel will flatten the structure
X_array = np.array(df.iloc[:, 2]).ravel()

In [14]:
# create a BoW DFM
bow_vectorizer = CountVectorizer()

X_bow = bow_vectorizer.fit_transform(X_array)

In [15]:
X_bow

<1199999x329492 sparse matrix of type '<class 'numpy.int64'>'
	with 11446957 stored elements in Compressed Sparse Row format>

In [16]:
# visualize tiny portion
X_bow[0:6, 0:20].todense()

matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
       dtype=int64)

In [17]:
# calculate sparsity
total_space = X_bow.shape[0] * X_bow.shape[1]
total_store = X_bow.getnnz()
pct_zeroes = 100 * (total_store/total_space)
print(f'Only {pct_zeroes:0.10f} % nonzero.')

Only 0.0028951048 % nonzero.


### Save BoW

In [18]:
savepath = os.path.join("..","data","3_processed","sentiment140")
filename = 'X_bow'
filepath = os.path.join(savepath, ''.join([filename, '.npz']))

sp.save_npz(filepath, X_bow)

# print total running time
mins, secs = divmod(time.time() - start_time, 60)
print(f'Elapsed Time: {mins:0.0f} minute(s) and {secs:0.0f} second(s)')

Elapsed Time: 1 minute(s) and 1 second(s)


---