## Twitter Sentiment Analysis 

---

###  Pre-process cleaned data for machine learning 

### Part 1: Bag of Words

While cleanup involved simply reformatting a Tweet's text by standardizing it and reducing the feature space (less punctuation, replacing usernames and URLs, lower casing, tokenizing, lemmatizing, etc.), pre-processing for machine learning is often more involved. It mainly consists of further data cleanup steps such as imputing NAs, but also feature engineering, and perhaps most importantly, a method of representing text in numerical form, such as [Document Term Matrices](./01_Document_Term_Matrices.ipynb), since most machine-learning algorithms do not accept text as input. This notebook explores the creation of a simple Bag of Words Document-Frequency Matrix.

---

### Load cleaned TRAIN data


In [6]:
import os 
import re
import time

import numpy as np
import pandas as pd
import scipy.sparse as sp

# for ML preprocessing
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# custom (see script)
import loading_module as lm

start_time = time.time()
X_train, y_train = lm.load_clean_data('X_train')

### Save target as npy

In [7]:
proc_dir = os.path.join("..","data","3_processed","sentiment140")
y_filepath = os.path.join(proc_dir, "y_train.npy")

with open(y_filepath, 'wb') as f:
    np.save(f, y_train)

### Quick EDA

In [8]:
X_train.shape, y_train.shape

((1199999, 3), (1197471, 1))

## ISSUE:

- loading_module is not reproducible!
- probably due to async multiprocessing, order is not the same

In [9]:
#X_train.head() 

In [10]:
#X_train.tail()

In [11]:
#y_train.head()

In [12]:
#y_train.tail()

Look for emojis, get `NaNs` instead:

In [13]:
error_ix = []
emoji_ix = []
for i, tweet in enumerate(X_train['lemmatized'][:len(X_train)]):
    try:
        m = re.search(r'EMOJI', tweet)
        if m:
            emoji_ix.append(i)
    except TypeError as e: 
        error_ix.append(i)

In [14]:
# only a few NaNs
X_train.iloc[error_ix, ]

Unnamed: 0,username,text,lemmatized
174958,dianamra,and it was,
230300,geegeeludlow,is in IT,
279405,Jmoux,are on..,
326761,Spacegirlspif13,Is... ...,
356389,ChickWithAName,. . . . . and it's on!,
510320,LukeOgle,is in IT,
607035,sangofsorrow,He is...,
629899,SquarahFaggins,to it!!,
1043379,rooroocachoo,It will,
1128263,WMonk,It will,


In [15]:
# emojis
X_train.iloc[emoji_ix[:5], ]

Unnamed: 0,username,text,lemmatized
92,Applechic,Like this cover a lot! Yup tiz anzum. @Gregdt...,like this cover lot yup tiz anzum USERNAME i w...
171,Applechic,Hope your now unbirthday is good too! @The_Kra...,hope your now unbirthday good too USERNAME EMO...
224,d_whiteplume,@panda951 no one makes cooler videos than BjÃ...,USERNAME no one make cooler video than bj EMOJ...
239,JustTooBusy,Car boot was a complete wash out - got soaked ...,car boot complete wash out got soaked supposed...
253,hindy_cindy,I love how down-to-earth BeyoncÃÂ© is. She di...,i love how downtoearth beyonc EMOJI she ditche...


In [16]:
len(emoji_ix) # could be better?

10856

### Impute NAs created during cleanup

We do not want to drop since the fact they ended up as empty strings is possibly informative.

In [17]:
X_train.isnull().sum()

username       0
text           0
lemmatized    10
dtype: int64

In [18]:
# Impute with NULL as a string
error_ix = []
NULL_ix = []
for i, tweet in enumerate(X_train['lemmatized'][:len(X_train)]):
    try:
        m = re.search(r'NULL', tweet)
        if m:
            NULL_ix.append(i)
    except TypeError as e: 
        error_ix.append(i)

NA_ix = X_train.loc[X_train['lemmatized'].isnull(), ].index
X_train['lemmatized'].loc[list(NA_ix), ] = 'NULL'

In [19]:
# double check
#X_train.isnull().sum()

In [20]:
error_ix = []
for i, tweet in enumerate(X_train['lemmatized'][:len(X_train)]):
    try:
        m = re.search(r'NULL', tweet)
        if m:
            NULL_ix.append(i)
    except TypeError as e: 
        continue

X_train.iloc[NULL_ix, ]

Unnamed: 0,username,text,lemmatized
174958,dianamra,and it was,
230300,geegeeludlow,is in IT,
279405,Jmoux,are on..,
326761,Spacegirlspif13,Is... ...,
356389,ChickWithAName,. . . . . and it's on!,
510320,LukeOgle,is in IT,
607035,sangofsorrow,He is...,
629899,SquarahFaggins,to it!!,
1043379,rooroocachoo,It will,
1128263,WMonk,It will,


### Create BoW DFM

In [22]:
# lemmatized column (2) as array, ravel will flatten the structure
X_array = np.array(X_train.iloc[:, 2]).ravel()

In [23]:
X_train.iloc[:3, 2]

0      she why didn she call just please come hom soon
1    had blast studio last friday making my first a...
2    USERNAME amazoncom mar caneuon n amazing ar y ...
Name: lemmatized, dtype: object

In [24]:
X_array[:3]

array(['she why didn she call just please come hom soon',
       'had blast studio last friday making my first album bandlife good',
       'USERNAME amazoncom mar caneuon n amazing ar y live chat ma nwn chware darn o turn right'],
      dtype=object)

In [26]:
# credit A Geron
from sklearn.base import BaseEstimator, TransformerMixin
from collections import Counter

class DocumentToWordCounterTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, replace_numbers=True):
        self.replace_numbers = replace_numbers
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        X_transformed = []
        for doc in X:
            if self.replace_numbers:
                doc = re.sub(r'\d+(?:\.\d*(?:[eE]\d+))?', 'NUMBER', doc)
            word_counts = Counter(doc.split())
            X_transformed.append(word_counts)
        return np.array(X_transformed)

In [27]:
X_few = X_array[:10]
X_few_wordcounts = DocumentToWordCounterTransformer().fit_transform(X_few)
for i in X_few_wordcounts:
    print(str(i).split('Counter(')[1].split(')')[0])

{'she': 2, 'why': 1, 'didn': 1, 'call': 1, 'just': 1, 'please': 1, 'come': 1, 'hom': 1, 'soon': 1}
{'had': 1, 'blast': 1, 'studio': 1, 'last': 1, 'friday': 1, 'making': 1, 'my': 1, 'first': 1, 'album': 1, 'bandlife': 1, 'good': 1}
{'USERNAME': 1, 'amazoncom': 1, 'mar': 1, 'caneuon': 1, 'n': 1, 'amazing': 1, 'ar': 1, 'y': 1, 'live': 1, 'chat': 1, 'ma': 1, 'nwn': 1, 'chware': 1, 'darn': 1, 'o': 1, 'turn': 1, 'right': 1}
{'i': 3, 'lost': 1, 'NUMBER': 1, 'follower': 1, 'do': 1, 'you': 1, 'not': 1, 'know': 1, 'who': 1, 'am': 1, 'demand': 1, 'steward': 1, 'enquiry': 1}
{'have': 3, 'even': 2, 'iphone': 2, 'USERNAME': 1, 'glad': 1, 'youre': 1, 'able': 1, 'i': 1, 'verizon': 1, 'they': 1, 'dont': 1, 'option': 1}
{'i': 2, 'USERNAME': 1, 'aww': 1, 'im': 1, 'sorry': 1, 'hope': 1, 'change': 1, 'soon': 1, 'hate': 1, 'people': 1, 'who': 1, 'dont': 1, 'comment': 1, 'when': 1, 'they': 1, 'read': 1}
{'playing': 1, 'some': 1, 'combat': 1, 'arm': 1, 'you': 1, 'should': 1, 'check': 1, 'out': 1, 'pretty': 1,

In [28]:
from scipy.sparse import csr_matrix

class WordCounterToVectorTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vocabulary_size=1000):
        self.vocabulary_size = vocabulary_size
    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X:
            for word, count in word_count.items():
                total_count[word] += min(count, 5) # minimum count
        most_common = total_count.most_common()[:self.vocabulary_size]
        self.most_common_ = most_common
        self.vocabulary_ = {word: index + 1 for index, (word, count) in enumerate(most_common)}
        return self
    def transform(self, X, y=None):
        rows, cols, data = [], [], []
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))
                data.append(count)
        return csr_matrix((data, (rows, cols)), shape=(len(X), self.vocabulary_size + 1))

In [29]:
vocab_transformer = WordCounterToVectorTransformer()
X_few_vectors = vocab_transformer.fit_transform(X_few_wordcounts)
X_few_vectors

<10x1001 sparse matrix of type '<class 'numpy.int32'>'
	with 112 stored elements in Compressed Sparse Row format>

In [30]:
X_few_vectors.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 1, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 1, ..., 0, 0, 0]], dtype=int32)

In [31]:
vocab_transformer.vocabulary_

{'i': 1,
 'USERNAME': 2,
 'have': 3,
 'she': 4,
 'soon': 5,
 'first': 6,
 'NUMBER': 7,
 'you': 8,
 'who': 9,
 'even': 10,
 'iphone': 11,
 'they': 12,
 'dont': 13,
 'out': 14,
 'her': 15,
 'why': 16,
 'didn': 17,
 'call': 18,
 'just': 19,
 'please': 20,
 'come': 21,
 'hom': 22,
 'had': 23,
 'blast': 24,
 'studio': 25,
 'last': 26,
 'friday': 27,
 'making': 28,
 'my': 29,
 'album': 30,
 'bandlife': 31,
 'good': 32,
 'amazoncom': 33,
 'mar': 34,
 'caneuon': 35,
 'n': 36,
 'amazing': 37,
 'ar': 38,
 'y': 39,
 'live': 40,
 'chat': 41,
 'ma': 42,
 'nwn': 43,
 'chware': 44,
 'darn': 45,
 'o': 46,
 'turn': 47,
 'right': 48,
 'lost': 49,
 'follower': 50,
 'do': 51,
 'not': 52,
 'know': 53,
 'am': 54,
 'demand': 55,
 'steward': 56,
 'enquiry': 57,
 'glad': 58,
 'youre': 59,
 'able': 60,
 'verizon': 61,
 'option': 62,
 'aww': 63,
 'im': 64,
 'sorry': 65,
 'hope': 66,
 'change': 67,
 'hate': 68,
 'people': 69,
 'comment': 70,
 'when': 71,
 'read': 72,
 'playing': 73,
 'some': 74,
 'combat': 75,
 '

In [29]:
#from sklearn.pipeline import Pipeline
#
#preprocess_pipeline = Pipeline([
#    ("document_to_wordcount", DocumentToWordCounterTransformer()),
#    ("wordcount_to_vector", WordCounterToVectorTransformer()),
#])

# lose the vocabulary_ ? WHERE IS THE VOCABULARY...

In [61]:
#X_train_transformed = preprocess_pipeline.fit_transform(X_array)

In [57]:
X_train_transformed

<1199999x1001 sparse matrix of type '<class 'numpy.int32'>'
	with 10232362 stored elements in Compressed Sparse Row format>

In [58]:
for i,v in enumerate(X_train_transformed[:33,:30].todense()):
    if i > 30:
        print(i,v)

31 [[2 0 1 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
32 [[1 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]]


In [62]:
X_wordcounts = DocumentToWordCounterTransformer().fit_transform(X_array)

vocabulary_transformer = WordCounterToVectorTransformer()
X_vectors = vocabulary_transformer.fit_transform(X_wordcounts)

In [63]:
vocabulary_transformer.vocabulary_ # 1000! yay

{'USERNAME': 1,
 'i': 2,
 'my': 3,
 'you': 4,
 'NUMBER': 5,
 'im': 6,
 'me': 7,
 'so': 8,
 'have': 9,
 'but': 10,
 'just': 11,
 'not': 12,
 'day': 13,
 'this': 14,
 'now': 15,
 'good': 16,
 'up': 17,
 'get': 18,
 'URL': 19,
 'all': 20,
 'out': 21,
 'like': 22,
 'go': 23,
 'no': 24,
 'got': 25,
 'u': 26,
 'love': 27,
 'dont': 28,
 'work': 29,
 'do': 30,
 'today': 31,
 'your': 32,
 'going': 33,
 'too': 34,
 'time': 35,
 'cant': 36,
 'back': 37,
 'one': 38,
 'lol': 39,
 'know': 40,
 'what': 41,
 'we': 42,
 'about': 43,
 'can': 44,
 'really': 45,
 'am': 46,
 'want': 47,
 'had': 48,
 'there': 49,
 'see': 50,
 'some': 51,
 'well': 52,
 'night': 53,
 'think': 54,
 'if': 55,
 'still': 56,
 'new': 57,
 'na': 58,
 'how': 59,
 'need': 60,
 'thanks': 61,
 'home': 62,
 'when': 63,
 'oh': 64,
 'miss': 65,
 'more': 66,
 'here': 67,
 'much': 68,
 'off': 69,
 'they': 70,
 'last': 71,
 'feel': 72,
 'hope': 73,
 'make': 74,
 'morning': 75,
 'been': 76,
 'then': 77,
 'tomorrow': 78,
 'great': 79,
 'twitte

In [64]:
X_vectors

<1199999x1001 sparse matrix of type '<class 'numpy.int32'>'
	with 10232362 stored elements in Compressed Sparse Row format>

In [72]:
import sys
np.set_printoptions(threshold=sys.maxsize)

In [103]:
ix,vals=[],[]
for i, v in enumerate(X_vectors[4].toarray()[0]):
    if i == 0:
        pass
    else:
        if v != 0:
            ix.append(i)
            vals.append(v)

In [104]:
# 1001 in length, the first is how many terms are missing from the vocab in this doc
X_vectors[4].toarray()

array([[4, 4, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [105]:
for i, v in enumerate(vocabulary_transformer.vocabulary_):
    if i in ix:
        print(i, v)

1 i
2 my
6 me
25 u
26 love
55 still
58 how
109 though
169 wont
264 call


In [106]:
ix, vals

([1, 2, 6, 25, 26, 55, 58, 109, 169, 264], [4, 1, 1, 1, 2, 1, 1, 1, 1, 1])

In [108]:
X_array[:6]

array(['please pray my house there major water leakage thats causing my entire house crack possibly fall apart',
       'URL bump this', 'USERNAME just saw not part your last message',
       'USERNAME lol i put up 100 track u havent retweeted 1 hun smh',
       'USERNAME u got ta endorse application if u gon na our courtside tweep USERNAME im USERNAME i USERNAME',
       'best night ever got spend better part half hour letting off firework'],
      dtype=object)

In [17]:
# create a BoW DFM
bow_vectorizer_ung = CountVectorizer(max_features=10000) 
bow_vectorizer_big = CountVectorizer(max_features=10000, ngram_range=(1,2))

X_bow_ung = bow_vectorizer_ung.fit_transform(X_array)
X_bow_big = bow_vectorizer_big.fit_transform(X_array)

In [18]:
# Total space <1199999x329492 sparse matrix of type '<class 'numpy.int64'>' with 11446957 stored elements
X_bow_ung, X_bow_big

(<1199999x10000 sparse matrix of type '<class 'numpy.int64'>'
 	with 10626030 stored elements in Compressed Sparse Row format>,
 <1199999x10000 sparse matrix of type '<class 'numpy.int64'>'
 	with 13212223 stored elements in Compressed Sparse Row format>)

In [19]:
# Only 0.0028951048 % nonzero for ALL features
def calc_sparsity(X):
    total_space = X.shape[0] * X.shape[1]
    total_store = X.getnnz()
    pct_zeroes = 100 * (total_store/total_space)
    print(f'Only {pct_zeroes:0.10f} % nonzero.')

calc_sparsity(X_bow_ung)

Only 0.0885503238 % nonzero.


In [21]:
calc_sparsity(X_bow_big)

Only 0.1101019501 % nonzero.


### Save BoW

In [22]:
savepath = os.path.join("..","data","3_processed","sentiment140")
sp.save_npz(os.path.join(savepath, 'X_bow_ung.npz'), X_bow_ung)
sp.save_npz(os.path.join(savepath, 'X_bow_big.npz'), X_bow_big)

# print total running time
mins, secs = divmod(time.time() - start_time, 60)
print(f'Elapsed Time: {mins:0.0f} minute(s) and {secs:0.0f} second(s)')

Elapsed Time: 7 minute(s) and 5 second(s)


---