## Twitter Sentiment Analysis 

---

###  Pre-process cleaned data for machine learning 

### Part 1: Bag of Words

While cleanup involved simply reformatting a Tweet's text by standardizing it and reducing the feature space (less punctuation, replacing usernames and URLs, lower casing, tokenizing, lemmatizing, etc.), pre-processing for machine learning is often more involved. It mainly consists of further data cleanup steps such as imputing NAs, but also feature engineering, and perhaps most importantly, a method of representing text in numerical form, such as [Document Term Matrices](./01_Document_Term_Matrices.ipynb), since most machine-learning algorithms do not accept text as input. This notebook explores the creation of a simple Bag of Words Document-Frequency Matrix.

---

### Load cleaned TRAIN data


In [18]:
import os 
import re
import time

import numpy as np
import pandas as pd
import scipy.sparse as sp

# for ML preprocessing
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# custom (see script)
import loading_module as lm

start_time = time.time()
X_train, y_train = lm.load_clean_data('X_train')

### Save target as npy

In [19]:
proc_dir = os.path.join("..","data","3_processed","sentiment140")
y_filepath = os.path.join(proc_dir, "y_train.npy")

with open(y_filepath, 'wb') as f:
    np.save(f, y_train)

### Quick EDA

In [20]:
X_train.shape, y_train.shape

((1199999, 3), (1199999, 1))

## ISSUE:

- loading_module is not reproducible!
- probably due to async multiprocessing, order is not the same

In [21]:
X_train.head() 

Unnamed: 0,username,text,lemmatized
0,pareidoliac,@Auckland_Museum I was thinking precisely of y...,USERNAME i thinking precisely your late progra...
1,sugarloves,@hp4ever13 Something HP... I *heart* mine- The...,USERNAME something hp i heart mine theyre suga...
2,joshhl,So annoyed that America have The Sims 3 already,so annoyed america have sims 3 already
3,ValbuenaMusic,NEWEST SONG &quot;Hey Mr. Bossa!!!&quot; @yout...,newest song hey mr bossa USERNAME URL swing ja...
4,JackieHagerman,Got up a 5 a.m. to workout and now I'm exhaust...,got up 5 am workout now im exhausted really wi...


In [5]:
X_train.tail()

Unnamed: 0,username,text,lemmatized
1199994,taylaar,"3 finals this week But Andrew is here, so I a...",3 final this week but andrew here so i am happy
1199995,Pewari,"@carocat yeah, I have a whole list of tv shows...",USERNAME yeah i have whole list tv show i real...
1199996,jamieallison,Sorry you missed it. Good night!,sorry you missed good night
1199997,alisha_J,Yo this 16 and pregnant show on mtv is sad to ...,yo this 16 pregnant show mtv sad me i feel bad...
1199998,Shelbayyyyy,Bed. SAT in the am,bed sat am


In [7]:
y_train.head()

Unnamed: 0,target
0,0
1,0
2,1
3,1
4,1


In [8]:
y_train.tail()

Unnamed: 0,target
1199994,0
1199995,1
1199996,0
1199997,0
1199998,0


Look for emojis, get `NaNs` instead:

In [9]:
error_ix = []
emoji_ix = []
for i, tweet in enumerate(X_train['lemmatized'][:len(X_train)]):
    try:
        m = re.search(r'EMOJI', tweet)
        if m:
            emoji_ix.append(i)
    except TypeError as e: 
        error_ix.append(i)

In [10]:
# only a few NaNs
X_train.iloc[error_ix, ]

Unnamed: 0,username,text,lemmatized
43379,rooroocachoo,It will,
57035,sangofsorrow,He is...,
79899,SquarahFaggins,to it!!,
130300,geegeeludlow,is in IT,
310320,LukeOgle,is in IT,
474958,dianamra,and it was,
676761,Spacegirlspif13,Is... ...,
929405,Jmoux,are on..,
956389,ChickWithAName,. . . . . and it's on!,
1028263,WMonk,It will,


In [11]:
# emojis
X_train.iloc[emoji_ix[:5], ]

Unnamed: 0,username,text,lemmatized
244,PolaScheps,@ddlovato hahaha can't wait to c it! u rock gi...,USERNAME hahaha cant wait c u rock girl i real...
510,barbbs,at psychology class iÃÂ´m STARVING 2 DEATH i ...,psychology class i EMOJI m starving 2 death i ...
651,bit_crusherrr,@busta_grimes I know but Asda doesnt sell Mika...,USERNAME i know but asda doesnt sell mikado li...
657,ligiagalvao,@tommcfly You're reading the saga of twilight ...,USERNAME youre reading saga twilight i read al...
972,ruoivietnam,@Poohnine: em cÃ¯Â¿Â½ t?a ?? ti?ng Anh khÃ¯Â¿Â...,USERNAME em c EMOJI ta ting anh kh EMOJI ng an...


In [12]:
len(emoji_ix) # could be better?

10856

### Impute NAs created during cleanup

We do not want to drop since the fact they ended up as empty strings is possibly informative.

In [13]:
X_train.isnull().sum()

username       0
text           0
lemmatized    10
dtype: int64

In [14]:
# Impute with NULL as a string
error_ix = []
NULL_ix = []
for i, tweet in enumerate(X_train['lemmatized'][:len(X_train)]):
    try:
        m = re.search(r'NULL', tweet)
        if m:
            NULL_ix.append(i)
    except TypeError as e: 
        error_ix.append(i)

NA_ix = X_train.loc[X_train['lemmatized'].isnull(), ].index
X_train['lemmatized'].loc[list(NA_ix), ] = 'NULL'

In [15]:
# double check
X_train.isnull().sum()

username      0
text          0
lemmatized    0
dtype: int64

In [16]:
error_ix = []
for i, tweet in enumerate(X_train['lemmatized'][:len(X_train)]):
    try:
        m = re.search(r'NULL', tweet)
        if m:
            NULL_ix.append(i)
    except TypeError as e: 
        continue

X_train.iloc[NULL_ix, ]

Unnamed: 0,username,text,lemmatized
43379,rooroocachoo,It will,
57035,sangofsorrow,He is...,
79899,SquarahFaggins,to it!!,
130300,geegeeludlow,is in IT,
310320,LukeOgle,is in IT,
474958,dianamra,and it was,
676761,Spacegirlspif13,Is... ...,
929405,Jmoux,are on..,
956389,ChickWithAName,. . . . . and it's on!,
1028263,WMonk,It will,


In [17]:
print(X_train.loc[4, 'lemmatized'])

exam doneeeee


### Create BoW DFM

In [35]:
# lemmatized column (2) as array, ravel will flatten the structure
X_array = np.array(X_train.iloc[:, 2]).ravel()

In [43]:
X_train.iloc[:10, 2]

0    please pray my house there major water leakage...
1                                        URL bump this
2         USERNAME just saw not part your last message
3    USERNAME lol i put up 100 track u havent retwe...
4    USERNAME u got ta endorse application if u gon...
5    best night ever got spend better part half hou...
6    moved inside i am wayyyyy too white sit sun to...
7    USERNAME wish i could your signing nashville b...
8                             me USERNAME hummin along
9     USERNAME wrong suggestion doesnt translate farsi
Name: lemmatized, dtype: object

In [46]:
X_array[:10]

array(['please pray my house there major water leakage thats causing my entire house crack possibly fall apart',
       'URL bump this', 'USERNAME just saw not part your last message',
       'USERNAME lol i put up 100 track u havent retweeted 1 hun smh',
       'USERNAME u got ta endorse application if u gon na our courtside tweep USERNAME im USERNAME i USERNAME',
       'best night ever got spend better part half hour letting off firework',
       'moved inside i am wayyyyy too white sit sun too long literally im reflective haha loungin sofa',
       'USERNAME wish i could your signing nashville but i got sick cant go 3 hr drive',
       'me USERNAME hummin along',
       'USERNAME wrong suggestion doesnt translate farsi'], dtype=object)

In [47]:
X_array[4]

'USERNAME u got ta endorse application if u gon na our courtside tweep USERNAME im USERNAME i USERNAME'

In [48]:
# credit A Geron
from sklearn.base import BaseEstimator, TransformerMixin
from collections import Counter

class DocumentToWordCounterTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, replace_numbers=True):
        self.replace_numbers = replace_numbers
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        X_transformed = []
        for doc in X:
            if self.replace_numbers:
                doc = re.sub(r'\d+(?:\.\d*(?:[eE]\d+))?', 'NUMBER', doc)
            word_counts = Counter(doc.split())
            X_transformed.append(word_counts)
        return np.array(X_transformed)

In [56]:
X_few = X_array[:10]
X_few_wordcounts = DocumentToWordCounterTransformer().fit_transform(X_few)
for i in X_few_wordcounts:
    print(str(i).split('Counter(')[1].split(')')[0])

{'my': 2, 'house': 2, 'please': 1, 'pray': 1, 'there': 1, 'major': 1, 'water': 1, 'leakage': 1, 'thats': 1, 'causing': 1, 'entire': 1, 'crack': 1, 'possibly': 1, 'fall': 1, 'apart': 1}
{'URL': 1, 'bump': 1, 'this': 1}
{'USERNAME': 1, 'just': 1, 'saw': 1, 'not': 1, 'part': 1, 'your': 1, 'last': 1, 'message': 1}
{'NUMBER': 2, 'USERNAME': 1, 'lol': 1, 'i': 1, 'put': 1, 'up': 1, 'track': 1, 'u': 1, 'havent': 1, 'retweeted': 1, 'hun': 1, 'smh': 1}
{'USERNAME': 4, 'u': 2, 'got': 1, 'ta': 1, 'endorse': 1, 'application': 1, 'if': 1, 'gon': 1, 'na': 1, 'our': 1, 'courtside': 1, 'tweep': 1, 'im': 1, 'i': 1}
{'best': 1, 'night': 1, 'ever': 1, 'got': 1, 'spend': 1, 'better': 1, 'part': 1, 'half': 1, 'hour': 1, 'letting': 1, 'off': 1, 'firework': 1}
{'too': 2, 'moved': 1, 'inside': 1, 'i': 1, 'am': 1, 'wayyyyy': 1, 'white': 1, 'sit': 1, 'sun': 1, 'long': 1, 'literally': 1, 'im': 1, 'reflective': 1, 'haha': 1, 'loungin': 1, 'sofa': 1}
{'i': 2, 'USERNAME': 1, 'wish': 1, 'could': 1, 'your': 1, 'signin

In [57]:
from scipy.sparse import csr_matrix

class WordCounterToVectorTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vocabulary_size=1000):
        self.vocabulary_size = vocabulary_size
    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X:
            for word, count in word_count.items():
                total_count[word] += min(count, 5) # minimum count
        most_common = total_count.most_common()[:self.vocabulary_size]
        self.most_common_ = most_common
        self.vocabulary_ = {word: index + 1 for index, (word, count) in enumerate(most_common)}
        return self
    def transform(self, X, y=None):
        rows, cols, data = [], [], []
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))
                data.append(count)
        return csr_matrix((data, (rows, cols)), shape=(len(X), self.vocabulary_size + 1))

In [58]:
vocab_transformer = WordCounterToVectorTransformer()
X_few_vectors = vocab_transformer.fit_transform(X_few_wordcounts)
X_few_vectors

<10x1001 sparse matrix of type '<class 'numpy.int32'>'
	with 105 stored elements in Compressed Sparse Row format>

In [59]:
X_few_vectors.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 1, 0, ..., 0, 0, 0],
        ...,
        [0, 1, 2, ..., 0, 0, 0],
        [0, 1, 0, ..., 0, 0, 0],
        [0, 1, 0, ..., 0, 0, 0]], dtype=int32)

In [60]:
vocab_transformer.vocabulary_

{'USERNAME': 1,
 'i': 2,
 'NUMBER': 3,
 'u': 4,
 'got': 5,
 'my': 6,
 'house': 7,
 'part': 8,
 'your': 9,
 'im': 10,
 'too': 11,
 'please': 12,
 'pray': 13,
 'there': 14,
 'major': 15,
 'water': 16,
 'leakage': 17,
 'thats': 18,
 'causing': 19,
 'entire': 20,
 'crack': 21,
 'possibly': 22,
 'fall': 23,
 'apart': 24,
 'URL': 25,
 'bump': 26,
 'this': 27,
 'just': 28,
 'saw': 29,
 'not': 30,
 'last': 31,
 'message': 32,
 'lol': 33,
 'put': 34,
 'up': 35,
 'track': 36,
 'havent': 37,
 'retweeted': 38,
 'hun': 39,
 'smh': 40,
 'ta': 41,
 'endorse': 42,
 'application': 43,
 'if': 44,
 'gon': 45,
 'na': 46,
 'our': 47,
 'courtside': 48,
 'tweep': 49,
 'best': 50,
 'night': 51,
 'ever': 52,
 'spend': 53,
 'better': 54,
 'half': 55,
 'hour': 56,
 'letting': 57,
 'off': 58,
 'firework': 59,
 'moved': 60,
 'inside': 61,
 'am': 62,
 'wayyyyy': 63,
 'white': 64,
 'sit': 65,
 'sun': 66,
 'long': 67,
 'literally': 68,
 'reflective': 69,
 'haha': 70,
 'loungin': 71,
 'sofa': 72,
 'wish': 73,
 'could'

In [29]:
#from sklearn.pipeline import Pipeline
#
#preprocess_pipeline = Pipeline([
#    ("document_to_wordcount", DocumentToWordCounterTransformer()),
#    ("wordcount_to_vector", WordCounterToVectorTransformer()),
#])

# lose the vocabulary_ ? WHERE IS THE VOCABULARY...

In [61]:
#X_train_transformed = preprocess_pipeline.fit_transform(X_array)

In [57]:
X_train_transformed

<1199999x1001 sparse matrix of type '<class 'numpy.int32'>'
	with 10232362 stored elements in Compressed Sparse Row format>

In [58]:
for i,v in enumerate(X_train_transformed[:33,:30].todense()):
    if i > 30:
        print(i,v)

31 [[2 0 1 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
32 [[1 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]]


In [62]:
X_wordcounts = DocumentToWordCounterTransformer().fit_transform(X_array)

vocabulary_transformer = WordCounterToVectorTransformer()
X_vectors = vocabulary_transformer.fit_transform(X_wordcounts)

In [63]:
vocabulary_transformer.vocabulary_ # 1000! yay

{'USERNAME': 1,
 'i': 2,
 'my': 3,
 'you': 4,
 'NUMBER': 5,
 'im': 6,
 'me': 7,
 'so': 8,
 'have': 9,
 'but': 10,
 'just': 11,
 'not': 12,
 'day': 13,
 'this': 14,
 'now': 15,
 'good': 16,
 'up': 17,
 'get': 18,
 'URL': 19,
 'all': 20,
 'out': 21,
 'like': 22,
 'go': 23,
 'no': 24,
 'got': 25,
 'u': 26,
 'love': 27,
 'dont': 28,
 'work': 29,
 'do': 30,
 'today': 31,
 'your': 32,
 'going': 33,
 'too': 34,
 'time': 35,
 'cant': 36,
 'back': 37,
 'one': 38,
 'lol': 39,
 'know': 40,
 'what': 41,
 'we': 42,
 'about': 43,
 'can': 44,
 'really': 45,
 'am': 46,
 'want': 47,
 'had': 48,
 'there': 49,
 'see': 50,
 'some': 51,
 'well': 52,
 'night': 53,
 'think': 54,
 'if': 55,
 'still': 56,
 'new': 57,
 'na': 58,
 'how': 59,
 'need': 60,
 'thanks': 61,
 'home': 62,
 'when': 63,
 'oh': 64,
 'miss': 65,
 'more': 66,
 'here': 67,
 'much': 68,
 'off': 69,
 'they': 70,
 'last': 71,
 'feel': 72,
 'hope': 73,
 'make': 74,
 'morning': 75,
 'been': 76,
 'then': 77,
 'tomorrow': 78,
 'great': 79,
 'twitte

In [64]:
X_vectors

<1199999x1001 sparse matrix of type '<class 'numpy.int32'>'
	with 10232362 stored elements in Compressed Sparse Row format>

In [72]:
import sys
np.set_printoptions(threshold=sys.maxsize)

In [103]:
ix,vals=[],[]
for i, v in enumerate(X_vectors[4].toarray()[0]):
    if i == 0:
        pass
    else:
        if v != 0:
            ix.append(i)
            vals.append(v)

In [104]:
# 1001 in length, the first is how many terms are missing from the vocab in this doc
X_vectors[4].toarray()

array([[4, 4, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [105]:
for i, v in enumerate(vocabulary_transformer.vocabulary_):
    if i in ix:
        print(i, v)

1 i
2 my
6 me
25 u
26 love
55 still
58 how
109 though
169 wont
264 call


In [106]:
ix, vals

([1, 2, 6, 25, 26, 55, 58, 109, 169, 264], [4, 1, 1, 1, 2, 1, 1, 1, 1, 1])

In [108]:
X_array[:6]

array(['please pray my house there major water leakage thats causing my entire house crack possibly fall apart',
       'URL bump this', 'USERNAME just saw not part your last message',
       'USERNAME lol i put up 100 track u havent retweeted 1 hun smh',
       'USERNAME u got ta endorse application if u gon na our courtside tweep USERNAME im USERNAME i USERNAME',
       'best night ever got spend better part half hour letting off firework'],
      dtype=object)

In [17]:
# create a BoW DFM
bow_vectorizer_ung = CountVectorizer(max_features=10000) 
bow_vectorizer_big = CountVectorizer(max_features=10000, ngram_range=(1,2))

X_bow_ung = bow_vectorizer_ung.fit_transform(X_array)
X_bow_big = bow_vectorizer_big.fit_transform(X_array)

In [18]:
# Total space <1199999x329492 sparse matrix of type '<class 'numpy.int64'>' with 11446957 stored elements
X_bow_ung, X_bow_big

(<1199999x10000 sparse matrix of type '<class 'numpy.int64'>'
 	with 10626030 stored elements in Compressed Sparse Row format>,
 <1199999x10000 sparse matrix of type '<class 'numpy.int64'>'
 	with 13212223 stored elements in Compressed Sparse Row format>)

In [19]:
# Only 0.0028951048 % nonzero for ALL features
def calc_sparsity(X):
    total_space = X.shape[0] * X.shape[1]
    total_store = X.getnnz()
    pct_zeroes = 100 * (total_store/total_space)
    print(f'Only {pct_zeroes:0.10f} % nonzero.')

calc_sparsity(X_bow_ung)

Only 0.0885503238 % nonzero.


In [21]:
calc_sparsity(X_bow_big)

Only 0.1101019501 % nonzero.


### Save BoW

In [22]:
savepath = os.path.join("..","data","3_processed","sentiment140")
sp.save_npz(os.path.join(savepath, 'X_bow_ung.npz'), X_bow_ung)
sp.save_npz(os.path.join(savepath, 'X_bow_big.npz'), X_bow_big)

# print total running time
mins, secs = divmod(time.time() - start_time, 60)
print(f'Elapsed Time: {mins:0.0f} minute(s) and {secs:0.0f} second(s)')

Elapsed Time: 7 minute(s) and 5 second(s)


---