# Twitter Sentiment Analysis - POC
---

## 4. Cleaning Pipeline

This is an involved part and the beginning of the POC per se. For the final project, I will not be sampling the data. Here I'm using a very small sample so that I can quickly iterate and move foward with the project.


## POC Only - Sample Data

In [1]:
import os
import time

import numpy as np
import pandas as pd

# time notebook
start_notebook = time.time()

# load minimally prepared X, y train subsets
raw_path = os.path.join("..","data","1_raw","sentiment140")
X_train = pd.read_csv(os.path.join(raw_path, "X_train.csv"))
y_train = pd.read_csv(os.path.join(raw_path, "y_train.csv"))

In [48]:
# sample down considerably to X, y sample subsets
from sklearn.model_selection import train_test_split

X, X_rest, y, y_rest = train_test_split(X_train, y_train, test_size=0.9999, random_state=158)

The plan is to forget about the `_rest` datasets and focus on the X, y small subsets, as if they were the entire training data.

In [49]:
print(f'Dataset size: {len(X):0.0f}')
print(f'Target distribution: {sum(y["target"])/len(y):0.3f}')

Dataset size: 119
Target distribution: 0.521


In [50]:
X.head(10)

Unnamed: 0,ID,username,tweet
848825,1827547913,CarissaCruz,"@CassXavier hahaha. yes, i know. it's good fo..."
147277,2204021385,CarlaCh,@shawnieora Been sad lately. Just found out my...
755568,2257571603,Joulez217,@conjunkie ah see my hotel only booked till sa...
15878,1550708931,himmelgarten,Google thinks the Cafe is a spam blog. They'r...
291177,1977512948,murnisitanggang,i love sunday
1116805,1981582888,noodles2007,walkin the zoo. already seen the monkeys and b...
401889,1827697408,haylz4000,i'm helping my friend with his maths
875770,2177034912,blaisegv,@alisonmichalk My pleasure
424813,1983995946,brycefury,No up today
780553,1823735990,msunique85,@deangeloredman @bmarzmusic feeling neglected


In [51]:
# create an array of Tweets
X_array = np.array(X.iloc[:, 2]).ravel()
X_array[:10]

array(["@CassXavier hahaha. yes, i know.  it's good for him. and us! ;)",
       '@shawnieora Been sad lately. Just found out my sister has stage 1 colon cancer. I already lost a sister ',
       "@conjunkie ah see my hotel only booked till sat. Sorry  I didn't book my room another friend did who can't do full weekend.",
       "Google thinks the Cafe is a spam blog.  They're recognised by &quot;irrelevant, repetitive, or nonsensical text&quot;.  That's told me ",
       'i love sunday ',
       'walkin the zoo. already seen the monkeys and birds and hell of a lot of all there animals.. this is fun ',
       "i'm helping my friend with his maths ",
       '@alisonmichalk My pleasure ', 'No up today ',
       '@deangeloredman @bmarzmusic feeling neglected '], dtype=object)

We lose the indices so we need to retain those.

In [52]:
X.insert(3, 'index', X.index)
X.index = range(len(X))
X.head()

Unnamed: 0,ID,username,tweet,index
0,1827547913,CarissaCruz,"@CassXavier hahaha. yes, i know. it's good fo...",848825
1,2204021385,CarlaCh,@shawnieora Been sad lately. Just found out my...,147277
2,2257571603,Joulez217,@conjunkie ah see my hotel only booked till sa...,755568
3,1550708931,himmelgarten,Google thinks the Cafe is a spam blog. They'r...,15878
4,1977512948,murnisitanggang,i love sunday,291177


In [53]:
import re
import urlextract

from html import unescape
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from collections import Counter
from sklearn.base import BaseEstimator, TransformerMixin

def is_ascii(doc):
    try:
        doc.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True
    
url_extractor = urlextract.URLExtract()
lemmatizer = WordNetLemmatizer()
           
class DocumentToWordCounterTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, lower_case=True, replace_usernames=True,
                 unescape_html=True, replace_urls=True, 
                 replace_numbers=True, remove_junk=True, 
                 remove_punctuation=True, replace_emojis=True,
                 replace_nonascii=True, tokenize=True, 
                 remove_stopwords=True, lemmatization=True):
        self.lower_case = lower_case
        self.replace_usernames = replace_usernames
        self.unescape_html = unescape_html
        self.replace_urls = replace_urls
        self.replace_numbers = replace_numbers
        self.remove_junk = remove_junk
        self.remove_punctuation = remove_punctuation
        self.replace_emojis = replace_emojis
        self.replace_nonascii = replace_nonascii
        self.tokenize = tokenize
        self.remove_stopwords = remove_stopwords
        self.lemmatization = lemmatization
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        X_transformed = []
        for doc in X:
            if self.lower_case:
                doc = doc.lower()
            if self.replace_usernames:
                doc = re.sub(r'@([^\s]+)',' USERNAME ', doc)
            if self.unescape_html:
                doc = unescape(doc)
            if self.replace_urls and url_extractor is not None:
                urls = list(set(url_extractor.find_urls(doc)))
                urls.sort(key=lambda url: len(url), reverse=True)
                for url in urls:
                    doc = doc.replace(url, ' URL ')
            if self.replace_numbers:
                doc = re.sub(r'\d+(?:\.\d*(?:[eE]\d+))?', ' NUMBER ', doc)
            if self.remove_junk:
                pattern = r'\¥|\â|\«|\»|\Ñ|\Ð|\¼|\½|\¾|\!|\?|\¿|\x82\
                            |\x83|\x84|\x85|\x86|\x87|\x88|\x89|\
                            |\x8a|\x8b|\x8c|\x8d|\x8e|\°|\µ|\´|\º|\¹|\³'
                doc = re.sub(pattern,'', doc)
            if self.remove_punctuation:
                doc = re.sub(r'\W+', ' ', doc, flags=re.M)
            if self.replace_emojis:
                doc = re.sub(r'[^\x00-\x7F]+', ' EMOJI ', doc)
            if self.replace_nonascii:
                if is_ascii(doc) == False:
                    doc = ' NONASCII '
            word_counts = Counter(doc.split())
            if self.remove_stopwords:
                #25 semantically non-selective words from the Reuters-RCV1 dataset
                # plus single-digit letters
                stop_words = ['a','an','and','are','as','at','be','by','for','from',
                              'has','he','in','is','it','its','of','on','that','the',
                              'to','was','were','will','with','t','s','d','m']
                for word in stop_words:
                    try:
                        word_counts.pop(word)
                    except KeyError:
                        continue
            if self.lemmatization and lemmatizer is not None:
                lemmatized_word_counts = Counter()
                for word, count in word_counts.items():
                    lemmatized_word = lemmatizer.lemmatize(word)
                    lemmatized_word_counts[lemmatized_word] += count
                word_counts = lemmatized_word_counts      
            X_transformed.append(word_counts)
        return np.array(X_transformed)

In [54]:
X_wordcounts = DocumentToWordCounterTransformer(tokenize=False).fit_transform(X_array)

In [55]:
len(X_wordcounts)

119

In [56]:
for counter in X_wordcounts[:10]:
    counter = str(counter).split("(")[1]
    print(counter)

{'USERNAME': 1, 'hahaha': 1, 'yes': 1, 'i': 1, 'know': 1, 'good': 1, 'him': 1, 'u': 1})
{'sister': 2, 'USERNAME': 1, 'been': 1, 'sad': 1, 'lately': 1, 'just': 1, 'found': 1, 'out': 1, 'my': 1, 'stage': 1, 'NUMBER': 1, 'colon': 1, 'cancer': 1, 'i': 1, 'already': 1, 'lost': 1})
{'my': 2, 'USERNAME': 1, 'ah': 1, 'see': 1, 'hotel': 1, 'only': 1, 'booked': 1, 'till': 1, 'sat': 1, 'sorry': 1, 'i': 1, 'didn': 1, 'book': 1, 'room': 1, 'another': 1, 'friend': 1, 'did': 1, 'who': 1, 'can': 1, 'do': 1, 'full': 1, 'weekend': 1})
{'google': 1, 'think': 1, 'cafe': 1, 'spam': 1, 'blog': 1, 'they': 1, 're': 1, 'recognised': 1, 'irrelevant': 1, 'repetitive': 1, 'or': 1, 'nonsensical': 1, 'text': 1, 'told': 1, 'me': 1})
{'i': 1, 'love': 1, 'sunday': 1})
{'walkin': 1, 'zoo': 1, 'already': 1, 'seen': 1, 'monkey': 1, 'bird': 1, 'hell': 1, 'lot': 1, 'all': 1, 'there': 1, 'animal': 1, 'this': 1, 'fun': 1})
{'i': 1, 'helping': 1, 'my': 1, 'friend': 1, 'his': 1, 'math': 1})
{'USERNAME': 1, 'my': 1, 'pleasure':

In [57]:
from scipy.sparse import csr_matrix

class WordCounterToVectorTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vocabulary_size=1000):
        self.vocabulary_size = vocabulary_size
    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X:
            for word, count in word_count.items():
                total_count[word] += min(count, 10)
        most_common = total_count.most_common()[:self.vocabulary_size]
        self.most_common_ = most_common
        self.vocabulary_ = {word: index + 1 for index, (word, count) in enumerate(most_common)}
        return self
    def transform(self, X, y=None):
        rows = []
        cols = []
        data = []
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))
                data.append(count)
        return csr_matrix((data, (rows, cols)), shape=(len(X), self.vocabulary_size + 1))

In [58]:
vocab_transformer = WordCounterToVectorTransformer()
X_vectors = vocab_transformer.fit_transform(X_wordcounts)
X_vectors

<119x1001 sparse matrix of type '<class 'numpy.int32'>'
	with 1187 stored elements in Compressed Sparse Row format>

In [59]:
len(vocab_transformer.vocabulary_)

640

In [60]:
# from original data
X.loc[19:21,]

Unnamed: 0,ID,username,tweet,index
19,2063447346,dearscarlett,I accidentally scratched jeremie and she start...,1058998
20,1562359688,Mizzgp10,@PALMTREEENT lol I'm okay lol but I'm m@dd cuz...,14605
21,1881043184,markrfletcher,@SaritaAgerman But still 2 more to go And one...,2906


In [61]:
for i in range(22):
    if i > 18:
        print(i, X_vectors.toarray()[i][:10])

19 [0 1 0 0 0 0 0 0 0 0]
20 [0 3 2 0 1 0 0 0 0 0]
21 [0 1 1 0 1 0 0 0 0 0]


In [62]:
X_array[19:22]

array(['I accidentally scratched jeremie and she started bleeding.  note to self: clip nails...',
       "@PALMTREEENT lol I'm okay lol but I'm m@dd cuz I don't get 2 meet yall nxt week when yall come down here! ",
       '@SaritaAgerman But still 2 more to go  And one of them is MediEVIL. I hate it.'],
      dtype=object)

In [63]:
for k,v in vocab_transformer.vocabulary_.items():
    if v < 11:
        print(v, k)

1 i
2 USERNAME
3 my
4 NUMBER
5 you
6 good
7 day
8 this
9 today
10 have


In [64]:
X_wordcounts[20]

Counter({'USERNAME': 2,
         'lol': 2,
         'i': 3,
         'okay': 1,
         'but': 1,
         'cuz': 1,
         'don': 1,
         'get': 1,
         'NUMBER': 1,
         'meet': 1,
         'yall': 2,
         'nxt': 1,
         'week': 1,
         'when': 1,
         'come': 1,
         'down': 1,
         'here': 1})

In [65]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

preprocess_pipeline = Pipeline([
    ("document_to_wordcount", DocumentToWordCounterTransformer()),
    ("wordcount_to_vector", WordCounterToVectorTransformer()),
])

X_train_transformed = preprocess_pipeline.fit_transform(X_array)

In [66]:
X_train_transformed

<119x1001 sparse matrix of type '<class 'numpy.int32'>'
	with 1187 stored elements in Compressed Sparse Row format>

In [67]:
y_array = y.iloc[:,0].ravel()

In [76]:
from sklearn.model_selection import cross_val_score

log_clf = LogisticRegression(solver="liblinear", random_state=42)
score = cross_val_score(log_clf, X_train_transformed, y_array, cv=5, verbose=3, scoring='accuracy')
print('Mean accuracy: ' + str(score.mean()))

[CV]  ................................................................
[CV] .................................... , score=0.708, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.542, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.708, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.542, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.652, total=   0.0s
Mean accuracy: 0.6304347826086957


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished


In [77]:
from sklearn.naive_bayes import MultinomialNB

NB_clf = MultinomialNB()
score = cross_val_score(NB_clf, X_train_transformed, y_array, cv=5, verbose=3, scoring='accuracy')
print('Mean accuracy: ' + str(score.mean()))

[CV]  ................................................................
[CV] .................................... , score=0.875, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.667, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.667, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.667, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.609, total=   0.0s
Mean accuracy: 0.6967391304347825


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished


---