# Cleanup Pipeline 1

This notebook is for developing the cleanup pipeline, not implementing it. Implementation is done via the `cleanup_module.py` which is imported into a notebook or script. 

Preprocessing is performed for a simple Bag-of-Words representation. Text to DFM representations are explained in more detail in this [Document Term Matrices notebook.](10.extra_Document_Term_Matrices.ipynb)

---

In [1]:
import re
import os
import time
import json

import numpy as np
import pandas as pd

import urlextract
from html import unescape
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from collections import Counter
from sklearn.base import BaseEstimator, TransformerMixin

In [2]:
# load minimally prepared X, y train subsets
raw_path = os.path.join("..","data","1_raw","sentiment140")
X_train = pd.read_csv(os.path.join(raw_path, "X_train.csv"))
y_train = pd.read_csv(os.path.join(raw_path, "y_train.csv"))

In [3]:
X_train.shape, y_train.shape

((1197471, 3), (1197471, 1))

### Sample down

Since dealing with the entire training data is time consuming and unnecessary for developing a solution, I'm sampling down considerably here, from ~1.2M to 10k instances.

In [4]:
from sklearn.model_selection import train_test_split

X, X_rest, y, y_rest = train_test_split(X_train, y_train, test_size=0.9915, random_state=42)

In [5]:
print(f'Dataset size: {len(X):0.0f}')
print(f'Target distribution: {sum(y["target"])/len(y):0.3f}')

Dataset size: 10178
Target distribution: 0.502


In [6]:
X.head(5)

Unnamed: 0,ID,username,tweet
746256,1962093226,phdbre,@lruettimann promise me you'll stop by CPP's b...
1055340,2232997122,DUNX,On the rooftop floor. Theres a gym I climbed b...
452532,1469846269,joeypiet,new incubus song
717583,1969116195,theitalianjob,@theitalianjob: funkciï¿½kkal szï¿½vjak
23382,2183646936,Yema,"Teehee, I just looked up &quot;Yema&quot; in t..."


We need to create an array out of the tweet column, and that array will not contain the shuffled indices in a randomly sampled training dataset so we capture that in a column:

In [7]:
X.insert(3, 'index', X.index)
X.index = range(len(X))
X.head()

Unnamed: 0,ID,username,tweet,index
0,1962093226,phdbre,@lruettimann promise me you'll stop by CPP's b...,746256
1,2232997122,DUNX,On the rooftop floor. Theres a gym I climbed b...,1055340
2,1469846269,joeypiet,new incubus song,452532
3,1969116195,theitalianjob,@theitalianjob: funkciï¿½kkal szï¿½vjak,717583
4,2183646936,Yema,"Teehee, I just looked up &quot;Yema&quot; in t...",23382


In [8]:
# create an array of Tweets
X_array = np.array(X.iloc[:, 2]).ravel()
X_array[:2]

array(["@lruettimann promise me you'll stop by CPP's booth this year at ASTD!  ",
       "On the rooftop floor. Theres a gym I climbed but sakurity wouldnt let me go down the slide.  shit's craaazy!"],
      dtype=object)

In [9]:
y[:2]

Unnamed: 0,target
746256,1
1055340,0


### Acknowledgements

The contraction map and `expand_contractions` function were adapted from this [KDnuggests article.](https://www.kdnuggets.com/2018/08/practitioners-guide-processing-understanding-text-2.html)

The **DocumentToWordCounterTransformer** class was inspired by Aurelien Geron's **EmailToWordCounterTransformer** class from his famous [classification notebook](https://github.com/ageron/handson-ml/blob/master/03_classification.ipynb), and the **WordCounterToVectorTransformer** is a straight copy of Geron's same class.

In [10]:
def expand_contractions(text, contractions_map):
    
    pattern = re.compile('({})'.format('|'.join(contractions_map.keys())), 
                        flags=re.IGNORECASE|re.DOTALL)
    
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contractions_map.get(match)\
                                if contractions_map.get(match)\
                                else contractions_map.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

def is_ascii(doc):
    try:
        doc.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True
           
class DocumentToWordCounterTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, expand_contractions=True, lower_case=True, 
                 replace_usernames=True, unescape_html=True, 
                 replace_urls=True, replace_numbers=True, 
                 remove_junk=True, remove_punctuation=True, 
                 replace_emojis=True, replace_nonascii=True, 
                 remove_stopwords=True, lemmatization=True):
        self.expand_contractions = expand_contractions
        self.lower_case = lower_case
        self.replace_usernames = replace_usernames
        self.unescape_html = unescape_html
        self.replace_urls = replace_urls
        self.replace_numbers = replace_numbers
        self.remove_junk = remove_junk
        self.remove_punctuation = remove_punctuation
        self.replace_emojis = replace_emojis
        self.replace_nonascii = replace_nonascii
        self.remove_stopwords = remove_stopwords
        self.lemmatization = lemmatization
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        X_transformed = []
        for doc in X:
            if self.lower_case:
                doc = doc.lower()
            if self.expand_contractions and contractions_map is not None:
                doc = expand_contractions(doc, contractions_map)
            if self.replace_usernames:
                doc = re.sub(r'^@([^\s]+)',' USERNAME ', doc)
            if self.unescape_html:
                doc = unescape(doc)
            if self.replace_urls and url_extractor is not None:
                urls = list(set(url_extractor.find_urls(doc)))
                urls.sort(key=lambda url: len(url), reverse=True)
                for url in urls:
                    doc = doc.replace(url, ' URL ')
            if self.replace_numbers:
                doc = re.sub(r'\d+(?:\.\d*(?:[eE]\d+))?', ' NUMBER ', doc)
            if self.remove_junk:
                pattern = r'\¥|\â|\«|\»|\Ñ|\Ð|\¼|\½|\¾|\!|\?|\¿|\x82\
                            |\x83|\x84|\x85|\x86|\x87|\x88|\x89|\
                            |\x8a|\x8b|\x8c|\x8d|\x8e|\°|\µ|\´|\º|\¹|\³'
                doc = re.sub(pattern,'', doc)
            if self.remove_punctuation:
                doc = re.sub(r'\W+', ' ', doc, flags=re.M)
            if self.replace_emojis:
                doc = re.sub(r'[^\x00-\x7F]+', ' EMOJI ', doc)
            if self.replace_nonascii:
                if is_ascii(doc) == False:
                    doc = ' NONASCII '
            word_counts = Counter(doc.split())
            if self.remove_stopwords:
                # 25 semantically non-selective words from the Reuters-RCV1 dataset
                stop_words = ['a','an','and','are','as','at','be','by','for','from',
                              'has','he','in','is','it','its','of','on','that','the',
                              'to','was','were','will','with']
                for word in stop_words:
                    try:
                        word_counts.pop(word)
                    except KeyError:
                        continue
            if self.lemmatization and lemmatizer is not None:
                lemmatized_word_counts = Counter()
                for word, count in word_counts.items():
                    lemmatized_word = lemmatizer.lemmatize(word)
                    lemmatized_word_counts[lemmatized_word] += count
                word_counts = lemmatized_word_counts      
            X_transformed.append(word_counts)
        return np.array(X_transformed)

In [11]:
with open("contractions_map.json") as f:
    contractions_map = json.load(f)

url_extractor = urlextract.URLExtract()
lemmatizer = WordNetLemmatizer()

In [12]:
X_wordcounts = DocumentToWordCounterTransformer().fit_transform(X_array)

In [13]:
for ix, counter in enumerate(X_wordcounts[13:18]):
    counter = str(counter).split("(")[1]
    print(f'Counter {ix+13:0.0f} => {counter}\n')

Counter 13 => {'USERNAME': 1, 'no': 1, 'question': 1, 'nothing': 1, 'can': 1, 'beat': 1, 'firefox': 1})

Counter 14 => {'me': 2, 'USERNAME': 1, 'hiya': 1, 'maybe': 1, 'you': 1, 'able': 1, 'enlighten': 1, 'why': 1, 'NUMBER': 1, 'first': 1, 'song': 1, 'rule': 1, 'concert': 1, 'photographer': 1, 'like': 1, 'wt': 1})

Counter 15 => {'am': 2, 'going': 2, 'look': 1, 'like': 1, 'my': 1, 'kizzy': 1, 'okay': 1, 'so': 1, 'mighty': 1, 'relieved': 1, 'missing': 1, 'her': 1, 'though': 1, 'exhausted': 1, 'after': 1, 'work': 1, 'feeding': 1, 'others': 1, 'then': 1, 'bed': 1})

Counter 16 => {'congrats': 1, 'all': 1, 'newest': 1, 'linfield': 1, 'wildcat': 1, 'alumnus': 1})

Counter 17 => {'USERNAME': 1, 'bte': 1, 'nola': 1, 'tonight': 1})



In [14]:
from scipy.sparse import csr_matrix

class WordCounterToVectorTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vocabulary_size=1000):
        self.vocabulary_size = vocabulary_size
    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X:
            for word, count in word_count.items():
                total_count[word] += min(count, 10)
        most_common = total_count.most_common()[:self.vocabulary_size]
        self.most_common_ = most_common
        self.vocabulary_ = {word: index + 1 for index, (word, count) in enumerate(most_common)}
        return self
    def transform(self, X, y=None):
        rows = []
        cols = []
        data = []
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))
                data.append(count)
        return csr_matrix((data, (rows, cols)), shape=(len(X), self.vocabulary_size + 1))

In [15]:
vocab_transformer = WordCounterToVectorTransformer()
X_vectors = vocab_transformer.fit_transform(X_wordcounts) 
X_vectors

<10178x1001 sparse matrix of type '<class 'numpy.int32'>'
	with 89988 stored elements in Compressed Sparse Row format>

In [16]:
vocab_transformer.vocabulary_size

1000

In [17]:
# from original data
pd.set_option('display.max_colwidth', -1)
print(X.loc[15:15, 'tweet'])

15    looks like my Kizzy will be okay, so am mighty relieved, missing her though  Am exhausted after work. Going feeding others then going bed.
Name: tweet, dtype: object


In [18]:
print(X_vectors.toarray()[15][0:11]) 

[5 0 0 0 0 1 0 2 0 0 1]


- 5 words are not in the vocabulary
- the first 1 is **my** (see below), "looks like *my* Kizzy"
- the 2 is **am**, "so *am* mighty... *Am* exhausted"
- the last 1 is **so** 

In [19]:
for k,v in vocab_transformer.vocabulary_.items():
    if v < 11:
        print(v, k)

1 i
2 USERNAME
3 NUMBER
4 you
5 my
6 not
7 am
8 have
9 me
10 so


In [20]:
# array version
print(X_array[15:16])

['looks like my Kizzy will be okay, so am mighty relieved, missing her though  Am exhausted after work. Going feeding others then going bed.']


In [21]:
# word counter version
print(X_wordcounts[15])

Counter({'am': 2, 'going': 2, 'look': 1, 'like': 1, 'my': 1, 'kizzy': 1, 'okay': 1, 'so': 1, 'mighty': 1, 'relieved': 1, 'missing': 1, 'her': 1, 'though': 1, 'exhausted': 1, 'after': 1, 'work': 1, 'feeding': 1, 'others': 1, 'then': 1, 'bed': 1})


#### Using Pipeline

In [22]:
from sklearn.pipeline import Pipeline

preprocess_pipeline = Pipeline([
    ("document_to_wordcount", DocumentToWordCounterTransformer()),
    ("wordcount_to_vector", WordCounterToVectorTransformer()),
])

X_train_transformed = preprocess_pipeline.fit_transform(X_array)

In [23]:
X_train_transformed

<10178x1001 sparse matrix of type '<class 'numpy.int32'>'
	with 89988 stored elements in Compressed Sparse Row format>

In [24]:
y_array = y.iloc[:,0].ravel()

#### Train couple baseline models

In [25]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

NB_clf = MultinomialNB()
score = cross_val_score(NB_clf, X_train_transformed, y_array, cv=5, verbose=3, scoring='accuracy')
print('Mean accuracy: ' + str(score.mean()))

[CV]  ................................................................
[CV] .................................... , score=0.733, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.745, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.727, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.737, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.743, total=   0.0s
Mean accuracy: 0.7371788398507456


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished


In [26]:
log_clf = LogisticRegression(solver="liblinear", random_state=42)
score = cross_val_score(log_clf, X_train_transformed, y_array, cv=5, verbose=3, scoring='accuracy')
print('Mean accuracy: ' + str(score.mean()))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


[CV]  ................................................................
[CV] .................................... , score=0.737, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.741, total=   0.1s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s


[CV] .................................... , score=0.732, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.730, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.742, total=   0.1s
Mean accuracy: 0.7361956526985998


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.3s finished


---