## Data Cleaning


This notebook will take the data that was created in the **Data Collection** Notebook and perform normal text data preprocessing.  

**Begin by removing items from the text that are not needed becasue they will add no value to the classification model**

* URLs  
* hashtags and Twitter @usernames  
* Emoticons. 
* Punctuation  

**Next we perform some more common NLP Preprocessing tasks:**

* Tokenization
* Removal of Stopwords  
* Lemmatization

In [8]:
import pandas as pd

import numpy as np
import os
import pickle
import boto3
s3 = boto3.resource('s3')
bucket_name = "msds-practicum-carey"

import re
import spacy
#nlp = spacy.load("en_core_web_sm")

nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])
#nlp.add_pipe(nlp.create_pipe('sentencizer'))

import nltk
from nltk import FreqDist
import string

from tqdm import tqdm_notebook as tqdm
tqdm().pandas()

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

## Download the Data from AWS S3. 

To keep the size of the code repository small I have stored all of the data in an S3 Object Store. The other option would be to use GIT LFS. All of the intermediate data has been store as a .pkl (pickle) file. This is a convenient way to serialize any variable from python in a portable way.  

tweet_df.pkl is a serialized Pandas Dataframe

In [9]:
with open('outdata/tweet_df.pkl', 'wb') as data:
    s3.Bucket(bucket_name).download_fileobj('tweet_df.pkl', data)
    
tweet_df = pd.read_pickle('outdata/tweet_df.pkl')

os.remove('outdata/tweet_df.pkl')

In [17]:
tweet_df.sample(10)

Unnamed: 0,tweet,class
144834,"Two years after the Emancipation Proclamation,...",L
1061475,RT @HelenRosenthal: Insights about how to addr...,L
81576,I spoke with @deni_kamper of @KNWAnews about C...,C
1072562,UPDATE ON OUR TOWNHALL!!! We have a new locati...,L
1196154,A majority of Americans support background che...,L
859700,"My wife and I raised our four kids in Bozeman,...",C
1253233,RT @GunnelsWarren: Spoiler alert: Jamie Dimon ...,L
677861,We should open an impeachment inquiry so we ca...,L
110756,RT @SXMProgress: “I think [healthcare] is the ...,L
1164044,We mourn the loss of two Georgia heroes. My co...,C


## Import Stopwords from NLTK and define text cleaning functions. 

NLTK keeps a library of "stopwords". Thesea are words that will typically show up the most in a text but will add very little substance to the analysis. Exampels of STOPWORDS are: "THE", "AN", "a" etc...

We can also add words to the list of stopwords. This is done on a project by project basis dependent upon the origin of the text. In our case the corpus came from Twitter so we know a good portion of it will start with "RT" which stands for "retweet". It adds nothing to the analysis so we will add it to the list of stopwords. 

In [37]:
# import stopwords 
stopwords = nltk.corpus.stopwords.words('english') 
stopwords.extend(['RT'])

In [48]:


def tokenize(text):
    
    tokens = nltk.word_tokenize(text)
    
    return tokens

def remove_stopwords(words):
  
    
    filtered = filter(lambda word: word not in stopwords, words)
    
    return list(filtered)

def lemmatize(text, nlp=nlp):
    
    doc = nlp(text)
    
    lemmatized = [token.lemma_ for token in doc]
    
    return " ".join(lemmatized)

def clean_text(df):
   
    df["clean_tweets"] = [lemmatize(x) for x in df['clean_tweets'].tolist()]
    print('done')
    return df

# Gets rid of emojis and some oddly formated strings
def remove_emoji(inputString):
    return inputString.encode('ascii', 'ignore').decode('ascii')

In [39]:
tweet_df['clean_tweets'] = tweet_df['tweet'].progress_apply(lambda x: re.sub('http://\S+', '', x))
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].progress_apply(lambda x: re.sub('https://\S+', '', x))
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].progress_apply(lambda x: re.sub('@\S+', '', x))
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].progress_apply(lambda x: remove_emoji(x))
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].progress_apply(lambda x: re.sub('\n', '', x))
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].progress_apply(lambda x: re.sub('&amp;', '', x))
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].progress_apply(lambda x: re.sub('&amp', '', x))
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].progress_apply(lambda x: re.sub(r'[^\w\s]', '', x))
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].progress_apply(lambda x: tokenize(x))
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].progress_apply(lambda x : remove_stopwords(x))
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].progress_apply(lambda x : " ".join(x) )

HBox(children=(FloatProgress(value=0.0, max=1350306.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1350306.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1350306.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1350306.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1350306.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1350306.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1350306.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1350306.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1350306.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1350306.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1350306.0), HTML(value='')))




In [57]:
import dask.dataframe as ddf
from dask.diagnostics import ProgressBar

dask_df = ddf.from_pandas(tweet_df, npartitions = parts)
result = dask_df.map_partitions(clean_text, meta = tweet_df)
with ProgressBar():
    df = result.compute(scheduler='processes')

[                                        ] | 0% Completed | 16min 33.9sdone
[###                                     ] | 8% Completed | 16min 45.9sdone
[######                                  ] | 16% Completed | 16min 56.9sdone
[##########                              ] | 25% Completed | 17min  7.7sdone
[#############                           ] | 33% Completed | 17min 15.9sdone
[################                        ] | 41% Completed | 17min 25.3sdone
[####################                    ] | 50% Completed | 17min 34.7sdone
[#######################                 ] | 58% Completed | 17min 41.4sdone
[##########################              ] | 66% Completed | 17min 46.9sdone
[##############################          ] | 75% Completed | 17min 52.4sdone
[#################################       ] | 83% Completed | 17min 56.6sdone
[####################################    ] | 91% Completed | 18min  0.1sdone
[########################################] | 100% Completed | 18min  1.4s


In [58]:
df


Unnamed: 0,tweet,class,clean_tweets
0,RT @aafb: Congrats to ⁦@RepOHalleran⁩ &amp; ⁦@...,L,congrat appointment look forward work together h
1,Great to meet the new Lake County Farm Bureau ...,L,great meet new lake county farm bureau executi...
2,Congratulations to @waynestcollege women's rug...,C,congratulation women rugby win sixth national ...
3,Great to meet with the Erickson Air Crane team...,C,great meet erickson air crane team medford tod...
4,Always wonderful to be part of the Back to Sch...,L,always wonderful part back school jam resource...
...,...,...,...
1350301,We should be upholding the National Environmen...,L,-PRON- uphold national environmental policy ac...
1350302,"If anything is to be investigated, I think we ...",C,if anything investigate -PRON- think need inve...
1350303,TODAY: Federal judge rules in favor of House R...,C,today federal judge rule favor house republica...
1350304,"In the words of an old proverb, ""A hit dog wil...",L,in word old proverb a hit dog holler


In [59]:
with open('outdata/tweets_clean_df.pkl', 'wb') as f:
    pickle.dump(df, f)
    
s3.meta.client.upload_file('outdata/tweets_clean_df.pkl',
                           bucket_name,
                           'tweets_clean_df.pkl')


In [60]:
os.remove('outdata/tweets_clean_df.pkl')

Process ForkPoolWorker-207239:
Traceback (most recent call last):
  File "/Users/scarey/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/Users/scarey/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/scarey/anaconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
    task = get()
  File "/Users/scarey/anaconda3/lib/python3.7/multiprocessing/queues.py", line 354, in get
    return _ForkingPickler.loads(res)
_pickle.UnpicklingError: unpickling stack underflow
Process ForkPoolWorker-354237:
Traceback (most recent call last):
  File "/Users/scarey/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/Users/scarey/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/scarey/anaconda3/lib/python3.7/multiprocessing/pool.py", line