***

## Data Cleaning


This notebook will take the data that was created in the **Data Collection** Notebook and perform normal text data preprocessing.  

**Begin by removing items from the text that are not needed becasue they will add no value to the classification model**

* URLs  
* hashtags and Twitter @usernames  
* Emoticons. 
* Punctuation  

**Next we perform some more common NLP Preprocessing tasks:**

* Tokenization
* Removal of Stopwords  
* Lemmatization

***

In [6]:
import pandas as pd

import numpy as np
import os
import pickle
import boto3
s3 = boto3.resource('s3')
bucket_name = "msds-practicum-carey"

import re
import spacy
#nlp = spacy.load("en_core_web_sm")

nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])
#nlp.add_pipe(nlp.create_pipe('sentencizer'))

import nltk
from nltk import FreqDist
import string



## Download the Data from AWS S3. 

To keep the size of the code repository small I have stored all of the data in an S3 Object Store. The other option would be to use GIT LFS. All of the intermediate data has been store as a .pkl (pickle) file. This is a convenient way to serialize any variable from python in a portable way.  

tweet_df.pkl is a serialized Pandas Dataframe

In [7]:
with open('outdata/tweet_df.pkl', 'wb') as data:
    s3.Bucket(bucket_name).download_fileobj('tweet_df.pkl', data)
    
tweet_df = pd.read_pickle('outdata/tweet_df.pkl')

os.remove('outdata/tweet_df.pkl')

In [8]:
tweet_df.sample(10)

Unnamed: 0,tweet,class
688755,NBC News reports Obama knew for at least 3 yea...,C
1114683,Thank you Naomi and family for visiting and sh...,L
660825,Bernie Sanders and I and all the members of th...,C
757280,RT @SpeakerPelosi: The House will vote on Thur...,L
1199501,"RI manufacturers employ over 42,000 workers &a...",L
1025881,Just visited Mason Valley's Peri &amp; Sons Fa...,L
179315,Great discussion about the future of housing i...,C
429852,"When Trump ended DACA, he left hundreds of tho...",L
1349528,RT @Ryan_ILFB: Great to hear from @RodneyDavis...,C
958124,Thanks to the folks in the Air Traffic Control...,C


## Import Stopwords from NLTK and define text cleaning functions. 

NLTK keeps a library of "stopwords". Thesea are words that will typically show up the most in a text but will add very little substance to the analysis. Exampels of STOPWORDS are: "THE", "AN", "a" etc...

We can also add words to the list of stopwords. This is done on a project by project basis dependent upon the origin of the text. In our case the corpus came from Twitter so we know a good portion of it will start with "RT" which stands for "retweet". It adds nothing to the analysis so we will add it to the list of stopwords. 

In [9]:
# import stopwords 
stopwords = nltk.corpus.stopwords.words('english') 
stopwords.extend(['RT'])

In [10]:

# breaks text up in to a list of individual words 
def tokenize(text):
    
    tokens = nltk.word_tokenize(text)
    
    return tokens

# removes stopwords 
def remove_stopwords(words):
  
    
    filtered = filter(lambda word: word not in stopwords, words)
    
    return list(filtered)

#  lemmatizes text based on the part of speech tags 
def lemmatize(text, nlp=nlp):
    
    doc = nlp(text)
    
    lemmatized = [token.lemma_ for token in doc]
    
    return " ".join(lemmatized)

# applies the lemmatize function to a dataframe
# allows us to use Dask to run function in parallel
def clean_text(df):
   
    df["clean_tweets"] = [lemmatize(x) for x in df['clean_tweets'].tolist()]
    print('done')
    return df

# Gets rid of emojis and some oddly formated strings
def remove_emoji(inputString):
    return inputString.encode('ascii', 'ignore').decode('ascii')

## Use REGEX and the defined functions to perform  preprocessing. 

### 1. Remove URLs

In [None]:
tweet_df['clean_tweets'] = tweet_df['tweet'].apply(lambda x: re.sub('http://\S+', '', x))
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].apply(lambda x: re.sub('https://\S+', '', x))

### 2. Remove @name mentions and Emojis

In [None]:
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].apply(lambda x: re.sub('@\S+', '', x))
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].apply(lambda x: remove_emoji(x))

### 3. Remove new line Characters 

In [None]:
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].apply(lambda x: re.sub('\n', '', x))
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].apply(lambda x: re.sub(r'[^\w\s]', '', x))


### 4. Remove amperstand (&) 

In [None]:
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].apply(lambda x: re.sub('&amp;', '', x))
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].apply(lambda x: re.sub('&amp', '', x))

### 5. Tokenize, Remove Stopwords, rejoint into string

In [11]:
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].apply(lambda x: tokenize(x))
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].apply(lambda x : remove_stopwords(x))
tweet_df['clean_tweets'] = tweet_df['clean_tweets'].apply(lambda x : " ".join(x) )

## Use Dask to parallelize the lemmatization of the words. 

The goal of lemmatization is to remove the inflection from the words. Returning only the base word.  

Processing each of the 1.3 million tweets one at a time will take a long time becasue lemmatizing a sentence is computationally expensive. To speed up this process we will use the "Dask" package.  

Using Dask we can break the dataframe up in to separate partitions and have each of them processed by a separate core of the processor. This is known as parallel computing. 

We begin by getting the number of cores within the computers processor. 

In [12]:

parts = os.cpu_count()
parts

12

Then we use Dask to break the Pandas Dataframe up in to the same number of paritions as we have cores. Then we map the 'clean_text' function to each parition and process.  

On my machine a 60 minute operation was reduced to 15 minutes. 

In [13]:
import dask.dataframe as ddf
from dask.diagnostics import ProgressBar

dask_df = ddf.from_pandas(tweet_df, npartitions = parts)
result = dask_df.map_partitions(clean_text, meta = tweet_df)
with ProgressBar():
    df = result.compute(scheduler='processes')

  import pandas.util.testing as tm


[                                        ] | 0% Completed | 14min 41.8sdone
[###                                     ] | 8% Completed | 14min 49.5sdone
[######                                  ] | 16% Completed | 14min 55.8sdone
[##########                              ] | 25% Completed | 15min  4.0sdone
[#############                           ] | 33% Completed | 15min  9.1sdone
[################                        ] | 41% Completed | 15min 13.5sdone
[####################                    ] | 50% Completed | 15min 17.5sdone
[#######################                 ] | 58% Completed | 15min 20.6sdone
[##########################              ] | 66% Completed | 15min 23.6sdone
[##############################          ] | 75% Completed | 15min 26.2sdone
[##############################          ] | 75% Completed | 15min 27.5sdone
[####################################    ] | 91% Completed | 15min 29.3sdone
[########################################] | 100% Completed | 15min 30.5s


The result is a new dataframe that contains all of the original data plus a new column that contains the lemmatized thext.  

Lemmatizing the text will make it easier to get correct word counts and such. 

In [21]:
df.sample(20)


Unnamed: 0,tweet,class,clean_tweets
1098648,.@POTUS doesn't know what it's like to live pa...,L,do not know like live paycheck paycheck hell n...
460691,RT @SDAgriculture: Thank you @SDGovDaugaard fo...,C,thank declare yesterday e15 day south dakota -...
844170,ICYMI: Always enjoy mornings with @cspan @cspa...,C,icymi always enjoy morning thank cspanwj
536961,Kathleen's Women- &amp; Minority-Owned Busines...,L,kathleens women minorityowned business resourc...
901461,Members of @VETSports discuss efforts to assis...,L,member discuss effort assist veteran veteransd...
1189330,Bad #trade deals have resulted in lost #jobs a...,L,bad trade deal result lose job shuttered facto...
333424,The Ag portion of the minibus really focuses o...,C,the ag portion minibus really focus ruralameri...
693819,What's in the bill? Critical support for:\n\n•...,L,what s bill critical support forpuerto ricos r...
605614,Looking forward to joining @CNNSOTU on Sunday ...,L,look forward join sunday morning tune 8 a.m. cst
1217669,"#Bismarck Century students Ronak, Bryce, Erik ...",C,bismarck century student ronak bryce erik doug...


Unnamed: 0,tweet,class,clean_tweets
0,RT @aafb: Congrats to ⁦@RepOHalleran⁩ &amp; ⁦@...,L,congrat appointment look forward work together h


In [16]:

with open('outdata/tweets_clean_df.pkl', 'wb') as f:
    pickle.dump(df, f)
    
s3.meta.client.upload_file('outdata/tweets_clean_df.pkl',
                           bucket_name,
                           'tweets_clean_df.pkl')


In [17]:
os.remove('outdata/tweets_clean_df.pkl')