# Parsing Text (aka Prepping Text Data)

What is it?
- Breaking our text data into smaller compenents and reduce variability between words

Why do we care? 
- Allows us to better understand our data programatically and get us ready for explore and modeling

Workflow

original text--->
1. lowercase text
2. remove accented and non-ASCII characters
3. remove special characters
4. tokenize the strings into discrete units
5. stem/lemmatize words
6. remove stopwords

ready for exploration!

## Let's see it in action

In [1]:
#standard imports
import pandas as pd
import numpy as np

### original text

In [2]:
original = "Paul Erdős and George Pólya were influential Hungarian mathematicians who contributed \
a lot to the field. Erdős's name contains the Hungarian letter 'ő' ('o' with double acute accent), \
but is often incorrectly written as Erdos or Erdös either by mistake or out of typographical necessity"
original

"Paul Erdős and George Pólya were influential Hungarian mathematicians who contributed a lot to the field. Erdős's name contains the Hungarian letter 'ő' ('o' with double acute accent), but is often incorrectly written as Erdos or Erdös either by mistake or out of typographical necessity"

### 1. lowercase text

In [3]:
article = original.lower()
article

"paul erdős and george pólya were influential hungarian mathematicians who contributed a lot to the field. erdős's name contains the hungarian letter 'ő' ('o' with double acute accent), but is often incorrectly written as erdos or erdös either by mistake or out of typographical necessity"

### 2. remove any accented characters and non-ASCII characters

- `unicodedata.normalize` removes any inconsistencies in unicode character encoding
- `.encode` to convert the resulting string to the ASCII character set
- `.decode` to turn the resulting bytes object back into a string

In [4]:
#import unicode character database
import unicodedata

In [5]:
article = unicodedata.normalize('NFKD', article)\
.encode('ascii', 'ignore')\
.decode('utf-8')

article

"paul erdos and george polya were influential hungarian mathematicians who contributed a lot to the field. erdos's name contains the hungarian letter 'o' ('o' with double acute accent), but is often incorrectly written as erdos or erdos either by mistake or out of typographical necessity"

### 3. remove special characters

- remove anything that isn't a-z, a number, a single quote, or a whitespace

In [6]:
#import regular expression operations
import re

In [7]:
#use re.sub to remove special characters
article = re.sub(r'[^a-z0-9\'\s]', '', article)
article

"paul erdos and george polya were influential hungarian mathematicians who contributed a lot to the field erdos's name contains the hungarian letter 'o' 'o' with double acute accent but is often incorrectly written as erdos or erdos either by mistake or out of typographical necessity"

### 4. tokenize

Tokenization is the process of breaking something down into smaller, discrete units. These units are called tokens.

It's common to tokenize the strings to break up words and punctutation left over into discrete units. 

Use `nltk.tokenize.ToktokTokenizer`

In [16]:
#import natural language toolkit
import nltk

In [17]:
#create the tokenizer
tokenize = nltk.tokenize.ToktokTokenizer()
tokenize

<nltk.tokenize.toktok.ToktokTokenizer at 0x14ecab100>

In [18]:
#use the tokenizer
article = tokenize.tokenize(article, return_str=True)
article

"paul erdos and george polya were influential hungarian mathematicians who contributed a lot to the field erdos ' s name contains the hungarian letter ' o ' ' o ' with double acute accent but is often incorrectly written as erdos or erdos either by mistake or out of typographical necessity"

### 5. stem or lemmatize words (choose one!)

Stemming
- **truncates** words to their "stem"
- algorithmic rules (non lingustic)
- example: "calls", "called", "calling" --> "call"
- fast and efficient


Lemmatize
- **changes** words to their "root"
- it can conjugate to the base word 
- example: "mouse", "mice" --> "mouse"
- slower than stemming

#### stemmer

Use `nltk.porter.PorterStemmer`

In [19]:
#create porter stemmer
ps = nltk.porter.PorterStemmer()
ps

<PorterStemmer>

In [20]:
#test stemmer
ps.stem('calling'), ps.stem('calls'), ps.stem('called')

('call', 'call', 'call')

In [21]:
ps.stem(article)

"paul erdos and george polya were influential hungarian mathematicians who contributed a lot to the field erdos ' s name contains the hungarian letter ' o ' ' o ' with double acute accent but is often incorrectly written as erdos or erdos either by mistake or out of typographical necess"

In [22]:
#use stemmer - apply stem to each word in our string
stems = [ps.stem(word) for word in article.split()]
stems[:10]

['paul',
 'erdo',
 'and',
 'georg',
 'polya',
 'were',
 'influenti',
 'hungarian',
 'mathematician',
 'who']

In [23]:
#join words back together
article_stemmed = ' '.join(stems)
article_stemmed

"paul erdo and georg polya were influenti hungarian mathematician who contribut a lot to the field erdo ' s name contain the hungarian letter ' o ' ' o ' with doubl acut accent but is often incorrectli written as erdo or erdo either by mistak or out of typograph necess"

#### lemmatize

Use `nltk.stem.WordNetLemmatizer`

In [24]:
# download the first time
# nltk.download('all')

In [25]:
#create the lemmatizer
wnl = nltk.stem.WordNetLemmatizer()
wnl

<WordNetLemmatizer>

In [26]:
#test lemmatizer
wnl.lemmatize('mouses'), wnl.lemmatize('mice')

('mouse', 'mouse')

In [27]:
wnl.lemmatize(article)

"paul erdos and george polya were influential hungarian mathematicians who contributed a lot to the field erdos ' s name contains the hungarian letter ' o ' ' o ' with double acute accent but is often incorrectly written as erdos or erdos either by mistake or out of typographical necessity"

In [28]:
#use lemmatize - apply stem to each word in our string
lemmas = [wnl.lemmatize(word) for word in article.split()]
lemmas[:10]

['paul',
 'erdos',
 'and',
 'george',
 'polya',
 'were',
 'influential',
 'hungarian',
 'mathematician',
 'who']

In [29]:
#join words back together
article_lemmatized = ' '.join(lemmas)
article_lemmatized

"paul erdos and george polya were influential hungarian mathematician who contributed a lot to the field erdos ' s name contains the hungarian letter ' o ' ' o ' with double acute accent but is often incorrectly written a erdos or erdos either by mistake or out of typographical necessity"

### 6. remove stopwords

Words which have little or no significance, especially when constructing meaningful features from text, are known as stopwords
- example: a, an, the, and like

We will use a standard English language stopwords list from nltk

Use `nltk.corpus.stopwords`

In [51]:
#import our stopwords list
from nltk.corpus import stopwords

In [52]:
#only need to do once
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mistygarcia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [32]:
#save stopwords
stopwords_ls = stopwords.words('english')
stopwords_ls[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [33]:
stopwords_ls.sort()

In [34]:
stopwords_ls[:10]

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an']

In [35]:
extra = ['all','about','after']
extra

['all', 'about', 'after']

In [54]:
set(stopwords_ls) - set(extra)

{"'",
 'a',
 'above',
 'again',
 'against',
 'ain',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 're',
 's',
 'same',
 's

In [37]:
stopwords_ls[:15]

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't"]

In [38]:
'am' in stopwords_ls

True

In [39]:
len(stopwords_ls)

179

In [40]:
#add to stopword list
stopwords_ls.append('o')

In [41]:
len(stopwords_ls)

180

In [42]:
#remove from stopword list
stopwords_ls.remove('o')

In [43]:
len(stopwords_ls)

179

In [44]:
#split words in lemmatized article
words = article_lemmatized.split()
words[:10]

['paul',
 'erdos',
 'and',
 'george',
 'polya',
 'were',
 'influential',
 'hungarian',
 'mathematician',
 'who']

In [45]:
#add to stopword list
stopwords_ls.append("'")

In [46]:
#remove stopwords from list of words
filtered_words = [word for word in words if word not in stopwords_ls]
filtered_words

['paul',
 'erdos',
 'george',
 'polya',
 'influential',
 'hungarian',
 'mathematician',
 'contributed',
 'lot',
 'field',
 'erdos',
 'name',
 'contains',
 'hungarian',
 'letter',
 'double',
 'acute',
 'accent',
 'often',
 'incorrectly',
 'written',
 'erdos',
 'erdos',
 'either',
 'mistake',
 'typographical',
 'necessity']

In [47]:
#show how many words we removed
len(words) - len(filtered_words)

24

In [48]:
#join words back together
parsed_article = ' '.join(filtered_words)
parsed_article

'paul erdos george polya influential hungarian mathematician contributed lot field erdos name contains hungarian letter double acute accent often incorrectly written erdos erdos either mistake typographical necessity'

#### ready for exploration!

In [49]:
original

"Paul Erdős and George Pólya were influential Hungarian mathematicians who contributed a lot to the field. Erdős's name contains the Hungarian letter 'ő' ('o' with double acute accent), but is often incorrectly written as Erdos or Erdös either by mistake or out of typographical necessity"

In [50]:
parsed_article

'paul erdos george polya influential hungarian mathematician contributed lot field erdos name contains hungarian letter double acute accent often incorrectly written erdos erdos either mistake typographical necessity'