# Parsing Text (aka Prepping Text Data)

What is it?
- Breaking our text data into smaller compenents and reduce variability between words

Why do we care? 
- Allows us to better understand our data programatically and get us ready for explore and modeling

Workflow

original text--->
1. lowercase text
2. remove accented and non-ASCII characters
3. remove special characters
4. tokenize the strings into discrete units
5. stem/lemmatize words
6. remove stopwords

ready for exploration!

## Let's see it in action

In [None]:
#standard imports
import pandas as pd
import numpy as np

### original text

In [None]:
original = "Paul Erdős and George Pólya were influential Hungarian mathematicians who contributed \
a lot to the field. Erdős's name contains the Hungarian letter 'ő' ('o' with double acute accent), \
but is often incorrectly written as Erdos or Erdös either by mistake or out of typographical necessity"

### 1. lowercase text

### 2. remove any accented characters and non-ASCII characters

- `unicodedata.normalize` removes any inconsistencies in unicode character encoding
- `.encode` to convert the resulting string to the ASCII character set
- `.decode` to turn the resulting bytes object back into a string

Use `unicodedata.normalize().encode().decode`

In [None]:
#import

### 3. remove special characters

- remove anything that isn't a-z, a number, a single quote, or a whitespace

In [None]:
#import regular expression operations


In [None]:
#use re.sub to remove special characters


### 4. tokenize

Tokenization is the process of breaking something down into smaller, discrete units. These units are called tokens.

It's common to tokenize the strings to break up words and punctutation left over into discrete units. 

Use `nltk.tokenize.ToktokTokenizer`

In [None]:
#import natural language toolkit


In [None]:
#create the tokenizer


In [None]:
#use the tokenizer


### 5. stem or lemmatize words (choose one!)

Stemming
- **truncates** words to their "stem"
- algorithmic rules (non lingustic)
- example: "calls", "called", "calling" --> "call"
- fast and efficient


Lemmatize
- **changes** words to their "root"
- it can conjugate to the base word 
- example: "mouse", "mice" --> "mouse"
- slower than stemming

#### stemmer

Use `nltk.porter.PorterStemmer`

In [None]:
#create porter stemmer


In [None]:
#test stemmer


In [None]:
#use stemmer - apply stem to each word in our string


In [None]:
#join words back together


#### lemmatize

Use `nltk.stem.WordNetLemmatizer`

In [None]:
# download the first time
# nltk.download('all')

In [None]:
#create the lemmatizer


In [None]:
#test lemmatizer


In [None]:
#use lemmatize - apply stem to each word in our string


In [None]:
#join words back together


### 6. remove stopwords

Words which have little or no significance, especially when constructing meaningful features from text, are known as stopwords
- example: a, an, the, and like

We will use a standard English language stopwords list from nltk

Use `nltk.corpus.stopwords`

In [None]:
#import stopwords list


In [None]:
#only need to do once
# nltk.download('stopwords')

In [None]:
#save stopwords


In [None]:
#set a list to remove some stopwords


In [None]:
#add to stopword list


In [None]:
#remove from stopword list


In [None]:
#split words in lemmatized article


In [None]:
#add to stopword list


In [None]:
#remove stopwords from list of words


In [None]:
#show how many words we removed


In [None]:
#join words back together


#### ready for exploration!