# Parsing Text (aka Prepping Text Data)

What is it?
- Breaking our text data into smaller compenents and reduce variability between words

Why do we care? 
- Allows us to better understand our data programatically and get us ready for explore and modeling

Workflow

original text--->
1. lowercase text
2. remove accented and non-ASCII characters
3. remove special characters
4. tokenize the strings into discrete units
5. stem/lemmatize words
6. remove stopwords

ready for exploration!

## Let's see it in action

In [1]:
#standard imports
import pandas as pd
import numpy as np

### original text

In [25]:
original = "Paul Erdős and George Pólya were influential Hungarian mathematicians who contributed \
a lot to the field. Erdős's name contains the Hungarian letter 'ő' ('o' with double acute accent), \
but is often incorrectly written as Erdos or Erdös either by mistake or out of typographical necessity"

### 1. lowercase text

In [26]:
article = original.lower()
article

"paul erdős and george pólya were influential hungarian mathematicians who contributed a lot to the field. erdős's name contains the hungarian letter 'ő' ('o' with double acute accent), but is often incorrectly written as erdos or erdös either by mistake or out of typographical necessity"

### 2. remove any accented characters and non-ASCII characters

- `unicodedata.normalize` removes any inconsistencies in unicode character encoding
- `.encode` to convert the resulting string to the ASCII character set
- `.decode` to turn the resulting bytes object back into a string

Use `unicodedata.normalize().encode().decode`

In [27]:
#import
import unicodedata

In [28]:
article = unicodedata.normalize('NFKD', article).encode('ascii', 'ignore')\
.decode('utf-8') ## turns it back into a string
article

"paul erdos and george polya were influential hungarian mathematicians who contributed a lot to the field. erdos's name contains the hungarian letter 'o' ('o' with double acute accent), but is often incorrectly written as erdos or erdos either by mistake or out of typographical necessity"

### 3. remove special characters

- remove anything that isn't a-z, a number, a single quote, or a whitespace

In [29]:
#import regular expression operations
import re

In [72]:
#use re.sub to remove special characters
article = re.sub(r"[^a-z0-9\'\s]", '', article)
article

"paul erdos and george polya were influential hungarian mathematicians who contributed a lot to the field erdos ' s name contains the hungarian letter ' o ' ' o ' with double acute accent but is often incorrectly written as erdos or erdos either by mistake or out of typographical necessity"

### 4. tokenize

Tokenization is the process of breaking something down into smaller, discrete units. These units are called tokens.

It's common to tokenize the strings to break up words and punctutation left over into discrete units. 

Use `nltk.tokenize.ToktokTokenizer`

In [40]:
#import natural language toolkit
import nltk

In [41]:
#create the tokenizer
tokenize = nltk.tokenize.ToktokTokenizer()
tokenize

<nltk.tokenize.toktok.ToktokTokenizer at 0x13beb2220>

In [52]:
#use the tokenizer
article = tokenize.tokenize(article, return_str = True)
article

"paul erdos and george polya were influential hungarian mathematicians who contributed a lot to the field erdos ' s name contains the hungarian letter ' o ' ' o ' with double acute accent but is often incorrectly written as erdos or erdos either by mistake or out of typographical necessity"

### 5. stem or lemmatize words (choose one!)

Stemming
- **truncates** words to their "stem"
- algorithmic rules (non lingustic)
- example: "calls", "called", "calling" --> "call"
- fast and efficient


Lemmatize
- **changes** words to their "root"
- it can conjugate to the base word 
- example: "mouse", "mice" --> "mouse"
- slower than stemming

#### stemmer

Use `nltk.porter.PorterStemmer`

In [53]:
#create porter stemmer
ps = nltk.porter.PorterStemmer()

In [54]:
#test stemmer
ps.stem('calling'), ps.stem('calls'), ps.stem('called'), ps.stem('call')

('call', 'call', 'call', 'call')

In [55]:
ps.stem('mouse'), ps.stem('mice')

('mous', 'mice')

In [56]:
#use stemmer - apply stem to each word in our string
ps.stem(article)

"paul erdos and george polya were influential hungarian mathematicians who contributed a lot to the field erdos ' s name contains the hungarian letter ' o ' ' o ' with double acute accent but is often incorrectly written as erdos or erdos either by mistake or out of typographical necess"

In [57]:
article.split()

['paul',
 'erdos',
 'and',
 'george',
 'polya',
 'were',
 'influential',
 'hungarian',
 'mathematicians',
 'who',
 'contributed',
 'a',
 'lot',
 'to',
 'the',
 'field',
 'erdos',
 "'",
 's',
 'name',
 'contains',
 'the',
 'hungarian',
 'letter',
 "'",
 'o',
 "'",
 "'",
 'o',
 "'",
 'with',
 'double',
 'acute',
 'accent',
 'but',
 'is',
 'often',
 'incorrectly',
 'written',
 'as',
 'erdos',
 'or',
 'erdos',
 'either',
 'by',
 'mistake',
 'or',
 'out',
 'of',
 'typographical',
 'necessity']

In [62]:
#join words back together
stems = [ps.stem(word) for word in article.split()]
stems[:10]

['paul',
 'erdo',
 'and',
 'georg',
 'polya',
 'were',
 'influenti',
 'hungarian',
 'mathematician',
 'who']

In [65]:
' '.join(stems)

"paul erdo and georg polya were influenti hungarian mathematician who contribut a lot to the field erdo ' s name contain the hungarian letter ' o ' ' o ' with doubl acut accent but is often incorrectli written as erdo or erdo either by mistak or out of typograph necess"

#### lemmatize

Use `nltk.stem.WordNetLemmatizer`

In [66]:
# download the first time
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/divante/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/divante/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /Users/divante/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /Users/divante/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /Users/divante/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to
[nltk_data]    |     /Users/divante/nltk_data...

[nltk_data]    | Downloading package paradigms to
[nltk_data]    |     /Users/divante/nltk_data...
[nltk_data]    |   Unzipping corpora/paradigms.zip.
[nltk_data]    | Downloading package pe08 to
[nltk_data]    |     /Users/divante/nltk_data...
[nltk_data]    |   Unzipping corpora/pe08.zip.
[nltk_data]    | Downloading package perluniprops to
[nltk_data]    |     /Users/divante/nltk_data...
[nltk_data]    |   Unzipping misc/perluniprops.zip.
[nltk_data]    | Downloading package pil to
[nltk_data]    |     /Users/divante/nltk_data...
[nltk_data]    |   Unzipping corpora/pil.zip.
[nltk_data]    | Downloading package pl196x to
[nltk_data]    |     /Users/divante/nltk_data...
[nltk_data]    |   Unzipping corpora/pl196x.zip.
[nltk_data]    | Downloading package porter_test to
[nltk_data]    |     /Users/divante/nltk_data...
[nltk_data]    |   Unzipping stemmers/porter_test.zip.
[nltk_data]    | Downloading package ppattach to
[nltk_data]    |     /Users/divante/nltk_data...
[nltk_data]    |

True

In [81]:
#create the lemmatizer
wnl = nltk.stem.WordNetLemmatizer()
wnl

<WordNetLemmatizer>

In [82]:
#test lemmatizer
wnl.lemmatize('mouse'), wnl.lemmatize('mice')

('mouse', 'mouse')

In [83]:
wnl.lemmatize(article)

"paul erdos and george polya were influential hungarian mathematicians who contributed a lot to the field erdos ' s name contains the hungarian letter ' o ' ' o ' with double acute accent but is often incorrectly written as erdos or erdos either by mistake or out of typographical necessity"

In [84]:
#use lemmatize - apply stem to each word in our string
[word for word in article.split()]

['paul',
 'erdos',
 'and',
 'george',
 'polya',
 'were',
 'influential',
 'hungarian',
 'mathematicians',
 'who',
 'contributed',
 'a',
 'lot',
 'to',
 'the',
 'field',
 'erdos',
 "'",
 's',
 'name',
 'contains',
 'the',
 'hungarian',
 'letter',
 "'",
 'o',
 "'",
 "'",
 'o',
 "'",
 'with',
 'double',
 'acute',
 'accent',
 'but',
 'is',
 'often',
 'incorrectly',
 'written',
 'as',
 'erdos',
 'or',
 'erdos',
 'either',
 'by',
 'mistake',
 'or',
 'out',
 'of',
 'typographical',
 'necessity']

In [85]:
article_lemma = [wnl.lemmatize(word) for word in article.split()]
article_lemma

['paul',
 'erdos',
 'and',
 'george',
 'polya',
 'were',
 'influential',
 'hungarian',
 'mathematician',
 'who',
 'contributed',
 'a',
 'lot',
 'to',
 'the',
 'field',
 'erdos',
 "'",
 's',
 'name',
 'contains',
 'the',
 'hungarian',
 'letter',
 "'",
 'o',
 "'",
 "'",
 'o',
 "'",
 'with',
 'double',
 'acute',
 'accent',
 'but',
 'is',
 'often',
 'incorrectly',
 'written',
 'a',
 'erdos',
 'or',
 'erdos',
 'either',
 'by',
 'mistake',
 'or',
 'out',
 'of',
 'typographical',
 'necessity']

In [86]:
#join words back together
article_lemma  = ' '.join(article_lemma)
article_lemma

"paul erdos and george polya were influential hungarian mathematician who contributed a lot to the field erdos ' s name contains the hungarian letter ' o ' ' o ' with double acute accent but is often incorrectly written a erdos or erdos either by mistake or out of typographical necessity"

### 6. remove stopwords

Words which have little or no significance, especially when constructing meaningful features from text, are known as stopwords
- example: a, an, the, and like

We will use a standard English language stopwords list from nltk

Use `nltk.corpus.stopwords`

In [87]:
#import stopwords list
from nltk.corpus import stopwords

In [88]:
#only need to do once
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/divante/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [91]:
#save stopwords
stopwords_ls = stopwords.words('english')
stopwords_ls[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [93]:
stopwords_ls.sort()

In [94]:
stopwords_ls[:10]

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an']

In [96]:
#set a list to remove some stopwords
extra = ['all', 'about', 'after']
extra

['all', 'about', 'after']

In [98]:
set(stopwords_ls) - set(extra)

{'a',
 'above',
 'again',
 'against',
 'ain',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 're',
 's',
 'same',
 'shan',


In [99]:
stopwords_ls.append('o')

In [103]:
stopwords_ls.remove('o')

In [104]:
stopwords_ls

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 're',
 '

In [106]:
#add to stopword list
words = article_lemma.split()
words

['paul',
 'erdos',
 'and',
 'george',
 'polya',
 'were',
 'influential',
 'hungarian',
 'mathematician',
 'who',
 'contributed',
 'a',
 'lot',
 'to',
 'the',
 'field',
 'erdos',
 "'",
 's',
 'name',
 'contains',
 'the',
 'hungarian',
 'letter',
 "'",
 'o',
 "'",
 "'",
 'o',
 "'",
 'with',
 'double',
 'acute',
 'accent',
 'but',
 'is',
 'often',
 'incorrectly',
 'written',
 'a',
 'erdos',
 'or',
 'erdos',
 'either',
 'by',
 'mistake',
 'or',
 'out',
 'of',
 'typographical',
 'necessity']

In [109]:
len([word for word in words if word not in stopwords_ls])

34

In [110]:
len([word for word in words])

51

In [112]:
stopwords_ls.append("'")

In [114]:
stopwords_ls.append("o")

In [116]:
filtered = [word for word in words if word not in stopwords_ls]
filtered

['paul',
 'erdos',
 'george',
 'polya',
 'influential',
 'hungarian',
 'mathematician',
 'contributed',
 'lot',
 'field',
 'erdos',
 'name',
 'contains',
 'hungarian',
 'letter',
 'double',
 'acute',
 'accent',
 'often',
 'incorrectly',
 'written',
 'erdos',
 'erdos',
 'either',
 'mistake',
 'typographical',
 'necessity']

In [118]:
#remove from stopword list
len(filtered) - len(words)

-24

In [124]:
original

"Paul Erdős and George Pólya were influential Hungarian mathematicians who contributed a lot to the field. Erdős's name contains the Hungarian letter 'ő' ('o' with double acute accent), but is often incorrectly written as Erdos or Erdös either by mistake or out of typographical necessity"

In [122]:
#split words in lemmatized article
parsed_article = ' '.join(filtered)
parsed_article

'paul erdos george polya influential hungarian mathematician contributed lot field erdos name contains hungarian letter double acute accent often incorrectly written erdos erdos either mistake typographical necessity'

In [None]:
#add to stopword list


In [None]:
#remove stopwords from list of words


In [None]:
#show how many words we removed


In [None]:
#join words back together


#### ready for exploration!