<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preprocessing-Text" data-toc-modified-id="Preprocessing-Text-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preprocessing Text</a></span><ul class="toc-item"><li><span><a href="#Where-did-the-text-originate?" data-toc-modified-id="Where-did-the-text-originate?-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Where did the <em>text</em> originate?</a></span></li><li><span><a href="#Removing-irrelevant-information" data-toc-modified-id="Removing-irrelevant-information-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Removing irrelevant information</a></span></li><li><span><a href="#Useful-tools" data-toc-modified-id="Useful-tools-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Useful tools</a></span><ul class="toc-item"><li><span><a href="#Introducing-Natural-Language-Toolkit-(NLTK)" data-toc-modified-id="Introducing-Natural-Language-Toolkit-(NLTK)-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Introducing Natural Language Toolkit (NLTK)</a></span></li><li><span><a href="#Regular-Expression-(Regex)" data-toc-modified-id="Regular-Expression-(Regex)-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Regular Expression (Regex)</a></span></li></ul></li></ul></li><li><span><a href="#Steps-to-Processing-Text" data-toc-modified-id="Steps-to-Processing-Text-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Steps to Processing Text</a></span><ul class="toc-item"><li><span><a href="#Cleaning" data-toc-modified-id="Cleaning-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Cleaning</a></span></li><li><span><a href="#Normalization" data-toc-modified-id="Normalization-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Normalization</a></span></li><li><span><a href="#Tokenization" data-toc-modified-id="Tokenization-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Tokenization</a></span></li><li><span><a href="#Stopword-removal" data-toc-modified-id="Stopword-removal-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Stopword removal</a></span></li><li><span><a href="#Stemming" data-toc-modified-id="Stemming-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Stemming</a></span></li><li><span><a href="#Lemmatization" data-toc-modified-id="Lemmatization-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Lemmatization</a></span></li><li><span><a href="#Note-on-Part-of-Speech-(POS)-Tagging" data-toc-modified-id="Note-on-Part-of-Speech-(POS)-Tagging-2.7"><span class="toc-item-num">2.7&nbsp;&nbsp;</span>Note on Part-of-Speech (POS) Tagging</a></span></li></ul></li></ul></div>

# Preprocessing Text

## Where did the _text_ originate?

Depending on where the text came from will change how we will preprocess it

Examples:
- Speech --> Convert into text/words
- Web pages --> HTML tags
- Word Doc, other text formats --> More "junk" to consider

## Removing irrelevant information

> The dogs in Alaska are cold, hungry, and lonely.

- Punctuation likely can be removed without drastically changing the meaning
- Capitalization rarely changes meaning
- Some common words really don't add to meaning: "a", "the", "are", "of"


## Useful tools 

### Introducing Natural Language Toolkit (NLTK)

NLTK is a great library that can help with preprocessing text as well as feature extraction

### Regular Expression (Regex)

Useful way to structurally to move through language (won't go through it here; lots of resources)

<img src='https://imgs.xkcd.com/comics/regular_expressions.png' width=60%/>

Personally like this webapp to help test out your pattern matching: [Regexr](https://regexr.com/)

# Steps to Processing Text

## Cleaning

Can use regex ([Python doc](https://docs.python.org/3/library/re.html)) & packages like [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to get rid of the extra junk so you just have the natural language material 

## Normalization

- Capitalization
- Puncutation (dependent on task)
    - Useful for text document as a whole

## Tokenization

Token (a symbol) holds meaning and can't meaningfully be split up (in English, these are usually words)

`nltk.tokenize` has a variety of tokenizers (http://www.nltk.org/api/nltk.tokenize.html): 

- `sent_tokenize` finds sentences (often done for translation)
- `word_tokenize` is like `split` but is a little smarter in how it tokenizes the text
- `RegexTokenizer` can do more advance control like tokenize the words and remove punctuation (http://www.nltk.org/api/nltk.tokenize.html?highlight=regexp#module-nltk.tokenize.regexp)
- `TweetTokenizer` specifically for tweets from Twitter (http://www.nltk.org/api/nltk.tokenize.html?highlight=regexp#nltk.tokenize.casual.TweetTokenizer)

## Stopword removal

- Makes set smaller and still mostly readable
- Usually these common stop words dominate the list of words
- Dependent on context of task

```python
from nltk.corpus import stopwords

eng_stopwords = stopwords.words('english')

words = [w for w in words if w not in eng_stopwords]

```

## Stemming 

- Reducing to root form --> reduce complexity but still have meaning
- fast and crude
- not all stemmed words are _words_

## Lemmatization

- Uses dictionary to map variants to root word
- Converts to an acutal _word_

## Note on Part-of-Speech (POS) Tagging

> A simple but limited solution since someone has to laboriously label the entire corpus. This is process is extremely error-prone.
>
> There are other strategies of learn sentence structure and tags (HMMs & RNNs)

In [7]:
import IPython
from nltk.corpus import treebank
t = treebank.parsed_sents('wsj_0001.mrg')[0]
t.draw()

In [12]:
import os
from IPython.display import Image, display
from nltk.draw import TreeWidget
from nltk.draw.util import CanvasFrame

def jupyter_draw_nltk_tree(tree,fn):
    cf = CanvasFrame()
    tc = TreeWidget(cf.canvas(), tree)
    tc['node_font'] = 'arial 13 bold'
    tc['leaf_font'] = 'arial 14'
    tc['node_color'] = '#005990'
    tc['leaf_color'] = '#3F8F57'
    tc['line_color'] = '#175252'
    cf.add_widget(tc, 10, 10)
    cf.print_to_file(f'{fn}.ps')
    cf.destroy()


In [16]:
# Source: https://www.nltk.org/book/ch08.html#ubiquitous-ambiguity
import nltk
groucho_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas'
V -> 'shot'
P -> 'in'
""")

sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas']
parser = nltk.ChartParser(groucho_grammar)

for i,tree in enumerate(parser.parse(sent)):
    print(tree)
    jupyter_draw_nltk_tree(tree,'_'.join(sent)+f'_{i}')

(S
  (NP I)
  (VP
    (VP (V shot) (NP (Det an) (N elephant)))
    (PP (P in) (NP (Det my) (N pajamas)))))
(S
  (NP I)
  (VP
    (V shot)
    (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))


![](images/I_shot_an_elephant_in_my_pajamas_0.png)

![](images/I_shot_an_elephant_in_my_pajamas_1.png)