# Preprocessing Text

## Where did the _text_ originate?

Depending on where the text came from will change how we will preprocess it

Examples:
- Speech --> Convert into text/words
- Web pages --> HTML tags
- Word Doc, other text formats --> More "junk" to consider

## Removing irrelevant information

> The dogs in Alaska are cold, hungry, and lonely.

- Punctuation likely can be removed without drastically changing the meaning
- Capitalization rarely changes meaning
- Some common words really don't add to meaning: "a", "the", "are", "of"


## Useful tools 

### Introducing Natural Language Toolkit (NLTK)

NLTK is a great library that can help with preprocessing text as well as feature extraction

### Regular Expression (Regex)

Useful way to structurally to move through language (won't go through it here; lots of resources)

<img src='https://imgs.xkcd.com/comics/regular_expressions.png' width=60%/>

Personally like this webapp to help test out your pattern matching: [Regexr](https://regexr.com/)

# Steps to Processing Text

## Cleaning

Can use regex & packages like [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to get rid of the extra junk so you just have the natural language material 

## Normalization

- Capitalization
- Puncutation (dependent on task)
    - Useful for text document as a whole

## Tokenization

Token (a symbol) holds meaning and can't meaningfully be split up (in English, these are usually words)

`nltk.tokenize` has a variety of tokenizers (http://www.nltk.org/api/nltk.tokenize.html): 

- `sent_tokenize` finds sentences (often done for translation)
- `word_tokenize` is like `split` but is a little smarter in how it tokenizes the text
- `RegexTokenizer` can do more advance control like tokenize the words and remove punctuation (http://www.nltk.org/api/nltk.tokenize.html?highlight=regexp#module-nltk.tokenize.regexp)
- `TweetTokenizer` specifically for tweets from Twitter (http://www.nltk.org/api/nltk.tokenize.html?highlight=regexp#nltk.tokenize.casual.TweetTokenizer)

## Stopword removal

- Makes set smaller and still mostly readable
- Usually these common stop words dominate the list of words
- Dependent on context of task

```python
from nltk.corpus import stopwords

eng_stopwords = stopwords.words('english')

words = [w for w in words if w not in eng_stopwords]

```