# Preprocessing Text

It is rarely ever a good idea to work with text data in its raw format. Text data is messy and filled with low-value noise. To built reliable analyses on this type of data, we first need to preprocess the data to reduce this noise.

In [None]:
import nltk
import pandas as pd
import numpy as np

# Expand the max width of how our dataFrames display on screen
pd.options.display.max_colwidth = 500

### Define a sentence to begin with

In [None]:
text = """Overfitting means that a model was trained too well and it's fitting too closely to the training dataset. 
A model has been overfit when the model is too complex (i.e. too many features/variables compared to the number of observations). 
An overfit model may achieve greater than 90 percent accuracy on the training set, but will likely perform poorly on the test set."""

# Preprocessing

### Expand contractions

In [None]:
import contractions

no_contractions = contractions.fix(text)

pd.DataFrame({"Before": [text], "After": [no_contractions]}).T

### Lowercase words

In [None]:
lower_text = no_contractions.lower()

pd.DataFrame({"Before": [no_contractions], "After": [lower_text]}).T

### Remove digits

Using [regular expressions](https://docs.python.org/3/library/re.html), we can identify sequences of characters that follow a particular pattern (i.e., phone numbers, zip codes, phrases that begin/end with a word/character, etc.). In the cell below, we're removing any standalone digits.

[Python regex cheat sheet](https://www.dataquest.io/blog/regex-cheatsheet/)

In [None]:
import re

no_digits = re.sub(r'\b\d+\b', '', lower_text)

pd.DataFrame({"Before": [lower_text], "After": [no_digits]}).T

### Tokenize text

nltk's `word_tokenize` function is a bit more advanced than the standard `split` function. `Tokenize` views the text linguistically and handles tokenizing compound terms, contractions, and punctuation much better than `split`.

In [None]:
from nltk import word_tokenize

tokens = word_tokenize(no_digits)

pd.DataFrame({"Before": [no_digits], "After": [tokens]}).T

### Remove punctuation

Python comes with a provided set of punctuation characters called `string.punctuation`. This can be very helpful so we don't have to try to remember which punctuation values to remove.

In [None]:
import string

# capture only the tokens that are not a part of the punctuation list
no_punctuation = [w for w in tokens if w not in string.punctuation]

pd.DataFrame({"Before": [tokens], "After": [no_punctuation]}).T

### Remove low-value words

In this cell, we're performing 2 tasks:

##### 1. Removing stopwords

Stopwords are frequently used words that provide very little value to the meaning of the sentence (e.g., 'a', 'the', 'of', 'and', etc.). NLTK comes prepackaged with stopwords lists for various languages. In the cell below, we're storing the stopwords list as `stop_words` and then we are using `stop_words.extend()` to include additional terms that we would like to remove. These additional words would come from your exploration of the data.

##### 2. Removing short words

Another thing we're doing in this cell is removing words that are less than a given length. In this scenario, we're only keeping words that are at least 3 characters. Again, this is something that you would decide throughout your exploration.

In [None]:
from nltk.corpus import stopwords

# capture nltk's english stopwords list
stop_words = stopwords.words('english')
stop_words.extend(['said', 'would', 'subject', 'use', 'also', 'like', 'know', 'well', ' could', 'thing'])

# filter out stopwords and short words
no_stopwords = [w for w in no_punctuation if ((w not in stop_words) and (len(w) >= 3))]

pd.DataFrame({"Before": [no_punctuation], "After": [no_stopwords]}).T

### Perform Stemming

Stemming is method of reducing inflectional forms of related terms. The goal is to reduce terms down to a root form. Harmonizing variations of a term assists in properly representing a term's presence in the text. 

[More information on stemming and lemmatization.](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

In [None]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

stems = [ps.stem(w) for w in no_stopwords]

pd.DataFrame({"Before": [no_stopwords], "After": [stems]}).T

### Perform Lemmatization

[Lemmatization](https://pythonprogramming.net/lemmatizing-nltk-tutorial/) is very similar to stemming, except lemmatization will always return actual english words. This can be helpful when you need to prioritize interpretability of your model's features. The drawback is that lemmatization does not harmonize variations of terms as aggressively as stemming.

NLTK's WordNetLemmatizer is built on top of the vast lexical database known as [WordNet](https://wordnet.princeton.edu/). To leverage this lemmatizer, you must provide a Part of Speech identifier (`n` - Noun (default), `v` - Verb, `a` - Adjective)

In [None]:
from nltk.stem import WordNetLemmatizer

wn = WordNetLemmatizer()

lemmas = []
for word in no_stopwords:
    clean_word = wn.lemmatize(word, pos='n')
    clean_word = wn.lemmatize(clean_word, pos='v')
    lemmas.append(clean_word)
    
pd.DataFrame({"Before": [no_stopwords], "After": [lemmas]}).T

## Before and After

In [None]:
pd.DataFrame({"Before": [text], "After": [lemmas]}).T

## Import the preprocessing function

In [None]:
from utils import preprocess_text

pd.DataFrame({"Before": [text], "After": [preprocess_text(text)]}).T