# NLP Basics: Implementing A Pipeline To Clean Text

### Pre-processing Text Data

Cleaning up the text data is necessary to highlight attributes that you're going to want your machine learning system to pick up on. We will explore three pre-processing steps in this lesson:
1. Remove punctuation
2. Tokenization
3. Remove stopwords

In [None]:
# Read in raw data and clean up the column names
import pandas as pd
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('../../../data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages.head()

### Remove Punctuation

In [None]:
# What punctuation is included in the default list?
import string

string.punctuation

In [None]:
# Why is it important to remove punctuation?

In [None]:
# Define a function to remove punctuation in our messages
def remove_punct(text):
    text = "".join([char for char in text if char not in string.punctuation])
    return text

### Tokenize

In [None]:
# Define a function to split our sentences into a list of words
import re

def tokenize(text):
    tokens = re.split('\W+', text)
    return tokens

messages['text_tokenized'] = messages['text_clean'].apply(lambda x: tokenize(x.lower()))

messages.head()

### Remove Stopwords

In [None]:
# What does an example look like?

In [None]:
# Load the list of stopwords built into nltk
import nltk

stopwords = nltk.corpus.stopwords.words('english')

In [None]:
# Define a function to remove all stopwords
def remove_stopwords(tokenized_text):    
    text = [word for word in tokenized_text if word not in stopwords]
    return text

messages['text_nostop'] = messages['text_tokenized'].apply(lambda x: remove_stopwords(x))

messages.head()

In [None]:
# Remove stopwords in our example