# Cleaning Text Data

### Text is Messy

You cannot go straight from raw text to fitting a machine learning or deep learning model.

You must clean your text first, which means splitting it into words and normalizing issues such as:

Upper and lower case characters.

Punctuation within and around words.

Numbers such as amounts and dates.

Spelling mistakes and regional variations.

Unicode characters

and much more…
### Manual Tokenization

Generally, we refer to the process of turning raw text into something we can model as "tokenization", where we are left with a list of words or "tokens".

We can manually develop Python code to clean text, and often this is a good approach given that each text dataset must be tokenized in a unique way.

For example, the snippet of code below will load a text file, split tokens by whitespace and convert each token to lowercase.

In [None]:
filename = '...'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words by white space
words = text.split()
# convert to lowercase
words = [word.lower() for word in words]

### NLTK Tokenization
Many of the best practices for tokenizing raw text have been captured and made available in a Python library called the Natural Language Toolkit or NLTK for short.

You can install this library using pip by typing the following on the command line:



    pip install -U nltk

After it is installed, you must also install the datasets used by the library, either via a Python script:

    
    import nltk
    nltk.download()
Or via a command line:


    python -m nltk.downloader all
Once installed, you can use the API to tokenize text. For example, the snippet below will load and tokenize an ASCII text file.

In [None]:
# load data
filename = '...'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)