## **How to Clean Text Manually and with NLTK**

**1. Text Cleaning Is Task Specific**

**2. Manual Tokenization**

Text cleaning is hard, but the text we have chosen to work with is pretty clean already. We
could just write some Python code to clean it up manually, and this is a good exercise for those
simple problems that you encounter. Tools like regular expressions and splitting strings can get
you a long way.

**2.1 Load Data**

Let's load the text data so that we can work with it. The text is small and will load quickly
and easily fit into memory. This will not always be the case and you may need to write code
to memory map the file. Tools like NLTK (covered in the next section) will make working
with large files much easier. We can load the entire metamorphosis clean.txt into memory as
follows:

In [1]:
# load text
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

**2.2 Split by Whitespace**

Clean text often means a list of words or tokens that we can work with in our machine learning
models. This means converting the raw text into a list of words and saving it again. A very
simple way to do this would be to split the document by white space, including " " (space), new
lines, tabs and more. We can do this in Python with the split() function on the loaded string.

In [2]:
# load text 
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

# split into words by white space
words = text.split()
print(words[:100])

['One', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'He', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'His', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '"What\'s', 'happened', 'to', 'me?"', 'he', 'thought.', 'It', "wasn't", 'a', 'dream.', 'His', 'room,', 'a', 'proper', 'human']


**2.3 Select Words**

Another approach might be to use the regex model (re) and split the document into words by
selecting for strings of alphanumeric characters (a-z, A-Z, 0-9 and '_'). For example:

In [7]:
import re
# load text
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split based on words only
words = re.split(r'\W+', text)
print(words[:100])

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armour', 'like', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 's', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasn', 't', 'a', 'dream', 'His', 'room']


Again, running the example we can see that we get our list of words. This time, we can see
that armour-like is now two words armour and like (fine) but contractions like What's is also
two words What and s (not great).

**2.4 Split by Whitespace and Remove Punctuation**

We may want the words, but without the punctuation like commas and quotes. We also want to
keep contractions together. One way would be to split the document into words by white space
(as in the section Split by Whitespace), then use string translation to replace all punctuation with
nothing (e.g. remove it). Python provides a constant called string.punctuation that provides a
great list of punctuation characters. For example:

In [10]:
import string

print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


We can use regular expressions to select for the punctuation characters and use the sub()
function to replace them with nothing. For example:

In [19]:
operators = ['c', 'd', 'a', 'b', 'e']

print('|'.join(map(re.escape, sorted(operators, reverse=True))))

[abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\|\~:]+


**3. Tokenization and Cleaning with NLTK**

The Natural Language Toolkit, or NLTK for short, is a Python library written for working and
modeling text. It provides good tools for loading and cleaning text that we can use to get our
data ready for working with machine learning and deep learning algorithms.

**3.1 Install NLTK**

After installation, you will need to install the data used with the library, including a great
set of documents that you can use later for testing other tools in NLTK. There are few ways to
do this, such as from within a script:

In [1]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

Or from the command line:
python -m nltk.downloader all

**3.2 Split into Sentences**

A good useful first step is to split the text into sentences. Some modeling tasks prefer input
to be in the form of paragraphs or sentences, such as Word2Vec. You could first split your
text into sentences, split each sentence into words, then save each sentence to file, one per line.
NLTK provides the sent tokenize() function to split text into sentences. The example below
loads the metamorphosis clean.txt file into memory, splits it into sentences, and prints the
first sentence.

In [2]:
from nltk import sent_tokenize

# load data
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

# split into sentences
sentences = sent_tokenize(text)
print(sentences[0])

One morning, when Gregor Samsa woke from troubled dreams, he found
himself transformed in his bed into a horrible vermin.


**3.3 Split into Words**

NLTK provides a function called word tokenize() for splitting strings into tokens (nominally
words). It splits tokens based on white space and punctuation. For example, commas and
periods are taken as separate tokens. Contractions are split apart (e.g. What's becomes What
and 's). Quotes are kept, and so on. For example:

In [3]:
from nltk.tokenize import word_tokenize
# load data
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words
tokens = word_tokenize(text)
print(tokens[:100])

['One', 'morning', ',', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', ',', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', '.', 'He', 'lay', 'on', 'his', 'armour-like', 'back', ',', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', ',', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', '.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', '.', 'His', 'many', 'legs', ',', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', ',', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', '.', '``', 'What', "'s", 'happened', 'to']


**3.4 Filter Out Punctuation**

We can filter out all tokens that we are not interested in, such as all standalone punctuation. This
can be done by iterating over all tokens and only keeping those tokens that are all alphabetic.
Python has the function isalpha() that can be used. For example:

In [7]:
from nltk.tokenize import word_tokenize
# load data
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words
tokens = word_tokenize(text)
# remove all tokens that are not alphabetic
words = [word for word in tokens if word.isalpha()]
print(words[:100])

'''
words = [word for word in tokens if not word.isalpha()]
print(words[:100])
'''

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 'happened', 'to', 'me', 'he', 'thought', 'It', 'was', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human', 'room']


'\nwords = [word for word in tokens if not word.isalpha()]\nprint(words[:100])\n'

**3.5 Filter out Stop Words (and Pipeline)**

Stop words are those words that do not contribute to the deeper meaning of the phrase. They
are the most common words such as: the, a, and is. For some applications like documentation
classification, it may make sense to remove stop words. NLTK provides a list of commonly
agreed upon stop words for a variety of languages, such as English. They can be loaded as
follows:

In [None]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

You can see that they are all lower case and have punctuation removed. You could compare
your tokens to the stop words and filter them out, but you must ensure that your text is prepared
the same way. Let's demonstrate this with a small pipeline of text preparation including:
- Load the raw text.
- Split into tokens.
- Convert to lowercase.
- Remove punctuation from each token.
- Filter out remaining tokens that are not alphabetic.
- Filter out tokens that are stop words.