# Data Preparation

You cannot go straight from raw text to fitting a machine learning or deep learning model. You must clean your text first, which means splitting it into words and handling punctuation and case.

There is a whole suite of text preparation methods that you may need to use, and the choice of methods really depends on your natural language processing task.

- How to get started by developing your own very simple text cleaning tools
- How to take a step up and use the more sophisticated methods in the NLTK library
- How to prepare text when using modern text representation methods like word embeddings

## Metamorphosis by Franz Kafka

We will use the text from the book _Metamorphosis_ by _Franz Kafka_.

The file contains header and footer information that we are not interested in, specifically copyright and license information. It has been deleted.

## Text Cleaning is Task Specific

After actually getting a hold of your text data, the first step in cleaning up text data is to have a strong idea about what you're trying to achieve, and in that context, review your text to see what exactly might help.

- It's plain text so there is no markup to parse.
- The translation of the original German uses UK English.
- The lines are artificially wrapped with new lines at about 70 characters.
- There are no obvious typos or spelling mistakes.
- There's punctuation like commas, apostrophes, quotes, question marks, and more.
- There's hyphenated descriptions like 'armour-like'.
- There's a lot of use of the em dash ('-') to continue sentences (maybe replace with commas).
- There are names/proper nouns.
- There does not appear to be numbers that require handling (1999).
- There are section markers and we have removed the first (chapters, sections).

We are going to look at general text cleaning steps in this tutorial. Nevertheless, consider some possible objectives we may have when working with this text document.
- If we were interested in developing a _Kafka-esque_ language model, we may want to keep all off the case, quotes, and other punctuation in place.
- If we were interested in classifying documents as '**Kafka**' and '**Not Kafka**', maybe we would want to strip case, punctuation, and even trim words back to their stem.

## Manual Tokenization

### 1. Load Data

In [1]:
# load text
filename = './res/data/metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

### 2. Split by Whitespace


Clean text often means a list of words or tokens that we can work with in our machine learning models. This means converting the raw text into a list of words and saving it again.

A simple way to do this would be to split the document by whitespace, including " ", new lines, tabs and more. We can do this in Python with the split() function on the loaded string.

In [4]:
# split into words by whitespace
words = text.split()
print(words[:100])

['\ufeffOne', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'He', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'His', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '"What\'s', 'happened', 'to', 'me?"', 'he', 'thought.', 'It', "wasn't", 'a', 'dream.', 'His', 'room,', 'a', 'proper', 'human']


### 3. Split by Whitespace and Remove Punctuation

In [7]:
import string

In [9]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


We can use the function maketrans() to create a mapping table. We can create an empty mapping table, but the third argument of this function allows to list all of the characters to remove during the translation process.

In [12]:
# make a translation table for str.translate() function, then translate
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in words]

print(stripped[:100])


['\ufeffOne', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armourlike', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'Whats', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasnt', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human']


### 4. Normalizing Case

It is common to convert all words to one case.

This means that the vocabulary will shrink in size, but some distinction are lost (e.g. 'Apple' the company vs. the fruit). We can convert all words to lowercase by calling the lower() function on each word.

In [13]:
lower = [w.lower() for w in stripped]
print(stripped[:100])

['\ufeffOne', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armourlike', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'Whats', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasnt', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human']


Remember, simple is better.

Simpler text data, simpler models, smaller vocabularies. You can always make things more complex later to see if it results in better model skill.

## Tokenization and Cleaning with **NLTK**

The **Natural Language Toolkit** (NLTK) is a Python library written for working and modeling text. It provides good tools for loading and cleaning text that we can use to get our data ready for working with machine learning and deep learning algorithms.

In [15]:
import nltk

### 2. Split into Sentences

A good useful first step is to split the text into sentences.

Some modeling tasks prefer input to be in the form of paragraphs or sentences, such as _word2vec_. You could first split your text into sentences, split each sentence into words, then save each sentence to file, one per line.

# 