## Intorduction
Natural Language Processing (NLP) is commonly used in text classification task such as spam detection and sentiment analysis, text generation, language translations and document classification. Text data can be considered either in sequence of character, sequence of words or sequence of sentences. Most commonly, text data are considered as sequence of words for most problems. In this article we will delve into pre-processing using simple example text data. However, the material discussed here is applicable to any NLP tasks. Particularly we'll use TensorFlow2.X Keras for text pre-processing which include:
- Tokenization
- Sequencing
- Padding

First, let's import required libraries.

In [1]:
import tensorflow as tf
#from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

Tokenizer is an API available in TensorFlow Keras which is used to tokenize sentences. We have defined our text data as sentences (each separated by comma) with a python array of strings as below. There are 4 sentences including one with a maximum length of 5. Our text data also includes punctuation terms as shown below.

In [2]:
sentences = ["I want to go out.",
             " I like to play.",
             " No eating - ",
             "No play!",
            ]
print(sentences)

['I want to go out.', ' I like to play.', ' No eating - ', 'No play!']


### Tokenization

As deep learning models do not understand text, we need to convert text into numerical representation. For this purpose, a first step is Tokenization. The Tokenizer API from TensorFlow Keras splits sentences into words and encodes these into integers. Below are hyperparameters used within Tokenizer API: 
- num_words: Limits maximum number of most popular words to keep while training. 
- filters: If not provided, by default filters out all punctuation terms (!"#$%&()*+,-./:;<=>?@[\]^_'{|}~\t\n).
- lower=1. This is a default setting which converts all words to lower case
- oov_tok : When its used, out of vocabulary token will be added to word index in the corpus which is used to build the model. This is used to replace out of vocabulary words (words that are not in our corpus) during text_to_sequence calls (see below).
- word_index: Convert all words to integer index. Full list of words are available as key value property: key = word and value = token for that word

Let's use the Tokenizer below and print out word index. We have used numb_words= 100 which is a lot for this data as there are only 9 distinct words and <OOV> string for out of vocabulary token.
   

In [3]:
tokenizer = Tokenizer(num_words=100, lower= 1, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'<OOV>': 1, 'i': 2, 'to': 3, 'play': 4, 'no': 5, 'want': 6, 'go': 7, 'out': 8, 'like': 9, 'eating': 10}


As seen above, each word in our sentences has been converted to numerical tokens. For instance, "i" has a value of 2. The tokenizer also ignored the exclamation mark after the word. For example, there is only one token for the word "play" or "play!" i.e. 4.

### Sequencing

Next, let's represent each sentence by sequences of numbers using texts_to_sequences from tokenizer object. Below, we printed out raw sentences, word index and sequences.

In [6]:
# Text to sequences
sequences = tokenizer.texts_to_sequences(sentences)
print(sentences)
print(word_index)
print(sequences)

['I want to go out.', ' I like to play.', ' No eating - ', 'No play!']
{'<OOV>': 1, 'i': 2, 'to': 3, 'play': 4, 'no': 5, 'want': 6, 'go': 7, 'out': 8, 'like': 9, 'eating': 10}
[[2, 6, 3, 7, 8], [2, 9, 3, 4], [5, 10], [5, 4]]


As shown above, texts are converted to sequences. 
- List of sentences have been converted to integers. For example, 
- "I want to go out" ---> [2,6,3,7,8]
- "I like to play" ---> [2,9,3,4]
- "No eating" ---> [5,10]
- "No play!" ---> [5,4]

### Padding
In any raw text data, naturally there will be sentences of different lengths. However, all neural networks require to have the inputs with same size. For this purpose, padding is done. Below, let's use pad_sequences for padding. pad_sequences uses arguments such as sequences, padding, maxlen, truncating, value and dtype.
- sequences : list of sequences that we created earlier
- padding = 'pre' or 'post (default pre). By using pre, we'll pad before each sequence and post will pad after each sequence.
- maxlen = maximum length of all sequences. If not provided, by default it will use the maximum length of the longest sentence.
- truncating = 'pre' or 'post' (default 'pre'). If a sequence length is larger than the provided maxlen value then, these values will be truncated to maxlen. 'pre' option will truncate at the beginning where as 'post' will truncate at the end of the sequences.
- value: padding value (default is 0)
- dtype : output sequence type (default is int32)

Let's focus important arguments used in pad_sequences : padding, maxlen and truncating.

##### pre  and post padding

Use of 'pre' or 'post' padding depends upon the analysis. In some cases, padding at the beginning is appropriate while not in others. For instance, if we use Recurrent Neural Network (RNN) for spam detection, then padding at the beginning would be appropriate as RNN can not learn long distance patterns. Padding at the beginning allows us to keep the sequences in the end hence RNN can make use of these sequences for prediction of next. However, in any case padding should be conducted after careful consideration and business knowledge. 

Below, the outputs for 'pre' followed by 'post' padding are shown with default maxlen value of maximum length of sequence.

In [7]:
# pre paddding
pre_pad = pad_sequences(sequences, padding='pre')
print("\nword_index = ", word_index)
print("\nsequences = ", sequences)
print("\npadded_seq = " )
print(pre_pad)


word_index =  {'<OOV>': 1, 'i': 2, 'to': 3, 'play': 4, 'no': 5, 'want': 6, 'go': 7, 'out': 8, 'like': 9, 'eating': 10}

sequences =  [[2, 6, 3, 7, 8], [2, 9, 3, 4], [5, 10], [5, 4]]

padded_seq = 
[[ 2  6  3  7  8]
 [ 0  2  9  3  4]
 [ 0  0  0  5 10]
 [ 0  0  0  5  4]]


In our example above, the sequence with maximum length is [ 2, 6, 3, 7, 8] which corresponds to "I want to go out". When padding ='pre' is used, padded value of 0 is added at the beginning of all other sequences. Because other sequences have shorter sequence than [ 2, 6, 3, 7, 8], padding actually made all other sequences to be of same size with this sequence.

Whereas, when padding = 'post' is used , padded value i.e. 0 is added at the end of the sequences.

In [8]:
# post paddding
post_pad = pad_sequences(sequences, padding='post')
print("\nword_index = ", word_index)
print("\nsequences = ", sequences)
print("\npadded_seq = " )
print(post_pad)


word_index =  {'<OOV>': 1, 'i': 2, 'to': 3, 'play': 4, 'no': 5, 'want': 6, 'go': 7, 'out': 8, 'like': 9, 'eating': 10}

sequences =  [[2, 6, 3, 7, 8], [2, 9, 3, 4], [5, 10], [5, 4]]

padded_seq = 
[[ 2  6  3  7  8]
 [ 2  9  3  4  0]
 [ 5 10  0  0  0]
 [ 5  4  0  0  0]]


##### pre  and post padding with maxlen and truncating option
By use of maxlen =4, we are truncating the length of padded sequences to be less than or equal to 4. As shown, below, the use of maxlen=4 impacted the first sequence [2, 6, 3, 7, 8]. THis sequence has length of 5 and is truncated to 4. The truncating with 'pre' option allows us to truncate the sequence at the beginning. Whereas, truncating with 'post' will truncate the sequence at the end. 

In [9]:
# pre padding, maxlen and pre truncation
prepad_maxlen_pretrunc = pad_sequences(sequences, padding = 'pre', maxlen =4, truncating = 'pre')
print(prepad_maxlen_pretrunc)

[[ 6  3  7  8]
 [ 2  9  3  4]
 [ 0  0  5 10]
 [ 0  0  5  4]]


In [10]:
# pre padding, maxlen and post truncation
prepad_maxlen_posttrunc = pad_sequences(sequences, padding = 'pre', maxlen =4, truncating = 'post')
print(prepad_maxlen_posttrunc)

[[ 2  6  3  7]
 [ 2  9  3  4]
 [ 0  0  5 10]
 [ 0  0  5  4]]


In [11]:
# post padding, maxlen and pre truncation
postpad_maxlen_pretrunc = pad_sequences(sequences, padding = 'post', maxlen =4, truncating = 'pre')
print(postpad_maxlen_pretrunc)

[[ 6  3  7  8]
 [ 2  9  3  4]
 [ 5 10  0  0]
 [ 5  4  0  0]]


In [12]:
# post padding, maxlen and post truncation
postpad_maxlen_pretrunc = pad_sequences(sequences, padding = 'post', maxlen =4, truncating = 'post')
print(postpad_maxlen_pretrunc)

[[ 2  6  3  7]
 [ 2  9  3  4]
 [ 5 10  0  0]
 [ 5  4  0  0]]


### Summary

In this article, we focused on pre-processing raw text data and preparing it for deep learning models. Specifically, we covered tokenizing sentences, representing it as sequences and padding it to make all the sequences of same length. This padded sequences are now ready for train/test split that can be used for neural network. Please refer to  [video by Laurence Moroney's](https://www.youtube.com/watch?v=fNxaJsNG3-s)
NLP zero to hero for further reading.

In future article, I will explain how we can use pre-processing in a real world data and use such padded sequence data with embedding in deep learning models.

# Thank you!