#Padding


When submitting encoded sentences to neural network all of then must have the same length, same as with images. So, similar lengths in the sentences is a requirement. There are also APIs to perform padding with the sentences. 
In this notebook the **pad_sequences** from keras.preprocessing is chosen.
```
from tensorflow.keras.preprocessing.sequence import pad_sequences
```

* **maxlen**: Optional Int, maximum length of all sequences. If not provided, sequences will be padded to the length of the longest individual sequence.
* **dtype**: dtype	(Optional, defaults to int32). Type of the output sequences. To pad sequences with variable length strings, you can use object.
* **padding**: String, 'pre' or 'post' (optional, defaults to 'pre'): pad either before or after each sequence.
* **truncating** String, 'pre' or 'post' (optional, defaults to 'pre'): remove values from sequences larger than maxlen, either at the beginning or at the end of the sequences.
* **value** Float or String, padding value. (Optional, defaults to 0.)

In [None]:
tf.keras.preprocessing.sequence.pad_sequences(
    sequences, maxlen=None, dtype='int32', padding='pre',
    truncating='pre', value=0.0
)

In the following example the list of sentences has been padded out into a matrix and that each row in the matrix has the same length. 
It achieved this by putting the appropriate number of zeros before the sentence.  So in the case of the sentence 5.3.2.4, it didn't actually do any. In the case of the longer sentence here it didn't need to do any. 
The padding can be either way before or after the sentence.

**Example 2:**

In [None]:
import tensorflow as tf
from tensorflow import keras


from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

#define the list with the sentences
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]
#Generate an instance of the tokenizer with 100 words, and OOV
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
#Encode the words in the list of sentences
tokenizer.fit_on_texts(sentences)
#Generate the dictionary with the encoded values for each word.
word_index = tokenizer.word_index

#Generate arrays with encoded values from the sentences
sequences = tokenizer.texts_to_sequences(sentences)

#Add padding (length 5) on the arrays generated
padded = pad_sequences(sequences, maxlen=5)
print("\nWord Index = " , word_index)
print("\nSequences = " , sequences)
print("\nPadded Sequences:")
print(padded)


# Test with words that the tokenizer wasn't fit to and add padding but this time
#with length 10.

test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)

padded = pad_sequences(test_seq, maxlen=10)
print("\nPadded Test Sequence: ")
print(padded)

padded_after = pad_sequences(test_seq, maxlen=9, padding= 'post')
print("\nPadded Test Sequence After")
print(padded_after)

padded_truncated = pad_sequences(test_seq, maxlen=4, truncating='pre')
print("\nPadded Test Sequence Truncated before")
print(padded_truncated)


Word Index =  {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

Sequences =  [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

Padded Sequences:
[[ 0  5  3  2  4]
 [ 0  5  3  2  7]
 [ 0  6  3  2  4]
 [ 9  2  4 10 11]]

Test Sequence =  [[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]

Padded Test Sequence: 
[[0 0 0 0 0 5 1 3 2 4]
 [0 0 0 0 0 2 4 1 2 1]]

Padded Test Sequence After
[[5 1 3 2 4 0 0 0 0]
 [2 4 1 2 1 0 0 0 0]]

Padded Test Sequence Truncated before
[[1 3 2 4]
 [4 1 2 1]]
