# padding for sequences of variable length

because of the input requirements of neural nets, deep learning libraries expect a fixed-length vectorization of data. NN models involve a many matrix operations, and performing them efficiently depends on inputs that are all the same length.

this can pose a problem for NLP purposes: words and phrases can be variable lengths.

a solution is to "pad" the sequences so that they are the same length, but their value is unchanged.

keras offers a convenient way to pad variable length sequences with empty values so they are ready for a deep learning model.

### pre-sequence padding vs post-sequence padding

the keras default is to add null values to the beginning of the sequence; this is referred to as pre-sequence padding. post sequence padding is also possible by setting a parameter (padding) to "post". 

the decision of whether to use pre- or post-sequence padding depends on the problem being modeled.

##### for more information about keras pre-processing:

https://keras.io/preprocessing/sequence/

In [1]:
from keras.preprocessing.sequence import pad_sequences

Using TensorFlow backend.


let's create some data. we'll make several sequences of different lengths.

In [4]:
seq = [
    [2, 4, 6, 8],
    [2, 4, 6],
    [2, 4]]

print(seq)

[[2, 4, 6, 8], [2, 4, 6], [2, 4]]


### pre-sequence padding

(the keras default)

In [6]:
# padding

pre_seq_padded = pad_sequences(seq)

# test

print(pre_seq_padded)

[[2 4 6 8]
 [0 2 4 6]
 [0 0 2 4]]


### post-sequence padding

change the pad_sequence function's 'padding' parameter to equal 'post'

In [7]:
# back to the original set of sequences

print(seq)

[[2, 4, 6, 8], [2, 4, 6], [2, 4]]


In [8]:
# set padding parameter to 'post'

post_seq_padded = pad_sequences(seq, padding='post')

# test

print(post_seq_padded)

[[2 4 6 8]
 [2 4 6 0]
 [2 4 0 0]]
