# padding and truncating

because of the input requirements of neural nets, deep learning libraries expect a fixed-length vectorization of data. NN models involve a many matrix operations, and performing them efficiently depends on inputs that are all the same length.

this can pose a problem for NLP purposes: words and phrases can be variable lengths.

one solution is to "pad" the sequences with empty values so that they are the same length, but their overall value is unchanged.

another solution is to truncate sequences to the same length, by removing values from either the beginning or end of sequences whose lengths are above the specified maximum.

keras offers convenient ways to both truncate and pad variable length sequences, so they are ready for a deep learning model.

## padding for sequences of variable length


### pre-sequence padding vs post-sequence padding

padding is adding empty, or "dummy" values to either the beginning or end of a sequence, in order to ensure it's the pre-set length. 

the keras default is to add null values to the beginning of the sequence; this is referred to as pre-sequence padding. 

post-sequence padding--adding empty values to the end of a sequence--is also possible by setting the pad_sequences() function's padding parameter to "post". 

the decision of whether to use pre- or post-sequence padding depends on the problem being modeled.

## truncation

### pre-sequence truncation vs post-sequence truncation

keras doesn't have a truncation-specific function in its sequence preprocessing library. instead, truncation is accomplished using the maxlen parameter of the pad_sequences() function above.

the default here again is to truncate the beginning of a sequence. post-sequence truncation can be done by setting the truncating parameter equal to 'post'.

## padding examples

In [13]:
from keras.preprocessing.sequence import pad_sequences

let's create some data. we'll make several sequences of different lengths.

In [14]:
seq = [
    [2, 4, 6, 8],
    [2, 4, 6],
    [2, 4]]

print(seq)

[[2, 4, 6, 8], [2, 4, 6], [2, 4]]


### pre-sequence padding

(the keras default)

In [15]:
# padding

pre_seq_padded = pad_sequences(seq)

# test

print(pre_seq_padded)

[[2 4 6 8]
 [0 2 4 6]
 [0 0 2 4]]


### post-sequence padding

change the pad_sequence function's 'padding' parameter to equal 'post'

In [16]:
# back to the original set of sequences

print(seq)

[[2, 4, 6, 8], [2, 4, 6], [2, 4]]


In [17]:
# set padding parameter to 'post'

post_seq_padded = pad_sequences(seq, padding='post')

# test

print(post_seq_padded)

[[2 4 6 8]
 [2 4 6 0]
 [2 4 0 0]]


you can see the zeroes added to the ends of each of the shorter rows in our dataframe. 

our dataset is now post-sequence padded.

## truncation examples

let's create more sequences to work with:

In [18]:
long_seq = [
    [1, 2, 3, 4, 5, 6, 7, 8,],
    [1, 2, 3, 4, 5, 6, 7],
    [1, 2, 3, 4, 5, 6]]

### pre-sequence truncation

we use the same pad_sequences() function, but this time set maxlen equal to 5:

In [19]:
# truncate using the pre-sequence default

pre_seq_truncated = pad_sequences(long_seq, maxlen=5)

# test

print(pre_seq_truncated)

[[4 5 6 7 8]
 [3 4 5 6 7]
 [2 3 4 5 6]]


rows in our set of sequences now begin with 4, 3, and 2 respectively, instead of 1.

### post-sequence truncation

to truncate from the ends of sequences, we use pad_sequences() again, with the truncating parameter set to 'post'. maxlen is still set to 5.

In [20]:
# truncate from the ends of sequences using truncating='post'

post_seq_truncated = pad_sequences(long_seq, maxlen=5, truncating='post')

# test

print(post_seq_truncated)

[[1 2 3 4 5]
 [1 2 3 4 5]
 [1 2 3 4 5]]


each row has had values removed from its end, truncating each at 5. 

## that's it for sequence padding & truncation in keras (for now)

there are of course ways to do this manually in python, which means you can adapt these basic ideas to suit your specific needs (or data!). when you don't need or want a custom implementation, keras offers a handy, quick way to prepare sequences for DL models.


##### for more information about keras pre-processing:

https://keras.io/preprocessing/sequence/