This is a training notebook to help me uderstand some **tensorflow.keras.preprocessing** classesand and methods:
* Tokenizer --> [Tokenizer docs](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer)
* pad_sequences (padding) --> [padding docs](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences)

In [1]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [2]:
sentences = [
    'I love my dog',
    'i love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

In [3]:
tokenizer = Tokenizer(num_words = 100) # num_words = max words
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

In [4]:
# Tokenizer?

In [5]:
# dir(tokenizer)

In [6]:
sequences = tokenizer.texts_to_sequences(sentences)

In [7]:
print('word_index: \n{}\n'.format(word_index))
print('sequences: \n{}'.format(sequences)) 

word_index: 
{'love': 1, 'my': 2, 'dog': 3, 'i': 4, 'cat': 5, 'you': 6, 'is': 7, 'do': 8, 'think': 9, 'amazing': 10, 'your': 11}

sequences: 
[[4, 1, 2, 3], [4, 1, 2, 5], [6, 1, 2, 3], [8, 6, 9, 2, 3, 7, 10], [3, 5, 7, 11, 1]]


In [8]:
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

In [9]:
test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

[[4, 1, 2, 3], [2, 3, 2]]


words that are not in tokenizer.word_index were ignored

### tokenizer with out of vocabulary token

In [10]:
tokenizer = Tokenizer(num_words = 100, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

You can use any word for out of index token eg. oov_token='blablabla' if you want to receive word_index: 
{'blablabla': 1, (...)} (:
Just use unique one.

In [11]:
sequences = tokenizer.texts_to_sequences(sentences)

In [12]:
print('word_index: \n{}\n'.format(word_index))
print('sequences: \n{}'.format(sequences))    

word_index: 
{'<OOV>': 1, 'love': 2, 'my': 3, 'dog': 4, 'i': 5, 'cat': 6, 'you': 7, 'is': 8, 'do': 9, 'think': 10, 'amazing': 11, 'your': 12}

sequences: 
[[5, 2, 3, 4], [5, 2, 3, 6], [7, 2, 3, 4], [9, 7, 10, 3, 4, 8, 11], [4, 6, 8, 12, 2]]


In [13]:
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

In [14]:
test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

[[5, 1, 2, 3, 4], [3, 4, 1, 3, 1]]


words that are not in tokenizer.word_index were indexed as 1 (<-- the out of index token)

### the num_words param
* num_words set as 3

In [15]:
tokenizer = Tokenizer(num_words = 3)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

In [16]:
sequences = tokenizer.texts_to_sequences(sentences)

In [17]:
print('word_index: \n{}\n'.format(word_index))
print('sequences: \n{}'.format(sequences))    

word_index: 
{'love': 1, 'my': 2, 'dog': 3, 'i': 4, 'cat': 5, 'you': 6, 'is': 7, 'do': 8, 'think': 9, 'amazing': 10, 'your': 11}

sequences: 
[[1, 2], [1, 2], [1, 2], [2], [1]]


it does not have an impact on fit_on_texts outputs such as word_index, word_counts, word_docs (all words were used), but only on final sequences that were made (eg. texts_to_sequences or texts_to_matrix methods)

### How does a Tokenizer change a text?

In [18]:
sentences = [
    'I \n \n love my \n dog',
    'i love my           cat...',
    'He loves my dog!',
    '   Do you think           my dog is amazing? ',
    'dog-cat is your love',
    'i found 1234 cats in my house',
    'Is 12cat3445H2O or dog15-14 a good name for a cat? '
]

In [19]:
tokenizer = Tokenizer(num_words = 100) # num_words = max words
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

In [20]:
print('word_index: \n{}\n'.format(word_index))

word_index: 
{'my': 1, 'dog': 2, 'i': 3, 'love': 4, 'cat': 5, 'is': 6, 'a': 7, 'he': 8, 'loves': 9, 'do': 10, 'you': 11, 'think': 12, 'amazing': 13, 'your': 14, 'found': 15, '1234': 16, 'cats': 17, 'in': 18, 'house': 19, '12cat3445h2o': 20, 'or': 21, 'dog15': 22, '14': 23, 'good': 24, 'name': 25, 'for': 26}



Some Tokenizer default values (can be defined when Tokenizer() is instantiated):

In [21]:
tokenizer.filters

'!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'

In [22]:
tokenizer.lower, tokenizer.split, tokenizer.char_level

(True, ' ', False)

#### tensorflow.keras.preprocessing.text.Tokenizer :
* lowercase --> I = i, Do = do
* ignore redundant whitespaces
* replace punctation included in tokenizer.filters by whitespace --> 'dog15-14' was indexed as 2 words: 'dog15' and '14' so probably after removing punctation 'dog15 14' was the output
* some special characters like newline \n and tab \t are also removed with punctation
* numbers are not removed both: seperated numbers and in joins with words
* you and your, cat and cats, love and loves are different words/indexes
* no stop words included

# padding
* making the sentences the same length

In [23]:
sentences = [
    'I love my dog',
    'i love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

In [24]:
#  no changes :)
tokenizer = Tokenizer(num_words = 100, oov_token='<OOV>') 
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)

In [26]:
#  padding
padded = pad_sequences(sequences)

In [27]:
print('word_index: \n{}\n'.format(word_index))
print('sequences: \n{}\n'.format(sequences))
print('padded: \n{}'.format(padded)) 

word_index: 
{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

sequences: 
[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

padded: 
[[ 0  0  0  5  3  2  4]
 [ 0  0  0  5  3  2  7]
 [ 0  0  0  6  3  2  4]
 [ 8  6  9  2  4 10 11]]


* so the default is padding='pre' (at the beggining)
* this is probably also the answer, why index 0 is not used in word_index (:

#### padding = 'post'

In [29]:
padded = pad_sequences(sequences, padding = 'post')
print('padded: \n{}'.format(padded)) 

padded: 
[[ 5  3  2  4  0  0  0]
 [ 5  3  2  7  0  0  0]
 [ 6  3  2  4  0  0  0]
 [ 8  6  9  2  4 10 11]]


#### maxlen param
* default cut at the front

In [30]:
padded = pad_sequences(sequences, padding = 'post', maxlen=5)
print('padded: \n{}'.format(padded)) 

padded: 
[[ 5  3  2  4  0]
 [ 5  3  2  7  0]
 [ 6  3  2  4  0]
 [ 9  2  4 10 11]]


#### maxlen param
* truncating = 'post' --> cutting at the back

In [31]:
padded = pad_sequences(sequences, padding = 'post', 
                       truncating = 'post', maxlen=5)
print('padded: \n{}'.format(padded)) 

padded: 
[[5 3 2 4 0]
 [5 3 2 7 0]
 [6 3 2 4 0]
 [8 6 9 2 4]]


just curiosity:

In [32]:
type(padded)

numpy.ndarray

In [33]:
type(sequences)

list