In [0]:
import tensorflow as tf
from tensorflow import keras

In [0]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [0]:
sentences = [
'I love my dog',   
'I love my cat',
'You love my dog',
' Do you think my cat is amazing',
'I have a new car'
]

This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...

In [0]:
# Creating an instance of a tokenizer

tokenizer = Tokenizer(num_words = 100)


In [0]:
# Calling the fit method which takes in the data and encodes it
tokenizer.fit_on_texts(sentences)



In [0]:
word_index = tokenizer.word_index

In [32]:
# Note that the Tokenizer strips out the punctuation

print (word_index)

{'my': 1, 'i': 2, 'love': 3, 'dog': 4, 'cat': 5, 'you': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10, 'have': 11, 'a': 12, 'new': 13, 'car': 14}


**Text to Sequence**

Now our goal is to turn our sentences into lists of values based on the tokens that we have created above.
We also face the difficulty to also manipulate these lists including but not limited to making each list the same length. Otherwise it may be hard to train a neural network with them 

In [0]:
sequences = tokenizer.texts_to_sequences(sentences)

In [34]:
print(sequences)

[[2, 3, 1, 4], [2, 3, 1, 5], [6, 3, 1, 4], [7, 6, 8, 1, 5, 9, 10], [2, 11, 12, 13, 14]]


**Let us do inference from this model**
One thing to keep in mind while doing inference from these models is that the words or sentences you are using as test should be encoded in the same word index. Otherwise, the result would be meaningless

In [0]:
test_sequence = [
    
    'i really love my dog',
    'my dog loves my manatee'
    
    
]

In [0]:
test_sequence = tokenizer.texts_to_sequences(test_sequence)

In [38]:
print (test_sequence)

[[2, 3, 1, 4], [1, 4, 1]]


Efficiency Tip:
Use a special character or number for a word that is encountered during your test data set, instead of leaving it as it is. We can do it in Keras very easily.
Let's rewrite the "tokenizer " in a different way

In [40]:
tokenizer = Tokenizer(num_words = 100, oov_token = ' <OOV>')

tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
print (word_index)
sequences = tokenizer.texts_to_sequences(sentences)

{' <OOV>': 1, 'my': 2, 'i': 3, 'love': 4, 'dog': 5, 'cat': 6, 'you': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11, 'have': 12, 'a': 13, 'new': 14, 'car': 15}


In [0]:
test_sequence = [
    
    'i really love my dog',
    'my dog loves my manatee'
    
    
]

In [0]:
test_sequence = tokenizer.texts_to_sequences(test_sequence)

In [43]:
print (test_sequence)

[[3, 1, 4, 2, 5], [2, 5, 1, 2, 1]]


**Padding**
As we mentioned earlier , it is important for a neural network to be trained on data that is uniform in size but this is not always the case. To deal with issue, we use padding.

In [0]:
# First , import the corresponding library from tensorflow. 
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [0]:
# Pad the sequences
padded = pad_sequences(sequences)

In [51]:
print ("Sequences are " + str(sequences))
print ("Padded Sequences are "+ str(padded))

Sequences are [[3, 4, 2, 5], [3, 4, 2, 6], [7, 4, 2, 5], [8, 7, 9, 2, 6, 10, 11], [3, 12, 13, 14, 15]]
Padded Sequences are [[ 0  0  0  3  4  2  5]
 [ 0  0  0  3  4  2  6]
 [ 0  0  0  7  4  2  5]
 [ 8  7  9  2  6 10 11]
 [ 0  0  3 12 13 14 15]]


**If we want to pad the sentences after the text , we can make a little tweak to the above code**

**As we see , all of the sequences have lenght equal to the lenght of the longest sequence**

In [52]:
padded = pad_sequences(sequences , padding = 'post')
print ("Sequences are " + str(sequences))
print ("Padded Sequences are "+ str(padded))

Sequences are [[3, 4, 2, 5], [3, 4, 2, 6], [7, 4, 2, 5], [8, 7, 9, 2, 6, 10, 11], [3, 12, 13, 14, 15]]
Padded Sequences are [[ 3  4  2  5  0  0  0]
 [ 3  4  2  6  0  0  0]
 [ 7  4  2  5  0  0  0]
 [ 8  7  9  2  6 10 11]
 [ 3 12 13 14 15  0  0]]


We can set the max lenght of any sentence by the following code

In [53]:
padded = pad_sequences(sequences , padding = 'post' , maxlen = 5)
print ("Sequences are " + str(sequences))
print ("Padded Sequences are "+ str(padded))

Sequences are [[3, 4, 2, 5], [3, 4, 2, 6], [7, 4, 2, 5], [8, 7, 9, 2, 6, 10, 11], [3, 12, 13, 14, 15]]
Padded Sequences are [[ 3  4  2  5  0]
 [ 3  4  2  6  0]
 [ 7  4  2  5  0]
 [ 9  2  6 10 11]
 [ 3 12 13 14 15]]


To avoid using the information from the start of the sentence using "**maxlen**" constraint , we can use another way around it and set the variable **truncating** to be post. In this way , we lose the information from the end of the sentence rather than from the front 

In [54]:
padded = pad_sequences(sequences , padding = 'post' , maxlen = 5 , truncating = 'post')
print ("Sequences are " + str(sequences))
print ("Padded Sequences are "+ str(padded))

Sequences are [[3, 4, 2, 5], [3, 4, 2, 6], [7, 4, 2, 5], [8, 7, 9, 2, 6, 10, 11], [3, 12, 13, 14, 15]]
Padded Sequences are [[ 3  4  2  5  0]
 [ 3  4  2  6  0]
 [ 7  4  2  5  0]
 [ 8  7  9  2  6]
 [ 3 12 13 14 15]]
