In [4]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

Below is the documentation about Tokenizer in Keras:

```
keras.preprocessing.text.Tokenizer(num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~ 
', lower=True, split=' ', char_level=False, oov_token=None)
```

- `num_words`: the maximum number of words to keep, based on word frequency. Only the most common num_words words will be kept.

According to the description for `num_words`, the vocabulary size should be `num_words`. But in practice, the vocabulary contains all words. 

In [10]:
num_words = 3
tk = Tokenizer(num_words=num_words)
texts = ["my name is far faraway asdasd", "my name is","your name is"]
tk.fit_on_texts(texts)
print(tk.word_index)
print(tk.texts_to_sequences(texts))

{'name': 1, 'is': 2, 'my': 3, 'far': 4, 'faraway': 5, 'asdasd': 6, 'your': 7}
[[1, 2], [1, 2], [1, 2]]


- `tk.word_index` is the vocabulary contain all words even if we set `num_word=3`, but the index is ordered as the word frequency. 

- `0` is a reserved index, never assigned to an existing word. So if we want `texts_to_sequences` to output text with 3 most common words, we should set `num_word=3+1`. 

We also can set `oov_token='UNK'` to add `UNK` to the vocabulary. Here the index of UNK is `word_cont+1`

In [12]:
num_words = 3
tk = Tokenizer(num_words=num_words+1, oov_token='UNK')
texts = ["my name is far faraway asdasd", "my name is","your name is"]
tk.fit_on_texts(texts)
print(tk.word_index)
print(tk.texts_to_sequences(texts))

{'name': 1, 'is': 2, 'my': 3, 'far': 4, 'faraway': 5, 'asdasd': 6, 'your': 7, 'UNK': 8}
[[3, 1, 2], [3, 1, 2], [1, 2]]


When `texts_to_sequences` process each sentence, it only consider the top `num_word` in the dictionary. So the dictionary used by the `texts_to_sequences` is `{'name': 1, 'is': 2, 'my': 3}`. We can see, for the sentence 1 and sentence 3, if there are **UNKNOWN** words in the `{'name': 1, 'is': 2, 'my': 3}`, `texts_to_sequences` will just skip it. 

This is not well when dealing with the OOV(out of vocabulary). We hope the output format should like:

```[[3, 1, 2, UNK, UNK, UNK], [3, 1, 2], [UNK, 1, 2]]```

Here we change the `tk.word_index`:


In [13]:
tk.oov_token

'UNK'

In [14]:
tk.word_index = {e:i for e,i in tk.word_index.items() if i <= num_words} # <= because tokenizer is 1 indexed
tk.word_index[tk.oov_token] = num_words + 1 

In [15]:
print(tk.word_index)
print(tk.texts_to_sequences(texts))

{'name': 1, 'is': 2, 'my': 3, 'UNK': 4}
[[3, 1, 2, 4, 4, 4], [3, 1, 2], [4, 1, 2]]


In [18]:
sequences = tk.texts_to_sequences(texts)
print(sequences)
data = pad_sequences(sequences, maxlen=10, padding='post')
print(data)

[[3, 1, 2, 4, 4, 4], [3, 1, 2], [4, 1, 2]]
[[3 1 2 4 4 4 0 0 0 0]
 [3 1 2 0 0 0 0 0 0 0]
 [4 1 2 0 0 0 0 0 0 0]]


BTW, if we use padding, we should set `mask=True` in the Embedding layer. If we use LSTM, it will ignore the padding part. And for the vector of UNKNOWN, we can not just set 0 because UNKNOWN also contains some information. There are some methods to try, averaging vectors for many infrequent words, or you can use a random vector. You can find more info [here](https://groups.google.com/d/msg/word2vec-toolkit/TgMeiJJGDc0/d1vueZkqeHIJ)

Another refrence: https://github.com/keras-team/keras/issues/8092

If we use tensorflow, we should use this to compute the mask: https://www.tensorflow.org/api_docs/python/tf/sequence_mask