# How to load the Tokenizer class in Keras

In [4]:

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.python.client import device_lib
import tensorflow

In [5]:
print ("Current keras version: ", tensorflow.keras.__version__)
print ("Available resources: ", device_lib.list_local_devices())

Current keras version:  2.8.0
Available resources:  [name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 4088150603172129951
xla_global_id: -1
]


**How to transform texts to IDs. IDs are assigned based on frequency, i.e., the most common words will receive ID #1**

In [6]:
tokenizer = Tokenizer()
texts = ["Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do.", 
         "So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid).",
         "There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself."]
tokenizer.fit_on_texts(texts) #It trains a tokenizer given a list of texts

texts2ids = tokenizer.texts_to_sequences(texts) #Tokenizes the texts, and transform them into IDs
print ("Texts as IDs:", texts2ids)
ids2texts = tokenizer.sequences_to_texts(texts2ids)
print ("IDs back to texts:", ids2texts)



Texts as IDs: [[8, 4, 14, 1, 15, 2, 16, 5, 17, 18, 6, 19, 20, 3, 21, 9, 5, 22, 10, 1, 23], [7, 11, 4, 24, 12, 6, 25, 26, 13, 27, 13, 11, 28, 29, 3, 30, 31, 32, 6, 33, 2, 34, 9, 35], [36, 4, 10, 7, 2, 37, 12, 38, 39, 40, 8, 41, 42, 7, 2, 43, 44, 5, 3, 45, 1, 46, 3, 47, 48, 1, 49]]
IDs back to texts: ['alice was beginning to get very tired of sitting by her sister on the bank and of having nothing to do', 'so she was considering in her own mind as well as she could for the hot day made her feel very sleepy and stupid', 'there was nothing so very remarkable in that nor did alice think it so very much out of the way to hear the rabbit say to itself']


We can consult what was been learned by the Tokenizer, e.g., to obtain the index of a given word using tokenizer.word_index

In [7]:
print ("tokenizer.word_counts:", tokenizer.word_counts)
print ("tokenizer.document_counts:", tokenizer.document_count)
print ("tokenizer.word_index:",tokenizer.word_index)
print ("tokenizer.word_docs", tokenizer.word_docs)

tokenizer.word_counts: OrderedDict([('alice', 2), ('was', 3), ('beginning', 1), ('to', 4), ('get', 1), ('very', 4), ('tired', 1), ('of', 3), ('sitting', 1), ('by', 1), ('her', 3), ('sister', 1), ('on', 1), ('the', 4), ('bank', 1), ('and', 2), ('having', 1), ('nothing', 2), ('do', 1), ('so', 3), ('she', 2), ('considering', 1), ('in', 2), ('own', 1), ('mind', 1), ('as', 2), ('well', 1), ('could', 1), ('for', 1), ('hot', 1), ('day', 1), ('made', 1), ('feel', 1), ('sleepy', 1), ('stupid', 1), ('there', 1), ('remarkable', 1), ('that', 1), ('nor', 1), ('did', 1), ('think', 1), ('it', 1), ('much', 1), ('out', 1), ('way', 1), ('hear', 1), ('rabbit', 1), ('say', 1), ('itself', 1)])
tokenizer.document_counts: 3
tokenizer.word_index: {'to': 1, 'very': 2, 'the': 3, 'was': 4, 'of': 5, 'her': 6, 'so': 7, 'alice': 8, 'and': 9, 'nothing': 10, 'she': 11, 'in': 12, 'as': 13, 'beginning': 14, 'get': 15, 'tired': 16, 'sitting': 17, 'by': 18, 'sister': 19, 'on': 20, 'bank': 21, 'having': 22, 'do': 23, 'con

Notice that punctuation has been removed, the text has been lowercased, etc. This is becuase the default attribute values of the Tokenizer class. They can be consulted at: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

    tf.keras.preprocessing.text.Tokenizer(
        num_words=None,
        filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
        lower=True, split=' ', char_level=False, oov_token=None,
        document_count=0, **kwargs
    )

You have the freedom to set up the other parameters to you favorite choice, and use them in the assigments as you please to create your LM.

More particularly, in the lab assignment we will modify two parameters: num_words and oov_token:

**num_words**: the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept.

**oov_token**: if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls.





**How to save and load a tokenizer**

In [8]:
import pickle 

with open('tokenizer.pickle', 'wb') as handle:
          pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('tokenizer.pickle', 'rb') as handle:
          tokenizer = pickle.load(handle)
  

**Example of a Tokenizer considering oov_tokens**

In [9]:
tokenizer = Tokenizer(oov_token="<unk>")
texts = ["Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do.", 
         "So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid).",
         "There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself."]
tokenizer.fit_on_texts(texts) 

oov_texts = ["Harry Potter and the Philoshopher stone"]
texts2ids = tokenizer.texts_to_sequences(oov_texts) #
print ("Texts as IDs:", texts2ids)
ids2texts = tokenizer.sequences_to_texts(texts2ids)
print ("IDs back to texts:", ids2texts)

Texts as IDs: [[1, 1, 10, 4, 1, 1]]
IDs back to texts: ['<unk> <unk> and the <unk> <unk>']


**Example of a Tokenizer considering oov_tokens and limited num_words**

In [10]:
tokenizer = Tokenizer(oov_token="<unk>", num_words=15)
texts = ["Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do.", 
         "So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid).",
         "There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself."]
tokenizer.fit_on_texts(texts) 

texts2ids = tokenizer.texts_to_sequences(texts) #
print ("Texts as IDs:", texts2ids)
ids2texts = tokenizer.sequences_to_texts(texts2ids)
print ("IDs back to texts:", ids2texts)

Texts as IDs: [[9, 5, 1, 2, 1, 3, 1, 6, 1, 1, 7, 1, 1, 4, 1, 10, 6, 1, 11, 2, 1], [8, 12, 5, 1, 13, 7, 1, 1, 14, 1, 14, 12, 1, 1, 4, 1, 1, 1, 7, 1, 3, 1, 10, 1], [1, 5, 11, 8, 3, 1, 13, 1, 1, 1, 9, 1, 1, 8, 3, 1, 1, 6, 4, 1, 2, 1, 4, 1, 1, 2, 1]]
IDs back to texts: ['alice was <unk> to <unk> very <unk> of <unk> <unk> her <unk> <unk> the <unk> and of <unk> nothing to <unk>', 'so she was <unk> in her <unk> <unk> as <unk> as she <unk> <unk> the <unk> <unk> <unk> her <unk> very <unk> and <unk>', '<unk> was nothing so very <unk> in <unk> <unk> <unk> alice <unk> <unk> so very <unk> <unk> of the <unk> to <unk> the <unk> <unk> to <unk>']
