<a href="https://colab.research.google.com/github/AbhishekMajhi/Deep-Learning-Work-Space/blob/main/Week1_Word_Embedding_Using_TensorflowKeras_Tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Tokenizer Introduction

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

In [None]:
sentenses = [
    'I love my cat',
    'I love animals'
]

In [None]:
tokenizer = Tokenizer(num_words=100) # will take top 100 words
tokenizer.fit_on_texts(sentenses)
word_index = tokenizer.word_index # returns a key a value pair where key is the word and value is the token for that word
# tokenizer strips puntuation out

In [None]:
word_index

{'i': 1, 'love': 2, 'my': 3, 'cat': 4, 'animals': 5}

# Text to sequence

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [None]:
sentences = [
    'I love my cat',
    'I love animals',
    'Do you also love animals?',
    "Don't you think my cat is cute?"
]

In [None]:
# initialize the tokenizer
tokenizer = Tokenizer(num_words=100)
# Tokenize the input sentences
tokenizer.fit_on_texts(sentences)
# Get the word index dictionary
word_index = tokenizer.word_index

In [None]:
# Generate list of token sequences
sequences = tokenizer.texts_to_sequences(sentences) # give us encoded list for each sentence

In [None]:
print(word_index)
print(sequences)

{'love': 1, 'i': 2, 'my': 3, 'cat': 4, 'animals': 5, 'you': 6, 'do': 7, 'also': 8, "don't": 9, 'think': 10, 'is': 11, 'cute': 12}
[[2, 1, 3, 4], [2, 1, 5], [7, 6, 8, 1, 5], [9, 6, 10, 3, 4, 11, 12]]


* If a word is not present in the dictionary at the time of training then that word will be not considered while generating inference.
* So, we really need a big traing set so that our vocabulary will be big enough to cover most of the word while inference.
* Instead of ignoring unseen words we can put some special value for these words, we can do that with tokenizer "oov_token".

In [None]:
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")   # TOKEN should be something unique and distint
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences) # give us encoded list for each sentence
print(word_index)
print(sequences)

{'<OOV>': 1, 'love': 2, 'i': 3, 'my': 4, 'cat': 5, 'animals': 6, 'you': 7, 'do': 8, 'also': 9, "don't": 10, 'think': 11, 'is': 12, 'cute': 13}
[[3, 2, 4, 5], [3, 2, 6], [8, 7, 9, 2, 6], [10, 7, 11, 4, 5, 12, 13]]


# Padding

* Before we feed data to train we need our sentences to be uniform in size. So, we need to do padding.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
sentences = [
    'I love my cat',
    'I love animals',
    'Do you also love animals?',
    "Don't you think my cat is cute?"
]

In [None]:
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

In [None]:
# tokenize the sentences
sequences = tokenizer.texts_to_sequences(sentences)
# add the padding token
# by default the padding will be before the sentence so, use padding='post'
# Max lenght a sequence can have is 100 as mentioned in num_words, but we can override that with 'maxlen'
# Like padding maxlen will trim out from left of the sentence if it has more than 5 words but we set it to post with truncating='post'
padded = pad_sequences(sequences, padding='post', truncating='post', maxlen=7)
print(word_index)
print()
print(padded)

{'<OOV>': 1, 'love': 2, 'i': 3, 'my': 4, 'cat': 5, 'animals': 6, 'you': 7, 'do': 8, 'also': 9, "don't": 10, 'think': 11, 'is': 12, 'cute': 13}

[[ 3  2  4  5  0  0  0]
 [ 3  2  6  0  0  0  0]
 [ 8  7  9  2  6  0  0]
 [10  7 11  4  5 12 13]]
