# Tokenization
- Process of representing words in a way that a computer can process them on later training on a NN that can understand their meaning.

### Concept

- ASCII can be used to represent words
#### Example
1. LISTEN = 083, 073, 076, 069, 078, 084
2. SILENT = 076, 073, 083, 084, 069, 078

- **`Disadvantage`** Both words when presented in different orders have same ascii values

- Hence instead of encoding Letters,**encode words**.
#### Example
- I Love my Dog : 001, 002, 003, `004`
- I Love my Cat : 001, 002, 003, `005`


In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

In [2]:
#Representing array as python array of strings
sentences = [
    'I love my dog',
    'I love my cat',
    'you love my dog!'
]

In [3]:
#creating instance of tokenizer object.
#keeping just most frequent 100words
tokenizer = Tokenizer(num_words=100) #num_words parameter is the maximum words to keep

#tokenizer move through all text and fit itself to them
tokenizer.fit_on_texts(sentences)  

#Full list of words is available as the word index property
word_index = tokenizer.word_index
print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


## Sequencing - Turning sentences into data.

In [4]:
sentences1 = [
    'I love my dog',
    'I love my cat',
    'you love my dog!',
    'Do you think my dog is amazing?'
]

In [5]:
tokenizer = Tokenizer(num_words=100)

#we have set of sentences that we will use for training a NN
tokenizer.fit_on_texts(sentences1)

#The tokenizer gets the word index from the sentences used for training
#and create sequences
word_index=tokenizer.word_index

In [6]:
#creates sequences of tokens representing each sentences

sequences=tokenizer.texts_to_sequences(sentences1)

In [7]:
print(word_index)
print(sentences1)

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}
['I love my dog', 'I love my cat', 'you love my dog!', 'Do you think my dog is amazing?']


## Problem

Q1 What happens when the NN needs to classify texts, but there are words in the text that it has never seen before?
> This confuses the tokenizer! How to handle this? 

In [8]:
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

In [9]:
test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

[[4, 2, 1, 3], [1, 3, 1]]


- since `manatee`, `really` etc words are not present in word_index, because they were not in the initial set of data

#### Result
- 5 word first sentence ends up as 4,2,1,3 as a 4 word sequence OR second sentence ends up as a 1,3,1 as the corpus used to build it didnt contain that word i.e **`loves, manatee and really`** are not in the word index


####  Conclusion
- We thus requires a big word index to handle sentences that are not in the training set.

### Solution
- In order not to loose the length of the sequence, use **`oov_token` property** setting for words not expected to be in the corpus,
- Tokenizer will create token for that and then replace words that it doesn't recognize with the out of Vocabulary token instead.

In [10]:
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences1)
word_index = tokenizer.word_index
seqences = tokenizer.texts_to_sequences(sentences1)

In [11]:
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

In [12]:
test_seq = tokenizer.texts_to_sequences(test_data)
print(word_index)
print(test_seq)

#Result: Sentence will not loose length
#token 1 is provided to not recognized words

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]


- We still lost meaning of the sentence but we atleast got the correct length.

- But while it helps maintain the sequence length to be the same length as the sentence, we might need to train a NN, how it can handle sentences of different lengths?

- Images are of same size, but sentences are of different lengths!!

**1. The solution is by using RaggedTensor**

**Simpler Solution is Padding**

### Padding

In [13]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [14]:
sentences1 = [
    'I love my dog',
    'I love my cat',
    'you love my dog!',
    'Do you think my dog is amazing?'
]

In [15]:
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences1)
word_index = tokenizer.word_index

seqences = tokenizer.texts_to_sequences(sentences1)

In [16]:
padded = pad_sequences(sequences)
print(word_index)
print(sequences)
print(padded)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]
[[ 0  0  0  4  2  1  3]
 [ 0  0  0  4  2  1  6]
 [ 0  0  0  5  2  1  3]
 [ 7  5  8  1  3  9 10]]


### NOTE

- Since longest sentence have 7 words in it hence additional 0's are added in remaining sentence
- We can also assigned the position to pre or post, in this way it ensure that all have equally sized sequences by paddng them with 0's at the front.

In [17]:
#to pad sequence with 0's at post position

padded = pad_sequences(sequences, padding='post')
print(padded)

[[ 4  2  1  3  0  0  0]
 [ 4  2  1  6  0  0  0]
 [ 5  2  1  3  0  0  0]
 [ 7  5  8  1  3  9 10]]


In [18]:
#if we dont require the padded sentence length to be equal to
#longest sentence we can use maxlen parameter

padded = pad_sequences(sequences,padding='post', maxlen=5)
print(padded)

[[ 4  2  1  3  0]
 [ 4  2  1  6  0]
 [ 5  2  1  3  0]
 [ 8  1  3  9 10]]


In [19]:
#we can decide from where words should be 

padded = pad_sequences(sequences,truncating='post', maxlen=5)
print(padded)

[[0 4 2 1 3]
 [0 4 2 1 6]
 [0 5 2 1 3]
 [7 5 8 1 3]]
