<a href="https://colab.research.google.com/github/Manjunath727/DLwithTF/blob/master/NLP_Using_TF/W1/L1_Tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Tokenization

    Tokenization is the process of converting text into numeric values with a number representing a word or character. 
     
    Keras provides a tokenizer class for generating dictionary of word encodings and creating vector out of sentences. Tokenizer provides a word index property which returns a dictionary containing key value pairs, where key is the word and value is the token for the word. 
    
    Tokenizer strips the punctuations out

In [3]:
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'Chelsea beat Barcelona',
    'I pity Arsenal',
    'Mou is the best manager in the world'
]

tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_indices = tokenizer.word_index
print(word_indices)

{'the': 1, 'chelsea': 2, 'beat': 3, 'barcelona': 4, 'i': 5, 'pity': 6, 'arsenal': 7, 'mou': 8, 'is': 9, 'best': 10, 'manager': 11, 'in': 12, 'world': 13}


## Text to Sequences

    Next is turning sentences into a list of values using the tokens generated above.
    Tokenizer provides texts_to_sequences method. 
    texts_to_sequences takes any sets of sentences and encodes them based on the word-set that the tokenizer learnt using fit_on_texts()
    

In [4]:
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

[[2, 3, 4], [5, 6, 7], [8, 9, 1, 10, 11, 12, 1, 13]]


## Testing 

    Now testing this tokenizer on new sentences

In [6]:
# Try with words that tokenizer was not fit

test_sentences = [
    'Mou is the greatest football manager in the world',
]

test_seq = tokenizer.texts_to_sequences(test_sentences)

print(test_seq)

[[8, 9, 1, 11, 12, 1, 13]]


## Out-of-vocabulary and Padding

    In the above test-data, there were words like 'greatest', 'football', which the tokenizer was not earlier fit on. Hence when tokenizer.texts_to_sequences was run on such sentences, those words were not encoded. To avoid this scenario, declare tokenizer with a property oov_token
    
    

In [0]:
new_tokenizer = Tokenizer(num_words=100,oov_token='<OOV>')
new_tokenizer.fit_on_texts(sentences)

sequences = new_tokenizer.texts_to_sequences(sentences)

In [10]:
test_seq = new_tokenizer.texts_to_sequences(test_sentences)
print(test_seq)

[[9, 10, 2, 1, 1, 12, 13, 2, 14]]


### Padding

    As with images which were required to be of uniform size when passing into a neural network, similar uniformity needs to be maintained for texts which are sent to train in an neural net.
    
    Pad sequences tool from keras allows sentences of different length to either truncate or pad them to make all sentences of same length. 
    
    

In [15]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

padded_sequences = pad_sequences(sequences, maxlen=5)

print(sequences)
print(padded_sequences)

[[3, 4, 5], [6, 7, 8], [9, 10, 2, 11, 12, 13, 2, 14]]
[[ 0  0  3  4  5]
 [ 0  0  6  7  8]
 [11 12 13  2 14]]
