<a href="https://colab.research.google.com/github/Abhik91/NLP-Coursera/blob/master/nlpCourseraTensorFlow_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This is the first program using Tensor Flow from Coursera NLP Course.


**Description** - Using Tensorflow to tokenize words using Tokenizer library. Tokenizer does all the heavy lifting of managing tokens, turning text into streams of tokens etc.

In [0]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

2. Creating a list of sentences

In [0]:
sentence = [ 'I love my dog',
             'I love my cat'
            ]

3. Create an instance of ***Tokenizer***. Pass parameter ***num_words*** to it. Value provided is too large as we don't know which how many unique text are there in the sentence list. So by setting this hyper-parameter, Tokenizer will take top 100 words by volume and encode those.

In [0]:
tokenizer = Tokenizer(num_words=100)

4. ***fit_on_texts*** method takes the words and tokenizes it

In [0]:
tokenizer.fit_on_texts(sentence)

5. The tokenizer provides a ***word_index*** property which returns a dictionary containing key value pairs. If you see the output, the I in the list sentence was capitalized, but in the output it is in small. Thus the tokenizer also ***strips the punctuation***

In [0]:
word_index = tokenizer.word_index
print (word_index)

{'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}


6. Lets take another example to verify how tokenizer ***strips the punctuation*** 

In [0]:
sentence = [
            'I love my dog',
            'I love my cat',
            'You love my dog!'
            ]

In [0]:
tokenizer.fit_on_texts(sentence)

7. On displaying the output, we can see that ***dog!*** is not treated as a different word, it is the same word and an addition of ***you***

In [0]:
word_index = tokenizer.word_index
print (word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


## Text to sequence
Here we will be creating a list of sequences, the sentences encoded with the tokens that we generated 

In [0]:
sentence = [
            'I love my dog',
            'I love my cat',
            'You love my dog!',
            'She really loves my dog!!'
            ]

Notice the list above have a sentence which is of greater length than the previous one.

In [0]:
tokenizer.fit_on_texts(sentence)
word_index = tokenizer.word_index
print (word_index)

{'my': 1, 'love': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6, 'she': 7, 'really': 8, 'loves': 9}


Now we will create list of sequences. Each sequence is the number correspoding to the word. Like the first one [4,2,1,3] is 'I love my dog' and so on...

In [0]:
sequences= tokenizer.texts_to_sequences(sentence)
print (sequences)

[[3, 2, 1, 4], [3, 2, 1, 5], [6, 2, 1, 4], [7, 8, 9, 1, 4]]


***Disadvantage is if you give a list of words on pre-trained model and the words are not present, it will create sequence with those tokens available to it, thus creating incorrect sequence***

In [0]:
test_sentences = [
            'I love my dog',
            'She really does not love my dog!!'
            ]
sequences = tokenizer.texts_to_sequences(test_sentences)
print (sequences)

[[3, 2, 1, 4], [7, 8, 2, 1, 4]]


***To overcome the disadvantage we will use the oov feature of tokenizer. OOV or out of vocabulary, replaces unseen words or vocabulary with oov string. OOV string should be unique and should'nt be among the vocabulary like '\<OOV\>'***

In [0]:
tokenizer1 = Tokenizer(num_words=100, oov_token='<OOV>')
tokenizer1.fit_on_texts(sentence) #Training tokenizer with the sentences in the list sentence
word_index = tokenizer1.word_index
print (word_index)

sequences = tokenizer1.texts_to_sequences(test_sentences)#Testing tokenizer with the sentences in the list sentence
print (sequences)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'cat': 6, 'you': 7, 'she': 8, 'really': 9, 'loves': 10}
[[5, 3, 2, 4], [8, 9, 1, 1, 3, 2, 4]]


***Padding***

In [0]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [0]:
padded = pad_sequences(sequences)
print (padded)

[[0 0 0 5 3 2 4]
 [8 9 1 1 3 2 4]]


***Padding at the end of sentences***

In [0]:
padded1 = pad_sequences(sequences, padding = 'post')
print (padded1)

[[5 3 2 4 0 0 0]
 [8 9 1 1 3 2 4]]


***Padding with maxlen = 5*** -> Removes the words from the begining of the sentence

In [0]:
padded2 = pad_sequences(sequences, padding = 'post', maxlen=5)
print (padded2)

[[5 3 2 4 0]
 [1 1 3 2 4]]


***Padding with maxlen = 5*** -> Remove the words from the end of the sentence

In [0]:
padded3 = pad_sequences(sequences, padding = 'post', maxlen=5, truncating='post')
print (padded3)

[[5 3 2 4 0]
 [8 9 1 1 3]]
