<a href="https://colab.research.google.com/github/FadouaKhm/NOTES--DeepLearning.AI-TensorFlow-Developer/blob/master/Natural_Language_Processing_in_TensorFlow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<a href="https://colab.research.google.com/github/lmoroney/dlaicourse/blob/master/TensorFlow%20In%20Practice/Course%203%20-%20NLP/Course%203%20-%20Week%201%20-%20Lesson%201.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### Copyright 2019 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#  Week 1 - Lesson 1

Tensorflow and keras give us a number of ways to encode words, but the one I'm going to focus on is the tokenizer. This will handle the heavy lifting for us, generating the dictionary of word encodings and creating vectors out of the sentences. 

In [1]:
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

tokenizer = Tokenizer(num_words = 100) #Provides a word index dict() with values = code and keys = word, It also handles ponctuation and upper/lower cases
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


**Encoding list of words**

Issues:
1. Handling unseen words
2. Tackling sentences with different sizes


In [6]:
#1. Handling unseen words
# Define out of vocab token for unseen words when defining the Tokenizer
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!',
    'Emily loves her dog'
]
tokenizer = Tokenizer(num_words = 100, oov_token="<oov>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
print(word_index)
print(sequences)

{'<oov>': 1, 'love': 2, 'my': 3, 'dog': 4, 'i': 5, 'cat': 6, 'you': 7, 'emily': 8, 'loves': 9, 'her': 10}
[[5, 2, 3, 4], [5, 2, 3, 6], [7, 2, 3, 4], [8, 9, 10, 4]]


In [10]:
#2. Tackling sentences with different sizes
# Define Padding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!',
    'Emily loves her dog Rex',
    'My dinner is yummier than yours lol!'
]
tokenizer = Tokenizer(num_words = 100, oov_token="<oov>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences)
print("Word encoding : ", word_index)
print(" Original encoded sequences : ", sequences)
print(" Padded sequences : ", padded)

Word encoding :  {'<oov>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'cat': 6, 'you': 7, 'emily': 8, 'loves': 9, 'her': 10, 'rex': 11, 'dinner': 12, 'is': 13, 'yummier': 14, 'than': 15, 'yours': 16, 'lol': 17}
 Original encoded sequences :  [[5, 3, 2, 4], [5, 3, 2, 6], [7, 3, 2, 4], [8, 9, 10, 4, 11], [2, 12, 13, 14, 15, 16, 17]]
 Padded sequences :  [[ 0  0  0  5  3  2  4]
 [ 0  0  0  5  3  2  6]
 [ 0  0  0  7  3  2  4]
 [ 0  0  8  9 10  4 11]
 [ 2 12 13 14 15 16 17]]


**Parameters to play with**


*   padding = 'post' (default is 'pre')
*   maxlen = 5 (default is length of longest sentence)
*   If maxlen is set, decide from where to truncate if a sentence is longer than maxlen. Set truncate = 'post' (or 'pre')





In [11]:
# Try with words that the tokenizer wasn't fit to
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)

padded = pad_sequences(test_seq, maxlen=10)
print("\nPadded Test Sequence: ")
print(padded)


Test Sequence =  [[5, 1, 3, 2, 4], [2, 4, 9, 2, 1]]

Padded Test Sequence: 
[[0 0 0 0 0 5 1 3 2 4]
 [0 0 0 0 0 2 4 9 2 1]]
