# **Udacity: Intro to TensorFlow for Deep Learning**
### **Lesson 9 NLP: Tokenization and Embeddings**

**Introduction to NLP**   
This lesson was an introduction to natural language processing, which involves analysing the meaning of text and speech data. Applications of NLP include
- Dictation and translation
- Sentiment analysis
- Text and speech generation

We have also seen NLP been applied in commerical products like voice assistants and other smart devices.

</br>

This lesson focuses on how we can prepare text data for our NLP models

## **Preparing Text for Natural language models**

Towards preparing text for NLP models we would need to
- **Tokenize the text**, which involves assigning a numerical value to words in the text training dataset
- We typically work with sentences and not just individual words in NLP, so after tokenization, we can then **convert sentences into sequences**.
- As we can have sentences of different lengths, we would also have sequences of different length, to work around this we would apply **padding and truncating** to ensure all the sequences have equal length.

<br/>

Other things and parameters to account for   
- Tokenization is only applied once and with the training set, any new words encountered in the test set would be represented with an **Out of vocabulary token, OOV**.
- We can define the **number of words which would get tokenized**.
- We can vary the **length of our sequences**, using padding and truncating.
- we can define **where padding and truncating is applied**, at the start or end of sequences.

## **Import required packages**

In [1]:
import tensorflow as tf
import numpy as np

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

print(tf.__version__)

2.8.2


## **Tokenization**

Lets use text from Green Eggs and Ham by Dr. Seuss.



In [2]:
# define a list of sentences
Green_eggs_and_ham = ["I AM SAM. I AM SAM. SAM I AM.",
                      "THAT SAM-I-AM! THAT SAM-I-AM! I DO NOT LIKE THAT SAM-I-AM!",
                      "DO WOULD YOU LIKE GREEN EGGS AND HAM?",
                      "I DO NOT LIKE THEM,SAM-I-AM.",
                      "I DO NOT LIKE GREEN EGGS AND HAM.",
                      "WOULD YOU LIKE THEM HERE OR THERE?",
                      "I WOULD NOT LIKE THEM HERE OR THERE.",
                      "I WOULD NOT LIKE THEM ANYWHERE.",
                      "I DO NOT LIKE GREEN EGGS AND HAM.",
                      "I DO NOT LIKE THEM, SAM-I-AM.",

                      "WOULD YOU LIKE THEM IN A HOUSE?",
                      "WOULD YOU LIKE THEN WITH A MOUSE?",

                      "I DO NOT LIKE THEM IN A HOUSE.",
                      "I DO NOT LIKE THEM WITH A MOUSE.",
                      "I DO NOT LIKE THEM HERE OR THERE.",
                      "I DO NOT LIKE THEM ANYWHERE.",
                      "I DO NOT LIKE GREEN EGGS AND HAM.",
                      "I DO NOT LIKE THEM, SAM-I-AM.",

                      "WOULD YOU EAT THEM IN A BOX?",
                      "WOULD YOU EAT THEM WITH A FOX?",

                      "NOT IN A BOX. NOT WITH A FOX.",
                      "NOT IN A HOUSE. NOT WITH A MOUSE.",
                      "I WOULD NOT EAT THEM HERE OR THERE.",
                      "I WOULD NOT EAT THEM ANYWHERE.",
                      "I WOULD NOT EAT GREEN EGGS AND HAM.",
                      "I DO NOT LIKE THEM, SAM-I-AM.",

                      "WOULD YOU? COULD YOU? IN A CAR?",
                      "EAT THEM! EAT THEM! HERE THEY ARE.",

                      "I WOULD NOT, COULD NOT, IN A CAR.",

                      "YOU MAY LIKE THEM. YOU WILL SEE.",
                      "YOU MAY LIKE THEM IN A TREE!",

                      "I WOULD NOT, COULD NOT IN A TREE.",
                      "NOT IN A CAR! YOU LET ME BE.",
                      "I DO NOT LIKE THEM IN A BOX.",
                      "I DO NOT LIKE THEM WITH A FOX.",
                      "I DO NOT LIKE THEM IN A HOUSE.",
                      "I DO NOT LIKE THEM WITH A MOUSE.",
                      "I DO NOT LIKE THEM HERE OR THERE.",
                      "I DO NOT LIKE THEM ANYWHERE.",
                      "I DO NOT LIKE GREEN EGGS AND HAM.",
                      "I DO NOT LIKE THEM, SAM-I-AM.",

                      "A TRAIN! A TRAIN! A TRAIN! A TRAIN!",
                      "COULD YOU, WOULD YOU ON A TRAIN?",

                      "NOT ON TRAIN! NOT IN A TREE!",
                      "NOT IN A CAR! SAM! LET ME BE!",
                      "I WOULD NOT, COULD NOT, IN A BOX.",
                      "I WOULD NOT, COULD NOT, WITH A FOX.",
                      "I WILL NOT EAT THEM IN A HOUSE.",
                      "I WILL NOT EAT THEM HERE OR THERE.",
                      "I WILL NOT EAT THEM ANYWHERE.",
                      "I DO NOT EAT GREEM EGGS AND HAM.",
                      "I DO NOT LIKE THEM, SAM-I-AM.",
]

**Defining and fitting the tokenizer and word index**

In [3]:
# initialize the tokenizer, it would tokenize the 100 most common words in the list of sentences
green_eggs_and_ham_tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")

# fit the tokenizer to the list of sentences
green_eggs_and_ham_tokenizer.fit_on_texts(Green_eggs_and_ham)


View the word index for the tokenizer

In [4]:
word_index = green_eggs_and_ham_tokenizer.word_index
print(word_index)
print("Number of tokenized words: {}".format(len(word_index)))

{'<OOV>': 1, 'i': 2, 'not': 3, 'them': 4, 'like': 5, 'a': 6, 'do': 7, 'would': 8, 'in': 9, 'you': 10, 'sam': 11, 'am': 12, 'eat': 13, 'with': 14, 'eggs': 15, 'and': 16, 'ham': 17, 'here': 18, 'green': 19, 'or': 20, 'there': 21, 'could': 22, 'train': 23, 'anywhere': 24, 'house': 25, 'mouse': 26, 'box': 27, 'fox': 28, 'car': 29, 'will': 30, 'that': 31, 'tree': 32, 'may': 33, 'let': 34, 'me': 35, 'be': 36, 'on': 37, 'then': 38, 'they': 39, 'are': 40, 'see': 41, 'greem': 42}
Number of tokenized words: 42


An intresting note,
- it hasn't picked on any of the exclamation marks, commas or periods in the sentences.
- it also has 42 words in its word index and not 100, so i'm guess that it only found 42 words which were very common in the entire list and it could not find 58 other words which were also very common


</br>

some questions to raise
- How often does the word have to come up in the sentence for it to be added to the word index, or does it just find the frequency of each word and then sort and filter based on frequency and desired number of words.


**Convert text to sequences**

In [5]:
# get a single sentence from the list of sentences
sample_sentence = Green_eggs_and_ham[0]
print(sample_sentence)


I AM SAM. I AM SAM. SAM I AM.


In [6]:
# convert the sentence into a sequence
green_eggs_and_ham_tokenizer.texts_to_sequences([sample_sentence])

[[2, 12, 11, 2, 12, 11, 11, 2, 12]]

In [7]:
# get more sentences from the list and convert it into sequences

sample_sentences = Green_eggs_and_ham[0:10]
sample_sequences = green_eggs_and_ham_tokenizer.texts_to_sequences(sample_sentences)

for text, sequence in zip(sample_sentences, sample_sequences):
  print("Text: {}".format(text))
  print("Sequence: {}".format(sequence))
  print("\n")


Text: I AM SAM. I AM SAM. SAM I AM.
Sequence: [2, 12, 11, 2, 12, 11, 11, 2, 12]


Text: THAT SAM-I-AM! THAT SAM-I-AM! I DO NOT LIKE THAT SAM-I-AM!
Sequence: [31, 11, 2, 12, 31, 11, 2, 12, 2, 7, 3, 5, 31, 11, 2, 12]


Text: DO WOULD YOU LIKE GREEN EGGS AND HAM?
Sequence: [7, 8, 10, 5, 19, 15, 16, 17]


Text: I DO NOT LIKE THEM,SAM-I-AM.
Sequence: [2, 7, 3, 5, 4, 11, 2, 12]


Text: I DO NOT LIKE GREEN EGGS AND HAM.
Sequence: [2, 7, 3, 5, 19, 15, 16, 17]


Text: WOULD YOU LIKE THEM HERE OR THERE?
Sequence: [8, 10, 5, 4, 18, 20, 21]


Text: I WOULD NOT LIKE THEM HERE OR THERE.
Sequence: [2, 8, 3, 5, 4, 18, 20, 21]


Text: I WOULD NOT LIKE THEM ANYWHERE.
Sequence: [2, 8, 3, 5, 4, 24]


Text: I DO NOT LIKE GREEN EGGS AND HAM.
Sequence: [2, 7, 3, 5, 19, 15, 16, 17]


Text: I DO NOT LIKE THEM, SAM-I-AM.
Sequence: [2, 7, 3, 5, 4, 11, 2, 12]




Side note
- The texts_to_sequences function takes in a list of texts, so we can not just pass in raw stings

**Closer look at the Tokenizers Doc**

[Tokenizer Doc](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer)

<br/>

**tf.keras.preprocessing is deprecated**   
It looks like keras.preprocessing module is deprecated in version 2.9.1, it is now suggested to use the 
*tf.keras.utils.text_dataset_from_directory* and *tf.keras.layers.TextVectorization* for preprocessing text input.

For the sake of continuity with the course, i will follow along with using the depreciated code, but i'd still checkout the new methods for preprocessing text in another notebook.

<br/>

**Other methods with the Text tokenizer**
- fit_on_sequence / texts
- sequence to matrix / texts and text generators
- text to matrix / sequnce and sequence generators
- get_config
- to_json

<br/>

With going into too much detail on each function it seems like not only can we convert text into sequences, we can also convert them into matrix and we can only fit to sequences.


## **Padding and Truncating**

Side note again
- Looking at the documentation for version 2.9.1, pad_sequences has been moved to *tf.keras.utils.pad_sequences*

<br/>

Default behaviours
- Padding and truncating is applied at the start of the sequences
- max_length of sequence is decided by the longest sequence in the list, unless specified in the class constructor.
- padding value is 0

<br/>

i'd still try to see if i can use pad_sequence which i thought was imported from *tf.keras.preprocessing.sequence*.





In [8]:
# lets take sample_sequences
for sequence in sample_sequences:
    print(sequence)

[2, 12, 11, 2, 12, 11, 11, 2, 12]
[31, 11, 2, 12, 31, 11, 2, 12, 2, 7, 3, 5, 31, 11, 2, 12]
[7, 8, 10, 5, 19, 15, 16, 17]
[2, 7, 3, 5, 4, 11, 2, 12]
[2, 7, 3, 5, 19, 15, 16, 17]
[8, 10, 5, 4, 18, 20, 21]
[2, 8, 3, 5, 4, 18, 20, 21]
[2, 8, 3, 5, 4, 24]
[2, 7, 3, 5, 19, 15, 16, 17]
[2, 7, 3, 5, 4, 11, 2, 12]


In [9]:
# apply padding to the sequence to make it have a length of 10
padded_sample_sequences = pad_sequences(sample_sequences, maxlen=10,
                                        padding='pre', truncating='post')

for sequence in padded_sample_sequences:
  print(sequence)

[ 0  2 12 11  2 12 11 11  2 12]
[31 11  2 12 31 11  2 12  2  7]
[ 0  0  7  8 10  5 19 15 16 17]
[ 0  0  2  7  3  5  4 11  2 12]
[ 0  0  2  7  3  5 19 15 16 17]
[ 0  0  0  8 10  5  4 18 20 21]
[ 0  0  2  8  3  5  4 18 20 21]
[ 0  0  0  0  2  8  3  5  4 24]
[ 0  0  2  7  3  5 19 15 16 17]
[ 0  0  2  7  3  5  4 11  2 12]


That seems straightforward, i've applied pre padding, so that padding is applied at the start of the sequence and i used post truncating to truncate from the end of the sequence.

## **Word Embeddings**

Word embeddings, represent individual words in our tokenized vocabulary as vectors in an n dimensional space.

<br/>

What does this mean?   
- a word would be represented as a vector of n dimensions
- Example if we have 5 dimensions, a word like "Ham" would be represented as [0, 4, 2, 9, 10]


<br/>

My issue with this is that, why i would not pretend to fully understand Embeddings, does it not make the whole process of tokenization and sampling obsolete?, if we would simply end up with words converted into vectors of n dimension, Why go through the steps of 2 previous steps??




Link to the [Embeddings Doc](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding)

In [10]:
# import the Embeddings layer
from tensorflow.keras.layers import Embedding



In [11]:
# define an Embeddings layer, which returns a vector represented in 10 dimensions.

green_eggs_and_ham_embedding_layer = Embedding(input_dim = 100, output_dim = 10, input_length=10)

lets try out the embedding layer on some of the sequence and view the output

In [12]:
one_padded_sequence = padded_sample_sequences[0]
print(one_padded_sequence)

output = green_eggs_and_ham_embedding_layer(one_padded_sequence)
print(output)

[ 0  2 12 11  2 12 11 11  2 12]
tf.Tensor(
[[ 2.7152192e-02  1.5747037e-02 -1.1287771e-02 -4.7525443e-02
  -1.1992801e-02  1.8999342e-02 -4.1127075e-02 -2.0254672e-02
   1.1161648e-02 -3.4348868e-02]
 [-3.6322713e-02 -5.0235540e-05  3.4889225e-02  3.1428341e-02
  -3.1471837e-02  2.7836237e-02 -4.8321035e-02  2.6914254e-03
  -2.5630344e-02 -3.5016216e-02]
 [ 4.3754950e-03 -2.6388740e-02  2.6722047e-02 -4.2404056e-02
  -4.9804937e-02  1.6483966e-02 -7.5172409e-03  2.3057345e-02
  -7.1607828e-03  8.6241364e-03]
 [ 3.3015240e-02  4.0695835e-02  2.2828404e-02 -6.9152191e-04
  -2.7136970e-02 -4.9355485e-02 -2.2118116e-02  1.2197804e-02
   4.8632573e-02 -1.1537384e-02]
 [-3.6322713e-02 -5.0235540e-05  3.4889225e-02  3.1428341e-02
  -3.1471837e-02  2.7836237e-02 -4.8321035e-02  2.6914254e-03
  -2.5630344e-02 -3.5016216e-02]
 [ 4.3754950e-03 -2.6388740e-02  2.6722047e-02 -4.2404056e-02
  -4.9804937e-02  1.6483966e-02 -7.5172409e-03  2.3057345e-02
  -7.1607828e-03  8.6241364e-03]
 [ 3.3015240e-0

It looks like it has done exactly what we want it to do. For each word (token) in the sequence, it has created a vector represented in 10 dimensions.

Are the vector representation of each token consistent??   
- yes they are, check token 2, 12 & 11. it's consistent

**Using the embeddings layer with flatten and globalAverage pooling**

In [13]:
from tensorflow.keras.layers import Flatten, GlobalAveragePooling1D


In [14]:
def embeddings_with_flatten(sequence):
  """ Output result of calling embedding with flatten layer."""
  embedding_output = Embedding(input_dim=100, output_dim=10, input_length=10)(sequence)
  return Flatten()(embedding_output)


In [15]:
# calling embeddings with flatten
output_1 = embeddings_with_flatten(one_padded_sequence)
print(output_1)


tf.Tensor(
[[-0.0354768   0.03246372 -0.01754789 -0.04727766  0.02521421  0.03086747
  -0.04990344 -0.0017138   0.03231838  0.02451508]
 [ 0.01058905 -0.01629823  0.02109941 -0.01163485 -0.01796778  0.04683198
  -0.00786849 -0.02776681 -0.00875662 -0.016039  ]
 [ 0.0208096  -0.04613874  0.00910821 -0.03084538  0.04396031 -0.03314225
   0.0214645  -0.03278995 -0.00110868  0.01102009]
 [-0.01283943  0.00863726  0.02854401  0.03662533  0.04537231 -0.04758253
  -0.03559171 -0.01177174  0.04042183  0.04093733]
 [ 0.01058905 -0.01629823  0.02109941 -0.01163485 -0.01796778  0.04683198
  -0.00786849 -0.02776681 -0.00875662 -0.016039  ]
 [ 0.0208096  -0.04613874  0.00910821 -0.03084538  0.04396031 -0.03314225
   0.0214645  -0.03278995 -0.00110868  0.01102009]
 [-0.01283943  0.00863726  0.02854401  0.03662533  0.04537231 -0.04758253
  -0.03559171 -0.01177174  0.04042183  0.04093733]
 [-0.01283943  0.00863726  0.02854401  0.03662533  0.04537231 -0.04758253
  -0.03559171 -0.01177174  0.04042183  0

In [16]:
# define the function
def embeddings_with_globalaveragepool(sequence):
  embedding_output = Embedding(input_dim=100, output_dim=10, input_length=10)(sequence)
  print(embedding_output)
  reshaped_embedding = tf.keras.layers.Reshape((10, 1))(embedding_output)
  print(reshaped_embedding)
  return GlobalAveragePooling1D()(reshaped_embedding)

# callings with embeddings with global average pooling
output_2 = embeddings_with_globalaveragepool(one_padded_sequence)
print(output_2)

tf.Tensor(
[[ 0.04871067 -0.042537   -0.03624497  0.03204564  0.04190827  0.0003333
   0.02057016  0.03037613  0.01111329 -0.00014178]
 [-0.03969458  0.01574237 -0.02752073  0.00113036 -0.04066811 -0.04456582
  -0.04061954  0.02596376 -0.01813868  0.02695204]
 [ 0.01085353  0.0105276   0.0446629   0.01968182 -0.00687555  0.00590048
  -0.02397422  0.03255136 -0.03061882 -0.04055713]
 [ 0.00226391 -0.0293314   0.02051053 -0.02150265 -0.01139201 -0.03864883
   0.04020679  0.03042111 -0.00959812 -0.0390706 ]
 [-0.03969458  0.01574237 -0.02752073  0.00113036 -0.04066811 -0.04456582
  -0.04061954  0.02596376 -0.01813868  0.02695204]
 [ 0.01085353  0.0105276   0.0446629   0.01968182 -0.00687555  0.00590048
  -0.02397422  0.03255136 -0.03061882 -0.04055713]
 [ 0.00226391 -0.0293314   0.02051053 -0.02150265 -0.01139201 -0.03864883
   0.04020679  0.03042111 -0.00959812 -0.0390706 ]
 [ 0.00226391 -0.0293314   0.02051053 -0.02150265 -0.01139201 -0.03864883
   0.04020679  0.03042111 -0.00959812 -0.

## **Subwords**

Subwords are common parts of an individual word. For example
- nevertheless: never + the + less 

<br/>

**Cool, but why should we care about subwords?**   
- certain words (words that don't appear very often) would not be contained in the vocabulary space, but the subwords within these uncommon words might appear much more often, so if we tokenize these subwords we would be able to capture the uncommon words which contain subwords.


🆒 But....   
Having a vocabulary that contain subwords by itself would not really help. We also would need to consider the order in which the subword appears to make sense of the overall meaning.

Example given in the lesson
- Decay: Dec + ay    (Negative sentiment)
- Decent: Dec + ent  (Positive sentiment)

without considering the next subword in the above example, it would be very difficult to determine if Dec is linked to more positive or negative terms.


Hopefully you've been convinced to build a vocabulary containing subwords. 

We can create subwords from text training dataset using
`tfds.features.text.SubwordTextEncoder`

Note that `tfds.features.text.SubwordTextEncoder` is deprecated in v2.9.1


In [19]:
try:
  import tensorflow_datasets
  from tensorflow_datasets.features.text import SubWordTextEncoder
except:
  print("Unable to import SubWordTextEncoder")

Unable to import SubWordTextEncoder


Welp....   
seems like it is also been deprecated in version 2.8

Cautionary tale with the free course, most of the codes and methods might be out of dates so it's always worth checking the docs

Keypoint here.

if using a version with SubWordTextEncoder not deprecated, we can create a tokenizer from the SubWordTextEncoder and fit it to the training text dataset. From our fitted tokenizer we can then create tokens which represent subwords and take off from there really.


[SubWordTextEncoder](https://www.tensorflow.org/datasets/api_docs/python/tfds/deprecated/text/SubwordTextEncoder)