# **Natural Language Processing (NLP)**

### Terminology

##### **Tokenization**

- Tokenization is the process of representing words as sequences of numbers, enabling computers to process and understand text. Using TensorFlow's Keras API to tokenize sentences, it creates a dictionary mapping words to their corresponding tokens.

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer


In [14]:
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

In [15]:
tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}


#### **sequences**

- Sequences are lists of numbers that represent the order of words in a sentence or document.

In [16]:
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]


Now that the the tokenizer has all the sentences words, we can see the length is well kept and words are correctly ordered.

However, sometimes we might have words that the tokenizer doesn't recognize.

example

In [17]:
new_sentence = [
    'I really love my dog',
    'My dog loves my manatee'
]
new_sequences = tokenizer.texts_to_sequences(new_sentence)
print(word_index)
print(new_sequences)

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}
[[4, 2, 1, 3], [1, 3, 1]]


from these sentences we can see that the length of our sentences changes and some words such as `really`, `loves` and `manatee` that we didn't know about were removed. To make sure that the length of our sentences is at least kept, we use a concept called padding `OOV`.

In [22]:
tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)

In [23]:
print(tokenizer.word_index)
new_sentences = tokenizer.texts_to_sequences(
    new_sentence
)

print(new_sentences)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]


Now for the model to handle sentences of different lengths we will then implement padding

In [24]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [27]:
padded = pad_sequences(sequences)
print(padded)

[[ 0  0  0  4  2  1  3]
 [ 0  0  0  4  2  1  6]
 [ 0  0  0  5  2  1  3]
 [ 7  5  8  1  3  9 10]]
