<a href="https://colab.research.google.com/github/EteimZ/Deep_Learning-Notebooks/blob/main/TensorFlow/NLP_Basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Basics with Tensorflow

In this notebook I will go through the basic NLP operations like Tokenization and Sequencing using tensorflow.


In [None]:
# Import required Libraries
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

## Tokenization

Tokenization is the process of taking a sentence or a corpus and converting it to integers so every occurance of the word will be given a particular token or number.

Keras has a Tokenizer [class](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer) for perform tokenization.

In [None]:
sentences = ['I love my dog', 'I love my cat', 'You love my Dog!', 'Do you think my dog is amazing?']

In [None]:
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index


print(word_index)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}


## Sequencing

Tokenization maps each word to a particular integer but we need to convert the sentences to a sequence of integers to perfom any useful NLP operation.

Each sentence has to be padded with respect to longest sentences making all the sentences to have the same length.  


In [None]:
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post') # pad sequences
print(padded)

[[ 5  3  2  4  0  0  0]
 [ 5  3  2  7  0  0  0]
 [ 6  3  2  4  0  0  0]
 [ 8  6  9  2  4 10 11]]


Now lets test our tokenizer on text it has never seen before, words that it has never seen before will have a integer of 1 why corresponds to the to the Out of Vocobulary token. 

In [None]:
test_data = ['I really like my dog', 'my dog loves my coat']

In [None]:
test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

[[5, 1, 1, 2, 4], [2, 4, 1, 2, 1]]
