<a href="https://colab.research.google.com/github/Nourhan-Adell/DeepLearning/blob/main/Tokenize_Basics(Out__of_vocabulary_problem).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Tokenize Basics (Out of vocabulary problem)**

**Steps:**
1. Encode the sentence(word_idex)
2. Turn the sentences into list of values based on these tokens
3. Manipulate these lists(not least to make every sentence the same lenght)


### **Import libraries**

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

**Tokenizer API**: 
- Will handle the heavy lifting for us, by generating the dictionary of word encodings and creating vectors out of the sentences
- It also remove the punctuation from the sentences
- It changes all the upper characters into lower one.

### **Start building the model**

In [2]:
# Define input sentences 
sentences = ['I love my dog',
             'I love my cat',
             'You love my dog!',
             'Do you think my dog is amazing?']

In [4]:
# Intialize the tokenizer class
tokenizer = Tokenizer(num_words= 100)   

#Generate indecies for each word in the corpus
tokenizer.fit_on_texts(sentences)  

# Get the indecies and print them
word_index = tokenizer.word_index

#
sequences = tokenizer.texts_to_sequences(sentences)

print(word_index)
print()
print(sequences)

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}

[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]


So, the **texts_to_sequences** function is convertig each sentence to a list according to the word encode

In [5]:
test_data = ['I really love my dog',
             'My dog loves my manatee']

test_seq =  tokenizer.texts_to_sequences(test_data)
print(test_seq)

[[4, 2, 1, 3], [1, 3, 1]]


From the above test data, We can coclude that the lists of the encoded test data doesn't include the new data so, it incodes the first sentence as it is (I love my dog) 

**So, It has ignored the unseen word in the test data**

## **Adding new property to solve ignoring words problem**

In [6]:
# Intialize the tokenizer class
tokenizer = Tokenizer(num_words= 100, oov_token= '<OOV>')     #oov: out of vocabulary   

#Generate indecies for each word in the corpus
tokenizer.fit_on_texts(sentences)  

# Get the indecies and print them
word_index = tokenizer.word_index

#
sequences = tokenizer.texts_to_sequences(sentences)

print(word_index)
print()
print(sequences)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]


In [7]:
test_seq =  tokenizer.texts_to_sequences(test_data)
print(test_seq)

[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]


So here we replace each unseen word into 'oov' instead of removing it totally.