<a href="https://colab.research.google.com/github/https-deeplearning-ai/tensorflow-1-public/blob/master/C3/W1/ungraded_labs/C3_W1_Lab_1_tokenize_basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ungraded Lab: Tokenizer Basics

In most Natural Language Processing tasks, the initial step in preparing your data is to extract a vocabulary of words from your *corpus* (i.e. input texts). You will need to define how to represent the texts into numerical representations which can be used to train a neural network. These representations are called *tokens* and Tensorflow and Keras makes it easy to generate these using its APIs. You will see how to do that in the next cells.

## Generating the vocabulary

In this notebook, you will look first at how you can provide a look up dictionary for each word. The code below takes a list of sentences, then takes each word in those sentences and assigns it to an integer. This is done using the [fit_on_texts()](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#fit_on_texts) method and you can get the result by looking at the `word_index` property. More frequent words have a lower index.

## Key points:


*   First we are creating an object of Tokenizer and passing a parameter.
*   Then, feeding training data into created object through fit_on_texts function.
*   In fit_on_texts the tokenizer will assign a particular number to every word to help computer recongize that word every single time.
*   After assigning number to each word we can get that data with help of function called word_index.
*   More frequence of words the lower index number they will have.
*   Tokenizer will ignore all punctuation and words are converted to lower case.







In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Define input sentences
sentences = [
    'i love my dog',
    'I, love my cat'
    ]

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words = 100)

# Generate indices for each word in the corpus
tokenizer.fit_on_texts(sentences)

# Get the indices and print it
word_index = tokenizer.word_index
print(word_index)

{'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}


The `num_words` parameter used in the initializer specifies the maximum number of words minus one (based on frequency) to keep when generating sequences. You will see this in a later exercise. For now, the important thing to note is it does not affect how the `word_index` dictionary is generated. You can try passing `1` instead of `100` as shown on the next cell and you will arrive at the same `word_index`. 

Also notice that by default, all punctuation is ignored and words are converted to lower case. You can override these behaviors by modifying the `filters` and `lower` arguments of the `Tokenizer` class as described [here](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#arguments). You can try modifying these in the next cell below and compare the output to the one generated above.

In [None]:
# Define input sentences
sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words = 1)

# Generate indices for each word in the corpus
tokenizer.fit_on_texts(sentences)

# Get the indices and print it
word_index = tokenizer.word_index
print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


## Demonstrating Parameters:
* filters:	a string where each element is a character that will be filtered from the texts. The default is all punctuation, plus tabs and line breaks, minus the ' character.

* lower:	boolean, Whether to convert the texts to lowercase.

* char_level:	if True, every character will be treated as a token.

* oov_token:	if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls.

* analyzer:	function Custom analyzer to split the text. The default analyzer is text_to_word_sequence

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Define input sentences
sentences = [
    'i lOve my dOG',
    'I, love my cat!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'
    ]

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words = 100, filters='?', lower=False)

# Generate indices for each word in the corpus
tokenizer.fit_on_texts(sentences)

# Get the indices and print it
word_index = tokenizer.word_index
print(word_index)

{'my': 1, 'i': 2, 'lOve': 3, 'dOG': 4, 'I,': 5, 'love': 6, 'cat!"#$%&()*+,-./:;<=>': 7, '@[\\]^_`{|}~\t\n': 8}


NOTE: using `filter = '?'`, the tokenzier removed ? from text and divided string into 2 parts. The ? replaced with a space.

NOTE: using `lower=False`, the tokenizer didnt converted uppercase into lowercase instead it left it in same condition.

That concludes this short exercise on tokenizing input texts!