# NLP Tokenization with TensorFlow Keras
### Overview

In Natural Language Processing (NLP), Tokenization is the process of splitting text into smaller units called tokens — usually words or subwords.
This makes it easier for a machine to understand and process text data.

In this example, I’ll use **TensorFlow’s Keras Tokenizer** to break a few sentences into **tokens** and create a **word index** — a `dictionary` that maps each unique word to a `numerical value`.

### TensorFlow
An end-to-end open source machine learning platform.

### Keras
Keras is a deep learning API designed for human beings, not machines. Keras focuses on debugging speed, code elegance & conciseness, maintainability, and deployability. When you choose Keras, your codebase is smaller, more readable, easier to iterate on.

### Tokenizer
I import the Tokenizer class from Keras.
This class helps to convert text into sequences of numbers that a neural network can understand.

In [1]:
# Importing the Tokenizer
from tensorflow.keras.preprocessing.text import Tokenizer

# Defining the Sentences
sentences = [
    'i love my cat',
    'I, love my dog',
    'You love my dog!'
]

# Initialize the Tokenizer
tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)

# Get the word index dictionary
word_index = tokenizer.word_index
print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


I have three short text samples.
Notice that they include differences in:

* Capitalization (i vs I)

* Punctuation (, and !)

The Tokenizer will handle these automatically by:

* Converting text to lowercase

* Ignoring punctuation



---

### Creating the Tokenizer Object

`tokenizer = Tokenizer(num_words=100)`

This line creates a **Tokenizer** and tells it to keep the **top 100 most frequent words** (based on how often they appear in your dataset).
Since my dataset is small, it will keep all words.


---
### Fitting the Tokenizer on Texts

`tokenizer.fit_on_texts(sentences)`


This step **analyzes** all sentences, finds every **unique word**, and assigns each one a **unique index number.**

For example:

The most frequent word gets index 1

The next most frequent gets 2, and so on.


---

### Viewing the Word Index

`word_index = tokenizer.word_index`

`print(word_index)`


This prints a dictionary showing the mapping from each word to its unique integer ID.


---

### Expected Output:

{'love': 1, 'my': 2, 'dog': 3, 'i': 4, 'cat': 5, 'you': 6}


Break that down:

* 'love' → 1 → most frequent word (appears in all sentences)

* 'my' → 2 → appears in every sentence

* 'dog' → 3

* 'i', 'cat', 'you' follow in frequency order


---



### What Happens Behind the Scenes

When `fit_on_texts()` runs, it:

1. Cleans the text: removes punctuation, lowercases everything.

2. Splits sentences into words (tokens).

3. Counts word frequencies.

4. Assigns indices based on how frequent each word is.

This dictionary (word_index) can then be used to:

* Convert text to sequences (for input into a model)

* Understand which words are most common

* Build embeddings or vocabularies for neural networks



---
### ✨ Summary
| Step                   | Purpose                                    |
| ---------------------- | ------------------------------------------ |
| **Tokenizer creation** | Define how many words to keep              |
| **fit_on_texts()**     | Learn all unique words and assign indices  |
| **word_index**         | Get a mapping of each word → integer       |
| **Result**             | Ready-to-use numerical data for NLP models |

