### 1. Text Vectorization Layer in TensorFlow

#### What It Is
The TextVectorization layer in TensorFlow:
Converts raw strings (sentences) into sequences of integers.

Handles:
 * Lowercasing
 * Removing punctuation
 * Tokenizing into words or characters
 * Creating vocabulary
 * Mapping each token to an integer

It’s super helpful when you want to build your own vocabulary from raw text rather than relying on datasets like IMDb that already return tokenized inputs.



In [4]:
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

# Example corpus
texts = [
    "This is a good movie",
    "This movie is bad",
    "What a fantastic film!"
]

# Create TextVectorization layer
"""
This layer will:
    1. Build a vocabulary of the most common words (up to max_tokens). Means most common 1000 word's vocabulary.
    2. Convert text to integers, where each word maps to a unique integer (based on frequency).
    3. Pad or truncate sequences to exactly output_sequence_length tokens.
"""
vectorizer = TextVectorization(
    max_tokens=1000,     # vocab size
    output_mode='int',   # convert to integer IDs
    output_sequence_length=6  # fixed length sequences
)

# Build vocabulary
"""
The adapt() method analyzes the text corpus and builds the vocabulary. 
It learns which words occur and assigns each word a unique index based on frequency (most frequent = lower index).
We can see in output as "this", "movie", "is", "a" words occured multiple times that's why they got lower index means they occured at the beginning of list.
"""
vectorizer.adapt(texts)

# View the vocabulary
"""
This shows the vocabulary that the layer has learned. Each word is assigned a unique index. The first two indices are usually reserved:
    "[PAD]" → index 0 (used for padding)
    "[UNK]" → index 1 (used for unknown/out-of-vocab words)
"""
vocab = vectorizer.get_vocabulary()
print("Vocabulary:\n", vocab)

# Transform the text
"""
This line transforms the original sentences into sequences of integers using the vocabulary.
Each sentence is converted into a list of 6 integers (as specified by output_sequence_length).
"""
vectorized_text = vectorizer(texts)
# print("vectorized_text: ", vectorized_text)
print("\nVectorized Output:\n", vectorized_text.numpy())


Vocabulary:
 ['', '[UNK]', np.str_('this'), np.str_('movie'), np.str_('is'), np.str_('a'), np.str_('what'), np.str_('good'), np.str_('film'), np.str_('fantastic'), np.str_('bad')]

Vectorized Output:
 [[ 2  4  5  7  3  0]
 [ 2  3  4 10  0  0]
 [ 6  5  9  8  0  0]]


Then a sentence like `This is a good movie` might get transformed to `[ 2  4  5  7  3  0]` where:

* this:2 
* is:4 
* a:5 
* good:7 
* movie:3

### How TextVectorization Connects with Embedding

#### Overview of Flow:
Raw Text → TextVectorization → Sequence of Token IDs → Embedding → Dense Representation for each token


Step-by-Step Connection:
1. TextVectorization Layer
    * Takes raw strings and converts them into a fixed-length integer sequence (As above o/p).
    * Each integer represents a word/token from the vocabulary it learned.

2. Embedding Layer
    * Takes the token IDs (integers from above) and looks up a dense vector (embedding) for each token.
    * The output is a 2D or 3D tensor where each word is represented by a dense vector of fixed size.

In [11]:
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization, Embedding
from tensorflow.keras.models import Sequential

# Step 1: TextVectorization Layer
vectorizer = TextVectorization(
    max_tokens=1000,
    output_mode='int',
    output_sequence_length=10
)

# Sample data
texts = ["TensorFlow is powerful", "I love deep learning"]
vectorizer.adapt(texts)

# Step 2: Embedding Layer
"""
vocab_size: The number of unique tokens including special tokens like [PAD] and [UNK].
embedding_dim=16: Each word/token will be represented as a dense vector of 16 dimensions.
"""
vocab_size = len(vectorizer.get_vocabulary())  # must match!
embedding_dim = 16

# Step 3: Create model
"""
Below model includes:

vectorizer: Converts input text to integer sequences.
Embedding: Converts those integers into dense 16-dimensional vectors.
mask_zero=True: Tells the model to ignore padding (0s) during training.
"""
model = Sequential([
    vectorizer,
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, mask_zero=True)
])

# Forward pass on example
print("tf constant: ", tf.constant(texts))
output = model(tf.constant(texts))
# print("output: ", output)
print("Embedded Output Shape:", output.shape)


tf constant:  tf.Tensor([b'TensorFlow is powerful' b'I love deep learning'], shape=(2,), dtype=string)
Embedded Output Shape: (2, 10, 16)


This pipeline:
1. Converts raw text into integer sequences using a vocabulary.
2. Maps those integers to dense 16D vectors using an Embedding layer.


We are using `Sequential()` model because:
1. First, the input text goes into the TextVectorization layer, which converts text to integers.
2. Next, those integers go into the Embedding layer, which turns them into dense vectors.

##### We're using Sequential to do all tasks in order.