# Model Building (reuseable model class)

**Goal: You want to build a reusable, modular Keras model for text classification. Key points:**
- Inputs: Integer token IDs (produced by your pipeline). Shape (batch_size, max_len).
- Embedding: Converts token IDs → dense vectors. Essential for NLP.
- Sequence encoder: Can be a BiLSTM or a Transformer block — this encodes order and context.
- Dense layers: Optional intermediate processing to extract features.
- Output layer: Softmax (or sigmoid if binary) — produces predictions.
- Independent of training: This file just builds the model; compiling, LR schedules, and loss go in the training script.
- Config-driven: So you can swap hyperparameters easily.

**Why modular?**
- Experimentation: Swap LSTM → Transformer → CNN blocks without touching training code.
- Clarity: Separation of concerns: pipeline → model → training loop.
- Reproducibility: build_model(config) ensures everyone can rebuild the same model easily.



In [None]:
%pip install tensorflow keras numpy

In [2]:
# Example

"""
model_builder — modular Keras model for text classification
"""

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

def build_model(
    # now we define all the parameters for our model
    vocab_size: int, 
    max_len: int = 128,
    embed_dim: int = 128,
    num_classes: int = 2,
    encoder_type: str = 'bilstm',  # options: 'bilstm', 'transformer'
    lstm_units: int = 128,
    transformer_heads: int = 4,
    transformer_ff_dim: int = 128,
    dense_units: int = 64,
) -> keras.Model:
    """
    Build a text classification model with modular encoder options.
    
    Parameters:
        vocab_size: Size of vocabulary
        max_len: Max sequence length
        embed_dim: Embedding dimension
        num_classes: Number of output classes
        encoder_type: 'bilstm' or 'transformer'
        lstm_units: Units for BiLSTM layer
        transformer_heads: Attention heads for transformer
        transformer_ff_dim: Feed-forward dim for transformer
        dense_units: Units for intermediate Dense layer
    
    Returns:
        Keras Model (not compiled)
    """
    inputs = keras.Input(shape=(max_len,), dtype='int32', name='input_ids') # Input layer this is where we define the shape and type of our input data this layer takes in sequences of integers of length max_len (ie tokenized text data)

    # 1️⃣ Embedding
    x = layers.Embedding(vocab_size, embed_dim, mask_zero=True, name='embedding')(inputs) # we create an embedding layer that converts input tokens into dense vectors of fixed size (this takes our tokenized input and maps each token to a vector of size embed_dim)

    # 2️⃣ Sequence encoder (help understand the context of the sequence)
    # Choose between BiLSTM or Transformer for sequence encoding
    # Use BiLSTM when you need to capture sequential dependencies in both directions
    # and your dataset is relatively small or you want a simpler model.
    # Use Transformer when you need to capture global context, work with longer sequences,
    # or have a larger dataset that can benefit from its parallel processing capabilities.
    if encoder_type.lower() == 'bilstm': # stands for bidirectional LSTM is a type of RNN 
        
        # BiLSTM (Bidirectional LSTM) Layer:
        # LSTM (Long Short-Term Memory) is a type of RNN (Recurrent Neural Network) that is capable of learning long-term dependencies in sequential data.
        # Bidirectional LSTM processes the input sequence in both forward and backward directions, capturing context from both past and future tokens.
        # This is particularly useful in NLP tasks where understanding the context of a word depends on both preceding and succeeding words.
        x = layers.Bidirectional(
            layers.LSTM(lstm_units),  # LSTM layer with `lstm_units` specifying the number of units in the LSTM cell
            name='bilstm'             # Name of the layer for identification
        )(x)
    elif encoder_type.lower() == 'transformer':
        # Simple transformer block
        # Multi-Head Attention: This layer allows the model to focus on different parts of the input sequence
        # simultaneously. It computes attention scores for each token in the sequence relative to all other tokens.
        # `num_heads` specifies the number of attention heads, and `key_dim` is the dimensionality of the query/key vectors.
        attn_output = layers.MultiHeadAttention(
            num_heads=transformer_heads,  # Number of attention heads
            key_dim=embed_dim,           # Dimensionality of the query/key vectors
            name='transformer_attn'      # Name of the attention layer
        )(x, x)  # The input `x` is used as both the query and the key/value (self-attention).

        # Residual Connection: Adds the original input `x` back to the attention output.
        # This helps preserve the original information and improves gradient flow during training.
        x = layers.Add(name='residual_add')([x, attn_output])

        # Layer Normalization: Normalizes the output of the residual connection.
        # This stabilizes training and ensures that the values are on a similar scale.
        x = layers.LayerNormalization(name='layer_norm')(x)

        # Feed-Forward Network (FFN): A dense layer with a ReLU activation function.
        # This introduces non-linearity and allows the model to learn more complex representations.
        ff_output = layers.Dense(
            transformer_ff_dim,  # Dimensionality of the feed-forward layer
            activation='relu',   # Activation function
            name='ff_dense'      # Name of the dense layer
        )(x)
    else:
        raise ValueError(f"Unknown encoder_type={encoder_type}")

    # 3️⃣ Dense intermediate layer (helps the model learn more complex features by adding an additional layer before the output layer and using relu activation to introduce non-linearity)
    x = layers.Dense(dense_units, activation='relu', name='dense')(x)

    # 4️⃣ Output layer (classification layer)
    if num_classes == 1: 
        # Binary classification
        outputs = layers.Dense(1, activation='sigmoid', name='output')(x) # for binary classification we use a single neuron with sigmoid activation which outputs a probability between 0 and 1 (probability of the positive class i.e if the input text belongs to the positive class then the output will be close to 1 otherwise close to 0)
    else:
        # Multi-class classification
        outputs = layers.Dense(num_classes, activation='softmax', name='output')(x) # since there can be many output classes we use softmax activation which outputs a probability distribution over all classes (the sum of all output probabilities will be 1 and the class with the highest probability is chosen as the predicted class)

    model = keras.Model(inputs=inputs, outputs=outputs, name='text_classifier') # create a Keras model with the specified inputs and outputs ( we do this after we have trained our model on our processed text data so we have a ready to use model for making predictions on new text data)
    return model


### Key Notes
- Encoders modular: Swap encoder_type without changing downstream code.
- Dense units independent: You can tune intermediate features easily.
- Output flexible: Automatically chooses sigmoid (binary) or softmax (multi-class).
- No compile: Leave optimizer, loss, LR schedules, metrics in train_fit.py.
- Embedding uses mask_zero=True: Ensures LSTM ignores padded positions.
- Transformer block: Minimal example to understand mechanics; can expand to full stacked blocks later.


In [4]:
# example usage 
config = {
    "vocab_size": 10000,
    "max_len": 128,
    "embed_dim": 128,
    "num_classes": 2,
    "encoder_type": "bilstm",
    "lstm_units": 128,
    "dense_units": 64,
}

model = build_model(**config)
model.summary()

2025-11-12 19:22:51.911266: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M2 Pro
2025-11-12 19:22:51.911315: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 16.00 GB
2025-11-12 19:22:51.911321: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 5.92 GB
I0000 00:00:1762993371.912032 2044201 pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
I0000 00:00:1762993371.912149 2044201 pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


### NOTE:

- What it is (is short): the NotEqual entry in your model is just a check that creates a True/False mask saying “is this token not padding (0)?”
- Why we need it: when sentences are different lengths we pad short ones with 0 so all rows have the same length. The mask tells the model which positions are real words and which are just padding, so the model can ignore padding.
- Tiny concrete example:
    - Input token ids (two examples, padded to length 4): 
     [[5, 3, 0, 0],
     [2, 7, 9, 0]]
    - Mask computed by input_ids != 0:
    [[True, True, False, False],
    [True, True, True, False]]
- Meaning: False positions are padding and should be ignored.

- How it appears in your model: Embedding(..., mask_zero=True) tells Keras to treat 0 as padding; Keras creates the not_equal op (no trainable params) and passes that mask to layers like LSTM so they skip padded timesteps.
- Simple takeaway: masking prevents padding from changing model outputs. If you don’t want it, set mask_zero=False on the Embedding (but then padding will be treated like a real token).


In [None]:
# using model 
import tensorflow as tf
from tensorflow.keras import layers

# ----------------------------
# 1️⃣ Tiny example dataset
# ----------------------------
texts = [
    "I love this movie",
    "This film was terrible",
    "Amazing plot and acting",
    "Horrible movie experience",
    "Loved it, would watch again",
    "Worst movie ever"
]
labels = [1, 0, 1, 0, 1, 0]  # 1 = positive, 0 = negative

# Convert to tf.data.Dataset
dataset = tf.data.Dataset.from_tensor_slices((texts, labels))

# ----------------------------
# 2️⃣ Text preprocessing (tokenization)
# ----------------------------
vocab_size = 50
max_len = 10
vectorize_layer = layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=max_len
)
vectorize_layer.adapt(dataset.map(lambda x, y: x))

def preprocess(text, label):
    # `text` here is a scalar string tensor (e.g. b"I love this movie").
    # TextVectorization accepts scalar strings when mapping over a
    # dataset and will produce integer sequences after the dataset is
    # batched. Do NOT expand dimensions here — returning
    # `vectorize_layer(text)` keeps shapes compatible with the model.
    return vectorize_layer(text), label

dataset = dataset.map(preprocess).batch(2).prefetch(tf.data.AUTOTUNE)

# ----------------------------
# 3️⃣ Build model
# ----------------------------
model = build_model(
    vocab_size=vocab_size,
    max_len=max_len,
    embed_dim=16,
    num_classes=1,
    encoder_type='bilstm',
    lstm_units=8,
    dense_units=4
)

# Compile model (optimizer, loss, metrics)
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy'],
)

model.summary()

# ----------------------------
# 4️⃣ Train model on tiny dataset
# ----------------------------
model.fit(dataset, epochs=5)

# ----------------------------
# 5️⃣ Test with a new example
# ----------------------------
test_text = tf.constant(["I hated this movie"])  # shape (1,)
# vectorize_layer expects a 1-D batch of strings; pass test_text directly
test_input = vectorize_layer(test_text)
pred = model.predict(test_input)
print("Prediction (probability of positive):", pred[0][0]) # outputs a probability between 0 and 1 0= negative, 1= positive


2025-11-12 19:26:13.301249: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


Epoch 1/5


2025-11-12 19:26:13.857207: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 46ms/step - accuracy: 0.5000 - loss: 0.6935
Epoch 2/5
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 46ms/step - accuracy: 0.5000 - loss: 0.6935
Epoch 2/5
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step - accuracy: 0.5000 - loss: 0.6930
Epoch 3/5
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step - accuracy: 0.5000 - loss: 0.6930
Epoch 3/5
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step - accuracy: 0.5000 - loss: 0.6925
Epoch 4/5
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step - accuracy: 0.5000 - loss: 0.6925
Epoch 4/5
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step - accuracy: 0.5833 - loss: 0.6921
Epoch 5/5
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step - accuracy: 0.5833 - loss: 0.6921
Epoch 5/5
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step 

### What’s happening here
- Dataset: Tiny list of movie reviews + labels → converted to tf.data.Dataset.
- TextVectorization: Tokenizes words → integers, pads/truncates to max_len=10.
- Pipeline: .map(preprocess), .batch(2), .prefetch() ensures efficient feeding.
- Model: Uses the build_model() you created — embedding → BiLSTM → dense → sigmoid.
- Training: We train for a few epochs on the tiny dataset.
- Prediction: Shows how to pass a new raw text through the same preprocessing pipeline → model.
