# Basics Of Model Training in TensorFlow

### model.fit() = higher-level: Keras manages the loop, metrics, logging, and optimizations for you. Use this when you want productivity and standard behavior.
- Keras handles everything for you:
  - runs forward + backward automatically
  - tracks metrics, batches, logging
  - supports callbacks, early stopping, checkpoints
- Use when: you want productivity and standard training behavior.

### GradientTape loop = lower-level: you write the forward/backward/update steps yourself. Use this to learn or to customize behavior per-step.
- You manually control each step:
  - compute loss inside the tape
  - get gradients yourself
  - apply optimizer updates manually
- Use when: you need custom per-step logic, research flexibility, or non-standard training.

In [None]:
%pip install tensorflow

### imports + tiny dataset

In [None]:
# Cell 1: imports and tiny dataset

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import time

# tiny dataset: 6 sentences (toy)
texts = [
    "I love this movie",
    "This film was terrible",
    "Amazing plot and acting",
    "Horrible movie experience",
    "Loved it, would watch again",
    "Worst movie ever"
]
labels = [1, 0, 1, 0, 1, 0]  # 1=positive, 0=negative

# wrap as tf.data.Dataset (strings, ints)
ds = tf.data.Dataset.from_tensor_slices((texts, labels))
# Why: small data so you can step through everything quickly. tf.data.Dataset simply holds your pairs of (text, label).

### preprocessing with TextVectorization

In [16]:
# Cell 2: TextVectorization to convert strings -> token ids
vocab_size = 1000
max_len = 10

vectorize = layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=max_len,
)

# adapt the vectorizer on the text data
vectorize.adapt(ds.map(lambda t, y: t))

# helper to apply vectorization and batch
def prepare(ds, batch_size=2, shuffle=True):
    ds2 = ds
    if shuffle:
        ds2 = ds2.shuffle(100)
    # Always map vectorization before batching
    ds2 = ds2.map(lambda t, y: (vectorize(t), y), num_parallel_calls=tf.data.AUTOTUNE)
    ds2 = ds2.batch(batch_size).prefetch(tf.data.AUTOTUNE)
    return ds2

train_ds = prepare(ds, batch_size=2)
# show one batch to inspect shapes / values
for xb, yb in train_ds.take(1):
    print("input_ids shape:", xb.shape, "labels:", yb.numpy())
    print("input_ids example (first batch row):", xb.numpy()[0])

# Why: TextVectorization is a convenient tokenizer + indexer. You see the concrete token IDs and shapes before training.

input_ids shape: (2, 10) labels: [0 1]
input_ids example (first batch row): [ 5  2 17  0  0  0  0  0  0  0]


2025-11-16 14:21:22.996823: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


### small model builder (standalone)

In [17]:
# Cell 3: tiny model builder (embedding -> pooling -> dense)
def build_tiny_model(vocab_size=vocab_size, embed_dim=16, max_len=max_len, num_classes=1):
    inputs = keras.Input(shape=(max_len,), dtype='int32', name='input_ids')
    x = layers.Embedding(vocab_size, embed_dim, name='embed')(inputs)
    x = layers.GlobalAveragePooling1D(name='pool')(x)
    x = layers.Dense(8, activation='relu', name='dense')(x)
    if num_classes == 1:
        outputs = layers.Dense(1, activation='sigmoid', name='out')(x)
    else:
        outputs = layers.Dense(num_classes, activation='softmax', name='out')(x)
    return keras.Model(inputs, outputs)

model = build_tiny_model()
model.summary()

# Why: simple model so training finishes instantly — good for learning.

### Train using model.fit() (high-level)

**Keras gives you a full training engine:**

- automatically batches the data
- runs the forward pass
- calculates the loss
- applies the optimizer
- tracks metrics
- supports callbacks (EarlyStopping, checkpoints, etc.)
- supports distributed training with tf.distribute

*You only specify:*
```py
model.compile(optimizer='adam', loss='mse', metrics=['mae'])
model.fit(ds, epochs=10)
```
**Pros**
- Very short code
- Very reliable & battle-tested
- Automatically handles edge cases
- Works perfectly with built-in layers/models
- Easy multi-GPU/multi-worker training
- Built-in callbacks (logging, checkpoints, LR schedules, etc.)
  
**Cons**
- Hard to customize per-step behavior
- Hard to do weird training loops (RL, GANs, meta-learning)
- Hard to inject custom losses that depend on intermediate tensors or multiple passes

**When to use**
- Use model.fit() when your training is standard supervised learning and you don’t need custom step-by-step behavior.
- It’s the right choice 90% of the time because it’s simple, clean, and handles batching, metrics, and callbacks automatically.
- Best for productivity, clean code, and standard training.

In [18]:
# Cell 4: high-level training with model.fit()
model = build_tiny_model()
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
    loss=keras.losses.BinaryCrossentropy(),
    metrics=[keras.metrics.BinaryAccuracy()]
)

# Train for a few epochs
t0 = time.time()
history = model.fit(train_ds, epochs=5, verbose=2)
print("fit() took %.3f s" % (time.time() - t0))

# What happened: fit() ran an internal loop: for each batch it called the model, computed loss, computed gradients, and applied optimizer 

Epoch 1/5


2025-11-16 14:21:36.538275: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.


3/3 - 1s - 366ms/step - binary_accuracy: 0.5000 - loss: 0.6942
Epoch 2/5
Epoch 2/5
3/3 - 0s - 21ms/step - binary_accuracy: 0.3333 - loss: 0.6939
Epoch 3/5
3/3 - 0s - 21ms/step - binary_accuracy: 0.3333 - loss: 0.6939
Epoch 3/5
3/3 - 0s - 20ms/step - binary_accuracy: 0.6667 - loss: 0.6926
Epoch 4/5
3/3 - 0s - 20ms/step - binary_accuracy: 0.6667 - loss: 0.6926
Epoch 4/5
3/3 - 0s - 19ms/step - binary_accuracy: 0.8333 - loss: 0.6920
Epoch 5/5
3/3 - 0s - 19ms/step - binary_accuracy: 0.8333 - loss: 0.6920
Epoch 5/5
3/3 - 0s - 19ms/step - binary_accuracy: 0.8333 - loss: 0.6912
fit() took 1.350 s
3/3 - 0s - 19ms/step - binary_accuracy: 0.8333 - loss: 0.6912
fit() took 1.350 s


### Manual training loop with GradientTape (low-level)

**You write everything explicitly:**
```py
optimizer = tf.keras.optimizers.Adam()

for x, y in ds:
    with tf.GradientTape() as tape:
        preds = model(x, training=True)
        loss = loss_fn(y, preds)

    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
```
**Pros**
- You can modify every step
- Add custom logic:
    - reinforcement learning updates
    - multiple losses
    - contrastive learning
    - gradient accumulation
    - clipping, freezing, mixing
    - GAN training (two models, two optimizers)
- Perfect for research and experiments
  
**Cons**
- More code to write
- Easier to make mistakes
- You must manage metrics manually
- You must write your own loop for epochs, batches, validation
- Harder to integrate with distribution strategies

**When to use**
- Use a custom training loop (GradientTape) when you need full control over the training step — e.g., custom losses, unusual update rules, multi-model setups (GANs), reinforcement learning, or any behavior that model.fit() can’t express.


In [20]:
# Cell 5: low-level training with GradientTape
# We'll recreate model & optimizer to compare fairly
model2 = build_tiny_model()
optimizer = keras.optimizers.Adam(learning_rate=1e-3)
loss_fn = keras.losses.BinaryCrossentropy()
train_loss_metric = keras.metrics.Mean(name="train_loss")
train_acc_metric = keras.metrics.BinaryAccuracy(name="train_acc")

# We'll run the same number of epochs and iterate datasets manually
epochs = 5
t0 = time.time()

# Optional: wrap train_step in @tf.function to compile to graph after you confirm correctness
@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        logits = model2(x, training=True)
        loss_value = loss_fn(y, logits)
    grads = tape.gradient(loss_value, model2.trainable_variables)
    optimizer.apply_gradients(zip(grads, model2.trainable_variables))
    train_loss_metric.update_state(loss_value)
    train_acc_metric.update_state(y, logits)

for epoch in range(epochs):
    # reset metrics at start of epoch (use singular API `reset_state()`)
    train_loss_metric.reset_state()
    train_acc_metric.reset_state()

    for step, (x_batch, y_batch) in enumerate(train_ds):
        train_step(x_batch, tf.cast(y_batch, tf.float32))
    print(f"Epoch {epoch+1}: loss={train_loss_metric.result():.4f}, acc={train_acc_metric.result():.4f}")

print("custom loop took %.3f s" % (time.time() - t0))

# What happened: You implemented the forward pass, computed loss, computed gradients via GradientTape, 
# applied updates, and manually updated metrics. @tf.function compiles train_step into a graph for speed.

Epoch 1: loss=0.6956, acc=0.5000
Epoch 2: loss=0.6958, acc=0.3333
Epoch 3: loss=0.6949, acc=0.3333
Epoch 4: loss=0.6944, acc=0.3333
Epoch 5: loss=0.6943, acc=0.3333
custom loop took 0.778 s


### Quick comparison & notes


In [22]:
# Cell 6: quick notes (print a prediction)
sample = tf.constant(["I absolutely loved this film"])
sample_vec = vectorize(tf.expand_dims(sample, -1))
print("Vectorized:", sample_vec.numpy())
print("fit() model predicts:", model.predict(sample_vec)[0][0])
print("custom-loop model predicts:", model2.predict(sample_vec)[0][0])

Vectorized: [[13  1 10  3 15  0  0  0  0  0]]
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
fit() model predicts: 0.5008707
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step
custom-loop model predicts: 0.49770308
fit() model predicts: 0.5008707
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step
custom-loop model predicts: 0.49770308


**Observations / learning points**
- Both methods should train the model; results will differ slightly because initial weights/optimizer states differ unless you re-used exact same model & seeds.
- model.fit() is concise; GradientTape gives you control
- Use @tf.function for speed on train_step but debug in eager mode first (remove @tf.function) if things break.


### Other Custom Loops

* **`tf.GradientTape` is the main and modern way** to write custom training loops in TensorFlow.
* You can also make custom loops by **overriding `model.train_step()`**, which still uses GradientTape internally but keeps compatibility with `model.fit()`.
* Older methods exist (`optimizer.compute_gradients`, TF1-style training ops), but they are **deprecated and not recommended**.

**In practice:**
➡️ Use **GradientTape** for full control
➡️ Use **`train_step()` override** if you want custom behavior **and** to keep using `model.fit()`
