# Chapter 30: Model Training Best Practices

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Prepare time‑series data correctly for neural network training (batching, shuffling, normalization)
- Choose an appropriate batch size and understand its impact on convergence
- Implement learning rate schedules and adaptive methods to improve training
- Select loss functions that align with your prediction task (regression, classification, probabilistic)
- Apply early stopping to prevent overfitting and save the best model
- Use model checkpointing to preserve training progress
- Leverage mixed precision training to speed up computation on compatible hardware
- Scale training across multiple GPUs with distributed strategies
- Monitor training metrics effectively with tools like TensorBoard
- Diagnose and debug common training issues (vanishing gradients, poor convergence, NaN losses)

---

## **30.1 Data Preparation**

Proper data preparation is the foundation of successful model training. For time‑series forecasting, we must respect temporal order while still providing the model with enough variety to generalize.

### **30.1.1 Temporal Shuffling**

In standard machine learning, we shuffle the dataset to ensure that each batch represents the overall distribution. However, for time‑series, shuffling would break the temporal dependencies. Instead, we typically **do not shuffle** the training data when using RNNs or any model where the sequence order matters. For models like MLPs that use lagged features, we can shuffle because each sample is already a fixed window. But for sequences, we must maintain order.

**When to shuffle:**
- For MLPs on lagged features: shuffle is okay (each row is independent).
- For RNNs/LSTMs/CNNs on sequences: do **not** shuffle across time; instead, we can shuffle batches **within** the training set if we ensure each batch contains complete sequences, but we must not mix sequences from different time periods. In practice, we often train without shuffling for stateful models, but for stateless models (like our sequence models), we can shuffle the **order of samples** (the windows) because each window is independent of the next window. This is acceptable as long as the windows themselves are constructed without look‑ahead.

For our NEPSE sequence data, we can safely shuffle the training windows because they are independent samples (each is a fixed 20‑day window). The validation and test sets must remain in temporal order to evaluate true out‑of‑sample performance.

```python
# Example: shuffling training windows
indices = np.arange(len(X_train))
np.random.shuffle(indices)
X_train_shuffled = X_train[indices]
y_train_shuffled = y_train[indices]
```

**Explanation:** Shuffling helps break any sequential correlation between windows and leads to more stable training. However, we must ensure that windows do not overlap in time in a way that leaks information – if windows overlap heavily, shuffling might still be okay because each window is a valid input‑output pair. But if there is risk of overlapping test data, we must keep train/test strictly temporal.

### **30.1.2 Batching**

We feed data to the model in mini‑batches. The batch size affects training dynamics:

- **Larger batches:** more stable gradient estimates, can use higher learning rates, but require more memory and may converge to sharper minima.
- **Smaller batches:** noisier gradients, can help escape local minima, but training may be less stable.

For time‑series, we often use batch sizes of 32, 64, or 128. The optimal size depends on the dataset size and model complexity.

```python
# In Keras, batch size is set during fit
history = model.fit(X_train, y_train, batch_size=32, epochs=50)
```

### **30.1.3 Normalization/Standardization**

Neural networks train best when inputs are scaled to similar ranges. For time‑series, we typically scale each feature independently using statistics computed on the **training set only**. We then apply the same transformation to validation and test sets.

- **Standardization (Z‑score):** `(x - mean) / std` – good for features with different units.
- **Min‑max scaling:** `(x - min) / (max - min)` – scales to [0,1]; sensitive to outliers.

For financial returns, standardization is common because returns are already centered near zero. For prices, standardization is also used.

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
# Fit on training data (2D: samples * timesteps, features)
X_train_2d = X_train.reshape(-1, X_train.shape[-1])
scaler.fit(X_train_2d)
X_train_scaled = scaler.transform(X_train_2d).reshape(X_train.shape)
X_test_scaled = scaler.transform(X_test.reshape(-1, X_test.shape[-1])).reshape(X_test.shape)
```

If the target is also scaled, remember to inverse‑transform predictions for final evaluation.

---

## **30.2 Batch Size Selection**

The batch size is a critical hyperparameter. Larger batches provide more accurate gradient estimates but require more memory and may lead to overfitting. Smaller batches introduce noise that can act as a regularizer.

**Guidelines:**
- Start with a batch size of 32 or 64.
- If you have a lot of memory, try 128 or 256.
- Monitor training and validation loss; if validation loss is much higher than training, you may be overfitting – try reducing batch size or increasing regularization.
- For time‑series, avoid extremely large batches that may not represent the full variety of temporal patterns.

```python
# Experiment with different batch sizes
batch_sizes = [16, 32, 64]
histories = {}
for bs in batch_sizes:
    model = create_model()  # your model definition
    history = model.fit(X_train, y_train, batch_size=bs, epochs=50, validation_data=(X_val, y_val), verbose=0)
    histories[bs] = history.history['val_loss'][-1]  # final validation loss
    print(f"Batch size {bs}: final val loss = {histories[bs]:.4f}")
```

**Explanation:** This simple loop can help you choose a batch size that gives the lowest validation loss. However, batch size interacts with learning rate; often you need to adjust the learning rate when changing batch size.

---

## **30.3 Learning Rate Scheduling**

The learning rate determines how much we adjust the weights in response to the gradient. Too high and training diverges; too low and training stalls. Using a schedule that reduces the learning rate over time can improve convergence.

### **30.3.1 Fixed Learning Rate with Decay**

We can start with a higher learning rate and exponentially decay it.

```python
def lr_scheduler(epoch, lr):
    if epoch < 10:
        return lr
    else:
        return lr * tf.math.exp(-0.1)

callback = tf.keras.callbacks.LearningRateScheduler(lr_scheduler)
model.fit(..., callbacks=[callback])
```

### **30.3.2 Reduce on Plateau**

Reduce the learning rate when validation loss stops improving.

```python
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss', factor=0.5, patience=5, min_lr=1e-6
)
```

### **30.3.3 Cyclical Learning Rates**

Cyclical learning rates vary the learning rate between bounds, which can help escape local minima.

```python
# Using custom callback or tensorflow_addons
```

### **30.3.4 Adaptive Optimizers**

Optimizers like Adam, RMSprop, and AdaGrad automatically adjust learning rates per parameter. They are often good defaults, but still benefit from global scheduling.

```python
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(optimizer=optimizer, loss='mse')
```

---

## **30.4 Loss Function Selection**

The loss function should match the prediction task and the desired output distribution.

### **30.4.1 Regression**

- **Mean Squared Error (MSE):** `'mse'` – sensitive to outliers.
- **Mean Absolute Error (MAE):** `'mae'` – more robust to outliers.
- **Huber Loss:** combination of MSE and MAE; less sensitive to outliers.

```python
model.compile(loss='huber', optimizer='adam')
```

### **30.4.2 Classification**

- **Binary crossentropy:** for binary classification (direction).
- **Categorical crossentropy:** for multi‑class (e.g., up/flat/down).
- **Sparse categorical crossentropy:** when labels are integers.

```python
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
```

### **30.4.3 Probabilistic Forecasting**

Use negative log‑likelihood of a chosen distribution. For example, for Gaussian output:

```python
def negative_log_likelihood(y_true, y_pred):
    mean = y_pred[..., 0]
    log_std = y_pred[..., 1]
    std = tf.exp(log_std)
    return -tf.reduce_mean(tfp.distributions.Normal(mean, std).log_prob(y_true))
```

---

## **30.5 Early Stopping**

Early stopping halts training when a monitored metric (e.g., validation loss) stops improving for a certain number of epochs (patience). This prevents overfitting and saves time.

```python
early_stop = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True  # revert to best model
)

history = model.fit(..., callbacks=[early_stop])
```

**Explanation:** `restore_best_weights=True` ensures that after stopping, the model weights are set to the epoch with the lowest validation loss. This is crucial for getting the best model.

---

## **30.6 Checkpointing**

Model checkpointing saves the model (or just weights) periodically. This is useful for resuming interrupted training or for later analysis.

```python
checkpoint = tf.keras.callbacks.ModelCheckpoint(
    'best_model.h5',
    monitor='val_loss',
    save_best_only=True,
    mode='min',
    verbose=1
)

model.fit(..., callbacks=[checkpoint])
```

**Explanation:** `save_best_only=True` saves only when the monitored metric improves, so you end up with the best model on disk.

---

## **30.7 Mixed Precision Training**

Mixed precision training uses both float16 and float32 to speed up training on compatible GPUs (NVIDIA Volta, Turing, Ampere). It reduces memory usage and can double training speed with minimal loss in accuracy.

```python
from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy('mixed_float16')

# Build model normally; outputs may need to be cast back to float32 for loss
# For most Keras layers, this is handled automatically.
```

**Note:** Ensure your GPU supports mixed precision. You may need to adjust loss scaling.

---

## **30.8 Distributed Training**

When you have multiple GPUs, you can distribute training to speed up large models or datasets.

### **30.8.1 MirroredStrategy**

The simplest strategy for synchronous training on a single machine with multiple GPUs.

```python
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    model = create_model()
    model.compile(...)

model.fit(...)
```

**Explanation:** Inside the strategy scope, all variables are mirrored across GPUs, and gradients are aggregated.

### **30.8.2 MultiWorkerMirroredStrategy**

For multi‑machine training.

```python
strategy = tf.distribute.MultiWorkerMirroredStrategy()
with strategy.scope():
    # build model
```

### **30.8.3 ParameterServerStrategy**

For asynchronous training with parameter servers.

For NEPSE, with a small dataset, distributed training is overkill, but it's good to know for scaling.

---

## **30.9 Training Monitoring**

Monitoring training helps you detect issues early. Use TensorBoard to visualize losses, metrics, and even histograms of weights.

```python
tensorboard = tf.keras.callbacks.TensorBoard(log_dir='./logs', histogram_freq=1)

model.fit(..., callbacks=[tensorboard])
```

Then run `tensorboard --logdir ./logs` in the terminal.

You can also log custom metrics.

### **30.9.1 What to Monitor**

- Training and validation loss – look for overfitting (val loss increasing).
- Learning rate (if scheduled).
- Gradient norms – if they become very small or very large, there may be vanishing/exploding gradients.
- Weights and biases distributions – can indicate saturation.

---

## **30.10 Debugging Training**

Common issues and how to fix them:

### **30.10.1 Loss Not Decreasing**

- Learning rate too low or too high.
- Data not normalized.
- Wrong loss function.
- Model too shallow (underfitting).

### **30.10.2 NaN Loss**

- Exploding gradients – reduce learning rate, add gradient clipping, check data for NaNs.
- Numerical instability – use mixed precision correctly, add epsilon in denominators.

### **30.10.3 Overfitting**

- Add regularization (dropout, weight decay).
- Reduce model capacity.
- Increase training data (augmentation).
- Early stopping.

### **30.10.4 Underfitting**

- Increase model capacity.
- Train longer.
- Adjust learning rate.

### **30.10.5 Vanishing Gradients**

- Use ReLU activations instead of sigmoid/tanh.
- Add batch normalization.
- Use residual connections.
- Check initial weights.

### **30.10.6 Gradient Clipping**

To prevent exploding gradients, clip gradients during optimization.

```python
optimizer = tf.keras.optimizers.Adam(clipnorm=1.0)  # clip by norm
# or
optimizer = tf.keras.optimizers.Adam(clipvalue=0.5)  # clip by value
```

### **30.10.7 Debugging with a Small Subset**

If the model doesn't learn, try overfitting a single batch. If it can't overfit one batch, there's a bug.

```python
# Take one batch
x_batch, y_batch = next(iter(train_dataset.take(1)))
model.fit(x_batch, y_batch, epochs=50, verbose=0)
# Check if loss goes to near zero
```

---

## **30.11 Putting It All Together: NEPSE Training Pipeline**

Let's combine the best practices into a complete training pipeline for an LSTM on NEPSE returns.

```python
import tensorflow as tf
from tensorflow.keras import layers, callbacks
from sklearn.preprocessing import StandardScaler
import numpy as np

# Assume X_train, y_train, X_val, y_val, X_test, y_test are prepared and scaled

# Define model
def create_lstm_model(input_shape):
    model = tf.keras.Sequential([
        layers.LSTM(64, return_sequences=True, input_shape=input_shape),
        layers.Dropout(0.3),
        layers.LSTM(32, return_sequences=False),
        layers.Dropout(0.3),
        layers.Dense(1)
    ])
    return model

input_shape = (X_train.shape[1], X_train.shape[2])
model = create_lstm_model(input_shape)

# Learning rate schedule
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=0.001,
    decay_steps=1000,
    decay_rate=0.9
)
optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)

model.compile(optimizer=optimizer, loss='huber', metrics=['mae'])

# Callbacks
early_stop = callbacks.EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
reduce_lr = callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, min_lr=1e-6)
checkpoint = callbacks.ModelCheckpoint('best_lstm.h5', monitor='val_loss', save_best_only=True)
tensorboard = callbacks.TensorBoard(log_dir='./logs/lstm')

# Train
history = model.fit(
    X_train, y_train,
    batch_size=32,
    epochs=100,
    validation_data=(X_val, y_val),
    callbacks=[early_stop, reduce_lr, checkpoint, tensorboard],
    verbose=1
)

# Evaluate on test
test_loss, test_mae = model.evaluate(X_test, y_test)
print(f"Test MAE: {test_mae:.4f}")
```

**Explanation:**

- We use an exponential decay learning rate schedule combined with ReduceLROnPlateau for robustness.
- Dropout is applied after each LSTM layer.
- Huber loss is less sensitive to outliers than MSE.
- Early stopping prevents overfitting; checkpoint saves the best model.
- TensorBoard logs for monitoring.

---

## **30.12 Chapter Summary**

In this chapter, we covered essential best practices for training neural networks, particularly for time‑series forecasting with the NEPSE dataset.

- **Data preparation:** correct scaling, batching, and shuffling strategies.
- **Batch size selection:** impact on training dynamics and how to choose.
- **Learning rate scheduling:** fixed decay, reduce on plateau, adaptive optimizers.
- **Loss functions:** choosing based on task (regression, classification, probabilistic).
- **Early stopping** to avoid overfitting.
- **Checkpointing** to preserve the best model.
- **Mixed precision** for faster training on compatible hardware.
- **Distributed training** for scaling.
- **Monitoring** with TensorBoard.
- **Debugging** common training issues.

### **Practical Takeaways for the NEPSE System:**

- Always scale features using training set statistics.
- Start with Adam and a moderate learning rate (0.001).
- Use early stopping with patience 10‑20.
- Monitor both training and validation loss; if they diverge, increase regularization.
- Experiment with batch size (32, 64) and learning rate schedules.
- For small datasets like a single stock, avoid overly complex models and heavy regularization is key.

In the next chapter, **Chapter 31: Evaluation Metrics for Regression**, we will dive into the metrics used to assess forecast accuracy, including MAE, RMSE, MAPE, and specialized financial metrics.

---

**End of Chapter 30**