# **Chapter 25: Neural Network Fundamentals**

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand the biological inspiration and basic architecture of artificial neural networks
- Explain the role of perceptrons and activation functions in introducing non‑linearity
- Build a multi‑layer perceptron (MLP) for regression and classification tasks using the NEPSE dataset
- Grasp the concept of backpropagation and how neural networks learn from data
- Choose appropriate loss functions for different prediction problems
- Implement common optimization algorithms (SGD, Adam, RMSprop) and understand their differences
- Apply regularization techniques (dropout, batch normalization, early stopping) to prevent overfitting
- Train a neural network effectively, including data preparation, scaling, and monitoring
- Recognize and avoid common pitfalls when applying neural networks to time‑series data

---

## **25.1 Introduction to Neural Networks**

Artificial neural networks (ANNs) are a class of machine learning models inspired by the biological neural networks that constitute animal brains. They consist of interconnected units called **neurons** (or nodes) organized in layers. Each connection has a weight that is adjusted during learning. Neural networks are universal function approximators – given enough neurons and layers, they can represent any continuous function.

In the context of the NEPSE prediction system, neural networks can capture complex, non‑linear relationships between engineered features (lagged returns, volume, technical indicators) and future returns or price direction. They often outperform traditional models when sufficient data is available, but they require careful tuning and regularization to avoid overfitting, especially with financial time‑series.

### **25.1.1 The Building Blocks**

A neural network is composed of:

- **Input layer:** Each neuron corresponds to a feature (e.g., `Return_Lag1`, `RSI`, etc.).
- **Hidden layers:** One or more layers between input and output, where the network learns representations.
- **Output layer:** Produces the final prediction (e.g., a single neuron for regression, or multiple for classification).

Each neuron computes a weighted sum of its inputs, adds a bias, and passes the result through an **activation function** to introduce non‑linearity.

---

## **25.2 Perceptrons and Activation Functions**

### **25.2.1 The Perceptron**

The perceptron is the simplest form of a neural network – a single neuron. It takes input vector `x`, multiplies by weights `w`, adds bias `b`, and outputs:

`output = activation(w·x + b)`

For binary classification, the activation is often a step function (e.g., output 1 if `w·x + b > 0`, else 0). However, step functions are not differentiable, so modern networks use smooth activation functions.

### **25.2.2 Common Activation Functions**

- **Sigmoid:** `σ(z) = 1 / (1 + e⁻ᶻ)`. Outputs between 0 and 1. Used for binary classification output. Suffers from vanishing gradient for large |z|.
- **Tanh:** `tanh(z) = (eᶻ - e⁻ᶻ) / (eᶻ + e⁻ᶻ)`. Outputs between -1 and 1. Zero‑centered, but still saturates.
- **ReLU (Rectified Linear Unit):** `ReLU(z) = max(0, z)`. Most popular for hidden layers. Non‑saturating, computationally efficient. Can cause dead neurons (if always negative, gradient is zero).
- **Leaky ReLU:** `max(αz, z)` with small α (e.g., 0.01) to allow gradient for negative inputs.
- **Softmax:** Used in output layer for multi‑class classification; converts logits to probabilities summing to 1.

For the NEPSE prediction, we typically use ReLU for hidden layers. For regression output (predicting return), we use a linear activation (no activation). For binary classification (direction), we use sigmoid.

---

## **25.3 Multi‑Layer Perceptrons (MLP)**

An MLP is a feedforward neural network with one or more hidden layers. Each layer is fully connected to the next. The network learns by adjusting weights to minimize a loss function.

### **25.3.1 Building an MLP for NEPSE Return Prediction**

We'll use TensorFlow/Keras to build a simple MLP for regression (predicting next day's return). We'll reuse the same feature set as in previous chapters.

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Load and prepare NEPSE data (as before)
df = pd.read_csv('nepse_data.csv')
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['Symbol', 'Date']).reset_index(drop=True)

# Use a single symbol for simplicity
symbol = df['Symbol'].unique()[0]
df_stock = df[df['Symbol'] == symbol].copy()

# Create features and target
df_stock['Return'] = df_stock['Close'].pct_change() * 100
df_stock['Return_Lag1'] = df_stock['Return'].shift(1)
df_stock['Return_Lag2'] = df_stock['Return'].shift(2)
df_stock['Volume_Lag1'] = df_stock['Vol'].shift(1)
df_stock['MA_5'] = df_stock['Close'].rolling(5).mean()
df_stock['Volatility_5'] = df_stock['Return'].rolling(5).std()
# RSI
delta = df_stock['Close'].diff()
gain = delta.where(delta > 0, 0)
loss = -delta.where(delta < 0, 0)
avg_gain = gain.rolling(14).mean()
avg_loss = loss.rolling(14).mean()
rs = avg_gain / avg_loss
df_stock['RSI'] = 100 - (100 / (1 + rs))

# Target: next day's return
df_stock['Target'] = df_stock['Return'].shift(-1)

# Drop NaN
df_stock = df_stock.dropna()

# Feature columns
feature_cols = ['Return_Lag1', 'Return_Lag2', 'Volume_Lag1', 'MA_5', 'Volatility_5', 'RSI']
X = df_stock[feature_cols].values
y = df_stock['Target'].values

# Temporal split
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Build MLP model
model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    layers.Dense(32, activation='relu'),
    layers.Dense(1)  # linear activation for regression
])

model.compile(optimizer='adam', loss='mse', metrics=['mae'])
model.summary()
```

**Explanation:**

- We define a sequential model with two hidden layers (64 and 32 neurons) using ReLU activation.
- The output layer has a single neuron with linear activation (default), suitable for regression.
- We compile with Adam optimizer, mean squared error loss, and track mean absolute error as a metric.
- `model.summary()` prints the architecture, showing the number of parameters.

### **25.3.2 Training the Model**

We'll train for a fixed number of epochs, monitoring validation loss to avoid overfitting.

```python
history = model.fit(
    X_train_scaled, y_train,
    validation_split=0.2,  # use 20% of training for validation
    epochs=100,
    batch_size=32,
    verbose=0  # set to 1 to see progress
)

# Plot training history
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
plt.plot(history.history['loss'], label='train_loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.legend()

plt.subplot(1,2,2)
plt.plot(history.history['mae'], label='train_mae')
plt.plot(history.history['val_mae'], label='val_mae')
plt.xlabel('Epoch')
plt.ylabel('MAE')
plt.legend()
plt.show()

# Evaluate on test set
test_loss, test_mae = model.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Test MSE: {test_loss:.4f}, Test MAE: {test_mae:.4f}")

# Predictions
y_pred = model.predict(X_test_scaled).flatten()
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Test RMSE: {rmse:.4f}")
```

**Explanation:**

- `validation_split=0.2` automatically holds out the last 20% of training data for validation. In time‑series, we should ensure this split is temporal; Keras's `validation_split` takes the last part, which is correct for time order (since our data is already ordered). However, it's better to manually split to control exactly.
- Training for 100 epochs; we watch validation loss to see if it plateaus or starts increasing (overfitting).
- The history plot shows learning curves.
- Test evaluation gives final performance.

---

## **25.4 Backpropagation**

Backpropagation is the algorithm used to train neural networks. It computes the gradient of the loss function with respect to each weight by applying the chain rule of calculus, propagating errors backward from the output layer to the input layer. These gradients are then used by an optimizer to update weights.

While we don't implement backpropagation manually (frameworks like TensorFlow do it automatically), understanding it helps in debugging and choosing hyperparameters.

**Key steps:**

1. **Forward pass:** Compute predictions and loss.
2. **Backward pass:** Compute gradients of loss w.r.t. each weight.
3. **Update weights:** Adjust weights in the opposite direction of the gradient (gradient descent).

---

## **25.5 Loss Functions**

The loss function measures how well the model's predictions match the true targets. Choosing the right loss is crucial.

### **25.5.1 Regression Losses**

- **Mean Squared Error (MSE):** `(y_true - y_pred)²`. Sensitive to outliers.
- **Mean Absolute Error (MAE):** `|y_true - y_pred|`. More robust to outliers.
- **Huber loss:** Combination of MSE and MAE; quadratic for small errors, linear for large errors. Less sensitive to outliers than MSE.

For NEPSE return prediction, MSE is common, but Huber may be better if there are extreme returns.

### **25.5.2 Classification Losses**

- **Binary Crossentropy:** For binary classification (direction). Measures the difference between true labels (0/1) and predicted probabilities.
- **Categorical Crossentropy:** For multi‑class classification.

For direction prediction, we use binary crossentropy.

### **25.5.3 Example: Binary Classification with MLP**

```python
# Prepare binary target
y_binary = (df_stock['Target'] > 0).astype(int).values
y_train_bin, y_test_bin = y_binary[:split_idx], y_binary[split_idx:]

# Build classifier
model_clf = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    layers.Dense(32, activation='relu'),
    layers.Dense(1, activation='sigmoid')  # sigmoid for probability
])

model_clf.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

history_clf = model_clf.fit(
    X_train_scaled, y_train_bin,
    validation_split=0.2,
    epochs=50,
    batch_size=32,
    verbose=0
)

# Evaluate
test_loss_clf, test_acc = model_clf.evaluate(X_test_scaled, y_test_bin, verbose=0)
print(f"Test accuracy: {test_acc:.4f}")
```

---

## **25.6 Optimization Algorithms**

Optimizers update the weights to minimize the loss. They differ in how they use gradients and adapt learning rates.

### **25.6.1 Stochastic Gradient Descent (SGD)**

Basic SGD updates weights using the gradient of the loss on a mini‑batch:  
`w = w - η * ∇w`

where `η` is the learning rate. SGD can be slow and may oscillate.

### **25.6.2 Adam (Adaptive Moment Estimation)**

Adam combines momentum and adaptive learning rates. It maintains moving averages of gradients and squared gradients. It often works well out‑of‑the‑box and is a good default choice.

### **25.6.3 RMSprop**

Similar to Adam but without the momentum component. Also adapts learning rates per parameter.

In Keras, we can easily use different optimizers:

```python
# SGD
model.compile(optimizer=keras.optimizers.SGD(learning_rate=0.01), loss='mse')

# Adam (default lr=0.001)
model.compile(optimizer='adam', loss='mse')

# RMSprop
model.compile(optimizer=keras.optimizers.RMSprop(learning_rate=0.001), loss='mse')
```

### **25.6.4 Learning Rate Scheduling**

Adjusting the learning rate during training can improve convergence. Common schedules:

- Step decay: reduce by factor every few epochs.
- Exponential decay.
- Reduce on plateau: reduce when validation loss stagnates.

```python
# Example: reduce learning rate when validation loss plateaus
reduce_lr = keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss', factor=0.5, patience=5, min_lr=1e-6
)

history = model.fit(..., callbacks=[reduce_lr])
```

---

## **25.7 Regularization in Neural Networks**

Neural networks are prone to overfitting, especially with small datasets. Regularization techniques help.

### **25.7.1 Dropout**

During training, randomly drops a fraction of neurons (setting their output to zero) each forward pass. This prevents co‑adaptation and acts as an ensemble of networks.

```python
model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(n_features,)),
    layers.Dropout(0.3),  # 30% dropout
    layers.Dense(32, activation='relu'),
    layers.Dropout(0.3),
    layers.Dense(1)
])
```

### **25.7.2 Batch Normalization**

Normalizes the inputs to each layer, stabilizing training and allowing higher learning rates. It also has a slight regularizing effect.

```python
model = keras.Sequential([
    layers.Dense(64, input_shape=(n_features,)),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    layers.Dense(32),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    layers.Dense(1)
])
```

### **25.7.3 Early Stopping**

Stop training when validation performance stops improving, preventing overfitting.

```python
early_stop = keras.callbacks.EarlyStopping(
    monitor='val_loss', patience=10, restore_best_weights=True
)

history = model.fit(..., callbacks=[early_stop])
```

### **25.7.4 L1/L2 Weight Regularization**

Add a penalty to the loss for large weights, similar to Lasso/Ridge.

```python
from tensorflow.keras import regularizers

model = keras.Sequential([
    layers.Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.001), input_shape=(n_features,)),
    layers.Dense(32, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
    layers.Dense(1)
])
```

---

## **25.8 Training Neural Networks**

### **25.8.1 Data Preparation**

- Scale features to zero mean and unit variance (or to [0,1] range). Neural networks benefit from normalized inputs.
- For time‑series, maintain temporal order; do not shuffle across time.
- Create a validation set that is temporally after the training set.

### **25.8.2 Choosing Hyperparameters**

- **Number of layers and neurons:** Start simple (1‑2 hidden layers) and increase if underfitting. Too many neurons can overfit.
- **Batch size:** Typical values 32, 64, 128. Smaller batches introduce noise, which can help generalization; larger batches are more stable.
- **Learning rate:** Critical; too high causes divergence, too low leads to slow convergence. Often tuned via trial or learning rate finder.
- **Epochs:** Use early stopping to determine automatically.

### **25.8.3 Monitoring Training**

Always monitor both training and validation loss. If validation loss starts increasing while training loss continues decreasing, you're overfitting. Apply regularization or stop earlier.

### **25.8.4 Example with All Techniques**

```python
# Build a regularized MLP
model = keras.Sequential([
    layers.Dense(128, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    layers.Dense(64, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    layers.Dense(32, activation='relu'),
    layers.Dense(1)
])

model.compile(optimizer='adam', loss='mse', metrics=['mae'])

callbacks = [
    keras.callbacks.EarlyStopping(monitor='val_loss', patience=15, restore_best_weights=True),
    keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, min_lr=1e-6)
]

history = model.fit(
    X_train_scaled, y_train,
    validation_split=0.2,
    epochs=200,
    batch_size=32,
    callbacks=callbacks,
    verbose=0
)

# Evaluate
test_loss, test_mae = model.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Test MAE: {test_mae:.4f}")
```

---

## **25.9 Common Pitfalls**

### **25.9.1 Look‑Ahead Bias**

When using features that require future information (e.g., using tomorrow's high in today's features), the model will appear accurate but fail in production. Ensure all features are computable at prediction time (use `shift()` and rolling windows that exclude the current point).

### **25.9.2 Data Leakage**

Scaling on the entire dataset before splitting leaks information. Always fit scaler on training set only.

### **25.9.3 Overfitting to Noise**

Financial data is noisy. A complex network can easily memorize noise. Use strong regularization, simple architectures, and monitor validation loss.

### **25.9.4 Non‑Stationarity**

Financial markets change over time. A model trained on old data may not generalize to new regimes. Use walk‑forward validation and consider retraining frequently.

### **25.9.5 Vanishing/Exploding Gradients**

Deep networks can suffer from vanishing or exploding gradients, especially with sigmoid/tanh. Use ReLU activations, batch normalization, and proper weight initialization (e.g., He initialization).

### **25.9.6 Insufficient Data**

Neural networks typically require large datasets. For a single stock, we may have only a few thousand rows. Consider using data from multiple stocks (pooling) or transfer learning.

### **25.9.7 Hyperparameter Tuning on Test Set**

Never tune hyperparameters based on test set performance. Use a separate validation set or time‑series CV. The test set should be used only once, at the end.

---

## **Chapter Summary**

In this chapter, we covered the fundamentals of neural networks, with applications to the NEPSE prediction system.

- **Neural network architecture:** input, hidden, output layers; neurons and activation functions.
- **MLP for regression and classification:** we built models to predict next‑day returns and direction.
- **Backpropagation** is the learning algorithm; frameworks handle it automatically.
- **Loss functions:** MSE/MAE for regression, binary crossentropy for classification.
- **Optimizers:** SGD, Adam, RMSprop; Adam is a good default.
- **Regularization:** dropout, batch normalization, early stopping, weight decay prevent overfitting.
- **Training best practices:** scale data, monitor validation loss, use callbacks.
- **Common pitfalls:** look‑ahead bias, leakage, overfitting, non‑stationarity.

### **Practical Takeaways for the NEPSE System:**

- Start with a simple MLP (1‑2 hidden layers) as a baseline.
- Use ReLU activation, Adam optimizer, and early stopping.
- Always scale features using training set statistics.
- Regularize heavily given the noisy nature of financial returns.
- Validate with temporal splits and walk‑forward to ensure robustness.
- Compare performance with simpler models (linear, tree‑based) to justify complexity.

In the next chapter, **Chapter 26: Recurrent Neural Networks**, we will explore networks designed for sequential data, such as LSTMs and GRUs, which are particularly suited for time‑series forecasting.

---

**End of Chapter 25**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='24. support_vector_machines.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='26. recurrent_neural_networks.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
