# Chapter 10: Introduction to Artificial Neural Networks with Keras

## 1. Chapter Overview
**Goal:** This chapter marks the transition from traditional Machine Learning to **Deep Learning**. We will explore Artificial Neural Networks (ANNs), starting from the simplest architecture (Perceptron) to Multi-Layer Perceptrons (MLPs). We will use **Keras**, a high-level API running on top of TensorFlow, to build models capable of classifying fashion images and predicting housing prices.

**Key Concepts:**
* **Perceptrons & TLUs:** The basic logic units of neural networks.
* **Multi-Layer Perceptron (MLP):** Stacking layers to solve non-linear problems (XOR problem).
* **Backpropagation:** The revolutionary training algorithm that computes gradients efficiently.
* **Activation Functions:** Why we need ReLU, Sigmoid, or Softmax.
* **Keras API:**
    * *Sequential API:* For simple stacks of layers.
    * *Functional API:* For complex topologies (Wide & Deep).
    * *Subclassing API:* For full control (research level).
* **Hyperparameter Tuning:** Using RandomizedSearch to find optimal neuron counts and learning rates.

**Practical Skills:**
* Building an Image Classifier for the **Fashion MNIST** dataset.
* Building a Regressor for California housing data.
* Using Callbacks (EarlyStopping, ModelCheckpoint) to prevent overfitting.
* Saving and loading Keras models.

In [None]:
# Setup
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

import numpy as np
import os
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

np.random.seed(42)
tf.random.set_seed(42)

print("TensorFlow version:", tf.__version__)
print("Keras version:", keras.__version__)

## 2. Theoretical Explanation (In-Depth)

### 1. From Biological to Artificial (The Perceptron)
The basic idea of ANNs is inspired by biological neurons. Neurons receive input signals via dendrites, process them, and if the signal is strong enough, fire an output signal via the axon.

**Perceptron (Frank Rosenblatt, 1957):**
This is the simplest ANN architecture. It consists of a single layer of **Threshold Logic Units (TLUs)**. The inputs are numbers, each having a weight ($w$). The TLU computes a weighted sum ($z = w_1 x_1 + \dots + w_n x_n + bias$), then applies a *step function*.
* Limitation: Perceptrons can only solve linear problems. They fail at the simple XOR problem.

### 2. Multi-Layer Perceptron (MLP) and Backpropagation
To overcome the limitations of the Perceptron, we stack multiple layers of TLUs. This structure is called an MLP:
1.  **Input Layer:** Receives the data features.
2.  **Hidden Layers:** One or more layers in the middle that transform the representation.
3.  **Output Layer:** Produces the final prediction.

**Backpropagation (Rumelhart, Hinton, Williams, 1986):**
How do we train such a deep network? Backpropagation is key. It works in two main passes for each training batch:
1.  **Forward Pass:** Data flows from input to output, generating predictions, and the error is calculated using a *Loss Function*.
2.  **Backward Pass:** The algorithm computes the gradient of the error with regard to every parameter (weight) by moving backward from output to input (using the *Chain Rule* of calculus). It tells us how much each weight contributed to the error.
3.  **Update:** A Gradient Descent step uses these gradients to update the weights to reduce the error.

### 3. Activation Functions
For an MLP to learn non-linear patterns, we **must** use non-linear activation functions between linear layers.
* **Sigmoid:** Squashes output to 0-1. Good for probabilities, but suffers from *vanishing gradients* in deep layers.
* **ReLU (Rectified Linear Unit):** $ReLU(z) = max(0, z)$. Fast to compute and does not saturate for positive values. It is the de-facto standard for hidden layers.
* **Softmax:** Used in the output layer for multiclass classification. Ensures probabilities sum to 1.

### 4. Keras API
TensorFlow 2 adopted Keras as its official high-level API. It is user-friendly:
* **Sequential API:** Easiest. Just `model.add(Layer)`. Good for 90% of cases.
* **Functional API:** More flexible. Allows multiple inputs/outputs and non-linear topologies (like Wide & Deep).
* **Subclassing API:** Most complex but flexible. You write your own Python class. Used for research.

## 3. Code Reproduction

### 3.1 Building an Image Classifier (Fashion MNIST)
We use the Fashion MNIST dataset (70,000 grayscale images of 10 fashion categories) because it is slightly harder than the classic MNIST digits.

In [None]:
# 1. Load Data
fashion_mnist = keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()

# 2. Validation Split & Normalization (Scaling)
# Pixel values are 0-255. We scale them to 0-1 for Neural Networks.
X_valid, X_train = X_train_full[:5000] / 255.0, X_train_full[5000:] / 255.0
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
X_test = X_test / 255.0

class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat",
               "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]

print("Training data shape:", X_train.shape)
print("Example class:", class_names[y_train[0]])

# Visualize one image
plt.imshow(X_train[0], cmap="binary")
plt.axis('off')
plt.show()

### 3.2 Creating a Model using Sequential API
We will build an MLP with 2 hidden layers.

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]), # Flattens 2D (28x28) to 1D (784)
    keras.layers.Dense(300, activation="relu"), # Hidden Layer 1: 300 neurons, ReLU
    keras.layers.Dense(100, activation="relu"), # Hidden Layer 2: 100 neurons, ReLU
    keras.layers.Dense(10, activation="softmax") # Output Layer: 10 classes, Softmax
])

# View architecture summary
model.summary()

### 3.3 Compile and Train
* **Loss:** `sparse_categorical_crossentropy` because our labels are integers (0-9), not one-hot vectors.
* **Optimizer:** `sgd` (Stochastic Gradient Descent).
* **Metrics:** `accuracy`.

In [None]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="sgd",
              metrics=["accuracy"])

# Train the model (this might take a few seconds/minutes)
history = model.fit(X_train, y_train, epochs=30,
                    validation_data=(X_valid, y_valid))

### 3.4 Visualizing Learning Curves
The `history` object stores loss and accuracy data during training. We plot it to detect overfitting.

In [None]:
import pandas as pd

pd.DataFrame(history.history).plot(figsize=(8, 5))
plt.grid(True)
plt.gca().set_ylim(0, 1) # set the vertical range to [0-1]
plt.title("Learning Curves")
plt.show()

# Evaluate on Test Set
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"\nTest Accuracy: {test_acc:.4f}")

### 3.5 Prediction
Using the model to predict class probabilities on new instances.

In [None]:
X_new = X_test[:3]
y_proba = model.predict(X_new)
print("Probabilities:\n", y_proba.round(2))

# Get class with highest probability
y_pred = np.argmax(y_proba, axis=1)
print("Predictions:", np.array(class_names)[y_pred])
print("Actual Labels:", np.array(class_names)[y_test[:3]])

### 3.6 Regression MLP (Functional API)
For regression (predicting California housing prices), the MLP structure differs:
* Output layer has only **1 neuron**.
* **No activation function** at output (we want continuous values).
* Loss function: **MSE**.

We will also use the **Functional API** to build a **Wide & Deep** architecture. It connects part of the input directly to the output (Wide) and part through deep layers (Deep). This allows learning both simple (linear) patterns and complex (deep) patterns simultaneously.

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load & Split Data
housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(housing.data, housing.target, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full, random_state=42)

# Scaling (CRITICAL for Neural Networks!)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)
X_test = scaler.transform(X_test)

# --- Functional API ---
input_ = keras.layers.Input(shape=X_train.shape[1:])
hidden1 = keras.layers.Dense(30, activation="relu")(input_)
hidden2 = keras.layers.Dense(30, activation="relu")(hidden1)

# Concatenate input directly with output of hidden layer 2
concat = keras.layers.Concatenate()([input_, hidden2])
output = keras.layers.Dense(1)(concat)

model_func = keras.models.Model(inputs=[input_], outputs=[output])

model_func.compile(loss="mean_squared_error", optimizer="sgd")

print("Training Wide & Deep Model...")
history = model_func.fit(X_train, y_train, epochs=20,
                         validation_data=(X_valid, y_valid))

### 3.7 Callbacks
What if we train for too long? Overfitting. Instead of guessing epochs, use `EarlyStopping`. We also use `ModelCheckpoint` to save the best model.

In [None]:
checkpoint_cb = keras.callbacks.ModelCheckpoint("my_keras_model.h5", save_best_only=True)

# Patience=10: Stop if val_loss doesn't improve for 10 consecutive epochs
early_stopping_cb = keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)

model_func.fit(X_train, y_train, epochs=100,
               validation_data=(X_valid, y_valid),
               callbacks=[checkpoint_cb, early_stopping_cb])

## 4. Step-by-Step Explanation

### 1. Image Data Preprocessing
**Input:** Images of `28x28` pixels with values `0-255`.
**Process:** Neural Networks converge faster with small input values (around 0-1). Thus, we divide by `255.0`. Forgetting this leads to exploding gradients or stalled training.

### 2. Classification Architecture
* `Flatten`: Has no parameters. It just reshapes the 2D matrix into a long 1D vector. It bridges the gap between image data and Dense layers.
* `Dense`: Fully Connected layers. Every neuron connects to every neuron in the previous layer.
* `ReLU`: Without this, a stack of Dense layers is mathematically equivalent to a single Linear layer. ReLU adds non-linearity, allowing the model to learn complex shapes.
* `Softmax`: Converts raw scores (logits) into a probability distribution (sum = 1).

### 3. Wide & Deep Architecture
In the Functional API code, `Concatenate` merges the raw inputs with the processed deep features. 
* **Deep path** learns abstract patterns.
* **Wide path** (direct connection) learns simple rules.
* This often outperforms standard MLP on tabular data.

### 4. Early Stopping
This is automatic regularization. Instead of manual monitoring, if `val_loss` stops improving for 10 epochs, the callback stops training and restores the best weights, preventing wasted resources and overfitting.

## 5. Chapter Summary

* **Deep Learning** uses multi-layer neural networks to learn complex data representations.
* **Keras** is the user-friendly, high-level API for TensorFlow.
* **Sequential API** is great for simple stacks.
* **Functional API** is needed for complex topologies (branching, multiple inputs).
* **Preprocessing:** Always scale data (StandardScaler) before feeding into a Neural Net.
* **Loss Function:** Use `sparse_categorical_crossentropy` for integer classification, `mse` for regression.
* **Activation:** `ReLU` for hidden layers; `Softmax` (classification) or `Linear` (regression) for output.
* **Optimization:** Avoid overfitting using `EarlyStopping` and save models with `ModelCheckpoint`.