<a href="https://colab.research.google.com/github/JordanDCunha/Hands-On-Machine-Learning-with-Scikit-Learn-and-PyTorch/blob/main/Chapter9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# From Biological to Artificial Neurons

Surprisingly, artificial neural networks (ANNs) have been around for quite a while. They were first introduced in 1943 by neurophysiologist **Warren McCulloch** and mathematician **Walter Pitts**. In their landmark paper *“A Logical Calculus of Ideas Immanent in Nervous Activity”*, they presented a simplified computational model of how biological neurons might work together to perform complex computations using propositional logic. This was the first artificial neural network architecture.

Early successes led to the belief that intelligent machines were just around the corner. When this failed to materialize in the 1960s, funding dried up and ANNs entered a long winter. A revival in the 1980s brought new architectures and better training techniques, but progress remained slow. By the 1990s, other machine learning methods such as support vector machines offered better results and stronger theoretical foundations, causing neural networks to fall out of favor again.

Today, ANNs are experiencing another resurgence — and this time, it’s different.


## Why Neural Networks Succeeded This Time

Several key factors explain the renewed success of neural networks:

- **Massive datasets** are now available, and ANNs often outperform other models on large, complex problems.
- **Computing power** has increased dramatically since the 1990s, driven by Moore’s Law and the widespread availability of GPUs originally developed for gaming.
- **Cloud platforms** make powerful hardware accessible to nearly everyone.
- **Training algorithms** have improved slightly but significantly in effectiveness.
- **Theoretical concerns**, such as getting stuck in local optima, turned out to be far less problematic in practice.
- **The Transformer architecture (2017)** revolutionized the field by enabling models that scale well and work across text, images, audio, robotics, and protein folding.
- **Transfer learning, in-context learning, few-shot learning, and chain-of-thought prompting** have unlocked new capabilities.

ANNs now benefit from a virtuous cycle of funding, research, and real-world impact. AI is no longer hidden behind the scenes — tools like ChatGPT have brought it directly to the public.


## Biological Neurons

A biological neuron consists of:
- A **cell body** containing the nucleus
- Many branching **dendrites**
- One long **axon**
- Branching **telodendria** ending in **synapses**

Neurons communicate via **electrical impulses** called action potentials. When enough neurotransmitters are received in a short time window, the neuron fires — unless inhibitory signals prevent it.

Despite their simplicity, biological neurons form massive networks with billions of nodes, each connected to thousands of others. These networks often organize into **layers**, especially in the cerebral cortex.


## Logical Computations with Neurons

McCulloch and Pitts proposed a simplified neuron model with:
- Binary inputs
- A binary output
- An activation threshold

They showed that networks of such neurons can compute **any logical proposition**.

Using neurons that activate when at least two inputs are active, simple networks can compute:
- Identity
- AND
- OR
- NOT (with inhibitory connections)

By combining these networks, arbitrarily complex logical expressions can be constructed.


## The Perceptron

Invented in 1957 by **Frank Rosenblatt**, the perceptron is based on a **threshold logic unit (TLU)**.

A TLU:
1. Computes a weighted sum  
   \[
   z = w^\top x + b
   \]
2. Applies a **step function**  
   \[
   h_w(x) = \text{step}(z)
   \]

This is similar to logistic regression, except it uses a step function instead of a sigmoid.


### Step Functions

Common step functions include:
- **Heaviside step function**
- **Sign function**

These functions output discrete values and are not differentiable.


## Perceptron Architecture

A perceptron consists of:
- An **input layer**
- A single **fully connected output layer**

A perceptron with multiple outputs can perform **multilabel classification** or **multiclass classification**.

The outputs of a fully connected layer can be computed efficiently using matrix operations:
\[
Y = \phi(XW + b)
\]

Where:
- \(X\): input matrix
- \(W\): weight matrix
- \(b\): bias vector
- \(\phi\): activation function


## Perceptron Training (Hebbian Learning)

Perceptron training is inspired by **Hebb’s rule**:

> “Cells that fire together, wire together.”

For each training instance:
- The model predicts outputs
- Weights connected to incorrect outputs are updated to reduce error

The update rule is:
\[
w_{i,j} \leftarrow w_{i,j} + \eta (y_j - \hat{y}_j)x_i
\]

Perceptrons converge **only if the data is linearly separable**.


## Limitations of Perceptrons

Perceptrons:
- Have **linear decision boundaries**
- Cannot solve problems like **XOR**
- Do not output class probabilities
- Use no regularization by default

These limitations led to a major loss of interest in neural networks in the late 1960s.


## Multilayer Perceptrons (MLPs)

Stacking perceptrons produces a **multilayer perceptron (MLP)**.

An MLP includes:
- One input layer
- One or more **hidden layers**
- One output layer

MLPs can solve **nonlinear problems**, including XOR.


## Deep Neural Networks

When an ANN contains many hidden layers, it is called a **deep neural network (DNN)**.

The field of **deep learning** studies such models, though the term is often used loosely for all neural networks.

MLPs are **feedforward neural networks**, meaning signals flow in one direction only.


## Backpropagation

Training MLPs was difficult until **reverse-mode automatic differentiation** was introduced by **Seppo Linnainmaa** in 1970.

Backpropagation combines:
- Reverse-mode autodiff
- Gradient descent

It computes gradients efficiently in two passes:
- Forward pass
- Backward pass


### How Backpropagation Works

1. Process one **mini-batch** at a time
2. Perform a **forward pass**
3. Compute the **loss**
4. Compute gradients using the **chain rule**
5. Propagate gradients backward through layers
6. Update weights using **gradient descent**

Random weight initialization is essential to break symmetry.


## Activation Functions

Step functions were replaced because they have zero gradients.

Common activation functions:
- **Sigmoid**
- **Tanh**
- **ReLU**

ReLU is now the default for most architectures due to speed and effectiveness.

Without nonlinear activation functions, deep networks collapse into linear models.


## Why Neural Networks Work

A deep neural network with nonlinear activations can theoretically approximate **any continuous function**.

This makes modern neural networks incredibly powerful — and explains why they now dominate machine learning.


# Building and Training MLPs with Scikit-Learn

Multilayer perceptrons (MLPs) can tackle a wide range of tasks, but the most common are **regression** and **classification**. Scikit-Learn provides tools for both. Let’s start with regression.


## Regression MLPs

For a regression task where you want to predict a **single numeric value** (for example, the price of a house), you only need **one output neuron**. Its output is the predicted value.

For **multivariate regression**, where you predict multiple values at once, you need **one output neuron per output dimension**. For example:
- Predicting an object’s center in an image requires **2 output neurons** (x and y).
- Adding a bounding box requires **2 more neurons** (width and height).
- Total: **4 output neurons**.


In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler


In [None]:
housing = fetch_california_housing()

X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, random_state=42
)


## Using `MLPRegressor`

Scikit-Learn provides the `MLPRegressor` class. As an example, we can build an MLP with:
- **3 hidden layers**
- **50 neurons per hidden layer**
- **ReLU activation** in hidden layers
- **No activation function** in the output layer

The model automatically adapts the input and output sizes when training starts.


In [None]:
mlp_reg = MLPRegressor(
    hidden_layer_sizes=[50, 50, 50],
    early_stopping=True,
    verbose=True,
    random_state=42
)


Neural networks are sensitive to feature scaling.  
We therefore standardize the input features using `StandardScaler` and train the model using a pipeline.


In [None]:
pipeline = make_pipeline(StandardScaler(), mlp_reg)
pipeline.fit(X_train, y_train)


`MLPRegressor` uses **R² score** for validation.  
The best validation score achieved during training is:


In [None]:
mlp_reg.best_validation_score_


Next, we evaluate the model on the test set using Root Mean Squared Error (RMSE).


In [None]:
y_pred = pipeline.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)
rmse


This MLP does not use an activation function in the output layer, so it can predict any real value.

To constrain outputs:
- Positive-only → ReLU or Softplus
- Bounded range → Sigmoid (0–1) or Tanh (–1 to 1)

Unfortunately, `MLPRegressor` does not support output-layer activations.


## Classification MLPs

MLPs can also handle classification tasks:

### Binary classification
- 1 output neuron
- Sigmoid activation
- Cross-entropy loss

### Multilabel classification
- 1 output neuron per label
- Sigmoid activation
- Probabilities do not sum to 1

### Multiclass classification
- 1 output neuron per class
- Softmax activation
- Probabilities sum to 1


## Fashion MNIST Classification Example

Fashion MNIST contains:
- 70,000 grayscale images
- 28×28 pixels
- 10 classes of clothing items

We load it using `fetch_openml`.


In [None]:
from sklearn.datasets import fetch_openml

fashion_mnist = fetch_openml(name="Fashion-MNIST", as_frame=False)
targets = fashion_mnist.target.astype(int)


In [None]:
X_train, y_train = fashion_mnist.data[:60_000], targets[:60_000]
X_test, y_test = fashion_mnist.data[60_000:], targets[60_000:]


In [None]:
import matplotlib.pyplot as plt

X_sample = X_train[0].reshape(28, 28)
plt.imshow(X_sample, cmap="binary")
plt.show()


Fashion MNIST class labels:


In [None]:
class_names = [
    "T-shirt/top", "Trouser", "Pullover", "Dress", "Coat",
    "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"
]


In [None]:
class_names[y_train[0]]


We now build an MLP classifier with two hidden layers and train it using pixel values scaled to the 0–1 range.


In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler

mlp_clf = MLPClassifier(
    hidden_layer_sizes=[300, 100],
    verbose=True,
    early_stopping=True,
    random_state=42
)

pipeline = make_pipeline(MinMaxScaler(), mlp_clf)
pipeline.fit(X_train, y_train)

pipeline.score(X_test, y_test)


MinMaxScaler works better than StandardScaler for images because:
- Pixel values naturally lie in [0, 255]
- Some pixels have very low variance
- Standard scaling would overemphasize unimportant pixels


In [None]:
X_new = X_test[:15]
mlp_clf.predict(X_new)


In [None]:
y_proba = mlp_clf.predict_proba(X_new)
y_proba[12]


Neural networks are often **overconfident** in their predictions.

One mitigation technique is **label smoothing**, where the true class probability is reduced slightly and the remaining probability mass is distributed across other classes.


# Hyperparameter Tuning Guidelines

Neural networks are extremely flexible, but that flexibility comes with a cost: **many hyperparameters to tune**.  
Even a simple MLP has dozens of choices, including:

- Number of layers
- Number of neurons per layer
- Activation functions
- Optimizer and learning rate
- Batch size
- Regularization methods

This section gives **practical rules of thumb** to narrow down good choices.


## Number of Hidden Layers

For many problems, you can start with **one hidden layer** and already get reasonable results.  
In theory, a single hidden layer with enough neurons can approximate any function.

However, **deep networks are more parameter-efficient**:
- They reuse features across layers
- They need fewer neurons overall
- They often generalize better

### Why depth helps
Lower layers learn simple features (edges, curves),  
higher layers combine them into complex concepts (faces, objects).

This layered structure:
- Speeds up convergence
- Improves generalization
- Enables **transfer learning**


### Transfer Learning

If you already trained a model on a similar task, you can reuse its lower layers.

Example:
- A face recognition model → hairstyle detection
- Reuse early layers that detect edges and shapes
- Train only the higher layers

This dramatically reduces:
- Training time
- Required data


### Practical Rule

- Start with **1–2 hidden layers**
- Increase depth **only if underfitting**
- For very complex tasks (vision, speech), deep architectures are required
- In practice, large models are rarely trained from scratch


## Number of Neurons per Hidden Layer

The input and output sizes are fixed by the task:
- MNIST: 784 inputs, 10 outputs

Hidden layers are more flexible.


### The “Stretch Pants” Strategy

Use **slightly larger networks than needed**, then rely on:
- Early stopping
- ℓ2 regularization

This avoids bottleneck layers that permanently lose information.

Example:
- PCA shows Fashion MNIST needs 187 dimensions
- First hidden layer with 200 neurons avoids information loss


### Rule of Thumb

- Increase neurons **until overfitting starts**
- Prefer **more layers over more neurons**
- Bottlenecks can help denoising—but don’t overdo it


## Learning Rate

The learning rate is one of the **most important hyperparameters**.

A good rule:
> Optimal learning rate ≈ **½ of the maximum stable learning rate**


### Learning Rate Finder Strategy

1. Start with a very small learning rate (e.g., 1e-5)
2. Gradually increase it each iteration
3. Plot loss vs learning rate (log scale)
4. Pick a learning rate **slightly before loss explodes**


In [None]:
# Example: manually updating learning rate with Scikit-Learn

mlp = MLPRegressor(
    hidden_layer_sizes=(100,),
    warm_start=True,
    max_iter=1,
    random_state=42
)

learning_rate = 1e-5

for _ in range(500):
    mlp.learning_rate_init = learning_rate
    mlp.partial_fit(X_train, y_train)
    learning_rate *= 1.05


> ⚠️ In Scikit-Learn, dynamic learning rates require:
- `warm_start=True`
- `partial_fit()`


## Batch Size

Batch size affects:
- Training speed
- Stability
- Generalization


### Small vs Large Batches

**Small batches (2–32):**
- Better generalization
- More stable training
- Slower per-epoch speed

**Large batches (1,000+):**
- Faster on GPUs
- Can be unstable early on
- Often require learning-rate warmup


### Practical Strategy

1. Start with a **moderate batch size**
2. Increase batch size if training is slow
3. Reduce batch size if:
   - Training is unstable
   - Validation performance degrades


## Other Important Hyperparameters


### Optimizer

Advanced optimizers (Adam, RMSProp) often:
- Converge faster
- Need less tuning

But:
- They still depend heavily on learning rate


### Optimizer

Advanced optimizers (Adam, RMSProp) often:
- Converge faster
- Need less tuning

But:
- They still depend heavily on learning rate


> ⚠️ If you change **any hyperparameter**, re-tune the learning rate.


## Recommended Reading

- Leslie Smith (2018): *A disciplined approach to neural network hyperparameters*
- Google: *Deep Learning Tuning Playbook*
- Andrew Ng: *Machine Learning Yearning*


## Summary

- Start simple, scale gradually
- Prefer depth over width
- Tune learning rate early and often
- Use early stopping and regularization
- Always validate on unseen data
