# Mohammad Mahdi Razmjoo - 400101272

**Multilayer Perceptron with Scikit-Learn**

binary classification

In [21]:
# Binary classification on the Iris dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import f1_score

# Load data
X, y_multi = load_iris(return_X_y=True)

# Convert to binary target: class “setosa” (0) vs. “not-setosa” (1)
y = (y_multi != 0).astype(int)

# Train / test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# Train MLP classifier
clf = MLPClassifier(hidden_layer_sizes=(50, 25),
                    max_iter=1000,
                    early_stopping=True,
                    random_state=42)
clf.fit(X_train, y_train)

# Evaluate
print("F1-score:", f1_score(y_test, clf.predict(X_test)))

F1-score: 0.975609756097561


## Step-by-Step Explanation & Take-aways

| Code section | Why we do it | What we gain / understand |
|--------------|--------------|---------------------------|
| **Load data**<br>`X, y_multi = load_iris(return_X_y=True)` | Bring the classic Iris measurements (sepal L/W, petal L/W) and their species labels into memory. | We have a small, clean benchmark dataset to test ideas rapidly. |
| **Binarize the target**<br>`y = (y_multi != 0).astype(int)` | Collapse the 3-class label to a binary task: *setosa vs. non-setosa*. This simplifies evaluation (single F1-score) and satisfies the assignment’s “binary classification” requirement. | We see how label engineering (re-labelling) can adapt the same data to different problem formulations. |
| **Train/test split (80 % / 20 %)**<br>`train_test_split(..., test_size=0.2, stratify=y)` | Reserve 20 % of the samples as an untouched test set. Stratification keeps the class ratio balanced in both splits. | We obtain an unbiased estimate of generalisation performance and obey the rubric’s 20 %-test rule. |
| **Feature scaling**<br>`StandardScaler().fit_transform` | The MLP’s gradient descent converges faster and more stably when features are zero-mean and unit-variance. | We experience first-hand how preprocessing affects optimisation and model quality. |
| **Define & train the MLP**<br>`MLPClassifier(hidden_layer_sizes=(50,25), ...)` | Build a 4-layer (input → 50 → 25 → 1) feed-forward network, train with early stopping to prevent overfitting. | We practise choosing hidden sizes, epochs and regularisation tricks that matter in real scenarios. |
| **Evaluate on the test set**<br>`f1_score(y_test, clf.predict(X_test))` | The F1-score balances precision and recall, making it a robust metric for class-imbalance scenarios. | We verify that the model exceeds the ≥ 0.75 F1 threshold and internalise how metrics inform model acceptance. |

### Key understanding
* **Data preparation (splits, scaling)** is as critical as the model itself.  
* A small network can achieve high performance (≈ 0.95 F1) on a simple, well-separated dataset—showing that model complexity should match data complexity.  
* Early stopping turns training into an empirical search for the sweet-spot epoch, illustrating the importance of validation monitoring.


regression

In [22]:
# Regression on the same Iris dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import r2_score

# Load data (again—but you could reuse X from the first cell)
iris   = load_iris()
X_full = iris.data
# Predict petal length (column index 2) from the other three features
y      = X_full[:, 2]
X      = X_full[:, [0, 1, 3]]

# Train / test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# Train MLP regressor
regr = MLPRegressor(hidden_layer_sizes=(50, 25),
                    max_iter=5000,
                    early_stopping=True,
                    random_state=42)
regr.fit(X_train, y_train)

# Evaluate
print("R²-score:", r2_score(y_test, regr.predict(X_test)))

R²-score: 0.9359089784533952


## Step-by-Step Explanation & Take-aways

| Code section | Why we do it | What we gain / understand |
|--------------|--------------|---------------------------|
| **Load data**<br>`iris = load_iris()` | Bring the four numeric Iris measurements into memory. | Same dataset lets us compare classification and regression on identical real data. |
| **Select target & predictors**<br>`y = petal length (col 2)`<br>`X = [sepal L, sepal W, petal W]` | Framing a regression task: predict one feature (petal length) from the other three. | Shows how the same table can be sliced into input/target pairs for different problems. |
| **Train/test split (80 % / 20 %)**<br>`train_test_split(..., test_size=0.2)` | Hold out 20 % of the samples for unbiased evaluation, matching the rubric. | Reinforces best practice: never judge a model on the data it learned from. |
| **Feature scaling**<br>`StandardScaler()` | Centering and scaling accelerates gradient descent and avoids one feature dominating the loss. | Observe the vital role of preprocessing in neural-network optimisation. |
| **Define & train the MLP regressor**<br>`MLPRegressor(hidden_layer_sizes=(50,25), …)` | Builds a 4-layer feed-forward network (input → 50 → 25 → 1) and trains with early stopping. | Practice in selecting hidden sizes and regularisation to reach high R² (> 0.8). |
| **Evaluate on the test set**<br>`r2_score(...)` | R² measures the proportion of variance explained by the model. | Confirms the network meets the ≥ 0.80 requirement and strengthens intuition for regression metrics. |

### Key understanding
* **Feature choice shapes difficulty** – predicting petal length from the other three dimensions is easier than from raw species labels; the high R² reveals strong linear/non-linear correlations.  
* **Early stopping guards against over-fitting** – training halts when validation loss stops improving, a simple yet powerful regularisation technique.  
* **Same architecture, different task** – by swapping loss functions and output activations, a feed-forward network can shift seamlessly between classification and regression.

**4-layer feedforward network with Keras**

binary classification

In [23]:
# 4-layer feedforward network – binary classification (setosa vs. non-setosa)
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score
import tensorflow as tf
from tensorflow.keras import layers, models

# Load data and create binary target
X, y_multi = load_iris(return_X_y=True)
y = (y_multi != 0).astype(int)

# Train / test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# Build 4-layer model (3 hidden + 1 output)
model = models.Sequential([
    layers.Input(shape=(4,)),
    layers.Dense(32, activation='relu'),
    layers.Dense(16, activation='relu'),
    layers.Dense(8,  activation='relu'),
    layers.Dense(1,  activation='sigmoid')
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Train
model.fit(X_train, y_train, epochs=100, batch_size=16, verbose=0)

# Evaluate
y_pred = (model.predict(X_test) > 0.5).astype(int)
print("F1-score:", f1_score(y_test, y_pred))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 81ms/step
F1-score: 1.0


## Step-by-Step Explanation & Take-aways: 4-Layer Keras Feed-Forward Classifier

| Code section | Why we do it | What we gain / understand |
|--------------|--------------|---------------------------|
| **Load data & binarise target**<br>`load_iris()` → `y = (y_multi != 0)` | Convert the 3-class Iris problem into a binary task: *setosa* (0) vs. *non-setosa* (1). | Demonstrates how the same dataset can be reframed for different objectives; simplifies evaluation to a single F1-score. |
| **Train/test split (80 % / 20 %)**<br>`train_test_split(..., test_size=0.2, stratify=y)` | Reserve 20 % of samples for an unbiased test set and preserve class ratio with `stratify`. | Ensures the reported F1-score reflects true generalisation, not memorisation. |
| **Feature scaling**<br>`StandardScaler()` | Standardise each feature to zero mean and unit variance. Neural nets train faster and more stably when inputs are on comparable scales. | Reinforces the importance of preprocessing for gradient-based optimisation. |
| **Define 4-layer model**<br>`Dense(32) → Dense(16) → Dense(8) → Dense(1)` | Create three hidden layers plus one sigmoid output layer, satisfying the “4-layer” criterion. Hidden sizes (32-16-8) strike a balance between capacity and overfitting risk. | Shows the effect of depth and width choices on expressiveness; illustrates that a modest network often suffices for simple tabular data. |
| **Compile**<br>`optimizer='adam', loss='binary_crossentropy'` | Adam offers adaptive learning rates; binary cross-entropy matches the Bernoulli target distribution. | Underlines the tight coupling between loss function and task type. |
| **Train**<br>`fit(..., epochs=100, batch_size=16)` | Optimise weights for up to 100 epochs with small batches. | Observe how epoch count and batch size influence convergence speed and generalisation. |
| **Predict & evaluate**<br>`f1_score(y_test, y_pred)` | Convert probabilities to class labels (`> 0.5`) and compute F1-score, which balances precision and recall. | Confirms the model surpasses the ≥ 0.75 threshold; highlights F1 as a robust metric for potential class imbalance. |

### Key understanding
* **Architectural simplicity can be sufficient** — a shallow, fully-connected net (> 0.95 F1 in practice) handles the separable Iris data without convolutions or attention.  
* **Preprocessing and correct loss choice** are as crucial as the network depth.  
* **Hold-out testing** is non-negotiable for honest reporting; the 20 % split enforces this discipline.

regression

In [24]:
# 4-layer feedforward network – regression predicting petal length
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
import tensorflow as tf
from tensorflow.keras import layers, models

# Load data
iris   = load_iris()
X_full = iris.data
y      = X_full[:, 2]
X      = X_full[:, [0, 1, 3]]

# Train / test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# Build 4-layer model (3 hidden + 1 output)
model = models.Sequential([
    layers.Input(shape=(3,)),
    layers.Dense(32, activation='relu'),
    layers.Dense(16, activation='relu'),
    layers.Dense(8,  activation='relu'),
    layers.Dense(1)
])

model.compile(optimizer='adam', loss='mse')

# Train
model.fit(X_train, y_train, epochs=500, batch_size=16, verbose=0)

# Evaluate
y_pred = model.predict(X_test).flatten()
print("R²-score:", r2_score(y_test, y_pred))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 76ms/step
R²-score: 0.9755104234980874


## Step-by-Step Explanation & Take-aways: 4-Layer Keras Feed-Forward Regressor

| Code section | Why we do it | What we gain / understand |
|--------------|--------------|---------------------------|
| **Load data**<br>`load_iris()` | Bring the four numeric Iris measurements into memory. | Re-using the same dataset keeps the experimental context consistent across tasks. |
| **Define target & predictors**<br>`y = petal length (col 2)`<br>`X = [sepal L, sepal W, petal W]` | Frame a regression problem: predict petal length from the other three dimensions. | Shows how the same table can support a different learning objective by simply re-selecting columns. |
| **Train/test split (80 % / 20 %)**<br>`train_test_split(..., test_size=0.2)` | Reserve 20 % of samples for an unbiased evaluation set. | Upholds the rubric’s rule and yields a trustworthy R² estimate. |
| **Feature scaling**<br>`StandardScaler()` | Standardise inputs to zero mean and unit variance. Neural nets converge faster and more stably with scaled features. | Reinforces preprocessing as a key ingredient for effective gradient descent. |
| **Define 4-layer model**<br>`Dense(32) → 16 → 8 → 1` | Three hidden layers + one linear output layer satisfy the “4-layer” requirement; ReLU activations provide non-linearity. | Highlights how modest depth (and width) can model the non-linear relationships in small tabular data. |
| **Compile**<br>`optimizer='adam', loss='mse'` | Adam gives adaptive learning rates; mean-squared error is the standard loss for regression. | Matches the optimisation objective to the task’s statistical assumptions. |
| **Train**<br>`fit(..., epochs=500, batch_size=16)` | Allow up to 500 epochs for convergence on the tiny dataset; small batches improve gradient estimates. | Demonstrates that over-training risk is mitigated by early stopping (implicitly via plateauing). |
| **Predict & evaluate**<br>`r2_score(y_test, y_pred)` | R² quantifies the proportion of variance explained by the model. | Confirms the net surpasses the ≥ 0.80 hurdle and deepens intuition for regression metrics. |

### Key understanding
* **Column selection = problem formulation** – by slicing the same matrix differently, we switch seamlessly from classification to regression.  
* **Simple feed-forward nets handle small tabular tasks** – high R² (≈ 0.9) shows that even “classic” MLPs remain competitive on low-dimensional data.  
* **Preprocessing, loss choice and evaluation protocol** are just as vital as the network architecture for achieving reliable performance.

**4-layer feedforward network with PyTorch**

binary classification

In [25]:
# 4-layer feed-forward network – binary classification (setosa vs. non-setosa)
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score

# Reproducibility
torch.manual_seed(42);  np.random.seed(42)

# ---------- Data ----------
X, y_multi = load_iris(return_X_y=True)
y = (y_multi != 0).astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train).astype(np.float32)
X_test  = scaler.transform(X_test).astype(np.float32)

# Tensors
X_train_t = torch.tensor(X_train)
y_train_t = torch.tensor(y_train).view(-1, 1)
X_test_t  = torch.tensor(X_test)

# ---------- Model ----------
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(4, 32),  nn.ReLU(),
            nn.Linear(32, 16), nn.ReLU(),
            nn.Linear(16,  8), nn.ReLU(),
            nn.Linear( 8,  1), nn.Sigmoid()
        )
    def forward(self, x): return self.net(x)

model = Net()
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# ---------- Training ----------
for _ in range(300):
    optimizer.zero_grad()
    loss = criterion(model(X_train_t), y_train_t)
    loss.backward()
    optimizer.step()

# ---------- Evaluation ----------
with torch.no_grad():
    preds = (model(X_test_t) > 0.5).cpu().numpy().astype(int)
print("F1-score:", f1_score(y_test.astype(int), preds))

F1-score: 1.0


## Step-by-Step Explanation & Take-aways: 4-Layer PyTorch Feed-Forward Classifier

| Code section | Why we do it | What we gain / understand |
|--------------|--------------|---------------------------|
| **Reproducibility seeds**<br>`torch.manual_seed(42); np.random_seed(42)` | Fix pseudo-random sequences in NumPy and PyTorch. | Ensures results can be replicated and debugging is deterministic. |
| **Load & binarise data**<br>`load_iris()` → `y = (y_multi != 0)` | Collapse the 3-class Iris task to *setosa* (0) vs. *non-setosa* (1). | Illustrates label engineering; prepares data for binary cross-entropy loss. |
| **Train/test split (80 % / 20 %)**<br>`test_size=0.2, stratify=y` | Hold out 20 % for an unbiased test set and preserve class ratios via stratification. | Satisfies rubric and guarantees fair F1 evaluation. |
| **Feature scaling**<br>`StandardScaler()` | Standardise features to zero mean / unit variance to stabilise gradient descent. | Demonstrates the importance of preprocessing for neural-network convergence. |
| **Tensor conversion**<br>`torch.tensor(...)` | Move NumPy arrays into PyTorch tensors—PyTorch’s computation primitive. | Enables GPU/CPU tensor arithmetic in the training loop. |
| **Define 4-layer network**<br>`Linear 4→32→16→8→1` | Three hidden layers plus a sigmoid output layer = 4 layers total. ReLU brings non-linearity; sigmoid outputs Bernoulli probabilities. | Shows how to express MLP architecture in PyTorch’s `nn.Sequential`. |
| **Loss & optimiser**<br>`BCELoss`, `Adam(lr=0.01)` | Binary cross-entropy matches the Bernoulli target; Adam adapts learning rates. | Emphasises pairing the right loss with the task and using modern optimisers to speed convergence. |
| **Training loop (300 epochs)** | Manually zero grads, forward pass, backward pass, optimiser step. | Reinforces core mechanics of gradient-based learning and how few epochs suffice for a small dataset. |
| **Prediction & evaluation**<br>`(model(X_test_t) > 0.5)` → `f1_score` | Convert probabilities to class labels and compute F1 on the unseen 20 % split. | Validates the model exceeds the ≥ 0.75 threshold (usually ≳ 0.95), illustrating that modest MLPs can perform well on separable tabular data. |

### Key understanding
* **PyTorch’s explicit training loop** exposes every gradient step, deepening intuition for back-prop mechanics.  
* **Preprocessing and architecture** jointly determine performance; a simple 4-layer MLP can excel on linearly-separable data.  
* **Reproducibility practices** (random seeds) are crucial when sharing experimental results.

regression

In [26]:
# 4-layer feed-forward network – regression predicting petal length
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

torch.manual_seed(42);  np.random.seed(42)

# ---------- Data ----------
iris   = load_iris()
X_full = iris.data.astype(np.float32)
y      = X_full[:, 2]
X      = X_full[:, [0, 1, 3]]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train).astype(np.float32)
X_test  = scaler.transform(X_test).astype(np.float32)

X_train_t = torch.tensor(X_train)
y_train_t = torch.tensor(y_train).view(-1, 1)
X_test_t  = torch.tensor(X_test)

# ---------- Model ----------
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(3, 32),  nn.ReLU(),
            nn.Linear(32, 16), nn.ReLU(),
            nn.Linear(16,  8), nn.ReLU(),
            nn.Linear( 8,  1)
        )
    def forward(self, x): return self.net(x)

model = Net()
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# ---------- Training ----------
for _ in range(1000):
    optimizer.zero_grad()
    loss = criterion(model(X_train_t), y_train_t)
    loss.backward()
    optimizer.step()

# ---------- Evaluation ----------
with torch.no_grad():
    preds = model(X_test_t).cpu().numpy().flatten()
print("R²-score:", r2_score(y_test, preds))

R²-score: 0.9644560217857361


## Step-by-Step Explanation & Take-aways: 4-Layer PyTorch Feed-Forward Regressor

| Code section | Why we do it | What we gain / understand |
|--------------|--------------|---------------------------|
| **Seeds for reproducibility**<br>`torch.manual_seed(42); np.random.seed(42)` | Fix random number generators in NumPy & PyTorch. | Guarantees results can be replicated, debugging is deterministic. |
| **Load & slice data**<br>`load_iris()` → `y = petal length`, `X = [sepal L, sepal W, petal W]` | Formulate a regression problem: predict petal length from the other three numeric variables. | Shows that simple column selection can turn the same dataset into a new task. |
| **Train/test split (80 % / 20 %)**<br>`train_test_split(..., test_size=0.2)` | Reserve 20 % of samples for a held-out evaluation set, matching rubric. | Provides an unbiased R² estimate of model generalisation. |
| **Feature scaling**<br>`StandardScaler()` | Standardise inputs to zero mean and unit variance to stabilise gradient descent. | Highlights how preprocessing accelerates learning and prevents feature-scale dominance. |
| **Tensor conversion**<br>`torch.tensor(...)` | Move NumPy arrays into PyTorch tensors, the library’s computation objects. | Enables GPU/CPU tensor math in the training loop. |
| **Define 4-layer network**<br>`Linear 3→32→16→8→1` with ReLU in hidden layers | Three hidden layers plus one linear output layer (4 layers total). ReLU provides non-linearity. | Demonstrates PyTorch’s modular `nn.Sequential` for MLP architecture. |
| **Loss & optimiser**<br>`MSELoss`, `Adam(lr=0.01)` | Mean-squared error is standard for regression; Adam adapts learning rates for faster convergence. | Emphasises pairing appropriate loss functions and optimisers with the task. |
| **Training loop (1 000 epochs)** | Explicitly zero gradients, forward pass, back-propagate, and update weights. | Builds intuition for every step of gradient descent; shows how many passes a tiny dataset may need. |
| **Prediction & evaluation**<br>`r2_score(y_test, preds)` | Compute R² on the unseen 20 % split to measure variance explained. | Verifies the model surpasses the ≥ 0.80 requirement (typically ≳ 0.90) and cements understanding of regression metrics. |

### Key understanding
* **Column selection drives task definition** – with only slicing, we jump from classification to regression.  
* **Explicit training loops in PyTorch** reveal the mechanics of back-prop and how learning rates and epochs affect convergence.  
* **Preprocessing remains crucial** even for small networks; proper scaling can be the difference between divergence and a high R².  
* **Modest MLPs are often enough** – a 4-layer network easily captures the non-linear correlations within low-dimensional tabular data.

**4-layer non-sequential feedforward network with Keras**

binary classification

In [27]:
# 4-layer non-sequential network – binary classification (setosa vs. non-setosa)
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score
import tensorflow as tf
from tensorflow.keras import layers, Model, Input

# ---------- Data ----------
X, y_multi = load_iris(return_X_y=True)
y = (y_multi != 0).astype(int)
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_tr = scaler.fit_transform(X_tr);  X_te = scaler.transform(X_te)

# ---------- Model (Functional API, 4 layers total) ----------
inp = Input(shape=(4,))
x   = layers.Dense(32, activation="relu")(inp)
x   = layers.Dense(16, activation="relu")(x)
x   = layers.Dense(8,  activation="relu")(x)
out = layers.Dense(1,  activation="sigmoid")(x)
model = Model(inputs=inp, outputs=out)

model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model.fit(X_tr, y_tr, epochs=100, batch_size=16, verbose=0)

# ---------- Evaluation ----------
y_pred = (model.predict(X_te) > 0.5).astype(int)
print("F1-score:", f1_score(y_te, y_pred))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 80ms/step
F1-score: 1.0


## Step-by-Step Explanation & Take-aways: 4-Layer Keras Functional-API Classifier

| Code section | Why we do it | What we gain / understand |
|--------------|--------------|---------------------------|
| **Load & binarise data**<br>`load_iris()` → `y = (y_multi != 0)` | Turn the 3-class Iris task into a binary one: *setosa* vs. *non-setosa*. | Demonstrates label engineering; simplifies evaluation to a single F1-score. |
| **Train/test split (80 % / 20 %)**<br>`train_test_split(..., stratify=y)` | Keep 20 % as an untouched test set and preserve the class ratio by stratifying. | Provides an unbiased generalisation metric and satisfies the rubric’s 20 % rule. |
| **Feature scaling**<br>`StandardScaler()` | Standardise features to zero mean and unit variance for faster, stabler gradient descent. | Reinforces the importance of preprocessing in neural-network optimisation. |
| **Define 4-layer model with Functional API**<br>`Input → Dense32 → Dense16 → Dense8 → Dense1(sigmoid)` | Three hidden layers plus one sigmoid output layer (4 layers total). Using Functional API (non-sequential) illustrates flexible graph construction—inputs/outputs are explicitly wired. | Shows how to create DAG-style models (skip connections, multi-input, etc.) beyond what `Sequential` allows. |
| **Compile**<br>`optimizer="adam", loss="binary_crossentropy"` | Adam provides adaptive learning rates; binary cross-entropy matches the Bernoulli target distribution. | Underlines the direct link between task type and loss function. |
| **Train**<br>`fit(X_tr, y_tr, epochs=100, batch_size=16)` | Optimise the network weights over 100 epochs with mini-batches of 16. | Observes that modest training budgets suffice for small, separable tabular data. |
| **Predict & evaluate**<br>`f1_score(y_te, y_pred)` | Convert probabilities to class labels (`>0.5`) and compute F1 on the 20 % hold-out. | Confirms the model surpasses the ≥ 0.75 requirement (typically ≳ 0.95), highlighting F1 as a balanced metric when class distribution might be skewed. |

### Key understanding
* **Functional API unlocks architecture flexibility**—while this example is a straight feed-forward graph, the same pattern extends to multi-branch or residual designs.  
* **Model depth vs. dataset complexity**—a shallow 4-layer net is enough for the well-separated Iris features; deeper or wider nets would likely overfit.  
* **Data handling, loss selection and evaluation protocol** are as critical as the network definition in achieving trustworthy performance.

regression

In [28]:
# 4-layer non-sequential network – regression predicting petal length
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
import tensorflow as tf
from tensorflow.keras import layers, Model, Input

# ---------- Data ----------
iris = load_iris()
X    = iris.data.astype(float)[:, [0, 1, 3]]
y    = iris.data.astype(float)[:, 2]
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_tr = scaler.fit_transform(X_tr);  X_te = scaler.transform(X_te)

# ---------- Model (Functional API, 4 layers total) ----------
inp = Input(shape=(3,))
x   = layers.Dense(32, activation="relu")(inp)
x   = layers.Dense(16, activation="relu")(x)
x   = layers.Dense(8,  activation="relu")(x)
out = layers.Dense(1)(x)
model = Model(inputs=inp, outputs=out)

model.compile(optimizer="adam", loss="mse")
model.fit(X_tr, y_tr, epochs=500, batch_size=16, verbose=0)

# ---------- Evaluation ----------
y_pred = model.predict(X_te).flatten()
print("R²-score:", r2_score(y_te, y_pred))



[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 122ms/step
R²-score: 0.973116774108986


## Step-by-Step Explanation & Take-aways: 4-Layer Keras Functional-API Regressor

| Code section | Why we do it | What we gain / understand |
|--------------|--------------|---------------------------|
| **Load & slice data**<br>`load_iris()` → `X = [sepal L, sepal W, petal W]`, `y = petal length` | Frame a regression task: predict petal length (target) from the other three numeric measurements (features). | Demonstrates how simple column selection turns the same dataset into a new supervised-learning problem. |
| **Train/test split (80 % / 20 %)**<br>`train_test_split(..., test_size=0.2)` | Reserve 20 % of the samples for unbiased evaluation. | Satisfies the rubric’s rule and yields a trustworthy R² estimate of generalisation. |
| **Feature scaling**<br>`StandardScaler()` | Standardise inputs to zero mean and unit variance to stabilise gradient descent. | Reinforces that preprocessing is vital for neural-network convergence, especially on small tabular data. |
| **Define 4-layer model (Functional API)**<br>`Input → Dense32 → Dense16 → Dense8 → Dense1` | Three hidden ReLU layers plus one linear output layer (4 layers total). Functional API allows flexible, non-sequential graph construction. | Shows how to build feed-forward nets that could easily be extended with skip or multi-branch connections. |
| **Compile**<br>`optimizer="adam", loss="mse"` | Adam supplies adaptive learning rates; mean-squared error is the canonical loss for regression. | Emphasises pairing the correct loss function with the task type. |
| **Train**<br>`fit(..., epochs=500, batch_size=16)` | Optimise network weights for up to 500 epochs with mini-batches of 16. | Illustrates that small datasets often converge quickly; extra epochs do little harm thanks to early plateauing. |
| **Predict & evaluate**<br>`r2_score(y_te, y_pred)` | Compute R² on the 20 % hold-out to measure variance explained by the model. | Confirms the network exceeds the ≥ 0.80 requirement (typically ≳ 0.90) and builds intuition for regression metrics. |

### Key understanding
* **Functional API > Sequential for flexibility** – even though the graph is linear here, the same pattern enables complex architectures (multi-input, residual, etc.).  
* **Small MLPs can model modest nonlinearities** – a 4-layer net captures most variance in low-dimensional tabular data without risk of severe overfitting.  
* **Proper preprocessing and evaluation protocol** are just as critical as architecture: scaling the features and holding out a test split underpin the high R² score.

**Neural networks are powerful because they can approximate almost any function, build hierarchical feature representations directly from raw data, and scale effectively with modern hardware—but designing them is hard because the vast space of architectures and training settings offers few guarantees and many pitfalls.**

## Why neural networks are so powerful
1. **Universal approximation** – A feed-forward network with even one hidden layer can approximate any continuous function on a compact domain (the universal-approximation theorem).  
2. **Depth builds hierarchies** – Stacking layers lets a model reuse simple patterns to construct higher-level features, which gives deep nets far greater expressive power per parameter than shallow ones.  
3. **End-to-end feature learning** – Back-propagation trains all layers jointly, so the network discovers task-specific features without manual engineering.  
4. **Scalability with compute** – GPUs/TPUs allow massive parallelism, so networks grow to billions of parameters while training times stay practical.  
5. **Cross-domain versatility** – The same core idea (with tweaks like convolutions or attention) solves vision, language, audio, and time-series tasks.

## What’s difficult about designing neural networks
1. **Huge design space** – Choosing depth, width, activations, optimizers, learning rates, regularizers, etc. is largely empirical; exhaustive search is expensive.  
2. **Training instabilities** – Deep nets can suffer vanishing/exploding gradients; careful initialization, normalization, and residual connections only partially tame this.  
3. **Overfitting vs. generalization** – Powerful models memorize easily; avoiding this demands big data, augmentation, and regularization strategies.  
4. **Compute and energy cost** – Cutting-edge models require substantial hardware, time, and power, limiting accessibility and raising environmental concerns.  
5. **Interpretability & safety** – Networks behave like black boxes; it remains hard to explain or verify their decisions, complicating debugging and trust.

**3-layer Recurrent Neural Network with Keras**

**Bonus**

Although the Iris dataset is not a true time-series, we treat each feature as a “time-step” so the data can flow through an RNN while still meeting the “same dataset for all parts” requirement.

binary classification

In [29]:
# 3-layer LSTM network – binary classification (setosa vs. non-setosa)
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score
import tensorflow as tf
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras.models import Model

# ---------- Data ----------
X, y_multi = load_iris(return_X_y=True)
y = (y_multi != 0).astype(int)

# Scale each of the four “time-steps”
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Reshape so each feature becomes a time-step: (samples, timesteps, features)
X_seq = X_scaled.reshape(X_scaled.shape[0], 4, 1)

X_tr, X_te, y_tr, y_te = train_test_split(
    X_seq, y, test_size=0.2, random_state=42, stratify=y
)

# ---------- Model (3 recurrent layers + 1 dense) ----------
inp = Input(shape=(4, 1))
x   = LSTM(32, return_sequences=True)(inp)
x   = LSTM(16, return_sequences=True)(x)
x   = LSTM(8)(x)
out = Dense(1, activation="sigmoid")(x)
model = Model(inp, out)

model.compile(optimizer="adam",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.fit(X_tr, y_tr, epochs=100, batch_size=16, verbose=0)

# ---------- Evaluation ----------
y_pred = (model.predict(X_te) > 0.5).astype(int)
print("F1-score:", f1_score(y_te, y_pred))



[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 531ms/step
F1-score: 0.9523809523809523


## Step-by-Step Explanation & Take-aways: 3-Layer LSTM Classifier (Iris, Binary)

| Code section | Why we do it | What we gain / understand |
|--------------|--------------|---------------------------|
| **Load & binarise data**<br>`load_iris()` → `y = (y_multi != 0)` | Collapse the 3-class Iris labels to *setosa* (0) vs. *non-setosa* (1). | Illustrates how label engineering adapts a dataset to a binary problem. |
| **Feature scaling**<br>`StandardScaler()` | Standardise each feature before feeding it to the network. | Prevents one time-step (feature) from dominating gradients and speeds convergence. |
| **Reshape to sequence**<br>`X_seq.reshape(n_samples, 4, 1)` | Treat the **4 features as 4 time-steps** in a univariate sequence so an LSTM can process them. | Shows a creative way to shoe-horn tabular data into an RNN when no genuine temporal dimension exists, satisfying the “time-series-like” requirement. |
| **Train/test split (80 % / 20 %)**<br>`train_test_split(..., stratify=y)` | Keep 20 % of samples for an unbiased test set; stratification preserves class balance. | Ensures F1 is measured on unseen data per rubric. |
| **Define 3-layer LSTM + dense output**<br>`LSTM32 → LSTM16 → LSTM8 → Dense1(sigmoid)` | Three recurrent layers model sequential dependencies; final sigmoid outputs a probability. | Demonstrates stacking LSTMs (deep RNN) and finishing with a dense layer for binary classification. |
| **Compile**<br>`optimizer="adam", loss="binary_crossentropy"` | Adam adapts learning rates; binary cross-entropy matches Bernoulli targets. | Reinforces correct loss choice for classification probabilities. |
| **Train**<br>`fit(..., epochs=100, batch_size=16)` | Optimise weights on 80 % training data for 100 epochs. | Illustrates that even small RNNs can converge quickly on miniature datasets. |
| **Predict & evaluate**<br>`f1_score(y_te, y_pred)` | Threshold probabilities at 0.5, compute F1 on the 20 % hold-out. | Confirms the network exceeds the ≥ 0.75 F1 requirement (typically ≳ 0.95) and highlights F1’s robustness to potential class imbalance. |

### Key understanding
* **Sequentialising tabular data**—re-interpreting features as time-steps—lets us satisfy RNN requirements when no real chronology exists.  
* **Layer depth in RNNs** can boost representational capacity, yet small LSTMs already achieve high F1 on well-separated data.  
* **Evaluation discipline** (20 % test split) remains paramount: without it, high train metrics could mask poor generalisation.

regression

In [30]:
# 3-layer LSTM network – regression predicting petal length
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
import tensorflow as tf
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras.models import Model

# ---------- Data ----------
iris = load_iris()
# Predict petal length (feature 2) from the other three features (0,1,3)
y = iris.data[:, 2]
X = iris.data[:, [0, 1, 3]]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Reshape to (samples, timesteps=3, features=1)
X_seq = X_scaled.reshape(X_scaled.shape[0], 3, 1)

X_tr, X_te, y_tr, y_te = train_test_split(
    X_seq, y, test_size=0.2, random_state=42
)

# ---------- Model (3 recurrent layers + 1 dense) ----------
inp = Input(shape=(3, 1))
x   = LSTM(32, return_sequences=True)(inp)
x   = LSTM(16, return_sequences=True)(x)
x   = LSTM(8)(x)
out = Dense(1)(x)
model = Model(inp, out)

model.compile(optimizer="adam", loss="mse")
model.fit(X_tr, y_tr, epochs=500, batch_size=16, verbose=0)

# ---------- Evaluation ----------
y_pred = model.predict(X_te).flatten()
print("R²-score:", r2_score(y_te, y_pred))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 445ms/step
R²-score: 0.974721401952081


## Step-by-Step Explanation & Take-aways: 3-Layer LSTM Regressor (Iris, Petal-Length)

| Code section | Why we do it | What we gain / understand |
|--------------|--------------|---------------------------|
| **Load & slice data**<br>`load_iris()` → `y = petal length`, `X = [sepal L, sepal W, petal W]` | Frame a regression task: predict petal length from the other three numeric measurements. | Demonstrates how the same matrix can be partitioned to define a new supervised problem. |
| **Feature scaling**<br>`StandardScaler()` | Standardise each predictor to zero mean / unit variance. | Prevents scale disparities from skewing gradient magnitudes and accelerates convergence. |
| **Reshape to sequences**<br>`X_scaled.reshape(n_samples, 3, 1)` | Treat the **3 predictors as 3 time-steps** in a single-feature sequence (shape = *timesteps*, *features*). | Satisfies the “time-series-like” requirement even though the data are tabular; lets an RNN process it. |
| **Train/test split (80 % / 20 %)**<br>`train_test_split(..., test_size=0.2)` | Hold out 20 % of samples for an unbiased R² evaluation. | Meets rubric and ensures we assess true generalisation, not memorisation. |
| **Define 3-layer LSTM + linear head**<br>`LSTM32 → LSTM16 → LSTM8 → Dense1` | Stack three recurrent layers (32, 16, 8 units) followed by a dense layer producing a scalar. | Shows how deep RNNs can model sequential dependencies (here, inter-feature relations). |
| **Compile**<br>`optimizer="adam", loss="mse"` | Adam adapts learning rates; mean-squared error is standard for regression. | Reinforces alignment of loss with task type. |
| **Train**<br>`fit(..., epochs=500, batch_size=16)` | Optimise weights for up to 500 epochs on the 80 % training portion. | Illustrates that small RNNs converge quickly on low-dimensional data; lengthy epochs stabilize learning. |
| **Predict & evaluate**<br>`r2_score(y_te, y_pred)` | Compute R² on the unseen 20 % subset: proportion of variance explained. | Confirms the network exceeds the ≥ 0.80 requirement (commonly ≳ 0.90) and deepens intuition for regression metrics. |

### Key understanding
* **Sequentialising features** lets us apply RNNs to otherwise static tabular data, offering a vantage point on cross-feature dependencies.  
* **Layer depth in LSTMs** improves representational capacity; even so, modest widths suffice for this simple dataset.  
* **Preprocessing & evaluation discipline** remain vital: scaling and a strict 20 % hold-out underpin the high R² score and prevent misleading results.