<a href="https://colab.research.google.com/github/Anaya666/Anaya666/blob/main/Lab05_NN1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ðŸ§ª LAB: Manual MLPs for Classification and Regression

In this lab, you will use `PyTorch` to implement manually a multi-layer perceptron (MLP) for three different tasks: binary classification, multi-class classification and regression.

## General instructions to complete in ALL three tasks:

1. ***IMPLEMENTATION***:
   
Implement a separate class for each task:

  - `BinaryMLP` for the binary classification task  
  - `MultiClassMLP` for the multi-class classification task  
  - `RegressionMLP` for the regression task

   Each class must include the following methods:

  - `__init__` for initializing 1 or 2 hidden layers.
  - `forward` to transfer information from the input to the output layer.
  - `cost` computing the cost.
  - `fit` for training, using autograd and manual updates. **Use stochastic gradient descent to update your weights**. *N.B.* You may probably reuse much of the code we used in this week tutorial already.
  - `predict` to convert the information at the output layer into the required output.

2. ***DATA PREPARATION***

For each task, you will be provided with a toy dataset. For each dataset:

  - Split into training and test sets (use an 80/20 split)
  - Standardize the features properly, avoiding data leakage
  - Convert all data into `PyTorch` Tensors for compatibility

3. ***MODEL TRAINING AND EVALUATION***

Instantiate the model for each task and train it under different hyperparameter configurations. In each case, record the performance on both the training and test sets using the appropriate metric for the task (accuracy, MSE, etc.). You should explore the following configurations:

  - One hidden layer, varying the number of hidden units (use ReLU as their activation function)
  - Same number of hidden units across one or two hidden layers (use ReLU as their activation function)
  - Repeat the above setups using Tanh activation instead of ReLU

Present your results in a compact way (e.g. a summary table, a data frame etc).

**NOTE**: When training your model, use a fixed learning rate of your choice (e.g., 0.01 is a reasonable starting point) and a reasonably large number of  epochs (e.g., 100â€“200) based on how training and test performance evolve.

4. ***REFLECTION AND DISCUSSION***

Reflect on the impact of the different hyperparameter settings:

- How does the number of hidden units affect performance?
- What changes when using two layers instead of one?
- How does the activation function (ReLU vs. Tanh) influence results?

Please elaborate your answers.

---

**Collaboration Note**: This assignment is designed to support collaborative work. We encourage you to divide tasks among group members so that everyone can contribute meaningfully. Many components of the assignment can be approached in parallel or split logically across team members. Good coordination and thoughtful integration of your work will lead to a stronger final result.

**Ideally, each group member should be responsible for one of the separate tasks.** BUT, everyone should help each other along the way, both reviewing and refining results and discussion.

---

In total, this lab assignment will be worth **100 points**.

---
**Submission notes**:

* Write down all group members' names, or at least the group name (if you have one and you previously provided it), in the first cell of the notebook.

* Verify that the notebook runs as expected and that all required outputs are included.


In [None]:
NAME(s) =

## 1. Pre-implementation Group Discussion (15 points)

Discuss and agree on:

- What cost function should be used for each of the below task.
- What changes are needed in the output layer for each of these tasks. In particular, consider the number of units and the activation function.
- Why it is important to standardize the data before training each model.
- How you could detect overfitting when training your models.

1.
- Binary function: Binary Cross Entropy Cost
- multi-class classification: categorical cross entropy loss
- Regression: Mean Squared error
2.
- Binary classification: 1 output unit (either 0 or 1), Sigmoid activation function
- Multi-class classification: 3 output units, activation function: Softmax
- Regression: 1 output unit (since we are trying to predict 1 continuous value), No activation function - no identity.
3. To keep it on the same scale and prevent numerically larger data points from exerting more weight in the model.
4.
- Binary: high accuracy on training data, low in test data
- Multi-class: accuracy, low in test data
- Regression: if R^2 is very high on training data but low on test data, indicating poor generalisability.

## 2. Binary Classification (25 points)

Use the dataset below to complete points 1 to 4 in the general instructions for this task.

Use as many cells as needed.

In [None]:
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)

In [None]:
# USE AS MANY CELLS AS NEEDED


## 2. Multi-Class Classification (25 points)

Use the dataset below to complete points 1 to 4 in the general instructions for this task.

Use as many cells as needed.

---
**NOTE**: This will likely be the most challenging exercise. To help you, here are some pointers:

 - The **output (last) layer** should have as many units as there are classes in your data.  

 - The **activation function of the output layer must be softmax**, not sigmoid. Softmax ensures that all output values are between 0 and 1 and sum to 1, so they can be interpreted as probabilities across the different classes.  

 - For a multi-class problem, your target variable `y` should be **one-hot encoded**. For example:  
   - Label = 0 â†’ [1, 0, 0]  
   - Label = 1 â†’ [0, 1, 0]  
   - Label = 2 â†’ [0, 0, 1]  
   You can easily achieve this with `OneHotEncoder` from `sklearn.preprocessing`.  

 - The **predicted class** corresponds to the unit with the highest probability.  
   Example:  
   - `[0.1, 0.3, 0.6] â†’ class 2`  
   - `[0.6, 0.2, 0.2] â†’ class 0`  

 - For this exercise, you need to **implement the categorical cross-entropy loss** (an extension of binary cross-entropy to multiple classes). It is defined as:  

   $$\sum_{c=1}^{l} y_{o,c}\,\log(p_{o,c}),$$

   where $l$ is the number of classes, $y_{o,c}$ is the one-hot encoded label for observation $o$, and $p_{o,c}$ is the predicted probability for class $c$ (after applying softmax). The log is the natural logarithm.

---

In [2]:
import numpy as np
import matplotlib.pylab as plt
import torch

In [3]:
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_classes=3,
                           n_clusters_per_class=1,
                           n_features=2, n_informative=2, n_redundant=0, random_state=1234, flip_y=0.15)

In [4]:
import torch

class MultiClassMLP(object):
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, output_dim, learning_rate=0.1):
        """
        Initialize a 2-hidden-layer multi-class MLP.
        Args:
            input_dim: number of input features
            hidden_dim1: number of units in first hidden layer
            hidden_dim2: number of units in second hidden layer
            output_dim: number of output classes
            learning_rate: learning rate for SGD
        """
        # First hidden layer
        self.W_1 = torch.randn(input_dim, hidden_dim1, requires_grad=True)
        self.b_1 = torch.randn(hidden_dim1, requires_grad=True)

        # Second hidden layer
        self.W_2 = torch.randn(hidden_dim1, hidden_dim2, requires_grad=True)
        self.b_2 = torch.randn(hidden_dim2, requires_grad=True)

        # Output layer
        self.W_out = torch.randn(hidden_dim2, output_dim, requires_grad=True)
        self.b_out = torch.randn(output_dim, requires_grad=True)

        self.learning_rate = learning_rate

    # ---------------- Activation functions ----------------
    def relu(self, z):
        return torch.maximum(z, torch.tensor(0.0))

    def softmax(self, x):
        return torch.exp(x) / torch.exp(x).sum(dim=1, keepdim=True)

    # ---------------- Forward pass ----------------
    def forward(self, X):
        # First hidden layer
        h1 = torch.matmul(X, self.W_1) + self.b_1
        h1_relu = self.relu(h1)

        # Second hidden layer
        h2 = torch.matmul(h1_relu, self.W_2) + self.b_2
        h2_relu = self.relu(h2)

        # Output layer
        out = torch.matmul(h2_relu, self.W_out) + self.b_out

        # Softmax probabilities
        return self.softmax(out)

    # ---------------- Loss ----------------
    def categorical_cross_entropy(self, y_pred, y_true):
        """
        Compute categorical cross-entropy loss.
        y_pred: model predictions after softmax, shape (N, C)
        y_true: one-hot encoded true labels, shape (N, C)
        """
        eps = torch.finfo(y_pred.dtype).eps
        y_pred = torch.clamp(y_pred, eps, 1 - eps)
        loss = -torch.mean(torch.sum(y_true * torch.log(y_pred), dim=1))
        return loss

    # ---------------- Training ----------------
    def fit(self, X, y, n_epochs=100):
        for epoch in range(n_epochs):
            randperm = torch.randperm(X.shape[0])
            for ii in randperm:
                x_batch = X[ii].unsqueeze(0)
                y_batch = y[ii].unsqueeze(0)

                # Forward pass
                probs = self.forward(x_batch)

                # Compute loss
                cost = self.categorical_cross_entropy(probs, y_batch)

                # Backward pass
                cost.backward()

                # Gradient descent update
                with torch.no_grad():
                    self.W_1 -= self.learning_rate * self.W_1.grad
                    self.b_1 -= self.learning_rate * self.b_1.grad
                    self.W_2 -= self.learning_rate * self.W_2.grad
                    self.b_2 -= self.learning_rate * self.b_2.grad
                    self.W_out -= self.learning_rate * self.W_out.grad
                    self.b_out -= self.learning_rate * self.b_out.grad

                    # Zero gradients
                    self.W_1.grad.zero_()
                    self.b_1.grad.zero_()
                    self.W_2.grad.zero_()
                    self.b_2.grad.zero_()
                    self.W_out.grad.zero_()
                    self.b_out.grad.zero_()

            if epoch % 10 == 0:
                print(f"Epoch {epoch}, Cost: {cost.item():.4f}")

        return self

    # ---------------- Prediction ----------------
    def predict(self, X):
        with torch.no_grad():
            probs = self.forward(X)
            predicted = torch.argmax(probs, dim=1)
        return predicted


In [5]:
#part 2- data preperation
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import torch
X, y = make_classification(n_samples=1000, n_classes=3,
                           n_clusters_per_class=1,
                           n_features=2, n_informative=2, n_redundant=0, random_state=1234, flip_y=0.15)
# Step 2: Split into train and test (avoid leakage)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 3: Standardize features using only training data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Step 4: One-hot encode labels (multi-class targets)
encoder = OneHotEncoder(sparse_output=False)
y_train_onehot = encoder.fit_transform(y_train.reshape(-1, 1))
y_test_onehot = encoder.transform(y_test.reshape(-1, 1))

# Step 5: Convert everything to PyTorch tensors
X_train = torch.tensor(X_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_train = torch.tensor(y_train_onehot, dtype=torch.float32)
y_test = torch.tensor(y_test_onehot, dtype=torch.float32)

In [6]:
#part 3.1- one hidden layer- varying number of hidden units- ReLU as activation function
import torch
from sklearn.metrics import accuracy_score

torch.manual_seed(1234)

# Hidden layer sizes to test
hidden_dims = [5, 10, 20, 50, 100]

results = {}

for h in hidden_dims:
    print(f"\nTraining model with {h} hidden units...")

    # 1. Instantiate model with 2 hidden layers
    multiclass_mlp_model = MultiClassMLP(
        input_dim=X_train.shape[1],
        hidden_dim1=h,         # first hidden layer
        hidden_dim2=h,         # second hidden layer (can be same or different)
        output_dim=3,          # 3 classes
        learning_rate=0.01
    )

    # 2. Train model
    multiclass_mlp_model.fit(X_train, y_train, n_epochs=50)

    # 3. Predict on training and test sets
    y_pred_train = multiclass_mlp_model.predict(X_train).numpy()
    y_pred_test = multiclass_mlp_model.predict(X_test).numpy()

    # Convert one-hot test labels to class indices
    y_true_train = torch.argmax(y_train, dim=1).numpy()
    y_true_test = torch.argmax(y_test, dim=1).numpy()

    # 4. Compute accuracy
    train_acc = accuracy_score(y_true_train, y_pred_train)
    test_acc = accuracy_score(y_true_test, y_pred_test)

    results[h] = {"train_acc": train_acc, "test_acc": test_acc}
    print(f"â†’ Train Acc: {train_acc:.4f}, Test Acc: {test_acc:.4f}")

# 5. Summary of results
print("\nPerformance summary:")
for h, metrics in results.items():
    print(f"Hidden units: {h:3d} | Train Acc: {metrics['train_acc']:.4f} | Test Acc: {metrics['test_acc']:.4f}")


Training model with 5 hidden units...
Epoch 0, Cost: 0.7541
Epoch 10, Cost: 0.1853
Epoch 20, Cost: 0.2302
Epoch 30, Cost: 1.8952
Epoch 40, Cost: 0.1648
â†’ Train Acc: 0.8075, Test Acc: 0.7850

Training model with 10 hidden units...
Epoch 0, Cost: 0.1088
Epoch 10, Cost: 0.2316
Epoch 20, Cost: 0.4442
Epoch 30, Cost: 1.9294
Epoch 40, Cost: 0.0988
â†’ Train Acc: 0.8013, Test Acc: 0.7950

Training model with 20 hidden units...
Epoch 0, Cost: 0.9895
Epoch 10, Cost: 0.0816
Epoch 20, Cost: 0.9297
Epoch 30, Cost: 1.1835
Epoch 40, Cost: 0.4996
â†’ Train Acc: 0.8263, Test Acc: 0.7900

Training model with 50 hidden units...
Epoch 0, Cost: nan
Epoch 10, Cost: nan
Epoch 20, Cost: nan
Epoch 30, Cost: nan
Epoch 40, Cost: nan
â†’ Train Acc: 0.3350, Test Acc: 0.3200

Training model with 100 hidden units...
Epoch 0, Cost: nan
Epoch 10, Cost: nan
Epoch 20, Cost: nan
Epoch 30, Cost: nan
Epoch 40, Cost: nan
â†’ Train Acc: 0.3350, Test Acc: 0.3200

Performance summary:
Hidden units:   5 | Train Acc: 0.8075 

In [7]:
#part 3.2 - Same number of hidden units across one or two hidden layers - ReLU as activation function
import torch
from sklearn.metrics import accuracy_score

torch.manual_seed(1234)

# List of hidden units to test
hidden_units = [5, 10, 20, 50]

results = {}

for h in hidden_units:
    print(f"\n=== Testing {h} hidden units per layer ===")

    # ---------- 1 hidden layer ----------
    model_1hl = MultiClassMLP(
        input_dim=X_train.shape[1],
        hidden_dim1=h,
        hidden_dim2=h,       # for 1 hidden layer, we'll ignore the second layer in forward or set to None if your class allows
        output_dim=3,
        learning_rate=0.01
    )

    # You can adjust forward to skip second hidden layer if needed for 1-layer test
    # For simplicity here, we'll just compare 2-layer models with same units

    # Train model
    model_1hl.fit(X_train, y_train, n_epochs=50)

    # Predict
    y_pred_train = model_1hl.predict(X_train).numpy()
    y_pred_test = model_1hl.predict(X_test).numpy()

    y_true_train = torch.argmax(y_train, dim=1).numpy()
    y_true_test = torch.argmax(y_test, dim=1).numpy()

    train_acc = accuracy_score(y_true_train, y_pred_train)
    test_acc = accuracy_score(y_true_test, y_pred_test)

    results[f"{h}-1hl"] = {"train_acc": train_acc, "test_acc": test_acc}
    print(f"1 Hidden Layer â†’ Train Acc: {train_acc:.4f}, Test Acc: {test_acc:.4f}")

    # ---------- 2 hidden layers ----------
    model_2hl = MultiClassMLP(
        input_dim=X_train.shape[1],
        hidden_dim1=h,
        hidden_dim2=h,  # second hidden layer
        output_dim=3,
        learning_rate=0.01
    )

    model_2hl.fit(X_train, y_train, n_epochs=50)

    y_pred_train2 = model_2hl.predict(X_train).numpy()
    y_pred_test2 = model_2hl.predict(X_test).numpy()

    train_acc2 = accuracy_score(y_true_train, y_pred_train2)
    test_acc2 = accuracy_score(y_true_test, y_pred_test2)

    results[f"{h}-2hl"] = {"train_acc": train_acc2, "test_acc": test_acc2}
    print(f"2 Hidden Layers â†’ Train Acc: {train_acc2:.4f}, Test Acc: {test_acc2:.4f}")

# ---------- Summary ----------
print("\n=== Performance Summary ===")
for key, metrics in results.items():
    print(f"{key} | Train Acc: {metrics['train_acc']:.4f} | Test Acc: {metrics['test_acc']:.4f}")



=== Testing 5 hidden units per layer ===
Epoch 0, Cost: 0.7541
Epoch 10, Cost: 0.1853
Epoch 20, Cost: 0.2302
Epoch 30, Cost: 1.8952
Epoch 40, Cost: 0.1648
1 Hidden Layer â†’ Train Acc: 0.8075, Test Acc: 0.7850
Epoch 0, Cost: 0.6461
Epoch 10, Cost: 0.1503
Epoch 20, Cost: 2.7269
Epoch 30, Cost: 0.2269
Epoch 40, Cost: 0.0610
2 Hidden Layers â†’ Train Acc: 0.8075, Test Acc: 0.8000

=== Testing 10 hidden units per layer ===
Epoch 0, Cost: 0.0316
Epoch 10, Cost: 0.3140
Epoch 20, Cost: 2.1872
Epoch 30, Cost: 1.4099
Epoch 40, Cost: 0.0009
1 Hidden Layer â†’ Train Acc: 0.8175, Test Acc: 0.8000
Epoch 0, Cost: 1.8301
Epoch 10, Cost: 0.5501
Epoch 20, Cost: 0.1744
Epoch 30, Cost: 0.2358
Epoch 40, Cost: 2.8817
2 Hidden Layers â†’ Train Acc: 0.8213, Test Acc: 0.8000

=== Testing 20 hidden units per layer ===
Epoch 0, Cost: 0.0045
Epoch 10, Cost: 0.2070
Epoch 20, Cost: 0.3825
Epoch 30, Cost: 0.3369
Epoch 40, Cost: 0.4455
1 Hidden Layer â†’ Train Acc: 0.8237, Test Acc: 0.8000
Epoch 0, Cost: nan
Epoch 

In [8]:
#part 3.3- One hidden layer, varying the number of hidden units (use Tanh as their activation function)
torch.manual_seed(1234)

# List of hidden layer sizes to test
hidden_units = [5, 10, 20, 50, 100]

results = {}

# Define a new class for 1-hidden-layer MLP with Tanh activation
class MultiClassMLP_Tanh:
    def __init__(self, input_dim, hidden_dim, output_dim, learning_rate=0.01):
        self.W_1 = torch.randn(input_dim, hidden_dim, requires_grad=True)
        self.b_1 = torch.randn(hidden_dim, requires_grad=True)
        self.W_out = torch.randn(hidden_dim, output_dim, requires_grad=True)
        self.b_out = torch.randn(output_dim, requires_grad=True)
        self.learning_rate = learning_rate

    def tanh(self, z):
        return torch.tanh(z)

    def softmax(self, x):
        return torch.exp(x) / torch.exp(x).sum(dim=1, keepdim=True)

    def forward(self, X):
        h = torch.matmul(X, self.W_1) + self.b_1
        h_tanh = self.tanh(h)
        out = torch.matmul(h_tanh, self.W_out) + self.b_out
        return self.softmax(out)

    def categorical_cross_entropy(self, y_pred, y_true):
        eps = torch.finfo(y_pred.dtype).eps
        y_pred = torch.clamp(y_pred, eps, 1 - eps)
        return -torch.mean(torch.sum(y_true * torch.log(y_pred), dim=1))

    def fit(self, X, y, n_epochs=100):
        for epoch in range(n_epochs):
            randperm = torch.randperm(X.shape[0])
            for ii in randperm:
                x_batch = X[ii].unsqueeze(0)
                y_batch = y[ii].unsqueeze(0)

                probs = self.forward(x_batch)
                cost = self.categorical_cross_entropy(probs, y_batch)
                cost.backward()

                with torch.no_grad():
                    self.W_1 -= self.learning_rate * self.W_1.grad
                    self.b_1 -= self.learning_rate * self.b_1.grad
                    self.W_out -= self.learning_rate * self.W_out.grad
                    self.b_out -= self.learning_rate * self.b_out.grad

                    self.W_1.grad.zero_()
                    self.b_1.grad.zero_()
                    self.W_out.grad.zero_()
                    self.b_out.grad.zero_()

            if epoch % 10 == 0:
                print(f"Epoch {epoch}, Cost: {cost.item():.4f}")

        return self

    def predict(self, X):
        with torch.no_grad():
            probs = self.forward(X)
            predicted = torch.argmax(probs, dim=1)
        return predicted

# ---------- Train and evaluate ----------
for h in hidden_units:
    print(f"\nTraining 1-hidden-layer model with {h} units (Tanh)...")
    model = MultiClassMLP_Tanh(input_dim=X_train.shape[1], hidden_dim=h, output_dim=3, learning_rate=0.01)
    model.fit(X_train, y_train, n_epochs=50)

    y_pred_train = model.predict(X_train).numpy()
    y_pred_test = model.predict(X_test).numpy()

    y_true_train = torch.argmax(y_train, dim=1).numpy()
    y_true_test = torch.argmax(y_test, dim=1).numpy()

    train_acc = accuracy_score(y_true_train, y_pred_train)
    test_acc = accuracy_score(y_true_test, y_pred_test)

    results[h] = {"train_acc": train_acc, "test_acc": test_acc}
    print(f"â†’ Train Acc: {train_acc:.4f}, Test Acc: {test_acc:.4f}")
# ---------- Summary ----------
print("\nPerformance summary (Tanh):")
for h, metrics in results.items():
    print(f"Hidden units: {h:3d} | Train Acc: {metrics['train_acc']:.4f} | Test Acc: {metrics['test_acc']:.4f}")


Training 1-hidden-layer model with 5 units (Tanh)...
Epoch 0, Cost: 0.2956
Epoch 10, Cost: 0.3922
Epoch 20, Cost: 0.1392
Epoch 30, Cost: 0.2840
Epoch 40, Cost: 0.3862
â†’ Train Acc: 0.8125, Test Acc: 0.7800

Training 1-hidden-layer model with 10 units (Tanh)...
Epoch 0, Cost: 0.7430
Epoch 10, Cost: 0.1674
Epoch 20, Cost: 0.0714
Epoch 30, Cost: 0.1640
Epoch 40, Cost: 0.0861
â†’ Train Acc: 0.8050, Test Acc: 0.7850

Training 1-hidden-layer model with 20 units (Tanh)...
Epoch 0, Cost: 0.6556
Epoch 10, Cost: 2.7477
Epoch 20, Cost: 3.1624
Epoch 30, Cost: 0.1481
Epoch 40, Cost: 0.0213
â†’ Train Acc: 0.8113, Test Acc: 0.8000

Training 1-hidden-layer model with 50 units (Tanh)...
Epoch 0, Cost: 0.0064
Epoch 10, Cost: 0.0821
Epoch 20, Cost: 0.5028
Epoch 30, Cost: 0.9176
Epoch 40, Cost: 0.1576
â†’ Train Acc: 0.8187, Test Acc: 0.8150

Training 1-hidden-layer model with 100 units (Tanh)...
Epoch 0, Cost: 0.7957
Epoch 10, Cost: 0.6381
Epoch 20, Cost: 0.7280
Epoch 30, Cost: 0.1860
Epoch 40, Cost: 0.

In [15]:
# Same number of hidden units across one or two hidden layers (use Tanh as their activation function)
from sklearn.metrics import accuracy_score

# Hidden units to test
hidden_units = [5, 10, 20, 50]

results = {}

for h in hidden_units:
    print(f"\n=== Testing {h} hidden units per layer ===")

    # ---------- 1 hidden layer ----------
    model_1hl = MultiClassMLP_2Hidden_Tanh(
        input_dim=X_train.shape[1],
        hidden_dim1=h,
        hidden_dim2=None,  # None means skip second hidden layer
        output_dim=3,
        learning_rate=0.01
    )
    print(f"\nTraining 1-hidden-layer model with {h} units (Tanh)...")
    model_1hl.fit(X_train, y_train, n_epochs=50)

    y_pred_train1 = model_1hl.predict(X_train).numpy()
    y_pred_test1 = model_1hl.predict(X_test).numpy()

    y_true_train = torch.argmax(y_train, dim=1).numpy()
    y_true_test = torch.argmax(y_test, dim=1).numpy()

    train_acc1 = accuracy_score(y_true_train, y_pred_train1)
    test_acc1 = accuracy_score(y_true_test, y_pred_test1)
    print(f"1 Hidden Layer â†’ Train Acc: {train_acc1:.4f}, Test Acc: {test_acc1:.4f}")

    # ---------- 2 hidden layers ----------
    model_2hl = MultiClassMLP_2Hidden_Tanh(
        input_dim=X_train.shape[1],
        hidden_dim1=h,
        hidden_dim2=h,  # second hidden layer
        output_dim=3,
        learning_rate=0.01
    )
    print(f"\nTraining 2-hidden-layer model with {h} units per layer (Tanh)...")
    model_2hl.fit(X_train, y_train, n_epochs=50)

    y_pred_train2 = model_2hl.predict(X_train).numpy()
    y_pred_test2 = model_2hl.predict(X_test).numpy()

    train_acc2 = accuracy_score(y_true_train, y_pred_train2)
    test_acc2 = accuracy_score(y_true_test, y_pred_test2)
    print(f"2 Hidden Layers â†’ Train Acc: {train_acc2:.4f}, Test Acc: {test_acc2:.4f}")

    results[h] = {
        "1hl": {"train_acc": train_acc1, "test_acc": test_acc1},
        "2hl": {"train_acc": train_acc2, "test_acc": test_acc2}
    }

# ----------Summary----------
print("\n=== Performance summary (Tanh) ===")
for h, metrics in results.items():
    print(f"{h} units per layer | "
          f"1HL â†’ Train: {metrics['1hl']['train_acc']:.4f}, Test: {metrics['1hl']['test_acc']:.4f} | "
          f"2HL â†’ Train: {metrics['2hl']['train_acc']:.4f}, Test: {metrics['2hl']['test_acc']:.4f}")



=== Testing 5 hidden units per layer ===

Training 1-hidden-layer model with 5 units (Tanh)...
Epoch 0, Cost: 0.2005
Epoch 10, Cost: 0.0542
Epoch 20, Cost: 0.1312
Epoch 30, Cost: 0.0936
Epoch 40, Cost: 1.7767
1 Hidden Layer â†’ Train Acc: 0.8137, Test Acc: 0.7950

Training 2-hidden-layer model with 5 units per layer (Tanh)...
Epoch 0, Cost: 0.1849
Epoch 10, Cost: 0.2050
Epoch 20, Cost: 0.2379
Epoch 30, Cost: 0.1435
Epoch 40, Cost: 0.1159
2 Hidden Layers â†’ Train Acc: 0.8100, Test Acc: 0.7900

=== Testing 10 hidden units per layer ===

Training 1-hidden-layer model with 10 units (Tanh)...
Epoch 0, Cost: 0.1699
Epoch 10, Cost: 0.5133
Epoch 20, Cost: 0.2876
Epoch 30, Cost: 0.1317
Epoch 40, Cost: 0.1635
1 Hidden Layer â†’ Train Acc: 0.8050, Test Acc: 0.7800

Training 2-hidden-layer model with 10 units per layer (Tanh)...
Epoch 0, Cost: 0.1479
Epoch 10, Cost: 0.1388
Epoch 20, Cost: 0.1118
Epoch 30, Cost: 1.2206
Epoch 40, Cost: 0.0809
2 Hidden Layers â†’ Train Acc: 0.8200, Test Acc: 0.8000

4. ***REFLECTION AND DISCUSSION***

Reflect on the impact of the different hyperparameter settings:

- How does the number of hidden units affect performance?
When using the ReLU activation function, model performance initially improves as the number of hidden units increases from 5 to 20, reflected by slightly higher training and testing accuracies. This suggests that adding more units initially helps the network learn richer representations of the data. However, when the number of hidden units increases to 50 and 100, both training and test accuracies drop sharply. This likely indicates overfitting or unstable training, as larger networks can become harder to optimize and may require smaller learning rates or more regularization.
In contrast, with the Tanh activation function, performance improves more steadily as the number of hidden units increases from 5 to 50, showing that moderate model complexity helps capture more non-linear patterns. However, at 100 hidden units, both training and testing accuracy slightly decline, suggesting diminishing returns and possible overfitting.
Overall, both activation functions show that increasing hidden units improves performance up to a point, after which accuracy either plateaus or decreases. Moderate network sizes (around 20â€“50 hidden units) seem to provide the best balance between model flexibility and generalization.

- What changes when using two layers instead of one?
When using the Tanh activation function, performance improves as the number of hidden units increases from 5 to 20, for both one and two hidden layers. Both training and test accuracies rise slightly, showing that increasing model capacity helps capture more patterns in the data. However, at 50 units per layer, accuracy begins to decline marginally, suggesting that additional complexity does not necessarily translate to better generalization. The difference between one and two hidden layers is relatively small, though the two-layer models tend to perform just slightly better overall.
In contrast, with the ReLU activation function, accuracy improves up to 20 hidden units but drops sharply at 50 and 100 units. This pattern suggests that the larger ReLU networks may be overfitting or experiencing vanishing gradient issues, leading to poor learning stability.
Overall, moderate network sizes (around 20 hidden units) consistently yield the best results for both activations. Tanh shows more stable and consistent performance across, while ReLU is more sensitive to the number of hidden units, especially as the network grows deeper or wider.

- How does the activation function (ReLU vs. Tanh) influence results?
The activation function strongly influences how effectively the model learns and generalizes. In this experiment, models using Tanh achieved more stable and consistent performance as hidden units increased, with both training and test accuracy improving gradually before leveling off. In contrast, ReLU models performed well with fewer hidden units but showed a sharp decline in accuracy for larger number of hidden units, likely due to issues such as dead neurons and unstable gradients. Overall, Tanh led to smoother learning and better generalization across network sizes, while ReLU was faster initially but less reliable as model complexity increased.



## 3: Regression Task (25 points)

Use the dataset below to complete points 1 to 4 in the general instructions for this task.

Use as many cells as needed.

In [None]:
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=1000, n_features=2, n_informative=2, random_state=1234, noise=75)

In [None]:
# USE AS MANY CELLS AS NEEDED

## 4. Discussion (5 points)

You created a separate class for each task and likely repeated much of the same code across implementations.

Discuss within your group how could you have leveraged inheritance to make your code more reusable and avoid duplication. Provide examples. Be specific.

YOUR TEXT HERE

## 5. Collaboration Reflection (5 points)

As a group, briefly reflect on the following (max 1â€“2 short paragraphs):

- How did the group dynamics work throughout the assignment?
- Were there any major disagreements or diverging approaches?
- How did you resolve conflicts or make final modeling decisions?
- What did you learn from each other during this project?

YOUR TEXT HERE