# 3. MLP

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Activity: Understanding Multi-Layer Perceptrons (MLPs)
This activity is designed to test your skills in Multi-Layer Perceptrons (MLPs).

## Exercise 1

Consider a simple MLP with 2 input features, 1 hidden layer containing 2 neurons, and 1 output neuron. Use the hyperbolic tangent (tanh) function as the activation for both the hidden layer and the output layer. The loss function is mean squared error (MSE): L = 1/N * (y - ŷ)², where ŷ is the network's output.

For this exercise, use the following specific values:

* Input and output vectors:
    * X: [0.5, -0.2]
    * Y: 1.0

* Hidden layer weights:
    * W¹ = [[0.3, -0.1], 
             [0.2, 0.4]]  (2x2 matrix)
 
* Hidden layer biases:
    * b¹ = [0.1, -0.2]  (1x2 vector)

* Output layer weights:
    * W² = [0.5, -0.3]

* Output layer bias:
    * b² = 0.2

* Learning rate: 
    * η = 0.3

* Activation function: tanh

Perform the following steps explicitly, showing all mathematical derivations and calculations with the provided values:

1. **Forward Pass**:
   * Compute the hidden layer pre-activations: Z¹ = W¹ * X + b¹.
   * Apply tanh to get hidden activations: a¹ = tanh(Z¹).
   * Compute the output pre-activation: Z² = W² * a¹ + b².
   * Compute the final output: ŷ = tanh(Z²).

2. **Loss Calculation**:
   * Compute the loss: L = 1/N * (Y - ŷ)².

3. **Backward Pass (Backpropagation)**: Compute the gradients of the loss with respect to all weights and biases. Start with delL/delŷ then compute:
   * delL/delZ² (using the tanh derivative: del/delZ tanh(Z) = 1 - tanh²(Z)).
   * Gradients for output layer: delL/delW², delL/delb².
   * Propagate to hidden layer: delL/delA¹, delL/delZ¹.
   * Gradients for hidden layer: delL/delW¹, delL/delb¹.
   * Show all intermediate steps and calculations.

4. **Parameter Update**: Using the learning rate η = 0.1, update all weights and biases via gradient descent:
   * W² <- W² - η * delL/delW²
   * b² <- b² - η * delL/delb²
   * W¹ <- W¹ - η * delL/delW¹
   * b¹ <- b¹ - η * delL/delb¹
   * Provide the numerical values for all updated parameters.

**Submission Requirements**: Show all mathematical steps explicitly, including intermediate calculations (e.g., matrix multiplications, tanh applications, gradient derivations). Use exact numerical values throughout and avoid rounding excessively to maintain precision (at least 4 decimal places).

In [2]:
# --- Helpers ---
def tanh(x):
    return np.tanh(x)

def dtanh(z):
    return 1.0 - np.tanh(z)**2

def fmt(x):
    if isinstance(x, float):
        return f"{x:.6f}"
    arr = np.array(x, dtype=float)
    return np.array2string(arr, formatter={'float_kind':lambda v: f"{v:.6f}"},
                           floatmode='maxprec', suppress_small=False)

def p(title, value):
    print(f"{title}: {fmt(value)}")

# --- Dados do exercício ---
X = np.array([0.5, -0.2], dtype=float)
Y = 1.0 

W1 = np.array([[0.3, -0.1],
               [0.2,  0.4]], dtype=float)

b1 = np.array([0.1, -0.2], dtype=float)

W2 = np.array([0.5, -0.3], dtype=float)
b2 = 0.2

eta_update = 0.1

print("=== EXERCÍCIO 1 — MLP (tanh, MSE) ===\n")
p("X", X); p("Y", Y)
print("\n--- Parâmetros iniciais ---")
p("W1", W1); p("b1", b1); p("W2", W2); p("b2", b2)


=== EXERCÍCIO 1 — MLP (tanh, MSE) ===

X: [0.500000 -0.200000]
Y: 1.000000

--- Parâmetros iniciais ---
W1: [[0.300000 -0.100000]
 [0.200000 0.400000]]
b1: [0.100000 -0.200000]
W2: [0.500000 -0.300000]
b2: 0.200000


In [3]:
# === 1) Forward pass ===
Z1 = W1 @ X + b1
A1 = tanh(Z1)
Z2 = float(W2 @ A1 + b2)
Y_hat = float(tanh(Z2))

print("\n--- Forward Pass ---")
p("Z1 = W1 @ X + b1", Z1)
p("A1 = tanh(Z1)", A1)
p("Z2 = W2 · A1 + b2", Z2)
p("ŷ = tanh(Z2)", Y_hat)

# === 2) Loss ===
L = (Y - Y_hat)**2
print("\n--- Loss (MSE) ---")
p("L = (Y - ŷ)^2", L)



--- Forward Pass ---
Z1 = W1 @ X + b1: [0.270000 -0.180000]
A1 = tanh(Z1): [0.263625 -0.178081]
Z2 = W2 · A1 + b2: 0.385237
ŷ = tanh(Z2): 0.367247

--- Loss (MSE) ---
L = (Y - ŷ)^2: 0.400377


In [4]:
# === 3) Backpropagation ===

# Saída
dL_dYhat = 2.0*(Y_hat - Y)
dYhat_dZ2 = dtanh(Z2)
dL_dZ2 = dL_dYhat * dYhat_dZ2

print("\n-- Saída --")
p("dL/dŷ = 2*(ŷ - Y)", dL_dYhat)
p("dtanh(Z2) = 1 - tanh^2(Z2)", dYhat_dZ2)
p("dL/dZ2", dL_dZ2)

# Gradientes da camada de saída
dL_dW2 = dL_dZ2 * A1            # (2,)
dL_db2 = dL_dZ2                 # escalar

print("\n-- Gradientes camada de saída --")
p("dL/dW2 = dL/dZ2 * A1", dL_dW2)
p("dL/db2 = dL/dZ2", dL_db2)

# Propagação p/ camada oculta
dL_dA1 = dL_dZ2 * W2            # (2,)
dA1_dZ1 = dtanh(Z1)             # (2,)
dL_dZ1 = dL_dA1 * dA1_dZ1       # (2,)

print("\n-- Propagação para a oculta --")
p("dL/dA1 = dL/dZ2 * W2", dL_dA1)
p("dtanh(Z1) = 1 - tanh^2(Z1)", dA1_dZ1)
p("dL/dZ1 = dL/dA1 ⊙ dtanh(Z1)", dL_dZ1)

# Gradientes da camada oculta
dL_dW1 = np.outer(dL_dZ1, X)    # (2,2)
dL_db1 = dL_dZ1                 # (2,)

print("\n-- Gradientes camada oculta --")
p("dL/dW1 = outer(dL/dZ1, X)", dL_dW1)
p("dL/db1 = dL/dZ1", dL_db1)


-- Saída --
dL/dŷ = 2*(ŷ - Y): -1.265507
dtanh(Z2) = 1 - tanh^2(Z2): 0.865130
dL/dZ2: -1.094828

-- Gradientes camada de saída --
dL/dW2 = dL/dZ2 * A1: [-0.288624 0.194968]
dL/db2 = dL/dZ2: -1.094828

-- Propagação para a oculta --
dL/dA1 = dL/dZ2 * W2: [-0.547414 0.328448]
dtanh(Z1) = 1 - tanh^2(Z1): [0.930502 0.968287]
dL/dZ1 = dL/dA1 ⊙ dtanh(Z1): [-0.509370 0.318032]

-- Gradientes camada oculta --
dL/dW1 = outer(dL/dZ1, X): [[-0.254685 0.101874]
 [0.159016 -0.063606]]
dL/db1 = dL/dZ1: [-0.509370 0.318032]


In [5]:
# === 4) Atualização de parâmetros (η = 0.1) ===
W2_new = W2 - eta_update * dL_dW2
b2_new = b2 - eta_update * dL_db2
W1_new = W1 - eta_update * dL_dW1
b1_new = b1 - eta_update * dL_db1

p("\nW2_new = W2 - η*dL/dW2", W2_new)
p("b2_new = b2 - η*dL/db2", b2_new)
p("W1_new = W1 - η*dL/dW1", W1_new)
p("b1_new = b1 - η*dL/db1", b1_new)


W2_new = W2 - η*dL/dW2: [0.528862 -0.319497]
b2_new = b2 - η*dL/db2: 0.309483
W1_new = W1 - η*dL/dW1: [[0.325468 -0.110187]
 [0.184098 0.406361]]
b1_new = b1 - η*dL/db1: [0.150937 -0.231803]


In [6]:
# === Checagem opcional: forward com parâmetros atualizados ===
Z1_new = W1_new @ X + b1_new
A1_new = tanh(Z1_new)
Z2_new = float(W2_new @ A1_new + b2_new)
Y_hat_new = float(tanh(Z2_new))
L_new = (Y - Y_hat_new)**2

p("\nŷ (antes)", Y_hat)
p("L (antes)", L)
p("ŷ (depois)", Y_hat_new)
p("L (depois)", L_new)



ŷ (antes): 0.367247
L (antes): 0.400377
ŷ (depois): 0.500620
L (depois): 0.249380


## Exercise 2

Using the `make_classification` function from scikit-learn, generate a synthetic dataset with the following specifications:

* Number of samples: 1000

* Number of classes: 2

* Number of clusters per class: Use the n_clusters_per_class parameter creatively to achieve 1 cluster for one class and 2 for the other (hint: you may need to generate subsets separately and combine them, as the function applies the same number of clusters to all classes by default).

* Other parameters: Set `n_features=2` for easy visualization, `n_informative=2`, `n_redundant=0`, `random_state=42` for reproducibility, and adjust `class_sep` or `flip_y` as needed for a challenging but separable dataset.

Implement an MLP from scratch (without using libraries like TensorFlow or PyTorch for the model itself; you may use NumPy for array operations) to classify this data. You have full freedom to choose the architecture, including:

* Number of hidden layers (at least 1)

* Number of neurons per layer

* Activation functions (e.g., sigmoid, ReLU, tanh)

* Loss function (e.g., binary cross-entropy)

* Optimizer (e.g., gradient descent, with a chosen learning rate)

Steps to follow:

1. Generate and split the data into training (80%) and testing (20%) sets.

2. Implement the forward pass, loss computation, backward pass, and parameter updates in code.

3. Train the model for a reasonable number of epochs (e.g., 100-500), tracking training loss.

4. Evaluate on the test set: Report accuracy, and optionally plot decision boundaries or confusion matrix.

5. Submit your code and results, including any visualizations.