# Feed-forward Neural Networks 
## aka Multi-Layer Perceptrons (MLPs)

These things are useful for nonlinear relations. There are some inputs, some hidden layers, and an output. The hidden layers are some form of linear transformations and some non-linearity effects.

In [None]:
import numpy as np
import pandas as pd 
import sys, os
import matplotlib.pyplot as plt

# Add the project root to sys.path (one level up from this notebook)
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))

from hypotai.data import generate_triangle_data
from hypotai.plotting import plot_triangle


from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, r2_score

The Neural Networks need to be scaled. I.e. the values you pass to it should always be between -1,1 with the center being at 0.

Otherwise, if a feature has a range that is different than the others (e.g. angle=0-180 but say distances 0-10) then the weights of the angle feature dominates and the network focuses on that. Similarly, the network tries to minimize the errors using Gradient descent, and if the scales are quite different the error surface will be distorted, you end up with a grid that is quite different in different parameters and it is harder to find the minimizing path (sometime not possible). Lastly, the activation functions like tanh, sigmoid can saturate for large values.

In [14]:
df = generate_triangle_data(n_samples=10_000, angle_mode="random")
X = df[["a", "b", "angle_deg"]]
y = df["c"]

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build pipeline: scale → feedforward net (I also just learned this, cool)
mlp = make_pipeline(
    StandardScaler(),
    MLPRegressor(hidden_layer_sizes=(10, 10),  # 2 hidden layers, 10 neurons each
                 activation='relu',
                 solver='adam',
                 max_iter=500,
                 random_state=42,
                 verbose=False)
)

mlp.fit(X_train, y_train)



0,1,2
,steps,"[('standardscaler', ...), ('mlpregressor', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,loss,'squared_error'
,hidden_layer_sizes,"(10, ...)"
,activation,'relu'
,solver,'adam'
,alpha,0.0001
,batch_size,'auto'
,learning_rate,'constant'
,learning_rate_init,0.001
,power_t,0.5
,max_iter,500


In [16]:
y_pred = mlp.predict(X_test)

print("MSE:", mean_squared_error(y_test, y_pred))
print("R²:", r2_score(y_test, y_pred))

MSE: 2.5427842533319414
R²: 0.9982086437280188


Good but now we understand why people call this a black box. <br>
Let's try to understand what actually is going on, peek beyond the curtain and check what is "hidden". 

In [35]:
## something small
mlp2 = MLPRegressor(
    hidden_layer_sizes=(3, 2),  # small on purpose: 3 neurons → 2 neurons → 1 output
    activation='relu',
    solver='adam',
    max_iter=500,
    random_state=42)

## Fit this model with scaled data (no pipe here, so we scale manually)
scaler2 = StandardScaler()
X_train_scaled = scaler2.fit_transform(X_train)
mlp2.fit(X_train_scaled, y_train)

0,1,2
,loss,'squared_error'
,hidden_layer_sizes,"(3, ...)"
,activation,'relu'
,solver,'adam'
,alpha,0.0001
,batch_size,'auto'
,learning_rate,'constant'
,learning_rate_init,0.001
,power_t,0.5
,max_iter,500


In [36]:

## one triangle
x_input = np.array([[3.0, 4.0, 90.0]])  # a, b, angle in degrees
x_input_scaled = scaler2.transform(x_input)



In [37]:
mlp2_pred = mlp2.predict(x_input_scaled)
print("Predicted c for triangle with a=3, b=4, angle=90:", mlp2_pred[0])

Predicted c for triangle with a=3, b=4, angle=90: 1.2127995658217303


Wow, what a terrible guess right? That is because our network is not deep or complex. We can clearly see that 3 neurons are not good enough. However, we can look at what it did under the hood. <br>

We get the tuned weights and biases; 


For each hidden layer $h^l$, the output is computed as:

$$
h^{(l)} = \sigma \left( W^{(l)} \cdot h^{(l-1)} + b^{(l)} \right)
$$

Where:

- $h^{(l)}$ is the output of layer \( l \) (the activations)
- $W^{(l)}$ is the weight matrix connecting layer \( l-1 \) to layer \( l \)
- $b^{(l)}$ is the bias vector at layer \( l \)
- $\sigma$ is the activation function (e.g. ReLU, sigmoid, tanh)
- $h^{(0)}$ = xis the input vector


The network has 2 layers; the first layer has 3 neurons, and the second one has 2 neurons `hidden_layer_sizes=(3,2)` <br>

When we fitted the model above, it found the best weight and bias coefficients. Now they are fixed. Once we pass a new data like `(3,4,90 degree)` the first layer will take it, and first multiply with the weights of that neuron (there are three) and add biases (again three), then apply an activation function, and pass the output to the second layer which will take it and multiply by the weights (there are now two), and add biases and apply the avtivation. The the final layer spits out the output. 

In [38]:
# Weights and biases
W1, W2, W3 = mlp2.coefs_
b1, b2, b3 = mlp2.intercepts_

# Forward pass
### Z1 is the first hidden layer
z1 = np.dot(x_input_scaled, W1) + b1
a1 = np.maximum(0, z1)  # ReLU, the activation function

z2 = np.dot(a1, W2) + b2
a2 = np.maximum(0, z2)  # ReLU

z3 = np.dot(a2, W3) + b3
output = z3[0][0]

print("Predicted c:", output)

Predicted c: 1.2127995658217303


In [40]:
print(z1)
print(z2)
print(z3)

[[1.64509661 4.87946435 1.97592469]]
[[-1.60445521 -0.45739404]]
[[1.21279957]]


In [44]:
print("Weights and biases:")
print("W1:", W1)
print("b1:", b1)
print("W2:", W2)
print("b2:", b2)
print("W3:", W3)
print("b3:", b3)

Weights and biases:
W1: [[-1.30478874 -1.38753549  2.74534161]
 [ 2.19858441 -1.33585828 -1.58085049]
 [ 2.10384646  3.62818951  2.79644833]]
b1: [3.02333578 0.49697946 3.96247483]
W2: [[ 3.58163660e+000  8.24171364e-135]
 [-3.22862247e+000  2.43399471e-128]
 [ 2.83250778e+000 -2.28864307e-227]]
b2: [ 2.66053273 -0.45739404]
W3: [[3.45974893e+000]
 [4.58471270e-102]]
b3: [1.21279957]


In a neural network, each layer transforms its input into an output by applying a weight matrix and bias, followed by an activation function. The number of neurons in each layer determines the output size of that layer, and the weight matrix is shaped to match the input and output dimensions. So if one layer has 3 neurons and the next has 2, the weights will have shape (3, 2), connecting every input to every output neuron. This continues layer by layer, and the final layer is always shaped to produce the desired output size — for example, a single number for regression. No matter how wide or deep the network is, the math adjusts to ensure the output has the right shape.

**How do I structure my model?** <br>

This depends on the task. What you want is something that gives good predictions on the new datasets. <br>

Sometimes your data is complex, you need more layers, more neurons. But sometimes the problem is not that complicated and if your model is complicated, it will overfit i.e. it will learn all the details of your training but fail to capture the generic distribution. Think it like this, you try to teach discriminating between cats and dogs. It memorizes every single cat and dog picture you show it. It can say "Oh I know him, that's Frank!" but if you show the photo of Frank's cousin your model goes like "I have never seen anything like that in my entire life". So you want it to know what dogs and cats generally look like but not get too specific.