Seeing a linear model fail made the need for deep learning feel obvious rather than theoretical.

You manually added non-linearity (polynomials)

The model learns non-linearity automatically; That’s deep learning.

# First Neural Network Experiment

## Research Question
Can a neural network learn a non-linear relationship without manual feature engineering?

## Hypothesis
A neural network with non-linear activation functions can model non-linear data effectively.


In [10]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [11]:
rng = np.random.default_rng(seed=42)
X = rng.random((200, 1)) * 5
y = X.squeeze()**2 + rng.standard_normal(200) * 2

In [12]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [13]:
X_train_t = torch.tensor(X_train, dtype=torch.float32)
y_train_t = torch.tensor(y_train, dtype=torch.float32).view(-1, 1)

X_test_t = torch.tensor(X_test, dtype=torch.float32)
y_test_t = torch.tensor(y_test, dtype=torch.float32).view(-1, 1)

This code converts a data structure (like a NumPy array or Python list) into a formatted PyTorch Tensor suitable for a neural network. It performs three critical operations in one line:

torch.tensor(y_train, ...)
This creates a PyTorch tensor from your raw data. Tensors are the fundamental data structures in PyTorch, similar to NumPy arrays but optimized for GPU acceleration and automatic differentiation (calculating gradients).

dtype=torch.float32
This explicitly sets the data type to 32-bit floating point. 
Why it's necessary: PyTorch defaults to float32 for most operations. If your raw data is in integers or float64 (the default for Python and NumPy), many neural network layers (like nn.Linear) will throw a "type mismatch" error because they expect float32.
Performance: float32 provides the ideal balance between numerical precision and computational speed on modern GPUs.

.view(-1, 1) This reshapes the tensor into a 2D column vector. The -1 (Infer Dimension): Tells PyTorch to automatically calculate the number of rows based on the total number of elements in the tensor.The 1: Explicitly sets the second dimension to 1 column.Why it's necessary: PyTorch loss functions (like MSELoss) and linear layers expect the target variable (\(y\)) to be 2D, with the shape (number_of_samples, 1). If \(y\) is a 1D "flat" array, the model may produce incorrect results or dimension mismatch errors during training.

In [14]:
# Define the neural network (Core Learning)
class SimpleNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(1, 16),
            nn.ReLU(),
            nn.Linear(16, 1)
        )

    def forward(self, x):
        return self.net(x)


This code defines a Neural Network architecture using PyTorch. In 2026, this remains the standard way to build custom models by inheriting from the nn.Module base class.

Class Definition & Initialization
class SimpleNN(nn.Module):
Inheriting from nn.Module gives your class all the necessary tools to handle parameters, move to GPUs, and perform backpropagation.
super().__init__() This line is mandatory. It initializes the internal PyTorch machinery within the parent nn.Module class so that your custom layers are tracked correctly. [1]

nn.Sequential (The "Container")
Instead of defining layers separately, nn.Sequential wraps them into a single object. Data flows through these layers in the exact order they are listed. [1, 2]
nn.Linear(1, 16) (Input Layer):This is a "Fully Connected" (Dense) layer. It takes 1 input (your single feature \(X\)) and projects it into a 16-dimensional hidden space. This expansion allows the model to learn more complex patterns.
nn.ReLU() (Activation Function):ReLU stands for Rectified Linear Unit. It replaces all negative values with zero. This is the most critical part: without this, your model is just a series of linear equations (a straight line). ReLU adds the "non-linearity" needed to fit curves.[3]
nn.Linear(16, 1) (Output Layer):This compresses the 16 hidden signals back down to 1 single output (your prediction \(y\)).

Input: 1 value (e.g., \(X=2.5\))
Linear 1: Becomes 16 different weighted values.
ReLU: Any negative values in those 16 are "turned off" (set to 0).
Linear 2: Combines those 16 values into 1 final prediction (e.g., \(y=11.2\))

First layer: learns features

ReLU: introduces non-linearity

Output layer: regression prediction

In [15]:
# Training setup
model = SimpleNN()
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

This code snippet initializes the three core components of a PyTorch training pipeline: the architecture, the error metric, and the update mechanism.

model = SimpleNN() 
Purpose: This creates an instance of a neural network class (which you must define beforehand).
Function: It allocates the layers (weights and biases) in memory. The variable model now represents the specific mathematical function that will take input data and produce a prediction. 

criterion = nn.MSELoss() 
Purpose: Defines the Loss Function, which measures how "wrong" the model is.
Function: MSELoss stands for Mean Squared Error. It calculates the average squared difference between the model's predictions and the actual target values. It is the standard choice for regression tasks.
Documentation: PyTorch MSELoss

optimizer = optim.Adam(model.parameters(), lr=0.01) 
Purpose: Defines the Optimization Algorithm used to update the model's weights to reduce the loss.
Components:
optim.Adam: An advanced algorithm that uses adaptive learning rates for each parameter, generally performing faster than standard Stochastic Gradient Descent (SGD).
model.parameters(): Tells the optimizer which weights and biases it is allowed to modify during training.
lr=0.01: Sets the Learning Rate. This determines the size of the "step" the optimizer takes. At 0.01, the model adjusts its weights by 1% of the calculated gradient magnitude each step.
Documentation: PyTorch Adam Optimizer

In [16]:
# Train the network
epochs = 500

for epoch in range(epochs):
    optimizer.zero_grad()
    outputs = model(X_train_t)
    loss = criterion(outputs, y_train_t)
    loss.backward()
    optimizer.step()

    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")


Epoch 0, Loss: 129.4090
Epoch 100, Loss: 6.7439
Epoch 200, Loss: 4.3391
Epoch 300, Loss: 4.1285
Epoch 400, Loss: 4.0952


This loop is the heart of deep learning. This code is the standard training loop in PyTorch. It repeats a specific set of steps to help the model learn from your data over multiple iterations.

Core Loop Structure
epochs = 500: An epoch is one full pass through your entire training dataset. This line specifies that the model will see and learn from the data 500 times.
for epoch in range(epochs):: This initiates the loop, running the training steps 500 times in sequence. 

The 5 Essential Steps of Training
Inside the loop, every iteration follows this mandatory order:
optimizer.zero_grad() (#Reset): In PyTorch, gradients are accumulated (added together) by default. This command clears out the old gradients from the previous pass so you start each epoch with a clean slate.
outputs = model(X_train_t) (#Forward Pass): The model takes your training data (X_train_t) and makes a prediction based on its current internal weights.
loss = criterion(outputs, y_train_t) (#Calculate Error): The criterion compares the model's predictions (outputs) against the actual correct answers (y_train_t) to see how much error it made.
loss.backward() (#Backpropagation): This is the "math" step. It calculates exactly how much each weight in the model contributed to the error.
optimizer.step() (#Update Weights): The optimizer uses the information from the backward pass to slightly adjust the model's weights in the direction that will reduce the error next time.

Progress Monitoring
if epoch % 100 == 0:: This logic ensures you don't clutter your screen; it prints a status update every 100 epochs.
loss.item():.4f: loss.item() extracts the error value from a PyTorch tensor and converts it into a standard Python number. The .4f formats it to 4 decimal places for readability. 
Are you seeing the loss value go down as it prints? If it stays high or goes up, we may need to adjust your learning rate (lr).



In [17]:
# Evaluate model
model.eval()
with torch.no_grad():
    predictions = model(X_test_t)
    mse_nn = mean_squared_error(
        y_test_t.numpy(), predictions.numpy()
    )

mse_nn

4.2512125968933105

This code performs a "clean" evaluation of your model on the test dataset to measure how well it generalizes to unseen data.

model.eval()
Purpose: Switches the model from Training Mode to Evaluation Mode.
Effect: This disables specific layers that should only be active during training, such as Dropout (which randomly shuts off neurons) and 
Batch Normalization (which uses running statistics instead of current batch data). This ensures your predictions are consistent and deterministic.

with torch.no_grad():
Purpose: Temporarily deactivates the Autograd engine.
Effect: Since you are only making predictions and not updating weights, you don't need to track gradients. 
This drastically reduces memory consumption and speeds up computation by not storing intermediate values required for backpropagation.

Making Predictions and Converting to NumPy
predictions = model(X_test_t): Pass the test data through the model to get its final predictions.
.numpy(): Most standard evaluation libraries (like Scikit-Learn) cannot read PyTorch Tensors directly. You must convert both your actual values (y_test_t) and your model's predictions back into standard NumPy arrays for calculation.

mean_squared_error(...)
Purpose: Calculates the final score for your model.
Function: It computes the average squared difference between the true test values and the predicted values. A lower value indicates a more accurate model.

Tip for 2026: If your loss was very high during training, checking mse_nn here will tell you if the model is actually "learning" or just memorizing the training data.

## Model Comparison

- Linear Regression: High error
- Polynomial Regression: Low error
- Neural Network: Comparable or better error

### Interpretation
The neural network learned non-linear patterns automatically without manual feature engineering.


## Reflection
- Neural networks are flexible function approximators
- Non-linearity is learned via activation functions
- This feels like a natural extension of earlier models


In 2026, it is widely recognized that while Neural Networks (NN) are more flexible, Polynomial Regression often outperforms them on test data when working with smaller, structured, or low-dimensional datasets. 
If my polynomial model's test error is lower, it indicates that the polynomial approach is a better "fit" for the specific complexity of my data.