# Module 19 - Hyperparemters

## Learning outcomes

- LO 1: Dissect a simple neural network using the PyTorch optimisation framework.
- LO 2: Evaluate the trade-offs of employing deep neural networks in your machine learning projects.
- LO 3: Refine an existing neural network for a specific application.
- LO 4: Develop intuition on the structure and components of a neural network.
- LO 5: Refine a codebase for machine learning competitions.

## Misc and Keywords
- **Gradient descent** with momentum is designed to help the gradient descent algorithm converge faster.
- **Fully connected** means that there is an edge, or weight, between any two nodes.
- **Torch tensor** is the same as numpy array, an *n-dimensional* array, except it can run on GPUs
- In PyTorch, the **forward function** defines the computation that the model performs on input data to produce output.
- **Tokenisation** is simply the process of breaking a text sample into words.

## Module Summary Description
One of the keys to building a neural network is the various design choices you must make along the way. Alex explains that these design parameters, commonly known as hyperparameters, are like barcodes for different models on your model shelf. In this module, you’ll learn about hyperparameter optimization, which will help you to choose the best model for a given task.

#### Learn PyTorch in 60 minutes: 
https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html

## Hyperparameters

In deep learning, **hyperparameters** are configuration settings used to control the training process of a model.  
Unlike model parameters (which are learned during training), hyperparameters are set before training begins and influence how learning happens.

### Why Hyperparameters Matter

Hyperparameters impact:
- **Model performance:** Poor choices can lead to underfitting or overfitting.
- **Training time:** Some settings speed up or slow down convergence.
- **Resource usage:** Affects memory, CPU/GPU usage, etc.

### Examples of hyperparamter choices:

- Which step size of the stochastic gradient descent (SGD), $s$, to use
- Which batch size, $B$, to use for the SGB
- Which dropout rate, $r$, for regularisation
- Which weight decay to regularise? and at which rate?
- Which loss function to use? Quadratic? cross-entropy?

Hyperparameters are not the same as the weights in the network which are learnt through gradient descent, they instead need to be defined by the user.

### Key Hyperparameters

#### Learning Rate (`lr`)

Controls how much to change the model in response to the estimated error each time the model weights are updated.

**Trade-off:**  
- Too high → unstable training, may diverge.
- Too low → slow convergence or stuck in local minima.

```python
from tensorflow.keras.optimizers import Adam

optimizer = Adam(learning_rate=0.001)
```

#### Batch Size
 
Number of samples processed before the model updates.

**Trade-off:**  
- Small → noisy updates but better generalisation.
- Large → stable updates but risk of poorer generalisation.

```python
model.fit(X_train, y_train, batch_size=32, epochs=10)
```

#### Number of Epochs

Number of complete passes through the entire training dataset.

**Trade-off:**  
- Too few → underfitting.
- Too many → overfitting.

```python
model.fit(X_train, y_train, epochs=50)
```

#### Optimiser Type

Algorithm used to update network weights.

**Examples:**  
- SGD (Stochastic Gradient Descent) → simple but may struggle with complex landscapes.
- Adam → adaptive learning rate, works well in most cases.
- RMSprop → works well with recurrent networks.

```python
from tensorflow.keras.optimizers import SGD, Adam

optimizer = SGD(learning_rate=0.01)
optimizer = Adam(learning_rate=0.001)
```

#### Dropout Rate
 
Fraction of neurons randomly set to zero during training to prevent overfitting.

**Trade-off:**  
- Low dropout → risk of overfitting.
- High dropout → risk of underfitting.

```python
from tensorflow.keras.layers import Dropout

model.add(Dropout(0.5))
```

#### Weight Initialisation
 
How initial values for weights are set before training.

**Trade-off:**  
- Poor initialisation → slow or failed convergence.
- Good initialisation (e.g., He or Xavier) → faster and stable training.

```python
from tensorflow.keras.layers import Dense
from tensorflow.keras.initializers import HeNormal

model.add(Dense(64, kernel_initializer=HeNormal()))
```

#### Regularisation (L1, L2)

Penalties added to the loss function to discourage large weights and prevent overfitting.

**Trade-off:**  
- Too strong → underfitting.
- Too weak → overfitting.

```python
from tensorflow.keras.regularizers import l2

model.add(Dense(64, kernel_regularizer=l2(0.01)))
```

#### Early Stopping
  
Stops training when the model performance stops improving on validation data.

**Trade-off:**  
- Helps avoid overfitting.
- If used improperly, may stop too early (underfitting).

```python
from tensorflow.keras.callbacks import EarlyStopping

early_stop = EarlyStopping(monitor='val_loss', patience=5)

model.fit(X_train, y_train, epochs=100, callbacks=[early_stop])
```



## Introduction to torch

#### Torch Tensors
A **Torch tensor** is the same as numpy array, an *n-dimensional* array, except: 
- It can run on GPUs or other accelerators (with .to(device) or .cuda()), while NumPy is CPU-only.
- It supports automatic differentiation (with .requires_grad=True), which allows gradients to be calculated and used for optimization (NumPy does not support this natively).
- It integrates seamlessly with PyTorch's neural network modules and optimizers.
- It has extra functionalities specific to deep learning (e.g., in-place ops, mixed precision, hooks).

An example is shown below

```python
import torch
import numpy as np

# NumPy array
np_array = np.array([[1, 2], [3, 4]])

# Torch tensor
tensor = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32, requires_grad=True)

# GPU tensor (optional)
tensor_gpu = tensor.to("cuda")  # if GPU available
```

#### What is *requires_grad* in PyTorch?
In PyTorch, every Tensor has a flag called requires_grad.
When set to True, PyTorch tracks all operations on the tensor so that it can automatically compute gradients during backpropagation.

This enables the use of automatic differentiation (autograd), which is essential for training neural networks.
Tensors with *requires_grad=True* will accumulate gradients in their .grad attribute after .backward() is called.

##### How it works:
- Forward pass: PyTorch records the operations in a computational graph.
- Backward pass (.backward()): PyTorch traverses this graph in reverse, applying the chain rule to compute gradients.
- Result: Each tensor with requires_grad=True will have its .grad populated with the gradient of the loss with respect to that tensor.

##### Why is this useful?
Gradients indicate how much a small change in the tensor affects the loss.

Optimizers like torch.optim.SGD use these gradients to update parameters and minimize the loss.

##### Preventing gradient tracking (*torch.no_grad*)
In scenarios where gradient computation is unnecessary, such as model evaluation or inference, we can temporarily disable gradient tracking using torch.no_grad().

In [6]:
import torch

# Tensor with gradients enabled
x = torch.randn(3, requires_grad=True)

# Operation
y = x * 2

# Backward pass
y.sum().backward()

print(x.grad)  # Shows gradient of y.sum() w.r.t. x

tensor([2., 2., 2.])


---
## Vanilla NumPy Neural Network

In [7]:
import numpy as np

# N is batch size; D_in is input dimension or number of features;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in) 
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

print('This is the shape of w2:', w2.shape)

# this is the neural network
def simple_nn(x):
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)
    return y_pred

0 25344617.80024845
1 19241213.077182896
2 16542566.454395559
3 14849562.121751975
4 13073525.812466811
5 10990722.99077887
6 8671290.910533313
7 6490916.280778909
8 4642667.820183601
9 3253403.8974768245
10 2261362.1082796818
11 1588702.3658847143
12 1137093.229593238
13 836452.3221806737
14 633186.1476644496
15 493606.2001597121
16 395103.45354633604
17 323644.12690757733
18 270139.9049382714
19 228924.04802692542
20 196377.3004188963
21 170141.81240083894
22 148531.91507549188
23 130457.2192998358
24 115151.67019756013
25 102049.25741051119
26 90751.39043669734
27 80947.29451531041
28 72400.74906637982
29 64911.51236956321
30 58330.32798915943
31 52519.47797378483
32 47379.431599016825
33 42837.71283637645
34 38794.97972227994
35 35192.964329675306
36 31975.509668982988
37 29092.521114024465
38 26505.626401555837
39 24179.97715220185
40 22089.419774448073
41 20205.548797438147
42 18503.3526295163
43 16963.30617017058
44 15568.402758653636
45 14303.03564684394
46 13153.310028642969
4

---
# Pytorch Version
Exactly the same model, but created in pytorch:

In [8]:
# -*- coding: utf-8 -*-
import torch
class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 680.4976806640625
1 624.5360717773438
2 577.7935180664062
3 537.4030151367188
4 502.2336730957031
5 471.02789306640625
6 442.8564453125
7 417.1272277832031
8 393.5264587402344
9 371.5996398925781
10 351.2056884765625
11 332.0414733886719
12 314.080078125
13 297.07354736328125
14 280.8624267578125
15 265.52313232421875
16 250.883056640625
17 237.0072021484375
18 223.77545166015625
19 211.11370849609375
20 199.01454162597656
21 187.4600830078125
22 176.50665283203125
23 166.07212829589844
24 156.16192626953125
25 146.7483673095703
26 137.82666015625
27 129.40196228027344
28 121.45137786865234
29 113.9508285522461
30 106.88719177246094
31 100.21592712402344
32 93.94758605957031
33 88.05029296875
34 82.52310943603516
35 77.31945037841797
36 72.44427490234375
37 67.87490844726562
38 63.600650787353516
39 59.59953308105469
40 55.85410690307617
41 52.354244232177734
42 49.082672119140625
43 46.025733947753906
44 43.160709381103516
45 40.480628967285156
46 37.97616195678711
47 35.63505172729

In [9]:
import torch
import torch.nn as nn

class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(5, 8)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(8, 1)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

model = SimpleNN()
input_data = torch.randn(1, 5)

output = model(input_data)  # This calls forward internally