# Discovering activation functions

**Limitations of the Sigmoid and Softmax function**

In [None]:
"""

The outputs of the sigmoid are bounded between zero and one, meaning that for any input, the output will be greater than zero and less than one.
Unlike the softmax function, it can be used anywhere in the network.

When we plot out the derivatives (or gradients) of the sigmoid function, we find that the gradients are always low and approach zero for low and high values of x.
This behavior is called  ""SATURATION"". This property of the sigmoid function creates a challenge during backpropagation
because each local gradient is a function of the previous gradient. For high and low values of x, the gradient will be so small
that it can prevent the weight from changing or updating.

This phenomenon is called vanishing gradients and makes training the network challenging.
Because each element of the output vector of a softmax activation function is also bounded

"""

**ReLU**

In [None]:
"""

The ReLU function outputs the maximum between its input and zero. For positive inputs, the output of the function is equal to the input.
For strictly negative outputs, the output of the function is equal to zero.

This function does not have an upper bound and the gradients do not converge to zero for high values of x, which overcomes the vanishing gradients problem.
In PyTorch, the ReLU function may be called using the nn module.


relu = nn.ReLU()

"""

**Leaky ReLU**

In [None]:
"""

The leaky ReLU is a variation of the ReLU function. For positive inputs, it behaves similarly to the ReLU function.
For negative inputs, however, it multiplies them by a small coefficient (defaulted to zero.zero-one in PyTorch).
By doing this, the leaky ReLU function has non-null gradients for negative inputs.

In PyTorch, the leaky ReLU function is called using the nn module as well. The negative_slope parameter indicates the coefficient by which
negative inputs are multiplied.

leaky_relu = nn.LeakyReLU(negative_slope = 0.05)

"""

In [None]:
"""

Calculate the gradient of the ReLU function for x using the relu_pytorch() function you defined, then running a backward pass.
Find the gradient at x

"""

# Create a ReLU function with PyTorch
relu_pytorch = nn.ReLU()

# Apply your ReLU function on x, and calculate gradients
x = torch.tensor(-1.0, requires_grad=True)
y = relu_pytorch(x)
y.backward()

# Print the gradient of the ReLU function for x
gradient = x.grad
print(gradient)

In [None]:
"""

Create a leaky ReLU function in PyTorch with a negative slope of 0.05.
Call the function on the tensor x, which has already been defined for you.

"""

# Create a leaky relu function in PyTorch
leaky_relu_pytorch = nn.LeakyReLU(negative_slope = 0.05)

x = torch.tensor(-2.0)
# Call the above function on the tensor x
output = leaky_relu_pytorch(x)
print(output)

# A deeper dive into neural network architecture

**Counting the number of parameters**

In [None]:
### The capacity of a network reflects the number of parameters in said network.

total = 0
for parameter in model.parameters():
  total += parameter.numel()
print(total)

In [None]:
"""
Create a 3-layer linear neural network with <120 parameters, using n_features as input and n_classes as output sizes
"""

def calculate_capacity(model):
  total = 0
  for p in model.parameters():
    total += p.numel()
  return total

n_features = 8
n_classes = 2

input_tensor = torch.Tensor([[3, 4, 6, 2, 3, 6, 8, 9]])

# Create a neural network with less than 120 parameters
model = nn.Sequential(
    nn.Linear(n_features , 4),
    nn.Linear(4 , 3),
    nn.Linear(3 , n_classes)
)
output = model(input_tensor)

print(calculate_capacity(model))

# Learning rate and momentum

**Momentum**

In [None]:
"""

This parameter provides momentum to the optimizer enabling it to overcome local dips. The momentum keeps the step size large when previous steps were also large,
even if the current gradient is small

"""

In [None]:
"""

two parameters of an optimizer can be adjusted when training a model: the learning rate and the momentum.
The learning rate controls the step size taken by the optimizer. Typical learning rate values range from ten raised to minus two, to ten raised to minus four.
If the learning rate is too high, the optimizer may never be able to minimize the loss function. If it is too low, training may take longer.


The other parameter, momentum, controls the inertia of the optimizer. Without momentum, the optimizer may get stuck in a local optimum.
Momentum usually ranges from 0.85 to 0.99


sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.95)

"""

# Layers initialization,transfer learning and fine tuning

**Layer Initialization**

In [None]:
"""

In general, when training a neural network, we initialize the weights of each layer to small values. So why is this required? We previously discuss how the output of a neuron in a linear layer is a weighted sum of the outputs of the previous layer. By keeping both the input data and the layer's weights small we ensure that the outputs of our layers remain small. A layer can be initialized in different ways, for example using the uniform distribution. Layer initialization is an active topic of research.

""

In [1]:
import torch.nn as nn
layer = nn.Linear(64, 128)
print(layer.weight.min(), layer.weight.max())

tensor(-0.1249, grad_fn=<MinBackward1>) tensor(0.1250, grad_fn=<MaxBackward1>)


**Transfer learning and fine tuning**

In [None]:
"""

In practice, machine learning engineer are rarely training a model from randomly initialized weights.
Instead they rely on a very powerful concept called transfer learning. Transfer learning consists in taking a model that was trained on a first task
and reuse for a second task.

For example, we trained a model on a large dataset of data scientist in the US and some more data became available, this time of salaries in Europe.
Instead of training a model using randomly initialized weights, we can load the weights from the first model and use them as a starting point to train on this new dataset.
Saving and loading weights in pytorch can be done using the torch.save and the torch.load functions.
These functions works on any type of pytorch objects, whether it's a single layer or a full model.


import torch
layer = nn.Linear(64, 128)
torch.save(layer, 'layer.pth')
new_layer = torch.load('layer.pth')

"""

In [None]:
"""

Sometimes, the second task is similar to the first task and we want to perform a specific type of transfer learning, called fine-tuning.
In this case, we load weights from a previously trained model, but train the model with a smaller learning rate. We can even train part of a network,
if we decide some of the network layers do not need to be trained and choose to freeze them.

A rule of thumb is to freeze early layers of the network and fine-tune layers closer to the output layer.
This can be achieved by setting each parameter's requires_grad attribute to False. Here, we use the model's named_parameters() method,
which returns the name and the parameter itself. We set requires_grad of the first layer's weight to False.


import torch.nn as nn
model = nn.Sequential(nn.Linear(64, 128),
      nn.Linear(128, 256)
  )

for name, param in model.named_parameters():
  if name == '0.weight':
    param.requires_grad = False

"""

In [None]:
"""

You are about to fine-tune a model on a new task after loading pre-trained weights. The model contains three linear layers.
However, because your dataset is small, you only want to train the last linear layer of this model and freeze the first two linear layers.

The model has already been created and exists under the variable model. You will be using the named_parameters method of the model to list the parameters of the model.
Each parameter is described by a name. This name is a string with the following naming convention: x.name where x is the index of the layer.

Remember that a linear layer has two parameters: the weight and the bias.

"""




"""

Use an if statement to determine if the parameter should be frozen or not based on its name.
Freeze the parameters of the first two layers of this model.

"""

for name, param in model.named_parameters():

    # Check if the parameters belong to the first layer
    if name == '0.weight' or name == '0.bias':

        # Freeze the parameters
        param.requires_grad = False

    # Check if the parameters belong to the second layer
    if name == '1.weight' or name == '1.bias':

        # Freeze the parameters
        param.requires_grad = False


In [None]:
"""

For each layer (layer0 and layer1), use the uniform initialization method to initialize the weights.

"""

layer0 = nn.Linear(16, 32)
layer1 = nn.Linear(32, 64)

# Use uniform initialization for layer0 and layer1 weights
nn.init.uniform_(layer0.weight)
nn.init.uniform_(layer1.weight)

model = nn.Sequential(layer0, layer1)