# PyTorch and object-oriented programming

**PyTorch Dataset**

In [None]:
"""

To train a model, we need to build a PyTorch Dataset, set up a DataLoader, and define the model.

We start with the init method, which reads a CSV file into a DataFrame and stores it in the data attribute as a NumPy array.
The super-init command ensures our WaterDataset class behaves like its parent class, torch Dataset.

Next, PyTorch requires us to implement the len method that returns the total size of the dataset which we access as the 0th element of DataFrame's shape.

Finally, we add the getitem method, which takes one argument called idx, the index of a sample, and returns the features (all columns but the last one)
and the label (the final column) for that sample.

"""


from torch.utils.data import Dataset

class WaterDataset(Dataset):
    def __init__(self, csv_path):
        super().__init__()
        df = pd.read_csv(csv_path)
        self.data = df.to_numpy()

    def __len__(self):
        return self.data.shape[0]

    def __getitem__(self, idx):
        features = self.data[idx, :-1]
        label = self.data[idx, -1]
        return features, label

**Pytorch Dataloader**

In [None]:
"""

With the WaterDataset class defined, we create an instance of the Dataset, passing it the training data file path.

Then, we pass the Dataset to the PyTorch DataLoader, setting the batch size to two and shuffling the training samples randomly.


We use the next-iter-combination to get one batch from the DataLoader. With a batch size of two, we get two samples,
each consisting of nine features and a target label.

"""

dataset_train = WaterDataset(
      "water_train.csv"
    )

from torch.utils.data import DataLoader

dataloader_train = DataLoader(
      dataset_train,
      batch_size=2,
      shuffle=True,
  )

features, labels = next(iter(dataloader_train))
print(f"Features: {features},\nLabels: {labels}")

**PyTorch Model**

In [None]:
"""

PyTorch models are also best defined as classes. We may have seen sequential models defined like this before. That's fine for small models,
but using classes gives us more flexibility to customize as complexity grows.

We can rewrite this model using OOP. The Net class is based on the nn.Module, PyTorch's base class for neural networks.
We define the model layers we want to use in the init method.

The forward method describes what happens to the input when passed to the model.
Here, we pass it through subsequent layers that we defined in the init method and wrap each layer's output in the activation function

"""

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(9, 16)
        self.fc2 = nn.Linear(16, 8)
        self.fc3 = nn.Linear(8, 1)

    def forward(self, x):
        x = nn.functional.relu(self.fc1(x))
        x = nn.functional.relu(self.fc2(x))
        x = nn.functional.sigmoid(self.fc3(x))
        return x

net = Net()

# Optimizers, training, and evaluation

**Training loop**

In [None]:
import torch.nn as nn
import torch.optim as optim

criterion = nn.BCELoss()
optimizer = optim.SGD(net.parameters(), lr=0.01)

for epoch in range(1000):
  for features, labels in dataloader_train:
    optimizer.zero_grad()
    outputs = net(features)
    loss = criterion(
    outputs, labels.view(-1, 1) ### reshape the labels with the view method to match the shape of the outputs.
    )
    loss.backward()
    optimizer.step()

**Stochastic Gradient Descent (SGD)**

In [None]:
"""

In Stochastic Gradient Descent, or SGD, the size of the parameter update depends only on the learning rate, a predefined hyperparameter.
SGD is computationally efficient, but because of its simplicity, it's rarely used in practice.

"""

optimizer = optim.SGD(net.parameters(), lr=0.01)

**Adaptive Gradient (Adagrad)**

In [None]:
"""

Using the same learning rate for each parameter cannot be optimal. Adaptive Gradient, or Adagrad, improves on it
by decreasing the learning rate during training for parameters that are infrequently updated.

This makes it well-suited for sparse data, that is, data in which some features are not often observed.
However, Adagrad tends to decrease the learning rate too fast.

"""

optimizer = optim.Adagrad(net.parameters(), lr=0.01)

**Root Mean Square Propagation (RMSprop)**

In [None]:
"""

Root Mean Square Propagation, or RMSprop, addresses Adagrad's aggressive learning rate decay by adapting the learning rate for each parameter
based on the size of its previous gradients.

"""

optimizer = optim.RMSprop(net.parameters(), lr=0.01)

**Adaptive Moment Estimation (Adam)**

In [None]:
"""

Finally, Adaptive Moment Estimation or Adam is arguably the most versatile and widely used optimizer.
It combines RMSprop with the concept of momentum: the average of past gradients where the most recent gradients have more weight.

Basing the update on both gradient size and momentum helps accelerate training. Adam is often the default go-to optimizer.

"""

optimizer = optim.Adam(net.parameters(), lr=0.01)

**Model evaluation**

In [None]:
"""

Once the model is trained, we can evaluate its performance on test data

"""

from torchmetrics import Accuracy
acc = Accuracy(task="binary")
net.eval()

with torch.no_grad():
    for features, labels in dataloader_test:
      outputs = net(features)
      preds = (outputs >= 0.5).float()
      acc(preds, labels.view(-1, 1))

accuracy = acc.compute()
print(f"Accuracy: {accuracy}")

# Vanishing and exploding gradients

**Vanishing Gradients**

In [None]:
"""

Neural networks often suffer from gradient instability during training. Sometimes, the gradients get smaller during the backward pass.
This is known as vanishing gradients. As a result, earlier layers receive hardly any parameter updates and the model doesn't learn.

"""

**Exploding Gradients**

In [None]:
"""

In other cases, the gradients get increasingly large, leading to huge parameter updates and divergent training. This is known as exploding gradients.

"""

**Solution to unstable gradients**

In [None]:
"""

To address these problems, we need a three-step solution consisting of proper weights initialization, good activations, and batch normalization

"""

**Weights Initialization**

In [None]:
"""
Good initialization ensures:
    Variance of layer inputs = variance of layer outputs
    Variance of gradients the same before and after a layer

The way to achieve this is different for each activation function. For ReLU, or Rectified Linear Unit, and similar activations, we can use He initialization,
also known as Kaiming initialization

"""

layer = nn.Linear(8, 1)
print(layer.weight)



import torch.nn.init as init
init.kaiming_uniform_(layer.weight)
print(layer.weight)

**He / Kaiming initialization**

In [None]:
import torch.nn as nn
import torch.nn.init as init
class Net(nn.Module):
    def __init__(self):
      super().__init__()
      self.fc1 = nn.Linear(9, 16)
      self.fc2 = nn.Linear(16, 8)
      self.fc3 = nn.Linear(8, 1)

      init.kaiming_uniform_(self.fc1.weight)
      init.kaiming_uniform_(self.fc2.weight)
      init.kaiming_uniform_(
          self.fc3.weight,
          nonlinearity="sigmoid",
      )

**Activation functions**

In [None]:
"""

The ReLU, or Rectified Linear Unit, is arguably the most commonly used activation. It's available as nn.functional.relu.
It has several advantages, but also an important drawback. It suffers from the dying neuron problem: during training, some neurons only output a zero.
This is caused by the fact that ReLU is zero for any negative value. If inputs to a neuron become negative, it effectively dies.


The ELU or Exponential Linear Unit is one activation designed to improve upon ReLU. It's available as nn.functional.elu.
Thanks to non-zero gradients for negative values, it doesn't suffer from the dying neurons problem. Additionally, its average output is near zero,
so it's less prone to vanishing gradients.

"""

**Batch Normalization**

In [None]:
"""

 After a layer:
    1. Normalize the layer's outputs by:
             Subtracting the mean
             Dividing by the standard deviation
    2. Scale and shift normalized outputs using learned parameters

"""

 class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(9, 16)
        self.bn1 = nn.BatchNorm1d(16)
        ...

    def forward(self, x):
        x = self.fc1(x)
        x = self.bn1(x)
        x = nn.functional.elu(x)

In [None]:
"""

Call the He (Kaiming) initializer on the weight attribute of the second layer, fc2, similarly to how it's done for fc1.
Call the He (Kaiming) initializer on the weight attribute of the third layer, fc3, accounting for the different activation function used in the final layer.
Update the activation functions in the forward() method from relu to elu.

"""

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(9, 16)
        self.fc2 = nn.Linear(16, 8)
        self.fc3 = nn.Linear(8, 1)

        # Apply He initialization
        init.kaiming_uniform_(self.fc1.weight)
        init.kaiming_uniform_(self.fc2.weight)
        init.kaiming_uniform_(
            self.fc3.weight,
            nonlinearity = "sigmoid",
        )

    def forward(self, x):
        # Update ReLU activation to ELU
        x = nn.functional.elu(self.fc1(x))
        x = nn.functional.elu(self.fc2(x))
        x = nn.functional.sigmoid(self.fc3(x))
        return x

In [None]:
"""

Add two BatchNorm1d layers assigning them to self.bn1 and self.bn2

In the forward() method, pass x through the second set of layers: the linear layer, the batch norm layer, and the activations, similarly to how it's done for the first set of layers


"""

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(9, 16)
        # Add two batch normalization layers
        self.bn1 = nn.BatchNorm1d(16)
        self.fc2 = nn.Linear(16, 8)
        self.bn2 = nn.BatchNorm1d(8)
        self.fc3 = nn.Linear(8, 1)

        init.kaiming_uniform_(self.fc1.weight)
        init.kaiming_uniform_(self.fc2.weight)
        init.kaiming_uniform_(self.fc3.weight, nonlinearity="sigmoid")