<a href="https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/main/t81_558_class_04_3_dropout.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# T81-558: Applications of Deep Neural Networks

**Module 4: Training for Tabular Data**

- Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
- For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).


# Module 4 Material

- Part 4.1: Using K-Fold Cross-validation with PyTorch [[Video]](https://www.youtube.com/watch?v=Q8ZQNvZwsNE&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/app_deep_learning/blob/main/t81_558_class_04_1_kfold.ipynb)
- Part 4.2: Training Schedules for PyTorch  [[Video]](https://www.youtube.com/watch?v=lMMlbmfvKDQ&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/app_deep_learning/blob/main/t81_558_class_04_2_schedule.ipynb)
- **Part 4.3: Dropout Regularization** [[Video]](https://www.youtube.com/watch?v=4ixjgw6Q42U&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/app_deep_learning/blob/main/t81_558_class_04_3_dropout.ipynb)
- Part 4.4: Batch Normalization [[Video]](https://www.youtube.com/watch?v=1U5nOKh9OLQ&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/app_deep_learning/blob/main/t81_558_class_04_4_batch_norm.ipynb)
- Part 4.5: RAPIDS for Tabular Data [[Video]](https://www.youtube.com/watch?v=KgoXuhG_kfs&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/app_deep_learning/blob/main/t81_558_class_04_5_rapids.ipynb)


# Google CoLab Instructions

The following code ensures that Google CoLab is running and maps Google Drive if needed. We also initialize the PyTorch device to either GPU/MPS (if available) or CPU.


In [1]:
import copy
import torch

try:
    import google.colab

    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# Make use of a GPU or MPS (Apple) if one is available.  (see module 3.2)
device = (
    "mps"
    if getattr(torch, "has_mps", False)
    else "cuda"
    if torch.cuda.is_available()
    else "cpu"
)
print(f"Using device: {device}")


# Early stopping (see module 3.4)
class EarlyStopping:
    def __init__(self, patience=5, min_delta=0, restore_best_weights=True):
        self.patience = patience
        self.min_delta = min_delta
        self.restore_best_weights = restore_best_weights
        self.best_model = None
        self.best_loss = None
        self.counter = 0
        self.status = ""

    def __call__(self, model, val_loss):
        if self.best_loss is None:
            self.best_loss = val_loss
            self.best_model = copy.deepcopy(model.state_dict())
        elif self.best_loss - val_loss >= self.min_delta:
            self.best_model = copy.deepcopy(model.state_dict())
            self.best_loss = val_loss
            self.counter = 0
            self.status = f"Improvement found, counter reset to {self.counter}"
        else:
            self.counter += 1
            self.status = f"No improvement in the last {self.counter} epochs"
            if self.counter >= self.patience:
                self.status = f"Early stopping triggered after {self.counter} epochs."
                if self.restore_best_weights:
                    model.load_state_dict(self.best_model)
                return True
        return False

Note: not using Google CoLab
Using device: mps


# Part 4.3: Dropout Regularization

When building effective deep learning models, we frequently encounter the challenge of overfitting, where a model performs exceptionally well on training data but fails to generalize well to unseen data. Regularization techniques are thus critical to ensure our model doesn't suffer from this common pitfall, striking a balance between bias (underfitting) and variance (overfitting). Regularization methods work by adding a penalty on the complexity of the model, effectively preventing the model from learning too much noise from the training data and thus enhancing its ability to generalize.

One such powerful regularization technique is 'Dropout', a concept that metaphorically 'drops out' or temporarily 'turns off' a fraction of neurons in the model during training, thereby reducing the interdependencies of neurons. The randomness introduced by dropping out neurons compels the network to learn more robust features, leading to a more generalized and less overfit model.

Dropout operates differently from most other regularization techniques. Instead of adding a penalty to the loss function, it randomly disables a fraction of neurons (defined by a probability parameter, typically ranging from 0.2 to 0.5), effectively creating a different architecture of the network for each training instance. This can be seen as training an ensemble of networks, which results in a more robust and generalizable final model.

In this section, we will delve into the nuances of dropout, explore its theoretical underpinnings, and illustrate how to apply it using various deep learning frameworks. Understanding and correctly implementing dropout is a valuable tool in the deep learning practitioner's toolkit, aiding the creation of robust and generalized models.

As we go through the chapter, you will understand how dropout fits perfectly into the grand schema of regularization methods, learn about its benefits and limitations, and grasp how to effectively use dropout in your deep learning models.

Dropout is a regularization technique that was introduced by Geoffrey Hinton, a pioneer in the field of deep learning, and his students Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov in a paper titled "Dropout: A Simple Way to Prevent Neural Networks from Overfitting," which was published in 2014.

The development of dropout came from the recognition of the challenges with overfitting in large neural networks. Large neural networks with millions of parameters are prone to overfitting because of their capacity to memorize training data. This is especially true when the amount of training data available is small relative to the complexity of the network.

Geoffrey Hinton, who is often referred to as the "godfather of deep learning," has made several seminal contributions to the field of artificial intelligence, with dropout being just one of them. He and his team were looking for simple and effective ways to make neural networks more robust and to improve their generalization capabilities. Inspired by biological systems, they proposed the idea of dropout, which involves randomly "dropping out" or deactivating a subset of neurons during the training process to prevent them from co-adapting too much to the data.

This simple yet effective technique has since been widely adopted in the deep learning community and has formed the basis of numerous subsequent research and developments. Dropout has proven to be a powerful tool in the training of neural networks, helping to mitigate overfitting and improve model generalization, particularly in scenarios with limited training data.

We will begin by modifying the previous example to use dropout. We will preprocess the data in the same way as before.


In [2]:
import pandas as pd
from scipy.stats import zscore
from sklearn.model_selection import train_test_split

# Read the data set
df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/jh-simple-dataset.csv",
    na_values=['NA','?'])

# Generate dummies for job
df = pd.concat([df,pd.get_dummies(df['job'],prefix="job",dtype=int)],axis=1)
df.drop('job', axis=1, inplace=True)

# Generate dummies for area
df = pd.concat([df,pd.get_dummies(df['area'],prefix="area",dtype=int)],axis=1)
df.drop('area', axis=1, inplace=True)

# Generate dummies for product
df = pd.concat([df,pd.get_dummies(df['product'],prefix="product",dtype=int)],axis=1)
df.drop('product', axis=1, inplace=True)

# Missing values for income
med = df['income'].median()
df['income'] = df['income'].fillna(med)

# Standardize ranges
df['income'] = zscore(df['income'])
df['aspect'] = zscore(df['aspect'])
df['save_rate'] = zscore(df['save_rate'])
df['subscriptions'] = zscore(df['subscriptions'])

To add dropout to the existing code, I'll use the **nn.Dropout** module from PyTorch. It randomly zeroes some of the elements of the input tensor with probability p (given in the argument) during training, which can help prevent overfitting. The modified code is shown below with the dropout layers added.

```
# Create the model with dropout
model = nn.Sequential(
    nn.Linear(x.shape[1], 20),
    nn.Dropout(0.5),
    nn.ReLU(),
    nn.Linear(20, 10),
    nn.Dropout(0.5),
    nn.ReLU(),
    nn.Linear(10, 1)
)
```

The complete changes can be seen here.

In [3]:
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler

# Convert to PyTorch Tensors
x_columns = df.columns.drop(["age", "id"])
x = torch.tensor(df[x_columns].values, dtype=torch.float32, device=device)
y = torch.tensor(df["age"].values, dtype=torch.float32, device=device).view(-1, 1)

# Set random seed for reproducibility
torch.manual_seed(42)

# Cross-Validate
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Early stopping parameters
patience = 10

fold = 0
for train_idx, test_idx in kf.split(x):
    fold += 1
    print(f"Fold #{fold}")

    x_train, x_test = x[train_idx], x[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    # PyTorch DataLoader
    train_dataset = TensorDataset(x_train, y_train)
    train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

    # Create the model with dropout
    model = nn.Sequential(
        nn.Linear(x.shape[1], 20),
        nn.Dropout(0.1),
        nn.ReLU(),
        nn.Linear(20, 10),
        nn.Dropout(0.1),
        nn.ReLU(),
        nn.Linear(10, 1),
    )
    model = torch.compile(model,backend="aot_eager").to(device)

    # Create the optimizer
    optimizer = optim.Adam(model.parameters())
    loss_fn = nn.MSELoss()

    # Early Stopping variables
    best_loss = float("inf")
    early_stopping_counter = 0

    # Training loop
    EPOCHS = 500
    epoch = 0
    done = False
    es = EarlyStopping()

    while not done and epoch < EPOCHS:
        epoch += 1
        model.train()
        for x_batch, y_batch in train_loader:
            optimizer.zero_grad()
            output = model(x_batch)
            loss = loss_fn(output, y_batch)
            loss.backward()
            optimizer.step()

        # Validation
        model.eval()
        with torch.no_grad():
            val_output = model(x_test)
            val_loss = loss_fn(val_output, y_test)

        if es(model, val_loss):
            done = True

    print(
        f"Epoch {epoch}/{EPOCHS}, Validation Loss: " f"{val_loss.item()}, {es.status}"
    )

# Final evaluation
model.eval()
with torch.no_grad():
    oos_pred = model(x_test)
score = torch.sqrt(loss_fn(oos_pred, y_test)).item()
print(f"Fold score (RMSE): {score}")

Fold #1
Epoch 69/500, Validation Loss: 8.185691833496094, Early stopping triggered after 5 epochs.
Fold #2
Epoch 32/500, Validation Loss: 15.895475387573242, Early stopping triggered after 5 epochs.
Fold #3
Epoch 70/500, Validation Loss: 11.07032585144043, Early stopping triggered after 5 epochs.
Fold #4
Epoch 72/500, Validation Loss: 5.993424415588379, Early stopping triggered after 5 epochs.
Fold #5
Epoch 36/500, Validation Loss: 15.81657886505127, Early stopping triggered after 5 epochs.
Fold score (RMSE): 3.933539867401123


The changes to the code are straightforward, we simply add Dropout layers in our model's architecture. Here, we've added two Dropout layers, one between the first linear layer and the second linear layer, and another between the second linear layer and the third. Each dropout layer will randomly set 50% of input elements to zero during training. Dropout layers are typically added after non-linear activation functions, such as ReLU in this case.

Dropout is a regularization technique that prevents overfitting by reducing the interdependent learning amongst the neurons, encouraging individual neuron to be independently capable. During the training, dropout will randomly disable some neurons which forces the data to find new paths to propagate through the network. Consequently, this results in a network that is capable of better generalization and is less likely to overfit the training data.