<a href="https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/main/t81_558_class_04_1_kfold.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# T81-558: Applications of Deep Neural Networks

**Module 4: Training for Tabular Data**

- Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
- For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).


# Module 4 Material

- **Part 4.1: Using K-Fold Cross-validation with PyTorch** [[Video]](https://www.youtube.com/watch?v=Q8ZQNvZwsNE&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/app_deep_learning/blob/main/t81_558_class_04_1_kfold.ipynb)
- Part 4.2: Training Schedules for PyTorch  [[Video]](https://www.youtube.com/watch?v=lMMlbmfvKDQ&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/app_deep_learning/blob/main/t81_558_class_04_2_schedule.ipynb)
- Part 4.3: Dropout Regularization [[Video]](https://www.youtube.com/watch?v=4ixjgw6Q42U&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/app_deep_learning/blob/main/t81_558_class_04_3_dropout.ipynb)
- Part 4.4: Batch Normalization [[Video]](https://www.youtube.com/watch?v=1U5nOKh9OLQ&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/app_deep_learning/blob/main/t81_558_class_04_4_batch_norm.ipynb)
- Part 4.5: RAPIDS for Tabular Data [[Video]](https://www.youtube.com/watch?v=KgoXuhG_kfs&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/app_deep_learning/blob/main/t81_558_class_04_5_rapids.ipynb)


# Google CoLab Instructions

The following code ensures that Google CoLab is running and maps Google Drive if needed. We also initialize the PyTorch device to either GPU/MPS (if available) or CPU.


In [1]:
import copy
import torch

try:
    import google.colab

    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# Make use of a GPU or MPS (Apple) if one is available.  (see module 3.2)
device = (
    "mps"
    if getattr(torch, "has_mps", False)
    else "cuda"
    if torch.cuda.is_available()
    else "cpu"
)
print(f"Using device: {device}")


# Early stopping (see module 3.4)
class EarlyStopping:
    def __init__(self, patience=5, min_delta=0, restore_best_weights=True):
        self.patience = patience
        self.min_delta = min_delta
        self.restore_best_weights = restore_best_weights
        self.best_model = None
        self.best_loss = None
        self.counter = 0
        self.status = ""

    def __call__(self, model, val_loss):
        if self.best_loss is None:
            self.best_loss = val_loss
            self.best_model = copy.deepcopy(model.state_dict())
        elif self.best_loss - val_loss >= self.min_delta:
            self.best_model = copy.deepcopy(model.state_dict())
            self.best_loss = val_loss
            self.counter = 0
            self.status = f"Improvement found, counter reset to {self.counter}"
        else:
            self.counter += 1
            self.status = f"No improvement in the last {self.counter} epochs"
            if self.counter >= self.patience:
                self.status = f"Early stopping triggered after {self.counter} epochs."
                if self.restore_best_weights:
                    model.load_state_dict(self.best_model)
                return True
        return False

Note: not using Google CoLab
Using device: mps


# Part 4.1: Using K-Fold Cross-validation with PyTorch

You can use cross-validation for a variety of purposes in predictive modeling:

- Generating out-of-sample predictions from a neural network
- Estimate a good number of epochs to train a neural network for (early stopping)
- Evaluate the effectiveness of certain hyperparameters, such as activation functions, neuron counts, and layer counts

Cross-validation uses several folds and multiple models to provide each data segment a chance to serve as both the validation and training set. Figure 5.CROSS shows cross-validation.

**Figure 5.CROSS: K-Fold Crossvalidation**
![K-Fold Crossvalidation](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/class_1_kfold.png "K-Fold Crossvalidation")

It is important to note that each fold will have one model (neural network). To generate predictions for new data (not present in the training set), predictions from the fold models can be handled in several ways:

- Choose the model with the highest validation score as the final model.
- Preset new data to the five models (one for each fold) and average the result (this is an [ensemble](https://en.wikipedia.org/wiki/Ensemble_learning)).
- Retrain a new model (using the same settings as the cross-validation) on the entire dataset. Train for as many epochs and with the same hidden layer structure.

Generally, I prefer the last approach and will retrain a model on the entire data set once I have selected hyper-parameters. Of course, I will always set aside a final holdout set for model validation that I do not use in any aspect of the training process.

## Regression vs Classification K-Fold Cross-Validation

Regression and classification are handled somewhat differently concerning cross-validation. Regression is the simpler case where you can break up the data set into K folds with little regard for where each item lands. For regression, the data items should fall into the folds as randomly as possible. It is also important to remember that not every fold will necessarily have the same number of data items. It is not always possible for the data set to be evenly divided into K folds. For regression cross-validation, we will use the Scikit-Learn class **KFold**.

Cross-validation for classification could also use the **KFold** object; however, this technique would not ensure that the class balance remains the same in each fold as in the original. The balance of classes that a model was trained on must remain the same (or similar) to the training set. Drift in this distribution is one of the most important things to monitor after a trained model has been placed into actual use. Because of this, we want to make sure that the cross-validation itself does not introduce an unintended shift. This technique is called stratified sampling and is accomplished by using the Scikit-Learn object **StratifiedKFold** in place of **KFold** whenever you use classification. In summary, you should use the following two objects in Scikit-Learn:

- **KFold** When dealing with a regression problem.
- **StratifiedKFold** When dealing with a classification problem.

The following two sections demonstrate cross-validation with classification and regression.

## Out-of-Sample Regression Predictions with K-Fold Cross-Validation

The following code trains the simple dataset using a 5-fold cross-validation. The expected performance of a neural network of the type trained here would be the score for the generated out-of-sample predictions. We begin by preparing a feature vector using the **jh-simple-dataset** to predict age. This model is set up as a regression problem.


In [2]:
import pandas as pd
from scipy.stats import zscore
from sklearn.model_selection import train_test_split

# Read the data set
df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/jh-simple-dataset.csv",
    na_values=['NA','?'])

# Generate dummies for job
df = pd.concat([df,pd.get_dummies(df['job'],prefix="job",dtype=int)],axis=1)
df.drop('job', axis=1, inplace=True)

# Generate dummies for area
df = pd.concat([df,pd.get_dummies(df['area'],prefix="area",dtype=int)],axis=1)
df.drop('area', axis=1, inplace=True)

# Generate dummies for product
df = pd.concat([df,pd.get_dummies(df['product'],prefix="product",dtype=int)],axis=1)
df.drop('product', axis=1, inplace=True)

# Missing values for income
med = df['income'].median()
df['income'] = df['income'].fillna(med)

# Standardize ranges
df['income'] = zscore(df['income'])
df['aspect'] = zscore(df['aspect'])
df['save_rate'] = zscore(df['save_rate'])
df['subscriptions'] = zscore(df['subscriptions'])

Now that the feature vector is created a 5-fold cross-validation can be performed to generate out-of-sample predictions. We will assume 500 epochs and not use early stopping. Later we will see how we can estimate a more optimal epoch count.


In [3]:
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler

# Convert to PyTorch Tensors
x_columns = df.columns.drop(['age', 'id'])
x = torch.tensor(df[x_columns].values, dtype=torch.float32, device=device)
y = torch.tensor(df['age'].values, dtype=torch.float32, device=device).view(-1, 1)

# Set random seed for reproducibility
torch.manual_seed(42)

# Cross-Validate
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Early stopping parameters
patience = 10

fold = 0
for train_idx, test_idx in kf.split(x):
    fold += 1
    print(f"Fold #{fold}")

    x_train, x_test = x[train_idx], x[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    # PyTorch DataLoader
    train_dataset = TensorDataset(x_train, y_train)
    train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

    # Create the model and optimizer
    model = nn.Sequential(
        nn.Linear(x.shape[1], 20),
        nn.ReLU(),
        nn.Linear(20, 10),
        nn.ReLU(),
        nn.Linear(10, 1)
    )
    model = torch.compile(model,backend="aot_eager").to(device)


    optimizer = optim.Adam(model.parameters(), lr=0.001)
    loss_fn = nn.MSELoss()

    # Early Stopping variables
    best_loss = float('inf')
    early_stopping_counter = 0

    # Training loop
    EPOCHS = 500
    epoch = 0
    done = False
    es = EarlyStopping()

    while not done and epoch<EPOCHS:
        epoch += 1
        model.train()
        for x_batch, y_batch in train_loader:
            optimizer.zero_grad()
            output = model(x_batch)
            loss = loss_fn(output, y_batch)
            loss.backward()
            optimizer.step()

        # Validation
        model.eval()
        with torch.no_grad():
            val_output = model(x_test)
            val_loss = loss_fn(val_output, y_test)

        if es(model, val_loss):
            done = True

    print(f"Epoch {epoch}/{EPOCHS}, Validation Loss: "
      f"{val_loss.item()}, {es.status}")

# Final evaluation
model.eval()
with torch.no_grad():
    oos_pred = model(x_test)
score = torch.sqrt(loss_fn(oos_pred, y_test)).item()
print(f"Fold score (RMSE): {score}")


Fold #1
Epoch 157/500, Validation Loss: 0.7110835909843445, Early stopping triggered after 5 epochs.
Fold #2
Epoch 149/500, Validation Loss: 0.49808964133262634, Early stopping triggered after 5 epochs.
Fold #3
Epoch 151/500, Validation Loss: 0.7314692735671997, Early stopping triggered after 5 epochs.
Fold #4
Epoch 191/500, Validation Loss: 0.4292869567871094, Early stopping triggered after 5 epochs.
Fold #5
Epoch 139/500, Validation Loss: 1.2475141286849976, Early stopping triggered after 5 epochs.
Fold score (RMSE): 1.1104767322540283


As you can see, the above code also reports the average number of epochs needed. A common technique is to then train on the entire dataset for the average number of epochs required.

## Classification with Stratified K-Fold Cross-Validation

The following code trains and fits the **jh**-simple-dataset dataset with cross-validation to generate out-of-sample. It also writes the out-of-sample (predictions on the test set) results.

It is good to perform stratified k-fold cross-validation with classification data. This technique ensures that the percentages of each class remain the same across all folds. Use the **StratifiedKFold** object instead of the **KFold** object used in the regression.


In [4]:
# Read the data set
df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/jh-simple-dataset.csv",
    na_values=["NA", "?"],
)

# Generate dummies for job
df = pd.concat([df, pd.get_dummies(df["job"], prefix="job",dtype=int)], axis=1)
df.drop("job", axis=1, inplace=True)

# Generate dummies for area
df = pd.concat([df, pd.get_dummies(df["area"], prefix="area",dtype=int)], axis=1)
df.drop("area", axis=1, inplace=True)

# Missing values for income
med = df["income"].median()
df["income"] = df["income"].fillna(med)

# Standardize ranges
df["income"] = zscore(df["income"])
df["aspect"] = zscore(df["aspect"])
df["save_rate"] = zscore(df["save_rate"])
df["age"] = zscore(df["age"])
df["subscriptions"] = zscore(df["subscriptions"])

# Convert to numpy - Classification
x_columns = df.columns.drop("product").drop("id")
x = df[x_columns].values
dummies = pd.get_dummies(df["product"],dtype=int)  # Classification
products = dummies.columns
y = dummies.values

We now loop through the five folds and use the validation data in each fold for early stopping. We also keep the validated predictions to build a complete set of "out of sample" predictions. These "out of sample" predictions allow us to have predictions across the entire dataset that were not in the training data. It is important to note that this separation is not 100% pure in that the validation set was used for early stopping. This small crossover is a tradeoff that allows us to use a large amount of training data in each fold.

In [5]:
import numpy as np
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold

# Assuming your data is in Numpy Arrays. If not, convert them into Numpy Arrays
x = np.array(x)
y = np.array(y)

# Use the nn.Sequential API
model = nn.Sequential(
    nn.Linear(x.shape[1], 50),
    nn.ReLU(),
    nn.Linear(50, 25),
    nn.ReLU(),
    nn.Linear(25, y.shape[1]),
    nn.Softmax(dim=1),
)
model = torch.compile(model,backend="aot_eager").to(device)

# Cross-validate
kf = StratifiedKFold(5, shuffle=True, random_state=42)

# Defining Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())

oos_y = []
oos_pred = []
fold = 0

for train, test in kf.split(x, df["product"]):
    fold += 1
    print(f"Fold #{fold}")

    x_train = torch.tensor(x[train], device=device, dtype=torch.float32)
    y_train = torch.tensor(np.argmax(y[train], axis=1),device=device, dtype=torch.long)  # Convert to class indices
    x_test = torch.tensor(x[test],device=device, dtype=torch.float32)
    y_test = torch.tensor(np.argmax(y[test], axis=1),device=device, dtype=torch.long)  # Convert to class indices

    # Training loop
    EPOCHS = 500
    epoch = 0
    done = False
    es = EarlyStopping(restore_best_weights=True)

    while not done and epoch < EPOCHS:
        epoch += 1
        model.train()
        optimizer.zero_grad()
        output = model(x_train)
        loss = criterion(output, y_train)
        loss.backward()
        optimizer.step()

        # Evaluate validation loss
        model.eval()
        with torch.no_grad():
            y_val = model(x_test)
            val_loss = criterion(y_val, y_test)

        if es(model, val_loss):
            done = True

    # Prediction
    with torch.no_grad():
        y_val = model(x_test)
        _, pred = torch.max(y_val, 1)

    oos_y.append(y_test.cpu().numpy())
    oos_pred.append(pred.cpu().numpy())

    print(
        f"Epoch {epoch}/{EPOCHS}, Validation Loss: " f"{val_loss.item()}, {es.status}"
    )

    # Measure this fold's accuracy
    score = metrics.accuracy_score(y_test.cpu().numpy(), pred.cpu().numpy())
    print(f"Fold score (accuracy): {score}")

# Build the oos prediction list and calculate the error.
oos_y = np.concatenate(oos_y)
oos_pred = np.concatenate(oos_pred)

score = metrics.accuracy_score(oos_y, oos_pred)
print(f"Final score (accuracy): {score}")

Fold #1
Epoch 384/500, Validation Loss: 1.4796711206436157, Early stopping triggered after 5 epochs.
Fold score (accuracy): 0.69
Fold #2
Epoch 6/500, Validation Loss: 1.453275442123413, Early stopping triggered after 5 epochs.
Fold score (accuracy): 0.73
Fold #3
Epoch 7/500, Validation Loss: 1.4182788133621216, Early stopping triggered after 5 epochs.
Fold score (accuracy): 0.765
Fold #4
Epoch 6/500, Validation Loss: 1.4109307527542114, Early stopping triggered after 5 epochs.
Fold score (accuracy): 0.76
Fold #5
Epoch 7/500, Validation Loss: 1.4493036270141602, Early stopping triggered after 5 epochs.
Fold score (accuracy): 0.7275
Final score (accuracy): 0.7345


# Module 4 Assignment

You can find the fourth assignment here: [assignment 4](https://github.com/jeffheaton/app_deep_learning/blob/main/assignments/assignment_yourname_class4.ipynb)

You can find the third assignment here: [assignment 4](https://github.com/jeffheaton/app_deep_learning/blob/main/assignments/assignment_yourname_class4.ipynb)