# Don't Repeat Yourself!
## Keep the Momentum: Evolving Neural Networks Without Starting Over

Training deep neural networks is a time-intensive process, often requiring hours, days, or even weeks of fine-tuning on high-powered GPUs. However, when a network saturates and no longer improves in accuracy, conventional approaches would call for retraining from scratch with a new, deeper, or wider architecture. This can lead to wasted computational resources and long downtimes as models are redesigned and retrained from the beginning. Our innovation addresses this challenge by allowing a trained neural network to evolve without losing any of its achieved accuracy. With this approach, networks can continue learning and refining their performance without the need to start over, effectively speeding up the training process and boosting productivity.

### Layer Insertion and Modification Without Accuracy Loss

When a neural network hits its performance ceiling, it may require a more complex architecture—deeper layers, additional neurons, or more filters in convolutional layers—to push accuracy higher. Traditionally, this would mean designing a new model and restarting training from scratch. However, our method allows for the seamless insertion of new layers, neurons, or filters into a trained network, all without compromising the current accuracy level.

The key lies in how weights are initialized. For fully connected layers, we insert identity matrices for the new layers, preserving the function of the existing network. In convolutional layers, we use identity filters, which act similarly to the identity matrix but for feature maps. This ensures that the newly inserted layers do not alter the input-output relationship learned by the network, maintaining the accuracy already achieved. Additionally, when neurons are added to existing layers, their weights are initialized using standard initialization techniques, but the connections to subsequent layers are zero-weighted. This setup guarantees that the new neurons do not interfere with the performance of the already trained part of the network.

### Handling Non-Linearity with the ActiSwitch Layer

Introducing new layers or neurons also brings the challenge of activation functions. Activation functions play a crucial role in introducing non-linearity, and their behavior can impact how well new components integrate into an existing network. To solve this, we have developed the ActiSwitch layer—a mechanism that allows a smooth transition between linear and non-linear activation functions.

The ActiSwitch layer operates using two parameters that control the ratio between linearity and non-linearity, creating a dynamic blend between the two extremes. As the network trains, the model can adjust these parameters to smoothly switch between linear behavior and the desired non-linearity. This capability is particularly valuable when adding new neurons or layers, as it allows the network to incorporate the new elements without destabilizing the already trained sections. The ActiSwitch layer ensures that activation functions evolve in sync with the expanded architecture, providing a smooth learning curve for the newly added components.

### Increasing Productivity in Neural Network Research

In the fast-paced world of AI research, productivity is paramount. This method accelerates the iterative process of neural network design, enabling faster experimentation without the need to restart each time a change is made to the architecture. Instead of retraining from scratch, researchers can continue from where they left off, modifying the network incrementally to achieve better performance.

Furthermore, this technique is highly adaptable. Whether you need to insert a few extra neurons in a fully connected layer, expand the number of filters in a convolutional layer, or even alter the size of filters themselves, our method can accommodate these changes without disrupting the training process. It provides an efficient, flexible way to adapt and scale neural networks without sacrificing prior progress.


In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import time

from atgen.network import ATNetwork
from atgen.layers import Linear, Flatten, Conv2D, MaxPool2D, ActiSwitch, Pass

import numpy as np
import random

seed = 0
torch.manual_seed(seed)
np.random.seed(seed)
random.seed(seed)
torch.use_deterministic_algorithms(True)

In [2]:
# Device configuration
device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')

# Hyperparameters
num_epochs = 10
learning_rate = 0.001
batch_size = 128

# CIFAR-100 dataset (100 classes, 32x32 images)
transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
])

train_dataset = torchvision.datasets.CIFAR100(root='../data', train=True, transform=transform, download=True)
test_dataset = torchvision.datasets.CIFAR100(root='../data', train=False, transform=transform, download=True)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

Files already downloaded and verified
Files already downloaded and verified


In [3]:
model = ATNetwork(
    Conv2D(3, 32, kernel_size=3, norm=True),
    ActiSwitch(nn.ReLU),
    MaxPool2D(),
    Conv2D(32, 64, kernel_size=3, norm=True),
    ActiSwitch(nn.ReLU),
    MaxPool2D(),
    Flatten(),
    Linear(64*8*8, 100),
    input_size=(32, 32)
)

model.summary()

[1m[38;5;153mModel Summary[0m[1m:
----------------------------------------------------------------------------------------------------
Layer      Type           Output Shape                  Parameters     Activation     
----------------------------------------------------------------------------------------------------
Layer 1    Conv2D         (batch_size, 32, 32, 32)      896            ActiSwitch(ReLU, 100.00%)
Layer 2    MaxPool2D      (batch_size, 32, 16, 16)      0              Pass           
Layer 3    Conv2D         (batch_size, 64, 16, 16)      18496          ActiSwitch(ReLU, 100.00%)
Layer 4    MaxPool2D      (batch_size, 64, 8, 8)        0              Pass           
Layer 5    Flatten        (batch_size, 4096)            0              Pass           
Layer 6    Linear         (batch_size, 100)             409700         Pass           
----------------------------------------------------------------------------------------------------
[38;5;153mTotal Parameters:  

In [4]:
# Initialize network and optimizer
model = model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

In [5]:
# Training function
def train_model(model, train_loader, num_epochs):
    start_time = time.time()
    for epoch in range(num_epochs):
        model.train()
        running_loss = 0.0
        correct = 0
        total = 0
        for i, (images, labels) in enumerate(train_loader):
            images = images.to(device)
            labels = labels.to(device)

            # Forward pass
            outputs = model(images)
            loss = criterion(outputs, labels)

            # Backward pass and optimization
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            # Track accuracy
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()
            running_loss += loss.item()

        accuracy = 100. * correct / total
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss/i:.4f}, Accuracy: {accuracy:.2f}%')

    training_time = time.time() - start_time
    print(f'Training completed in: {training_time:.2f} seconds')
    return training_time

In [6]:
# Evaluate function
def evaluate_model(model, test_loader):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for images, labels in test_loader:
            images = images.to(device)
            labels = labels.to(device)
            outputs = model(images)
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()

    accuracy = 100. * correct / total
    print(f'Test Accuracy: {accuracy:.2f}%')

In [7]:
# Training the initial network
print("Training the initial network...")
initial_training_time = train_model(model, train_loader, num_epochs)
evaluate_model(model, test_loader)

Training the initial network...
Epoch [1/10], Loss: 3.6536, Accuracy: 16.84%
Epoch [2/10], Loss: 3.0037, Accuracy: 27.24%
Epoch [3/10], Loss: 2.7334, Accuracy: 32.54%
Epoch [4/10], Loss: 2.5920, Accuracy: 35.64%
Epoch [5/10], Loss: 2.4756, Accuracy: 37.80%
Epoch [6/10], Loss: 2.3892, Accuracy: 39.63%
Epoch [7/10], Loss: 2.3144, Accuracy: 41.38%
Epoch [8/10], Loss: 2.2570, Accuracy: 42.30%
Epoch [9/10], Loss: 2.2051, Accuracy: 43.81%
Epoch [10/10], Loss: 2.1673, Accuracy: 44.64%
Training completed in: 261.27 seconds
Test Accuracy: 41.09%


### Now you can apply the method we’ve discussed (inserting new layers, neurons, or filters) to enhance performance without resetting the training process.
You’ll also notice that the accuracy starts off significantly higher than with the traditional CNN. This improvement is thanks to the `ActiSwitch` layer, which strikes an optimal balance between linearity and non-linearity. As we explored earlier, `ActiSwitch` not only outperforms traditional skip connections but also shows great promise as an architecture that can surpass the capabilities of `ResNet`.

In [8]:
model.layers.insert(1, Conv2D.init_identity_layer(32, kernel_size=3, norm=True))
model.layers.insert(4, Conv2D.init_identity_layer(64, kernel_size=3, norm=True))
model.layers.insert(8, Linear.init_identity_layer(100))
model.activation.insert(1, ActiSwitch(nn.ReLU, True))
model.activation.insert(4, ActiSwitch(nn.ReLU, True))
model.activation.insert(7, ActiSwitch(nn.ReLU, True))
model.store_sizes((32, 32))
model.summary()
model = model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

[1m[38;5;153mModel Summary[0m[1m:
----------------------------------------------------------------------------------------------------
Layer      Type           Output Shape                  Parameters     Activation     
----------------------------------------------------------------------------------------------------
Layer 1    Conv2D         (batch_size, 32, 32, 32)      896            ActiSwitch(ReLU, 78.79 %)
Layer 2    Conv2D         (batch_size, 32, 32, 32)      9248           ActiSwitch(ReLU, 0.00  %)
Layer 3    MaxPool2D      (batch_size, 32, 16, 16)      0              Pass           
Layer 4    Conv2D         (batch_size, 64, 16, 16)      18496          ActiSwitch(ReLU, 97.36 %)
Layer 5    Conv2D         (batch_size, 64, 16, 16)      36928          ActiSwitch(ReLU, 0.00  %)
Layer 6    MaxPool2D      (batch_size, 64, 8, 8)        0              Pass           
Layer 7    Flatten        (batch_size, 4096)            0              Pass           
Layer 8    Linear       

In [9]:
# Continue training the network
print("Continue training the network...")
initial_training_time = train_model(model, train_loader, num_epochs)
evaluate_model(model, test_loader)

Continue training the network...
Epoch [1/10], Loss: 2.9718, Accuracy: 40.72%
Epoch [2/10], Loss: 2.0917, Accuracy: 46.24%
Epoch [3/10], Loss: 2.0208, Accuracy: 47.49%
Epoch [4/10], Loss: 1.9759, Accuracy: 48.70%
Epoch [5/10], Loss: 1.9405, Accuracy: 49.46%
Epoch [6/10], Loss: 1.8927, Accuracy: 50.40%
Epoch [7/10], Loss: 1.8541, Accuracy: 51.28%
Epoch [8/10], Loss: 1.8206, Accuracy: 51.68%
Epoch [9/10], Loss: 1.7869, Accuracy: 52.70%
Epoch [10/10], Loss: 1.7584, Accuracy: 53.31%
Training completed in: 446.89 seconds
Test Accuracy: 46.42%


In [10]:
model = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=3, padding=1),
    nn.BatchNorm2d(32),
    nn.ReLU(),
    nn.Conv2d(32, 32, kernel_size=3, padding=1),
    nn.BatchNorm2d(32),
    nn.ReLU(),
    nn.MaxPool2d(2, 2),
    nn.Conv2d(32, 64, kernel_size=3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(),
    nn.Conv2d(64, 64, kernel_size=3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(),
    nn.MaxPool2d(2, 2),
    nn.Flatten(),
    nn.Linear(4096, 100),
    nn.ReLU(),
    nn.Linear(100, 100),
)
model = model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

In [11]:
# Training the initial network
print("Training the initial network...")
initial_training_time = train_model(model, train_loader, num_epochs)
evaluate_model(model, test_loader)

Training the initial network...
Epoch [1/10], Loss: 3.9360, Accuracy: 9.58%
Epoch [2/10], Loss: 3.2361, Accuracy: 19.91%
Epoch [3/10], Loss: 2.8938, Accuracy: 26.38%
Epoch [4/10], Loss: 2.7147, Accuracy: 30.13%
Epoch [5/10], Loss: 2.5752, Accuracy: 32.80%
Epoch [6/10], Loss: 2.4843, Accuracy: 34.69%
Epoch [7/10], Loss: 2.4160, Accuracy: 36.10%
Epoch [8/10], Loss: 2.3641, Accuracy: 37.27%
Epoch [9/10], Loss: 2.3018, Accuracy: 38.64%
Epoch [10/10], Loss: 2.2578, Accuracy: 39.25%
Training completed in: 300.83 seconds
Test Accuracy: 35.20%


### Conclusion: Evolving Without Restarting

The ability to insert new layers, neurons, or filters into a trained neural network without losing accuracy represents a significant breakthrough in neural network training. By leveraging identity matrices and zero-weighted connections, we can preserve the model’s learned knowledge, while the ActiSwitch layer ensures smooth transitions between activation functions. This innovation opens up new possibilities for evolving neural network architectures and allows researchers to push the boundaries of model accuracy without retraining from scratch.

#### In a field where every hour of training counts, this method enables you to "Keep the Momentum" and continue improving your models without unnecessary delays or wasted resources.