In [1]:
# Imports

# Import basic libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt
from collections import OrderedDict
from PIL import Image

# Import PyTorch
import torch # import main library
from torch.autograd import Variable
import torch.nn as nn # import modules
from torch import optim # import optimizers for demonstrations
import torch.nn.functional as F # import torch functions
from torchvision import transforms # import transformations to use for demo
from torch.utils.data import Dataset, DataLoader 

![image](https://github.com/Lexie88rus/Activation-functions-examples-pytorch/raw/master/assets/background-card-chip.jpg)

[_Photo by Fancycrave.com from Pexels_](https://www.pexels.com/photo/green-ram-card-collection-825262/)

# In-Place Operations in PyTorch
_What are they and why avoid them_

## Introduction


Today's advanced deep neural networks have millions of parameters (for example, see the comparison in [this paper](https://arxiv.org/pdf/1905.11946.pdf)) and trying to train them on free GPU's like Kaggle or Goggle Colab often leads to running out of memory on GPU. There are several simple ways to reduce the GPU memory occupied by the model, for example:
* Consider changing the architecture of the model or using the type of model with fewer parameters (for example choose [DenseNet](https://arxiv.org/pdf/1608.06993.pdf)-121 over DenseNet-169). This approach can affect model's performance metrics.
* Reduce the batch size or manually set the number of data loader workers. In this case it will take longer for the model to train.

Using in-place operations in neural networks may help to avoid the downsides of approaches mentioned above while saving some GPU memory. However it is strongly __not recommended to use in-place operations__ for several reasons.

In this kernel I would like to:
* Describe what are the in-place operations and demonstrate how they might help to save the GPU memory.
* Tell why we should avoid the in-place operations or use them with great caution.

## In-place Operations
`In-place operation is an operation that changes directly the content of a given linear algebra, vector, matrices(Tensor) without making a copy. The operators which helps to do the operation is called in-place operator.` See the [tutorial](https://www.tutorialspoint.com/inplace-operator-in-python) on in-place operations in Python.

As it is said in the definition, in-place operations don't make a copy of the input, that is why they can help to reduce the memory usage, when operating with high-dimentional data.

I would like to run a simple model on [Fashion MNIST dataset](https://www.kaggle.com/zalando-research/fashionmnist) to demonstrate how in-place operations help to consume less GPU memory. In this demonstration I use simple fully-connected deep neural network with four linear layers and [ReLU](https://pytorch.org/docs/stable/nn.html#relu) activation after each hidden layer.

## Seeting Up The Demo
Ih this section I will prepare everything for the demonstration:
* Load Fashion MNIST dataset from PyTorch,
* Introduce transformations for Fashion MNIST images using PyTorch,
* Prepare model training procedure.

If you are familiar with PyTorch basics, just skip this part and go straight to the rest of the kernel.

### Introduce Transformations

The most efficient way to transform the input data is to use buil-in PyTorch transformations:

In [2]:
# Define a transform
transform = transforms.Compose([transforms.ToTensor()])

### Load the Data

To load the data I used standard Dataset and Dataloader classes from PyTorch and [FashionMNIST class code from this kernel](https://www.kaggle.com/arturlacerda/pytorch-conditional-gan):

In [3]:
class FashionMNIST(Dataset):
    '''
    Dataset clas to load Fashion MNIST data from csv.
    Code from original kernel:
    https://www.kaggle.com/arturlacerda/pytorch-conditional-gan
    '''
    def __init__(self, transform=None):
        self.transform = transform
        fashion_df = pd.read_csv('../input/fashion-mnist_train.csv')
        self.labels = fashion_df.label.values
        self.images = fashion_df.iloc[:, 1:].values.astype('uint8').reshape(-1, 28, 28)

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        label = self.labels[idx]
        img = Image.fromarray(self.images[idx])
        
        if self.transform:
            img = self.transform(img)

        return img, label

# Load the training data for Fashion MNIST
trainset = FashionMNIST(transform=transform)
# Define the dataloader
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

### Setup Training Procedure
I wrote a small training procedure, which runs 5 training epochs and prints the loss for each epoch:

In [4]:
def train_model(model, device):
    '''
    Function trains the model and prints out the training log.
    '''
    #setup training
    
    #define loss function
    criterion = nn.NLLLoss()
    #define learning rate
    learning_rate = 0.003
    #define number of epochs
    epochs = 5
    #initialize optimizer
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    
    model.to(device)

    #run training and print out the loss to make sure that we are actually fitting to the training set
    print('Training the model \n')
    for e in range(epochs):
        running_loss = 0
        for images, labels in trainloader:
            
            images, labels = images.to(device), labels.to(device)
            images = images.view(images.shape[0], -1)
            log_ps = model(images)
            loss = criterion(log_ps, labels)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
        else:
            # print out the loss to make sure it is decreasing
            print(f"Training loss: {running_loss}")

### Define the Model

PyTorch provides us with in-place implementation of ReLU activation function. I will run consequently training with in-place ReLU implementation and with vanilla ReLU.

In [5]:
# create class for basic fully-connected deep neural network
class Classifier(nn.Module):
    '''
    Demo classifier model class to demonstrate in-place operations
    '''
    def __init__(self, inplace = False):
        super().__init__()

        # initialize layers
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 64)
        self.fc4 = nn.Linear(64, 10)
        
        self.relu = nn.ReLU(inplace = inplace) # pass inplace as parameter to ReLU

    def forward(self, x):
        # make sure the input tensor is flattened
        x = x.view(x.shape[0], -1)

        # apply activation function
        x = self.relu(self.fc1(x))

        # apply activation function
        x = self.relu(self.fc2(x))
        
        # apply activation function
        x = self.relu(self.fc3(x))
        
        x = F.log_softmax(self.fc4(x), dim=1)

        return x

## Compare Memory Usage for In-place and Vanilla Operations


Let's compare memory usage for one single call of ReLU activation function:

In [6]:
# empty caches and setup the device
torch.cuda.empty_cache()

device = torch.device('cuda:0' if torch.cuda.is_available() else "cpu")
device

device(type='cuda', index=0)

In [7]:
def get_memory_allocated(device, inplace = False):
    '''
    Function measures allocated memory before and after the ReLU function call.
    '''
    
    # Create a large tensor
    t = torch.randn(10000, 10000, device=device)
    
    # Measure allocated memory
    torch.cuda.synchronize()
    start_max_memory = torch.cuda.max_memory_allocated() / 1024**2
    start_memory = torch.cuda.memory_allocated() / 1024**2
    
    # Call in-place or normal ReLU
    if inplace:
        F.relu_(t)
    else:
        output = F.relu(t)
    
    # Measure allocated memory after the call
    torch.cuda.synchronize()
    end_max_memory = torch.cuda.max_memory_allocated() / 1024**2
    end_memory = torch.cuda.memory_allocated() / 1024**2
    
    # Return amount of memory allocated for ReLU call
    return end_memory - start_memory, end_max_memory - start_max_memory

Run out of place ReLU:

In [8]:
memory_allocated, max_memory_allocated = get_memory_allocated(device, inplace = False)
print('Allocated memory: {}'.format(memory_allocated))
print('Allocated max memory: {}'.format(max_memory_allocated))

Allocated memory: 382.0
Allocated max memory: 382.0


Run in-place ReLU:

In [9]:
memory_allocated_inplace, max_memory_allocated_inplace = get_memory_allocated(device, inplace = True)
print('Allocated memory: {}'.format(memory_allocated_inplace))
print('Allocated max memory: {}'.format(max_memory_allocated_inplace))

Allocated memory: 0.0
Allocated max memory: 0.0


Now let's do the same while training a simple classifier.
Run training with vanilla ReLU:

In [10]:
# initialize classifier
model = Classifier(inplace = False)

# measure allocated memory
torch.cuda.synchronize()
start_max_memory = torch.cuda.max_memory_allocated() / 1024**2
start_memory = torch.cuda.memory_allocated() / 1024**2

# train the classifier
train_model(model, device)

# measure allocated memory after training
torch.cuda.synchronize()
end_max_memory = torch.cuda.max_memory_allocated() / 1024**2
end_memory = torch.cuda.memory_allocated() / 1024**2

Training the model 

Training loss: 490.5504989773035
Training loss: 361.1345275044441
Training loss: 329.05726308375597
Training loss: 306.97832968086004
Training loss: 292.6471059694886


In [11]:
print('Allocated memory: {}'.format(end_memory - start_memory))
print('Allocated max memory: {}'.format(end_max_memory - start_max_memory))

Allocated memory: 1.853515625
Allocated max memory: 0.0


Run training with in-place ReLU:

In [12]:
# initialize model with in-place ReLU
model = Classifier(inplace = True)

# measure allocated memory
torch.cuda.synchronize()
start_max_memory = torch.cuda.max_memory_allocated() / 1024**2
start_memory = torch.cuda.memory_allocated() / 1024**2

# train the classifier with in-place ReLU
train_model(model, device)

# measure allocated memory after training
torch.cuda.synchronize()
end_max_memory = torch.cuda.max_memory_allocated() / 1024**2
end_memory = torch.cuda.memory_allocated() / 1024**2

Training the model 

Training loss: 485.5531446188688
Training loss: 359.61066341400146
Training loss: 329.1772850751877
Training loss: 307.14213905483484
Training loss: 292.3229675516486


In [13]:
print('Allocated memory: {}'.format(end_memory - start_memory))
print('Allocated max memory: {}'.format(end_max_memory - start_max_memory))

Allocated memory: 1.853515625
Allocated max memory: 0.0


Looks like using in-place ReLU really helps us to save some GPU memory. But we should be __extremely cautious when using in-place operations and check twice__. In the next section I will show why.

## Downsides of In-place Operations

The major downside of in-place operations is the fact that __they might overwrite values required to compute gradients__ which means breaking the training procedure of the model. That is what [the official PyTorch autograd documentation](https://pytorch.org/docs/stable/notes/autograd.html#in-place-operations-with-autograd) says:
> Supporting in-place operations in autograd is a hard matter, and we discourage their use in most cases. Autograd’s aggressive buffer freeing and reuse makes it very efficient and there are very few occasions when in-place operations actually lower memory usage by any significant amount. Unless you’re operating under heavy memory pressure, you might never need to use them.

> There are two main reasons that limit the applicability of in-place operations:

> 1. In-place operations can potentially overwrite values required to compute gradients.
> 2. Every in-place operation actually requires the implementation to rewrite the computational graph. Out-of-place versions simply allocate new objects and keep references to the old graph, while in-place operations, require changing the creator of all inputs to the Function representing this operation. This can be tricky, especially if there are many Tensors that reference the same storage (e.g. created by indexing or transposing), and in-place functions will actually raise an error if the storage of modified inputs is referenced by any other Tensor.

The other reason of being careful with in-place operations is that their implementation is extremely tricky. That is why I would __recommend to use PyTorch standard in-place operations__  (like `torch.tanh_` or `torch.sigmoid_`) instead of implementing one manually.

Let's see an example of [SiLU](https://arxiv.org/pdf/1606.08415.pdf) (or Swish-1) activation function. This is the normal implementation of SiLU:

In [14]:
def silu(input):
    '''
    Normal implementation of SiLU activation function
    https://arxiv.org/pdf/1606.08415.pdf
    '''
    return input * torch.sigmoid(input)

Let's try to implement in-place SiLU using torch.sigmoid_ in-place function:

In [15]:
def silu_inplace_1(input):
    '''
    Incorrect implementation of in-place SiLU activation function
    https://arxiv.org/pdf/1606.08415.pdf
    '''
    return input * torch.sigmoid_(input) # THIS IS INCORRECT!!!

The code above __incorrectly__ implements in-place SiLU. We can make sure of that:

In [16]:
t = torch.randn(3)

# print result of original SiLU
print("Original SiLU: {}".format(silu(t)))

# change the value of t with in-place function
silu_inplace_1(t)
print("In-place SiLU: {}".format(t))

Original SiLU: tensor([ 0.0796, -0.2744, -0.2598])
In-place SiLU: tensor([0.5370, 0.2512, 0.2897])


It is easy to see that the function `silu_inplace_1` in fact returns `sigmoid(input) * sigmoid(input)` !

The working example of the in-place implementation of SiLU using `torch.sigmoid_` could be:

In [17]:
def silu_inplace_2(input):
    '''
    Example of implementation of in-place SiLU activation function using torch.sigmoid_
    https://arxiv.org/pdf/1606.08415.pdf
    '''
    result = input.clone()
    torch.sigmoid_(input)
    input *= result
    return input

In [18]:
t = torch.randn(3)

# print result of original SiLU
print("Original SiLU: {}".format(silu(t)))

# change the value of t with in-place function
silu_inplace_2(t)
print("In-place SiLU #2: {}".format(t))

Original SiLU: tensor([ 0.7774, -0.2767,  0.2967])
In-place SiLU #2: tensor([ 0.7774, -0.2767,  0.2967])


This small example demonstrates why we should be extremely careful and check twice when using the in-place operations.

## Conclusion
In this article: 
* I described the in-place operations and their purpose. Demonstrated how in-place operations help to __consume less GPU memory__.
* Described the major __downsides of in-place operations__. One should be very careful about using them and check the result twice.

## Additional References
Links to the additional resources and further reading:

1. [PyTorch Autograd documentation](https://pytorch.org/docs/stable/notes/autograd.html#in-place-operations-with-autograd)

## PS
![echo logo](https://github.com/Lexie88rus/Activation-functions-examples-pytorch/blob/master/assets/echo_logo.png?raw=true)

I participate in implementation of a __Echo package__ with mathematical backend for neural networks, which can be used with most popular existing packages (TensorFlow, Keras and [PyTorch](https://pytorch.org/)). We have done a lot for PyTorch and Keras so far. Here is a [link to a repository on GitHub](https://github.com/digantamisra98/Echo/tree/Dev-adeis), __I will highly appreciate your feedback__ on that.