# Problem with Batch Normalization layer for action choice

#### Background knowledge

- Batch Normalization

Batch normalization is intended to increase stability when training a neural network. Batch normalization normalizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation. [source](https://towardsdatascience.com/batch-normalization-in-neural-networks-1ac91516821c)

- Actor network for ddpg and maddpg

Both ddpg and maddpg includes local actor network for action selection and target actor network for action estimation and update. Both local and target actor network take state(s)/observation(s) as input, and output action based on model parameters. The **local** actor network typically takes **one** state/observation input. The **target** actor network typically takes a **batch** of states as input.

#### Problem statement

If we add batchnorm layer into actor network, the input **batch** states/observations for the **target actor** network will be **normalized**, while the state/observation for **local actor network** remain the **same** (batchnorm doesn't apply to single input). In other words, we are choosing action based on observation, while estimating action based on a **changed version** of the same state. The **estimated action** (by target actor network) will be largely **different** from the **real action** (by local actor network) even the target actor network is identical to the local actor network. See the example below

#### Example
First define an actor model. Here I directly use the actor model for maddpg and uses batchnorm in its forward function (line 47).

In [1]:
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F

def norm_init(layer):
    fan_in = layer.weight.data.size()[0]
    return 1./np.sqrt(fan_in)

def hidden_init(layer):
    fan_in = layer.weight.data.size()[0]
    lim = 1. / np.sqrt(fan_in)
    return (-lim, lim)

class Actor(nn.Module):
    """Actor (Policy) Model."""

    def __init__(self, state_size, action_size, seed, fc1_units = 256, fc2_units = 128):
        """Initialize parameters and build model.
        Params
        ======
            state_size (int): Dimension of each state
            action_size (int): Dimension of each action
            seed (int): Random seed
            fc1_units (int): Number of nodes in first hidden layer
            fc2_units (int): Number of nodes in second hidden layer
        """
        super(Actor, self).__init__()
        self.seed = torch.manual_seed(seed)
        self.fc1 = nn.Linear(state_size, fc1_units)
        self.fc2 = nn.Linear(fc1_units,fc2_units)
        self.fc3 = nn.Linear(fc2_units, action_size)
        self.batchnorm_1 = nn.BatchNorm1d(fc1_units)
        
        self.reset_parameters()

    def reset_parameters(self):
        self.fc1.weight.data.uniform_(*hidden_init(self.fc1))
        self.fc2.weight.data.uniform_(*hidden_init(self.fc2))
        self.fc3.weight.data.uniform_(-3e-3, 3e-3)

    def forward(self, state):
        """Build an actor (policy) network that maps states -> actions."""
        #x = F.relu(self.batchnorm_1(self.fc1(state)))
        if state.dim() != 1:
            x = F.relu(self.batchnorm_1(self.fc1(state)))
            #x = F.relu(self.fc1(state))
        else:
            x = F.relu(self.fc1(state))
        
        
        x = F.relu(self.fc2(x))
        return F.tanh(self.fc3(x))

Then instantiate an actor and prepare random states as input.

In [2]:
actor = Actor(24,2,4)
single_state = torch.rand(24)
batch_states = torch.stack([single_state,torch.rand(24),torch.randn(24)]) # put the single_state along with
                                                                          # other random states

Pass states into the actor, get actions

In [3]:
single_action = actor(single_state)
batch_action = actor(batch_states)
print("single_action: ", single_action)
print("batch_action: ", batch_action)
print("batch_action[0]: ", batch_action[0])

single_action:  tensor([-0.0143,  0.0236], grad_fn=<TanhBackward>)
batch_action:  tensor([[-0.0164,  0.0191],
        [-0.0141,  0.0240],
        [-0.0198,  0.0162]], grad_fn=<TanhBackward>)
batch_action[0]:  tensor([-0.0164,  0.0191], grad_fn=<SelectBackward>)




Note that batch_action[0] is different from the single_action due to the normalization step we take. 

Since the target actor network is used to estimate action for next states, and then put into the target actor network to derive the TD error term, the estimated TD error will be inaccurate. This may lead to the agents' not able to solve the environment.