## BatchNorm

**内涵**

- 批

    mini-batch

- 标准化

    0均值 1方差

In [1]:
import torch
import numpy as np
import torch.nn as nn

**批量归一化的类型**

![](./img/batch.png)

**总览**

![](./nb.jpg)

**解释**

- Batch_Norm

         以特征层为单位，计算所有batch_size内的均值与方差，并标准化特征层(即以N为尺度计算均值方差)
         
- Layer_Norm

        以每个特征层元素为单位，计算特征层内所有元素的均值与方差，并标准化（即以H*W为尺度计算均值与方差）

- Instance_Norm

        逐通道计算均值与方差

如果把特征图比喻成一摞书，总共有 N 个样本，每个样本有 C 个通道，每个通道层有 H 行，每行 有W 个像素。

1. BN 求均值时，把所有样本按通道对应加起来，再除以N×H×W，导致每个通道一个均值和方差，共有C个

2. LN 求均值时，把每个样本中的像素值全加起来，再除以C×H×W，导致每个样本有一个均值和方差，共有N个

3. IN 求均值时，忽略样本，把每个通道里的所有值相加，除以H×W，使得每个通道有个均值和方差

4. GN 求均值时，将样本的通道分组，求出每个组的均值和方差

**BatchNorm原理**

![](./img/batchnorm.png)

### BN层的影响

In [9]:
class MLP(nn.Module):
    def __init__(self, neural_num, layers=100):
        super(MLP, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
        self.bns = nn.ModuleList([nn.BatchNorm1d(neural_num) for i in range(layers)])
        self.neural_num = neural_num

    def forward(self, x):

        for (i, linear), bn in zip(enumerate(self.linears), self.bns):
            x = linear(x)
            # x = bn(x)
            x = torch.relu(x)

            if torch.isnan(x.std()):
                print("output is nan in {} layers".format(i))
                break

            print("layers:{}, mean:{}".format(i, x.std().item()))

        return x

    def initialize(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):

                # method 1
                # nn.init.normal_(m.weight.data, std=1)    # normal: mean=0, std=1

                # method 2 kaiming
                nn.init.kaiming_normal_(m.weight.data)

In [10]:
'''
不使用归一化
'''

neural_nums = 256
layer_nums = 100
batch_size = 16

net = MLP(neural_nums, layer_nums)
net.initialize() # 不进行bn时 需要采用合适的初始化

inputs = torch.randn((batch_size, neural_nums))  # normal: mean=0, std=1

output = net(inputs)
print(output)

layers:0, mean:0.8383101224899292
layers:1, mean:0.810845136642456
layers:2, mean:0.8057557940483093
layers:3, mean:0.8525007963180542
layers:4, mean:0.838342010974884
layers:5, mean:0.7958686947822571
layers:6, mean:0.8045035004615784
layers:7, mean:0.8100224137306213
layers:8, mean:0.7280009984970093
layers:9, mean:0.7112995982170105
layers:10, mean:0.7215366363525391
layers:11, mean:0.769201397895813
layers:12, mean:0.7079854011535645
layers:13, mean:0.7077028155326843
layers:14, mean:0.6731024384498596
layers:15, mean:0.6838957071304321
layers:16, mean:0.659018874168396
layers:17, mean:0.6689760684967041
layers:18, mean:0.7156397104263306
layers:19, mean:0.7458699941635132
layers:20, mean:0.7870975136756897
layers:21, mean:0.7753603458404541
layers:22, mean:0.7260276079177856
layers:23, mean:0.7045286893844604
layers:24, mean:0.5795301198959351
layers:25, mean:0.6172888278961182
layers:26, mean:0.6577270030975342
layers:27, mean:0.5839077830314636
layers:28, mean:0.5290199518203735

In [12]:
class MLP(nn.Module):
    def __init__(self, neural_num, layers=100):
        super(MLP, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
        self.bns = nn.ModuleList([nn.BatchNorm1d(neural_num) for i in range(layers)])
        self.neural_num = neural_num

    def forward(self, x):

        for (i, linear), bn in zip(enumerate(self.linears), self.bns):
            x = linear(x)
            x = bn(x)  # bn层
            x = torch.relu(x)

            if torch.isnan(x.std()):
                print("output is nan in {} layers".format(i))
                break

            print("layers:{}, std:{}".format(i, x.std().item()))
        return x

In [13]:
'''
添加bn层的结果
'''
neural_nums = 256
layer_nums = 100
batch_size = 16

net = MLP(neural_nums, layer_nums) # 使用bn层  不使用初始化  而且更优越

inputs = torch.randn((batch_size, neural_nums))  # normal: mean=0, std=1

output = net(inputs)
print(output)

layers:0, std:0.5878563523292542
layers:1, std:0.5806277990341187
layers:2, std:0.5738962292671204
layers:3, std:0.575210690498352
layers:4, std:0.5790040493011475
layers:5, std:0.5819441080093384
layers:6, std:0.5825987458229065
layers:7, std:0.582497775554657
layers:8, std:0.5798506140708923
layers:9, std:0.5772049427032471
layers:10, std:0.579689085483551
layers:11, std:0.5835692286491394
layers:12, std:0.5786635875701904
layers:13, std:0.5837998986244202
layers:14, std:0.5886130332946777
layers:15, std:0.5854383111000061
layers:16, std:0.579085648059845
layers:17, std:0.5850254893302917
layers:18, std:0.5762629508972168
layers:19, std:0.5866146683692932
layers:20, std:0.5781558752059937
layers:21, std:0.574286937713623
layers:22, std:0.5843595862388611
layers:23, std:0.571588397026062
layers:24, std:0.5902033448219299
layers:25, std:0.5834580659866333
layers:26, std:0.575657069683075
layers:27, std:0.5830262303352356
layers:28, std:0.5779280662536621
layers:29, std:0.57693606615066

### BN层的使用

**BN层的主要属性**

    - running_mean
        均值
    
    - running_var 
        方差
    
    - weight 
        gama        
    
    - bias
        beta

In [5]:
# ======================================== nn.BatchNorm层的属性

x = torch.rand(100, 16, 28 * 28)

layer = nn.BatchNorm1d(16)

out = layer(x)
print(out.shape)
print(layer.running_mean)
print(layer.running_var)

torch.Size([100, 16, 784])
tensor([0.0501, 0.0496, 0.0500, 0.0502, 0.0499, 0.0498, 0.0500, 0.0501, 0.0499,
        0.0500, 0.0500, 0.0500, 0.0502, 0.0498, 0.0500, 0.0500])
tensor([0.9083, 0.9084, 0.9084, 0.9084, 0.9084, 0.9083, 0.9083, 0.9083, 0.9083,
        0.9084, 0.9084, 0.9083, 0.9083, 0.9083, 0.9083, 0.9083])


    torch.nn.BatchNorm1d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

- 功能

    对小批量(mini-batch)的2d或3d输入进行批标准化(Batch Normalization)操作，计算输入各个维度的均值和标准差。
    
    代替权值的初始化以及权重衰减

- num_features： 

    来自期望输入的特征数

- eps： 

    为保证数值稳定性（分母不能趋近或取0）,给分母加上的值。默认为1e-5。

- momentum： 

    动态均值和动态方差所使用的动量。默认为0.1。
    
    即通过指数加权平均来估计样本均值与方差
    
        running_mean=(1-momentum)*pre_running_mean+momentum*mean_t
        
        running_var=(1-momentum)*pre_running_var+momentum*var_t

- affine： 

    一个布尔值，当设为true，给该层添加可学习的仿射变换参数。

- track_running_stats
    
    给定是训练状态还是测试状态

**BN1d的计算原理**

![](./bn1.jpg)

In [14]:
# ======================================== nn.BatchNorm1d

'''
针对全连接层使用
'''
batch_size = 3
num_features = 5 # 5个特征
momentum = 0.3

features_shape = (1)

feature_map = torch.ones(features_shape)                                                    # 1D
feature_maps = torch.stack([feature_map*(i+1) for i in range(num_features)], dim=0)         # 2D
feature_maps_bs = torch.stack([feature_maps for i in range(batch_size)], dim=0)             # 3D
print("input data:\n{} shape is {}".format(feature_maps_bs, feature_maps_bs.shape))

bn = nn.BatchNorm1d(num_features=num_features, momentum=momentum)

running_mean, running_var = 0, 1

for i in range(2):
    '''
    注意bn将计算特征维度上的均值与方差
    '''
    outputs = bn(feature_maps_bs) # 输入3*5*1  5*1输出形状

    print("\niteration:{}, running mean: {} ".format(i, bn.running_mean))
    print("iteration:{}, running var:{} ".format(i, bn.running_var))

    mean_t, var_t = 2, 0
    
    # 手动计算
    running_mean = (1 - momentum) * running_mean + momentum * mean_t
    running_var = (1 - momentum) * running_var + momentum * var_t

    print("iteration:{}, 第二个特征的running mean: {} ".format(i, running_mean))
    print("iteration:{}, 第二个特征的running var:{}".format(i, running_var))

input data:
tensor([[[1.],
         [2.],
         [3.],
         [4.],
         [5.]],

        [[1.],
         [2.],
         [3.],
         [4.],
         [5.]],

        [[1.],
         [2.],
         [3.],
         [4.],
         [5.]]]) shape is torch.Size([3, 5, 1])

iteration:0, running mean: tensor([0.3000, 0.6000, 0.9000, 1.2000, 1.5000]) 
iteration:0, running var:tensor([0.7000, 0.7000, 0.7000, 0.7000, 0.7000]) 
iteration:0, 第二个特征的running mean: 0.6 
iteration:0, 第二个特征的running var:0.7

iteration:1, running mean: tensor([0.5100, 1.0200, 1.5300, 2.0400, 2.5500]) 
iteration:1, running var:tensor([0.4900, 0.4900, 0.4900, 0.4900, 0.4900]) 
iteration:1, 第二个特征的running mean: 1.02 
iteration:1, 第二个特征的running var:0.48999999999999994


    torch.nn.BatchNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True)

- num_features： 

    来自期望输入的特征数，该期望输入的大小为'batch_size x num_features [x width]'

- eps： 

    为保证数值稳定性（分母不能趋近或取0）,给分母加上的值。默认为1e-5。

- momentum： 

    动态均值和动态方差所使用的动量。默认为0.1。

- affine： 

    一个布尔值，当设为true，给该层添加可学习的仿射变换参数。

- Shape： 

     输入：（N, C）或者(N, C, L) - 输出：（N, C）或者（N，C，L）（输入输出相同）

**bn2d原理**

![](./bn2d.jpg)

In [17]:
# ======================================== nn.BatchNorm2d
batch_size = 3
num_features = 3
momentum = 0.3
    
features_shape = (2, 2)

feature_map = torch.ones(features_shape)                                                    # 2D
feature_maps = torch.stack([feature_map*(i+1) for i in range(num_features)], dim=0)         # 3D
feature_maps_bs = torch.stack([feature_maps for i in range(batch_size)], dim=0)             # 4D

print("input data:\n{} shape is {}".format(feature_maps_bs, feature_maps_bs.shape))

bn = nn.BatchNorm2d(num_features=num_features, momentum=momentum)

running_mean, running_var = 0, 1

for i in range(2):
    outputs = bn(feature_maps_bs)
    
    print("\niteration:{}, running mean: {} ".format(i, bn.running_mean))
    
    print("\niter:{}, running_mean.shape: {}".format(i, bn.running_mean.shape))
    print("iter:{}, running_var.shape: {}".format(i, bn.running_var.shape))

    print("iter:{}, weight.shape: {}".format(i, bn.weight.shape))
    print("iter:{}, bias.shape: {}".format(i, bn.bias.shape))

input data:
tensor([[[[1., 1.],
          [1., 1.]],

         [[2., 2.],
          [2., 2.]],

         [[3., 3.],
          [3., 3.]]],


        [[[1., 1.],
          [1., 1.]],

         [[2., 2.],
          [2., 2.]],

         [[3., 3.],
          [3., 3.]]],


        [[[1., 1.],
          [1., 1.]],

         [[2., 2.],
          [2., 2.]],

         [[3., 3.],
          [3., 3.]]]]) shape is torch.Size([3, 3, 2, 2])

iteration:0, running mean: tensor([0.3000, 0.6000, 0.9000]) 

iter:0, running_mean.shape: torch.Size([3])
iter:0, running_var.shape: torch.Size([3])
iter:0, weight.shape: torch.Size([3])
iter:0, bias.shape: torch.Size([3])

iteration:1, running mean: tensor([0.5100, 1.0200, 1.5300]) 

iter:1, running_mean.shape: torch.Size([3])
iter:1, running_var.shape: torch.Size([3])
iter:1, weight.shape: torch.Size([3])
iter:1, bias.shape: torch.Size([3])


In [8]:
vars(layer)

{'_backend': <torch.nn.backends.thnn.THNNFunctionBackend at 0x2a45377a160>,
 '_parameters': OrderedDict([('weight', Parameter containing:
               tensor([0.3507, 0.0740, 0.1083, 0.2431, 0.4291, 0.2763, 0.9373, 0.5150, 0.5816,
                       0.0219, 0.4399, 0.6164, 0.8034, 0.0686, 0.5728, 0.1637],
                      requires_grad=True)), ('bias', Parameter containing:
               tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
                      requires_grad=True))]),
 '_buffers': OrderedDict([('running_mean',
               tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])),
              ('running_var',
               tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])),
              ('num_batches_tracked', tensor(0))]),
 '_backward_hooks': OrderedDict(),
 '_forward_hooks': OrderedDict(),
 '_forward_pre_hooks': OrderedDict(),
 '_state_dict_hooks': OrderedDict(),
 '_load_state_dict_pre_hooks'

    torch.nn.BatchNorm3d(num_features, eps=1e-05, momentum=0.1, affine=True)

- num_features： 

    来自期望输入的特征数，该期望输入的大小为'batch_size x num_features depth x height x width'

- eps： 

    为保证数值稳定性（分母不能趋近或取0）,给分母加上的值。默认为1e-5。

- momentum： 

    动态均值和动态方差所使用的动量。默认为0.1。

- affine： 

    一个布尔值，当设为true，给该层添加可学习的仿射变换参数。

- Shape： 
    
    输入：（N, C，D, H, W)
    
    输出：（N, C, D, H, W）（输入输出相同）

In [None]:
# ======================================== nn.BatchNorm3d

batch_size = 3
num_features = 4
momentum = 0.3

features_shape = (2, 2, 3)

feature = torch.ones(features_shape)                                                # 3D
feature_map = torch.stack([feature * (i + 1) for i in range(num_features)], dim=0)  # 4D
feature_maps = torch.stack([feature_map for i in range(batch_size)], dim=0)         # 5D

print("input data:\n{} shape is {}".format(feature_maps, feature_maps.shape))

bn = nn.BatchNorm3d(num_features=num_features, momentum=momentum)

running_mean, running_var = 0, 1

for i in range(2):
    outputs = bn(feature_maps)

    print("\niter:{}, running_mean.shape: {}".format(i, bn.running_mean.shape))
    print("iter:{}, running_var.shape: {}".format(i, bn.running_var.shape))

    print("iter:{}, weight.shape: {}".format(i, bn.weight.shape))
    print("iter:{}, bias.shape: {}".format(i, bn.bias.shape))

## 其他类型归一化

### **Layer Norm**

![](./ln.jpg)

    - 解决BN无法适用于特征层间不同大小的RNN等网络

    - 计算特征维度上的均值与方差 
    
    - 没有running_mean 和 running_var
    
    - gama 与 beta 是逐元素的

    nn.LayerNorm(normalized_shape, eps, elementwise_affine)
    
- normalized_shape
    
    计算层的形状，按该形状为基数计算均值方差

- eps

    分母修正
    
- elementwise_affine

    是否需要gamma beta进行变换

In [22]:
# ======================================== nn.layer norm
batch_size = 8
num_features = 3

features_shape = (2, 2)

feature_map = torch.ones(features_shape)  # 2D
feature_maps = torch.stack([feature_map * (i + 1) for i in range(num_features)], dim=0)  # 3D
feature_maps_bs = torch.stack([feature_maps for i in range(batch_size)], dim=0)  # 4D

# feature_maps_bs shape is [8, 6, 3, 4],  B * C * H * W
ln = nn.LayerNorm(feature_maps_bs.size()[1:], elementwise_affine=True)

output = ln(feature_maps_bs)

print("Layer Normalization")
print(ln.weight.shape) # weight就是gama
print(feature_maps_bs[0, ...])
print(output[0, ...]) # 计算每个样本的12个元素均值与方差

Layer Normalization
torch.Size([3, 2, 2])
tensor([[[1., 1.],
         [1., 1.]],

        [[2., 2.],
         [2., 2.]],

        [[3., 3.],
         [3., 3.]]])
tensor([[[-1.2247, -1.2247],
         [-1.2247, -1.2247]],

        [[ 0.0000,  0.0000],
         [ 0.0000,  0.0000]],

        [[ 1.2247,  1.2247],
         [ 1.2247,  1.2247]]], grad_fn=<SelectBackward>)


In [23]:
(1 + 2 + 3) * 4 / (3 * 2 * 2)

2.0

In [25]:
# ======================================== nn.layer norm  elementwise_affine=False
batch_size = 8
num_features = 3

features_shape = (2, 2)

feature_map = torch.ones(features_shape)  # 2D
feature_maps = torch.stack([feature_map * (i + 1) for i in range(num_features)], dim=0)  # 3D
feature_maps_bs = torch.stack([feature_maps for i in range(batch_size)], dim=0)  # 4D

# feature_maps_bs shape is [8, 6, 3, 4],  B * C * H * W
ln = nn.LayerNorm(feature_maps_bs.size()[1:], elementwise_affine=False) # 不需要transform

output = ln(feature_maps_bs)

print("Layer Normalization")
print(ln.weight) # 没有weight
print(feature_maps_bs[0, ...])
print(output[0, ...]) # 计算每个样本的12个元素均值与方差

Layer Normalization
None
tensor([[[1., 1.],
         [1., 1.]],

        [[2., 2.],
         [2., 2.]],

        [[3., 3.],
         [3., 3.]]])
tensor([[[-1.2247, -1.2247],
         [-1.2247, -1.2247]],

        [[ 0.0000,  0.0000],
         [ 0.0000,  0.0000]],

        [[ 1.2247,  1.2247],
         [ 1.2247,  1.2247]]])


In [26]:
# ======================================== nn.layer norm  自定义计算层
batch_size = 8
num_features = 6

features_shape = (3, 4)

feature_map = torch.ones(features_shape)  # 2D
feature_maps = torch.stack([feature_map * (i + 1) for i in range(num_features)], dim=0)  # 3D
feature_maps_bs = torch.stack([feature_maps for i in range(batch_size)], dim=0)  # 4D

# feature_maps_bs shape is [8, 6, 3, 4],  B * C * H * W
ln = nn.LayerNorm([3, 4])

output = ln(feature_maps_bs)

print("Layer Normalization")
print(ln.weight.shape)
print(feature_maps_bs[0, ...])
print(output[0, ...]) # 计算一个（3，4）里的元素均值进行标准化

Layer Normalization
torch.Size([3, 4])
tensor([[[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]],

        [[2., 2., 2., 2.],
         [2., 2., 2., 2.],
         [2., 2., 2., 2.]],

        [[3., 3., 3., 3.],
         [3., 3., 3., 3.],
         [3., 3., 3., 3.]],

        [[4., 4., 4., 4.],
         [4., 4., 4., 4.],
         [4., 4., 4., 4.]],

        [[5., 5., 5., 5.],
         [5., 5., 5., 5.],
         [5., 5., 5., 5.]],

        [[6., 6., 6., 6.],
         [6., 6., 6., 6.],
         [6., 6., 6., 6.]]])
tensor([[[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0.,

### **Instance Norm**


![](./In.jpg)


**特点**

    - BN无法再生成模型中使用，不适用于风格差异较大的图像
    
    

    nn.InstanceNorm2d(num_features,eps,momentum,affine,track_running_stats)

- num_features

    特征数量

- eps

    分母修正项
    
- momentum
    
    动量

- affine

    是否需要仿射变换
    
- track_running_stats

    训练还是测试模式

In [4]:
# ======================================== nn.instance norm 2d
batch_size = 3
num_features = 3
momentum = 0.3

features_shape = (2, 2)

feature_map = torch.ones(features_shape)    # 2D
feature_maps = torch.stack([feature_map * (i + 1) for i in range(num_features)], dim=0)  # 3D
feature_maps_bs = torch.stack([feature_maps for i in range(batch_size)], dim=0)  # 4D

print("Instance Normalization")
print("input data:\n{} shape is {}".format(feature_maps_bs[0], feature_maps_bs.shape))

instance_n = nn.InstanceNorm2d(num_features=num_features, momentum=momentum)

outputs = instance_n(feature_maps_bs)

print(outputs[0])
# print("\niter:{}, running_mean.shape: {}".format(i, bn.running_mean.shape))
# print("iter:{}, running_var.shape: {}".format(i, bn.running_var.shape))
# print("iter:{}, weight.shape: {}".format(i, bn.weight.shape))
# print("iter:{}, bias.shape: {}".format(i, bn.bias.shape))

Instance Normalization
input data:
tensor([[[1., 1.],
         [1., 1.]],

        [[2., 2.],
         [2., 2.]],

        [[3., 3.],
         [3., 3.]]]) shape is torch.Size([3, 3, 2, 2])
tensor([[[0., 0.],
         [0., 0.]],

        [[0., 0.],
         [0., 0.]],

        [[0., 0.],
         [0., 0.]]])


### **Group_Norm**

![](./gn.jpg)

特点:

    - batch过小导致的均值方差估计不准
    
    - 应用于大模型小batch的情形中
    
    - 对特征通道进行分组，计入均值方差计算
    
    - 无running_mean  running_var
    
    - gamma  beta 逐通道

    nn.GroupNorm(num_groups,num_channels,eps,affine)

- num_groups

    分组数

- num_channels

    特征数
    
- eps

    分母修正项
    
- affine

    是否变换

In [7]:
# ======================================== nn.grop norm

batch_size = 2
num_features = 4
num_groups = 2   # 3 Expected number of channels in input to be divisible by num_groups

features_shape = (2, 2)

feature_map = torch.ones(features_shape)    # 2D
feature_maps = torch.stack([feature_map * (i + 1) for i in range(num_features)], dim=0)  # 3D
feature_maps_bs = torch.stack([feature_maps * (i + 1) for i in range(batch_size)], dim=0)  # 4D

gn = nn.GroupNorm(num_groups, num_features)
outputs = gn(feature_maps_bs)

print("Group Normalization")
print(feature_maps_bs)
print(gn.weight.shape)
print(outputs[0])

Group Normalization
tensor([[[[1., 1.],
          [1., 1.]],

         [[2., 2.],
          [2., 2.]],

         [[3., 3.],
          [3., 3.]],

         [[4., 4.],
          [4., 4.]]],


        [[[2., 2.],
          [2., 2.]],

         [[4., 4.],
          [4., 4.]],

         [[6., 6.],
          [6., 6.]],

         [[8., 8.],
          [8., 8.]]]])
torch.Size([4])
tensor([[[-1.0000, -1.0000],
         [-1.0000, -1.0000]],

        [[ 1.0000,  1.0000],
         [ 1.0000,  1.0000]],

        [[-1.0000, -1.0000],
         [-1.0000, -1.0000]],

        [[ 1.0000,  1.0000],
         [ 1.0000,  1.0000]]], grad_fn=<SelectBackward>)
