In [1]:
import torch
import torch.nn as nn

In [2]:
help(nn)

Help on package torch.nn in torch:

NAME
    torch.nn

PACKAGE CONTENTS
    _reduction
    backends (package)
    common_types
    cpp
    functional
    grad
    init
    intrinsic (package)
    modules (package)
    parallel (package)
    parameter
    qat (package)
    quantized (package)
    utils (package)

FILE
    d:\programmefiles\python\miniconda3\lib\site-packages\torch\nn\__init__.py




In [5]:
for k, v in sorted(nn.Module.__dict__.items()):
    if callable(v):
        print(k)

__call__
__delattr__
__dir__
__getattr__
__init__
__repr__
__setattr__
__setstate__
_apply
_call_impl
_get_name
_load_from_state_dict
_named_members
_register_load_state_dict_pre_hook
_register_state_dict_hook
_replicate_for_data_parallel
_save_to_state_dict
_slow_forward
add_module
apply
bfloat16
buffers
children
cpu
cuda
double
eval
extra_repr
float
forward
half
load_state_dict
modules
named_buffers
named_children
named_modules
named_parameters
parameters
register_backward_hook
register_buffer
register_forward_hook
register_forward_pre_hook
register_parameter
requires_grad_
share_memory
state_dict
to
train
type
zero_grad


# nn.Module()
`nn.Module()`

所有神经网络模块的基类，所有搭建的神经网络模型应该继承这个类；该模块还包含很多其他子模块，进而从而嵌套，即可以将子模块作为类属性，如将`nn.Conv2d`作为类属性等；以这种方式声明的子模块将被注册，由`nn.Module`和`ScriptModule`共享，当调用`to`时也将转换它们的参数

**File**: torch\nn\modules\module.py

**Type**:           type

**methods**

    __call__
    __dir__
    __repr__
    __setstate__
    _apply
    _call_impl
    _get_name
    _load_from_state_dict
    _named_members
    _register_load_state_dict_pre_hook
    _register_state_dict_hook
    _replicate_for_data_parallel
    _save_to_state_dict
    _slow_forward
    add_module
    apply
    bfloat16
    buffers
    children
    cpu
    cuda
    double
    eval
    extra_repr
    float
    forward
    half
    load_state_dict
    modules
    named_buffers
    named_children
    named_modules
    named_parameters
    parameters
    register_backward_hook
    register_buffer
    register_forward_hook
    register_forward_pre_hook
    register_parameter
    requires_grad_
    share_memory
    state_dict
    to
    train
    type
    zero_grad

## nn.Module.train()

`<mudule>.train(mode = True)`


`mode`为 True 时设定模块为训练模式，False 时评估模式，这只对某些模块产生影响；具体模块在训练/评估模式下的行为请参阅相关文档

**Type**:      function

## nn.Module.eval()

`<module_name>.eval()`

设定模块为评估状态，这只对某些模块产生影响；具体模块在训练/评估模式下的行为请参阅相关文档，该方法等价于`<module_name>.train(False)`

**Type**:      function

## nn.Module.state_dict()

`<module>.state_dict(destination=None, prefix='', keep_vars=False)`

**Docstring:**

返回一个包含了模块的所有状态的字典，其中参数和持久缓冲区 (persistent buffers)(如 running averages)，键值为相关参数和缓冲区的名称

**Type**: function

### Example
```python
module.state_dict().keys()  # => ['bias', 'weight']
```

## nn.Module.load_state_dict
```python
<module>.load_state_dict(
    state_dict: Dict[str, torch.Tensor],
    strict=True,
)
```
**Docstring**

将参数和缓冲区从`state_dict`复制给该模块及其实例化对象，并一个返回带有``missing_keys``和``unexpected_keys``字段的``NamedTuple``；其中``missing_keys``是一个包括了丢失的键的有字符串组成的列表，``unexpected_keys``是一个是包含意外键的字符串列表；若`strict`为`True`，则`state_dict`的键值必须与`torch.nn.Module.state_dict`函数返回的键值完全匹配。

**Args**

- state_dict: 包含参数和持久缓冲区的字典

- strict:  True 时强制`state_dict`中的键与该类的`state_dict`函数返回的键匹配


**Type**: function

## nn.Module.named_children()
`<module>.named_children(self) -> Iterator[Tuple[str, ForwardRef('Module')]]`

**Docstring**

返回一个迭代器，其可生成该模块的所有直接子模块的名称及模块本身`(string, Module)`

**File**:   \torch\nn\modules\module.py

**Type**:      function

### Example

In [None]:
class TestModel(nn.Module):
    def __init__(self):
        super(TestModel, self).__init__()
        self.conv1 = nn.Conv2d(3, 64, 3)
        self.relu = nn.ReLU()
        self.conv2 = nn.Conv2d(64, 128, 3)

model = TestModel()
for named_children in model.named_children():
    print(named_children)

## nn.Module.children()
`nn.Module.children(self) -> Iterator[ForwardRef('Module')]`

返回包含直接子模块的迭代器

**Type**:      function

In [39]:
def has_children(module):
    try:
        next(module.children())
        print("Iterator `module.children()`:", module.children())
        return True
    except StopIteration:
        print("{} is not an Iterator".format(module))
        return False

class TestModel_Parent(nn.Module):
    def __init__(self):
        super(TestModel_Parent, self).__init__()
        self.conv1 = nn.Conv2d(3, 64, 3)
        self.relu = nn.ReLU()
        self.conv2 = nn.Conv2d(64, 128, 3)
        self.model_chilren = TestModel()

model_parent = TestModel_Parent()
for name, module in model_parent.named_children():
    if has_children(module):
        pass

Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1)) is not an Iterator
ReLU() is not an Iterator
Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1)) is not an Iterator
Iterator `module.children()`: <generator object Module.children at 0x0000025673457648>


## nn.Module.apply()
`<model>.apply(fn: Callable[[ForwardRef('Module')], NoneType]) -> ~T`

**Docstring**

递归地将`fn`应用于自身及子模块（即由``children()``返回的模块）并返回应用`fn`后的`self`；典型的用法包括初始化模型的参数，另见`nn-init-doc`

**Args**

- fn: 要应作用每个模块的函数，其输入为一模块，输出为 None

### Example

In [None]:
@torch.no_grad()
def init_weights(m):
    print(m)
    if type(m) == nn.Linear:
        m.weight.fill_(1.0)
        print(m.weight)

net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
net.apply(init_weights)

## nn.Module.register_forward_hook()
```python
model.register_forward_hook(hook: Callable[..., NoneType]) 
-> torch.utils.hooks.RemovableHandle
```

**Docstring**

在模块上注册一个`forward`的 hook，每次`forward`计算得到出输后 hook 均会被调用一次。

hook 的其定义形式为

```python
hook(module, input, output) -> None or modified output
```

其中`input`只包含传递给`module`的位置参数，关键字参数只会传递给``forward``而不会传递给 hook；hook 可能会对`output`进行调整，同时也会对`input`进行 inplace 地调整，但由于 hook 是在`forward`调用之后被调用的，故其对`input`调整并不产生影响；此函数返回属于`torch.utils.hooks.RemovableHandle`类的 handle，可通过调用``handle.remove()``来移除附加的 hook

**Type**:      function

## nn.Module.modules()
`nn.Module.modules(self) -> Iterator[ForwardRef('Module')]`

**Docstring**:

返回网络中所有模块及其子模块的迭代器；需要注意的是，重复的模块只返回一次，可参见下面的例子；

**File**:    torch\nn\modules\module.py

**Type**:      function

### Example

In [11]:
linear = nn.Linear(2, 2)
net = nn.Sequential(
    linear,
    linear,
    nn.ReLU(inplace=True),
    linear
)
for module in net.modules():
    print(module, sep="\n", end="\n\n")

Sequential(
  (0): Linear(in_features=2, out_features=2, bias=True)
  (1): Linear(in_features=2, out_features=2, bias=True)
  (2): ReLU(inplace=True)
  (3): Linear(in_features=2, out_features=2, bias=True)
)

Linear(in_features=2, out_features=2, bias=True)

ReLU(inplace=True)



## nn.Module.named_modules()
```python
model.named_modules(
    memo: Union[Set[ForwardRef('Module')], NoneType] = None,
    prefix: str = '',
)
```
**Docstring**

返回网络中所有模块及其子模块的迭代器，该迭代器可以以`(name, Module)`的形式输出某一模块的名称以及模块自身；需要注意的是，重复的模块只返回一次，可参见下面的例子；

**File**:  \torch\nn\modules\module.py

**Type**:      function

### Example

In [11]:
linear = nn.Linear(2, 2)
net = nn.Sequential(
    linear,
    linear,
    nn.ReLU(inplace=True),
    linear
)
print(dict(net.named_modules())["0"])

# for name, m in net.named_modules():
#     print(name+":", m, sep="\n", end="\n\n")

Linear(in_features=2, out_features=2, bias=True)


#  

#  

## nn.Sequential()
Init signature: nn.Sequential(*args: Any)
Docstring:     
A sequential container.
Modules will be added to it in the order they are passed in the constructor.
Alternatively, an ordered dict of modules can also be passed in.

To make it easier to understand, here is a small example::

    # Example of using Sequential
    model = nn.Sequential(
              nn.Conv2d(1,20,5),
              nn.ReLU(),
              nn.Conv2d(20,64,5),
              nn.ReLU()
            )

    # Example of using Sequential with OrderedDict
    model = nn.Sequential(OrderedDict([
              ('conv1', nn.Conv2d(1,20,5)),
              ('relu1', nn.ReLU()),
              ('conv2', nn.Conv2d(20,64,5)),
              ('relu2', nn.ReLU())
            ]))
Init docstring: Initializes internal Module state, shared by both nn.Module and ScriptModule.
File:           d:\programfiles\miniconda3\lib\site-packages\torch\nn\modules\container.py
Type:           type
Subclasses:     ConvReLU1d, ConvReLU2d, ConvReLU3d, LinearReLU, ConvBn1d, ConvBn2d, ConvBnReLU1d, ConvBnReLU2d, ConvBn3d, ConvBnReLU3d, ...

## nn.ModuleDict()
Init signature:
nn.ModuleDict(
    modules: Union[Mapping[str, torch.nn.modules.module.Module], NoneType] = None,
) -> None
Docstring:     
Holds submodules in a dictionary.

:class:`~torch.nn.ModuleDict` can be indexed like a regular Python dictionary,
but modules it contains are properly registered, and will be visible by all
:class:`~torch.nn.Module` methods.

:class:`~torch.nn.ModuleDict` is an **ordered** dictionary that respects

* the order of insertion, and

* in :meth:`~torch.nn.ModuleDict.update`, the order of the merged 
  ``OrderedDict``, ``dict`` (started from Python 3.6) or another
  :class:`~torch.nn.ModuleDict` (the argument to 
  :meth:`~torch.nn.ModuleDict.update`).

Note that :meth:`~torch.nn.ModuleDict.update` with other unordered mapping
types (e.g., Python's plain ``dict`` before Python version 3.6) does not
preserve the order of the merged mapping.

Arguments:
    modules (iterable, optional): a mapping (dictionary) of (string: module)
        or an iterable of key-value pairs of type (string, module)

Example::

    class MyModule(nn.Module):
        def __init__(self):
            super(MyModule, self).__init__()
            self.choices = nn.ModuleDict({
                    'conv': nn.Conv2d(10, 10, 3),
                    'pool': nn.MaxPool2d(3)
            })
            self.activations = nn.ModuleDict([
                    ['lrelu', nn.LeakyReLU()],
                    ['prelu', nn.PReLU()]
            ])

        def forward(self, x, choice, act):
            x = self.choices[choice](x)
            x = self.activations[act](x)
            return x
Init docstring: Initializes internal Module state, shared by both nn.Module and ScriptModule.
File:           d:\programfiles\miniconda3\lib\site-packages\torch\nn\modules\container.py
Type:           type
Subclasses: 

## nn.ModuleList()
Init signature:
nn.ModuleList(
    modules: Union[Iterable[torch.nn.modules.module.Module], NoneType] = None,
) -> None
Docstring:     
Holds submodules in a list.

:class:`~torch.nn.ModuleList` can be indexed like a regular Python list, but
modules it contains are properly registered, and will be visible by all
:class:`~torch.nn.Module` methods.

Arguments:
    modules (iterable, optional): an iterable of modules to add

Example::

    class MyModule(nn.Module):
        def __init__(self):
            super(MyModule, self).__init__()
            self.linears = nn.ModuleList([nn.Linear(10, 10) for i in range(10)])

        def forward(self, x):
            # ModuleList can act as an iterable, or be indexed using ints
            for i, l in enumerate(self.linears):
                x = self.linears[i // 2](x) + l(x)
            return x
Init docstring: Initializes internal Module state, shared by both nn.Module and ScriptModule.
File:           d:\programfiles\miniconda3\lib\site-packages\torch\nn\modules\container.py
Type:           type
Subclasses: 

In [None]:
nn.ParameterDict()

#  

#  

In [None]:
nn.DataParallel

# nn.DataParallel()
`nn.DataParallel(module, device_ids=None, output_device=None, dim=0)`
**Docstring**:     在模型层面实现数据并行；该模型容器通过在 batch 维度上将输入分割并分配到指定的设备上，进而将`module`并行化，这里 batch 大小应大于所使用的 GPU 数量；其他对象如优化器等则会在每个设备上复制一次；在运行`DataParallel`类前，并行化的`module`必须在``device_ids[0]``上有自己的参数和缓冲区；

在每次前向传播过程中，`module`会被复制到每个设备上，进而在`forward`中对模块的任何更新都会丢失；然而`nn.DataParparallel`保证`device[0]`上副本的参数和缓冲区与并行的基`module`共享存储，因此会对`device[0]`上的参数或缓冲区进行 inplace 的更新并记录，例如`nn.BatchNorm2d`和`torch.nn.utils.spectral_norm`依赖此行为来更新缓冲区；当`module`在`forward`中返回一个标量时，此包装器将返回一个长度等于并行设备数量的向量，其包含来自每个设备的结果；

反向传递过程中，每个副本的梯度会被累加到原始模块中；在`module`及其子模块中定义的前向传播和反向传播的钩子会被调用``len(device_ids)``次，每次调用时输入都位于特定设备上；钩子只保证相应设备上的操作能够以正确的顺序执行，例如其不能保证通过`Module.register_forward_pre_hook`注册的钩子会在所有`Module.forward`被调用前执行，但可以保证每个钩子在相应的`Module.forward`调用之前执行完毕；

进行多 GPU 训练时建议使用`nn.parallel.DistributedDataParallel`而非此类，即便只有一个节点；更多信息参见`cuda-nn-ddp-instead`和`ddp`

在`DataParallel`封装的`nn.Module`中使用``pack sequence -> recurrent network -> unpack sequence``模式有一个微妙之处，参见 FAQ 中的`pack-rnn-unpack-with-data-parallelism`章节；


**Args**:
- module：`nn.Module`类型；即要并行化的模型；
- device_ids：整型或`torch.device`组成的列表；即 CUDA 的设备，默认为所有设备；
    - output_device：整型或`torch.device`，模型输出位于的设备，默认`device_ids[0]`

**File**:      \torch\nn\parallel\data_parallel.py

**Type**:           type

#  

# nn.parallel.DistributedDataParallel()
```python
nn.parallel.DistributedDataParallel(
    module,
    device_ids=None,
    output_device=None,
    dim=0,
    broadcast_buffers=True,
    process_group=None,
    bucket_cap_mb=25,
    find_unused_parameters=False,
    check_reduction=False,
)
```
**Docstring**：在模块层面实现基于``torch.distributed``包的分布式数据并行；


This container parallelizes the application of the given module by
splitting the input across the specified devices by chunking in the batch
dimension. The module is replicated on each machine and each device, and
each such replica handles a portion of the input. During the backwards
pass, gradients from each node are averaged.

The batch size should be larger than the number of GPUs used locally.

See also: :ref:`distributed-basics` and :ref:`cuda-nn-ddp-instead`.
The same constraints on input as in :class:`torch.nn.DataParallel` apply.

Creation of this class requires that ``torch.distributed`` to be already
initialized, by calling :func:`torch.distributed.init_process_group`.

``DistributedDataParallel`` is proven to be significantly faster than
:class:`torch.nn.DataParallel` for single-node multi-GPU data
parallel training.

Here is how to use it: on each host with N GPUs, you should spawn up N processes, while ensuring that each process individually works on a single GPU 
from 0 to N-1. Therefore, it is your job to ensure that your training script operates on a single given GPU by calling:

    >>> torch.cuda.set_device(i)

where i is from 0 to N-1. In each process, you should refer the following
to construct this module:

    >>> torch.distributed.init_process_group(backend='nccl', world_size=4, init_method='...')
    >>> model = DistributedDataParallel(model, device_ids=[i], output_device=i)

In order to spawn up multiple processes per node, you can use either
``torch.distributed.launch`` or ``torch.multiprocessing.spawn``

.. note ::
    Please refer to `PyTorch Distributed Overview <https://pytorch.org/tutorials/beginner/dist_overview.html>`__
    for a brief introduction to all features related to distributed training.

.. note:: ``nccl`` backend is currently the fastest and
    highly recommended backend to be used with Multi-Process Single-GPU
    distributed training and this applies to both single-node and multi-node
    distributed training

.. note:: This module also supports mixed-precision distributed training.
    This means that your model can have different types of parameters such
    as mixed types of fp16 and fp32, the gradient reduction on these
    mixed types of parameters will just work fine.
    Also note that ``nccl`` backend is currently the fastest and highly
    recommended backend for fp16/fp32 mixed-precision training.

.. note:: If you use ``torch.save`` on one process to checkpoint the module,
    and ``torch.load`` on some other processes to recover it, make sure that
    ``map_location`` is configured properly for every process. Without
    ``map_location``, ``torch.load`` would recover the module to devices
    where the module was saved from.

.. warning::
    This module works only with the ``gloo`` and ``nccl`` backends.

.. warning::
    Constructor, forward method, and differentiation of the output (or a
    function of the output of this module) is a distributed synchronization
    point. Take that into account in case different processes might be
    executing different code.

.. warning::
    This module assumes all parameters are registered in the model by the
    time it is created. No parameters should be added nor removed later.
    Same applies to buffers.

.. warning::
    This module assumes all parameters are registered in the model of each
    distributed processes are in the same order. The module itself will
    conduct gradient all-reduction following the reverse order of the
    registered parameters of the model. In other words, it is users'
    responsibility to ensure that each distributed process has the exact
    same model and thus the exact same parameter registration order.

.. warning::
    This module allows parameters with non-rowmajor-contiguous strides.
    For example, your model may contain some parameters whose
    :class:`torch.memory_format` is ``torch.contiguous_format``
    and others whose format is ``torch.channels_last``.  However,
    corresponding parameters in different processes must have the
    same strides.

.. warning::
    This module doesn't work with :func:`torch.autograd.grad` (i.e. it will
    only work if gradients are to be accumulated in ``.grad`` attributes of
    parameters).

.. warning::

    If you plan on using this module with a ``nccl`` backend or a ``gloo``
    backend (that uses Infiniband), together with a DataLoader that uses
    multiple workers, please change the multiprocessing start method to
    ``forkserver`` (Python 3 only) or ``spawn``. Unfortunately
    Gloo (that uses Infiniband) and NCCL2 are not fork safe, and you will
    likely experience deadlocks if you don't change this setting.

.. warning::
    Forward and backward hooks defined on :attr:`module` and its submodules
    won't be invoked anymore, unless the hooks are initialized in the
    :meth:`forward` method.

.. warning::
    You should never try to change your model's parameters after wrapping
    up your model with DistributedDataParallel. In other words, when
    wrapping up your model with DistributedDataParallel, the constructor of
    DistributedDataParallel will register the additional gradient
    reduction functions on all the parameters of the model itself at the
    time of construction. If you change the model's parameters after
    the DistributedDataParallel construction, this is not supported and
    unexpected behaviors can happen, since some parameters' gradient
    reduction functions might not get called.

.. note::
    Parameters are never broadcast between processes. The module performs
    an all-reduce step on gradients and assumes that they will be modified
    by the optimizer in all processes in the same way. Buffers
    (e.g. BatchNorm stats) are broadcast from the module in process of rank
    0, to all other replicas in the system in every iteration.

.. note::
    If you are using DistributedDataParallel in conjunction with the
    :ref:`distributed-rpc-framework`, you should always use
    :meth:`torch.distributed.autograd.backward` to compute gradients and
    :class:`torch.distributed.optim.DistributedOptimizer` for optimizing
    parameters.



.. warning::
    Using DistributedDataParallel in conjuction with the
    :ref:`distributed-rpc-framework` is experimental and subject to change.

Args:
    module (Module): module to be parallelized
    device_ids (list of int or torch.device): CUDA devices. This should
               only be provided when the input module resides on a single
               CUDA device. For single-device modules, the ``i``th
               :attr:`module` replica is placed on ``device_ids[i]``. For
               multi-device modules and CPU modules, device_ids must be None
               or an empty list, and input data for the forward pass must be
               placed on the correct device. (default: all devices for
               single-device modules)
    output_device (int or torch.device): device location of output for
                  single-device CUDA modules. For multi-device modules and
                  CPU modules, it must be None, and the module itself
                  dictates the output location. (default: device_ids[0] for
                  single-device modules)
    broadcast_buffers (bool): flag that enables syncing (broadcasting) buffers of
                      the module at beginning of the forward function.
                      (default: ``True``)
    process_group: the process group to be used for distributed data
                   all-reduction. If ``None``, the default process group, which
                   is created by ```torch.distributed.init_process_group```,
                   will be used. (default: ``None``)
    bucket_cap_mb: DistributedDataParallel will bucket parameters into
                   multiple buckets so that gradient reduction of each
                   bucket can potentially overlap with backward computation.
                   :attr:`bucket_cap_mb` controls the bucket size in MegaBytes (MB)
                   (default: 25)
    find_unused_parameters (bool): Traverse the autograd graph of all tensors
                                   contained in the return value of the wrapped
                                   module's ``forward`` function.
                                   Parameters that don't receive gradients as
                                   part of this graph are preemptively marked
                                   as being ready to be reduced. Note that all
                                   ``forward`` outputs that are derived from
                                   module parameters must participate in
                                   calculating loss and later the gradient
                                   computation. If they don't, this wrapper will
                                   hang waiting for autograd to produce gradients
                                   for those parameters. Any outputs derived from
                                   module parameters that are otherwise unused can
                                   be detached from the autograd graph using
                                   ``torch.Tensor.detach``. (default: ``False``)
    check_reduction: when setting to ``True``, it enables DistributedDataParallel
                     to automatically check if the previous iteration's
                     backward reductions were successfully issued at the
                     beginning of every iteration's forward function.
                     You normally don't need this option enabled unless you
                     are observing weird behaviors such as different ranks
                     are getting different gradients, which should not
                     happen if DistributedDataParallel is correctly used.
                     (default: ``False``)

Attributes:
    module (Module): the module to be parallelized

Example::

    >>> torch.distributed.init_process_group(backend='nccl', world_size=4, init_method='...')
    >>> net = torch.nn.DistributedDataParallel(model, pg)
Init docstring: Initializes internal Module state, shared by both nn.Module and ScriptModule.
File:           d:\programmefiles\python\anaconda3\envs\tensorflow2.2\lib\site-packages\torch\nn\parallel\distributed.py
Type:           type
Subclasses:  

In [None]:
Example::
    >>> import torch.distributed.autograd as dist_autograd
    >>> from torch.nn.parallel import DistributedDataParallel as DDP
    >>> from torch import optim
    >>> from torch.distributed.optim import DistributedOptimizer
    >>> from torch.distributed.rpc import RRef
    >>>
    >>> t1 = torch.rand((3, 3), requires_grad=True)
    >>> t2 = torch.rand((3, 3), requires_grad=True)
    >>> rref = rpc.remote("worker1", torch.add, args=(t1, t2))
    >>> ddp_model = DDP(my_model)
    >>>
    >>> # Setup optimizer
    >>> optimizer_params = [rref]
    >>> for param in ddp_model.parameters():
    >>>     optimizer_params.append(RRef(param))
    >>>
    >>> dist_optim = DistributedOptimizer(
    >>>     optim.SGD,
    >>>     optimizer_params,
    >>>     lr=0.05,
    >>> )
    >>>
    >>> with dist_autograd.context() as context_id:
    >>>     pred = ddp_model(rref.to_here())
    >>>     loss = loss_func(pred, loss)
    >>>     dist_autograd.backward(context_id, loss)
    >>>     dist_optim.step()

#  

#  

## nn.LogSoftmax()
`activ_fn = nn.LogSoftmax(dim=None)
 Output = activ_fn(Input)`

**Docstring**:     对在`dim`坐标上的 n 维张量 $x$ 应用函数 $\log(\texttt{Softmax}(x))$，并返回一个形状与`Input`相同的`Output`

$$
\texttt{LogSoftmax}(x_{i}) = \log\left(\frac{\exp(x_i) }{ \sum_j^n \exp(x_j)} \right)\,,\, i=1, 2, \cdots, n
$$

**File**:    \nn\modules\activation.py

**Type**:           type

**Examples**:
```python
input = torch.tensor([[-3, -2, -1], [1, 2, 3]], dtype=float)
outputs = nn.LogSoftmax(dim=1)(input)

log_softmax = (inputs.exp() / inputs.exp().sum(1, True)).log()
```

# 

## nn.NLLLoss()
```python
loss_fn = nn.NLLLoss(
    weight: Union[torch.Tensor, NoneType] = None,
    size_average=None,
    ignore_index: int = -100,
    reduce=None,
    reduction: str = 'mean',
)
loss = loss_fn(output, target)
```
**Docstring**:      负的对数似然损失，常用于 C 类分类问题；对数概率可以通过在网络的最后一层添加`LogSoftmax`获得；记 batch 大小为 BS，$\mathtt{output} = (O_1, O_2, \cdots, O_{BS}), \mathtt{target} = (T_1, T_2, \cdots, T_{BS})$；$\mathtt{weight} = (W_1, W_2, \cdots, W_C)$，此时第 i 个样本的损失为：
$$
\ell_i = - W_{T_i} O_{i,T_i}\mathbb{I}\{T_i \neq \text{ignore_index}\}
$$
总损失在`reduction`取不同值时对应结果为：
$$
\ell = \begin{cases}
    \frac{\sum_{i=1}^{BS}\ell_i}{\sum_{i=1}^{BS} W_{T_i}} &
    \text{if }\mathtt{reduction="mean"}\\
    \sum_{i=1}^{BS} l_i,  &
    \text{if }\mathtt{reduction="sum"}
\end{cases}
$$

**Args**:

- weight: 代表为每个类分配权重，应为长度为 C 的一维张量，默认为全 1 的张量；

- size_average: 此参数已被弃用，参见`reduction`参数；默认为 True，此时返回的损失为 batch 内所有单个损失的平均值；需要注意的是，对于一些损失，一个样本中含有多个用于计算损失的元素，例如像素级分类任务等；False 时损失则为一个 batch 内所有损失的和；当`reduce`为 False 忽略此参数；

- ignore_index: 指定`target`中一个被忽略且对梯度没有贡献的值 (即索引)；当`size_average`为 True 时，总损失为所有非忽略的`target`值的损失的平均值；

- reduce: 已被弃用，见`reduction`参数；默认 True；False 时对 batch 的每一个元素返回一个损失值，同时忽略`size_average`参数；

- reduction: `size_average`和`reduce`将会被弃用，但目前指明这两者中任何一个参数均会重写`reduction`；`reduction`默认为``'mean'``，其与可接受取值及其含义为：
    - `"none"`: 不采取任何 reduction
    - `"mean"`: 输出的总和会除以输出元素的总数量
    - `"sum"`: 输出会被求和

- output: 这里指网络的输出；应该包含每个类的对数概率；形状为 $(BS, C)$，或对于 K 维的损失值例如语义分割任务等，形状应为 $(BS, C, D_1, D_2, \cdots, D_K)$；，其中 C 为类别的个数，BS 为 batch 大小； 

- target: 这里指标签值，应包含每个样本所属类别的索引，即其每个元素取值在 $\{0, 1, \cdots, C-1\}$ 内；如果指定了`ignore_index`，则该损失也可以接受`ignore_index`所含的索引，这些索引不必在上述范围内；其形状应为 $(BS,)$，或对于 K 维的损失例如像素级分类问题等，形状可以为 $(BS, D_1, D_2, \cdots, D_K)$；

- loss: 若`reduction`为``'none'``，则输出值与`target`形状相同，否则为标量

**File**: \nn\modules\loss.py

**Type**:           type


**Examples**
```python
input = torch.tensor(
    [[ 1.5410, -0.2934, -2.1788,  0.5684, -1.0845],
     [-1.3986,  0.4033,  0.8380, -0.7193, -0.4033],
     [-0.5966,  0.1820, -0.8567,  1.1006, -1.0712]],
    dtype=torch.float64)
target = torch.tensor([1, 0, 4])
# built-in API
loss = nn.NLLLoss()(nn.LogSoftmax(dim=1)(input), target)
# manually
output = (input.exp() / input.exp().sum(1, True)).log()
nllloss = -output[range(3), target].mean()
assert nllloss == loss
```

## nn.CrossEntropyLoss()
```python
loss_fn = nn.CrossEntropyLoss(
    weight= None,
    size_average=None,
    ignore_index=-100,
    reduce=None,
    reduction='mean',
)
loss = loss_fn(output, target)
```

这个类其结合了`nn.LogSoftmax`和`nn.NLLLoss`类；对于具有 C 个类别的分类问题，记 batch 大小为 BS，$\mathtt{output} = (O_1, O_2, \cdots, O_{BS})\,,\,\mathtt{target} = (T_1, T_2, \cdots, T_{BS})\,,\,\mathtt{weight} = (W_1, W_2, \cdots, W_C)$，此时第 i 个样本的损失为：

$$
\ell(O, T_i) = - W_{T_i} \log\left(\frac{\exp(O_{i,T_i})}{\sum_j \exp(O_{j,T_j})}\right)
$$
总损失在`reduction`取不同值时对应结果为：
$$
\ell = \begin{cases}
    \dfrac{\sum_{i=1}^{BS}\ell(O, T_i)}{\sum_{i=1}^{BS} W_{T_i}} &
    \text{, if }\mathtt{reduction="mean"}\\[5pt]
    \displaystyle\sum_{i=1}^{BS} \ell(O, T_i)  &
    \text{, if }\mathtt{reduction="sum"}
\end{cases}\\
$$
应该指出，这里交叉熵体现在了 $O_{i,T_i}$，即第 i 个样本对应的输出的与作为独热码的标签的逐元素相乘；


**Args**

- weight: 代表为每个类分配权重，应为长度为 C 的一维张量

- size_average: 此参数已被弃用，参见`reduction`参数；True 时返回的损失为对 batch 内所有样本的损失的平均值；False 时损失则为一个 batch 内所有样本的损失的和，当`reduce`为``False``忽略此参数；需要注意的是，对于一些损失，一个样本中含有多个用于计算损失的元素；

- ignore_index: 指定`target`中一个被忽略且对梯度没有贡献的值 (即索引)；当`size_average`为 True 时，总损失为所有非忽略的`target`值的损失的平均值；

- reduce: 已被弃用，见`reduction`；``False``时对 batch 的每一个元素返回一个 loss 值，同时忽略`size_average`参数；默认``True``

- reduction: `size_average`和`reduce`将会被弃用，但目前指明这两者中任何一个参数均会重写`reduction`；`reduction`默认为``'mean'``，其与可接受取值及其含义为：
    - `"none"`: 不采取任何 reduction
    - `"mean"`: 输出的总和会除以输出元素的总数量
    - `"sum"`: 输出会被求和

- output: 这里指网络的输出；表示对每个类的预测评分的张量，预测评分应是未经标准化的；形状为 $(BS, C)$，或对于 K 维的损失值如语义分割任务等，形状应为 $(BS, C, D_1, D_2, \cdots, D_K)$；，其中 C 为类别的个数，BS 为 batch 大小； 

- target: 这里指标签值，应包含每个样本所属类别的索引，即其每个元素取值在 $\{0, 1, \cdots, C-1\}$ 内；如果指定了`ignore_index`，则该损失也可以接受`ignore_index`所含的索引，这些索引不必在上述范围内；其形状应为 $(BS,)$，或对于 K 维的损失例如像素级分类问题等，形状可以为 $(BS, D_1, D_2, \cdots, D_K)$；

- loss: 若`reduction`为``'none'``，则输出值与`target`形状相同，否则为标量

**File**:   \torch\nn\modules\loss.py

**Type**:           type

In [73]:
loss_func = nn.CrossEntropyLoss(reduction="none")
output = torch.tensor([[1.0, 0, 0], [0, 1, 0], [0, 0, 1]], dtype=torch.float32, requires_grad=True)
# output = torch.ones([3, 3])
target = torch.arange(3, dtype=torch.long)
loss = loss_func(output, target)
print(loss)

tensor([0.5514, 0.5514, 0.5514], grad_fn=<NllLossBackward>)


#  

#  

# nn.Embedding()
```python
nn.Embedding(
    num_embeddings,
    embedding_dim,
    padding_idx=None,
    max_norm=None,
    norm_type=2.0,
    scale_grad_by_freq=False,
    sparse=False,
    _weight=None,
) -> None
```
**Docstring**

一个简单的查找表，用于存储一个固定的字典的嵌入及其形状大小；该模块常用于存储词嵌入，*其输入为一个索引构成的列表，输出为相应的词嵌入*

**Args**:

- num_embeddings: 嵌入的字典的形状大小，即字典包含词向量的个数

- embedding_dim: 每一个嵌入的向量的大小，即每个词向量维度

- padding_idx: 即将查找表`padding_idx`位置的词向量置为全零向量，由于查找表中每个向量在后续训练中会经历被更新的环节，而对于该全零向量，其梯度永远是 0

- max_norm: 若指明，则每个范数超过`max_norm`的嵌入向量都会 renormalize 至范数为`max_norm`的向量

- norm_type: 指`max_norm`的范数形式，默认 2

- scale_grad_by_freq: 指明时根据 minibatch 中单词的频率的倒数来缩放梯度，默认``False``.

- sparse: ``True``时相应的权重矩阵的梯度为一稀疏矩阵，只有有限的 optimizer 支持稀疏梯度；目前支持的有`optim.SGD`(`CUDA`、`CPU`)，`optim.SparseAdam`(`CUDA`、`CPU`)，`optim.Adagrad`(`CPU`)

**Attributes**

- weight: 此模块的形状为 (num_embeddings, embedding_dim) 的可学习权重，其元素初始化时服从 $\mathcal{N}(0, 1)$



**File**:   \torch\nn\modules\sparse.py

**Type**:           type

### Examples

In [3]:
embedding = nn.Embedding(10, 3)
input = torch.LongTensor([[0, 1],[2, 3]])  # get the 1st, 2nd vector in the 1st batch, and the 3rd, 4th vector in the 2nd batch
print(embedding(input))
print(embedding.weight)

tensor([[[-0.2980,  1.1256, -0.5481],
         [ 0.3493,  0.9749, -0.9385]],

        [[-1.5044, -0.6946,  0.2112],
         [-0.8295,  0.4603, -0.3248]]], grad_fn=<EmbeddingBackward>)
Parameter containing:
tensor([[-0.2980,  1.1256, -0.5481],
        [ 0.3493,  0.9749, -0.9385],
        [-1.5044, -0.6946,  0.2112],
        [-0.8295,  0.4603, -0.3248],
        [-1.7520,  0.1621, -1.2363],
        [ 0.6270, -1.0117,  0.4207],
        [-1.1248,  0.2388,  1.0833],
        [ 0.7649, -0.8190, -0.3316],
        [ 1.1387, -0.5347, -0.0481],
        [ 0.2216,  2.0920,  1.7258]], requires_grad=True)


In [None]:
# example with padding_idx
embedding = nn.Embedding(10, 3, padding_idx=3)
input = torch.LongTensor([3, 2, 1, 0, 1, 2, 3])
print(embedding(input))

#  

#  

# nn.Linear()
`nn.Linear(in_features, out_features, bias=True) -> None`

**Docstring**

初始化该模块内部状态，由`nn.Module`和`ScriptModule`共享，用于对输入做线性变换 $y = x A^T + b$

**Args**

- in_features: 每个输入样本的大小，即该层输入张量形状应满足`(N, *, in_features)`，其中`*`表示任意个附加维度

- out_features: 类比上文

- bias: 略

**Attributes**

- weight: 形状为`(out_featurs, in_features)`的可训练参数，权重值初始化时服从分布 $\mathcal{U}(-\sqrt{k}, \sqrt{k})$，其中 $k = \frac{1}{\text{in_features}}$

- bias: True 时为形状`(out_featurs)`的可训练参数，初始化方法与权重相同
            
**File**:  \torch\nn\modules\linear.py

**Type**:           type

**Subclasses**:     _LinearWithBias, Linear

### Examples

In [None]:
m = nn.Linear(20, 30)
input = torch.randn(128, 20)
output = m(input)
print(output.size())

#  

# nn.Conv2d()
```python
nn.Conv2d(
    in_channels: int,
    out_channels: int,
    kernel_size: Union[int, Tuple[int, int]],
    stride: Union[int, Tuple[int, int]] = 1,
    padding: Union[int, Tuple[int, int]] = 0,
    dilation: Union[int, Tuple[int, int]] = 1,
    groups: int = 1,
    bias: bool = True,
    padding_mode: str = 'zeros',
)
```
**Docstring**:

对一个由若干输入平面组成的输入信号进行 2 维的卷积；对于输入形状为 $(N \times C_{i} \times H_{i} \times W_{i})$ 的特征图 $\boldsymbol{X}^{(in)}$，输出形状为 $(N \times C_{o} \times H_{o} \times W_{o})$ 的特征图$\boldsymbol{X}^{(out)}$，卷积过程可描述为
$$
\boldsymbol{X}^{(out)}_{nc_o}= \boldsymbol{b}_o + \sum_{c_i = 1}^{C_{i}} \boldsymbol{w}_{oi} \star \boldsymbol{X}^{(in)}_{nc_i} \;,\; c_o = 1, 2, \cdots, C_{o} \;,\; n = 1, 2, \cdots, N
$$
且满足
$$
H_{o} = \left\lfloor\frac{H_{i}  + 2 P_h - D_h \times (K_h - 1) - 1}{S_h} + 1\right\rfloor\\
W_{o} = \left\lfloor\frac{W_{i}  + 2 P_w - D_w \times (K_w - 1) - 1}{S_w} + 1\right\rfloor
$$
其中 $\star$ 表示[互关连算符](https://en.wikipedia.org/wiki/Cross-correlation)；卷积的动态演示参见 [here](https://github.com/vdumoulin/conv_arithmetic)

需要注意的是，对于需要使用CUDA的CuDNN后端的情况，该操作可能会选择一个具有不确定性的算法以提高性能；若需要算法保持稳定，可设置``torch.backends.cudnn.deterministic =True``，但这同时可能会损失一定的性能；更多背景知识可以参见`/notes/randomness`

**Args**:

- in_channels, out_channels: pass

- kernel_size, stride, padding, dilation: 可以是整数或元祖；整数`n`时，其同时应用于纵向和横向两个维度；元祖`(h, w)`时，h 和 w 分别应用于纵向维度和横向维度，其中`dilation`指的对输入特征图采样的间隔；

- padding_mode: 可以是``'zeros'``、``'reflect'``、``'replicate'``、``'circular'``，默认为``'zeros'``

- groups: 整数且必须能够整除 $C_{in}, C_{out}$，其决定了卷积过程中独立进行卷积的个数，例如`groups=2`意味着输入特征图会被分为两部分，这两部分会分别各自进行卷积，随后再将卷积的结果进行拼接（如 AlexNet 中的机制）；当`groups == in_channels`且`out_channels == K * in_channels`该过程也称为 depthwise 卷积

- bias: ``True``时附加偏置项，否则不附加，默认``True``



**Attributes**:

- weight: 形状为 $(C_{out} \times \frac{C_{in}}{groups} \times K_h \times K_w)$，其元素初始化默认服从 $\mathcal{U}(-\sqrt{k}, \sqrt{k})$，其中 $k = \frac{groups}{C_{in} \; K_h \; K_w}$

- bias: 形状为 $(C_{out})$，其元素初始化默认服从 $\mathcal{U}(-\sqrt{k}, \sqrt{k})$，其中 $k = \frac{groups}{C_{in} \; K_h \; K_w}$


**File**:          \torch\nn\modules\conv.py

**Type**:           type

**Subclasses**:     Conv2d, ConvBn2d


#  

# nn.ReLU()
` nn.ReLU(inplace: bool = False)`

`inplace`为 True 时直接对原变量进行操作，否则对输入进行复制，进过`ReLU`函数之后再返回；默认 False


**File**:     torch\nn\modules\activation.py

**Type**:           type

**Subclasses**:     ReLU, ReLU6

### Examples

In [8]:
m1 = nn.ReLU()
m2 = nn.ReLU(inplace=True)
inputs = torch.arange(1, 9) - 4
outputs = m1(inputs)  # inputs == tensor([-3, -2, -1,  0,  1,  2,  3,  4]), outputs == tensor([0, 0, 0, 0, 1, 2, 3, 4])
outputs = m2(inputs)  # outputs == inputs == tensor([0, 0, 0, 0, 1, 2, 3, 4])

#  

# nn.AdaptiveAvgPool2d()
`nn.AdaptiveAvgPool2d(output_size: Union[int, Tuple[int, ...]]) -> None`

**Docstring**:

对一个由多 channel 组成的输入变量上应用二维的自适应平均池化；`output_size`可以是元祖`(H, W)`或单一整数`H`或为 None；第一种情况时对任意形状输入张量，输出张量形状为`(..., H, W)`，第二种情况则输出为`(..., H, H)`，第三种情况下 None 表示输出张量对应维度与输入张量相同；三种情况下输出张量的 channel 数和输入张量的 channel 数相同

**File**:          \torch\nn\modules\pooling.py

**Type**:           type 

### Examples

In [None]:
avgpool1= nn.AdaptiveAvgPool2d((2, 1))
avgpool2 = nn.AdaptiveAvgPool2d(1)
avgpool3 = nn.AdaptiveMaxPool2d((None, 2))
x = torch.arange(-9, 9, dtype=torch.float32).reshape(1, 2, 3, 3)
y1 = avgpool1(x)  # y1.shape => (1, 2, 2, 1)
y2 = avgpool2(x)  # y2.shape => (1, 2, 1, 1)
y3 = avgpool3(x)  # y3.shape => (1, 2, 3, 2)

#  

In [None]:
nn.BatchNorm1d()

# nn.BatchNorm1d()
```python
nn.BatchNorm1d(
    num_features,
    eps=1e-05,
    momentum=0.1,
    affine=True,
    track_running_stats=True,
)
```

**Docstring**:<br>
对输入的 2D 或 3D 张量进行 BN 操作，其中 2D 对应张量形状为`(batch_size, features)`，3D 对应张量形状为`(batch_size, features, length)`，其中特征有时也称作通道；BN 的计算过程如下：
$$
y = \gamma\frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}} + \beta
$$
其中 $\gamma$ 和 $\beta$ 为长度为`features`的可学习参数，分别默认为 1 和 0；平均值和标准差是相对`batch_size`计算的，且每个`features`的计算相互独立；标准差是通过有偏估计函数计算的，相当于`torch.var(input, unbiased=False)`；训练时默认对均值和方差进行滑动估计，遍历所有 batch 最后得到的滑动统计数据则作为评估时的归一化参数；滑动估计的`momentum`默认为 0.1；滑动统计数据的更新公式为：
$$
\hat{x}_\text{new} = (1 - \mathtt{momentum}) \cdot \hat{x} + \mathtt{momentum} \cdot x_t
$$
其中 $\hat{x}$ 为滑动平均的估计值，$x_t$ 为新的观察值；<br>
[References](https://arxiv.org/abs/1502.03167)



**Args**:
- num_features: 输入形状为`(batch_size, features)`或`(batch_size, features, length)`时对应的`features`大小；
- eps: 默认 1e-5
- momentum: 用于计算`running_mean`和`running_var`的数值，默认为 0.1；设置为 None 时表明不使用滑动平均；
- affine: True 时该模块含有可学习的仿射参数 $\gamma$ 和 $\beta$，默认为 True；
- track_running_stats: True 时该模块会追踪`running_mean`和`running_var`的值；否则不追踪，，并将统计值缓冲区`running_mean`和`running_var`初始化为 None，进而该模块在训练和评估时均使用 batch 级别的统计数据；默认 True


**File**:    \torch\nn\modules\batchnorm.py

**Type**:           type

### Examples
```python
m = nn.BatchNorm1d(100, affine=False)
inputs = torch.randn(20, 100, 32)
outputs = m(inputs)
```

# nn.BatchNorm2d()
```python
nn.BatchNorm2d(
    num_features,
    eps=1e-05,
    momentum=0.1,
    affine=True,
    track_running_stats=True,
)
```
**Docstring**

对输入的 4D 张量进行 BN 操作，对应张量形状为`(batch_size, channel, height, width)`，其中通道有时也称作特征；BN 的计算过程如下：
$$
y = \gamma\frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}} + \beta
$$
其中 $\gamma$ 和 $\beta$ 为长度为`channel`的可学习参数，分别默认为 1 和 0；平均值和标准差是相对`batch_size`计算的，且每个`channel`的计算相互独立；标准差是通过有偏估计函数计算的，相当于`torch.var(input, unbiased=False)`；训练时默认对均值和方差进行滑动估计，遍历所有 batch 最后得到的滑动统计数据则作为评估时的归一化参数；滑动估计的`momentum`默认为 0.1；滑动统计数据的更新公式为：
$$
\hat{x}_\text{new} = (1 - \mathtt{momentum}) \cdot \hat{x} + \mathtt{momentum} \cdot x_t
$$
其中 $\hat{x}$ 为滑动平均的估计值，$x_t$ 为新的观察值；

[References](https://arxiv.org/abs/1502.03167)


**Args**:
- num_features: 输入形状为`(batch_size, features)`或`(batch_size, features, length)`时对应的`features`大小；
- eps: 默认 1e-5
- momentum: 用于计算`running_mean`和`running_var`的数值，默认为 0.1；设置为 None 时表明不使用滑动平均；
- affine: True 时该模块含有可学习的仿射参数 $\gamma$ 和 $\beta$，默认为 True；
- track_running_stats: True 时该模块会追踪`running_mean`和`running_var`的值；否则不追踪，，并将统计值缓冲区`running_mean`和`running_var`初始化为 None，进而该模块在训练和评估时均使用 batch 级别的统计数据；默认 True

**File**:    \torch\nn\modules\batchnorm.py

**Type**:           type

# 

# 

# nn.Parameter()
`nn.Parameter(data=None, requires_grad=True)`

**Docstring**:

`Parameter`是`torch.Tensor`的子类；当其被分配为`nn.Module`的属性时，`Parameter`会被自动添加到`nn.Module`参数列表中，并会出现在例如`Module.parameters`迭代器中；然而，给`nn.Module`分配`torch.Tensor`并不会产生自动添加这种效果，这是因为`torch.Tensor`可能是出于在模型中缓存一些临时层而添加的，例如 RNN 的最后一个隐藏层；若没有`Parameter`这种类，这些临时层也会被注册在模型中

**Args**:
- data: 应为`torch.Tensor`类型，即添加的参数张量
- requires_grad: 默认 True，该参数是否需要记录梯度；更多细节参见`excluding-subgraphs`

**File**:  \torch\nn\parameter.py

**Type**:           type

# 

# 

# nn.UpsamplingBilinear2d()
```python
nn.UpsamplingBilinear2d(
    size: Union[int, Tuple[int, int], NoneType] = None,
    scale_factor: Union[float, Tuple[float, float], NoneType] = None,
) -> None
```
对输入进行 2D 双线性上采样；可以通过`size`或`scale_factor`指定输出特征图的形状；

请注意！这个类已经弃用，可改为使用`F.interpolate(..., mode='bilinear', align_corners=True)`；

**Args**
- size：输出的形状；可以是整型、整型组成的元祖；
- scale_factor：浮点型或浮点型组成的元祖；放缩的比例，当算得的特征图不为整数时向下取整；

**File**:     \torch\nn\modules\upsampling.py

**Type**:           type

### Examples

In [26]:
x = torch.arange(1, 5, dtype=torch.float32).view(1, 1, 2, 2)
y1 = nn.UpsamplingBilinear2d(scale_factor=2)(x)
y2 = nn.functional.interpolate(x, scale_factor=2, mode='bilinear',
    align_corners=False)
print(y1)
print(y2)

tensor([[[[1.0000, 1.3333, 1.6667, 2.0000],
          [1.6667, 2.0000, 2.3333, 2.6667],
          [2.3333, 2.6667, 3.0000, 3.3333],
          [3.0000, 3.3333, 3.6667, 4.0000]]]])
tensor([[[[1.0000, 1.2500, 1.7500, 2.0000],
          [1.5000, 1.7500, 2.2500, 2.5000],
          [2.5000, 2.7500, 3.2500, 3.5000],
          [3.0000, 3.2500, 3.7500, 4.0000]]]])
