In [1]:
import torch
import torch.nn as nn

# nn.Module()
`nn.Module()`

**Docstring**

所有神经网络模块的基类，所有搭建的神经网络模型应该继承这个类；该模块还包含很多其他子模块，进而从而嵌套，即可以将子模块作为类属性，如将`nn.Conv2d`作为类属性等；以这种方式声明的子模块将被注册，由`nn.Module`和`ScriptModule`共享，当调用`to`时也将转换它们的参数

**File**: torch\nn\modules\module.py

**Type**:           type

**Subclasses**:     Identity, Linear, Bilinear, \_ConvNd, Threshold, ReLU, RReLU, Hardtanh, Sigmoid, Hardsigmoid, ...



## nn.Module.train()

`<mudule>.train(mode = True)`


`mode`为 True 时设定模块为训练模式，False 时评估模式，这只对某些模块产生影响；具体模块在训练/评估模式下的行为请参阅相关文档

**Type**:      function

## nn.Module.eval()

`<module_name>.eval()`

设定模块为评估状态，这只对某些模块产生影响；具体模块在训练/评估模式下的行为请参阅相关文档，该方法等价于`<module_name>.train(False)`

**Type**:      function

## nn.Module.state_dict

`<module>.state_dict(destination=None, prefix='', keep_vars=False)`

**Docstring:**

返回一个包含了模块的所有状态的字典，其中参数和持久缓冲区 (persistent buffers)(如 running averages)，键值为相关参数和缓冲区的名称

**Type**: function

### Example
```python
module.state_dict().keys()  # => ['bias', 'weight']
```

## nn.Module.load_state_dict
```python
<module>.load_state_dict(
    state_dict: Dict[str, torch.Tensor],
    strict=True,
)
```
**Docstring**

将参数和缓冲区从`state_dict`复制给该模块及其实例化对象，并一个返回带有``missing_keys``和``unexpected_keys``字段的``NamedTuple``；其中``missing_keys``是一个包括了丢失的键的有字符串组成的列表，``unexpected_keys``是一个是包含意外键的字符串列表；若`strict`为`True`，则`state_dict`的键值必须与`torch.nn.Module.state_dict`函数返回的键值完全匹配。

**Args**

- state_dict: 包含参数和持久缓冲区的字典

- strict:  True 时强制`state_dict`中的键与该类的`state_dict`函数返回的键匹配


**Type**: function

## nn.Module.named_children()
`<module>.named_children(self) -> Iterator[Tuple[str, ForwardRef('Module')]]`

**Docstring**

返回一个迭代器，其可生成该模块的所有直接子模块的名称及模块本身`(string, Module)`

**File**:   \torch\nn\modules\module.py

**Type**:      function

### Example

In [None]:
class TestModel(nn.Module):
    def __init__(self):
        super(TestModel, self).__init__()
        self.conv1 = nn.Conv2d(3, 64, 3)
        self.relu = nn.ReLU()
        self.conv2 = nn.Conv2d(64, 128, 3)

model = TestModel()
for named_children in model.named_children():
    print(named_children)

## nn.Module.children()
`nn.Module.children(self) -> Iterator[ForwardRef('Module')]`

返回包含直接子模块的迭代器

**Type**:      function

In [39]:
def has_children(module):
    try:
        next(module.children())
        print("Iterator `module.children()`:", module.children())
        return True
    except StopIteration:
        print("{} is not an Iterator".format(module))
        return False

class TestModel_Parent(nn.Module):
    def __init__(self):
        super(TestModel_Parent, self).__init__()
        self.conv1 = nn.Conv2d(3, 64, 3)
        self.relu = nn.ReLU()
        self.conv2 = nn.Conv2d(64, 128, 3)
        self.model_chilren = TestModel()

model_parent = TestModel_Parent()
for name, module in model_parent.named_children():
    if has_children(module):
        pass

Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1)) is not an Iterator
ReLU() is not an Iterator
Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1)) is not an Iterator
Iterator `module.children()`: <generator object Module.children at 0x0000025673457648>


## nn.Module.apply()
`<model>.apply(fn: Callable[[ForwardRef('Module')], NoneType]) -> ~T`

**Docstring**

递归地将`fn`应用于自身及子模块（即由``children()``返回的模块）并返回应用`fn`后的`self`；典型的用法包括初始化模型的参数，另见`nn-init-doc`

**Args**

- fn: 要应作用每个模块的函数，其输入为一模块，输出为 None

### Example

In [None]:
@torch.no_grad()
def init_weights(m):
    print(m)
    if type(m) == nn.Linear:
        m.weight.fill_(1.0)
        print(m.weight)

net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
net.apply(init_weights)

## nn.Module.register_forward_hook()
`<model>.register_forward_hook(hook: Callable[..., NoneType]) -> torch.utils.hooks.RemovableHandle`

**Docstring**

在模块上注册一个 forward hook，每次`forward`计算出输出值后 hook 均会被调用一次，其定义标记形式为
```py
hook(module, input, output) -> None or modified output
```
其中`input`只包含传递给模型的位置参数，关键字参数只会传递给``forward``而不会传递给 hook；其可以对`output`进行调整，同时也会对`input`变量自身进行进行调整，但由于 hook 是在`forward`调用之后被调用的，故其对`input`调整并不产生影响；`register_forward_hook`函数返回一个`torch.utils.hooks.RemovableHandle`类的实例的 handle，其可通过调用``handle.remove()``来移除附加的 hook

**Type**:      function

#  

# nn.DataParallel()
`nn.DataParallel(module, device_ids=None, output_device=None, dim=0)`
Docstring:     
Implements data parallelism at the module level.

This container parallelizes the application of the given :attr:`module` by
splitting the input across the specified devices by chunking in the batch
dimension (other objects will be copied once per device). In the forward
pass, the module is replicated on each device, and each replica handles a
portion of the input. During the backwards pass, gradients from each replica
are summed into the original module.

The batch size should be larger than the number of GPUs used.

.. warning::
    It is recommended to use :class:`~torch.nn.parallel.DistributedDataParallel`,
    instead of this class, to do multi-GPU training, even if there is only a single
    node. See: :ref:`cuda-nn-ddp-instead` and :ref:`ddp`.

Arbitrary positional and keyword inputs are allowed to be passed into
DataParallel but some types are specially handled. tensors will be
**scattered** on dim specified (default 0). tuple, list and dict types will
be shallow copied. The other types will be shared among different threads
and can be corrupted if written to in the model's forward pass.

The parallelized :attr:`module` must have its parameters and buffers on
``device_ids[0]`` before running this :class:`~torch.nn.DataParallel`
module.

.. warning::
    In each forward, :attr:`module` is **replicated** on each device, so any
    updates to the running module in ``forward`` will be lost. For example,
    if :attr:`module` has a counter attribute that is incremented in each
    ``forward``, it will always stay at the initial value because the update
    is done on the replicas which are destroyed after ``forward``. However,
    :class:`~torch.nn.DataParallel` guarantees that the replica on
    ``device[0]`` will have its parameters and buffers sharing storage with
    the base parallelized :attr:`module`. So **in-place** updates to the
    parameters or buffers on ``device[0]`` will be recorded. E.g.,
    :class:`~torch.nn.BatchNorm2d` and :func:`~torch.nn.utils.spectral_norm`
    rely on this behavior to update the buffers.

.. warning::
    Forward and backward hooks defined on :attr:`module` and its submodules
    will be invoked ``len(device_ids)`` times, each with inputs located on
    a particular device. Particularly, the hooks are only guaranteed to be
    executed in correct order with respect to operations on corresponding
    devices. For example, it is not guaranteed that hooks set via
    :meth:`~torch.nn.Module.register_forward_pre_hook` be executed before
    `all` ``len(device_ids)`` :meth:`~torch.nn.Module.forward` calls, but
    that each such hook be executed before the corresponding
    :meth:`~torch.nn.Module.forward` call of that device.

.. warning::
    When :attr:`module` returns a scalar (i.e., 0-dimensional tensor) in
    :func:`forward`, this wrapper will return a vector of length equal to
    number of devices used in data parallelism, containing the result from
    each device.

.. note::
    There is a subtlety in using the
    ``pack sequence -> recurrent network -> unpack sequence`` pattern in a
    :class:`~torch.nn.Module` wrapped in :class:`~torch.nn.DataParallel`.
    See :ref:`pack-rnn-unpack-with-data-parallelism` section in FAQ for
    details.


Args:
    module (Module): module to be parallelized
    device_ids (list of int or torch.device): CUDA devices (default: all devices)
    output_device (int or torch.device): device location of output (default: device_ids[0])

Attributes:
    module (Module): the module to be parallelized

Example::

    >>> net = torch.nn.DataParallel(model, device_ids=[0, 1, 2])
    >>> output = net(input_var)  # input_var can be on any device, including CPU
Init docstring: Initializes internal Module state, shared by both nn.Module and ScriptModule.
File:           d:\programmefiles\python\anaconda3\envs\tensorflow2.2\lib\site-packages\torch\nn\parallel\data_parallel.py
Type:           type
Subclasses:  

#  

# nn.parallel.DistributedDataParallel()
```python
nn.parallel.DistributedDataParallel(
    module,
    device_ids=None,
    output_device=None,
    dim=0,
    broadcast_buffers=True,
    process_group=None,
    bucket_cap_mb=25,
    find_unused_parameters=False,
    check_reduction=False,
)
```
**Docstring**

在模块层面实现基于``torch.distributed``包的分布式数据并行；


This container parallelizes the application of the given module by
splitting the input across the specified devices by chunking in the batch
dimension. The module is replicated on each machine and each device, and
each such replica handles a portion of the input. During the backwards
pass, gradients from each node are averaged.

The batch size should be larger than the number of GPUs used locally.

See also: :ref:`distributed-basics` and :ref:`cuda-nn-ddp-instead`.
The same constraints on input as in :class:`torch.nn.DataParallel` apply.

Creation of this class requires that ``torch.distributed`` to be already
initialized, by calling :func:`torch.distributed.init_process_group`.

``DistributedDataParallel`` is proven to be significantly faster than
:class:`torch.nn.DataParallel` for single-node multi-GPU data
parallel training.

Here is how to use it: on each host with N GPUs, you should spawn up N
processes, while ensuring that each process individually works on a single GPU
from 0 to N-1. Therefore, it is your job to ensure that your training script
operates on a single given GPU by calling:

    >>> torch.cuda.set_device(i)

where i is from 0 to N-1. In each process, you should refer the following
to construct this module:

    >>> torch.distributed.init_process_group(backend='nccl', world_size=4, init_method='...')
    >>> model = DistributedDataParallel(model, device_ids=[i], output_device=i)

In order to spawn up multiple processes per node, you can use either
``torch.distributed.launch`` or ``torch.multiprocessing.spawn``

.. note ::
    Please refer to `PyTorch Distributed Overview <https://pytorch.org/tutorials/beginner/dist_overview.html>`__
    for a brief introduction to all features related to distributed training.

.. note:: ``nccl`` backend is currently the fastest and
    highly recommended backend to be used with Multi-Process Single-GPU
    distributed training and this applies to both single-node and multi-node
    distributed training

.. note:: This module also supports mixed-precision distributed training.
    This means that your model can have different types of parameters such
    as mixed types of fp16 and fp32, the gradient reduction on these
    mixed types of parameters will just work fine.
    Also note that ``nccl`` backend is currently the fastest and highly
    recommended backend for fp16/fp32 mixed-precision training.

.. note:: If you use ``torch.save`` on one process to checkpoint the module,
    and ``torch.load`` on some other processes to recover it, make sure that
    ``map_location`` is configured properly for every process. Without
    ``map_location``, ``torch.load`` would recover the module to devices
    where the module was saved from.

.. warning::
    This module works only with the ``gloo`` and ``nccl`` backends.

.. warning::
    Constructor, forward method, and differentiation of the output (or a
    function of the output of this module) is a distributed synchronization
    point. Take that into account in case different processes might be
    executing different code.

.. warning::
    This module assumes all parameters are registered in the model by the
    time it is created. No parameters should be added nor removed later.
    Same applies to buffers.

.. warning::
    This module assumes all parameters are registered in the model of each
    distributed processes are in the same order. The module itself will
    conduct gradient all-reduction following the reverse order of the
    registered parameters of the model. In other words, it is users'
    responsibility to ensure that each distributed process has the exact
    same model and thus the exact same parameter registration order.

.. warning::
    This module allows parameters with non-rowmajor-contiguous strides.
    For example, your model may contain some parameters whose
    :class:`torch.memory_format` is ``torch.contiguous_format``
    and others whose format is ``torch.channels_last``.  However,
    corresponding parameters in different processes must have the
    same strides.

.. warning::
    This module doesn't work with :func:`torch.autograd.grad` (i.e. it will
    only work if gradients are to be accumulated in ``.grad`` attributes of
    parameters).

.. warning::

    If you plan on using this module with a ``nccl`` backend or a ``gloo``
    backend (that uses Infiniband), together with a DataLoader that uses
    multiple workers, please change the multiprocessing start method to
    ``forkserver`` (Python 3 only) or ``spawn``. Unfortunately
    Gloo (that uses Infiniband) and NCCL2 are not fork safe, and you will
    likely experience deadlocks if you don't change this setting.

.. warning::
    Forward and backward hooks defined on :attr:`module` and its submodules
    won't be invoked anymore, unless the hooks are initialized in the
    :meth:`forward` method.

.. warning::
    You should never try to change your model's parameters after wrapping
    up your model with DistributedDataParallel. In other words, when
    wrapping up your model with DistributedDataParallel, the constructor of
    DistributedDataParallel will register the additional gradient
    reduction functions on all the parameters of the model itself at the
    time of construction. If you change the model's parameters after
    the DistributedDataParallel construction, this is not supported and
    unexpected behaviors can happen, since some parameters' gradient
    reduction functions might not get called.

.. note::
    Parameters are never broadcast between processes. The module performs
    an all-reduce step on gradients and assumes that they will be modified
    by the optimizer in all processes in the same way. Buffers
    (e.g. BatchNorm stats) are broadcast from the module in process of rank
    0, to all other replicas in the system in every iteration.

.. note::
    If you are using DistributedDataParallel in conjunction with the
    :ref:`distributed-rpc-framework`, you should always use
    :meth:`torch.distributed.autograd.backward` to compute gradients and
    :class:`torch.distributed.optim.DistributedOptimizer` for optimizing
    parameters.

Example::
    >>> import torch.distributed.autograd as dist_autograd
    >>> from torch.nn.parallel import DistributedDataParallel as DDP
    >>> from torch import optim
    >>> from torch.distributed.optim import DistributedOptimizer
    >>> from torch.distributed.rpc import RRef
    >>>
    >>> t1 = torch.rand((3, 3), requires_grad=True)
    >>> t2 = torch.rand((3, 3), requires_grad=True)
    >>> rref = rpc.remote("worker1", torch.add, args=(t1, t2))
    >>> ddp_model = DDP(my_model)
    >>>
    >>> # Setup optimizer
    >>> optimizer_params = [rref]
    >>> for param in ddp_model.parameters():
    >>>     optimizer_params.append(RRef(param))
    >>>
    >>> dist_optim = DistributedOptimizer(
    >>>     optim.SGD,
    >>>     optimizer_params,
    >>>     lr=0.05,
    >>> )
    >>>
    >>> with dist_autograd.context() as context_id:
    >>>     pred = ddp_model(rref.to_here())
    >>>     loss = loss_func(pred, loss)
    >>>     dist_autograd.backward(context_id, loss)
    >>>     dist_optim.step()

.. warning::
    Using DistributedDataParallel in conjuction with the
    :ref:`distributed-rpc-framework` is experimental and subject to change.

Args:
    module (Module): module to be parallelized
    device_ids (list of int or torch.device): CUDA devices. This should
               only be provided when the input module resides on a single
               CUDA device. For single-device modules, the ``i``th
               :attr:`module` replica is placed on ``device_ids[i]``. For
               multi-device modules and CPU modules, device_ids must be None
               or an empty list, and input data for the forward pass must be
               placed on the correct device. (default: all devices for
               single-device modules)
    output_device (int or torch.device): device location of output for
                  single-device CUDA modules. For multi-device modules and
                  CPU modules, it must be None, and the module itself
                  dictates the output location. (default: device_ids[0] for
                  single-device modules)
    broadcast_buffers (bool): flag that enables syncing (broadcasting) buffers of
                      the module at beginning of the forward function.
                      (default: ``True``)
    process_group: the process group to be used for distributed data
                   all-reduction. If ``None``, the default process group, which
                   is created by ```torch.distributed.init_process_group```,
                   will be used. (default: ``None``)
    bucket_cap_mb: DistributedDataParallel will bucket parameters into
                   multiple buckets so that gradient reduction of each
                   bucket can potentially overlap with backward computation.
                   :attr:`bucket_cap_mb` controls the bucket size in MegaBytes (MB)
                   (default: 25)
    find_unused_parameters (bool): Traverse the autograd graph of all tensors
                                   contained in the return value of the wrapped
                                   module's ``forward`` function.
                                   Parameters that don't receive gradients as
                                   part of this graph are preemptively marked
                                   as being ready to be reduced. Note that all
                                   ``forward`` outputs that are derived from
                                   module parameters must participate in
                                   calculating loss and later the gradient
                                   computation. If they don't, this wrapper will
                                   hang waiting for autograd to produce gradients
                                   for those parameters. Any outputs derived from
                                   module parameters that are otherwise unused can
                                   be detached from the autograd graph using
                                   ``torch.Tensor.detach``. (default: ``False``)
    check_reduction: when setting to ``True``, it enables DistributedDataParallel
                     to automatically check if the previous iteration's
                     backward reductions were successfully issued at the
                     beginning of every iteration's forward function.
                     You normally don't need this option enabled unless you
                     are observing weird behaviors such as different ranks
                     are getting different gradients, which should not
                     happen if DistributedDataParallel is correctly used.
                     (default: ``False``)

Attributes:
    module (Module): the module to be parallelized

Example::

    >>> torch.distributed.init_process_group(backend='nccl', world_size=4, init_method='...')
    >>> net = torch.nn.DistributedDataParallel(model, pg)
Init docstring: Initializes internal Module state, shared by both nn.Module and ScriptModule.
File:           d:\programmefiles\python\anaconda3\envs\tensorflow2.2\lib\site-packages\torch\nn\parallel\distributed.py
Type:           type
Subclasses:  

#  

#  

# nn.CrossEntropyLoss()
```python
nn.CrossEntropyLoss(
    weight= None,
    size_average=None,
    ignore_index=-100,
    reduce=None,
    reduction='mean',
) -> None
```
**Docstring**

其结合了`nn.LogSoftmax`和`nn.NLLLoss`类，常用于具有 n 个类别的分类问题；`weight`代表为每个类分配权重，应为 1 维张量；`input`应该包含每个类的原始的、未标准化的 scores，其形状应为 $(m \times n \times d_1 \times d_2 \times \cdots \times d_K), K \geq 1$，其中 m 为 batch 的大小，$d_1 \cdots d_K$ 为

该类要求对于长度为 m 的一维张量的每个值，其都有一个范围在 $[0, n - 1]$ 内的类索引作为 target；若指明了`ignore_index`，则该类也接收`ignore_index`这些值。记输入值为 $\boldsymbol{x} = (x_1, x_2, \cdots, x_n)$，则 loss 值为

$$\text{loss}(x_i) = -\log\left( \frac{\exp(x_i)}{\sum_{j = 1}^n \exp(x_j)} \right) = -x_i + \log\left(\sum_{j = 1}^n \exp(x_j)\right)$$

`weight`被指明的情况下，记权重值为 $\boldsymbol{w} = (w_1, w_2, \cdots, w_n)$，则 loss 值为

$$\text{loss}(x_i) = w_i \left(-x_i + \log\left(\sum_{j = 1}^n \exp(x_j)\right)\right)$$


**Args**

- weight: 代表为每个类分配权重，应为长度为 C 的一维张量

- size_average: 默认为``True``，该情况下 loss 值为 batch 内所有样本 loss 值的平均值，如果字段`size_average`被指明为``False``，则 loss 值将对每个 batch 进行求和，当`reduce`为``False``忽略此参数，此参数已被弃用，见`reduction`；Note that for some losses, there are multiple elements per sample.

- ignore_index: 指定一个被忽略且对输入梯度没有贡献的目标值；当`size_average`为``True``时，loss 值为所有没被忽略目标对应的 loss 值的平均值

- reduce: 已被弃用，见`reduction`；``False``时对 batch 的每一个元素返回一个 loss 值，同时忽略`size_average`参数；默认``True``

- reduction: 可以是``'none'``、``'mean'``、``'sum'``，`size_average`和`reduce`将会被弃用，但目前指明这两者中的一个参数会重写`reduction`；`reduction`默认为``'mean'``
    - ``'none'``: 不采取任何 reduction
    - ``'mean'``: 输出的总和会除以输出元素的总数量
    - ``'sum'``: 输出会被求和

**Shape**

- input: 形状为 $(m \times n)$，其中 n 为类别的个数；或对于 K 维的损失值，形状为 $(m \times n \times d_1 \times d_2 \times \cdots \times d_K), K \geq 1$ 

- target: 形状为 $(m)$，或对于 K 维的损失值，形状为 $(m \times d_1 \times d_2 \times \cdots \times d_K)\,, K \geq 1$；其中每个元素的值满足 $0 \leq \text{target}[i] \leq C-1 $；

- output: 若`reduction`为``'none'``，则输出值与`target`形状相同，否则为标量

**File**:   \torch\nn\modules\loss.py

**Type**:           type

In [73]:
loss_func = nn.CrossEntropyLoss(reduction="none")
output = torch.tensor([[1.0, 0, 0], [0, 1, 0], [0, 0, 1]], dtype=torch.float32, requires_grad=True)
# output = torch.ones([3, 3])
target = torch.arange(3, dtype=torch.long)
loss = loss_func(output, target)
print(loss)

tensor([0.5514, 0.5514, 0.5514], grad_fn=<NllLossBackward>)


#  

#  

# nn.Embedding()
```python
nn.Embedding(
    num_embeddings,
    embedding_dim,
    padding_idx=None,
    max_norm=None,
    norm_type=2.0,
    scale_grad_by_freq=False,
    sparse=False,
    _weight=None,
) -> None
```
**Docstring**

一个简单的查找表，用于存储一个固定的字典的嵌入及其形状大小；该模块常用于存储词嵌入，*其输入为一个索引构成的列表，输出为相应的词嵌入*

**Args**:

- num_embeddings: 嵌入的字典的形状大小，即字典包含词向量的个数

- embedding_dim: 每一个嵌入的向量的大小，即每个词向量维度

- padding_idx: 即将查找表`padding_idx`位置的词向量置为全零向量，由于查找表中每个向量在后续训练中会经历被更新的环节，而对于该全零向量，其梯度永远是 0

- max_norm: 若指明，则每个范数超过`max_norm`的嵌入向量都会 renormalize 至范数为`max_norm`的向量

- norm_type: 指`max_norm`的范数形式，默认 2

- scale_grad_by_freq: 指明时根据 minibatch 中单词的频率的倒数来缩放梯度，默认``False``.

- sparse: ``True``时相应的权重矩阵的梯度为一稀疏矩阵，只有有限的 optimizer 支持稀疏梯度；目前支持的有`optim.SGD`(`CUDA`、`CPU`)，`optim.SparseAdam`(`CUDA`、`CPU`)，`optim.Adagrad`(`CPU`)

**Attributes**

- weight: 此模块的形状为 (num_embeddings, embedding_dim) 的可学习权重，其元素初始化时服从 $\mathcal{N}(0, 1)$



**File**:   \torch\nn\modules\sparse.py

**Type**:           type

### Examples

In [3]:
embedding = nn.Embedding(10, 3)
input = torch.LongTensor([[0, 1],[2, 3]])  # get the 1st, 2nd vector in the 1st batch, and the 3rd, 4th vector in the 2nd batch
print(embedding(input))
print(embedding.weight)

tensor([[[-0.2980,  1.1256, -0.5481],
         [ 0.3493,  0.9749, -0.9385]],

        [[-1.5044, -0.6946,  0.2112],
         [-0.8295,  0.4603, -0.3248]]], grad_fn=<EmbeddingBackward>)
Parameter containing:
tensor([[-0.2980,  1.1256, -0.5481],
        [ 0.3493,  0.9749, -0.9385],
        [-1.5044, -0.6946,  0.2112],
        [-0.8295,  0.4603, -0.3248],
        [-1.7520,  0.1621, -1.2363],
        [ 0.6270, -1.0117,  0.4207],
        [-1.1248,  0.2388,  1.0833],
        [ 0.7649, -0.8190, -0.3316],
        [ 1.1387, -0.5347, -0.0481],
        [ 0.2216,  2.0920,  1.7258]], requires_grad=True)


In [None]:
# example with padding_idx
embedding = nn.Embedding(10, 3, padding_idx=3)
input = torch.LongTensor([3, 2, 1, 0, 1, 2, 3])
print(embedding(input))

#  

#  

# nn.Linear()
`nn.Linear(in_features, out_features, bias=True) -> None`

**Docstring**

初始化该模块内部状态，由`nn.Module`和`ScriptModule`共享，用于对输入做线性变换 $y = x A^T + b$

**Args**

- in_features: 每个输入样本的大小，即该层输入张量形状应满足`(N, *, in_features)`，其中`*`表示任意个附加维度

- out_features: 类比上文

- bias: 略

**Attributes**

- weight: 形状为`(out_featurs, in_features)`的可训练参数，权重值初始化时服从分布 $\mathcal{U}(-\sqrt{k}, \sqrt{k})$，其中 $k = \frac{1}{\text{in_features}}$

- bias: True 时为形状`(out_featurs)`的可训练参数，初始化方法与权重相同
            
**File**:  \torch\nn\modules\linear.py

**Type**:           type

**Subclasses**:     _LinearWithBias, Linear

### Examples

In [None]:
m = nn.Linear(20, 30)
input = torch.randn(128, 20)
output = m(input)
print(output.size())

#  

# nn.Conv2d()
```python
nn.Conv2d(
    in_channels: int,
    out_channels: int,
    kernel_size: Union[int, Tuple[int, int]],
    stride: Union[int, Tuple[int, int]] = 1,
    padding: Union[int, Tuple[int, int]] = 0,
    dilation: Union[int, Tuple[int, int]] = 1,
    groups: int = 1,
    bias: bool = True,
    padding_mode: str = 'zeros',
)
```
**Docstring**:

对一个由若干输入平面组成的输入信号进行 2 维的卷积；对于输入形状为 $(N \times C_{i} \times H_{i} \times W_{i})$ 的特征图 $\boldsymbol{X}^{(in)}$，输出形状为 $(N \times C_{o} \times H_{o} \times W_{o})$ 的特征图$\boldsymbol{X}^{(out)}$，卷积过程可描述为
$$
\boldsymbol{X}^{(out)}_{nc_o}= \boldsymbol{b}_o + \sum_{c_i = 1}^{C_{i}} \boldsymbol{w}_{oi} \star \boldsymbol{X}^{(in)}_{nc_i} \;,\; c_o = 1, 2, \cdots, C_{o} \;,\; n = 1, 2, \cdots, N
$$
且满足
$$
H_{o} = \left\lfloor\frac{H_{i}  + 2 P_h - D_h \times (K_h - 1) - 1}{S_h} + 1\right\rfloor\\
W_{o} = \left\lfloor\frac{W_{i}  + 2 P_w - D_w \times (K_w - 1) - 1}{S_w} + 1\right\rfloor
$$
其中 $\star$ 表示[互关连算符](https://en.wikipedia.org/wiki/Cross-correlation)；卷积的动态演示参见 [here](https://github.com/vdumoulin/conv_arithmetic)

需要注意的是，对于需要使用CUDA的CuDNN后端的情况，该操作可能会选择一个具有不确定性的算法以提高性能；若需要算法保持稳定，可设置``torch.backends.cudnn.deterministic =True``，但这同时可能会损失一定的性能；更多背景知识可以参见`/notes/randomness`

**Args**:

- in_channels, out_channels: pass

- kernel_size, stride, padding, dilation: 可以是整数或元祖；整数`n`时，其同时应用于纵向和横向两个维度；元祖`(h, w)`时，h 和 w 分别应用于纵向维度和横向维度，其中`dilation`指的对输入特征图采样的间隔；

- padding_mode: 可以是``'zeros'``、``'reflect'``、``'replicate'``、``'circular'``，默认为``'zeros'``

- groups: 整数且必须能够整除 $C_{in}, C_{out}$，其决定了卷积过程中独立进行卷积的个数，例如`groups=2`意味着输入特征图会被分为两部分，这两部分会分别各自进行卷积，随后再将卷积的结果进行拼接（如 AlexNet 中的机制）；当`groups == in_channels`且`out_channels == K * in_channels`该过程也称为 depthwise 卷积

- bias: ``True``时附加偏置项，否则不附加，默认``True``



**Attributes**:

- weight: 形状为 $(C_{out} \times \frac{C_{in}}{groups} \times K_h \times K_w)$，其元素初始化默认服从 $\mathcal{U}(-\sqrt{k}, \sqrt{k})$，其中 $k = \frac{groups}{C_{in} \; K_h \; K_w}$

- bias: 形状为 $(C_{out})$，其元素初始化默认服从 $\mathcal{U}(-\sqrt{k}, \sqrt{k})$，其中 $k = \frac{groups}{C_{in} \; K_h \; K_w}$


**File**:          \torch\nn\modules\conv.py

**Type**:           type

**Subclasses**:     Conv2d, ConvBn2d
