## 手写swin-transformer

### 1、如何基于图片生成patch embedding?
方法一
- 基于pytorch unfold的API来将图片进行分块，也就是模仿卷积的思路，设置kernel_size=stride=patch_size, 得到分块后的图片
- 得到格式为[bs, num_patch, patch_depth]的张量
- 将张量与形状为[patch_depth, model_dim_C]的权重矩阵进行乘法操作，即可得到形状为[bs, num_patch, model_dim_C]的patch embedding

方法二
- patch_depth是等于input_channel*patch_size*patch_size
- model_dim_C相当于二维卷积的输出通道数目
- 将形状为[patch_depth, model_dim_C]的权重矩阵转换为[model_dim_C, input_channel, patch_size, patch_size]的卷积核
- 调用PyTorch的conv2d API得到卷积的输出张量，形状为[bs, output_channel, height, width]
- 转换为[bs, num_patch, model_dim_C]的格式，即为patch embedding



In [15]:

import torch
import torch.nn as nn
import torch.nn.functional as F

import math

#难点1 patch embedding
def image2emb_naive(image, patch_size, weight):
    """直观方法实现patch embedding"""
    # 注意unfold的输入只针对4-D向量，所以images shape：bs*channel*h*w
    patch_image = F.unfold(image, kernel_size=(patch_size, patch_size), stride=(patch_size, patch_size)).transpose(-1, -2)  # bc*num_patch*patch_depth(相当于一个patch的深度)
    patch_embedding = patch_image @ weight   # bc*num_patch*model_dim_C
    return patch_embedding

# 验证
patch_size = 4
model_dim_C = 100
images = torch.randn(2,3, 16, 16)
weight = torch.randn(patch_size*patch_size*3, model_dim_C)
print(image2emb_naive(images, patch_size, weight).shape)

torch.Size([2, 16, 96])
