# Study the nn.Embedding and nn.Linear 

https://medium.com/@gautam.e/what-is-nn-embedding-really-de038baadd24

https://www.youtube.com/watch?v=XswEBzNgIYc

In [1]:
# !pip install scipy
import numpy as np
from scipy.sparse import csr_matrix

In [2]:
import torch
import torch.nn as nn

## Check nn.Embedding

In [3]:
embedding_layer = nn.Embedding(num_embeddings=10, embedding_dim=4)
optimizer = torch.optim.Adam(embedding_layer.parameters(), lr=0.001)
loss_func = nn.MSELoss

In [4]:
print(embedding_layer)

Embedding(10, 4)


In [5]:
embedding_layer.weight.data

tensor([[ 0.2753,  0.6012, -0.7758,  0.4877],
        [-0.4799,  0.9522,  1.6895, -0.7348],
        [ 1.0987, -0.0538, -0.6264, -1.1798],
        [-0.2704,  0.8823, -0.6534, -0.2049],
        [-0.7768,  0.1396,  0.5873, -1.7623],
        [ 0.6802,  1.6255, -0.2280,  1.4884],
        [ 0.1690,  2.7113, -0.1665, -0.8009],
        [-0.0663,  0.2883, -1.2854, -0.0727],
        [ 0.4348, -0.2138,  0.1903, -1.4876],
        [-0.3714, -0.6321,  0.5146, -1.1418]])

In [6]:
embedding_layer.weight.data[9]

tensor([-0.3714, -0.6321,  0.5146, -1.1418])

embedding_layer 只是一个look up table, 共有10个index（0~9），每个index对应一个vector (dim =4)

In [7]:
input_tensor1 = torch.tensor([3,2,0], dtype=torch.long)
result1 = embedding_layer(input_tensor1) # 取出第3,2,0个 embedding
print(result1.shape)
print(result1)

torch.Size([3, 4])
tensor([[-0.2704,  0.8823, -0.6534, -0.2049],
        [ 1.0987, -0.0538, -0.6264, -1.1798],
        [ 0.2753,  0.6012, -0.7758,  0.4877]], grad_fn=<EmbeddingBackward0>)


In [8]:
input_tensor2 = torch.tensor([[1, 3], [0,1]], dtype=torch.long)
result2 = embedding_layer(input_tensor2) # 取出第1,3 matrix，取出第0,1 个matrix，组成第二个matrix，两个matrix合并成2个2x4，即 2x2x4 的矩阵
print(result2.shape)
print(result2)

torch.Size([2, 2, 4])
tensor([[[-0.4799,  0.9522,  1.6895, -0.7348],
         [-0.2704,  0.8823, -0.6534, -0.2049]],

        [[ 0.2753,  0.6012, -0.7758,  0.4877],
         [-0.4799,  0.9522,  1.6895, -0.7348]]], grad_fn=<EmbeddingBackward0>)


## Now let us look at nn.Linear
Embedding vs Linear ， definition-wise?
An embedding is the same thing as a linear layer, but works differently in that it does a **lookup** instead of a matrix-vector multiplication.
也就是说，nn.embedding是为了lookup而生的nn layer，普通的nn.Linear则是为了矩阵乘法使用的。
搞这么两套，主要还是为了不同应用场景，导致的存储，运算效率不同

https://www.youtube.com/watch?v=XswEBzNgIYc

### Why use an embedding when we have a linear layer?
An embedding is an efficient alternative to a single linear layer when one has a large number of input features. 
This may happen in natural language processing (NLP) when one is working with text data or in some (language-like) tabular data that is treated as a bag-of-words (BoW). In such cases its also quite common to have the input data available as a sparse matrix (typically a result of an output from sklearn’s CountVectorizer of TfidfVectorizer as a sparse.scipy.csr_matrix) and it is memory-inefficent to convert that in to a dense matrix but really easy to access its non-zero elements and their positions directly instead (using the data and indices attributes).

In [9]:
# Initialize a sparse matrix: This could be your training set
X_train = csr_matrix(np.array([[1, 0, 1, 0],
                               [0, 0, 1, 1],
                               [1, 1, 1, 0]]))
# Get one row: One sample in the training set
row = X_train.getrow(0)

In [10]:
row

<1x4 sparse matrix of type '<class 'numpy.int64'>'
	with 2 stored elements in Compressed Sparse Row format>

Now let’s pass the training example row through the linear layer and the embedding so that we get the same result in each case.

In [11]:
w_linear = nn.Linear(4,3,bias=False)
print(w_linear.weight)

Parameter containing:
tensor([[ 0.1443, -0.1548, -0.3186,  0.0693],
        [ 0.4708, -0.4276,  0.4181,  0.3375],
        [-0.4092,  0.4163, -0.4208, -0.0187]], requires_grad=True)


In [12]:
row_dense = torch.FloatTensor(row.toarray())
print(row_dense)

tensor([[1., 0., 1., 0.]])


In [13]:
prob_linear = w_linear (row_dense)

In [14]:
print(prob_linear)

tensor([[-0.1743,  0.8889, -0.8300]], grad_fn=<MmBackward0>)


In [15]:
w_embedding = nn.Embedding(4, 3).from_pretrained(w_linear.weight.T)
print(w_embedding.weight)

Parameter containing:
tensor([[ 0.1443,  0.4708, -0.4092],
        [-0.1548, -0.4276,  0.4163],
        [-0.3186,  0.4181, -0.4208],
        [ 0.0693,  0.3375, -0.0187]])


In [16]:
print(w_embedding(torch.tensor(row.indices)).sum(0))

tensor([-0.1743,  0.8889, -0.8300])


The outputs are the same. Yay! 
A couple of observations to keep in mind when you’re using this in your own nn.Module:

1. The embedding weights and the linear layers weights are transposed to each other.
2. The linear layer w_linear does the actual matrix vector multiplication and therefore needs the row to be converted to dense format. In contrast, w_embedding just needed the indices of row to do a lookup. Not only is this faster, but it’s also quite convenient with the scipy.sparse.indices attribute that is available for the sparse matrix!
3. The embedding requires the sum(0). Don’t forget it!
