## How to represent meaning?

Meaning is the combination of the signifier(symbol) and the signified(idea or thing). For example, the English word 'apple' acts as a signifier for the concept of a particular type of fruit. In Chinese, the signifier changes to '苹果' (píngguǒ), while the signified—the mental concept of that fruit—remains essentially the same."

## How to organize words in computers?
WordNet organize words by meaning. Specifically, it utilizes synonyms and antonyms, hypernyms and hyponyms, meronyms and holonyms.
However, WordNet sometimes misclassifies words like good and proficient into synonyms, or misses new meaning of words like wicked, ninja, etc. Especially, it can't compute accurate word similarity!

Traditional NLP represents words as one-hot vectors. The problem is that the vetor dimension depends on the size of the vocabulary. Say you have a dictionary as vocabulary, dimension disaster is bound to occur. And chances are that the number of dimensions is greater than the number of samples. Besides, one-hot word vectors and orthogonal, namely no similarity.

## Starting from Distributed Semantics: A word's meaning is given by the words that frequently appear close-by
*"You shall know a word by the company it keeps"  --J.R. Firth*

Based on this idea, people proposed Word2Vec (an implementation of word embedding that represents words using low-dimensional, dense real-valued vectors) to obtain word vectors with semantic features, which is somewhat analogous to a backbone network extracting features from images in computer vision. So unlike one-hot word vectors that are acquired through a tool class, Word2Vec word vectors are learnt.

Word2Vec mainly has two implementations, forecasting context given centre(Skip-gram) and forecasting centre given context(CBOW). Take Skip-gram as an example, essentially, it is a multi-classification task. We want that, for every position *t*, for every word in the context with window step *j* and window size *m*, the product $P(w_{t+j} \mid w_t)$ is maximal.

$$
\prod_{t=1}^{T} \prod_{\substack{-m \le j \le m \\ j \ne 0}} P(w_{t+j} \mid w_t)
$$


For better optimization, we convert it to Negative Log Likelihood: $$
-\sum_{t=1}^{T} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \log P(w_{t+j} \mid w_t)
$$

This is actually Cross Entropy Loss, where $P(w_{t+j} \mid w_t)$ is softmaxing model's output logits. So in reality, we use nn.CrossEntropyLoss() as the loss function. Specifically, we calculate $P(w_{t+j} \mid w_t)$ this way:
$$
P(w_{t+j} \mid w_t)=P(o \mid c) = \frac{\exp \big( v_c \cdot u_o \big)}{\sum_{w=1}^V \exp \big( v_c \cdot u_w \big)}
$$
, where *v* is the word vector of centre words and *u* is the word vector of context words.
### How to build the dataset?

As shown in the formula $P(w_{t+j} \mid w_t)$, every sample is composed of a centre word *$w_{t}$* and a context word *$w_{t+j}$*, so our features is the vectorization representation of *$w_{t}$*, and our label is that of *$w_{t+j}$*. If you use nn.Linear to implement the model, you should use one-hot encoding to vectorize words. For example, we have a corpus "I love learning German and German history". The vocabulary should be like \[I, love, learning, German, and, history \], which determines our one-hot vectors to be 6 dimensional(the size of the vocabulary). So word "*I*" is denoted as \[1, 0, 0, 0, 0, 0\] while word "*German*" is denoted as\[0, 0, 0, 1, 0, 0\]. However, due to the reason that nn.CrossEntropyLoss receives ordinal indices as label y, we need an different approach when handling the generation the y. The word "*German*", for instance, is encoded as *3*.

### How to build the model architecture?

Although we already have pytorch implementations like nn.Embedding, we can also implement this with nn.Linear. See details in the codes way below.

```
# Word Representation 结构图

- **Word Vector（词向量，大类）**
  - One-hot Vector（最原始的，高维、稀疏、无语义）
  - **Word Embedding（更高级的，低维、稠密、有语义）**
    - Word2Vec（Google 提出的具体算法，CBOW/Skip-gram）
    - GloVe
    - FastText
    - （更复杂的神经网络模型）
      - ELMo
      - BERT
      - GPT 系列

---

**Neural Word Representation（神经词表示）**  
: 用神经网络学出来的词向量（涵盖 Word2Vec、GloVe、FastText、ELMo、BERT、GPT 等）
```


Token vs Type:

"Word word word word."

​​Tokens​​ = 4  
​​Types​​ = 1


## Below is an example of Skip-Gram(btw, skip-gram and CBOW are two implementations of Word2Vec. This note is all about Skip-Gram).

In [465]:
import torch
import torch.nn as nn
import collections

corpus="""Python is a popular programming language for machine learning and data science. 
Machine learning algorithms require large datasets for training neural networks. 
Deep learning models use multiple layers to extract features from raw data. 
Artificial intelligence is transforming various industries with automation solutions. 
Natural language processing helps computers understand human language patterns. 
Computer vision enables machines to recognize objects in images and videos. 
Data scientists use pandas numpy and matplotlib for data analysis tasks. 
Reinforcement learning agents learn through trial and error interactions. 
Cloud computing provides scalable resources for deploying AI applications. 
GitHub is a platform for version control and collaborative software development."""

tokens=corpus.lower().replace(".",'').replace("\n"," ").split(" ")

counter=collections.Counter(tokens)

vocab=[tp[0]for tp in sorted(counter.items(), key=lambda x:x[1], reverse=True)]
vocab_size=len(vocab)

word2idx={word:i for i,word in enumerate(vocab)}

In [467]:
def one_hot_encoder(word):
    vector=torch.zeros(vocab_size)
    vector[word2idx[word]]=1
    return vector.reshape(1,-1)

In [504]:
def generate_dataset(tokens, window=5):
    X=None
    y=None
    for token in tokens:
        for step in range(-window,window+1):
            if step == 0:
                continue
            context_idx=tokens.index(token)+step
            if context_idx < 0 or context_idx > len(tokens)-1:
                continue
            context=tokens[context_idx]
            context_vocab_idx=vocab.index(context)
            if X is None:
                X=one_hot_encoder(token)
            else:
                X=torch.cat((X,one_hot_encoder(token)), dim=0)
            if y is None:
                y=torch.tensor(context_vocab_idx).reshape(1)
            else:
                y=torch.cat((y, torch.tensor(context_vocab_idx).reshape(1)), dim=0)
    return X,y

In [505]:
X,y=generate_dataset(tokens)

In [529]:
class Skip_Gram(nn.Module):
    def __init__(self, vocab_size, embedding_dim=20):
        super().__init__()
        self.in_embed=nn.Linear(vocab_size, embedding_dim, bias=False) #its paramters are word vectors of centre words
        self.out_embed=nn.Linear(embedding_dim, vocab_size, bias=False) #its paramters are word vectors of context words
    def forward(self, x):
        #事实上，用中心词向量和上下文词向量做点积计算条件概率的过程已经蕴含在这了
        #In fact, the process of calculating conditional probabilities using dot product of center word vectors and context word vectors is already inherent in this
        x=self.in_embed(x) #这一层的参数是所有中心词向量，用输入词的one-hot向量x 乘以中心词矩阵self.in_embed，相当于查表得到这个词的中心词向量
        x=self.out_embed(x) #这一层的参数所有是上下文词向量，所以用上一层的输出点乘这一层，得到这个中心词与所有上下文词的余弦相似度。
        return x

In [525]:
model=Skip_Gram(vocab_size)
trainer=torch.optim.Adam(model.parameters(), lr=0.1)

In [508]:
nn.init.xavier_normal_(model.in_embed.weight)
nn.init.xavier_normal_(model.out_embed.weight)

Parameter containing:
tensor([[ 0.0707,  0.2475, -0.0775,  ..., -0.0339, -0.0314,  0.0535],
        [-0.1219, -0.1411,  0.0738,  ...,  0.2032,  0.1827, -0.1409],
        [-0.0892, -0.0587, -0.0309,  ...,  0.3755,  0.1125, -0.0848],
        ...,
        [ 0.2516,  0.0962,  0.0764,  ..., -0.0179, -0.0129,  0.0526],
        [-0.0294, -0.2043,  0.1738,  ..., -0.0911, -0.1177, -0.1590],
        [-0.2249,  0.1198, -0.1531,  ...,  0.2079, -0.1011,  0.1384]],
       requires_grad=True)

In [509]:
loss=nn.CrossEntropyLoss()

In [517]:
for i in range(10):
    trainer.zero_grad()
    y_hat=model(X)
    l=loss(y_hat,y)
    l.backward()
    print(l)
    trainer.step()

tensor(2.2059, grad_fn=<NllLossBackward0>)
tensor(2.2057, grad_fn=<NllLossBackward0>)
tensor(2.2057, grad_fn=<NllLossBackward0>)
tensor(2.2057, grad_fn=<NllLossBackward0>)
tensor(2.2055, grad_fn=<NllLossBackward0>)
tensor(2.2055, grad_fn=<NllLossBackward0>)
tensor(2.2055, grad_fn=<NllLossBackward0>)
tensor(2.2054, grad_fn=<NllLossBackward0>)
tensor(2.2053, grad_fn=<NllLossBackward0>)
tensor(2.2053, grad_fn=<NllLossBackward0>)


In [518]:
torch.softmax(model(X[0]), dim=0)

tensor([6.7468e-05, 7.7630e-05, 6.1342e-06, 1.8399e-06, 1.8599e-05, 2.0142e-01,
        2.0714e-01, 1.8279e-01, 1.4354e-04, 2.6581e-11, 1.1548e-13, 3.0521e-04,
        2.0582e-01, 2.0219e-01, 1.8282e-05, 1.6840e-09, 4.7620e-13, 5.4735e-14,
        1.0344e-11, 3.4670e-12, 6.1300e-11, 5.9820e-13, 1.3153e-13, 3.8012e-14,
        5.7839e-13, 8.4667e-13, 1.1247e-12, 2.6608e-12, 5.3314e-11, 1.2991e-13,
        1.6635e-09, 6.0428e-12, 8.2682e-12, 5.6700e-12, 2.5304e-10, 1.4865e-10,
        4.0416e-11, 2.7512e-09, 6.4789e-10, 3.1522e-09, 8.6565e-11, 2.1888e-11,
        3.5582e-12, 7.0046e-14, 1.0264e-11, 1.5316e-14, 1.5014e-15, 8.0476e-15,
        9.8369e-15, 5.7794e-14, 5.8810e-13, 6.1597e-14, 5.0919e-15, 2.1632e-12,
        4.8041e-10, 2.4262e-10, 1.6591e-08, 5.8982e-10, 8.1182e-12, 7.3183e-11,
        1.4661e-13, 1.0191e-10, 2.0629e-12, 1.3437e-13, 1.7435e-11, 3.1886e-12,
        2.6553e-13, 1.5877e-13, 4.6620e-10, 1.1778e-09, 2.9650e-09, 6.0570e-09,
        3.9406e-10, 2.0229e-09, 2.0453e-

In [522]:
from torch.nn.functional import cosine_similarity

def word_similarity(word1, word2):
    vec1 = model.in_embed(torch.tensor(one_hot_encoder(word1))).detach()
    vec2 = model.in_embed(torch.tensor(one_hot_encoder(word2))).detach()
    print(vec1,vec2)
    return cosine_similarity(vec1, vec2, dim=1).item()

print(f"相似度： {word_similarity('machine', 'learning'):.3f}")

tensor([[-0.1796, -1.2016,  2.1759, -0.2336,  1.0390,  1.0127,  0.5082,  1.5670,
         -1.5129, -1.2642, -1.1036, -1.6216, -1.0297, -1.3005, -0.5369,  1.7246,
          0.8961,  1.3457,  1.3210, -1.3263]]) tensor([[-2.2303,  0.1507,  1.3409, -1.0065,  2.3670,  1.2989,  1.0662,  0.4756,
          0.0467, -0.7359, -0.6774, -1.5686, -0.5068, -1.0053,  0.0666,  0.9162,
          0.7425,  1.7015,  1.4628, -0.2554]])
相似度： 0.724


  vec1 = model.in_embed(torch.tensor(one_hot_encoder(word1))).detach()
  vec2 = model.in_embed(torch.tensor(one_hot_encoder(word2))).detach()
