# Word Embedding Training

Skip-Gram Vs Contineous Bag Of Words with self-Attention

- Situation:
    - Just learned Word Embedding, where I was taught there are two ways to generate meaninggul semantic word embeddings:    Skip-Gram and CBOW. I thought it was funny to try it out myself. So I decided to train embedding model using these two different methods, and give a rough comparsion to the result.

- Task:
    - Training two embedding models with Skip-Gram vs CBOW, and see which one may behave better by my own using experience.

- Action:
    - I pulled off an excerpt from a novel (titled "The death of Hero") as training corpus.
    - Preprocessing the text, deleting unnecessary characters and symbols, spliting the text, forming a word2idx and idx2word

        - Added \<PAD\>, \<UNK\> to handle unkown and padding too

    - From here on, I took two routes:

        - Route 1: training with Skip-Gram.
            - Scan with a context window, form center, target pairs.   PS. center : a single word, target : a list of words appeared in the context of the center word. 
            - The key task is asking model to learn how to predict the context word from a given center word.
            - The model I defined is the classic Word2vec Model, embedding layer + linear layer + softmax + crossentropyLoss

        - Route 2: training with CBOW
            - Scan with a context window again, form context, target pair. PS. context : list of words, target : a single word
            - The task is training model to guess the missing value given a the surrounding words.
            - The model I defined is embedding + self-attention + linear layer + softmax + crossentropyLoss
- Result:
    - The skip-gram approach had loss hard stuck at a very large number, not doing well
    - The CBOW approach successfully converge to a small loss after 400 epoches, harry. Checked the embeding output, the result is interesting.
    

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import re
from torch.utils.data import DataLoader, TensorDataset

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [3]:
# Sample corpus
corpus:str = ''
with open('corpus.txt', 'r') as file:
    corpus = file.read()

In [4]:
# text cleaning
text = corpus.lower()
text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
corpus = re.sub(r"\s+", " ", text).strip()
corpus = corpus.split()

In [5]:
# Build vocab 
vocab = set(corpus)

word2idx = {w: i for i, w in enumerate(vocab)}
word2idx['<PAD>'] = len(word2idx) # special tokens
word2idx['<UNK>'] = len(word2idx) # special tokens
idx2word = {i: w for w, i in word2idx.items()}
vocab_size = len(word2idx)

## Skip-Gram

In [5]:
# generate context, and tokenize
data = []
window_size = 2
for i, center in enumerate(corpus):
    context = []
    for j in range(max(0, i - window_size), min(len(corpus), i + window_size + 1)):
        if i != j:
            context.append(word2idx[corpus[j]])  # [W]
    data.append((word2idx[center], context))

data[:3]

[(899, [13, 1294]), (13, [899, 1294, 943]), (1294, [899, 13, 943, 329])]

In [6]:
#to tensor
c,context = zip(*data)
center = torch.tensor(c, dtype=torch.long)

targets = torch.zeros((len(center), vocab_size))
for row in range(len(context)):
    for col in context[row]: 
        targets[row, col] = 1

targets[:5, :5]



tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])

In [7]:
from torch.utils.data import TensorDataset, DataLoader

dataset = TensorDataset(center, targets)
dataloader = DataLoader(dataset, batch_size=20, shuffle=True)

### Define Model

In [None]:
class SkipGram(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super(SkipGram, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embed_dim)
        self.linear1 = nn.Linear(embed_dim, vocab_size)
    
    def forward(self, center_words):
        embeds = self.embeddings(center_words)  # [batch_size, embed_dim]
        x = self.linear1(embeds)                # [batch_size, vocab_dim]
        return x

In [12]:
model = SkipGram(vocab_size, embed_dim=220)
model.to(device)

SkipGram(
  (embeddings): Embedding(1365, 220)
  (linear1): Linear(in_features=220, out_features=1365, bias=True)
)

In [15]:

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)
model.to(device)
# training loop
for epoch in range(501):
    total_loss = 0
    for center, context in dataloader:
        center_batch = center.to(device)  # [B]
        context_batch = context.to(device) # [B, V]

        output = model(center_batch) # [B, V]
        loss = loss_fn(output, context_batch)

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
        optimizer.step()
        total_loss += loss.item()

    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Loss: {total_loss:.4f}")


Epoch 0, Loss: 3137.0091
Epoch 10, Loss: 3130.8706
Epoch 20, Loss: 3127.5402
Epoch 30, Loss: 3122.9321
Epoch 40, Loss: 3117.6036
Epoch 50, Loss: 3113.7666
Epoch 60, Loss: 3110.3580
Epoch 70, Loss: 3105.8761
Epoch 80, Loss: 3100.1104
Epoch 90, Loss: 3098.2304
Epoch 100, Loss: 3094.0000
Epoch 110, Loss: 3089.6313
Epoch 120, Loss: 3086.4785
Epoch 130, Loss: 3083.3954
Epoch 140, Loss: 3079.4711
Epoch 150, Loss: 3074.7757
Epoch 160, Loss: 3070.9609
Epoch 170, Loss: 3070.0774
Epoch 180, Loss: 3067.0077
Epoch 190, Loss: 3061.8876
Epoch 200, Loss: 3058.9116
Epoch 210, Loss: 3056.8749
Epoch 220, Loss: 3054.2147
Epoch 230, Loss: 3051.5206
Epoch 240, Loss: 3045.6635
Epoch 250, Loss: 3043.7747
Epoch 260, Loss: 3041.2251
Epoch 270, Loss: 3039.6581
Epoch 280, Loss: 3036.6365
Epoch 290, Loss: 3033.3091
Epoch 300, Loss: 3029.6859
Epoch 310, Loss: 3028.3598
Epoch 320, Loss: 3024.9406
Epoch 330, Loss: 3021.1941
Epoch 340, Loss: 3021.8115
Epoch 350, Loss: 3017.9976
Epoch 360, Loss: 3015.3558
Epoch 370, L

## CBOW

### Form data

In [6]:
# get contexts and target pair

data = []
window_size = 3 # look 3 to the left and right
for i, target in enumerate(corpus):
    context = []
    for j in range(max(0, i - window_size), min(len(corpus), i + window_size + 1)):
        if i != j:
            context.append(word2idx[corpus[j]])

    # pad for different lens
    while len(context) < 6:
        context.append(word2idx["<PAD>"])
    data.append((context, word2idx[target]))

data[:5]

    
    

[([814, 1003, 240, 1365, 1365, 1365], 180),
 ([180, 1003, 240, 211, 1365, 1365], 814),
 ([180, 814, 240, 211, 426, 1365], 1003),
 ([180, 814, 1003, 211, 426, 723], 240),
 ([814, 1003, 240, 426, 723, 114], 211)]

In [7]:
# to tensor

context_list, target_list = zip(*data)

target_tensor = torch.tensor(target_list)

context_tensor = torch.tensor(context_list, dtype=torch.long) # [T, 6]

context_tensor

tensor([[ 814, 1003,  240, 1365, 1365, 1365],
        [ 180, 1003,  240,  211, 1365, 1365],
        [ 180,  814,  240,  211,  426, 1365],
        ...,
        [1003,  522,  835,  994,  174, 1365],
        [ 522,  835, 1003,  174, 1365, 1365],
        [ 835, 1003,  994, 1365, 1365, 1365]])

In [8]:
ds = TensorDataset(context_tensor,target_tensor)
dataloader = DataLoader(ds, batch_size=1, shuffle=True)

### Define Self-Attention Model

In [9]:
class SimpleSelfAttention(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.query = nn.Linear(d_model, d_model)
        self.key   = nn.Linear(d_model, d_model)
        self.value = nn.Linear(d_model, d_model)
        self.scale = d_model**0.5

    def forward(self, x):
        Q = self.query(x)
        K = self.key(x)
        V = self.value(x)
        attn_scores = Q @ K.transpose(-2, -1) / self.scale
        attn_weights = F.softmax(attn_scores, dim=-1)
        return attn_weights @ V # [S, D]

### Define CBOW Model

In [10]:
class CBOW(nn.Module):
    def __init__(self, vocab_size, seq_len, embed_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.self_attention = SimpleSelfAttention(embed_dim)
        nodes = embed_dim * seq_len // 3 if embed_dim * seq_len > 3 else 3
        self.linear1 = nn.Linear(embed_dim * seq_len, nodes)
        self.linear2 = nn.Linear(nodes, vocab_size)
    
    def forward(self, x): # [S]
        embed = self.embedding(x)
        x = self.self_attention(embed).view((1, -1))
        x = self.linear1(x)
        x = F.relu(x)
        x = self.linear2(x)
        return x # [V]


In [11]:
model = CBOW(vocab_size, seq_len=window_size*2, embed_dim=100)

In [12]:
model.to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)
# training loop
for epoch in range(501):
    total_loss = 0
    for context, target in dataloader:
        target_ = target.to(device)  # [1]
        context_ = context.to(device) # [S]

        output = model(context_) # [V]
        loss = loss_fn(output, target_)

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
        optimizer.step()
        total_loss += loss.item()


    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Loss: {total_loss:.4f}")


Epoch 0, Loss: 29553.0728
Epoch 10, Loss: 24741.6069
Epoch 20, Loss: 23634.5078
Epoch 30, Loss: 22257.3190
Epoch 40, Loss: 20532.0775
Epoch 50, Loss: 18555.5391
Epoch 60, Loss: 16459.4669
Epoch 70, Loss: 14386.7752
Epoch 80, Loss: 12418.4807
Epoch 90, Loss: 10614.9805
Epoch 100, Loss: 8963.7750
Epoch 110, Loss: 7499.7051
Epoch 120, Loss: 6238.4146
Epoch 130, Loss: 5160.8037
Epoch 140, Loss: 4272.6087
Epoch 150, Loss: 3508.1090
Epoch 160, Loss: 2875.7679
Epoch 170, Loss: 2363.8509
Epoch 180, Loss: 1903.0090
Epoch 190, Loss: 1527.5460
Epoch 200, Loss: 1210.3182
Epoch 210, Loss: 960.7232
Epoch 220, Loss: 757.4371
Epoch 230, Loss: 580.9135
Epoch 240, Loss: 447.8394
Epoch 250, Loss: 337.9303
Epoch 260, Loss: 254.7353
Epoch 270, Loss: 194.7736
Epoch 280, Loss: 139.9516
Epoch 290, Loss: 112.7440
Epoch 300, Loss: 88.0148
Epoch 310, Loss: 68.0574
Epoch 320, Loss: 53.6810
Epoch 330, Loss: 43.5100
Epoch 340, Loss: 34.0334
Epoch 350, Loss: 27.5803
Epoch 360, Loss: 22.5688
Epoch 370, Loss: 16.4490


In [13]:
for name, param in model.named_parameters():
    if param.grad is not None:
        grad_norm = param.grad.data.norm(2).item()
        print(f"{name} grad norm: {grad_norm:.6f}")


embedding.weight grad norm: 0.016792
self_attention.query.weight grad norm: 0.056579
self_attention.query.bias grad norm: 0.005814
self_attention.key.weight grad norm: 0.053965
self_attention.key.bias grad norm: 0.000000
self_attention.value.weight grad norm: 0.056995
self_attention.value.bias grad norm: 0.012227
linear1.weight grad norm: 0.131273
linear1.bias grad norm: 0.004745
linear2.weight grad norm: 0.198066
linear2.bias grad norm: 0.002480


In [15]:
embeddings = model.embedding.weight.data
print(f"Embedding for 'say': {embeddings[word2idx['the']]}")


Embedding for 'say': tensor([-0.6516, -0.5228, -2.6756,  1.0574, -1.0365, -0.9951, -0.4743,  0.6886,
        -0.0177,  1.1187, -0.5014,  0.1908, -1.3898, -1.5563,  1.4845,  1.0136,
        -1.0156, -0.5403,  0.5217,  1.2004, -0.4451,  0.1688,  2.5683, -0.2973,
        -0.2957,  0.3516, -0.1253,  1.1774, -0.8941, -0.2151, -0.9814,  1.4599,
         1.0003, -0.6810,  1.0453,  1.0736, -1.6421, -0.7032, -0.8412,  0.6861,
         0.7410,  0.3247,  0.2107,  0.4246,  0.8145, -0.4981,  0.4476, -2.1404,
        -2.7737,  1.0773,  0.4306,  1.7925,  0.8574, -0.7026,  0.6058,  1.9203,
        -0.5291, -0.8033, -0.9161, -0.8780,  0.3470,  0.5998,  0.9965,  0.5328,
        -0.2157, -0.2082,  0.9472,  1.0053, -1.0803, -0.8169,  2.5357,  0.5682,
        -1.7280,  1.4474, -1.3752, -2.2571,  1.9295, -0.5921, -0.3369,  1.0598,
         0.0266,  0.6273,  2.0813, -0.7890, -0.6126,  0.4479,  0.0099, -0.5917,
        -1.1267,  0.2658,  1.4372, -0.8784,  2.9819,  0.7703, -1.3374, -0.2320,
         0.1725, -1

In [56]:
import torch.nn.functional as F

new_model = model.to('cpu')
with torch.no_grad():
    center_vecs = new_model.embedding.weight
    sim = F.cosine_similarity(center_vecs[word2idx["vaguely"]].unsqueeze(0), center_vecs)
    topk = torch.topk(sim, k=10)
    for i in topk.indices:
        print(idx2word[i.item()])


vaguely
possibly
damn
noble
other
corpse
chemise
avoiding
inadequate
unless


In [54]:
test_words = '''even in a of the intelligentsia'''.split()

vec = [word2idx[t] for t in test_words]

with torch.inference_mode():
    word_vecs = torch.tensor(vec, dtype=torch.long)
    #print(word_vecs.shape)

    result = F.softmax(new_model(word_vecs), dim = -1).squeeze()
    topk = torch.topk(result, k = 10)
    for i in topk.indices:
        print(idx2word[i.item()])



dictatorship
peoples
matter
except
love
toils
light
out
more
parts
