Explanation:

1. Data Preparation: 
        - The corpus is tokenized into words, and a vocabulary is built.
        - Context-target pairs are generated using a window size of 2.

2. Skip-gram Model: The SkipGramModel class defines a neural network with an embedding layer to learn word vectors and an output layer to predict context words.

3. Training Loop: The model is trained using the context-target pairs. For each target word, it predicts its context words using the embeddings and updates them to minimize the prediction loss.

Step 1: Data Preparation

We'll start by preparing a small corpus and generating context-target pairs for training.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from collections import Counter
import numpy as np

# Sample text corpus
corpus = [
    "we are learning nlp",
    "nlp is fun",
    "we love deep learning",
    "deep learning is powerful"
]

# Tokenize the corpus
tokenized_corpus = [sentence.split() for sentence in corpus]

# Build vocabulary
vocabulary = Counter()
for sentence in tokenized_corpus:
    for word in sentence:
        vocabulary[word] += 1

# Create word to index and index to word mappings
word_to_idx = {word: i for i, word in enumerate(vocabulary)}
idx_to_word = {i: word for word, i in word_to_idx.items()}

vocab_size = len(vocabulary)

# Generate context-target pairs
def generate_skipgram_pairs(tokenized_corpus, window_size=2):
    pairs = []
    for sentence in tokenized_corpus:
        sentence_len = len(sentence)
        for idx, word in enumerate(sentence):
            for neighbor in range(max(idx - window_size, 0), min(idx + window_size + 1, sentence_len)):
                if neighbor != idx:
                    pairs.append((word, sentence[neighbor]))
    return pairs

pairs = generate_skipgram_pairs(tokenized_corpus)


Step 2: Define the Skip-gram Model
We'll create a simple neural network with an embedding layer and a linear layer to predict context words.

In [2]:
class SkipGramModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(SkipGramModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.output_layer = nn.Linear(embedding_dim, vocab_size)
    
    def forward(self, target_word):
        # Get the embedding of the target word
        word_embed = self.embeddings(target_word)
        # Calculate scores for all words in the vocabulary
        output = self.output_layer(word_embed)
        return output


Step 3: Training the Model
Next, we'll train the Skip-gram model using the context-target pairs.

In [3]:
# Hyperparameters
embedding_dim = 10
learning_rate = 0.01
epochs = 100

# Initialize model, loss function, and optimizer
model = SkipGramModel(vocab_size, embedding_dim)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

# Prepare data for training
def prepare_data(pairs, word_to_idx):
    inputs = [word_to_idx[target] for target, context in pairs]
    targets = [word_to_idx[context] for target, context in pairs]
    return torch.LongTensor(inputs), torch.LongTensor(targets)

inputs, targets = prepare_data(pairs, word_to_idx)

# Training loop
for epoch in range(epochs):
    total_loss = 0
    for input_word, target_word in zip(inputs, targets):
        input_word = input_word.unsqueeze(0)
        target_word = target_word.unsqueeze(0)

        # Zero gradients
        optimizer.zero_grad()
        
        # Forward pass
        output = model(input_word)
        
        # Compute loss
        loss = criterion(output, target_word)
        
        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    
    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss:.4f}")

# Display learned embeddings
for word, idx in word_to_idx.items():
    print(f"Word: {word}, Embedding: {model.embeddings.weight[idx].detach().numpy()}")


Epoch 10/100, Loss: 69.3557
Epoch 20/100, Loss: 63.8174
Epoch 30/100, Loss: 61.0445
Epoch 40/100, Loss: 59.2488
Epoch 50/100, Loss: 57.9731
Epoch 60/100, Loss: 57.0197
Epoch 70/100, Loss: 56.2831
Epoch 80/100, Loss: 55.7019
Epoch 90/100, Loss: 55.2374
Epoch 100/100, Loss: 54.8633
Word: we, Embedding: [ 1.409335   -0.20863475  0.44019634 -0.47989354 -1.0884212  -0.25774264
  1.3875313  -1.5579122  -0.3796208   1.0906088 ]
Word: are, Embedding: [-0.9366768   0.8951899   0.04885015  1.4833986  -0.12077451  0.4680458
 -1.2635247   0.3657455   1.430025   -0.02664759]
Word: learning, Embedding: [-0.09689061 -0.2076085   1.3457878   0.6046952  -0.4199099  -1.0195935
  0.66363126  2.0995212  -0.13935253 -0.45863977]
Word: nlp, Embedding: [ 2.5507257e+00  1.9211942e+00 -6.9080037e-01  2.9292544e-03
 -6.4891368e-02  4.6898937e-01 -6.4207202e-01 -6.9626361e-01
  2.0395450e-03  1.3152207e-01]
Word: is, Embedding: [-0.52267057 -0.5367874   0.49218756  1.5211189  -0.28172144  1.0779531
  1.1760077  