<a href="https://colab.research.google.com/github/Omnamdeo912/DeepLearning_Word2Vec/blob/main/Activity6_2023204017.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [27]:
import torch
from torch import nn
import torchtext
import numpy as np


In [2]:
class DatasetForCBOW(torch.utils.data.Dataset):
  def __init__(self, text, context_length=5):  # 1) Chnaged context_size to 5
    text_list = text.split(".")
    text_list_token = [t.split(" ") for t in text_list]
    n_grams = []
    self.vocab = torchtext.vocab.build_vocab_from_iterator(text_list_token)

    for sent in text_list_token:
      len_sent = len(sent)
      if len_sent >= 5:
        for i in range(len_sent-4):
          n_gram = sent[i: i+5]
          n_grams.append(n_gram)
    self.n_grams_indices = [self.vocab.lookup_indices(n_gram) for n_gram in n_grams]

  def __len__(self):
    return len(self.n_grams_indices)

  def __getitem__(self, index):
    x = (self.n_grams_indices[index][0], self.n_grams_indices[index][-1])
    y = self.n_grams_indices[index][1]
    return torch.tensor(x), torch.tensor(y)

2) Here i am taking bieeger corpus of words :
```
" Technology has changed the way we live in big ways. It affects our society in many different aspects. Let's look at some ways technology has made an impact.

Communication is way faster now. We can talk to anyone, anywhere in an instant. Social media and texting help us stay connected. But sometimes, we forget to talk face-to-face.

Learning is easier with technology. We can study online and use apps to help us understand better. But not everyone has access to these tools, which isn't fair.

Shopping has shifted online too. We can buy things from home. It's convenient, but traditional stores struggle.

Jobs have changed with technology. Some jobs disappeared, but new ones popped up. Working from home is more common now.

Healthcare got better with technology. Doctors use machines to find problems early. But we worry about keeping our health data safe.

Entertainment is everywhere with technology. We watch shows, play games, and even explore different worlds in virtual reality. But too much screen time isn't good for us.

Technology helps the environment too. Solar panels and smart devices save energy. But making tech gadgets can harm the Earth.

In the end, technology is a mixed bag. It brings good things like fast communication, easy learning, and better healthcare. But it also brings challenges like privacy worries and environmental concerns. We need to use technology wisely to make sure it helps our society without causing harm."
```

In [12]:
# 2) Taking bigger corpus text

dataset = DatasetForCBOW("Technology has changed the way we live in big ways. It affects our society in many different aspects. Let's look at some ways technology has made an impact.Communication is way faster now. We can talk to anyone, anywhere in an instant. Social media and texting help us stay connected. But sometimes, we forget to talk face-to-face.Learning is easier with technology. We can study online and use apps to help us understand better. But not everyone has access to these tools, which isn't fair.Shopping has shifted online too. We can buy things from home. It's convenient, but traditional stores struggle.Jobs have changed with technology. Some jobs disappeared, but new ones popped up. Working from home is more common now.Healthcare got better with technology. Doctors use machines to find problems early. But we worry about keeping our health data safe.Entertainment is everywhere with technology. We watch shows, play games, and even explore different worlds in virtual reality. But too much screen time isn't good for us.Technology helps the environment too. Solar panels and smart devices save energy. But making tech gadgets can harm the Earth.In the end, technology is a mixed bag. It brings good things like fast communication, easy learning, and better healthcare. But it also brings challenges like privacy worries and environmental concerns. We need to use technology wisely to make sure it helps our society without causing harm.")

print("dataset size:", len(dataset))
print(dataset[4])

for i in range(len(dataset)):  # Printing the list of tensors after n-gram processing
    sample = dataset[i]
    print("Sample {}: {}".format(i, sample))

print(dataset.vocab.get_itos())

dataset size: 141
(tensor([39, 67]), tensor(17))
Sample 0: (tensor([19, 39]), tensor(8))
Sample 1: (tensor([ 8, 17]), tensor(23))
Sample 2: (tensor([ 23, 108]), tensor(10))
Sample 3: (tensor([10,  9]), tensor(39))
Sample 4: (tensor([39, 67]), tensor(17))
Sample 5: (tensor([17, 40]), tensor(108))
Sample 6: (tensor([ 0, 36]), tensor(18))
Sample 7: (tensor([18,  9]), tensor(59))
Sample 8: (tensor([ 59, 114]), tensor(13))
Sample 9: (tensor([13, 24]), tensor(36))
Sample 10: (tensor([36, 64]), tensor(9))
Sample 11: (tensor([  0, 135]), tensor(50))
Sample 12: (tensor([50, 40]), tensor(109))
Sample 13: (tensor([109,   1]), tensor(65))
Sample 14: (tensor([65,  8]), tensor(135))
Sample 15: (tensor([135, 111]), tensor(40))
Sample 16: (tensor([40, 20]), tensor(1))
Sample 17: (tensor([  1, 103]), tensor(8))
Sample 18: (tensor([41, 34]), tensor(6))
Sample 19: (tensor([0, 2]), tensor(5))
Sample 20: (tensor([ 5, 61]), tensor(7))
Sample 21: (tensor([ 7, 62]), tensor(37))
Sample 22: (tensor([37,  9]), t

In [4]:
dataloader = torch.utils.data.DataLoader(dataset, batch_size=3)

In [5]:
class CBOWModeler(nn.Module):

    def __init__(self, vocab, embedding_dim, context_size):
        super(CBOWModeler, self).__init__()
        vocab_size = len(vocab)
        self.embeddings = torch.nn.Embedding(vocab_size, embedding_dim)
        self.linear = torch.nn.Linear(embedding_dim, vocab_size)
        self.relu = torch.nn.ReLU()
        self.softmax = torch.nn.Softmax(dim=-1)
        self.loss_fun = torch.nn.CrossEntropyLoss()
        self.vocab = vocab

    def forward(self, inputs):
        # inputs: b, s
        embeds = self.embeddings(inputs) # b, s, dim
        context_embed = embeds.sum(dim=1) # b, s
        out1 = self.relu(context_embed) # b, s
        out2 = self.linear(out1) # b, v
        log_probs = self.softmax(out2)
        return out2, log_probs






4) Increasing embeddings from 100 to 200.

In [29]:
model = CBOWModeler(dataset.vocab, 200, 5).to("cuda")
optim = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(100):
  for batch in dataloader:
    x, y = batch
    x = x.to("cuda")
    y = y.to("cuda")
    optim.zero_grad()
    out, preds = model(x)
    loss = model.loss_fun(out, y)
    loss.backward()
    optim.step()
  print(loss)
    # print(out.shape, preds.shape)
  # print(out, preds)

tensor(5.1955, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(3.2636, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(1.7995, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(0.8923, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(0.4731, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(0.2888, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(0.1962, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(0.1434, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(0.1102, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(0.0879, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(0.0720, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(0.0603, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(0.0513, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(0.0443, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(0.0387, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(0.0341, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(0.0304, device='cuda:0', grad_fn=

In [30]:
embeddings = model.embeddings.weight
embeddings.shape

torch.Size([158, 200])

In [13]:
# context_indices = [dataset.vocab.stoi[word] for word in dataset.vocab]
print(dataset.vocab.get_stoi())


{'worry': 157, 'worries': 156, 'worlds': 155, 'wisely': 153, 'which': 152, 'watch': 151, 'virtual': 150, 'up': 149, 'time': 145, 'these': 144, 'texting': 143, 'sure': 141, 'sometimes,': 136, 'shows,': 133, 'study': 140, 'shifted': 132, 'screen': 131, 'save': 130, 'safe': 129, 'problems': 127, 'about': 57, 'ways': 40, 'struggle': 139, 'Technology': 19, 'causing': 69, 'a': 56, 'stores': 138, 'Working': 55, 'many': 114, 'Learning': 49, 'explore': 89, 'Healthcare': 45, 'But': 3, 'not': 121, 'Entertainment': 44, 'stay': 137, 'Earth': 43, 'make': 112, 'Doctors': 42, 'much': 118, 'Shopping': 51, 'In': 46, 'learning,': 107, 'made': 111, 'way': 39, 'changed': 23, 'for': 95, 'online': 35, 'now': 34, 'even': 86, 'things': 38, 'society': 36, 'challenges': 70, 'aspects': 64, 'like': 33, 'Solar': 53, 'Jobs': 48, 'popped': 125, '': 0, 'environment': 84, 'play': 124, 'helps': 29, 'Some': 54, 'better': 12, 'understand': 148, 'is': 6, 'buy': 68, 'and': 4, 'mixed': 116, "It's": 47, "Let's": 50, 'keeping'

3) Trying to compare embeddings of two close words (close by semantics)
  > Used norm function to calculate distance

In [35]:
embedding1 = embeddings[142]
embedding2 = embeddings[97]
embedding3 = embeddings[152]

distance_similar = torch.norm(embedding1 - embedding2).item()
distance_different = torch.norm(embedding1 - embedding3).item()
print(distance_similar)
print(distance_different)


19.647817611694336
21.88470458984375
