# Lecture 19: 2023-04-06 Vector Semantics and Word Embeddings

## Lecture Overview

* Recap on Word2Vec, GloVe, and FastText
* ELMo and BERT Embeddings
* NLP Applications of Word Embeddings
* Word Embeddings as inputs to Neural Networks

## Recap on Word2Vec, GloVe, and FastText

How can we represent words as vectors? We can use one-hot vectors, but they are very sparse and do not capture any semantic information. We can also use dense vectors, but they are not sparse and do not capture any syntactic information. We have to engineer a method that captures both semantic and syntactic information.

### Word2Vec

<img src="./images/mikolov.png" width="900" height="500" />

### GloVe (Global Vectors for Word Representation)

<img src="./images/glove-2.png" width="900" height="200" />

### FastText

<img src="./images/fasttext.png" width="900" height="500" />

___

references:
* [Mikolov et al. (2013)](https://arxiv.org/abs/1301.3781)
* [Pennington et al. (2014)](https://nlp.stanford.edu/pubs/glove.pdf)
* [Joulin et al. (2016)](https://arxiv.org/abs/1607.04606)
* [Bojanowski et al. (2017)](https://arxiv.org/abs/1707.04651)


### Word2Vec, GloVe, and FastText in Python

The following libraries can be used to train word embeddings in Python:

* [gensim](https://radimrehurek.com/gensim/models/word2vec.html)
* [spaCy](https://spacy.io/usage/vectors-similarity)
* [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
* [fastText](https://fasttext.cc/docs/en/python-module.html)


#### Word2Vec with gensim

```python

# from the docs: https://radimrehurek.com/gensim/models/word2vec.html

from gensim.test.utils import common_texts
from gensim.models import Word2Vec

model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")
```

#### GloVe with Python

Download the pretrained GloVe embeddings from [here](https://nlp.stanford.edu/projects/glove/). The following code loads the embeddings into a dictionary.

```python

import numpy as np
from scipy.spatial.distance import cosine
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

glove_file = 'glove.6B.50d.txt'

def load_glove_embeddings(glove_file):
    with open(glove_file, 'r') as f:
        words = set()
        word_to_vec_map = {}
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float32)
    return words, word_to_vec_map

```

#### FastText with Python

```python

# from the docs: https://fasttext.cc/docs/en/python-module.html

import fasttext
skgrm_model = fasttext.train_unsupervised('data.txt', model='skipgram')

cbow_model = fasttext.train_unsupervised('data.txt', model='cbow')

## Text Classification with FastText

model = fasttext.train_supervised('train.txt', 'model') # train.txt is path to the training file with one sentence per line with its label

model.predict('This is a test sentence', k=3) # returns a tuple (labels, probabilities) with k-labels

## predict more than one sentence
model.predict(['This is a test sentence', 'This is another test sentence'], k=3)

```





## ELMo and BERT Embeddings

Word2Vec, GloVe, and FastText are static embeddings, meaning that there is a fixed vector representation for each word. ELMo and BERT are contextual embeddings, meaning that the vector representation of a word depends on the context in which it is used. ELMo and BERT are pre-trained on a large corpus of text and can be used to extract features from text data.

### ELMo (Embeddings from Language Models)

<img src="./images/elmo.png" width="900" height="300" />

Paper: [Deep contextualized word representations](https://arxiv.org/abs/1802.05365)

### BERT (Bidirectional Encoder Representations from Transformers)


BERT is a deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). ELMo representations are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. This allows the model to learn richly contextualized representations of words by conditioning on both left and right context. BERT representations can be easily added to existing models and significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment and sentiment analysis.

<center><img src="./images/bert-1.png" height="400" width="800" /></center>


BERT is trained on two tasks: (1) Masked Language Modeling (MLM) and (2) Next Sentence Prediction (NSP). The MLM task is a fill-in-the-blank task where the model is trained to predict the masked words in a sentence. The NSP task is a binary classification task where the model is trained to predict if the second sentence in a pair of sentences is the next sentence in a text. The model is trained on a large corpus of text, such as Wikipedia, and is pre-trained for 1 million steps. The model is then fine-tuned on a specific task, such as text classification, and is trained for 3,000 steps. The model is then used to extract features from text data.

<center><img src="./images/bert.png" height="400" width="800" /></center>


code: https://github.com/google-research/bert <br />
paper: [Delvin el al. (2018)](https://arxiv.org/abs/1810.04805)

## NLP Applications of Word Embeddings

We can use word embeddings to solve a variety of NLP tasks, such as:

* Text Classification
* Text Summarization
* Question Answering
* Sentiment Analysis
* Machine Translation
* Text Generation
* Text Similarity
* Text Clustering
* Recommendation Systems
* Chatbots
* Information Retrieval
* and more...

## Completing our understanding of Neural Networks - Gradient Descent


<img src="./images/Neuron.drawio.png">

### PyTorch Autograd

* Computes the gradients in the Neural Network

In [28]:
import torch

# Create a tensor with 5 elements
x = torch.tensor([1, 2, 3, 4, 5], dtype=torch.float32)
x


tensor([1., 2., 3., 4., 5.])

In [29]:
## require_grad=True

x = torch.tensor([1, 2, 3, 4, 5], dtype=torch.float32, requires_grad=True)
x

tensor([1., 2., 3., 4., 5.], requires_grad=True)

In [30]:
## operations on the tensor

# Computational Graph is created in the forward pass
y = x**2

* Computes the gradients w.r.t. the parameters of the model

$$\frac{\partial y}{\partial x}$$

In [31]:
y #grad_fn=<PowBackward0> is the gradient function (for backpropagation)

tensor([ 1.,  4.,  9., 16., 25.], grad_fn=<PowBackward0>)

In [32]:
z = 2*y + x

In [33]:
z

tensor([ 3., 10., 21., 36., 55.], grad_fn=<AddBackward0>)

In [34]:
z.mean() #mean of all the elements in the tensor

tensor(25., grad_fn=<MeanBackward0>)

In [35]:
## dz/dx

z.mean().backward() #scalar value

In [36]:
x.grad #gradient of z with respect to x

tensor([1.0000, 1.8000, 2.6000, 3.4000, 4.2000])

Vector jacobian product or chain rule

<center><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/e343f872b676a0e64646f27593d03c77c53cbaf3" height="400" width="800"></center>

In [52]:
x = torch.tensor([1, 2, 3, 4, 5], dtype=torch.float32, requires_grad=True)
y = x**2

y

tensor([ 1.,  4.,  9., 16., 25.], grad_fn=<PowBackward0>)

In [59]:
# jacobian matrix
j = torch.tensor([0.1, 1.0, 0.001, 0.0001, 0.00001], dtype=torch.float32)
x.backward(j) #jacobian matrix - if not scalar value then we must pass a vector

In [60]:
## dz/dx
print(x.grad)

tensor([3.0000e-01, 5.0000e+00, 7.0000e-03, 9.0000e-04, 1.1000e-04])


In [61]:
## to suspend tracking of gradients

with torch.no_grad():
    y = x**2
    print(y)

tensor([ 1.,  4.,  9., 16., 25.])


In [62]:
## Other options to stop tracking gradients

x.requires_grad_(False)
x.detach()

tensor([1., 2., 3., 4., 5.])

In [66]:
## Training example

weights = torch.ones(4, requires_grad=True)

for epoch in range(3): # gradients are accumulated
    model_output = (weights*3).sum()
    model_output.backward()
    print(weights.grad)
    
    # empty the gradients
    # weights.grad.zero_()

tensor([3., 3., 3., 3.])
tensor([6., 6., 6., 6.])
tensor([9., 9., 9., 9.])


### Backpropagation


* Chain rule = product of partial derivatives

x = input
y = f(x)
z = b(y)


We can use the chain rule to compute the derivative of z with respect to y, and then the derivative of y with respect to x.

$$\frac{\partial z}{\partial y} = \frac{\partial z}{\partial y} \frac{\partial y}{\partial x}$$

PyTorch creates the computational graph automatically, and then uses the chain rule to compute the gradients of the loss with respect to the parameters of the model.

In [67]:
### Calculating the gradients manually

# 1. forward pass
# 2. compute gradients
# 3. Backward pass (d loss / d weights)

import torch

x = torch.tensor(1.0)
y = torch.tensor(2.0)

w = torch.tensor(1.0, requires_grad=True)

# forward pass and compute the loss
y_hat = w * x
loss = (y_hat - y)**2 # linear regression loss

print(loss)

tensor(1., grad_fn=<PowBackward0>)


In [68]:
## backward pass
loss.backward()
w.grad

tensor(-2.)

### A Numpy Example

In [71]:
import numpy as np

# linear regression
X = np.array([1, 2, 3, 4], dtype=np.float32)
Y = np.array([2, 4, 6, 8], dtype=np.float32)

w = 0.0

# model prediction
def forward(x):
    return w * x

# loss = Mean Squared Error (MSE)
def loss(y, y_predicted):
    return ((y_predicted - y)**2).mean()


# gradient = d(loss)/d(w) = 1/N * 2x (xw - y)
def gradient(x, y, y_predicted):
    return np.dot(2*x, y_predicted - y).mean()


print(f'Prediction before training: f(5) = {forward(5):.3f}')

# training
learning_rate = 0.01

n_iters = 20

for epoch in range(n_iters):
    # prediction = forward pass
    y_pred = forward(X)
    
    # loss
    l = loss(Y, y_pred)
    
    # gradients
    dw = gradient(X, Y, y_pred)
    
    # update weights
    w -= learning_rate * dw
    
    if epoch % 2 == 0:
        print(f'[INFO]: epoch {epoch + 1}: w = {w:.3f}, loss = {l:.8f}')
    
print(f'Prediction after training: f(5) = {forward(5):.3f}')

Prediction before training: f(5) = 0.000
[INFO]: epoch 1: w = 1.200, loss = 30.00000000
[INFO]: epoch 3: w = 1.872, loss = 0.76800019
[INFO]: epoch 5: w = 1.980, loss = 0.01966083
[INFO]: epoch 7: w = 1.997, loss = 0.00050331
[INFO]: epoch 9: w = 1.999, loss = 0.00001288
[INFO]: epoch 11: w = 2.000, loss = 0.00000033
[INFO]: epoch 13: w = 2.000, loss = 0.00000001
[INFO]: epoch 15: w = 2.000, loss = 0.00000000
[INFO]: epoch 17: w = 2.000, loss = 0.00000000
[INFO]: epoch 19: w = 2.000, loss = 0.00000000
Prediction after training: f(5) = 10.000


### A PyTorch Example

In [73]:
import torch

X = torch.tensor([1, 2, 3, 4], dtype=torch.float32)
Y = torch.tensor([2, 4, 6, 8], dtype=torch.float32)

w = torch.tensor(0.0, dtype=torch.float32, requires_grad=True)

def forward(x):
    return w * x

def loss(y, y_predicted):
    return ((y_predicted - y)**2).mean()

print(f'Prediction before training: f(5) = {forward(5):.3f}')

learning_rate = 0.01
n_iters = 20

for epoch in range(n_iters):
    # prediction = forward pass
    y_pred = forward(X)
    
    # loss
    l = loss(Y, y_pred)
    
    # gradients = backward pass
    l.backward() # dl/dw
    
    # update weights
    with torch.no_grad():
        w -= learning_rate * w.grad
    
    # zero gradients
    w.grad.zero_()
    
    if epoch % 2 == 0:
        print(f'[INFO]: epoch {epoch + 1}: w = {w:.3f}, loss = {l:.8f}')

print(f'Prediction after training: f(5) = {forward(5):.3f}')

Prediction before training: f(5) = 0.000
[INFO]: epoch 1: w = 0.300, loss = 30.00000000
[INFO]: epoch 3: w = 0.772, loss = 15.66018772
[INFO]: epoch 5: w = 1.113, loss = 8.17471695
[INFO]: epoch 7: w = 1.359, loss = 4.26725292
[INFO]: epoch 9: w = 1.537, loss = 2.22753215
[INFO]: epoch 11: w = 1.665, loss = 1.16278565
[INFO]: epoch 13: w = 1.758, loss = 0.60698116
[INFO]: epoch 15: w = 1.825, loss = 0.31684780
[INFO]: epoch 17: w = 1.874, loss = 0.16539653
[INFO]: epoch 19: w = 1.909, loss = 0.08633806
Prediction after training: f(5) = 9.612


### Putting it all together

* Model Design (input, output size, forward pass)
* Create loss and optimizer
* Training Loop
    * Forward pass - compute prediction
    * Backward pass - gradients
    * Update weights - optimizer.step()

In [79]:
import torch
import torch.nn as nn #neural networks

## linear regression
X = torch.tensor([[1], [2], [3], [4]], dtype=torch.float32)
Y = torch.tensor([[2], [4], [6], [8]], dtype=torch.float32)

n_samples, n_features = X.shape # 4, 1 (n_samples, n_features)

test = torch.tensor([5], dtype=torch.float32)

input_size = n_features
output_size = n_features

# class LinearRegression(nn.Module):
#     def __init__(self, input_dim, output_dim):
#         super(LinearRegression, self).__init__() # super class constructor
#         self.lin = nn.Linear(input_dim, output_dim) # define the linear layer
        
#     def forward(self, x):
#         return self.lin(x)
    
model = nn.Linear(input_size, output_size) # == LinearRegression(input_size, output_size)

print(f'Prediction before training: f(5) = {model(test).item():.3f}')

learning_rate = 0.01
n_iters = 20

loss = nn.MSELoss() # mean squared error loss
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate) # stochastic gradient descent

# training
for epoch in range(n_iters):
    # prediction = forward pass
    y_pred = model(X)
    
    # loss
    l = loss(Y, y_pred)
    
    # gradients = backward pass
    l.backward() # dl/dw
    
    optimizer.step() # update weights
    
    # zero gradients
    optimizer.zero_grad()
    
    if epoch % 2 == 0:
        print(f'[INFO]: epoch {epoch + 1}: w = {model.weight.item():.3f}, loss = {l:.8f}')

print(f'Prediction after training: f(5) = {model(test).item():.3f}')


Prediction before training: f(5) = -3.665
[INFO]: epoch 1: w = -0.411, loss = 53.89147186
[INFO]: epoch 3: w = 0.204, loss = 26.07174301
[INFO]: epoch 5: w = 0.631, loss = 12.67595291
[INFO]: epoch 7: w = 0.928, loss = 6.22484207
[INFO]: epoch 9: w = 1.135, loss = 3.11738729
[INFO]: epoch 11: w = 1.279, loss = 1.61981273
[INFO]: epoch 13: w = 1.380, loss = 0.89736181
[INFO]: epoch 15: w = 1.450, loss = 0.54812449
[INFO]: epoch 17: w = 1.500, loss = 0.37859368
[INFO]: epoch 19: w = 1.535, loss = 0.29560304
Prediction after training: f(5) = 8.867


## Word Embeddings as Input to a Neural Network

In [83]:
# working with word embeddings in PyTorch
import torch
import torch.nn as nn

# https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html

embeddings = nn.Embedding(10, 3) # 10 words, 3 dimensions
type(embeddings)

torch.nn.modules.sparse.Embedding

In [15]:
embeddings = nn.Embedding(10, 3, padding_idx=0) # 10 words, 3 dimensions
embeddings

Embedding(10, 3, padding_idx=0)

## Word Embeddings in TensorFlow

In [100]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

In [101]:
sentences = [
    "The cat is on the mat",
    "The dog is on the couch",
    "The bird is in the sky",
    "The fish is in the water",
]

labels = [0, 1, 2, 3]  # corresponding labels for the sentences

### Tokenize our text

In [102]:
def tokenize(sentences):
    tokens = [s.lower().split() for s in sentences]
    return tokens

tokens = tokenize(sentences)

### Create our vocabulary

In [103]:
def create_vocab(tokens):
    vocab = set()
    for s in tokens:
        vocab.update(s)
    return vocab

In [105]:
vocab = create_vocab(tokens)
vocab_size = len(vocab)

In [108]:
vocab

{'bird',
 'cat',
 'couch',
 'dog',
 'fish',
 'in',
 'is',
 'mat',
 'on',
 'sky',
 'the',
 'water'}

### Create our word to index mapping

In [109]:
word_to_idx = {word: idx for idx, word in enumerate(vocab)}
idx_to_word = {idx: word for word, idx in word_to_idx.items()}

def encode_sentences(tokens, word_to_idx):
    encoded_sentences = []
    for s in tokens:
        encoded_sentences.append([word_to_idx[word] for word in s])
    return encoded_sentences

encoded_sentences = encode_sentences(tokens, word_to_idx)

### Padding

In [110]:
def pad_sequences(encoded_sentences, seq_length):
    padded_sentences = []
    for s in encoded_sentences:
        if len(s) < seq_length:
            padded_sentences.append(s + [0] * (seq_length - len(s)))
        else:
            padded_sentences.append(s[:seq_length])
    return padded_sentences

seq_length = 6
padded_sentences = pad_sequences(encoded_sentences, seq_length)

In [88]:
class ToyDataset(Dataset):
    def __init__(self, sentences, labels):
        self.sentences = sentences
        self.labels = labels

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        return torch.tensor(self.sentences[idx]), torch.tensor(self.labels[idx])

dataset = ToyDataset(padded_sentences, labels)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

In [93]:
class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(TextClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.fc1 = nn.Linear(embedding_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = self.embedding(x)
        x = torch.mean(x, dim=1)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

embedding_dim = 10
hidden_dim = 32
output_dim = 4

model = TextClassifier(vocab_size, embedding_dim, hidden_dim, output_dim)
        

In [94]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
num_epochs = 50

In [95]:
for epoch in range(num_epochs):
    epoch_loss = 0.0
    epoch_accuracy = 0.0

    for batch in dataloader:
        sentences, labels = batch
        optimizer.zero_grad()
        outputs = model(sentences)

        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        predictions = torch.argmax(outputs, dim=1)
        epoch_accuracy += (predictions == labels).sum().item()

    epoch_loss /= len(dataloader)
    epoch_accuracy /= len(dataset)
    print(f'Epoch {epoch+1}, Loss: {epoch_loss:.4f}, Accuracy: {epoch_accuracy:.4f}')

Epoch 1, Loss: 1.4143, Accuracy: 0.2500
Epoch 2, Loss: 1.4051, Accuracy: 0.2500
Epoch 3, Loss: 1.3979, Accuracy: 0.2500
Epoch 4, Loss: 1.3912, Accuracy: 0.2500
Epoch 5, Loss: 1.3849, Accuracy: 0.2500
Epoch 6, Loss: 1.3796, Accuracy: 0.2500
Epoch 7, Loss: 1.3728, Accuracy: 0.2500
Epoch 8, Loss: 1.3671, Accuracy: 0.2500
Epoch 9, Loss: 1.3615, Accuracy: 0.2500
Epoch 10, Loss: 1.3547, Accuracy: 0.2500
Epoch 11, Loss: 1.3492, Accuracy: 0.2500
Epoch 12, Loss: 1.3432, Accuracy: 0.2500
Epoch 13, Loss: 1.3371, Accuracy: 0.2500
Epoch 14, Loss: 1.3310, Accuracy: 0.2500
Epoch 15, Loss: 1.3248, Accuracy: 0.2500
Epoch 16, Loss: 1.3187, Accuracy: 0.2500
Epoch 17, Loss: 1.3124, Accuracy: 0.2500
Epoch 18, Loss: 1.3061, Accuracy: 0.2500
Epoch 19, Loss: 1.2999, Accuracy: 0.2500
Epoch 20, Loss: 1.2940, Accuracy: 0.2500
Epoch 21, Loss: 1.2872, Accuracy: 0.2500
Epoch 22, Loss: 1.2808, Accuracy: 0.2500
Epoch 23, Loss: 1.2749, Accuracy: 0.5000
Epoch 24, Loss: 1.2682, Accuracy: 0.5000
Epoch 25, Loss: 1.2617, A

In [98]:
test_sentences = [
    "The cat is on the mat",
    "The dog is on the couch",
]

test_tokens = tokenize(test_sentences)
encoded_test_sentences = encode_sentences(test_tokens, word_to_idx)
padded_test_sentences = pad_sequences(encoded_test_sentences, seq_length)

with torch.no_grad():
    test_sentences_tensor = torch.tensor(padded_test_sentences)
    test_outputs = model(test_sentences_tensor)
    test_predictions = torch.argmax(test_outputs, dim=1)
    print(f'Test predictions: {test_predictions}')


Test predictions: tensor([0, 1])
