# Embeddings HW

## Exercise 1: Word analogies

An analogy explains one thing in terms of another to highlight the ways in which they are alike. For example, *paris* is similar to *france* in the same way that *rome* is to *italy*. Word2Vec vectors sometimes shows the ability of solving analogy problem of the form **a is to b as a* is to what?**.

In the cell below, we show you how to use word vectors to find x. The `most_similar` function finds words that are most similar to the words in the `positive` list and most dissimilar from the words in the `negative` list. The answer to the analogy will be the word ranked most similar (largest numerical value). In the case below, the top one word *italy* is the answer, so this analogy is solved successfully.

**Your task** is to look for one analogy that can be solved successfully and one analogy that could not be solved using this pre-trained Word2Vec model. You can check out [this paper](https://www.semanticscholar.org/paper/Efficient-Estimation-of-Word-Representations-in-Mikolov-Chen/330da625c15427c6e42ccfa3b747fb29e5835bf0) for inspirations.

Please only submit a nice and easy to read notebook.

In [None]:
# !pip3 install gensim

In [1]:
import gensim.downloader as api
import time

In [10]:
def fprint_pairs(pairs: list):
    """Formatted print function on list of pairs"""
    
    print("-"*38)
    print("{: ^4} {: ^19} {: ^13}".format("Rank", "Word", "Similarity"))
    print("-"*4,"-"*19,"-"*13)

    i = 1
    for k,v in pairs:
        print("{: <4} {: <19} {:>13.2f}".format(i,k,v))
        i += 1
    
    print("-"*38)
    print()

In [8]:
# Load 3 million Word2Vec Vectors, pre-trained on Google news, each with the dimension of 300
# This model may take a few minutes to load for the first time.

start_time = time.time()
w2v_google = api.load("word2vec-google-news-300")
print("--- %s seconds ---" % (time.time() - start_time))

--- 66.94558310508728 seconds ---


In [159]:
print(f"Loaded vocab size: {len(w2v_google.index_to_key)}")

Loaded vocab size: 3000000


In [None]:
w2v_google["cat"].shape  # Embedding vector length

(300,)

In [11]:
# Run this cell to answer the analogy -- paris : france :: rome : x
similarity_paris = w2v_google.most_similar(positive=['rome', 'france'], negative=['paris'])

fprint_pairs(similarity_paris)

--------------------------------------
Rank        Word          Similarity  
---- ------------------- -------------
1    italy                        0.52
2    european                     0.51
3    italian                      0.51
4    epl                          0.49
5    spain                        0.49
6    england                      0.49
7    italians                     0.48
8    kosovo                       0.48
9    lampard                      0.48
10   malta                        0.48
--------------------------------------



In [None]:
# TODO: Add your code here

## Exercise 2: Classification (OPTIONAL)

Do the data processing part of a classification with a simple neural network.

### Download data

[Data description](https://www.kaggle.com/datasets/parulpandey/emotion-dataset)

You can download the data from the web by using the link, or from code following the instructions above:

1. Go to [Kaggle](https://www.kaggle.com) and register
2. In your profile settings scroll down to API
3. Generate key with: `Create New Token`. This downloads a file named `kaggle.json`
4. Place it to the [appropriate location](https://github.com/Kaggle/kaggle-api/blob/main/docs/README.md#api-credentials)
5. Now, you can download all the necessary files from code.

In [None]:
# !pip3 install kaggle, torch

In [None]:
from kaggle.api.kaggle_api_extended import KaggleApi
import os
import torch

In [153]:
api = KaggleApi()
api.authenticate()

dataset_slug = 'parulpandey/emotion-dataset'  # Dataset
download_path = os.path.join("..", "data", "emotion_dataset")  # Destination folder

api.dataset_download_files(dataset_slug, path=download_path, unzip=True)

Dataset URL: https://www.kaggle.com/datasets/parulpandey/emotion-dataset


In [27]:
os.listdir(download_path)

['training.csv', 'test.csv', 'validation.csv']

### Process data

Make train and test datasets. Note that this dataset has 5 different labels: sadness (0), joy (1), love (2), anger (3), fear (4). Keep only rows with label 0 or 1 filter everything else.

In [None]:
# TODO: Add your code here

len(X_train), len(y_train), len(X_test), len(y_test)

(10028, 10028, 1276, 1276)

### Convert the text to vector

Do some kind of preprocessing and average the word vectors use your favorite library e.g. spacy, gensim. It can take a few seconds, save the final vectors for later so you don't have to wait all the time.

FOr the example outputs, I used the gensim library.

In [None]:
def text2vec(text: str) -> list:
    # TODO: Add your code here
    pass

    return list(sentence_vector)

In [112]:
X_train = torch.tensor([text2vec(text_input) for text_input in X_train])
y_train = torch.tensor(y_train)

X_test = torch.tensor([text2vec(text_input) for text_input in X_test])
y_test = torch.tensor(y_test)

In [113]:
X_train.size(), y_train.size(), X_test.size(), y_test.size()

(torch.Size([10028, 300]),
 torch.Size([10028]),
 torch.Size([1276, 300]),
 torch.Size([1276]))

In [None]:
# Save
torch.save({"X_train": X_train, "y_train": y_train}, 'train_data.pth')
torch.save({"X_test": X_test, "y_train": y_train}, 'test_data.pth')

# Load
# train_data = torch.load('train_data.pth')
# X_train = train_data["X_train"]
# y_train = train_data["y_train"]


# test_data = torch.load('test_data.pth')
# X_test = test_data["X_test"]
# y_train = test_data["y_train"]

### Data loader



In [116]:
from torch.utils.data import TensorDataset, DataLoader

In [117]:
batch_size = 32

train_loader = DataLoader(dataset=TensorDataset(X_train, y_train), batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=TensorDataset(X_test, y_test), batch_size=batch_size, shuffle=False)

In [None]:
# Test the DataLoader

for i, (X_batch, y_batch) in enumerate(train_loader):
    print(f"Batch {i+1}:")
    print("X_batch shape:", X_batch.shape)
    print("y_batch shape:", y_batch.shape)
    print()
    
    break

Batch 1:
X_batch shape: torch.Size([32, 300])
y_batch shape: torch.Size([32])



### Model

In [119]:
class EmotionClassifier(torch.nn.Module):
    def __init__(self, input_dim, num_classes):
        super(EmotionClassifier, self).__init__()

        self.fc = torch.nn.Linear(input_dim, 1024)
        self.relu = torch.nn.ReLU()
        self.fc2 = torch.nn.Linear(1024, 512)
        self.relu2 = torch.nn.ReLU()
        self.fc3 = torch.nn.Linear(512, 256)
        self.relu3 = torch.nn.ReLU()
        self.fc4 = torch.nn.Linear(256, num_classes)

    def forward(self, x):
        x = self.fc(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.relu2(x)
        x = self.fc3(x)
        x = self.relu3(x)
        x = self.fc4(x)

        return x

### Train

In [131]:
model = EmotionClassifier(input_dim=300, num_classes=2)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Define loss function and optimizer
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [None]:
def train(model, train_loader, criterion, optimizer, device, num_epochs=5):

    model.train()

    for epoch in range(1, num_epochs+1):
        total_loss = 0.0
        correct = 0
        total = 0

        for X_batch, y_batch in train_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)

            optimizer.zero_grad()

            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)

            loss.backward()
            optimizer.step()

            predicted_labels = torch.argmax(outputs.data, dim=1)

            total += y_batch.size(0)

            total_loss += loss.item()
            correct += (predicted_labels == y_batch).sum().item()
        
        print(f"Epoch [{epoch}/{num_epochs}], Loss: {total_loss/len(train_loader):.4f}, Accuracy: {correct/total:.4f}")

    return model

model = train(model, train_loader, criterion, optimizer, device, num_epochs=5)

Epoch [1/5], Loss: 0.3506, Accuracy: 0.8427
Epoch [2/5], Loss: 0.2694, Accuracy: 0.8857
Epoch [3/5], Loss: 0.2396, Accuracy: 0.8985
Epoch [4/5], Loss: 0.2112, Accuracy: 0.9102
Epoch [5/5], Loss: 0.1767, Accuracy: 0.9225


In [None]:
def eval(model, test_loader, criterion, device):
    model.eval()

    total_loss = 0.0
    correct = 0
    total = 0

    with torch.no_grad():
        for X_batch, y_batch in test_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)

            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)

            predicted_labels = torch.argmax(outputs.data, dim=1)

            total += y_batch.size(0)

            total_loss += loss.item()
            correct += (predicted_labels == y_batch).sum().item()

    return total_loss / len(test_loader), correct / total

test_loss, test_accuracy = eval(model, test_loader, criterion, device)
print(f"Test Loss: {test_loss:.4f}, Test Accuracy: {test_accuracy:.4f}")

Test Loss: 0.2455, Test Accuracy: 0.9013


In [151]:
def eval_example(model, sentence, device):
    sentence_vector = torch.tensor(text2vec(sentence)).unsqueeze(0).to(device)

    model.eval()
    with torch.no_grad():
        output = model(sentence_vector)
        predicted_label = torch.argmax(output.data, dim=1)
        confidence = torch.nn.functional.softmax(output, dim=1)[0][predicted_label]
        
        print(f"Input sentence: {sentence}")
        print(f"Predicted label: {predicted_label.item()}")
        print(f"Confidence: {confidence.item():.3f}\n")

In [152]:
sentence_0 = "i am quite happy to be here today with you all"
sentence_1 = "i am quite happy to be here today with you all, what a shame that it's raining"

eval_example(model, sentence_0, device)
eval_example(model, sentence_1, device)

Input sentence: i am quite happy to be here today with you all
Predicted label: 1
Confidence: 1.000

Input sentence: i am quite happy to be here today with you all, what a shame that it's raining
Predicted label: 0
Confidence: 0.559

