# NLP Assignment 1

Overview:
In this assignment, you will be asked to:

  - generate batch for skip-gram model
  - implement two loss functions to train word embeddings
  - tune the parameters for word embeddings
  - apply best learned word embeddings to word analogy task
  - calculate bias score on your best models
  - create a new task on which you would run WEAT test

NOTE:

Please only make your code edits in the TODO(students) blocks in the notebook and make sure you have run the previous notebook cells before running the latter ones. Add comments to explain your code well and make sure to use relevant identifier names.


How to use this notebook:
  - This notebook is best viewed and executed in Google Colab.
  - Please upload the .ipynb version of this notebook in Google Drive on your SBU CS account.
  - Double click and select Open with Colab
  - Upload the files provided in the current working directory of the Colab notebook

Please use the following Google Colab Tutorial in case you are not familiar with the tool: [Link](https://colab.research.google.com/drive/16pBJQePbqkz3QFV54L4NIkOn1kwpuRrj)

## Setting up the data and needed libraries

In [None]:
# Download datafile
!wget http://mattmahoney.net/dc/text8.zip
!unzip text8.zip
!rm text8.zipyy

In [None]:
from google.colab import drive
drive.mount('/content/drive')

<b>Importing needed libraries and setting up random seeds

In [None]:
# All import statements

import collections
import json

import numpy as np
from scipy.spatial import distance

import torch
import torch.nn as nn
import random
import math

from tqdm import tqdm

import os
import pickle

# Setting up all the seeds for repeatable experiements
# DO NOT CHANGE
np.random.seed(1234)
torch.manual_seed(1234)

## Generating the Data


To train word vectors, you need to generate training instances from the given data. You will implement a method that will generate training instances in batches. For skip-gram model, you will slide a window and sample training instances from the data inside the window.

<b>For example:</b>

Suppose that we have a text: "The quick brown fox jumps over the lazy dog."
and batch_size = 8, window_size = 3

"<font color = red>[The quick brown]</font> fox jumps over the lazy dog"

Context word would be 'quick' and predicting words are 'The' and 'brown'.

This will generate training examples of the form context(x), predicted_word(y) like:
<ul>
      <li>(quick    ,       The)
      <li>(quick    ,     brown)
</ul>
And then move the sliding window.

"The <font color = red>[quick brown fox]</font> jumps over the lazy dog"

In the same way, we have two more examples:
<ul>
    <li>(brown, quick)
    <li>(brown, fox)
</ul>

Moving the window again:

"The quick <font color = red>[brown fox jumps]</font> over the lazy dog"

We get,

<ul>
    <li>(fox, brown)
    <li>(fox, jumps)
</ul>

Finally we get two more instances from the moved window,

"The quick brown <font color = red>[fox jumps over]</font> the lazy dog"

<ul>
    <li>(jumps, fox)
    <li>(jumps, over)
</ul>

Since now we have 8 training instances, which is the batch size,
stop generating this batch and return batch data.

The two functions given below are helper functions and will assist in feching the data from the file streams.

In [None]:
# Read the data into a list of strings.
def read_data(filename):
    with open(filename) as file:
        text = file.read()
        data = [token.lower() for token in text.strip().split(" ")]
    return data

def build_dataset(words, vocab_size):
    count = [['UNK', -1]]
    count.extend(collections.Counter(words).most_common(vocab_size - 1))
    # token_to_id dictionary, id_to_taken reverse_dictionary
    vocab_token_to_id = dict()
    for word, _ in count:
        vocab_token_to_id[word] = len(vocab_token_to_id)
    data = list()
    unk_count = 0
    for word in words:
        if word in vocab_token_to_id:
            index = vocab_token_to_id[word]
        else:
            index = 0  # dictionary['UNK']
            unk_count += 1
        data.append(index)
    count[0][1] = unk_count
    vocab_id_to_token = dict(zip(vocab_token_to_id.values(), vocab_token_to_id.keys()))
    return data, count, vocab_token_to_id, vocab_id_to_token

<b>Variable Description</b>

data_index is the index of a word. You can access a word using data[data_index].

batch_size is the number of instances in one batch.

num_skips is the number of samples you want to draw in a window (in example, it was 2).

skip_windows decides how many words to consider left and right from a context word(so, skip_windows*2+1 = window_size).

batch will contains word ids for context words. Dimension is [batch_size].

labels will contains word ids for predicting words. Dimension is [batch_size, 1].


Please fill the TODO section in the code below to generate data batches.

In [None]:
class Dataset:
    def __init__(self, data, batch_size=128, num_skips=8, skip_window=4):
        """
        @data_index: the index of a word. You can access a word using data[data_index]
        @batch_size: the number of instances in one batch
        @num_skips: the number of samples you want to draw in a window
                (In the below example, it was 2)
        @skip_window: decides how many words to consider left and right from a context word.
                    (So, skip_windows*2+1 = window_size)
        """

        self.data_index=0
        self.data = data
        assert batch_size % num_skips == 0
        assert num_skips <= 2 * skip_window

        self.batch_size = batch_size
        self.num_skips = num_skips
        self.skip_window = skip_window

    def reset_index(self, idx=0):
        self.data_index=idx

    def generate_batch(self):
        """
        Write the code generate a training batch

        batch will contain word ids for context words. Dimension is [batch_size].
        labels will contain word ids for predicting(target) words. Dimension is [batch_size, 1].
        """

        center_word = np.ndarray(shape=(self.batch_size), dtype=np.int32) # not initialized
        context_word = np.ndarray(shape=(self.batch_size), dtype=np.int32)# not initialized

        # stride: for the rolling window
        stride = 1

        ### TODO(students): start
        span = 2 * self.skip_window + 1  # [ skip_window target skip_window ]

        buffer = collections.deque(maxlen=span)
        for _ in range(span):
            buffer.append(self.data[self.data_index])
            self.data_index = (self.data_index + 1) % len(self.data)

        for i in range(self.batch_size // self.num_skips):
            target = self.skip_window  # target label at the center of the buffer
            targets_to_avoid = [self.skip_window]
            for j in range(self.num_skips):
                while target in targets_to_avoid:
                    target = random.randint(0, span - 1)
                targets_to_avoid.append(target)
                center_word[i * self.num_skips + j] = buffer[self.skip_window]
                context_word[i * self.num_skips + j] = buffer[target]
            buffer.append(self.data[self.data_index])
            self.data_index = (self.data_index + 1) % len(self.data)

        return torch.LongTensor(center_word), torch.LongTensor(context_word)



## Building the Model



<b>Negative Log Likelihood (NLL): </b>
We discussed the log likelihood function in class. This is the negative of the same. These are called “loss” functions since they measure how bad the current model is from the expected behavior. Refer to the class notes on this topic.

To understand it better, you may also refer to Section 4.3 [here](http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf).

Training a word2vec model with this loss and the default settings took ~50 mins on Google Colab with GPU accelarator. It will take ~10 hrs on a Macbook Pro 2018 CPU.

<br>

<b>Negative Sampling (NEG): </b>
The negative sampling formulates a slightly different classification task and a corresponding loss.
[This paper](https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf) describes the method in detail.

The idea here is to build a classifier that can give high probabilities to words that are the correct target words and low probabilities to words that are incorrect target words.
As with negative log likelihood loss, here we define the classifier using a function that uses the word vectors of the context and target as free parameters.
The key difference however is that instead of using the entire vocabulary, here we sample a set of k negative words for each instance, and create an augmented instance which is a collection of the true target word and k negative words.
Now the vectors are trained to maximize the probability of this augmented instance.
To understand it better, you may also refer to Section 4.4 [here](http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf).

Training a word2vec model with this loss and the default settings took ~2h30 mins on Google Colab with GPU accelarator.




<b> Please implement the models and loss functions in the code below in the TODO sections </b>

In [None]:
# Defining the sigmoid function
sigmoid = lambda x: 1/(1 + torch.exp(-x))

class WordVec(nn.Module):
    def __init__(self, V, embedding_dim, loss_func, counts, num_neg_samples_per_center = 1):
        super(WordVec, self).__init__()
        self.center_embeddings = nn.Embedding(num_embeddings=V, embedding_dim=embedding_dim)
        self.center_embeddings.weight.data.normal_(mean=0, std=1/math.sqrt(embedding_dim))
        self.center_embeddings.weight.data[self.center_embeddings.weight.data<-1] = -1
        self.center_embeddings.weight.data[self.center_embeddings.weight.data>1] = 1

        self.context_embeddings = nn.Embedding(num_embeddings=V, embedding_dim=embedding_dim)
        self.context_embeddings.weight.data.normal_(mean=0, std=1/math.sqrt(embedding_dim))
        self.context_embeddings.weight.data[self.context_embeddings.weight.data<-1] = -1 + 1e-10
        self.context_embeddings.weight.data[self.context_embeddings.weight.data>1] = 1 - 1e-10

        self.loss_func = loss_func
        self.counts = counts

        self.num_neg_samples_per_center = num_neg_samples_per_center

    def forward(self, center_word, context_word):

        if self.loss_func == "nll":
            return self.negative_log_likelihood_loss(center_word, context_word)
        elif self.loss_func == "neg":
            return self.negative_sampling(center_word, context_word)
        else:
            raise Exception("No implementation found for %s"%(self.loss_func))

    def negative_log_likelihood_loss(self, center_word, context_word):

        # Notes (page 9): http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf
        center_word_embeddings = self.center_embeddings(center_word) # batches, dims
        context_word_embeddings = self.context_embeddings(context_word) # batches, dims

        a = torch.sum(torch.mul(center_word_embeddings, context_word_embeddings), axis=1) # batches
        # (batches, dims) @ (dims, V) = (batches, V);
        b = torch.logsumexp(center_word_embeddings @ self.context_embeddings.weight.T, dim=1) # batches
        loss = torch.mean(b - a)

        return loss

    def negative_sampling(self, center_word, context_word):
        # Use this variable to control the number of negative samples for every positive sample
        num_neg_samples_per_center = self.num_neg_samples_per_center

        # Calculate unigram distribution
        unigram_dist = np.array(self.counts, dtype=np.int64)
        unigram_dist = unigram_dist / np.sum(unigram_dist)
        unigram_dist = np.power(unigram_dist, 0.75)
        unigram_dist /= np.sum(unigram_dist)

        # Sample negative words
        batch_size = center_word.size(0)
        neg_samples = torch.tensor(
            np.random.choice(len(self.counts),
                             size=(batch_size, num_neg_samples_per_center),
                             replace=True,
                             p=unigram_dist
                             ),
                            dtype=torch.int64).to(center_word.device)

        # Calculate the positive term
        center_word_embeddings = self.center_embeddings(center_word)
        context_word_embeddings = self.context_embeddings(context_word)
        pos_term = torch.sum(torch.mul(center_word_embeddings, context_word_embeddings), axis=1)
        pos_term = torch.sigmoid(pos_term)
        pos_term = torch.log(pos_term)

        # Calculate the negative term
        neg_word_embeddings = self.context_embeddings(neg_samples)
        neg_term = torch.bmm(neg_word_embeddings, center_word_embeddings.unsqueeze(2)).squeeze(2)
        neg_term = torch.sigmoid(-neg_term)
        neg_term = torch.sum(torch.log(neg_term), axis=1)

        # Calculate loss
        loss = -torch.mean(pos_term + neg_term)

        return loss



    def print_closest(self, validation_words, reverse_dictionary, top_k=8):
        print('Printing closest words')
        embeddings = torch.zeros(self.center_embeddings.weight.shape).copy_(self.center_embeddings.weight)
        embeddings = embeddings.data.cpu().numpy()

        validation_ids = validation_words
        norm = np.sqrt(np.sum(np.square(embeddings),axis=1,keepdims=True))
        normalized_embeddings = embeddings/norm
        validation_embeddings = normalized_embeddings[validation_ids]
        similarity = np.matmul(validation_embeddings, normalized_embeddings.T)
        for i in range(len(validation_ids)):
            word = reverse_dictionary[validation_words[i]]
            nearest = (-similarity[i, :]).argsort()[1:top_k+1]
            print(word, [reverse_dictionary[nearest[k]] for k in range(top_k)])

## Training and Data Loading Loops

The code below uses the models and losses built above and runs the actual training process.

In [None]:
class Trainer:
    def __init__(self, model, ckpt_save_path, reverse_dictionary):
        self.model = model
        self.ckpt_save_path = ckpt_save_path
        self.reverse_dictionary = reverse_dictionary

    def training_step(self, center_word, context_word):
        loss =  self.model(center_word, context_word)
        return loss

    def train(self, dataset, max_training_steps, ckpt_steps, validation_words, device="cpu", lr = 1):

        optim = torch.optim.SGD(self.model.parameters(), lr = lr)
        self.model.to(device)
        self.model.train()
        self.losses = []

        t = tqdm(range(max_training_steps))
        for curr_step in t:
            optim.zero_grad()
            center_word, context_word = dataset.generate_batch()
            loss = self.training_step(center_word.to(device), context_word.to(device))
            loss.backward()
            optim.step()
            self.losses.append(loss.item())
            if curr_step:
                t.set_description("Avg loss: %s"%(round(sum(self.losses[-2000:])/len(self.losses[-2000:]), 3)))
            if curr_step % 10000 == 0:
                self.model.print_closest(validation_words, self.reverse_dictionary)
            if curr_step%ckpt_steps == 0 and curr_step > 0:
                self.save_ckpt(curr_step)

    def save_ckpt(self, curr_step):
        torch.save(self.model, "%s/%s.pt"%(self.ckpt_save_path, str(curr_step)))

## Training Framework

The following run_training function will train a model for you as shown in the results below. Please use it for the various experiemnts you will perform. The parameters of the run_training() function include many hyperparameters which should be experimented with. Some examples include vector size, batch size, vocabulary size, epochs etc. Please use function call similar to the following cell for guidance on how to train the model.

In [None]:
def create_path(path):
    if not os.path.exists(path):
        os.mkdir(path)
        print ("Created a path: %s"%(path))

def run_training(
    model_type = 'nll', # defines which loss function is being used to train the model
                        # can take values 'nll' for negative log loss and 'neg' for negative sampling
    lr = 0.1, # defines the learning rate used for training the model
    num_neg_samples_per_center = 1, # controls the number of negative samples per center word
    checkpoint_model_path = './checkpoints', # defines path to the checkpoint of the model
    final_model_path = './final_model', # location to save the final model
    skip_window = 4, # size of the skip window
    vocab_size = int(1e5), # size of the vocabulary used in the experiments
    num_skips = 2, # Number of samples to be drawn from a window
    batch_size = 64, # Size of the batches in terms of number of x,y pairs used for training
    embedding_size = 128, # size of the embedding vectores
    checkpoint_step = 50000, # Number of steps after which checkpoint is saved
    max_num_steps = 200001 # Maximum number of steps to train for
):

    checkpoint_model_path = f'{checkpoint_model_path}_{model_type}/'
    create_path(checkpoint_model_path)

    # Read data
    words = read_data("./text8")
    print('Data size', len(words))

    data, count, vocab_token_to_id, vocab_id_to_token = build_dataset(words, vocab_size)
    # save dictionary as vocabulary
    print('Most common words (+UNK)', count[:5])
    print('Sample data', data[:10], [vocab_id_to_token[i] for i in data[:10]])
    # Calculate the probability of unigrams
    # unigram_cnt = [c for w, c in count]
    count_dict = dict(count)
    unigram_cnt = [count_dict[vocab_id_to_token[i]] for i in sorted(list(vocab_token_to_id.values()))]
    data_index = 0

    dataset = Dataset(data, batch_size=batch_size, num_skips=num_skips, skip_window=skip_window)
    center, context = dataset.generate_batch()

    for i in range(8):
        print(center[i].item(), vocab_id_to_token[center[i].item()],'->', context[i].item(), vocab_id_to_token[context[i].item()])
    dataset.reset_index()

    valid_size = 16     # Random set of words to evaluate similarity on.
    valid_window = 100  # Only pick dev samples in the head of the distribution.
    valid_examples = np.random.choice(valid_window, valid_size, replace=False)

    embedding_size = embedding_size
    model = WordVec(V=vocab_size, embedding_dim=embedding_size, loss_func=model_type, counts=np.array(unigram_cnt), num_neg_samples_per_center = num_neg_samples_per_center)
    trainer = Trainer(model, checkpoint_model_path, vocab_id_to_token)

    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    print(f'Device: {device}')
    trainer.train(dataset, max_num_steps, checkpoint_step, valid_examples, device, lr = lr)
    model_path = final_model_path
    create_path(model_path)
    model_filepath = os.path.join(model_path, 'word2vec_%s.model'%(model_type))
    pickle.dump([vocab_token_to_id, model.center_embeddings.weight.detach().cpu().numpy()], open(model_filepath, 'wb'))

The following cell shows a demo with much lesser training epochs, embedding size and vocabulary size to test your code. Please use values closer to function defaults in the above cell for experimentation in the following sections.

Please make sure to keep an eye on the Avg. Loss value being printed as the model trains. This value sgould gradually decrease if you have implemented your code well.

In [None]:
run_training(
    model_type = 'neg', # defines which loss function is being used to train the model
                        # can take values 'nll' for negative log loss and 'neg' for negative sampling
    lr = 0.1, # defines the learning rate used for training the model
    num_neg_samples_per_center = 5, # controls the number of negative samples per center word
    checkpoint_model_path = './demo_checkpoints', # defines path to the checkpoint of the model
    final_model_path = './final_demo_model', # location to save the final model
    skip_window = 4, # size of the skip window
    vocab_size = int(1e5), # size of the vocabulary used in the experiments
    num_skips = 8, # Number of samples to be drawn from a window
    batch_size = 256, # Size of the batches in terms of number of x,y pairs used for training
    embedding_size = 4, # size of the embedding vectores
    checkpoint_step = 500, # Number of steps after which checkpoint is saved
    max_num_steps = 2001 # Maximum number of steps to train for
)

<b>Please train at least one model of each NLL and NEG using the above function. Make sure you name the files and checkpoints appropriately.</b>

In [None]:
# TODO(student): start

# train at least 1 NEG model

# train at least 1 NLL model

# TODO(student): end

## Testing Framework

<b>Analogies using word vectors</b>

You will use the word vectors you learned from both approaches in the following word analogy task.

Each question/task is in the following form.
```
Consider the following word pairs that share the same relation, R:

    pilgrim:shrine, hunter:quarry, assassin:victim, climber:peak

Among these word pairs,

(1) pig:mud
(2) politician:votes
(3) dog:bone
(4) bird:worm

Q1. Which word pairs has the MOST illustrative(similar) example of the relation R?

The word pair that has the most illustrative example of the relation R is pilgrim:shrine,
because it implies that a pilgrim visits a shrine, just as a hunter visits a quarry,
an assassin targets a victim, and a climber seeks to reach a peak.

Q2. Which word pairs has the LEAST illustrative(similar) example of the relation R?

The word pair that has the LEAST illustrative example of the relation R is bird:worm,
because while it implies that a bird eats a worm,
it is not as clear or explicit in demonstrating the same type of direct relationship as the other pairs.
```

For each question, there are examples pairs of a certain relation. Your task is to find the most/least illustrative word pair of the relation. One simple method to answer those questions will be measuring the similarities of difference vectors.

Recall that vectors are representing some direction in space. If (a, b) and (c, d) pairs are analogous pairs then the transformation from a to b (i.e., some x vector when added to a gives b: a + x = b) should be highly similar to the transformation from c to d (i.e., some y vector when added to c gives d: c + y = d). In other words, the difference vector (b-a) should be similar to difference vector (d-c).

This difference vector can be thought to represent the relation between the two words.

<b>Please fill-in the TODO section below to implement the above mentioned task.</b>

Due to the noisy annotation data, the expected accuracy is not high. The NLL default overall accuracy is 33.5% and negative sampling default overall accuracy is 33.6%.
Improving this score 1~3% would be your goal.


<b>Further implementation instructions:</b>

  - `In the next 2 cells`:
    You will write a code in the TODO section for evaluating relation between pairs of words -- called the [MaxDiff question](https://en.wikipedia.org/wiki/MaxDiff).
    You will generate a file with your predictions following the format of `word_analogy_sample_predictions.txt`.

  - `evaluate_word_analogy.pl`:
    This is a perl script to evaluate YOUR PREDICTIONS on development data. Use it as shown in the next cell. You do not need to submit prediction or score files related to the dev set.

  - `word_analogy_dev.txt`:
    This is some data for development.
    Each line of this file is divided into "examples" and "choices" by "||".
        [examples]||[choices]
    "Examples" and "choices" are delimited by a comma.
      For example:  "tailor:suit","oracle:prophesy","baker:flour"

  - `word_analogy_dev_sample_predictions.txt`:
    A sample prediction file. Pay attention to the format of this file.
    Your prediction file should follow this to use "score_maxdiff.pl" script.
    Each row is in this format:
    
      <pair1> <pair2> <pair3> <pair4> <least_illustrative_pair> <most_illustrative_pair>

    The order of word pairs should match their original order found in `word_analogy_dev.txt`.

  - `word_analogy_dev_mturk_answers.txt`:
    This is the answers collected using Amazon mechanical turk for `word_analogy_dev.txt`.
    The answers in this file is used as the correct answer and used to evaluate your analogy predictions. (using "evaluate_word_analogy.pl")
    For your information, the answers here are a little bit noisy.

  - `word_analogy_test.txt`:
    Test data file. When you are done experimenting with your model, you will generate predictions for this test data using your best models (NLL/negative sampling).
    You will not be able to run the evaluation script on the test set.

    Make sure your submission files are named: `test_preds_nll.txt`, `test_preds_neg.txt`.


In [None]:
def read_data(file_path):
    with open(file_path,'r') as f:
        data = f.readlines()

    candidate, test = [], []
    for line in data:
        a, b = line.strip().split("||")
        a = [i[1:-1].split(":") for i in a.split(",")]
        b = [i[1:-1].split(":") for i in b.split(",")]
        candidate.append(a)
        test.append(b)

    return candidate, test

def get_embeddings(examples, embeddings, dictionary):

    """
    For the word pairs in the 'examples' array, fetch embeddings and return.
    You can access your trained model via dictionary and embeddings.
    dictionary[word] will give you word_id
    and embeddings[word_id] will return the embedding for that word.

    word_id = dictionary[word]
    v1 = embeddings[word_id]

    or simply

    v1 = embeddings[dictionary[word_id]]
    """

    norm = np.sqrt(np.sum(np.square(embeddings),axis=1,keepdims=True))
    normalized_embeddings = embeddings/norm

    embs = []
    for line in examples:
        temp = []
        for pairs in line:
            temp.append([ normalized_embeddings[dictionary[pairs[0]]], normalized_embeddings[dictionary[pairs[1]]] ])
        embs.append(temp)

    result = np.array(embs)

    return result

def evaluate_pairs(candidate_embs, test_embs):

    """
    Write code to evaluate a relation between pairs of words.
    Find the best and worst pairs and return that.
    """

    best_pairs = []
    worst_pairs = []

    ### TODO(students): start
    for line_candidate_embs, line_test_embs in zip(candidate_embs, test_embs):
      similarities = []
      for candidate_emb_pair, test_emb_pair in zip(line_candidate_embs, line_test_embs):
          similarity = np.dot(candidate_emb_pair[0], candidate_emb_pair[1]) * np.dot(test_emb_pair[0], test_emb_pair[1])
          similarities.append(similarity)

      best_pair_index = np.argmax(similarities)
      worst_pair_index = np.argmin(similarities)

      best_pairs.append(best_pair_index)
      worst_pairs.append(worst_pair_index)
    ### TODO(students): end

    return best_pairs, worst_pairs

def write_solution(best_pairs, worst_pairs, test, path):

    """
    Write best and worst pairs to a file, that can be evaluated by evaluate_word_analogy.pl
    """

    ans = []
    for i, line in enumerate(test):
        temp = [f'"{pairs[0]}:{pairs[1]}"' for pairs in line]
        temp.append(f'"{line[worst_pairs[i]][0]}:{line[worst_pairs[i]][1]}"')
        temp.append(f'"{line[best_pairs[i]][0]}:{line[best_pairs[i]][1]}"')
        ans.append(" ".join(temp))

    with open(path, 'w') as f:
        f.write("\n".join(ans))


def run_word_analogy_eval(
    model_path = './final_model', # path to the location where the model being evaluated is stored
    input_filepath = 'word_analogy_dev.txt', # Word analogy file to evaluate on
    output_filepath = 'word_analogy_demo_results.txt', # predicted results
    model_type = 'nll' # type of model being used, NLL or NEG
):

    print(f'Model file: {model_path}/word2vec_{model_type}.model')
    model_filepath = os.path.join(model_path, 'word2vec_%s.model'%(model_type))

    dictionary, embeddings = pickle.load(open(model_filepath, 'rb'))

    candidate, test = read_data(input_filepath)

    candidate_embs = get_embeddings(candidate, embeddings, dictionary)
    test_embs = get_embeddings(test, embeddings, dictionary)

    best_pairs, worst_pairs = evaluate_pairs(candidate_embs, test_embs)

    out_filepath = output_filepath
    print(f'Output file: {out_filepath}')
    write_solution(best_pairs, worst_pairs, test, out_filepath)

Once the word_analogy code is complete with the TODO section, you can use the following function call to generate the results of the word analogy task. A demo result set is also provided for reference.

In [None]:
run_word_analogy_eval(
    model_path = './final_demo_model', # path to the location where the model being evaluated is stored
    input_filepath = 'word_analogy_dev.txt', # Word analogy file to evaluate on
    output_filepath = 'word_analogy_demo_dev_results.txt', # predicted results
    model_type = 'neg' # type of model being used, NLL or NEG

)

The results can finally be converted into numeric metrics using the Perl script below. A demo score result is also provided for refernce.

In [None]:
!chmod 777 evaluate_word_analogy.pl
!./evaluate_word_analogy.pl word_analogy_dev_mturk_answers.txt word_analogy_demo_dev_results.txt demo_score_neg.txt

## Running Experiments with various hyper-parameters for the Neg models

You should run five experiments, where each experiment involves learning word vectors using Negative Sampling model with a specific setting for the three hyper parameters listed below AND evaluating the resulting word vectors on the test set of the word analogy task.

Hyper parameters to try:
  - Number of Neg samples (Can vary from 1 to 5)
  - Learning Rate (Can vary from 0.1 to 10)
  - Window size (Can vary from 1 to 10)

For example one experiment can be where you train a model with number of negative samples set to 5, the learning rate set to 6.4, and window size set to 7. With this you will get a set of word vectors, which you will evaluate on the test set for the word analogy task.

For each experiment you will record your guess for what you think will happen, your recording of what happened, and a guess for an explanation of why it happened. You are doing these experiments to see what happens. Your guesses for what should happen or the explanations themselves wont be graded.

<b> Please make sure that you also test your code on word_analogy_test.txt </b>

### Experiment 1

<b> What are the hyperparameters selected for this experiment and why did you select them?</b>

Hyperparameters: Number of Neg samples = 5, Learning Rate = 1, Window size = 5.

I selected these hyperparameters as a starting point since they are the default values used in the original code. I want to see how the model performs with these default values and to use it as a baseline to compare against other experiments.

<b> How do you expect the model to behave? </b>

With the chosen hyperparameters, I expect the model to exhibit a decent performance, as these settings have been commonly used and proven to work well in many scenarios. However, this does not guarantee the best possible results, and there might be room for improvement through hyperparameter tuning. Using these default values will allow us to establish a baseline from which we can refine the model by adjusting the hyperparameters based on the specific dataset and task at hand.

In [None]:
## TODO(students): start

## Train a new model with the suggested hyperparameters and run the word analogy test

## TODO(students): end

<b> What did you observe? Do the observations follow your expectations? Give a plausible cause.</b>


Generated by:                                     score_maxdiff.pl
Mechanical Turk File:                             word_analogy_dev_mturk_answers.txt
Test File:                                        word_analogy_demo_results.txt
Number of MaxDiff Questions:                      914
Number of Least Illustrative Guessed Correctly:   296
Number of Least Illustrative Guessed Incorrectly: 618
Accuracy of Least Illustrative Guesses:            32.4%
Number of Most Illustrative Guessed Correctly:    307
Number of Most Illustrative Guessed Incorrectly:  607
Accuracy of Most Illustrative Guesses:             33.6%
Overall Accuracy:                                  33.0%


Based on the chosen hyperparameters and the model's performance, we observed unexpected outcomes. It seems that the overall accuracy of the model is relatively low, at 33.0%. The accuracy of guessing the least illustrative words is 32.4%, and the accuracy of guessing the most illustrative words is 33.6%. These observations do not indicate a significant improvement in the model's performance compared to the default hyperparameters.

Some plausible causes for unexpected observations could include:

1. Insufficient training data: If the model has not been exposed to a diverse and representative dataset, it might not generalize well to new data.


2. Overfitting or underfitting: Overfitting occurs when the model learns the training data too well, including noise, while underfitting happens when the model fails to capture the underlying patterns in the data. Both situations can lead to poor performance on unseen data.


3. Suboptimal hyperparameters: The chosen hyperparameters might not be the best fit for the dataset or task. Experimenting with different hyperparameter values can help find a better configuration that results in improved performance.

### Experiment 2

<b> What are the hyperparameters selected for this experiment and why did you select them?</b>

Hyperparameters: Number of Neg samples = 5, Learning Rate = 1, Window size = 9.

I selected these alternative hyperparameters to explore how different configurations affect the performance of the Word2Vec model. Changing the hyperparameters allows us to investigate how each change impacts the model's ability to capture semantic and syntactic relationships between words. The rationale for each hyperparameter choice is as follows:

Window Size = 9: By increasing the window size, the model can capture a larger context for the target word, allowing it to learn more meaningful relationships between words and better understand their semantic meanings. A larger window size provides more context, which can be beneficial for learning, but it also increases the computational cost of training the model.

<b> How do you expect the model to behave? </b>

With the alternative hyperparameters chosen (Number of Negative Samples = 5, Learning Rate = 1, and Window Size = 9), I expect the model to behave differently compared to the default settings.

Increasing the window size to 9 allows the model to consider a broader context when learning word representations. This could potentially lead to better capture of long-range dependencies between words, and therefore, improved performance on tasks that require understanding of broader context.

In [None]:
## TODO(students): start

## Train a new model with the suggested hyperparameters and run the word analogy test

## TODO(students): end

<b> What did you observe? Do the observations follow your expectations? Give a plausible cause.</b>


Result: Generated by:                                     score_maxdiff.pl
Mechanical Turk File:                             word_analogy_dev_mturk_answers.txt
Test File:                                        word_analogy_demo_dev_results.txt
Number of MaxDiff Questions:                      914
Number of Least Illustrative Guessed Correctly:   287
Number of Least Illustrative Guessed Incorrectly: 627
Accuracy of Least Illustrative Guesses:            31.4%
Number of Most Illustrative Guessed Correctly:    337
Number of Most Illustrative Guessed Incorrectly:  577
Accuracy of Most Illustrative Guesses:             36.9%
Overall Accuracy:                                  34.1%


The results show that the overall accuracy of the model with the alternative hyperparameters (Number of Negative Samples = 5, Learning Rate = 1, and Window Size = 9) is 34.1%. This is a slight improvement compared to the default settings, where the overall accuracy was 33.6%.

The observations partially follow the expectations. The increased window size likely allowed the model to better capture long-range dependencies between words, leading to a slight improvement in the overall accuracy. However, the improvement is not substantial, which could be attributed to the high learning rate of 1. As mentioned earlier, a high learning rate can lead to less stable learning and difficulty in converging to an optimal solution.

The plausible cause for the observed improvement in performance is the increased window size, which allowed the model to consider a broader context when learning word representations. However, the high learning rate might have limited the extent of the improvement by causing less stable learning.

### Experiment 3

<b> What are the hyperparameters selected for this experiment and why did you select them?</b>

Hyperparameters: Number of Neg samples = 1, Learning Rate = 1, Window size = 5.
In this experiment, the selected hyperparameters are:

Number of Negative Samples = 1: By reducing the number of negative samples, the model will now have fewer negative examples to compare against the positive example for each center word. This change might result in faster training but could also lead to a less robust representation, as the model is exposed to fewer contrasting examples.

These hyperparameters were selected to investigate the impact of reducing the number of negative samples while keeping the other parameters at their default values.

<b> How do you expect the model to behave? </b>

Since the number of negative samples has been reduced, the model may train faster due to having fewer contrasting examples to process. However, this could lead to a less robust representation and possibly lower performance, as the model has less exposure to various negative examples.

In [None]:
## TODO(students): start

## Train a new model with the suggested hyperparameters and run the word analogy test

## TODO(students): end

<b> What did you observe? Do the observations follow your expectations? Give a plausible cause.</b>


Result: Generated by:                                     score_maxdiff.pl
Mechanical Turk File:                             word_analogy_dev_mturk_answers.txt
Test File:                                        word_analogy_demo_dev_results.txt
Number of MaxDiff Questions:                      914
Number of Least Illustrative Guessed Correctly:   286
Number of Least Illustrative Guessed Incorrectly: 628
Accuracy of Least Illustrative Guesses:            31.3%
Number of Most Illustrative Guessed Correctly:    341
Number of Most Illustrative Guessed Incorrectly:  573
Accuracy of Most Illustrative Guesses:             37.3%
Overall Accuracy:                                  34.3%

Based on the results obtained, the overall accuracy is 34.3%, which is slightly higher than the default settings but still not significantly different. The accuracy of least illustrative guesses is 31.3%, and the accuracy of most illustrative guesses is 37.3%.

The observations partially follow the expectations. As expected, the model trained faster due to the reduction in the number of negative samples, but the overall performance did not improve significantly. The plausible cause for this observation could be that the reduction in the number of negative samples had a limited impact on the model's ability to learn a more robust representation. With fewer contrasting examples, the model may not have been able to capture the full complexity of the relationships between words.

However, since the overall accuracy did not decrease substantially, this suggests that the reduction in negative samples did not have a detrimental effect on the model's performance. The model could still learn meaningful relationships between words, although the overall improvement was minimal.


### Experiment 4

<b> What are the hyperparameters selected for this experiment and why did you select them?</b>

Hyperparameters: Number of Neg samples = 1, Learning Rate = 0.1, Window size = 9.

I chose these hyperparameters to explore the effect of reducing the number of negative samples while also decreasing the learning rate and increasing the window size. The lower learning rate is expected to make the model learn more slowly, allowing for more fine-grained adjustments to the weights. The increased window size allows the model to consider a broader context while learning word relationships. By adjusting these hyperparameters, I aim to investigate the impact of these changes on the model's performance and gain insights into the model's behavior under different conditions.

<b> How do you expect the model to behave? </b>

I expect the lower number of negative samples to make the model less effective at distinguishing between positive and negative relationships, but it will also make the training process faster.

The reduced learning rate should make the model's learning process slower and more gradual. This can lead to better convergence and a more fine-grained understanding of the relationships between words, although it might also require more training iterations to reach optimal performance.

The increased window size allows the model to consider a larger context when learning word relationships. This can help the model to capture more complex associations and dependencies between words, potentially leading to a better understanding of the word relationships and improved overall performance.


In [None]:
## TODO(students): start

## Train a new model with the suggested hyperparameters and run the word analogy test

## TODO(students): end

<b> What did you observe? Do the observations follow your expectations? Give a plausible cause.</b>

result :
Generated by:                                     score_maxdiff.pl
Mechanical Turk File:                             word_analogy_dev_mturk_answers.txt
Test File:                                        word_analogy_demo_dev_results.txt
Number of MaxDiff Questions:                      914
Number of Least Illustrative Guessed Correctly:   286
Number of Least Illustrative Guessed Incorrectly: 628
Accuracy of Least Illustrative Guesses:            31.3%
Number of Most Illustrative Guessed Correctly:    342
Number of Most Illustrative Guessed Incorrectly:  572
Accuracy of Most Illustrative Guesses:             37.4%
Overall Accuracy:                                  34.4%


The results show that the overall accuracy of the model with the selected hyperparameters (Number of Negative Samples = 1, Learning Rate = 0.1, and Window Size = 9) is 34.4%. The accuracy of the least illustrative guesses is 31.3%, while the accuracy of the most illustrative guesses is 37.4%.

These results are slightly better than the default settings but not by a significant margin. This might be due to the trade-offs involved in selecting the hyperparameters. The lower number of negative samples and reduced learning rate might have helped improve the model's learning process, but the benefits could have been offset by the slower convergence and the need for more training iterations.

The increased window size might have helped the model to capture more complex associations and dependencies between words. However, it is possible that the chosen dataset does not contain enough examples of such complex relationships to significantly improve the model's performance.

Overall, the observations are in line with the expectations to some extent, but they also highlight the challenges involved in fine-tuning hyperparameters for optimal performance. Further experimentation with different combinations of hyperparameters and larger datasets may be needed to achieve more significant improvements in the model's performance.


### Experiment 5

<b> What are the hyperparameters selected for this experiment and why did you select them?</b>

Hyperparameters: Number of Neg samples = 5, Learning Rate = 0.1, Window size = 9.

The selected hyperparameters were chosen to explore the impact of different values on the model's performance. The reasoning behind each selection is as follows:

Number of Negative Samples = 5: Increasing the number of negative samples can improve the model's ability to learn meaningful relationships between words by providing more negative examples for comparison. However, it also increases the training complexity. By selecting a moderate number of negative samples, such as 5, we aim to balance the benefits of having more negative samples with the computational cost.

Learning Rate = 0.1: Lowering the learning rate from the default value helps prevent overshooting the optimal solution during the gradient descent optimization process. A lower learning rate can lead to more stable convergence and potentially better model performance. However, it may also slow down the training process, so a balance between convergence speed and stability is desired.

Window Size = 9: By increasing the window size, the model can capture a larger context for the target word, allowing it to learn more meaningful relationships between words and better understand their semantic meanings. A larger window size provides more context, which can be beneficial for learning, but it also increases the computational cost of training the model.

The selected hyperparameters aim to strike a balance between improving the model's ability to learn meaningful word relationships and managing the computational cost of training. These values can be further fine-tuned based on the model's performance to optimize the results.

<b> How do you expect the model to behave? </b>


we expect the model to learn meaningful word relationships and have a better understanding of the semantic meanings of words in its vocabulary. It may take longer to train due to the lower learning rate, but the results could be more stable and accurate. The larger window size should help capture more context, while the moderate number of negative samples should provide a good balance between performance and computational cost.

In [None]:
## TODO(students): start

## Train a new model with the suggested hyperparameters and run the word analogy test

## TODO(students): end

<b> What did you observe? Do the observations follow your expectations? Give a plausible cause.</b>


Result : Generated by:                                     score_maxdiff.pl
Mechanical Turk File:                             word_analogy_dev_mturk_answers.txt
Test File:                                        word_analogy_demo_dev_results.txt
Number of MaxDiff Questions:                      914
Number of Least Illustrative Guessed Correctly:   286
Number of Least Illustrative Guessed Incorrectly: 628
Accuracy of Least Illustrative Guesses:            31.3%
Number of Most Illustrative Guessed Correctly:    342
Number of Most Illustrative Guessed Incorrectly:  572
Accuracy of Most Illustrative Guesses:             37.4%
Overall Accuracy:                                  34.4%

Based on the results you provided, the overall accuracy of the model is 34.4%. This is a slight improvement over the default overall accuracy of 33.6%. The observations show that the model performed slightly better when you lowered the learning rate and increased the number of negative samples and window size.

A plausible cause for this improvement could be that the increased window size allowed the model to consider a broader context while learning word representations. Additionally, using more negative samples might have contributed to better training of the model by providing a more diverse set of negative samples for the model to learn from. Finally, lowering the learning rate might have helped the model to converge more smoothly during training, potentially avoiding large weight updates that could lead to unstable learning.

# Running the Negative Log-likelihood (NLL) method.

Learn word vectors using the negative log-likelihood method with the same settings of hyper parameters as in your Experiment 1 above. (Note that number of negative samples does not apply in this case). Test the resulting vectors on the test set of the word analogy task.
<br/>


<b>What do you observe? How much is the model accuracy? How long did it take for you to train?</b>

Result:
Generated by:                                     score_maxdiff.pl
Mechanical Turk File:                             word_analogy_dev_mturk_answers.txt
Test File:                                        word_analogy_demo_dev_results.txt
Number of MaxDiff Questions:                      914
Number of Least Illustrative Guessed Correctly:   277
Number of Least Illustrative Guessed Incorrectly: 637
Accuracy of Least Illustrative Guesses:            30.3%
Number of Most Illustrative Guessed Correctly:    332
Number of Most Illustrative Guessed Incorrectly:  582
Accuracy of Most Illustrative Guesses:             36.3%
Overall Accuracy:                                  33.3%

Upon running the Negative Log-likelihood (NLL) method with the same hyperparameter settings as in Experiment 1 (Learning Rate = 1 and Window Size = 5), the following observations were made:

Overall accuracy: 33.3%
Accuracy of least illustrative guesses: 30.3%
Accuracy of most illustrative guesses: 36.3%
These results indicate that the model has a moderate performance on the word analogy task with the given hyperparameters. The training time for this specific experiment was not provided, but it can be expected to vary depending on the hardware and the size of the dataset used.


### Conclude the results of your experiments. Include a table in the notebook showing all your results.

The choice of hyperparameters has a significant impact on the performance of the model.
Both negative sampling and negative log-likelihood methods are affected by the hyperparameter settings.
There is no single optimal configuration of hyperparameters, as different settings might work better for different tasks and datasets.

It is essential to note that these experiments were conducted using specific datasets and tasks, and the results might vary when applied to other tasks or datasets. Furthermore, more exhaustive hyperparameter tuning and optimization could potentially improve model performance.

## WEAT Test

We now have seen the power of word embeddings to help learn analogies, so it is also appropriate to show the unwanted learnings of the generated embeddings.
In this task, we will be looking at how to evaluate whether the embeddings are biased or not.

The WEAT test provides a way to measure quantifiably the bias in the word embeddings. [This paper](https://arxiv.org/pdf/1810.03611.pdf) describes the method in detail.

The basic idea is to examine the associations in word embeddings between concepts.
It measures the degree to which a model associates sets of target words (e.g., African American names, European American names, flowers, insects) with sets of attribute words (e.g., ”stable”, ”pleasant” or ”unpleasant”).
The association between two given words is defined as the cosine similarity between the embedding vectors for the words.


This will generate the bias scores as evaluated on 5 different tasks with different sets of attributes (A and B) and targets (X and Y) as defined in the file pointed to in the `weat_file_path` (`weat.json` for the given data). This will print and dump the output in the filepath you specify.

For task addition for WEAT Test:
Follow the task definition as done for the other WEAT tasks to add custom task of your own!.

Add to the json file `custom_weat.json`, another task in the following format:
```
{
  # initial tasks....
  "custom_task": {
    "A_key": "A_val",
    "B_key": "B_val",
    "X_key": "X_val",
    "Y_key": "Y_val",
    "A_val": [
      # list of words for attribute A
    ],
    "B_val": [
      # list of words for attribute B
    ],
    "X_val": [
      # list of words for target X
    ],
    "Y_val": [
      # list of words for target Y
    ],
  }
}
```

Ensure that the task name is `custom_task`, this will be automatically verified. Have a look at the other tasks for more clarity.

Your submission bias output files should be named `nll_bias_output.json` and `neg_bias_output.json`.

After you complete the `custom_weat.json` task, you can run the script for the given data as well as your custom data.
Your submission custom bias output files should be named `nll_custom_bias_output.json` and `neg_custom_bias_output.json`.


In [None]:
def str2bool(v):
    if isinstance(v, bool):
       return v
    if v.lower() in ('yes', 'true', 't', 'y', '1'):
        return True
    elif v.lower() in ('no', 'false', 'f', 'n', '0'):
        return False

def unit_vector(vec):
    return vec / np.linalg.norm(vec)

def cos_sim(v1, v2):

    """
    Cosine Similarity between the 2 vectors
    """

    v1_u = unit_vector(v1)
    v2_u = unit_vector(v2)
    return np.clip(np.tensordot(v1_u, v2_u, axes=(-1, -1)), -1.0, 1.0)

def weat_association(W, A, B):

    """
    Compute Weat score for given target words W, along the attributes A & B.
    """

    return np.mean(cos_sim(W, A), axis=-1) - np.mean(cos_sim(W, B), axis=-1)

def weat_score(X, Y, A, B):

    """
    Compute differential weat score across the given target words X & Y along the attributes A & B.
    """

    x_association = weat_association(X, A, B)
    y_association = weat_association(Y, A, B)

    tmp1 = np.mean(x_association, axis=-1) - np.mean(y_association, axis=-1)
    tmp2 = np.std(np.concatenate((x_association, y_association), axis=0))

    return tmp1 / tmp2

def balance_word_vectors(vec1, vec2):
    diff = len(vec1) - len(vec2)

    if diff > 0:
        vec1 = np.delete(vec1, np.random.choice(len(vec1), diff, 0), axis=0)
    else:
        vec2 = np.delete(vec2, np.random.choice(len(vec2), -diff, 0), axis=0)

    return (vec1, vec2)

def get_word_vectors(words, model, vocab_token_to_id):

    """
    Return list of word embeddings for the given words using the passed model and tokeniser
    """

    output = []

    emb_size = len(model[0])

    for word in words:
        try:
            output.append(model[vocab_token_to_id[word]])
        except:
            pass

    return np.array(output)

def compute_weat(weat_path, model, vocab_token_to_id):

    """
    Compute WEAT score for the task as defined in the file at `weat_path`, and generating word embeddings from the passed model and tokeniser.
    """

    with open(weat_path) as f:
        weat_dict = json.load(f)

    all_scores = {}

    for data_name, data_dict in weat_dict.items():
        # Target
        X_key = data_dict['X_key']
        Y_key = data_dict['Y_key']

        # Attributes
        A_key = data_dict['A_key']
        B_key = data_dict['B_key']

        X = get_word_vectors(data_dict[X_key], model, vocab_token_to_id)
        Y = get_word_vectors(data_dict[Y_key], model, vocab_token_to_id)
        A = get_word_vectors(data_dict[A_key], model, vocab_token_to_id)
        B = get_word_vectors(data_dict[B_key], model, vocab_token_to_id)

        if len(X) == 0 or len(Y) == 0:
            print('Not enough matching words in dictionary')
            continue

        X, Y = balance_word_vectors(X, Y)
        A, B = balance_word_vectors(A, B)

        score = weat_score(X, Y, A, B)
        all_scores[data_name] = str(score)

    return all_scores

def dump_dict(obj, output_path):
    with open(output_path, "w") as file:
        json.dump(obj, file)

def run_bias_eval(
    weat_file_path = 'weat.json', # weat file where the tasks are defined
    out_file = 'weat_demo_results.json', # output JSON file where the output is stored
    model_path = '/content/final_demo_model/word2vec_nll.model' # Full model path (including filename) to load from
):

    vocab_token_to_id, model = pickle.load(open(model_path, 'rb'))

    bias_score = compute_weat(weat_file_path, model, vocab_token_to_id)

    print("Final Bias Scores")
    print(json.dumps(bias_score, indent=4))

    dump_dict(bias_score, out_file)

In [None]:
run_bias_eval(
    weat_file_path = 'weat.json', # weat file where the tasks are defined
    out_file = 'neg_bias_output.json', # output JSON file where the output is stored
    model_path = '/content/final_demo_model/word2vec_neg.model' # Full model path (including filename) to load from
)

Please refer weat.json as show in the above code and create 5 new tests for your best NLL and NEG models.

### WEAT Experiment 1 (NLL Model)

<b> What tests did you create and why do you expect these biases to exist in the model?</b>


It is to measure the bias in the word embeddings between renewable and nonrenewable energy sources in association with positive and negative words. The test is designed to determine if there is a bias in the word embeddings that favors one type of energy source over another in terms of positive and negative attributes.

The test contains the following categories:

1. Renewable energy sources (A): solar, wind, hydro, geothermal, biomass


2. Nonrenewable energy sources (B): coal, oil, natural gas, nuclear, fossil fuel


3. Positive words (X): good, clean, beneficial, sustainable, efficient


4. Negative words (Y): bad, dirty, harmful, unsustainable, inefficient


I expect these biases to exist in the model because word embeddings are learned from the text corpus they are trained on, which often contains human biases present in the written text. In the case of renewable and nonrenewable energy sources, there may be biases in the text corpus that favor renewable energy as more positive and nonrenewable energy as more negative, due to environmental concerns and public opinion.

It's essential to be aware of these biases, as they may have unintended consequences when using word embeddings in various applications, such as natural language processing tasks or recommendation systems. By measuring and understanding these biases, we can work towards developing more fair and unbiased models.


<b> How do you expect the model to behave? What is the expected score in your opinion? </b>

I expect the model to show some degree of bias in favor of renewable energy sources when associated with positive words and nonrenewable energy sources when associated with negative words. This expectation stems from the fact that the training corpus likely contains biases present in human-written text, which may reflect a preference for renewable energy due to environmental concerns and the push for sustainable development.

The WEAT score ranges from -2 to 2, with a positive score indicating a stronger association between the first target set (renewable energy) and the first attribute set (positive words), and a negative score indicating a stronger association between the first target set (renewable energy) and the second attribute set (negative words). In this case, I would expect a positive score, as renewable energy sources are generally perceived more positively than nonrenewable ones.

However, it is difficult to predict the exact score without analyzing the model and its embeddings. The magnitude of the score would depend on the strength of the association between the target and attribute sets in the word embeddings. A higher magnitude score would indicate a stronger bias, while a score close to zero would suggest a weaker or no bias in the model's embeddings.

In [None]:
## TODO(students): start

## Run your custom WEAT test

## TODO(students): end

<b> What did you observe? Do the observations follow your expectations? Give a plausible cause.</b>

The observation that the custom task has a positive WEAT score of 0.2431915 aligns with the expectation that the model associates renewable energy sources more with positive words and nonrenewable energy sources more with negative words. This bias could be attributed to the fact that the training data used to generate the word embeddings likely contains a significant amount of information that positively portrays renewable energy sources due to their environmental benefits and sustainability.

On the other hand, nonrenewable energy sources are often associated with negative consequences such as pollution, climate change, and resource depletion. As a result, the model reflects these biases present in the training data, which have shaped the associations between renewable/nonrenewable energy sources and positive/negative words.

<b> Please suggest 2 possible ways to remove bias and why do you think they will work? </b>

1.Counterfactual Data Augmentation:
One way to remove bias from the model is to use counterfactual data augmentation. This method involves creating new training instances by swapping attribute words or phrases while keeping the rest of the sentence intact. For example, you can replace "solar energy" with "coal energy" in a sentence and vice versa. By doing this, you create a more balanced dataset that captures the same contextual information for both renewable and nonrenewable energy sources, thus reducing the bias in the word embeddings. This method works because it helps the model learn that both types of energy sources can appear in similar contexts, which reduces the strength of the associations between energy types and positive/negative words.

2.Post-processing Embeddings:
Another way to remove bias is to apply post-processing techniques to the word embeddings after they have been trained. One such method is the "Hard Debias" algorithm, which identifies the direction of bias in the embedding space and then projects the embeddings onto a subspace orthogonal to that bias direction. This has the effect of removing the bias while preserving other useful semantic information. This method works because it directly addresses the bias present in the embeddings by neutralizing the differences between the two groups (renewable and nonrenewable energy sources) along the bias direction, ensuring that the associations between these groups and positive/negative words are weakened.

### WEAT Experiment 2 (NEG Model)

<b> What tests did you create and why do you expect these biases to exist in the model?</b>

The test consists of two sets of target words representing traditional work settings and remote work settings. The attribute words are divided into positive and negative categories. The test checks if there is a systematic bias in the embeddings of these words, which could be a consequence of the way the model has been trained on the data.

We expect biases to exist in the model because language models like GPT are trained on large-scale text data from various sources, including news articles, social media, and websites. These sources may contain biased opinions or sentiments regarding traditional and remote work settings. Consequently, the model may learn these biases and reflect them in the word embeddings.

By performing the WEAT test, we can quantify the degree of bias present in the model, identify potential areas of improvement, and take appropriate steps to mitigate the biases.

<b> How do you expect the model to behave? What is the expected score in your opinion? </b>

In this custom WEAT test, we expect the model to show some degree of bias due to the potential influence of public opinion and media coverage in the training data. The direction and magnitude of the bias would depend on the prevailing sentiments during the time the training data was collected.

If the training data contains predominantly positive sentiments about remote work and negative sentiments about traditional work, we may expect a positive WEAT score. In this case, the model would associate remote work settings more strongly with positive attributes and traditional work settings with negative attributes. Conversely, if the training data contains more positive sentiments about traditional work and negative sentiments about remote work, we may expect a negative WEAT score.

The exact value of the expected score is difficult to predict without analyzing the data and running the test. However, a high absolute score would indicate a strong bias in the model, while a score close to zero would suggest a weaker or no bias. By examining the WEAT score, we can gain insights into the potential biases in the model and consider steps to mitigate them.

In [None]:
## TODO(students): start

## Run your custom WEAT test

## TODO(students): end

<b> What did you observe? Do the observations follow your expectations? Give a plausible cause.</b>

I got the following result:
"custom_task": "0.7961551"

The higher positive WEAT score in this case indicates an even stronger association between remote work settings and positive attributes and traditional work settings with negative attributes compared to the previous result. This observation still follows the expectations that the model might have captured prevailing sentiments about remote work and traditional work from the training data.

A plausible cause for this stronger bias could be that the training data contains an even larger proportion of positive opinions, articles, or discussions about remote work, emphasizing its benefits such as flexibility, work-life balance, and reduced commute time. At the same time, the data might contain even more negative sentiments about traditional work settings, focusing on issues like rigid schedules, long commutes, and lack of flexibility. The stronger bias in this case may reflect the strength of these opinions in the data.

<b> Please suggest 2 possible ways to remove bias and why do you think they will work? </b>

1. Data Preprocessing: One way to remove bias from the model is to curate the training data carefully. This involves identifying and addressing any potential imbalances or biases in the data. For example, you can collect more diverse and balanced examples, including positive and negative aspects of both traditional and remote work settings. By ensuring the data is more representative and balanced, the model will learn less biased associations, thus reducing the bias in its word embeddings.

2. Post-hoc Bias Mitigation: Another approach is to apply bias mitigation techniques on the learned embeddings after the model is trained. One such method is the "Bias-Direction Debiasing" or "Neutralize and Equalize" method, which involves identifying a bias direction in the embeddings and then neutralizing and equalizing the target embeddings along this direction. This process adjusts the embeddings of the target words so that they are not influenced by the bias direction and have an equal distance to the attribute word embeddings. By doing so, the model's biased associations are reduced, resulting in a more fair representation of the target words.

## Conclusion

<b> Please provide an appropriate conclusion to the experiments and results you obtained </b>

In conclusion, the experiments conducted using custom WEAT tests have shown that biases exist in the language model's word embeddings. These biases are likely a result of the training data, which may contain skewed or biased representations of certain concepts, in our case, renewable and nonrenewable energy sources, and traditional and remote work settings. The results obtained align with our expectations to some extent, revealing the presence of biases in the model's associations.

However, it's essential to recognize that no model is perfect, and there will always be some level of bias in language models. By being aware of these biases, we can work towards mitigating them through techniques like data preprocessing and post-hoc bias mitigation. Implementing these strategies can help create more balanced, fair, and representative models that perform better across a wide range of tasks and applications.

As AI technologies continue to advance and become more integrated into our lives, it's crucial to prioritize addressing biases and promoting fairness in these systems. By doing so, we can ensure that the benefits of AI are accessible to everyone, and that the technology serves as an empowering tool for society as a whole.

## Submission Guidelines

Create a folder having your solution and should contain the following:
  - This notebook with your solution
  - All linked files provided in the 'Files to upload in notebook' folder
  - A 'solution/' folder containing the files mentioned below

Files to be generated and submitted:
Create a new folder called `submission/` and place the following files in it:
   - `test_preds_nll.txt` - Your best NLL model predictions for `word_analogy_test.txt`
   - `test_preds_neg.txt` - Your best negative sampling model predictions for `word_analogy_test.txt`
   - `nll_bias_output.json` - Results for the WEAT task on `weat.json` using your best NLL model
   - `neg_bias_output.json` - Results for the WEAT task on `weat.json` using your best negative sampling model
   - `nll_custom_bias_output.json` - Results for the custom WEAT task on `custom_weat.json` using your best NLL model
   - `neg_custom_bias_output.json` - Results for the custom WEAT task on `custom_weat.json` using your best negative sampling model
   - `gdrive_link.txt` - Should contain a `wget`able to a folder that contains your best models. The model files should be named `word2vec_nll.model` and `word2vec_neg.model`, and the folder should be named `538-hw1-<SBUID>-models`. Please make sure you provide the necessary permissions.
   - `<SBUID>_Report.pdf` - A PDF report as detailed below.


## Collaboration Guidelines

  - You can collaborate to discuss ideas and to help each other for better understanding of concepts and math.
  - You should NOT collaborate on the code level. This includes all implementation activities: design, coding, and debugging.
  - You should NOT not use any code that you did not write to complete the assignment.
  - The homework will be **cross-checked**. Do not cheat at all! It’s worth doing the homework partially instead of cheating and copying your code and get 0 for the whole homework. In previous years, students have faced harsh disciplinary action as a result of the same.


## Extra Notes

  - If you add any code apart from the TODOs in the codebase (note that you don't need to), please mark it by commenting in the code itself.
  An example of the same could be:
    ```
    # Adding some_global_var for XXX
    some_global_var
    ```
  - General tips when you work on tensor computations:
    - Break the whole list of operations into smaller ones.
    - Write down the shapes of the tensors


## Credits and Disclaimer

**Credits**: This code is part of the starter package of the assignment/s used in NLP course at Stony Brook University.
This assignment has been designed, implemented and revamped as required by many NLP TAs to varying degrees.
In chronological order of TAship they include Heeyoung Kwon, Jun Kang, Mohaddeseh Bastan, Harsh Trivedi, Matthew Matero, Nikita Soni, Sharvil Katariya, Yash Kumar Lal, Adithya V. Ganesan, Sounak Mondal, Saqib Hasan, and Jasdeep Grover. Thanks to all of them!

**Disclaimer/License**: This code is only for school assignment purpose, and **any version of this should NOT be shared publicly on github or otherwise even after semester ends**.
Public availability of answers devalues usability of the assignment and work of several TAs who have contributed to this.
We hope you'll respect this restriction.