# Word2Vec Implementation and Experiments 

Overview:
  - generate batch for skip-gram model
  - implement two loss functions to train word embeddings
  - tune the parameters for word embeddings
  - apply best learned word embeddings to word analogy task
  - calculate bias score on your best models
  - create a new task on which you would run WEAT test


How to use this notebook:
  - This notebook is best viewed and executed in Google Colab.
  - Please upload the .ipynb version of this notebook into Google Drive.
  - Double click and select Open with Colab
  - Upload the files provided in the current working directory of the Colab notebook

## Setting up the data and needed libraries

In [None]:
# Download datafile for Linux
# !wget http://mattmahoney.net/dc/text8.zip
# !unzip text8.zip
# !rm text8.zip

In [None]:
# Download datafile for Windows
import requests
import zipfile
import os

# Define the URL of the file
url = "http://mattmahoney.net/dc/text8.zip"
# Define the file name to save
filename = "text8.zip"

# Download the file
response = requests.get(url)
with open(filename, 'wb') as f:
    f.write(response.content)

# Extract the contents of the zip file
with zipfile.ZipFile(filename, 'r') as zip_ref:
    zip_ref.extractall()

# Remove the zip file
os.remove(filename)

<b>Importing needed libraries and setting up random seeds

In [None]:
# All import statements

import collections
import json

import numpy as np
from scipy.spatial import distance

import torch
import torch.nn as nn

import math

from tqdm import tqdm

import os
import pickle

# Setting up all the seeds for repeatable experiements
np.random.seed(1234)
torch.manual_seed(1234)

## Generating the Data


To train word vectors, generating training instances from the given data is necessary. The generation method will create training instances in batches. For the skip-gram model, it will slide a window and sample training instances from the data inside the window.

<b>For example:</b>

Suppose that we have a text: "The quick brown fox jumps over the lazy dog."
and batch_size = 8, window_size = 3

"<font color = red>[The quick brown]</font> fox jumps over the lazy dog"

Context word would be 'quick' and predicting words are 'The' and 'brown'.

This will generate training examples of the form context(x), predicted_word(y) like:
<ul>
      <li>(quick    ,       The)
      <li>(quick    ,     brown)
</ul>
And then move the sliding window.

"The <font color = red>[quick brown fox]</font> jumps over the lazy dog"

In the same way, we have two more examples:
<ul>
    <li>(brown, quick)
    <li>(brown, fox)
</ul>

Moving the window again:

"The quick <font color = red>[brown fox jumps]</font> over the lazy dog"

We get,

<ul>
    <li>(fox, brown)
    <li>(fox, jumps)
</ul>

Finally we get two more instances from the moved window,

"The quick brown <font color = red>[fox jumps over]</font> the lazy dog"

<ul>
    <li>(jumps, fox)
    <li>(jumps, over)
</ul>

Since now we have 8 training instances, which is the batch size,
stop generating this batch and return batch data.


The two functions given below can fetch the data from the file streams.

In [None]:
# Read the data into a list of strings.
def read_data(filename):
    with open(filename) as file:
        text = file.read()
        data = [token.lower() for token in text.strip().split(" ")]
    return data

def build_dataset(words, vocab_size):
    count = [['UNK', -1]]
    count.extend(collections.Counter(words).most_common(vocab_size - 1))
    # token_to_id dictionary, id_to_taken reverse_dictionary
    vocab_token_to_id = dict()
    for word, _ in count:
        vocab_token_to_id[word] = len(vocab_token_to_id)
    data = list()
    unk_count = 0
    for word in words:
        if word in vocab_token_to_id:
            index = vocab_token_to_id[word]
        else:
            index = 0  # dictionary['UNK']
            unk_count += 1
        data.append(index)
    count[0][1] = unk_count
    vocab_id_to_token = dict(zip(vocab_token_to_id.values(), vocab_token_to_id.keys()))
    return data, count, vocab_token_to_id, vocab_id_to_token

<b>Variable Description</b>

data_index is the index of a word. Access a word using data[data_index].

batch_size is the number of instances in one batch.

num_skips is the number of samples draw in a window (in example, it was 2).

skip_windows decides how many words to consider left and right from a context word(so, skip_windows*2+1 = window_size).

batch will contains word ids for context words. Dimension is [batch_size].

labels will contains word ids for predicting words. Dimension is [batch_size, 1].


In [None]:
class Dataset:
    def __init__(self, data, batch_size=128, num_skips=8, skip_window=4):
        """
        @data_index: the index of a word. You can access a word using data[data_index]
        @batch_size: the number of instances in one batch
        @num_skips: the number of samples you want to draw in a window
                (In the below example, it was 2)
        @skip_window: decides how many words to consider left and right from a context word.
                    (So, skip_windows*2+1 = window_size)
        """

        self.data_index=0
        self.data = data
        assert batch_size % num_skips == 0
        assert num_skips <= 2 * skip_window

        self.batch_size = batch_size
        self.num_skips = num_skips
        self.skip_window = skip_window

    def reset_index(self, idx=0):
        self.data_index=idx

    def generate_batch(self):
        """
        Write the code generate a training batch

        batch will contain word ids for context words. Dimension is [batch_size].
        labels will contain word ids for predicting(target) words. Dimension is [batch_size, 1].
        """
        # print("Generating batchs...")
        center_word = np.ndarray(shape=(self.batch_size), dtype=np.int32)
        context_word = np.ndarray(shape=(self.batch_size), dtype=np.int32)

        # stride: for the rolling window
        stride = 1

        # print("DATA: ")
        # print("Shape: ", np_data.shape)
        if (self.data_index == 0):
            self.data_index = self.skip_window
        else:
            self.data_index+= self.skip_window+1
        # print("data_index")
        # print(self.data_index)
        # print(self.data[self.data_index])
        device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        for i in range(self.batch_size // self.num_skips):
          center_word_index = self.data_index
          selected_context_words = set()
          for j in range(self.num_skips):
            while True:
              context_word_index = np.random.randint(
                  max(0, self.data_index - self.skip_window),
                  min(len(self.data)-1, self.data_index + self.skip_window)+1
              )
              if context_word_index != self.data_index and context_word_index not in selected_context_words:
                    break
            center_word[i * self.num_skips + j] = self.data[center_word_index]
            context_word[i * self.num_skips + j] = self.data[context_word_index]
            selected_context_words.add(context_word_index)

          self.data_index = (self.data_index + stride) % len(self.data)
        # print("CONTEXT: ")
        # print(context_word.shape)
        # print("CENTER: ")
        # print(center_word.shape)


        return torch.LongTensor(center_word), torch.LongTensor(context_word)

## Building the Model



<b>Negative Log Likelihood (NLL): </b>
A metric used in statistics and machine learning to evaluate how well a model fits the observed data. It measures the dissimilarity between the predicted probability distribution and the actual distribution of the data. By taking the negative logarithm of the likelihood function, NLL converts the task of maximizing likelihood into minimizing a loss, making it suitable for optimization algorithms. Lower values of NLL indicate better agreement between the model's predictions and the actual data, making it a commonly used measure in tasks like classification and regression.

Refer to [here](http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf).

Training a word2vec model with this loss and the default settings took ~50 mins on Google Colab with GPU accelarator. It will take ~10 hrs on a Macbook Pro 2018 CPU.

<br>

<b>Negative Sampling (NEG): </b>
The negative sampling formulates a slightly different classification task and a corresponding loss.
[This paper](https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf) describes the method in detail.

The idea here is to build a classifier that can give high probabilities to words that are the correct target words and low probabilities to words that are incorrect target words.
As with negative log likelihood loss, here we define the classifier using a function that uses the word vectors of the context and target as free parameters.
The key difference however is that instead of using the entire vocabulary, here we sample a set of k negative words for each instance, and create an augmented instance which is a collection of the true target word and k negative words.
Now the vectors are trained to maximize the probability of this augmented instance.
To understand it better, you may also refer to [here](http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf).

Training a word2vec model with this loss and the default settings took ~2h30 mins on Google Colab with GPU accelarator.




In [None]:
# Defining the sigmoid function
sigmoid = lambda x: 1/(1 + torch.exp(-x))

class WordVec(nn.Module):
    def __init__(self, V, embedding_dim, loss_func, counts, num_neg_samples_per_center = 1):
        super(WordVec, self).__init__()
        self.center_embeddings = nn.Embedding(num_embeddings=V, embedding_dim=embedding_dim)
        self.center_embeddings.weight.data.normal_(mean=0, std=1/math.sqrt(embedding_dim))
        self.center_embeddings.weight.data[self.center_embeddings.weight.data<-1] = -1
        self.center_embeddings.weight.data[self.center_embeddings.weight.data>1] = 1

        self.context_embeddings = nn.Embedding(num_embeddings=V, embedding_dim=embedding_dim)
        self.context_embeddings.weight.data.normal_(mean=0, std=1/math.sqrt(embedding_dim))
        self.context_embeddings.weight.data[self.context_embeddings.weight.data<-1] = -1 + 1e-10
        self.context_embeddings.weight.data[self.context_embeddings.weight.data>1] = 1 - 1e-10

        self.loss_func = loss_func
        self.counts = counts

        self.num_neg_samples_per_center = num_neg_samples_per_center

    def forward(self, center_word, context_word):

        if self.loss_func == "nll":
            return self.negative_log_likelihood_loss(center_word, context_word)
        elif self.loss_func == "neg":
            return self.negative_sampling(center_word, context_word)
        else:
            raise Exception("No implementation found for %s"%(self.loss_func))

    def negative_log_likelihood_loss(self, center_word, context_word):

        # Notes (page 9): http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf
        center_word_embeddings = self.center_embeddings(center_word) # batches, dims
        context_word_embeddings = self.context_embeddings(context_word) # batches, dims

        a = torch.sum(torch.mul(center_word_embeddings, context_word_embeddings), axis=1) # batches
        # (batches, dims) @ (dims, V) = (batches, V);
        b = torch.logsumexp(center_word_embeddings @ self.context_embeddings.weight.T, dim=1) # batches
        loss = torch.mean(b - a)

        return loss

    def negative_sampling(self, center_word, context_word):

        # use this variable to control the number of negative samples for every positive sample
        # print(center_word[:20])
        # print(context_word[:20])
        num_neg_samples_per_center = self.num_neg_samples_per_center


        batch_size = center_word.shape[0]
        device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

        # create a tensor with the negative samples
        neg_samples = []
        for i in range(batch_size):
            neg_samples.append(torch.multinomial(torch.tensor(self.counts).float().pow(0.75), self.num_neg_samples_per_center, replacement=True))
        neg_samples = torch.stack(neg_samples).to(device)  # (batch_size, num_neg_samples_per_center)

        center_word_embeddings = self.center_embeddings(center_word).to(device)  # (batch_size, embedding_dim)
        context_word_embeddings = self.context_embeddings(context_word).to(device)  # (batch_size, embedding_dim)

        neg_word_embeddings = self.context_embeddings(neg_samples).to(device)  # (batch_size, num_neg_samples_per_center, embedding_dim)

        # compute the dot product of the center and context word embeddings
        pos_scores = torch.sum(torch.mul(center_word_embeddings, context_word_embeddings), dim=1)  # (batch_size,)

        # compute the dot product of the center word embeddings and the negative samples embeddings
        neg_scores = torch.bmm(neg_word_embeddings, center_word_embeddings.unsqueeze(2)).squeeze()  # (batch_size, num_neg_samples_per_center)

        # compute the loss
        pos_loss = torch.mean(-torch.log(sigmoid(pos_scores)))
        neg_loss = torch.mean(-torch.log(sigmoid(-neg_scores)))
        loss = pos_loss + neg_loss

        return loss

    def print_closest(self, validation_words, reverse_dictionary, top_k=8):
        print('Printing closest words')
        embeddings = torch.zeros(self.center_embeddings.weight.shape).copy_(self.center_embeddings.weight)
        embeddings = embeddings.data.cpu().numpy()

        validation_ids = validation_words
        norm = np.sqrt(np.sum(np.square(embeddings),axis=1,keepdims=True))
        normalized_embeddings = embeddings/norm
        validation_embeddings = normalized_embeddings[validation_ids]
        similarity = np.matmul(validation_embeddings, normalized_embeddings.T)
        for i in range(len(validation_ids)):
            word = reverse_dictionary[validation_words[i]]
            nearest = (-similarity[i, :]).argsort()[1:top_k+1]
            print(word, [reverse_dictionary[nearest[k]] for k in range(top_k)])

## Training and Data Loading Loops

The code below uses the models and losses built above and runs the actual training process.

In [None]:
class Trainer:
    def __init__(self, model, ckpt_save_path, reverse_dictionary):
        self.model = model
        self.ckpt_save_path = ckpt_save_path
        self.reverse_dictionary = reverse_dictionary

    def training_step(self, center_word, context_word):
        loss =  self.model(center_word, context_word)
        return loss

    def train(self, dataset, max_training_steps, ckpt_steps, validation_words, device="cpu", lr = 1):

        optim = torch.optim.SGD(self.model.parameters(), lr = lr)
        self.model.to(device)
        self.model.train()
        self.losses = []

        t = tqdm(range(max_training_steps))
        for curr_step in t:
            optim.zero_grad()
            center_word, context_word = dataset.generate_batch()
            loss = self.training_step(center_word.to(device), context_word.to(device))
            loss.backward()
            optim.step()
            self.losses.append(loss.item())
            if curr_step:
                t.set_description("Avg loss: %s"%(round(sum(self.losses[-2000:])/len(self.losses[-2000:]), 3)))
            # if curr_step % 10000 == 0:
            #     self.model.print_closest(validation_words, self.reverse_dictionary)
            if curr_step%ckpt_steps == 0 and curr_step > 0:
                self.save_ckpt(curr_step)

    def save_ckpt(self, curr_step):
        torch.save(self.model, "%s/%s.pt"%(self.ckpt_save_path, str(curr_step)))

## Training Framework

The following run_training function will train a model as shown in the results below. The parameters of the run_training() function include many hyperparameters which can be experimented with. Some examples include vector size, batch size, vocabulary size, epochs etc.

In [None]:
def create_path(path):
    if not os.path.exists(path):
        os.mkdir(path)
        print ("Created a path: %s"%(path))

def run_training(
    model_type = 'nll', # defines which loss function is being used to train the model
                        # can take values 'nll' for negative log loss and 'neg' for negative sampling
    lr = 1, # defines the learning rate used for training the model
    num_neg_samples_per_center = 1, # controls the number of negative samples per center word
    checkpoint_model_path = './checkpoints', # defines path to the checkpoint of the model
    final_model_path = './final_model', # location to save the final model
    skip_window = 1, # size of the skip window
    vocab_size = int(1e5), # size of the vocabulary used in the experiments
    num_skips = 2, # Number of samples to be drawn from a window
    batch_size = 64, # Size of the batches in terms of number of x,y pairs used for training
    embedding_size = 128, # size of the embedding vectores
    checkpoint_step = 50000, # Number of steps after which checkpoint is saved
    max_num_steps = 200001 # Maximum number of steps to train for
):

    checkpoint_model_path = f'{checkpoint_model_path}_{model_type}/'
    create_path(checkpoint_model_path)

    # Read data
    words = read_data("./text8")
    print('Data size', len(words))

    data, count, vocab_token_to_id, vocab_id_to_token = build_dataset(words, vocab_size)
    # save dictionary as vocabulary
    print('Most common words (+UNK)', count[:5])
    print('Sample data', data[:10], [vocab_id_to_token[i] for i in data[:10]])
    # Calculate the probability of unigrams
    # unigram_cnt = [c for w, c in count]
    count_dict = dict(count)
    unigram_cnt = [count_dict[vocab_id_to_token[i]] for i in sorted(list(vocab_token_to_id.values()))]
    data_index = 0

    dataset = Dataset(data, batch_size=batch_size, num_skips=num_skips, skip_window=skip_window)
    center, context = dataset.generate_batch()
    for i in range(8):
        print(center[i].item(), vocab_id_to_token[center[i].item()],'->', context[i].item(), vocab_id_to_token[context[i].item()])
    dataset.reset_index()

    valid_size = 16     # Random set of words to evaluate similarity on.
    valid_window = 100  # Only pick dev samples in the head of the distribution.
    valid_examples = np.random.choice(valid_window, valid_size, replace=False)

    embedding_size = embedding_size
    model = WordVec(V=vocab_size, embedding_dim=embedding_size, loss_func=model_type, counts=np.array(unigram_cnt), num_neg_samples_per_center = num_neg_samples_per_center)
    trainer = Trainer(model, checkpoint_model_path, vocab_id_to_token)

    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    print(f'Device: {device}')
    trainer.train(dataset, max_num_steps, checkpoint_step, valid_examples, device, lr = lr)
    model_path = final_model_path
    create_path(model_path)
    model_filepath = os.path.join(model_path, 'word2vec_%s.model'%(model_type))
    pickle.dump([vocab_token_to_id, model.center_embeddings.weight.detach().cpu().numpy()], open(model_filepath, 'wb'))

The following cell shows a demo with much lesser training epochs, embedding size and vocabulary size to test the code.

In [None]:
run_training(
    model_type = 'neg', # defines which loss function is being used to train the model
                        # can take values 'nll' for negative log loss and 'neg' for negative sampling
    lr = 10, # defines the learning rate used for training the model
    num_neg_samples_per_center = 3, # controls the number of negative samples per center word
    checkpoint_model_path = './demo_checkpoints', # defines path to the checkpoint of the model
    final_model_path = './final_demo_model', # location to save the final model
    skip_window = 1, # size of the skip window
    vocab_size = int(1e5), # size of the vocabulary used in the experiments
    num_skips = 2, # Number of samples to be drawn from a window
    batch_size = 256, # Size of the batches in terms of number of x,y pairs used for training
    embedding_size = 4, # size of the embedding vectores
    checkpoint_step = 500, # Number of steps after which checkpoint is saved
    max_num_steps = 2001 # Maximum number of steps to train for
)

<b>Train models of NLL and NEG using the above function.</b>

In [None]:
run_training(
    model_type = 'neg', # defines which loss function is being used to train the model
                        # can take values 'nll' for negative log loss and 'neg' for negative sampling
    lr = 1, # defines the learning rate used for training the model
    num_neg_samples_per_center = 1, # controls the number of negative samples per center word
    checkpoint_model_path = './checkpoints', # defines path to the checkpoint of the model
    final_model_path = './final_model', # location to save the final model
    skip_window = 1, # size of the skip window
    vocab_size = int(1e5), # size of the vocabulary used in the experiments
    num_skips = 2, # Number of samples to be drawn from a window
    batch_size = 64, # Size of the batches in terms of number of x,y pairs used for training
    embedding_size = 128, # size of the embedding vectores
    checkpoint_step = 50000, # Number of steps after which checkpoint is saved
    max_num_steps = 200001 # Maximum number of steps to train for
)
run_training(
    model_type = 'nll', # defines which loss function is being used to train the model
                        # can take values 'nll' for negative log loss and 'neg' for negative sampling
    lr = 1, # defines the learning rate used for training the model
    num_neg_samples_per_center = 1, # controls the number of negative samples per center word
    checkpoint_model_path = './checkpoints', # defines path to the checkpoint of the model
    final_model_path = './final_model', # location to save the final model
    skip_window = 1, # size of the skip window
    vocab_size = int(1e5), # size of the vocabulary used in the experiments
    num_skips = 2, # Number of samples to be drawn from a window
    batch_size = 64, # Size of the batches in terms of number of x,y pairs used for training
    embedding_size = 128, # size of the embedding vectores
    checkpoint_step = 50000, # Number of steps after which checkpoint is saved
    max_num_steps = 200001 # Maximum number of steps to train for
)

## Testing Framework

<b>Analogies using word vectors</b>

Use the word vectors learned from both approaches in the following word analogy task.

Each question/task is in the following form.
```
Consider the following word pairs that share the same relation, R:

    pilgrim:shrine, hunter:quarry, assassin:victim, climber:peak

Among these word pairs,

(1) pig:mud
(2) politician:votes
(3) dog:bone
(4) bird:worm

Q1. Which word pairs has the MOST illustrative(similar) example of the relation R?
Q2. Which word pairs has the LEAST illustrative(similar) example of the relation R?
```

For each question, there are examples pairs of a certain relation. The task is to find the most/least illustrative word pair of the relation. One simple method to answer those questions will be measuring the similarities of difference vectors.

Vectors are representing some direction in space. If (a, b) and (c, d) pairs are analogous pairs then the transformation from a to b (i.e., some x vector when added to a gives b: a + x = b) should be highly similar to the transformation from c to d (i.e., some y vector when added to c gives d: c + y = d). In other words, the difference vector (b-a) should be similar to difference vector (d-c).

This difference vector can be thought to represent the relation between the two words.

Due to the noisy annotation data, the expected accuracy is not high. The NLL default overall accuracy is 33.5% and negative sampling default overall accuracy is 33.6%.


<b>Further implementation explanation:</b>

  - `In the next 2 cells`:
    Evaluating relation between pairs of words -- called the [MaxDiff question](https://en.wikipedia.org/wiki/MaxDiff).
    Generate a file with the predictions following the format of `word_analogy_sample_predictions.txt`.

  - `evaluate_word_analogy.pl`:
    This is a perl script to evaluate THE PREDICTIONS on development data. Use it as shown in the next cell. 

  - `word_analogy_dev.txt`:
    This is some data for development.
    Each line of this file is divided into "examples" and "choices" by "||".
        [examples]||[choices]
    "Examples" and "choices" are delimited by a comma.
      For example:  "tailor:suit","oracle:prophesy","baker:flour"

  - `word_analogy_dev_sample_predictions.txt`:
    A sample prediction file. Pay attention to the format of this file.
    The prediction file follows this to use "score_maxdiff.pl" script.
    Each row is in this format:
    
      <pair1> <pair2> <pair3> <pair4> <least_illustrative_pair> <most_illustrative_pair>

    The order of word pairs matchs their original order found in `word_analogy_dev.txt`.

  - `word_analogy_dev_mturk_answers.txt`:
    This is the answers collected using Amazon mechanical turk for `word_analogy_dev.txt`.
    The answers in this file is used as the correct answer and used to evaluate the analogy predictions. (using "evaluate_word_analogy.pl")

  - `word_analogy_test.txt`:
    Test data file.


In [None]:
def read_data_analogy(file_path): # NAME MODIFIED FOR BETTER EXPERIMENT EXPERIENCE
    with open(file_path,'r') as f:
        data = f.readlines()

    candidate, test = [], []
    for line in data:
        a, b = line.strip().split("||")
        a = [i[1:-1].split(":") for i in a.split(",")]
        b = [i[1:-1].split(":") for i in b.split(",")]
        candidate.append(a)
        test.append(b)

    return candidate, test

def get_embeddings(examples, embeddings, dictionary):

    """
    For the word pairs in the 'examples' array, fetch embeddings and return.
    You can access your trained model via dictionary and embeddings.
    dictionary[word] will give you word_id
    and embeddings[word_id] will return the embedding for that word.

    word_id = dictionary[word]
    v1 = embeddings[word_id]

    or simply

    v1 = embeddings[dictionary[word_id]]
    """

    norm = np.sqrt(np.sum(np.square(embeddings),axis=1,keepdims=True))
    normalized_embeddings = embeddings/norm

    embs = []
    for line in examples:
        temp = []
        for pairs in line:
            temp.append([ normalized_embeddings[dictionary[pairs[0]]], normalized_embeddings[dictionary[pairs[1]]] ])
        embs.append(temp)

    result = np.array(embs)

    return result

def evaluate_pairs(candidate_embs, test_embs):

    """
    Write code to evaluate a relation between pairs of words.
    Find the best and worst pairs and return that.
    """

    best_pairs = []
    worst_pairs = []

    #print("candidate_embs\n")
    #print(candidate_embs[:2])
    #print("test_embs\n")
    #print(test_embs[:3])
    candidate_embs = np.array(candidate_embs)
    test_embs = np.array(test_embs)
    diff_of_candidate = np.zeros((len(candidate_embs),len(candidate_embs[0]),
                                  len(candidate_embs[0][0][0])),
                                 dtype = np.float32)
    for i in range (len(candidate_embs)):
      for j in range (len(candidate_embs[0])):
        diff_of_candidate[i][j]=np.array(candidate_embs[i][j][0]-candidate_embs[i][j][1])
    diff_of_test = test_embs [:,:,0,:] - test_embs [:,:,1,:]
    #print("candidate_diff\n")
    #print(diff_of_candidate[:3])
    #print("test_diff\n")
    #print(diff_of_test[:3])
    for i, line in enumerate(diff_of_test):
      #print()
      #print("line")
      #print(line)
      line_test_pairs_score = []
      for j, pair in enumerate(diff_of_test[i]):
        #print("pair")
        #print(pair)
        pair_dot_sum = 0
        for k, cadidate_pair in enumerate(diff_of_candidate[i]):
          pair_dot_sum = pair_dot_sum + np.dot(pair,cadidate_pair)
          #print ("pair_dot_sum")
          #print (pair_dot_sum)
        line_test_pairs_score.append(pair_dot_sum)
      #print(line_test_pairs_score)
      min = np.argmin(line_test_pairs_score)
      max = np.argmax(line_test_pairs_score)
      #print(min)
      #print(max)
      best_pairs.append(max)
      worst_pairs.append(min)

    return best_pairs, worst_pairs
def write_solution(best_pairs, worst_pairs, test, path):

    """
    Write best and worst pairs to a file, that can be evaluated by evaluate_word_analogy.pl
    """

    ans = []
    for i, line in enumerate(test):
        temp = [f'"{pairs[0]}:{pairs[1]}"' for pairs in line]
        temp.append(f'"{line[worst_pairs[i]][0]}:{line[worst_pairs[i]][1]}"')
        temp.append(f'"{line[best_pairs[i]][0]}:{line[best_pairs[i]][1]}"')
        ans.append(" ".join(temp))

    with open(path, 'w') as f:
        f.write("\n".join(ans))


def run_word_analogy_eval(
    model_path = './final_model', # path to the location where the model being evaluated is stored
    input_filepath = 'word_analogy_dev.txt', # Word analogy file to evaluate on
    output_filepath = 'word_analogy_demo_results.txt', # predicted results
    model_type = 'nll' # type of model being used, NLL or NEG
):

    print(f'Model file: {model_path}/word2vec_{model_type}.model')
    model_filepath = os.path.join(model_path, 'word2vec_%s.model'%(model_type))

    dictionary, embeddings = pickle.load(open(model_filepath, 'rb'))

    candidate, test = read_data_analogy(input_filepath) # READ DATA NAME MODIFIED

    candidate_embs = get_embeddings(candidate, embeddings, dictionary)
    test_embs = get_embeddings(test, embeddings, dictionary)

    best_pairs, worst_pairs = evaluate_pairs(candidate_embs, test_embs)

    out_filepath = output_filepath
    print(f'Output file: {out_filepath}')
    write_solution(best_pairs, worst_pairs, test, out_filepath)

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
run_word_analogy_eval(
    model_path = './final_demo_model', # path to the location where the model being evaluated is stored
    input_filepath = 'word_analogy_dev.txt', # Word analogy file to evaluate on
    output_filepath = 'word_analogy_demo_dev_results.txt', # predicted results
    model_type = 'neg' # type of model being used, NLL or NEG

)

The results can finally be converted into numeric metrics using the Perl script below. A demo score result is also provided for refernce.

In [None]:
!chmod 777 evaluate_word_analogy.pl
!./evaluate_word_analogy.pl word_analogy_dev_mturk_answers.txt word_analogy_demo_dev_results.txt demo_score_neg.txt

In [None]:
# !perl evaluate_word_analogy.pl word_analogy_dev_mturk_answers.txt word_analogy_demo_dev_results.txt > demo_score_neg.txt

## Running Experiments with various hyper-parameters for the Neg models

There are five experiments, where each experiment involves learning word vectors using Negative Sampling model with a specific setting for the three hyper parameters listed below and evaluating the resulting word vectors on the test set of the word analogy task.

Hyper parameters to try:
  - Number of Neg samples (Can vary from 1 to 5)
  - Learning Rate (Can vary from 0.1 to 10)
  - Window size (Can vary from 1 to 10)

### Experiment 1

* Number of Neg samples 1

* Learning Rate 1

* Window size 3



Expectation: more accurate than demo, take longer time then demo, overal accuracy approach 33%

In [None]:
run_training(
    model_type = 'neg', # defines which loss function is being used to train the model
                        # can take values 'nll' for negative log loss and 'neg' for negative sampling
    lr = 1, # defines the learning rate used for training the model
    num_neg_samples_per_center = 1, # controls the number of negative samples per center word
    checkpoint_model_path = './checkpoints_1', # defines path to the checkpoint of the model
    final_model_path = './final_model_1', # location to save the final model
    skip_window = 1, # size of the skip window
    vocab_size = int(1e5), # size of the vocabulary used in the experiments
    num_skips = 2, # Number of samples to be drawn from a window
    batch_size = 64, # Size of the batches in terms of number of x,y pairs used for training
    embedding_size = 128, # size of the embedding vectores
    checkpoint_step = 50000, # Number of steps after which checkpoint is saved
    max_num_steps = 200001 # Maximum number of steps to train for
)


In [None]:
run_word_analogy_eval(
    model_path = './final_model_1', # path to the location where the model being evaluated is stored
    input_filepath = 'word_analogy_dev.txt', # Word analogy file to evaluate on
    output_filepath = 'word_analogy_dev_results_1.txt', # predicted results
    model_type = 'neg' # type of model being used, NLL or NEG

)

In [None]:
!chmod 777 evaluate_word_analogy.pl
!./evaluate_word_analogy.pl word_analogy_dev_mturk_answers.txt word_analogy_dev_results_1.txt score_neg_1.txt

<b> Result </b>


* Accuracy of Least Illustrative Guesses:            33.2%

* Accuracy of Most Illustrative Guesses:             31.7%

* Overall Accuracy:                                  32.4%

* Takes: 2:23:20 h

* Avg loss: 0.065: 100%|██████████| 200001/200001 [2:23:20<00:00, 23.25it/s]

Accuracy is less than 33%, lower than expected. Possibly due to the small number of negative samples.


### Experiment 2


* Number of Neg samples 3

* Learning Rate 1

* Window size 3

The accuracy of the previous experiment does not exceed 33%, try to increase the accuracy by adding negative samples


Expectation: slower than first experiment but more accurat

In [None]:
run_training(
    model_type = 'neg', # defines which loss function is being used to train the model
                        # can take values 'nll' for negative log loss and 'neg' for negative sampling
    lr = 1, # defines the learning rate used for training the model
    num_neg_samples_per_center = 3, # controls the number of negative samples per center word
    checkpoint_model_path = './checkpoints_2', # defines path to the checkpoint of the model
    final_model_path = './final_model_2', # location to save the final model
    skip_window = 1, # size of the skip window
    vocab_size = int(1e5), # size of the vocabulary used in the experiments
    num_skips = 2, # Number of samples to be drawn from a window
    batch_size = 64, # Size of the batches in terms of number of x,y pairs used for training
    embedding_size = 128, # size of the embedding vectores
    checkpoint_step = 50000, # Number of steps after which checkpoint is saved
    max_num_steps = 200001 # Maximum number of steps to train for
)

In [None]:
run_word_analogy_eval(
    model_path = './final_model_2', # path to the location where the model being evaluated is stored
    input_filepath = 'word_analogy_dev.txt', # Word analogy file to evaluate on
    output_filepath = 'word_analogy_dev_results_2.txt', # predicted results
    model_type = 'neg' # type of model being used, NLL or NEG

)

In [None]:
!chmod 777 evaluate_word_analogy.pl
!./evaluate_word_analogy.pl word_analogy_dev_mturk_answers.txt word_analogy_dev_results_2.txt score_neg_2.txt

<b> Result </b>



* Accuracy of Least Illustrative Guesses: 34.1%

* Accuracy of Most Illustrative Guesses: 35.7%

* Overall Accuracy: 34.9%

* Takes: 2:21:55 h

* Avg loss: 0.059: 100%|██████████| 200001/200001 [2:21:55<00:00, 23.49it/s]

Accuracy does improve significantly, which is expected since more negative samples allow the model to better distinguish negatively correlated word pairs. But it didn't take longer, which is a bit abnormal, not sure if it is due to Colab's problem with GPU scheduling or other reasons

### Experiment 3

* Number of Neg samples 3

* Learning Rate 0.5

* Window size 3

After the first two experiments, I tried to use a smaller learning rate to explore the impact of the learning rate on the model, and control the negative sample size and window size unchanged



<b> Expectation </b>

Expecting this experiment to take longer and produce more accurate results than the last one

In [None]:
run_training(
    model_type = 'neg', # defines which loss function is being used to train the model
                        # can take values 'nll' for negative log loss and 'neg' for negative sampling
    lr = 0.5, # defines the learning rate used for training the model
    num_neg_samples_per_center = 3, # controls the number of negative samples per center word
    checkpoint_model_path = './checkpoints_3', # defines path to the checkpoint of the model
    final_model_path = './final_model_3', # location to save the final model
    skip_window = 1, # size of the skip window
    vocab_size = int(1e5), # size of the vocabulary used in the experiments
    num_skips = 2, # Number of samples to be drawn from a window
    batch_size = 64, # Size of the batches in terms of number of x,y pairs used for training
    embedding_size = 128, # size of the embedding vectores
    checkpoint_step = 50000, # Number of steps after which checkpoint is saved
    max_num_steps = 200001 # Maximum number of steps to train for
)

In [None]:
run_word_analogy_eval(
    model_path = './final_model_3', # path to the location where the model being evaluated is stored
    input_filepath = 'word_analogy_dev.txt', # Word analogy file to evaluate on
    output_filepath = 'word_analogy_dev_results_3.txt', # predicted results
    model_type = 'neg' # type of model being used, NLL or NEG

)

In [None]:
!chmod 777 evaluate_word_analogy.pl
!./evaluate_word_analogy.pl word_analogy_dev_mturk_answers.txt word_analogy_dev_results_3.txt score_neg_3.txt

<b> Result </b>


* Accuracy of Least Illustrative Guesses: 34.1%

* Accuracy of Most Illustrative Guesses: 35.7%

* Overall Accuracy: 34.9%

* Takes: 5:10:09 h

* Avg loss: 0.057: 100%|██████████| 200001/200001 [5:10:09<00:00, 10.75it/s]

Surprisingly, the accuracy in all aspects is the same as the previous experiment, and the average loss of the two experiments is also very similar. But still due to Colab's opacity to GPU mobilization, I can't be sure whether the experiment takes longer time is caused by the reduction of learning rate. I infer that the reason why the accuracy has not improved may be that the loss function of the two experiments is also the same because the last experiment has the same variables as the experiment except the learning rate. After a long time of calculation, both experiments have found the lowest point of the function, so no matter how much calculation is done, the result will stay around a value. This makes the learning rate have little effect on it.

### Experiment 4

* Number of Neg samples 3

* Learning Rate 1

* Window size 5

After the above experiments, I tried to use a bigger window size to explore the impact of the learning rate on the model, and control the negative sample size and learning rate unchanged

Expectation: As the window size becomes larger, the model should sample more related words of a single word, and should be able to more accurately understand the correlation between words and thus give each word a more accurate vector

In [None]:
run_training(
    model_type = 'neg', # defines which loss function is being used to train the model
                        # can take values 'nll' for negative log loss and 'neg' for negative sampling
    lr = 1, # defines the learning rate used for training the model
    num_neg_samples_per_center = 3, # controls the number of negative samples per center word
    checkpoint_model_path = './checkpoints_4', # defines path to the checkpoint of the model
    final_model_path = './final_model_4', # location to save the final model
    skip_window = 2, # size of the skip window
    vocab_size = int(1e5), # size of the vocabulary used in the experiments
    num_skips = 4, # Number of samples to be drawn from a window
    batch_size = 64, # Size of the batches in terms of number of x,y pairs used for training
    embedding_size = 128, # size of the embedding vectores
    checkpoint_step = 50000, # Number of steps after which checkpoint is saved
    max_num_steps = 200001 # Maximum number of steps to train for
)

In [None]:
run_word_analogy_eval(
    model_path = './final_model_4', # path to the location where the model being evaluated is stored
    input_filepath = 'word_analogy_dev.txt', # Word analogy file to evaluate on
    output_filepath = 'word_analogy_dev_results_4.txt', # predicted results
    model_type = 'neg' # type of model being used, NLL or NEG
)

In [None]:
!chmod 777 evaluate_word_analogy.pl
!./evaluate_word_analogy.pl word_analogy_dev_mturk_answers.txt word_analogy_dev_results_4.txt score_neg_4.txt

<b> Result </b>


* Accuracy of Least Illustrative Guesses: 34.1%

* Accuracy of Most Illustrative Guesses: 35.7%

* Overall Accuracy: 34.9%

* Takes: 2:21:55 h

* Avg loss: 0.058: 100%|██████████| 200001/200001 [2:21:55<00:00, 23.49it/s]

The accuracy results are again exactly the same as before. This shocked me. Through repeated confirmation and comparison of avg loss, I made sure that I did not use the same model repeatedly, which made me wonder if there was an error in the implementation. If the implementation is correct, the reason for this result is likely to be the same as the previous experiment - After a long time of calculation, these experiments have found the lowest point of the function, so no matter how much calculation is done, the result will stay near a value. When this value is reached, the literal vector changes very little

### Experiment 5

* Number of Neg samples 4

* Learning Rate 1

* Window size 3

Since in the second experiment, increasing the number of negative samples significantly improved the accuracy, I tried to further increase the number of negative samples to see if the accuracy would improve further

Expectation: Looking forward to further improvements in accuracy

In [None]:
run_training(
    model_type = 'neg', # defines which loss function is being used to train the model
                        # can take values 'nll' for negative log loss and 'neg' for negative sampling
    lr = 1, # defines the learning rate used for training the model
    num_neg_samples_per_center = 4, # controls the number of negative samples per center word
    checkpoint_model_path = './checkpoints_5', # defines path to the checkpoint of the model
    final_model_path = './final_model_5', # location to save the final model
    skip_window = 1, # size of the skip window
    vocab_size = int(1e5), # size of the vocabulary used in the experiments
    num_skips = 2, # Number of samples to be drawn from a window
    batch_size = 64, # Size of the batches in terms of number of x,y pairs used for training
    embedding_size = 128, # size of the embedding vectores
    checkpoint_step = 50000, # Number of steps after which checkpoint is saved
    max_num_steps = 200001 # Maximum number of steps to train for
)

In [None]:
run_word_analogy_eval(
    model_path = './final_model_5', # path to the location where the model being evaluated is stored
    input_filepath = 'word_analogy_dev.txt', # Word analogy file to evaluate on
    output_filepath = 'word_analogy_dev_results_5.txt', # predicted results
    model_type = 'neg' # type of model being used, NLL or NEG
)

In [None]:
!chmod 777 evaluate_word_analogy.pl
!./evaluate_word_analogy.pl word_analogy_dev_mturk_answers.txt word_analogy_dev_results_5.txt score_neg_5.txt

<b> Result </b>

* Accuracy of Least Illustrative Guesses: 29.5%

* Accuracy of Most Illustrative Guesses: 33.5%

* Overall Accuracy: 31.5%

* Takes: 2:22:35 h

* Avg loss: 0.059: 100%|██████████| 200001/200001 [2:22:35<00:00, 23.38it/s]

Accuracy did not improve, but decreased. Maybe too many negative samples interfere with the model's judgment on the relationship between words, making many related word pairs less relevant.

# Running the Negative Log-likelihood (NLL) method.

Learn word vectors using the negative log-likelihood method with the same settings of hyper parameters as in Experiment 1 above. (Note that number of negative samples does not apply in this case). Test the resulting vectors on the test set of the word analogy task.
<br/>


<b>Result</b>

* Accuracy of Least Illustrative Guesses: 26.0%
* Accuracy of Most Illustrative Guesses: 30.4%
* Accuracy of Most Illustrative Guesses: 30.4%
* Time: 0:30:56 h

The nll model has lower accuracy with the same settings as the neg model, but it takes much less time than the neg model.

In [None]:
run_word_analogy_eval(
    model_path = './final_model', # path to the location where the model being evaluated is stored
    input_filepath = 'word_analogy_dev.txt', # Word analogy file to evaluate on
    output_filepath = 'word_analogy_dev_results_nll.txt', # predicted results
    model_type = 'nll' # type of model being used, NLL or NEG
)

In [None]:
!chmod 777 evaluate_word_analogy.pl
!./evaluate_word_analogy.pl word_analogy_dev_mturk_answers.txt word_analogy_dev_results_nll.txt score_nll.txt

### Conclusion

|EXPERIMENT #| Accuracy | Time | Avg loss |
| ----------- | ----------- |----------|------|
| NEG 1      | 32.4%       | 2:23:20 | 0.065|
| NEG 2   | 34.9%       |2:21:55|0.059|
|NEG 3| 34.9%| 5:10:09| 0.057|
|NEG 4| 34.9%| 2:21:55| 0.058|
|NEG 5| 31.5%| 2:22:35|0.059|
|NLL 1| 28.2%|0:30:56| 1.091|

In general, through the above experiments, it can be found that a smaller learning rate can find the minimum value of the loss function more accurately, but it has no significant effect after being small to a certain extent. It is also found that more negative samples can improve the accuracy of learning, but once the number of negative samples is too large, it will also interfere with the model. The change of the window size should have a similar performance to the negative sample, but due to the small number of experiments on the window size, it is not obvious

## WEAT Test

Observe the unwanted learnings of the generated embeddings.
In this task, we looked at how to evaluate whether the embeddings are biased or not.

The WEAT test provides a way to measure quantifiably the bias in the word embeddings. [This paper](https://arxiv.org/pdf/1810.03611.pdf) describes the method in detail.

The basic idea is to examine the associations in word embeddings between concepts.
It measures the degree to which a model associates sets of target words (e.g., African American names, European American names, flowers, insects) with sets of attribute words (e.g., ”stable”, ”pleasant” or ”unpleasant”).
The association between two given words is defined as the cosine similarity between the embedding vectors for the words.


This will generate the bias scores as evaluated on 5 different tasks with different sets of attributes (A and B) and targets (X and Y) as defined in the file pointed to in the `weat_file_path` (`weat.json` for the given data). This will print and dump the output in the filepath.


Add to the json file `custom_weat.json`, another task in the following format:
```
{
  # initial tasks....
  "custom_task": {
    "A_key": "A_val",
    "B_key": "B_val",
    "X_key": "X_val",
    "Y_key": "Y_val",
    "A_val": [
      # list of words for attribute A
    ],
    "B_val": [
      # list of words for attribute B
    ],
    "X_val": [
      # list of words for target X
    ],
    "Y_val": [
      # list of words for target Y
    ],
  }
}
```

The task name is `custom_task`, this will be automatically verified. Have a look at the other tasks for more clarity.


In [None]:
def str2bool(v):
    if isinstance(v, bool):
       return v
    if v.lower() in ('yes', 'true', 't', 'y', '1'):
        return True
    elif v.lower() in ('no', 'false', 'f', 'n', '0'):
        return False

def unit_vector(vec):
    return vec / np.linalg.norm(vec)

def cos_sim(v1, v2):

    """
    Cosine Similarity between the 2 vectors
    """

    v1_u = unit_vector(v1)
    v2_u = unit_vector(v2)
    return np.clip(np.tensordot(v1_u, v2_u, axes=(-1, -1)), -1.0, 1.0)

def weat_association(W, A, B):

    """
    Compute Weat score for given target words W, along the attributes A & B.
    """

    return np.mean(cos_sim(W, A), axis=-1) - np.mean(cos_sim(W, B), axis=-1)

def weat_score(X, Y, A, B):

    """
    Compute differential weat score across the given target words X & Y along the attributes A & B.
    """

    x_association = weat_association(X, A, B)
    y_association = weat_association(Y, A, B)

    tmp1 = np.mean(x_association, axis=-1) - np.mean(y_association, axis=-1)
    tmp2 = np.std(np.concatenate((x_association, y_association), axis=0))

    return tmp1 / tmp2

def balance_word_vectors(vec1, vec2):
    diff = len(vec1) - len(vec2)

    if diff > 0:
        vec1 = np.delete(vec1, np.random.choice(len(vec1), diff, 0), axis=0)
    else:
        vec2 = np.delete(vec2, np.random.choice(len(vec2), -diff, 0), axis=0)

    return (vec1, vec2)

def get_word_vectors(words, model, vocab_token_to_id):

    """
    Return list of word embeddings for the given words using the passed model and tokeniser
    """

    output = []

    emb_size = len(model[0])

    for word in words:
        try:
            output.append(model[vocab_token_to_id[word]])
        except:
            pass

    return np.array(output)

def compute_weat(weat_path, model, vocab_token_to_id):

    """
    Compute WEAT score for the task as defined in the file at `weat_path`, and generating word embeddings from the passed model and tokeniser.
    """

    with open(weat_path) as f:
        weat_dict = json.load(f)

    all_scores = {}

    for data_name, data_dict in weat_dict.items():
        # Target
        X_key = data_dict['X_key']
        Y_key = data_dict['Y_key']

        # Attributes
        A_key = data_dict['A_key']
        B_key = data_dict['B_key']

        X = get_word_vectors(data_dict[X_key], model, vocab_token_to_id)
        Y = get_word_vectors(data_dict[Y_key], model, vocab_token_to_id)
        A = get_word_vectors(data_dict[A_key], model, vocab_token_to_id)
        B = get_word_vectors(data_dict[B_key], model, vocab_token_to_id)

        if len(X) == 0 or len(Y) == 0:
            print('Not enough matching words in dictionary')
            continue

        X, Y = balance_word_vectors(X, Y)
        A, B = balance_word_vectors(A, B)

        score = weat_score(X, Y, A, B)
        all_scores[data_name] = str(score)

    return all_scores

def dump_dict(obj, output_path):
    with open(output_path, "w") as file:
        json.dump(obj, file)

def run_bias_eval(
    weat_file_path = 'weat.json', # weat file where the tasks are defined
    out_file = 'weat_demo_results.json', # output JSON file where the output is stored
    model_path = '/content/final_demo_model/word2vec_nll.model' # Full model path (including filename) to load from
):

    vocab_token_to_id, model = pickle.load(open(model_path, 'rb'))

    bias_score = compute_weat(weat_file_path, model, vocab_token_to_id)

    print("Final Bias Scores")
    print(json.dumps(bias_score, indent=4))

    dump_dict(bias_score, out_file)

In [None]:
run_bias_eval(
    weat_file_path = 'weat.json', # weat file where the tasks are defined
    out_file = 'nll_bias_output.json', # output JSON file where the output is stored
    model_path = '/content/final_model/word2vec_nll.model' # Full model path (including filename) to load from
)

### WEAT Experiment 1 (NLL Model)

Tests Created:
* AmusementPark_Hospital_Pleasant_Unpleasant

* ItalianCuisines_MexicanCuisines_Healthy_Unhealthy

* JapeneseCar_AmericanCar_Good_Bad

* EuropeanCountries_AfricanCountries_Developed_FallBehind

* Male_Female_Careless_Careful

I created these tests because these biases in testing are very common in many people's minds, so there is a good chance that this bias will be reflected in the sample

I expect that for AmusementPark_Hospital_Pleasant_Unpleasant test and EuropeanCountries_AfricanCountries_Developed_FallBehind test, the model can give a higher correlation, because these prejudices are more common and deep-rooted in daily life, I expect their results to be greater than other tests, and other results I generally Expect to be higher than 0, but shouldn't be much higher.

In [None]:
run_bias_eval(
    weat_file_path = 'custom_weat.json', # weat file where the tasks are defined
    out_file = 'weat_results_nll.json', # output JSON file where the output is stored
    model_path = '/content/final_model/word2vec_nll.model' # Full model path (including filename) to load from
)

* "AmusementPark_Hospital_Pleasant_Unpleasant": "1.3171837"
* "ItalianCuisines_MexicanCuisines_Healthy_Unhealthy": "0.42497307"
* "JapeneseCar_AmericanCar_Good_Bad": "-0.20283286",
* "EuropeanCountries_AfricanCountries_Developed_FallBehind": "0.12414764"
* "Male_Female_Careless_Careful": "1.2796223"

The results of AmusementPark_Hospital_Pleasant_Unpleasant are as expected, amusement parks are usually pleasant and hospitals are usually accompanied by illness and pain. The test results of ItalianCuisines_MexicanCuisines_Healthy_Unhealthy are also more in line with expectations. But the results of JapeneseCar_AmericanCar_Good_Bad are opposite to expectations, which means that the model is more inclined to think that American cars are better. For the results given by EuropeanCountries_AfricanCountries_Developed_FallBehind, it can be found that the bias of this test is not so large. In the end, I was surprised by the bias in the model regarding the relationship between gender and personality. Model thinks that Male are more closely associated with Careless and Female are more closely associated with Careful.

<b>2 possible ways to remove bias? </b>

* I think one way is to ensure that the data used to train the algorithm is diverse and representative of the population. Data should be collected from different sources, including different age groups, genders, ethnicities, and socioeconomic backgrounds. By having a diverse dataset, the model can learn to recognize patterns and relationships that are not biased towards a particular group.

* I think another way is to monitor some of the more common biases in real time during the training sample process. Once the biases are found to be serious, start looking for sample inputs with opposite biases to correct them.

### WEAT Experiment 2 (NEG Model)

Tests Created:
* AmusementPark_Hospital_Pleasant_Unpleasant

* ItalianCuisines_MexicanCuisines_Healthy_Unhealthy

* JapeneseCar_AmericanCar_Good_Bad

* EuropeanCountries_AfricanCountries_Developed_FallBehind

* Male_Female_Careless_Careful

I created these tests because these biases in testing are very common in many people's minds, so there is a good chance that this bias will be reflected in the sample

I was expecting similar results from this experiment as the previous ones, since they were trained with the same dataset. Perhaps the results of this experiment will be more accurate.

In [None]:
run_bias_eval(
    weat_file_path = 'custom_weat.json', # weat file where the tasks are defined
    out_file = 'weat_results_neg.json', # output JSON file where the output is stored
    model_path = '/content/final_model/word2vec_neg.model' # Full model path (including filename) to load from
)

* "AmusementPark_Hospital_Pleasant_Unpleasant": "1.4622178",
* "ItalianCuisines_MexicanCuisines_Healthy_Unhealthy": "0.82199323",
* "JapeneseCar_AmericanCar_Good_Bad": "-0.23306134",
* "EuropeanCountries_AfricanCountries_Developed_FallBehind": "0.060268328",
* "Male_Female_Careless_Careful": "-0.4530505"

Except for the results of the Male_Female_Careless_Carefu test, which are quite different from the previous ones, the results of other tests are relatively similar. The difference in the "Male_Female_Careless_Careful" test may be due to the fact that too many male and careless samples were drawn when negative samples were drawn, which corrected the original bias and even made the bias develop in the opposite direction

<b>2 possible ways to remove bias? </b>

* One way is as before: ensure that the data used to train the algorithm is diverse and representative of the population. Data should be collected from different sources, including different age groups, genders, ethnicities, and socioeconomic backgrounds. By having a diverse dataset, the model can learn to recognize patterns and relationships that are not biased towards a particular group.

* Another possible method is to take some biased samples at the same time when drawing negative samples

## Conclusion

Through the above experiments, the following conclusions can be obtained. neg is often more accurate than nll when using the same hyper-parameters. In addition, for the neg model, the appropriate window size and the number of negative samples can make the model more accurate, too much or too little will interfere with the accuracy. A small learning rate can find the minimum point more accurately. In addition, many biases that are common in real life can be reflected in the model. Whether these biases are measured by the nll or neg model, as long as the samples are the same, the results are relatively close.