# SI 630: Homework 2: Word2Vec

This homework will have you implementing word2vec using PyTorch and let you familiarize yourself with building more complex neural networks and the larger PyTorch development infrastructure.

Broadly, this homework consists of a few major parts:
1. Implement a `Corpus` class that will load the dataset and convert it to a sequence of token ids
2. Implement negative sampling to select tokens to be used as negative examples of words in the context
3. Create your dataset of positive and negative examples per context and load it into PyTorch's `DataLoader` to use for sampling
4. Implement a `Word2Vec` class that is a PyTorch neural network
5. Implement a training loop that samples a _batch_ of target words and their respective positive/negative context words
6. Implement rare word removal and frequent word subsampling
7. Run your model on the full dataset for at least one epoch
8. Do the exploratory parts of the homework
9. Make a copy of this notebook and change your implementation so it learns word vectors with less bias

After Step 5, you should be able to run your word2vec implementation on a small dataset and verify that it's learning correctly. Once you can verify everything is working, proceed with steps 6 and beyond. **Please note that this list is a general sketch and the homework PDF has the full list/description of to-dos and all your deliverables.**

In [1]:
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import init
from tqdm.auto import tqdm, trange
from collections import Counter
import random
from torch import optim
from torch.utils.tensorboard import SummaryWriter
import json
import pickle
import pandas as pd
# Helpful for computing cosine similarity--Note that this is NOT a similarity!
from scipy.spatial.distance import cosine
# Handy command-line argument parsing
import argparse
import re

random.seed(1234)
np.random.seed(1234)
torch.manual_seed(1234)



<torch._C.Generator at 0x7f52340a1970>

In [8]:
min_token_freq = 5
window_size = 2
num_negative_samples_per_target = 2
batch_size = 512
k = 100
lr = 5e-4
epochs = 1
penalty = 0.0005

## Create a class to hold the data

Before we get to training word2vec, we'll need to process the corpus into some representation. The `Corpus` class will handle much of the functionality for corpus reading and keeping track of which word types belong to which ids. The `Corpus` class will also handle the crucial functionality of generating negative samples for training (i.e., randomly-sampled words that were not in the target word's context).

Some parts of this class can be completed after you've gotten word2vec up and running, so see the notes below and the details in the homework PDF.

In [3]:
with open("words.json", "r") as f:
    words = json.loads(f.read())
    synonyms = words["synonyms"]
    stopwords = words["stopwords"]
    prefixes = words["prefix"]
    suffixes = words["suffixes"]
    phrases = words["phrases"]

def UNKing(token):
    return "<UNK>"
    if " " in token:
        return token.lower()
    if not token.isalpha():
        return "<UNK_NUM>"
    elif token.upper() == token:
        return "<UNK_CAP>"
    token = token.lower()
    pre, suf = "", ""
    for i, prefix in enumerate(prefixes):
        if isinstance(prefix, list):
            prefix = tuple(prefix)
        if token.startswith(prefix):
            pre = i
            break
    for i, suffix in enumerate(suffixes):
        if isinstance(suffix, list):
            suffix = tuple(suffix)
        if token.startswith(suffix):
            suf = i
            break
    return "<UNK_{}_{}>".format(pre, suf)

In [4]:
class Corpus:
    def __init__(self):
        self.tokenizer = RegexpTokenizer(r'\w+')
        self.word_to_index = dict() # word to unique-id
        self.index_to_word = dict() # unique-id to word
        self.word_counts = Counter()
        self.negative_sampling_table = list()
        self.full_token_sequence_as_ids = list()
        self.training_data = list()
        
    def tokenize(self, text):
        """Tokenize the document and returns a list of the tokens"""
        text = self.tokenizer.tokenize(text) 
        return text
        i = 0
        words = list()
        while i < len(text):
            offset = 0
            phrase = phrases
            while i + offset < len(text) and text[i + offset] in phrase:
                phrase = phrase[text[i + offset]]
                offset += 1
            if "" in phrase.keys() and offset:
                words.append(" ".join(text[i:i + offset]))
                i += offset
            else:
                words.append(text[i])
                i += 1
        return words    

    def load_data(self, file_name, min_token_freq):
        """
        Reads the data from the specified file as long long sequence of text
        (ignoring line breaks) and populates the data structures of this
        word2vec object.
        """
        counter = Counter()
        self.word_counts.clear()
        with open(file_name, "r") as f:
            texts = f.read()
        all_tokens = self.tokenize(texts)
        counter.update(all_tokens)
        for i, token in enumerate(tqdm(all_tokens)):
            if counter[token] < min_token_freq:
                all_tokens[i] = UNKing(token)  
            else:
                if token.upper() != token:
                    token = token.lower()
                    if token in synonyms.keys():
                        p = list()
                        for syn in synonyms[token]:
                            p.append(1 / counter.get(syn, 1))
                        words = [token] + synonyms[token]
                        p = np.array([sum(p)] + p) / (2 * sum(p))
                        token = np.random.choice(words, p=p)
                all_tokens[i] = token
        self.word_counts.update(all_tokens)
        for i, word in enumerate(sorted(self.word_counts.keys())):
            self.word_to_index[word] = i
            self.index_to_word[i] = word
        prob = np.array([self.word_counts[self.index_to_word[i]] for i in sorted(self.index_to_word.keys())]) / len(all_tokens)
        prob = (np.power((prob / 1e-3), 0.5) + 1) * 1e-3 / prob
        self.full_token_sequence_as_ids = [self.word_to_index[word] for word in all_tokens]
        # subsampling
        self.full_token_sequence_as_ids = [index for index in self.full_token_sequence_as_ids if np.random.binomial(1, min(prob[index], 1))]
        print('Loaded all data from %s; saw %d tokens (%d unique)' \
              % (file_name, len(self.full_token_sequence_as_ids),
                 len(self.word_to_index)))
        
    def generate_negative_sampling_table(self, exp_power=0.75, table_size=1e6):
        '''
        Generates a big list data structure that we can quickly randomly index into
        in order to select a negative training example (i.e., a word that was
        *not* present in the context). 
        '''             
        print("Generating sampling table")
        prob = np.power([self.word_counts[self.index_to_word[i]] for i in sorted(self.index_to_word.keys())], exp_power)
        prob = prob / prob.sum()
        self.negative_sampling_table = np.random.choice(len(prob), int(table_size), p=prob)


    def generate_negative_samples(self, cur_context_word_id, num_samples):
        '''
        Randomly samples the specified number of negative samples from the lookup
        table and returns this list of IDs as a numpy array. As a performance
        improvement, avoid sampling a negative example that has the same ID as
        the current positive context word.
        '''

        results = list()
        context_words = set(cur_context_word_id)
        while len(results) < num_samples:
            sample = int(np.random.choice(self.negative_sampling_table, 1)[0])
            if sample not in cur_context_word_id and self.index_to_word[sample] not in stopwords:
                results.append(sample)

        return np.array(results)
    
    def generate_training_data(self):
        for i, index in enumerate(tqdm(self.full_token_sequence_as_ids)):
            pos_samples = list()
            if self.index_to_word[index].startswith("<UNK_"): # don;t use unk as target words
                continue
            for offset in range(-window_size, window_size + 1):
                if offset == 0:
                    continue
                if i + offset < len(self.full_token_sequence_as_ids) and i + offset >= 0:
                    if self.index_to_word[self.full_token_sequence_as_ids[i + offset]].lower() in stopwords:
                        continue # remove the stopwords
                    pos_samples.append(self.full_token_sequence_as_ids[i + offset])
            num_samples = 2 * window_size * (1 + num_negative_samples_per_target) - len(pos_samples)
            neg_samples = self.generate_negative_samples(pos_samples, num_samples)
            samples = np.append(pos_samples, neg_samples) if pos_samples else neg_samples
            labels = np.append(
                np.ones(len(pos_samples), dtype=np.float32), 
                np.zeros(len(neg_samples), dtype=np.float32)
            )
            self.training_data.append([index, samples, labels])

## Create the corpus

Now that we have code to turn the text into training data, let's do so. We've provided several files for you to help:

* `wiki-bios.DEBUG.txt` -- use this to debug your corpus reader
* `wiki-bios.10k.txt` -- use this to debug/verify the whole word2vec works
* `wiki-bios.med.txt` -- use this when everything works to generate your vectors for later parts
* `wiki-bios.HUGE.txt.gz` -- _do not use this_ unless (1) everything works and (2) you really want to test/explore. This file is not needed at all to do your homework.

We recommend startin to debug with the first file, as it is small and fast to load (quicker to find bugs). When debugging, we recommend setting the `min_token_freq` argument to 2 so that you can verify that part of the code is working but you still have enough word types left to test the rest.

You'll use the remaining files later, where they're described.

In the next cell, create your `Corpus`, read in the data, and generate the negative sampling table.

## Generate the training data

Once we have the corpus ready, we need to generate our training dataset. Each instance in the dataset is a target word and positive and negative examples of contexts words. Given the target word as input, we'll want to predict (or not predict) these positive and negative context words as outputs using our network. Your task here is to create a python `list` of instances. 

Your final training data should be a list of tuples in the format ([target_word_id], [word_id_1, ...], [predicted_labels]), where each item in the list is a list:
1. The first item is a list consisting only of the target word's ID.
2. The second item is a list of word ids for both context words and negative samples 
3. The third item is a list of labels to predicted for each of the word ids in the second list (i.e., `1` for context words and `0` for negative samples). 

You will feed these tuples into the PyTorch `DatasetLoader` later that will do the converstion to `Tensor` objects. You will need to make sure that all of the lists in each tuple are `np.array` instances and are not plain python lists for this `Tensor` converstion to work.

In [7]:
corpus = Corpus()
# path = 'man.txt'
# path = 'wiki-bios.10k.txt'
path = 'wiki-bios.med.txt' # 2437890
# path = 'wiki-bios.HUGE.txt'

%time corpus.load_data(path, min_token_freq) # 1h
%time corpus.generate_negative_sampling_table() # 1min
%time corpus.generate_training_data() # 10h

  0%|          | 0/23015360 [00:00<?, ?it/s]

Loaded all data from wiki-bios.med.txt; saw 17892944 tokens (95227 unique)
CPU times: user 5min 30s, sys: 32.6 s, total: 6min 3s
Wall time: 5min 20s
Generating sampling table
CPU times: user 164 ms, sys: 4.42 ms, total: 169 ms
Wall time: 168 ms


  0%|          | 0/17892944 [00:00<?, ?it/s]

CPU times: user 2h 45min 2s, sys: 41min, total: 3h 26min 2s
Wall time: 2h 38min 34s


In [14]:
corpus.training_data

[[5320,
  array([11363, 18513, 10460, 14099,  7301,  1506,  9976,  7764, 14322,
         16468, 13983, 17516]),
  array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)],
 [11363,
  array([ 5320,  8797, 16435, 12616, 21529,  8926, 20876, 18437,  2688,
          4954,   829, 12573]),
  array([1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)],
 [10372,
  array([ 5320, 11363,  8797,   636, 11637, 21656, 22778,  5359, 20125,
           664, 21927,  7014]),
  array([1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)],
 [8797,
  array([11363,  1267, 20439,   640, 20970, 14211, 22566,  4221, 16717,
         13525,   631, 24437]),
  array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)],
 [24816,
  array([ 8797, 22009,  2293, 10357, 18283,  1065, 18616, 12458,  2849,
          7563, 15207, 10249]),
  array([1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)],
 [11607,
  array([ 8797, 22009,   266, 11173, 10337,  8221, 19

In [None]:
%%time
with open("corpus.pkl", "wb") as f:
    f.write(pickle.dumps(corpus))

## Create the network

We'll create a new neural network as a subclass of `nn.Module` like we did in Homework 1. However, _unlike_ the network you built in Homework 1, we do not need to used linear layers to implement word2vec. Instead, we will use PyTorch's `Emedding` class, which maps an index (e.g., a word id in this case) to an embedding. 

Roughly speaking, word2vec's network makes a prediction by computing the dot product of the target word's embedding and a context word's embedding and then passing this dot product through the sigmoid function ($\sigma$) to predict the probability that the context word was actually in the context. The homework write-up has lots of details on how this works. Your `forward()` function will have to implement this computation.

In [12]:
class Word2Vec(nn.Module):
    def __init__(self, vocab_size, embedding_size):
        super(Word2Vec, self).__init__()
        self.vocab_size = vocab_size
        self.embedding_size = embedding_size
        self.init_emb(init_range=0.5 / self.vocab_size)
        self.output_layer = nn.Sigmoid()
        
    def init_emb(self, init_range):
        self.context_embeddings = nn.Embedding(self.vocab_size, self.embedding_size)
        self.context_embeddings.weight.data.uniform_(-init_range, init_range)
        self.target_embeddings = nn.Embedding(self.vocab_size, self.embedding_size)
        self.target_embeddings.weight.data.uniform_(-init_range, init_range)
        
    def forward(self, target_word_id, context_word_ids):
        ''' 
        Predicts whether each context word was actually in the context of the target word.
        The input is a tensor with a single target word's id and a tensor containing each
        of the context words' ids (this includes both positive and negative examples).
        '''
        target_vec = self.context_embeddings(target_word_id).unsqueeze(1)
        context_vec = self.target_embeddings(context_word_ids)
        # random dropoff
        target_rand, context_rand = torch.rand(target_vec.shape), torch.rand(context_vec.shape)
        target_vec = torch.where(target_rand > 0.1, target_vec, torch.zeros(target_vec.shape))
        context_vec = torch.where(context_rand > 0.1, context_vec, torch.zeros(context_rand.shape))
        #
        return self.output_layer(torch.sum(target_vec * context_vec, dim=-1)) 
    
    def save(self, path: str):
        torch.save(self.state_dict(), path)

## Train the network!

Now that you have data in the right format and a neural network designed, it's time to train the network and see if it's all working. The trainin code will look surprisingly similar at times to your pytorch code from Homework 1 since all networks share the same base training setup. However, we'll add a few new elements to get you familiar with more common training techniques. 

For all steps, be sure to use the hyperparameters values described in the write-up.

1. Initialize your optimizer and loss function 
2. Create your network
3. Load your dataset into PyTorch's `DataLoader` class, which will take care of batching and shuffling for us (yay!)
4. Create a new `SummaryWriter` to periodically write our running-sum of the loss to a tensorboard
5. Train your model 

Two new elements show up. First, we'll be using `DataLoader` which is going to sample data for us and put it in a batch (and also convert the data to `Tensor` objects. You can iterate over the batches and each iteration will return all the items eventually, one batch at a time (a full epoch's worth).

The second new part is using `tensorboard`. As you might have noticed in Homework 1, training neural models can take some time. [TensorBoard](https://www.tensorflow.org/tensorboard/) is a handy web-based view that you can check during training to see how the model is doing. We'll use it here and periodically log a running sum of the loss after a set number of steps. The Homework write up has a plot of what this looks like. We'll be doing something simple here with tensorboard but it will come in handy later as you train larger models (for longer) and may want to visually check if your model is converging. TensorBoard was initially written for another deep learning framework, TensorFlow, but proved so useful it was ported to work in PyTorch too and is [easy to integrate](https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html).

To start training, we recommend training on the `wiki-bios.10k.txt` dataset. This data is small enough you can get through an epoch in a few minutes (or less) while still being large enough you can test whether the model is learning anything by examining common words. Below this cell we've added a few helper functions that you can use to debug and query your model. In particular, the `get_neighbors()` function is a great way to test: if your model has learned anything, the nearest neighbors for common words should seem reasonable (without having to jump through mental hoops). An easy word to test on the `10k` data is "january" which should return month-related words as being most similar.

**NOTE**: Since we're training biographies, the text itself will be skewed towards words likely to show up biographices--which isn't necessary like "regular" text. You may find that your model has few instances of words you think are common, or that the model learns poor or unusual neighbors for these. When querying the neighbors, it can help to think of which words you think are likely to show up in biographies on Wikipedia and use those as probes to see what the model has learned.

Once you're convinced the model is learning, switch to the `med` data and train your model as specified in the PDF. Once trained, save your model using the `save()` function at the end of the notebook. This function records your data in a common format for word2vec vectors and lets you load the vectors into other libraries that have more advanced functionality. In particular, you can use the [gensim](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html) code in other notebook included to explore the vectors and do simple vector analogies.

In [13]:
class Evaulator(object):
    def __init__(self, model, corpus):
        self.model = model
        self.corpus = corpus
    
    def get_vec(self, word):
        if word.upper() != word:
            word = word.lower()
        if word not in self.corpus.word_to_index.keys():
            word = UNKing(word)
        return self.model.target_embeddings(torch.LongTensor([self.corpus.word_to_index[word]])).detach().numpy()
    
    @staticmethod
    def compute_cosine(vec1, vec2):
        return 1 - abs(float(cosine(vec1, vec2)))
    
    def compute_cosine_similarity(self, word_one, word_two):
        """Computes the cosine similarity between the two words"""
        try:
            return self.compute_cosine(self.get_vec(word_one),self.get_vec(word_two))
        except Exception as e:
            print(e)
            return 0
    
    def get_neighbors(self, target_word):
        """ Finds the top 10 most similar words to a target word"""
        outputs = []
        for word, index in tqdm(self.corpus.word_to_index.items(), total=len(self.corpus.word_to_index)):
            similarity = self.compute_cosine_similarity(target_word, word)
            result = {"word": word, "score": similarity}
            outputs.append(result)

        # Sort by highest scores
        neighbors = sorted(outputs, key=lambda o: o['score'], reverse=True)
        return neighbors[1:10]
    
    def measure_bias(self):
        error = 0
        for man, woman in words["gender"]:
            vec_m, vec_w = eva.get_vec(man), eva.get_vec(woman)
            for equal_job in words["equal_job"]:
                vec = eva.get_vec(equal_job)
                error += abs(eva.compute_cosine(vec_m, vec) - eva.compute_cosine(vec_w, vec))
            for man_job, woman_job in words["jobs"]:
                vec_man_job = vec_m - eva.get_vec(man_job)
                vec_woman_job = vec_w - eva.get_vec(woman_job)
                error += np.sum((vec_man_job - vec_woman_job) ** 2)
        return error

In [14]:
%%time
model = Word2Vec(len(corpus.index_to_word), k)
eva = Evaulator(model, corpus)
eva.measure_bias()

CPU times: user 935 ms, sys: 0 ns, total: 935 ms
Wall time: 954 ms


394.93403935794976

In [None]:
model = Word2Vec(len(corpus.index_to_word), k)
dataloader = DataLoader(corpus.training_data, batch_size=512, shuffle=True)
criterion = torch.nn.BCELoss()
optimizer = optim.AdamW(model.parameters(), lr=lr)
scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda = lambda step: max(0.995 ** step, 5e-3))
writer = SummaryWriter("Logs") # output the result to tensor board
eva = Evaulator(model, corpus)
loss_sum, bias = 0, 0
for epoch in trange(epochs):
    for step, data in enumerate(tqdm(dataloader)):
        if not step % 100:
            bias = penalty * eva.measure_bias()
        optimizer.zero_grad()
        target_ids, context_ids, labels = data 
        output = model(target_ids, context_ids)
        loss = criterion(torch.squeeze(output), torch.squeeze(labels)) + bias
        loss.backward()
        loss_sum += float(loss.detach().numpy())
        optimizer.step()
        if not step % 100: # plot the loss sum every 100 step
            writer.add_scalar('Loss/train', loss_sum, step)
            writer.add_scalar('Learning Rate/train', optimizer.param_groups[0]["lr"], step)
            loss_sum = 0
            scheduler.step()                  
writer.close()
model.eval()

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/34948 [00:00<?, ?it/s]

## Verify things are working

Once you have an initial model trained, try using the following code to query the model for what are the nearest neighbor of a word. This code is intended to help you debug

In [18]:
words = {"january", "girl", "language", "medicine", "president", "vehicle", "walk", "carefully", "various", "calculator"}
res = dict()
for word in words:
    res[word] = [x["word"] for x in eva.get_neighbors(word)]

  0%|          | 0/95227 [00:00<?, ?it/s]

  0%|          | 0/95227 [00:00<?, ?it/s]

  0%|          | 0/95227 [00:00<?, ?it/s]

  0%|          | 0/95227 [00:00<?, ?it/s]

  0%|          | 0/95227 [00:00<?, ?it/s]

  0%|          | 0/95227 [00:00<?, ?it/s]

  0%|          | 0/95227 [00:00<?, ?it/s]

  0%|          | 0/95227 [00:00<?, ?it/s]

  0%|          | 0/95227 [00:00<?, ?it/s]

  0%|          | 0/95227 [00:00<?, ?it/s]

In [20]:
pd.DataFrame(res).to_csv("Neighbours.csv")

In [22]:
vec_queen = eva.get_vec("king") - eva.get_vec("man") + eva.get_vec("woman")
eva.compute_cosine(vec_queen, eva.get_vec("queen"))

0.8872133493423462

In [32]:
vec_artist = eva.get_vec("scientist") - eva.get_vec("science") + eva.get_vec("art")
eva.compute_cosine(vec_artist, eva.get_vec("artist"))

0.7452678680419922

In [33]:
vec_walk = eva.get_vec("run") - eva.get_vec("fast") + eva.get_vec("slow")
eva.compute_cosine(vec_walk, eva.get_vec("walk"))

0.8028973340988159

It looks like the anology also exists for professions, such as scientist - science + art, can lead to an artist. The similar thing also happen with verb, such as we can remove the fast from run, and get walk.

# Save your model!

Once you have a fully trained model, save it using the code below. Note that we only save the `target_embeddings` from the model, but you could modify the code if you want to save the context vectors--or even try doing fancier things like saving the concatenation of the two or the average of the two!

In [27]:
%%time
torch.save(model.state_dict(), "Models/0215/model.pt")

CPU times: user 365 ms, sys: 29.8 ms, total: 395 ms
Wall time: 768 ms


In [None]:
def save(model, corpus, filename):
    '''
    Saves the model to the specified filename as a gensim KeyedVectors in the
    text format so you can load it separately.
    '''

    # Creates an empty KeyedVectors with our embedding size
    kv = KeyedVectors(vector_size=model.embedding_size)        
    vectors = []
    words = []
    # Get the list of words/vectors in a consistent order
    for index in trange(model.target_embeddings.num_embeddings):
        word = corpus.index_to_word[index]
        vectors.append(model.target_embeddings(torch.LongTensor([index])).detach().numpy()[0])
        words.append(word)

    # Fills the KV object with our data in the right order
    kv.add_vectors(words, vectors) 
    kv.save_word2vec_format(filename, binary=False)


# FINAL PART: DO THIS LAST AND READ CAREFULLY

Before you start this part, you need to have a fully working solution and completed the exploratory part of the assignment.

**Once you are ready, create a copy of your working notebook and call it `Debiased Word2Vec.ipynb`. Do not do this part in your working code for the assignment!!!**

## Seriously, save your code in a new file and then start reading the rest of these instructions there.

Ok, hopefully you're reading these in a new file... For this last part of the assignment, we're going to _change_ how word2vec learns at a fundamental level. 

As you might have noticed in your exploratory analysis, the word2vec model learns to weird and sometimes biased associations between words. In particular, your word2vec model has likely learned some unfortunate gender biases, e.g., that the vector for "nurse" is closer to "woman" than "man". The algorithm itself isn't to blame since it is learning these from a corpus (here, Wikipedia biographies) that contain biases already based on how people write. Wikipedia [is](http://markusstrohmaier.info/documents/2015_icwsm2015_wikipedia_gender.pdf) [well](http://dcs.gla.ac.uk/~mounia/Papers/wiki_gender_bias.pdf) [known](https://www.academia.edu/download/64856696/genderanalysisofWikipediabiostext_self_archived.pdf) for having gender biases in how it writes about men and women.


**Side note**: Some times this bias-learning behavior is useful: We can use word2vec to uncover these biases and analyze their trends, like this PNAS paper did for [looking at bias in news writing along multiple dimensions of identity](https://www.pnas.org/content/pnas/115/16/E3635.full.pdf)

In this last part of the homework, we'll ask how we might try to _prevent_ these biases by modifying the training. You won't need to solve this problem by any means, but the act of trying to reduce the biases will open up a whole new toolbox for how you (the experimenter/practioner) can change how and what models learn.

There are many potential ways to _debias_ word embeddings so that their representations are not skewed along one "latent dimension" like gender. In this homework, you'll be trying one of a few different ideas for how to do it. **You are not expected to solve gender bias! This part of the assignment is to have to start grappling with a hard challenge but there is no penalty for doing less-well!** 

One common technique to have models avoid learning bias is similar to another one you already&mdash;**regularization**. In Logistic Regression, we could use L2 regularization to have our model avoid learning $\beta$ weights that are overfit to specific or low-frequency features by adding a regularizer penalty where the larger the weight, the more penalty the model paid. Recall that this forces the model to only pick the most useful (generalizable) weights, since it has to pay a penalty for any non-zero weight. 

In word2vec, we can adapt the idea to think about whether our model's embeddings are closer or farther to different gender dimensions. For example, if we consider the embedding for "president", ideally, we'd want it to be equally similar to the embeddings for "man" and "woman". One idea then is to penalize the model based on how uneven the similarity is. We can do this by directly modifying the loss:
```
loss = loss_criteron(preds, actual_vals) + some_bias_measuring_function(model)
```
Here, the `some_bias_measuring_function` function takes in your model as input and returns how much bias you found. Continuing our example, we might implement it in pseudocode as
```
def some_bias_measuring_function(model):
    pres_woman_sim = cosine_similarity(model, "president", "woman")
    pres_man_sim = cosine_similarity(model, "president", "man")
    return abs(pres_woman_sim - pres_man_sim)
```
This simple example would penalize the model for learning a representation of "president" that is more simular to one of the two words. Of course, this example is overly simple. Why just "president"? Why just "man" and "woman"? Why not other words or other gender-related words or other gender identities?? 

Another idea might be to just make the vectors for "man" and "woman" be as similar as possible:
```
def some_bias_measuring_function(model):
    # cosine similarity is in [-1,1] but we mostly expect it in [0,1]
    man_woman_sim = cosine_similarity(model, "man", "woman")
    # penalize vectors that are not maximally similar, and avoid the edge case 
    # of negative cosine similarity
    return 1 - max(man_woman_sim, 0)
```

All of this works in practice because PyTorch is fantastic about tracking the gradient with respect to the loss. This ability lets us easily define a loss function so that our word2vec model (1) learns to predict the right context words while (2) avoids learning biases. If we compare this code to the numpy part of Homework 1, it's easy to see how powerful PyTorch can be as an experimenter for helping you control what and how your models learn!

Your task is to expand this general approach by coming up with an extension to word2vec that adds some new term to the `loss` value that penalizes bias in the gender dimension. There is no right way to do this and even some right-looking approaches may not work&mdash;or might word but simultaneously destroy the information in the word vectors (all-zero vectors are unbiased but also uninformative!). 

**Suggestion:** You may need to weight your bias term in the loss function (remember that $\lambda_1 x_1 + \lambda_2 x_2$ interpolation? This is sort of similar) so that your debiasing regularizer doesn't overly penalize your model.

Once you have generated your model, record word vector similarities for the pairs listed on canvas in `word-pair-similarity-predictions.csv` where your file writes a result like
```
word1,word2,sim
dog,puppy,0.91234123
woman,preseident,0.81234
```
You'll record the similarity for each pair of words in the file and upload it to CodaLab, which is kind of like Kaggle but lets use a custom scoring program. We'll evaluate your embeddings based on how unbiased they are and how much information they still capture after debiasing. **Your grade does not depend on how well you do in CodaLab, just that you tried something and submitted.** However, the CodaLab leaderboard will hopefully provide a fun and insightful way of comparing just how much bias we can remove from our embeddings.

The CodaLab link will be posted to Piazza

In [24]:
df = pd.read_csv("word_pair_similarity_predictions.csv")
df["sim" ] = df.apply(lambda x: eva.compute_cosine(eva.get_vec(x["word1"].lower()), eva.get_vec(x["word2"].lower())), axis=1)
# df.to_csv("label.csv", index=False)
df.tail(20)

Unnamed: 0,word1,word2,sim
1610,wedding,she,-0.0298
1611,wedding,her,0.098532
1612,wedding,hers,0.03497
1613,wedding,daughter,0.57403
1614,relatives,male,0.576557
1615,relatives,man,0.60586
1616,relatives,boy,0.594422
1617,relatives,brother,0.647661
1618,relatives,he,-0.078551
1619,relatives,him,-0.025218
