<a href="https://colab.research.google.com/github/FreemindTrader/nlp-in-practice/blob/master/Chapter_5_Backprop_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
-----------------------



In this Notebook, we're going to step through an update of the weight matrices for a skip-gram word2vec model using Negative Sampling. This means we'll be implementing both negative sampling and backprop from scratch!

This Notebook has two major parts.

### Part 1 - Pre-training with gensim

In **Part 1**, we are going to use `gensim` to *begin* to train a word2vec model on the Wikipedia comments dataset. `gensim` will handle the construction of the vocabulary for us, and perform the first two training passes. Two passes is enough such that the word vectors will be reasonable, but still pretty bad. I chose to do this because I think a weight update in this state is more informative than if we did it on either *fully trained* vectors, or *completely random* vectors.

*You may __skip reading Part 1__ and __go straight to Part 2__ if you'd like.*

### Part 2 - Manual weight update

In **Part 2**, we will take a single training word pair ("thought", "well"), and implement the weight update from scratch. This will require implementing negative sampling (to select the negative samples) and implementing the gradient calculations for updating the word vectors.


# Contents
-----------------

**Part 1 - Pretraining with gensim**
* [Dataset Preparation](#Dataset-Preparation)
    * [Download the dataset](#Download-the-dataset)
    * [Parse the dataset file](#Parse-the-dataset-file)
    * [Tokenize the comments](#Tokenize-the-comments)
* [Pre-Training](#Training)
    * [Configure logging](#Configure-logging)
    * [Set model parameters](#Set-model-parameters)
    * [Build the vocabulary](#Build-the-vocabulary)
    * [Train the model](#Train-the-model)
    * [Play with results](#Play-with-results)
    
**Part 2 - Manual Weight Update**    
* [Negative Sampling from Scratch](#Negative-Sampling-from-Scratch)
    * [Generating the Unigram Table](#Generating-Unigram-Table)
    * [Inspect the Table](#Inspect-the-Table)
        * [Row Counts](#Row-Counts)
        * [Compare Probability Distributions](#Compare-Probability-Distributions)
        * [20 Random Words](#20-Random-Words)
    * [Backprop](#Backprop)
        * [Retrieve Weight Matrices](#Retrieve-Weight-Matrices)
        * [Picking Samples](#Picking-Samples)
        * [Weight Updates](#Weight-Updates)


# Part 1 - Pretraining with gensim
------------------------------------------------------


# Dataset Preparation
-----------------------------------
In this section we'll download a text dataset comprised of comments on Wikipedia which contain "attacks" on other users (plus counter-examples).

We'll use `pandas` for CSV parsing and `gensim` for tokenization.

## Download the dataset
--------------------------------------
Download the text.

In [None]:
!pip install wget

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp36-none-any.whl size=9682 sha256=79830acb96fb80c2523d64204b187e06a415dffc4b2b1b9eb27f5ad4ea572751
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [None]:
import wget
import os

# Create the data subdirectory if not there.
if not os.path.exists('./data/'):
    os.mkdir('./data/')

filename = './data/attack_annotated_comments.tsv'

# Download download if we already have it!
if not os.path.exists(filename):

    # URL for the CSV file (~55.4MB) containing the wikipedia comments.
    url = 'https://ndownloader.figshare.com/files/7554634'

    # Download the dataset.
    print('Downloading Wikipedia Attack Comments dataset (~55.4MB)...')
    wget.download(url, filename)

    print('  DONE.')

# We won't use these, but FYI, this is the file containing the labels
# for the comments.
#   url = 'https://ndownloader.figshare.com/files/7554637'
#   filename = './data/attack_annotated_comments.tsv'

Downloading Wikipedia Attack Comments dataset (~55.4MB...
  DONE.


## Parse the dataset file
--------------------------------
We'll use `pandas` just to help us parse the tab-separated `.tsv` file.


In [None]:
import pandas as pd

print('Parsing the dataset .tsv file...')
comments = pd.read_csv('./data/attack_annotated_comments.tsv', sep = '\t')

print('    Done.')


Parsing the dataset .tsv file...
    Done.


## Tokenize the comments
------------------------------------
This dataset uses the special labels "NEWLINE_TOKEN" and "TAB_TOKEN" to represent the newline and tab characters. We'll replace these with a single space.

In [None]:
# remove newline and tab tokens
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))

Next, Use gensim to perform a simple tokenization strategy to the text and turn each comment into a list of words.

In [None]:
%%time

import gensim
import io

print('Tokenizing comments...')

# Track the total number of tokens in the dataset.
num_tokens = 0

# List of sentences to use for training.
sentences = []

# For each comment...
for i, row in comments.iterrows():

    # Report progress.
    if ((i % 20000) == 0):
        print('  Read {:,} comments.'.format(i))

    # Tokenize the comment. This returns a list of words.
    parsed = gensim.utils.simple_preprocess(row.comment)

    # Accumulate the total number of words in the dataset.
    num_tokens += len(parsed)

    # Add the comment to the list.
    sentences.append(parsed)

print('DONE.')
print('')
print('{:>10,} comments'.format(i))
print('{:>10,} tokens'.format(num_tokens))
print('{:>10,} avg. tokens / comment'.format(int(num_tokens / len(sentences))))
print('')

Tokenizing comments...
  Read 0 comments.
  Read 20,000 comments.
  Read 40,000 comments.
  Read 60,000 comments.
  Read 80,000 comments.
  Read 100,000 comments.
DONE.

   115,863 comments
 7,651,029 tokens
        66 avg. tokens / comment

CPU times: user 25 s, sys: 655 ms, total: 25.7 s
Wall time: 26.2 s


# Training
----------------

Time to train the model!

## Configure Logging
-----------------------------
`gensim` provides some valuable information about the training process using the `logging` module in Python.

In order to see this log output, we first need to setup logging.


In [None]:
import logging

# Enable logging at the `INFO` level and set a custom format--the
# default log format is pretty wordy.
logging.basicConfig(
    format='%(asctime)s : %(message)s', # Display just time and message.
    datefmt='%H:%M:%S', # Display time, but not the date.
    level=logging.INFO)


Let's also suppress any pesky warnings from libraries that gensim references.

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


## Set Model Parameters
----------------------------------

We define all of the parameters for our model upfront. Take a look at the code comments for each parameter below.

Also, for reference:
* Documentation for [gensim.models.Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec).
* Source code for [gensim.models.Word2Vec](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py#L659) constructor.


In [None]:
model = gensim.models.Word2Vec (
    size=100,    # Number of features in word vector

    window=10,   # Context window size (in each direction)
                 #   Default is 5

    min_count=2, # Words must appear this many times to be in vocab.
                 #   Default is 5

    workers=10,  # Training thread count

    sg=0,        # 0: CBOW, 1: Skip-gram.
                 #   Default is 0, CBOW

    hs=0,        # 0: Negative Sampling, 1: Hierarchical Softmax
                 #   Default is 0, NS

    negative=5   # Nmber of negative samples
                 #   Default is 5
)

## Build the Vocabulary
---------------------------------

Before we can train the word2vec neural network, we need to create a vocabulary. The vocabulary contains the full list of words that we will end up learning word vectors for.

* Source code for `build_vocab` is at [base_any2vec.py#L896](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/base_any2vec.py#L896)

In [None]:
# Build the vocabulary using the comments in "sentences".
model.build_vocab(
    sentences, # Our comments dataset
    progress_per=20000  # Update after this many sentences.
                        # Too many progress updates is annoying!
)

22:58:55 : collecting all words and their counts
22:58:55 : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
22:58:56 : PROGRESS: at sentence #20000, processed 1399843 words, keeping 51238 word types
22:58:56 : PROGRESS: at sentence #40000, processed 2764033 words, keeping 76280 word types
22:58:56 : PROGRESS: at sentence #60000, processed 4091968 words, keeping 94720 word types
22:58:57 : PROGRESS: at sentence #80000, processed 5354741 words, keeping 112164 word types
22:58:57 : PROGRESS: at sentence #100000, processed 6640746 words, keeping 129161 word types
22:58:57 : collected 141062 word types from a corpus of 7651029 raw words and 115864 sentences
22:58:57 : Loading a fresh vocabulary
22:58:58 : effective_min_count=2 retains 71038 unique words (50% of original 141062, drops 70024)
22:58:58 : effective_min_count=2 leaves 7581005 word corpus (99% of original 7651029, drops 70024)
22:58:58 : deleting the raw counts dictionary of 141062 items
22:58:58 : sample=0.001 

## Train the model
-------------------------

Now that we have a vocabulary built, we're ready to train the model.

*IMPORTANT: We are only running two training passes (epochs=2) so that we get the weights into a reasonable, but still imperfect state.*

In [None]:
%%time

print('Training the model...')

model.train(
    sentences,
    total_examples=len(sentences),
    epochs=2,        # How many training passes to take.
    report_delay=10.0 # Report progress every 10 seconds.
)

print('  Done.')
print('')

22:59:13 : training model with 10 workers on 71038 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=10


Training the model...


22:59:14 : EPOCH 1 - PROGRESS: at 5.39% examples, 368547 words/s, in_qsize 19, out_qsize 0
22:59:24 : EPOCH 1 - PROGRESS: at 78.82% examples, 425096 words/s, in_qsize 19, out_qsize 0
22:59:27 : worker thread finished; awaiting finish of 9 more threads
22:59:27 : worker thread finished; awaiting finish of 8 more threads
22:59:27 : worker thread finished; awaiting finish of 7 more threads
22:59:27 : worker thread finished; awaiting finish of 6 more threads
22:59:27 : worker thread finished; awaiting finish of 5 more threads
22:59:27 : worker thread finished; awaiting finish of 4 more threads
22:59:27 : worker thread finished; awaiting finish of 3 more threads
22:59:27 : worker thread finished; awaiting finish of 2 more threads
22:59:27 : worker thread finished; awaiting finish of 1 more threads
22:59:27 : worker thread finished; awaiting finish of 0 more threads
22:59:27 : EPOCH - 1 : training on 7651029 raw words (5906604 effective words) took 13.8s, 427547 effective words/s
22:59:28 : 

  Done.

CPU times: user 53.1 s, sys: 268 ms, total: 53.4 s
Wall time: 27.4 s


## Play with results
---------------------------
As a quick, informal test of the quality of the model, let's look at some word comparisons that we think might appear in this dataset.

In [None]:
model.wv.most_similar('condescending')

[('childish', 0.788465142250061),
 ('rude', 0.7828449606895447),
 ('arrogant', 0.7772617340087891),
 ('malleus', 0.7670767903327942),
 ('aggressive', 0.7646420001983643),
 ('sarcastic', 0.759050726890564),
 ('nasty', 0.7552060484886169),
 ('pedantic', 0.7531112432479858),
 ('insulting', 0.7473530769348145),
 ('reprimanded', 0.7345737814903259)]

In [None]:
print(model.wv.similarity('stupid', 'dumb'))

0.8387435


# Part 2 - Manual Weight Update
----------------------------------------------------
In Part 1, we partially trained a word2vec model using gensim. Now, in Part 2, we're going to implement the training from scratch, and walk through a single training sample in order to see it in detail.

We'll take the word pair ("though", "well"), and make the necessary weight updates.

# Negative Sampling from Scratch
-------------------------------------------------------

In order to update the weights properly, we also need to implement Negative Sampling so that we can randomly choose 5 words as negative samples *using the correct probabilities*.

## Generating Unigram Table
----------------------------------------
I've implemented negative sampling here by porting the code from the `InitUnigramTable` function in the original [word2vec.c](https://github.com/chrisjmccormick/word2vec_commented/blob/master/word2vec.c).

Each word is given a weight equal to its frequency (word count) raised to the 3/4 power. The probability for a selecting a word is just its weight divided by the sum of weights for all words.

To implement this, we have a large array, and we fill it with the word ids from our vocabulary. word ids appear multiple times in the table such that `(number of rows with word i) / (table size) = probability of choosing word i`.

Every vocab word appears at least once in the table.

The size of the table relative to the size of the vocab dictates the resolution of the sampling. A larger unigram table means the negative samples will be selected with a probability that more closely matches the probability calculated by the equation.

In [None]:
import numpy as np

# The original code used a table size of 100M for a vocab of 3M words.
# Since our vocab is only ~75K, we'll use 10M instead.
table_size = int(10e6)
d1 = 0.75
power = 0.75

# Bonus - What table size should we use to be proportional to the
#         original 3M word model?
# print('{:,}'.format(int(100e6 / 3e6 * len(entries))))

# Allocate the table.
uni_table = np.ndarray((table_size,1), dtype='int')

# Get all of the vocab entries as a list.
entries = model.wv.vocab.values()

print('Sorting vocab...')

# Sort them by decreasing frequency...
entries = sorted(entries, key=lambda entry: entry.count, reverse=True)

# Also total up the counts, so we can compare to unigram.
total_count = 0
train_words_pow = 0

print('Accumulating denominator...')

# Calculate the denominator, which is the sum of weights for all words.
for entry in entries:
    train_words_pow += pow(entry.count, power)
    total_count += entry.count

print('Done.')

print('Filling out unigram table...')

# 'i' is the vocabulary index of the current word, whereas 'a' will be
# the index into the unigram table.
i = 0

# d1 will store the probability that we choose word `i` as a fraction
# between 0 and 1.
d1 = pow(entries[i].count, power) / train_words_pow;

# Loop over all positions in the table.
for a in range(0, table_size):

  # Update progress every 1M entries.
  if a % int(1e6) == 0:
    print('    At table position {:<10,} / {:,}'.format(a, len(uni_table)))

  # Store word 'i' in this position. Word 'i' will appear multiple times
  # in the table, based on its frequency in the training data.
  uni_table[a] = i;

  # If the fraction of the table we have filled is greater than the
  # probability of choosing this word, then move to the next word.
  if (float(a) / float(table_size) > d1):

    # Move to the next word.
    i += 1;

    # Calculate the probability for the new word, and accumulate it with
    # the probabilities of all previous words, so that we can compare d1 to
    # the percentage of the table that we have filled.
    d1 += pow(entries[i].count, power) / train_words_pow

  # Don't go past the end of the vocab.
  # The total weights for all words should sum up to 1, so there shouldn't
  # be any extra space at the end of the table. Maybe it's possible to be
  # off by 1, though? Or maybe this is just precautionary.
  if (i >= len(entries)):
    print('Triggered the end check!')
    i = len(entries) - 1;

print('Done!')

Sorting vocab...
Accumulating denominator...
Done.
Filling out unigram table...
    At table position 0          / 10,000,000
    At table position 1,000,000  / 10,000,000
    At table position 2,000,000  / 10,000,000
    At table position 3,000,000  / 10,000,000
    At table position 4,000,000  / 10,000,000
    At table position 5,000,000  / 10,000,000
    At table position 6,000,000  / 10,000,000
    At table position 7,000,000  / 10,000,000
    At table position 8,000,000  / 10,000,000
    At table position 9,000,000  / 10,000,000
Done!


## Inspect the Table
---------------------------
Let's check out a few properties of the table.

### Row Counts
------------------
Each word has a number of spots in the table proportional to its sampling probability, so as a point of reference, let's see how many spots are occupied by the *least common word* and by the *most common word*

In [None]:
# Get the last word in the table.
last_word = uni_table[-1]
num_spaces = 0

# Loop backwards through the table...
for i in range(-1, -len(uni_table), -1):

    # Stop when the word changes.
    if not uni_table[i] == last_word:
        break

    num_spaces += 1

print('The least common word has {:,} spots'.format(num_spaces))

# Look up the first word in the table.
first_word = uni_table[0]
num_spaces = 0

# Loop forward through the table
for i in range(0, len(uni_table)):

    # Stop when the word changes.
    if not uni_table[i] == first_word:
        break

    num_spaces += 1

print('The most common word has {:,} spots'.format(num_spaces))

The least common word has 13 spots
The most common word has 130,193 spots


### Compare Probability Distributions
-------------------------------------------

Recall that the unigram distribution is given as:

$ P(w_i) = \frac{  f(w_i)  }{\sum_{j=0}^{n}\left(  f(w_j) \right) } $

But the authors found that the following modification produced better results:

$ P(w_i) = \frac{  {f(w_i)}^{3/4}  }{\sum_{j=0}^{n}\left(  {f(w_j)}^{3/4} \right) } $

Let's compare the probabilities for the 10 most common and 10 least common words to see how the modified distribution affects them.

In [None]:
# List to hold the data to report.
rows = []

# Indeces for the first ten and last ten words.
indeces = list(range(0, 10))
indeces += list(range(len(entries)-10, len(entries)))

# For each word...
for i in indeces:
    # Look up the word.
    word = model.wv.index2word[entries[i].index]

    # Get the probability of selection with unigram distribution.
    # Format it as a percentage with 2 decimal points.
    uni_prob = '%.6f%%' % (float(entries[i].count) / float(total_count) * 100.0)

    # Get the probability of selection with
    uni34_prob = '%.6f%%' % (pow(entries[i].count, power) / train_words_pow * 100.0)

    #rows.append((word, entries[i].count, prob_str))
    rows.append((i, word, uni_prob, uni34_prob))

# Convert the results to a DataFrame to display as a nice table.
df = pd.DataFrame(rows, columns=['Rank', 'Word', 'Unigram', 'Unigram^3/4'])
df = df.set_index('Rank')

display(df)


Unnamed: 0_level_0,Word,Unigram,Unigram^3/4
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,the,4.759013%,1.301913%
1,to,2.846351%,0.885443%
2,and,2.271361%,0.747582%
3,you,2.251390%,0.742647%
4,of,2.232659%,0.738008%
5,is,1.794802%,0.626552%
6,that,1.572681%,0.567446%
7,in,1.441577%,0.531585%
8,it,1.438820%,0.530822%
9,this,0.956311%,0.390745%


### 20 Random Words
-------------------------
It's interesting to use this table now and select a handful of words at random to see what we get.

What we see, most notably, is that common words are still sampled very often!

In [None]:
import random

print('  --Rank--   --Word--')

# Pick some negative samples!
for i in range(0, 20):

    # Pick a random index in the unigram table.
    j = random.randint(0, table_size)

    # Look up the word at position 'j'.
    word_i = uni_table[j, 0]

    # Print the word's ranking  with its ranking.
    print("{:>10,}   {:}".format(word_i, model.wv.index2word[int(word_i)]))


  --Rank--   --Word--
        60   here
     1,105   error
         8   it
    19,841   virulent
        86   re
    22,321   musk
         0   the
    68,145   darknesshines
    32,757   sprawling
        14   as
     2,282   promise
    40,531   spiketoronto
     2,034   cellspacing
       366   important
       368   states
       423   following
        49   who
    32,269   ded
        34   what
     5,623   shocked


## Backprop
-----------------
Now we're ready to run Backprop!


### Retrieve Weight Matrices
--------------------------------

We're going to use the weight matrices from our partially trained gensim model.

The input vectors are stored in `model.wv.vectors`, and the output vectors are stored in `model.trainables.syn1neg`.

We'll run some sanity checks here to ensure that the vectors aren't normalized and that we are indexing them correctly.

In [None]:
# Get the (partially-trained) input vectors matrix and the
# output vectors matrix from the gensim model.
in_vecs = model.wv.vectors
out_vecs = model.trainables.syn1neg

# Retrieve the word vector for "stupid".
vec_a_i = model.wv.vocab['stupid'].index
vec_a = in_vecs[vec_a_i,:]

# Retrieve the word vector for "dumb".
vec_b_i = model.wv.vocab['dumb'].index
vec_b = in_vecs[vec_b_i,:]

# Ensure that the vectors aren't normalized!
print('Verifying vectors are not normalized...')
assert((np.linalg.norm(vec_a) - 1.0) > 0.01)
assert((np.linalg.norm(vec_b) - 1.0) > 0.01)

print("Cosine similarity between 'stupid' and 'dumb'...")
print("    Using gensim: %.4f" % model.wv.similarity('stupid', 'dumb'))
print("        Manually: %.4f" % np.dot(vec_a / np.linalg.norm(vec_a), vec_b / np.linalg.norm(vec_b)))


Verifying vectors are not normalized...
Cosine similarity between 'stupid' and 'dumb'...
    Using gensim: 0.8387
        Manually: 0.8387


### Picking Samples
-----------------------

Here is an example sentence from the training data: "your questions are well thought out and reasoned"

Let's suppose that our context window is currently centered around `thought`, and we are currently looking at the word at -1, `well`. This is our positive sample.

```
            -3       -2     -1      input     +1     +2      +3      
"your  (questions)  (are)  (well)  [thought]  (out)  (and) (reasoned)"
```

We'll start by:
1. Defining our input word ('thought').
2. Defining our positive output word ('well').
3. Selecting five random words as negative samples.

We'll store the output words as a list with entries of the form (`word`, `label`), where `label` is 1 for the positive sample and 0 for the negative samples.

In [None]:
import random

# The word at the center of our context window.
input_word = 'thought'

# Build a table to report the words--this table is just for information
# and won't be used in the training.
word_stats = []

# Create a list of positive and negative output words with their labels.
output_words = [('well', 1.0)]

# Record the word, its label, its frequency rank, and the number of occurrences in the training data.
word_stats.append(('well',
                   1.0,  # Label
                   "{:,}".format(model.wv.vocab['well'].index), # Frequency rank
                   "{:,}".format(model.wv.vocab['well'].count) # Number of occurrences
                  ))

# The number of negative samples is a parameter--the default is 5.
num_neg_samples = 5

# Randomly choose 5 negative samples.
for i in range(0, num_neg_samples):

    # Pick a random index in the unigram table.
    j = random.randint(0, table_size)

    # Look up the word at position 'j'.
    word_i = uni_table[j, 0]

    # Retrieve the string version of the word.
    out_word = model.wv.index2word[int(word_i)]

    # Record the word, its label, its frequency rank, and the number of occurrences in the training data.
    word_stats.append((out_word,
                       0.0, # Label
                       "{:,}".format(word_i),  # Frequency rank
                       "{:,}".format(model.wv.vocab[out_word].count) # Number of occurrences
                      ))

    # Add the word to the list, with label '0' to indicate it's a negative
    # sample.
    output_words.append((out_word, 0.0))

# Display our output words and their statistics as a table.
df = pd.DataFrame(word_stats, columns=['Word', 'Label', 'Rank', 'Occurrences'])
df = df.set_index('Word')
display(df)


Unnamed: 0_level_0,Label,Rank,Occurrences
Word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
well,1.0,94,9342
jihads,0.0,43314,3
steamboat,0.0,27608,7
you,0.0,3,170678
gfdl,0.0,2846,227
to,0.0,1,215782


### Weight Updates
----------------------

We now have 6 word pairs to train the network on, one positive pair and five negative pairs.

For each output word, we will:

1. Calculate the network's current output for this output word.
2. Calculate the error.
3. Update the output weights for the output word.
4. Accumulate the gradient for the input word.

The term "gradient" just means the amount to update each weight by. Specifically, we'll have a gradient vector which is the same size as the input word vector, and it will store the amount to adjust each feature of the input word vector by.

In [None]:
# Learning rate. This starts out at this value, but changes
# over the course of training.
alpha = 0.025

# For reference...
# index 2 word: model.wv.index2word
# word 2 index: model.wv.vocab

# Record the activation and error values to display in a table at
# the end.
out_word_stats = []

# Look up the index for the input word.
in_vec_i = model.wv.vocab[input_word].index

# Select the input word vector.
in_vec = in_vecs[in_vec_i, :]

# Create an empty vector to hold the gradient for the input word.
in_vec_grad = np.zeros(in_vecs[0,:].shape)

# Print header
#print('   Input         Output   Activ.  Error')
#print('   -----         ------   ------  -----')

# For each output word...
for (out_word, label) in output_words:

    # ======== Calculate Network Output ========

    # Look up the output word.
    out_vec_i = model.wv.vocab[out_word].index

    # Select the output word vector.
    out_vec = out_vecs[out_vec_i, :]

    # Take their dot product.
    z = np.dot(in_vec, out_vec)

    # Apply the sigmoid activation. This is the model's output for
    # `out_word`
    activation = 1 / (1 + np.exp(-z))

    # ======== Calculate Error ========

    # Calculate the output error and apply the learning rate (alpha).
    err = (label - activation) * alpha

    # ======== Update Output Word Vector ========

    # Update the output vector by multiplying the error with the input
    # vector.
    out_vecs[out_vec_i, :] = out_vec + (err * in_vec)

    # ======== Accumulate Input Word Vector Gradient ========

    # Multiply the error with the output vector and accumulate this
    # as the gradient for the input vector.
    in_vec_grad += err * out_vec

    # Record the activation and error for this word pair.
    # Leave the last column empty--it will hold the new activation value
    # in the next code block.
    out_word_stats.append([input_word, out_word, '%.4f' % activation, '%.4f' % err, ''])

    #print('%10s  %12s  %.4f  %.4f' % (input_word, out_word, activation, err))

# ======== Apply Input Word Vector Gradient ========

# Update the input word vector weights by applying the gradient.
in_vecs[in_vec_i, :] = in_vec + in_vec_grad

# Display the activation and error values as a table.
df = pd.DataFrame(out_word_stats,
                  columns=['Input', 'Output', 'Activation', 'Error', ''])

display(df)

Unnamed: 0,Input,Output,Activation,Error,Unnamed: 5
0,thought,well,0.8085,0.0048,
1,thought,jihads,0.1248,-0.0031,
2,thought,steamboat,0.066,-0.0017,
3,thought,you,0.7931,-0.0198,
4,thought,gfdl,0.0023,-0.0001,
5,thought,to,0.0852,-0.0021,


#### Side Note - Interpreting the Activation
-----------------------------------------
Don't get confused about the meaning of the activation value--*it is not a measure of word similarity*. Rather, it reflects how likely you are to find the word "well" in the vicinity of "thought".

For example, the words "thought" and "think" are very similar in meaning, but are unlikely to appear close together. The following code confirms this with our model.

In [None]:
# Select output vector for "think".
out_vec = out_vecs[model.wv.vocab['think'].index, :]

# Take dot product of "thought" and "think".
z = np.dot(in_vec, out_vec)

# Apply the sigmoid activation.
activation = 1 / (1 + np.exp(-z))

# Show the word vector similarity versus output value.
print("Similarity for 'thought' and 'think':    %.4f" % model.wv.similarity('thought', 'think'))
print("Network output for ('thought', 'think'): %.4f" % activation)

Similarity for 'thought' and 'think':    0.4503
Network output for ('thought', 'think'): 0.1073


------------------------
Now that we've updated the weights, we can try running another forward pass to see the impact of our changes.

We'll see that the positive output word now has a slightly higher activation, and the negative output words now all have slightly lower activations.

In [None]:
# Select the updated input word vector.
in_vec = in_vecs[in_vec_i, :]

i = 0

# For each output word...
for (out_word, label) in output_words:

    # Look up the output word.
    out_vec_i = model.wv.vocab[out_word].index

    # Select the updated output word vector.
    out_vec = out_vecs[out_vec_i, :]

    # Take their dot product.
    z = np.dot(in_vec, out_vec)

    # Apply the sigmoid activation. This is the model's output for
    # `out_word`
    activation = 1 / (1 + np.exp(-z))

    # Record the new activation value
    out_word_stats[i][4] = '%.4f' % activation

    i += 1

df = pd.DataFrame(out_word_stats, columns=['Input', 'Output', 'Prev. Activation', 'Prev. Error', 'New Activation'])
df

Unnamed: 0,Input,Output,Prev. Activation,Prev. Error,New Activation
0,thought,well,0.8085,0.0048,0.8831
1,thought,jihads,0.1248,-0.0031,0.0875
2,thought,steamboat,0.066,-0.0017,0.0535
3,thought,you,0.7931,-0.0198,0.2406
4,thought,gfdl,0.0023,-0.0001,0.0022
5,thought,to,0.0852,-0.0021,0.0646


# Conclusion
---------------------

Here's what we covered:
* We saw how negative sampling is implemented using the unigram table approach.
* We looked more at how negative sampling behaves on real data.
* We saw how to update the skip-gram model weights by walking through a single training sample.