<a href="https://colab.research.google.com/github/Mjh9122/ML_lit_review/blob/main/word2vec/word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Efficient Estimation of Word Representations in Vector Space
## Authors: Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean
### Notes: Michael Holtz

### Abstract

They describe two NN architectures for embedding words into vector spaces. The quality is measured via a "word similarity task," and they find higher accuracy at much lower computational cost. Furthermore, the vectors produced provide state-of-the-art performance on a test set for syntactic and semantic word similarities.

### Intro

Many previous NLP systems treat words as simply an element in a set of words, with no notion of similarity between words. These simple techniques have notable limits in many tasks. While trillions of words might be necessary to achieve performance with these simple methods, tasks such as automatic speech recognition or machine translation may have corpora with only millions or billions of words. In these scenarios, a more complex strategy is needed.

### Paper Goals

The goal is to create high-quality vector embeddings for millions of words from a billion+ word corpora. The expectation of the embedding is that similar words should be close to one another and that words can have multiple degrees of similarity (such as a similar ending). More surprising is that algebraic operations on these vectors hold their meaning. Ex. King - man + woman = queen. They also develop a test set for syntactic and semantic regularities and discuss how time and accuracy depend on embedding dimension.


### Previous work

Previous attempts at word embeddings via a neural network language model. The first proposed models learned both a word vector representation and a statistical language model. Later architectures attempted to learn the embedding via a single hidden layer and then the vectors were used to train the NNLM. Other work also found that NLP tasks became easier when working with word vectors. The architecture in this paper seeks to find these vectors in a much more computationally efficient way.

### NNLM Architectures

#### Feedforward NNLM
This model takes in N-words encoded in one-of-V coding. The input layer is then projected to a projection layer P. This layer is passed to a hidden layer, which in turn predicts a probability distribution over the 1xV output layer.

#### Recurrent NNLM (RNNLM)
This model removes the need to specify the context length for the input. The RNNLM removes the projection layer, consisting of only input, hidden, and output layers. It is a recurrent architecture becuase there are time delayed connections from the hidden layer to itself. These connection theoretically allow for short term memory, allowing past words to influence future predictions.


### New log-linear models
The main focus of the paper. New models are proposed which avoid the nonlinear nature of the neural nets above, allowing for much more efficient training. These new models can then be used to train the above architectures on a much smaller input dimension.

#### CBOW
Continous bag of words (CBOW) is similar to the feedforward model but there is no non-linear hidden layer. Instead the projection layer is shared for all words, and input contains words from the past as well as the future. The goal is to classify the word in the middle of the input.

In [50]:
# @title Imports
import torch
import torch.nn as nn

import random
import re

from tqdm import tqdm
from torch.utils.data import DataLoader, Dataset

In [51]:
# @title Corpus
corpus = """
The Cat in the Hat

By Dr. Seuss

The sun did not shine.
It was too wet to play.
So we sat in the house
All that cold, cold, wet day.

I sat there with Sally.
We sat there, we two.
And I said, "How I wish
We had something to do!"

Too wet to go out
And too cold to play ball.
So we sat in the house.
We did nothing at all.

So all we could do was to

Sit!
Sit!
Sit!
Sit!

And we did not like it.
Not one little bit.

BUMP!

And then
something went BUMP!
How that bump made us jump!

We looked!
Then we saw him step in on the mat!
We looked!
And we saw him!
The Cat in the Hat!
And he said to us,
"Why do you sit there like that?"
"I know it is wet
And the sun is not sunny.
But we can have
Lots of good fun that is funny!"

"I know some good games we could play,"
Said the cat.
"I know some new tricks,"
Said the Cat in the Hat.
"A lot of good tricks.
I will show them to you.
Your mother
Will not mind at all if I do."

Then Sally and I
Did not know what to say.
Our mother was out of the house
For the day.

But our fish said, "No! No!
Make that cat go away!
Tell that Cat in the Hat
You do NOT want to play.
He should not be here.
He should not be about.
He should not be here
When your mother is out!"

"Now! Now! Have no fear.
Have no fear!" said the cat.
"My tricks are not bad,"
Said the Cat in the Hat.
"Why, we can have
Lots of good fun, if you wish,
with a game that I call
UP-UP-UP with a fish!"

"Put me down!" said the fish.
"This is no fun at all!
Put me down!" said the fish.
"I do NOT wish to fall!"

"Have no fear!" said the cat.
"I will not let you fall.
I will hold you up high
As I stand on a ball.
With a book on one hand!
And a cup on my hat!
But that is not ALL I can do!"
Said the cat...

"Look at me!
Look at me now!" said the cat.
"With a cup and a cake
On the top of my hat!
I can hold up TWO books!
I can hold up the fish!
And a litte toy ship!
And some milk on a dish!
And look!
I can hop up and down on the ball!
But that is not all!
Oh, no.
That is not all...

"Look at me!
Look at me!
Look at me NOW!
It is fun to have fun
But you have to know how.
I can hold up the cup
And the milk and the cake!
I can hold up these books!
And the fish on a rake!
I can hold the toy ship
And a little toy man!
And look! With my tail
I can hold a red fan!
I can fan with the fan
As I hop on the ball!
But that is not all.
Oh, no.
That is not all...."

That is what the cat said...
Then he fell on his head!
He came down with a bump
From up there on the ball.
And Sally and I,
We saw ALL the things fall!

And our fish came down, too.
He fell into a pot!
He said, "Do I like this?"
Oh, no! I do not.
This is not a good game,"
Said our fish as he lit.
"No, I do not like it,
Not one little bit!"

"Now look what you did!"
Said the fish to the cat.
"Now look at this house!
Look at this! Look at that!
You sank our toy ship,
Sank it deep in the cake.
You shook up our house
And you bent our new rake.
You SHOULD NOT be here
When our mother is not.
You get out of this house!"
Said the fish in the pot.

"But I like to be here.
Oh, I like it a lot!"
Said the Cat in the Hat
To the fish in the pot.
"I will NOT go away.
I do NOT wish to go!
And so," said the Cat in the Hat,

"So
so
so...

I will show you
Another good game that I know!"
And then he ran out.
And, then, fast as a fox,
The Cat in the Hat
Came back in with a box.
A big red wood box.
It was shut with a hook.
"Now look at this trick,"
Said the cat.
"Take a look!"

Then he got up on top
With a tip of his hat.
"I call this game FUN-IN-A-BOX,"
Said the cat.
"In this box are two things
I will show to you now.
You will like these two things,"
Said the cat with a bow.

"I will pick up the hook.
You will see something new.
Two things. And I call them
Thing One and Thing Two.
These Things will not bite you.
They want to have fun."
Then, out of the box
Came Thing Two and Thing One!
And they ran to us fast.
They said, "How do you do?
Would you like to shake hands
With Thing One and Thing Two?"

And Sally and I
Did not know what to do.
So we had to shake hands
With Thing One and Thing Two.
We shook their two hands.
But our fish said, "No! No!
Those Things should not be
In this house! Make them go!
"They should not be here
When your mother is not!
Put them out! Put them out!"
Said the fish in the pot.

"Have no fear, little fish,"
Said the Cat in the Hat.
"These Things are good Things."
And he gave them a pat.
"They are tame. Oh, so tame!
They have come here to play.
They will give you some fun
On this wet, wet, wet day."

"Now, here is a game that they like,"
Said the cat.
"They like to fly kites,"
Said the Cat in the Hat

"No! Not in the house!"
Said the fish in the pot.
"They should not fly kites
In a house! They should not.
Oh, the things they will bump!
Oh, the things they will hit!
Oh, I do not like it!
Not one little bit!" Then Sally and I
Saw them run down the hall.
We saw those two Things
Bump their kites on the wall!
Bump! Thump! Thump! Bump!
Down the wall in the hall.

Thing Two and Thing One!
They ran up! They ran down!
On the string of one kite
We saw Mother's new gown!
Her gown with the dots
That are pink, white and red.
Then we saw one kite bump
On the head of her bed!

Then those Things ran about
With big bumps, jumps and kicks
And with hops and big thumps
And all kinds of bad tricks.
And I said,
"I do NOT like the way that they play
If Mother could see this,
Oh, what would she say!"

Then our fish said, "Look! Look!"
And our fish shook with fear.
"Your mother is on her way home!
Do you hear?
Oh, what will she do to us?
What will she say?
Oh, she will not like it
To find us this way!"

"So, DO something! Fast!" said the fish.
"Do you hear!
I saw her. Your mother!
Your mother is near!
So, as fast as you can,
Think of something to do!
You will have to get rid of
Thing One and Thing Two!"

So, as fast as I could,
I went after my net.
And I said, "With my net
I can get them I bet.
I bet, with my net,
I can get those Things yet!"

Then I let down my net.
It came down with a PLOP!
And I had them! At last!
Thoe two Things had to stop.
Then I said to the cat,
"Now you do as I say.
You pack up those Things
And you take them away!"

"Oh dear!" said the cat,
"You did not like our game...
Oh dear.

What a shame!
What a shame!
What a shame!"

Then he shut up the Things
In the box with the hook.
And the cat went away
With a sad kind of look.

"That is good," said the fish.
"He has gone away. Yes.
But your mother will come.
She will find this big mess!
And this mess is so big
And so deep and so tall,
We ca not pick it up.
There is no way at all!"

And THEN!
Who was back in the house?
Why, the cat!
"Have no fear of this mess,"
Said the Cat in the Hat.
"I always pick up all my playthings
And so...
I will show you another
Good trick that I know!"

Then we saw him pick up
All the things that were down.
He picked up the cake,
And the rake, and the gown,
And the milk, and the strings,
And the books, and the dish,
And the fan, and the cup,
And the ship, and the fish.
And he put them away.
Then he said, "That is that."
And then he was gone
With a tip of his hat.

Then our mother came in
And she said to us two,
"Did you have any fun?
Tell me. What did you do?"

And Sally and I did not know
What to say.
Should we tell her
The things that went on there that day?

Should we tell her about it?
Now, what SHOULD we do?
Well...
What would YOU do
If your mother asked YOU?


"""

corpus_filtered = re.sub('[^A-Za-z0-9 ]+', " ", corpus)
corpus = corpus_filtered.lower().split()

In [52]:
# Create vocab
vocab = set(corpus)
vocab_size = len(vocab)
print(f'vocab size: {vocab_size}')

# To convert from word to embedding index
word_to_ix = {word:ix for ix, word in enumerate(vocab)}
ix_to_word = {ix:word for ix, word in enumerate(vocab)}

# Create dataset by taking four words on either side of a target word
context_length = 4

Xs = []
ys = []
for i in range(context_length, len(corpus) - context_length):
    context = corpus[i - context_length: i] + corpus[i + 1 : i + context_length + 1]
    target = corpus[i]
    Xs.append(context)
    ys.append(target)

Xs_t = [torch.tensor([word_to_ix[w] for w in x]) for x in Xs]
ys_t = [torch.tensor([word_to_ix[y]]) for y in ys]

vocab size: 242


In [53]:
# Quick dataset class for dataloading
class CBOW_Dataset(Dataset):
    def __init__(self, X, Y):
        self.X = X
        self.Y = Y

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.Y[idx],self.X[idx]

In [54]:
# CBOW model proper. No non-linear activations like the paper says.
class CBOW(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim, vocab_size)
        self.softmax = nn.LogSoftmax(dim = -1)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).sum(dim=1)
        out = self.linear(embeds)
        return self.softmax(out)

    def get_word_emdedding(self, word):
        word = torch.tensor(word_to_ix[word])
        return self.embeddings(word).view(1,-1)

In [55]:
# Create model instance, loss, optimizer, dataset, and dataloader
model = CBOW(vocab_size, 64)
loss_function = nn.NLLLoss()
optim = torch.optim.SGD(model.parameters(), lr=0.1)

dataset = CBOW_Dataset(Xs_t, ys_t)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=256, shuffle=True)

# training loop
for epoch in range(1000):
    epoch_loss = 0
    iters = 0
    correct = 0

    for labels, features in dataloader:
        optim.zero_grad()
        y_pred = model(features)
        correct += sum(torch.argmax(y_pred, dim=1) == labels.squeeze())
        loss = loss_function(y_pred, labels.squeeze())
        loss.backward()
        optim.step()
        epoch_loss += loss
        iters += 1

    if not epoch % 50:
        print(f'Epoch: {epoch}, Loss {epoch_loss.item()/iters}, Accuracy: {correct/len(Xs_t)}, Incorrect classifications: {(len(Xs_t) - correct)}/{len(Xs_t)}')


Epoch: 0, Loss 6.716022491455078, Accuracy: 0.012907191179692745, Incorrect classifications: 1606/1627
Epoch: 50, Loss 1.8378186907087053, Accuracy: 0.563614010810852, Incorrect classifications: 710/1627
Epoch: 100, Loss 1.2621653420584542, Accuracy: 0.7277197241783142, Incorrect classifications: 443/1627
Epoch: 150, Loss 0.9797385079520089, Accuracy: 0.8014751076698303, Incorrect classifications: 323/1627
Epoch: 200, Loss 0.8073596954345703, Accuracy: 0.8469575643539429, Incorrect classifications: 249/1627
Epoch: 250, Loss 0.6417697497776577, Accuracy: 0.882606029510498, Incorrect classifications: 191/1627
Epoch: 300, Loss 0.5369910853249686, Accuracy: 0.9108788967132568, Incorrect classifications: 145/1627
Epoch: 350, Loss 0.4448422704424177, Accuracy: 0.9397664666175842, Incorrect classifications: 98/1627
Epoch: 400, Loss 0.3771602766854422, Accuracy: 0.9508297443389893, Incorrect classifications: 80/1627
Epoch: 450, Loss 0.30984534536089214, Accuracy: 0.9674246907234192, Incorrect 

In [56]:
# Test the model on a sample sentance from the text
test = 'and then something went bump how that made us'
test = test.lower().split()
context = test[:4] + test[5:]
target = test[4]

test_vec = torch.tensor([word_to_ix[w] for w in context]).unsqueeze(0)
log_probs = model(test_vec)
pred = ix_to_word[torch.argmax(log_probs).item()]

print(f'Test context: {context}')
print(f'Test target: {target}')
print(f'Test prediction: {pred}')

Test context: ['and', 'then', 'something', 'went', 'how', 'that', 'made', 'us']
Test target: bump
Test prediction: bump
