In this project, we use certain segments of code from the Udacity Deep Learning Specialization to estimate word embeddings with the Word2Vec algorithm. This is permissible under the terms of the MIT License associated with the original source code. 

# Skip-gram Word2Vec

---
## Word embeddings

When you're dealing with words in text, you end up with tens of thousands of word classes to analyze; one for each word in a vocabulary. Trying to one-hot encode these words is massively inefficient because most values in a one-hot vector will be set to zero. So, the matrix multiplication that happens in between a one-hot input vector and a first, hidden layer will result in mostly zero-valued hidden outputs.

To solve this problem and greatly increase the efficiency of our networks, we use what are called **embeddings**. Embeddings are just a fully connected layer like you've seen before. We call this layer the embedding layer and the weights are embedding weights. We skip the multiplication into the embedding layer by instead directly grabbing the hidden layer values from the weight matrix. We can do this because the multiplication of a one-hot encoded vector with a matrix returns the row of the matrix corresponding the index of the "on" input unit.

<img src='lookup_matrix.png' width=50%>

Instead of doing the matrix multiplication, we use the weight matrix as a lookup table. We encode the words as integers, for example "heart" is encoded as 958, "mind" as 18094. Then to get hidden layer values for "heart", you just take the 958th row of the embedding matrix. This process is called an **embedding lookup** and the number of hidden units is the **embedding dimension**.
 
There is nothing magical going on here. The embedding lookup table is just a weight matrix. The embedding layer is just a hidden layer. The lookup is just a shortcut for the matrix multiplication. The lookup table is trained just like any weight matrix.

Embeddings aren't only used for words of course. You can use them for any model where you have a massive number of classes. A particular type of model called **Word2Vec** uses the embedding layer to find vector representations of words that contain semantic meaning.

---
## Word2Vec

The Word2Vec algorithm finds much more efficient representations by finding vectors that represent the words. These vectors also contain semantic information about the words.

<img src="assets/context_drink.png" width=40%>

Words that show up in similar **contexts**, such as "coffee", "tea", and "water" will have vectors near each other. Different words will be further away from one another, and relationships can be represented by distance in vector space.


There are two architectures for implementing Word2Vec:
>* CBOW (Continuous Bag-Of-Words) and 
* Skip-gram

<img src="assets/word2vec_architectures.png" width=60%>

In this implementation, we'll be using the **skip-gram architecture** with **negative sampling** because it performs better than CBOW and trains faster with negative sampling. Here, we pass in a word and try to predict the words surrounding it in the text. In this way, we can train the network to learn representations for words that show up in similar contexts.

---
## Loading Data

First, we load a csv file `articles_daily_ts.csv`, where each news article is assigned a day on which market participants can react to it. The file is imported to the data frame `data`.

In [1]:
import pandas as pd
data = pd.read_csv('articles_daily_ts.csv', encoding = 'utf-8-sig', sep=';')

data_texts = list(data.texts)

Next, we pre-process the texts (convert all upper-case to lower-case letters, remove punctuation, remove non-alphabetic characters, remove multiple whitespaces, and add collocations) and split them into tokens.

In [2]:
from datetime import datetime
startTime = datetime.now()
# The class 'tokens' pre-processes the texts and splits them into tokens.
import tokens

import multiprocessing as mp
NUM_CORE = mp.cpu_count()-8

# The function necessary for multiprocessing in Jupyter Notebooks.
import worker_preprocess

if __name__ == "__main__":
    # Create the document objects
    list_of_objects = [tokens.tokens(text) for text in data_texts]
    
    pool = mp.Pool(NUM_CORE)
    # Apply the 'preprocess' method and output a list of tokens for each text.
    list_of_tokens = pool.map(worker_preprocess.worker_preprocess, ((obj) for obj in list_of_objects))
    pool.close()
    pool.join()

print(datetime.now() - startTime)

0:00:05.651768


## Pre-processing

* Remove all words that show up five or *fewer* times in the dataset. This will greatly reduce issues due to noise in the data and improve the quality of the vector representations. 

In [3]:
import itertools
from collections import Counter

# Combine all words from all texts into one list 'words'
words = list(itertools.chain.from_iterable(list_of_tokens))

word_counts = Counter(words)
# Remove all words with 5 or fewer occurences
words = [word for word in words if word_counts[word] > 5]

In [4]:
print(words[:100])

['chilean', 'salmon_prices', 'see', 'a', 'lift', 'in', 'week', 'prices', 'to', 'both', 'the', 'united', 'states', 'and', 'brazil', 'climb', 'in', 'the', 'opening', 'week', 'of', 'the', 'year', 'prices', 'to', 'japan', 'remain', 'flat', 'chilean', 'salmon_prices', 'heading', 'into', 'the', 'united', 'states', 'nudged', 'up', 'by', 'six', 'cents', 'in', 'the', 'first', 'week', 'of', 'reaching', 'per', 'pound', 'for', 'average', 'd', 'trim', 'atlantic', 'salmon', 'at', 'wholesale', 'prices', 'for', 'headon', 'atlantic', 'salmon', 'from', 'chile', 'size', 'at', 'wholesale', 'in', 'brazil', 'also', 'went', 'up', 'during', 'the', 'week', 'to', 'hit', 'per', 'kilo', 'meanwhile', 'the', 'average_prices', 'for', 'frozen', 'headed', 'and', 'gutted', 'hg', 'trout', 'heading', 'to', 'japan', 'sizes', 'and', 'for', 'frozen', 'hg', 'coho', 'remained', 'unchanged', 'the']


In [5]:
# print some stats about this word data
print("Total words in text: {}".format(len(words)))
print("Unique words: {}".format(len(set(words)))) # `set` removes any duplicate words

Total words in text: 1776632
Unique words: 11594


### Dictionaries

Next, we are creating two dictionaries to convert words to integers and back again (integers to words). This is done with a function in the `utils.py` file. `create_lookup_tables` takes in a list of words in a text and returns two dictionaries.
>* The integers are assigned in descending frequency order, so the most frequent word ("the") is given the integer 0 and the next most frequent is 1, and so on. 

Once we have our dictionaries, the words are converted to integers and stored in the list `int_words`.

In [6]:
import utils

vocab_to_int, int_to_vocab = utils.create_lookup_tables(words)
int_words = [vocab_to_int[word] for word in words]

print(int_words[:30])

[98, 153, 177, 5, 1904, 2, 64, 39, 1, 131, 0, 134, 146, 4, 719, 2158, 2, 0, 1382, 64, 3, 0, 35, 39, 1, 607, 476, 1110, 98, 153]


## Subsampling

Words that show up often such as "the", "of", and "for" don't provide much context to the nearby words. If we discard some of them, we can remove some of the noise from our data and in return get faster training and better representations. This process is called subsampling by Mikolov. For each word $w_i$ in the training set, we'll discard it with probability given by 

$$ P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}} $$

where $t$ is a threshold parameter and $f(w_i)$ is the frequency of word $w_i$ in the total dataset.

Next, we implement subsampling for the words in `int_words`. That is, we go through `int_words` and discard each word given the probablility $P(w_i)$ shown above. Note that $P(w_i)$ is the probability that a word is discarded. We assign the subsampled data to `train_words`.

In [7]:
from collections import Counter
import random
import numpy as np

threshold = 1e-5
word_counts = Counter(int_words)
#print(list(word_counts.items())[0])  # dictionary of int_words, how many times they appear

total_count = len(int_words)
freqs = {word: count/total_count for word, count in word_counts.items()}
p_drop = {word: 1 - np.sqrt(threshold/freqs[word]) for word in word_counts}
# discard some frequent words, according to the subsampling equation
# create a new list of words for training
train_words = [word for word in int_words if random.random() < (1 - p_drop[word])]

print(train_words[:30])

[131, 719, 1382, 5142, 4780, 1702, 343, 1016, 1356, 4, 1053, 1355, 2405, 82, 75, 1302, 38, 320, 186, 1733, 245, 299, 1107, 697, 383, 514, 1302, 23, 118, 65]


In [8]:
len(train_words)

336822

## Making batches

Now that our data is in good shape, we need to get it into the proper form to pass it into our network. With the skip-gram architecture, for each word in the text, we want to define a surrounding _context_ and grab all the words in a window around that word, with size $C$. 

From [Mikolov et al.](https://arxiv.org/pdf/1301.3781.pdf): 

"Since the more distant words are usually less related to the current word than those close to it, we give less weight to the distant words by sampling less from those words in our training examples... If we choose $C = 5$, for each training word we will select randomly a number $R$ in range $[ 1: C ]$, and then use $R$ words from history and $R$ words from the future of the current word as correct labels."

We implement a function `get_target` that receives a list of words, an index, and a window size, then returns a list of words in the window around the index. We use the algorithm described above, where we chose a random number of words from the window to use as context words.

Say, we have an input and we're interested in the idx=2 token, `741`: 
```
[5233, 58, 741, 10571, 27349, 0, 15067, 58112, 3580, 58, 10712]
```

For `R=2`, `get_target` will return a list of four values:
```
[5233, 58, 10571, 27349]
```

In [9]:
def get_target(words, idx, window_size=10):
    ''' Get a list of words in a window around an index. '''
    
    R = np.random.randint(1, window_size+1)
    start = idx - R if (idx - R) > 0 else 0
    stop = idx + R
    target_words = words[start:idx] + words[idx+1:stop+1]
    
    return list(target_words)

In [10]:
# run this cell multiple times to check for random window selection
int_text = [i for i in range(10)]
print('Input: ', int_text)
idx=5 # word index of interest

target = get_target(int_text, idx=idx, window_size=5)
print('Target: ', target)  # we should get some indices around the idx

Input:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Target:  [3, 4, 6, 7]


### Generating Batches 

Here's a generator function that returns batches of input and target data for our model, using the `get_target` function from above. The idea is that it grabs `batch_size` words from a words list. Then for each of those batches, it gets the target words in a window.

In [11]:
def get_batches(words, batch_size, window_size=10):
    ''' Create a generator of word batches as a tuple (inputs, targets) '''
    
    n_batches = len(words)//batch_size
    
    # only full batches
    words = words[:n_batches*batch_size]
    
    for idx in range(0, len(words), batch_size):
        x, y = [], []
        batch = words[idx:idx+batch_size]
        for ii in range(len(batch)):
            batch_x = batch[ii]
            batch_y = get_target(batch, ii, window_size)
            y.extend(batch_y)
            x.extend([batch_x]*len(batch_y))
        yield x, y  

In [12]:
int_text = [i for i in range(20)]
x,y = next(get_batches(int_text, batch_size=4, window_size=10))

print('x\n', x)
print('y\n', y)

x
 [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]
y
 [1, 2, 3, 0, 2, 3, 0, 1, 3, 0, 1, 2]


---
## Validation

Here, we are creating a function that will help us observe our model as it learns. We're going to choose a few common words and few uncommon words. Then, we'll print out the closest words to them using the cosine similarity: 

<img src="assets/two_vectors.png" width=30%>

$$
\mathrm{similarity} = \cos(\theta) = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}||\vec{b}|}
$$


We can encode the validation words as vectors $\vec{a}$ using the embedding table, then calculate the similarity with each word vector $\vec{b}$ in the embedding table. With the similarities, we can print out the validation words and words in our embedding table semantically similar to those words. It's a nice way to check that our embedding table is grouping together words with similar semantic meanings.

In [13]:
def cosine_similarity(embedding, valid_size=16, valid_window=100, device='cpu'):
    """ Returns the cosine similarity of validation words with words in the embedding matrix.
        Here, embedding should be a PyTorch embedding module.
    """
    
    # Here we're calculating the cosine similarity between some random words and 
    # our embedding vectors. With the similarities, we can look at what words are
    # close to our random words.
    
    # sim = (a . b) / |a||b|
    
    embed_vectors = embedding.weight
    
    # magnitude of embedding vectors, |b|
    magnitudes = embed_vectors.pow(2).sum(dim=1).sqrt().unsqueeze(0)
    
    # pick N words from our ranges (0,window) and (1000,1000+window). lower id implies more frequent 
    valid_examples = np.array(random.sample(range(valid_window), valid_size//2))
    valid_examples = np.append(valid_examples,
                               random.sample(range(1000,1000+valid_window), valid_size//2))
    valid_examples = torch.LongTensor(valid_examples).to(device)
    
    valid_vectors = embedding(valid_examples)
    similarities = torch.mm(valid_vectors, embed_vectors.t())/magnitudes
        
    return valid_examples, similarities

---
# SkipGram model

Here we define and train the SkipGram model. We'll define an [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding) and a final, softmax output layer.

An Embedding layer takes in a number of inputs, importantly:
* **num_embeddings** – the size of the dictionary of embeddings, or how many rows you'll want in the embedding weight matrix
* **embedding_dim** – the size of each embedding vector; the embedding dimension

Below is an approximate diagram of the general structure of our network.
<img src="assets/skip_gram_arch.png" width=60%>

* The input words are passed in as batches of input word tokens. 
* This will go into a hidden layer of linear units (our embedding layer). 
* Then, finally into a softmax output layer. 

We'll use the softmax layer to make a prediction about the context words by sampling, as usual.

---
## Negative Sampling

For every example we give the network, we train it using the output from the softmax layer. That means for each input, we're making very small changes to millions of weights even though we only have one true example. This makes training the network very inefficient. We can approximate the loss from the softmax layer by only updating a small subset of all the weights at once. We'll update the weights for the correct example, but only a small number of incorrect, or noise, examples. This is called ["negative sampling"](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf). 

There are two modifications we need to make. First, since we're not taking the softmax output over all the words, we're really only concerned with one output word at a time. Similar to how we use an embedding table to map the input word to the hidden layer, we can now use another embedding table to map the hidden layer to the output word. Now we have two embedding layers, one for input words and one for output words. Secondly, we use a modified loss function where we only care about the true example and a small subset of noise examples.

$$
- \large \log{\sigma\left(u_{w_O}\hspace{0.001em}^\top v_{w_I}\right)} -
\sum_i^N \mathbb{E}_{w_i \sim P_n(w)}\log{\sigma\left(-u_{w_i}\hspace{0.001em}^\top v_{w_I}\right)}
$$

This is a little complicated so we'll go through it bit by bit. $u_{w_O}\hspace{0.001em}^\top$ is the embedding vector for our "output" target word (transposed, that's the $^\top$ symbol) and $v_{w_I}$ is the embedding vector for the "input" word. Then the first term 

$$\large \log{\sigma\left(u_{w_O}\hspace{0.001em}^\top v_{w_I}\right)}$$

says we take the log-sigmoid of the inner product of the output word vector and the input word vector. Now the second term, let's first look at 

$$\large \sum_i^N \mathbb{E}_{w_i \sim P_n(w)}$$ 

This means we're going to take a sum over words $w_i$ drawn from a noise distribution $w_i \sim P_n(w)$. The noise distribution is basically our vocabulary of words that aren't in the context of our input word. In effect, we can randomly sample words from our vocabulary to get these words. $P_n(w)$ is an arbitrary probability distribution though, which means we get to decide how to weight the words that we're sampling. This could be a uniform distribution, where we sample all words with equal probability. Or it could be according to the frequency that each word shows up in our text corpus, the unigram distribution $U(w)$. The authors found the best distribution to be $U(w)^{3/4}$, empirically. 

Finally, in 

$$\large \log{\sigma\left(-u_{w_i}\hspace{0.001em}^\top v_{w_I}\right)},$$ 

we take the log-sigmoid of the negated inner product of a noise vector with the input vector. 

<img src="assets/neg_sampling_loss.png" width=50%>

To give you an intuition for what we're doing here, remember that the sigmoid function returns a probability between 0 and 1. The first term in the loss pushes the probability that our network will predict the correct word $w_O$ towards 1. In the second term, since we are negating the sigmoid input, we're pushing the probabilities of the noise words towards 0.

In [14]:
import torch
from torch import nn
import torch.optim as optim

In [15]:
class SkipGramNeg(nn.Module):
    def __init__(self, n_vocab, n_embed, noise_dist=None):
        super().__init__()
        
        self.n_vocab = n_vocab
        self.n_embed = n_embed
        self.noise_dist = noise_dist
        
        # define embedding layers for input and output words
        self.in_embed = nn.Embedding(n_vocab, n_embed)
        self.out_embed = nn.Embedding(n_vocab, n_embed)
        
        # Initialize embedding tables with uniform distribution
        # (this helps with convergence)
        self.in_embed.weight.data.uniform_(-1, 1)
        self.out_embed.weight.data.uniform_(-1, 1)
        
    def forward_input(self, input_words):
        input_vectors = self.in_embed(input_words)
        return input_vectors
    
    def forward_output(self, output_words):
        output_vectors = self.out_embed(output_words)
        return output_vectors
    
    def forward_noise(self, batch_size, n_samples):
        """ Generate noise vectors with shape (batch_size, n_samples, n_embed)"""
        if self.noise_dist is None:
            # Sample words uniformly
            noise_dist = torch.ones(self.n_vocab)
        else:
            noise_dist = self.noise_dist
            
        # Sample words from our noise distribution
        noise_words = torch.multinomial(noise_dist,
                                        batch_size * n_samples,
                                        replacement=True)
        
        device = "cuda" if model.out_embed.weight.is_cuda else "cpu"
        noise_words = noise_words.to(device)
        
        noise_vectors = self.out_embed(noise_words).view(batch_size, n_samples, self.n_embed)
        
        return noise_vectors

In [16]:
class NegativeSamplingLoss(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, input_vectors, output_vectors, noise_vectors):
        
        batch_size, embed_size = input_vectors.shape
        
        # Input vectors should be a batch of column vectors
        input_vectors = input_vectors.view(batch_size, embed_size, 1)
        
        # Output vectors should be a batch of row vectors
        output_vectors = output_vectors.view(batch_size, 1, embed_size)
        
        # bmm = batch matrix multiplication
        # correct log-sigmoid loss
        out_loss = torch.bmm(output_vectors, input_vectors).sigmoid().log()
        out_loss = out_loss.squeeze()
        
        # incorrect log-sigmoid loss
        noise_loss = torch.bmm(noise_vectors.neg(), input_vectors).sigmoid().log()
        noise_loss = noise_loss.squeeze().sum(1)  # sum the losses over the sample of noise vectors

        # negate and sum correct and noisy log-sigmoid losses
        # return average batch loss
        return -(out_loss + noise_loss).mean()

### Training

Below is our training loop, and we train the model on GPU (NVIDIA GeForce RTX 2080 Ti).

In [69]:
from datetime import datetime
startTime = datetime.now() # track time

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Get our noise distribution
# Using word frequencies calculated earlier in the notebook
word_freqs = np.array(sorted(freqs.values(), reverse=True))
unigram_dist = word_freqs/word_freqs.sum()
noise_dist = torch.from_numpy(unigram_dist**(0.75)/np.sum(unigram_dist**(0.75)))

# instantiating the model
embedding_dim = 256
model = SkipGramNeg(len(vocab_to_int), embedding_dim, noise_dist=noise_dist).to(device)

# using the loss that we defined
criterion = NegativeSamplingLoss() 
optimizer = optim.Adam(model.parameters(), lr=0.0001)

print_every = 10000
steps = 0
epochs = 500

# train for some number of epochs
for e in range(epochs):
    
    # get our input, target batches
    for input_words, target_words in get_batches(train_words, 64):
        steps += 1
        inputs, targets = torch.LongTensor(input_words), torch.LongTensor(target_words)
        inputs, targets = inputs.to(device), targets.to(device)
        
        # input, output, and noise vectors
        input_vectors = model.forward_input(inputs)
        output_vectors = model.forward_output(targets)
        noise_vectors = model.forward_noise(inputs.shape[0], 5)

        # negative sampling loss
        loss = criterion(input_vectors, output_vectors, noise_vectors)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # loss stats
        if steps % print_every == 0:
            print("Epoch: {}/{}".format(e+1, epochs))
            print("Loss: ", loss.item()) # avg batch loss at this point in training
            valid_examples, valid_similarities = cosine_similarity(model.in_embed, device=device)
            _, closest_idxs = valid_similarities.topk(6)

            valid_examples, closest_idxs = valid_examples.to('cpu'), closest_idxs.to('cpu')
            for ii, valid_idx in enumerate(valid_examples):
                closest_words = [int_to_vocab[idx.item()] for idx in closest_idxs[ii]][1:]
                print(int_to_vocab[valid_idx.item()] + " | " + ', '.join(closest_words))
            print("...\n")

Epoch: 2/500
Loss:  11.358478546142578
kilo | plant, gronbrekk, lent, throughput, trigger
business | diagnosis, prospects, peerreviewed, named, fjord
their | antibiotic, ajinomoto, offload, ssf, huang
years | alexandra, contract, posts, mandatory, regrets
than | fish, sea, evolving, reform, andre
industry | eventful, consultancy, ecuador, goes, concentrating
market | brim, catastrophic, measured, matjes, kingfish
during | referendum, salaks, shoot, skyrocketing, recognizing
factors | pretty, recalls, sank, pignatel, significantly
story | suspending, reported, norwegianbacked, restrictions, sjur
license | previous, prawns, depend, tourism, san
makes | cents, emergency, maloy, supposed, williksen
russias | shipment, soften, between, ties, lox
cage | became, injury, publication, warn, epicenter
soon | supplements, tine, subcontractors, laborintensive, apicda
declined | vikings, browns, announced, helgason, infestation
...

Epoch: 4/500
Loss:  9.527480125427246
ceo | seafood_companies, map

Epoch: 21/500
Loss:  3.356534004211426
have | the, more, shows, be, received
are | to, has, the, in, and
has | are, the, to, on, around
companies | primarily, second_quarter, domestic, of, keep
first | the, to, allstar, employ, ratheb
for | the, and, of, with, in
not | and, the, long, start, myself
that | and, to, the, its, implemented
approval | indicator, evelyn, garden, grocer, around
institute | sponsored, operating_profits, featured, board, floridabased
theres | stayhome, exceeding, nzd, criticized, concessions
story | reported, rolled, phenomenon, driver, withdrawn
went | alternative, assumed, npfmc, changes, licenses
begin | injury, referring, flatfish, allocated, inlet
options | farmraised, upscale, backdrop, sotenas, generation
update | funds, pilar, on, kristoffer, rivers
...

Epoch: 23/500
Loss:  3.3433878421783447
but | is, salmon, come, helping, high_prices
new | from, is, chile, at, potentially
about | on, egil, chilean, burns, chang
price | marketable, average, replaced,

Epoch: 42/500
Loss:  2.6636464595794678
percent | compared, per, up, fell, year
week | prices, kilo, per, ago, average
in | of, and, to, the, a
so | we, industry, are, said, challenging
there | that, we, for, it, innovation
was | a, the, of, percent, earlier
intrafish | are, buyer, at, there, salmon
million | first_quarter, same_period, earnings, percent, sales
staff | frequent, exchange, jordan, rational, reaped
meat | quota, food, watershed, their, is
concerns | played, workers, road, threatened, offloads
called | delivery, delegation, stressed, online, coop
meeting | expanded, ancillary, logistical, ceos, superintendency
sockeye | game, season, kongbased, alleging, walker
french | utilize, samherjis, manages, integrated, judgement
soon | opening, subcontractors, benchmark, john, play
...

Epoch: 44/500
Loss:  2.597674608230591
are | to, the, said, intrafish, we
percent | compared, per, up, fell, year
group | in, million, also, the, same_period
so | we, industry, are, said, it
their 

Epoch: 63/500
Loss:  1.7224234342575073
down | percent, fell, kilo, drop, compared
per | kilo, prices, week, compared, average
companies | of, a, one, is, it
new | and, is, including, ceo, in
percent | compared, fell, per, year, down
by | to, of, the, a, and
also | to, in, the, on, as
with | a, of, the, to, on
course | consistent, theme, asda, series, event
uncertainty | gill, go, difficult, salmon_price, there
wrote | an, farmed_salmon_prices, with, throughout, permissions
helped | down, tax, arm, quarter, earnings_before_interest
domestic | trout, buoyed, actually, finland, olafsson
authority | inspect, detected, lifting, salmon_farms, hoaas
french | france, boulognesurmer, retail, manages, smoked
communities | fishery, workers, concerns, elected, politicians
...

Epoch: 65/500
Loss:  2.0380609035491943
about | but, are, that, your, have
were | the, was, of, during, to
it | to, the, is, for, are
some | told, but, are, argument, have
salmon | the, a, of, to, in
salmon_farming | mowi, 

Epoch: 84/500
Loss:  2.214724540710449
that | there, they, is, said, not
sales | increased, earnings, ebitda, same_period, percent
now | for, we, it, is, will
one | industry, the, companies, is, that
week | kilo, prices, nasdaq, kilo_on_average, popular_size
and | in, a, of, the, to
from | the, of, salmon, a, percent
so | we, they, industry, said, can
buyer | exporter, sell, intrafish, next_week, prices
beyond | always, food, partnered, said, protein
falling | week, kilo, sharply, salmon_prices, d
greater | year, months, too, challenge, poised
cage | concept, municipality, cages, hostlund, offshore
concerns | communities, false, workers, green, meeting
practices | standard, standards, sustainability, sustainable, gsi
problem | sailing, willing, fad, cause, found
...

Epoch: 86/500
Loss:  2.252810001373291
ceo | company, as, companys, new, and
first | landbased, facility, of, will, on
of | the, in, to, and, a
operations | group, companys, loss, first_quarter, in
fish | are, in, at, anot

Epoch: 105/500
Loss:  2.08170223236084
other | the, while, be, which, to
market | prices, chinese, china, per, into
landbased | project, construction, projects, facility, first
year | percent, same_period, record, higher, amortization
mowi | mowis, salmon_farming, canada, scotland, by
this | to, of, be, the, is
can | to, it, said, industry, as
production | to, for, the, biomass, in
options | heavyweight, combination, lacks, cite, kits
lawsuit | filed, plaintiffs, alleges, alleging, plaintiff
agency | russia, rosrybolovstvo, placed, import, dump
followed | prototype, stem, monday, giant, feb
ssc | bakkafrost, sscs, salmon_company, craig, anderson
meeting | former, votes, elected, concerns, proposal
kilo_on_average | nasdaq, week, weekonweek, popular_size, average_prices
holds | takeover, salmon_farmer, competent, quota, transaction
...

Epoch: 107/500
Loss:  2.090384006500244
is | a, to, the, for, of
their | of, public, it, need, and
per | kilo, prices, average, compared, average_price


Epoch: 126/500
Loss:  2.039677381515503
while | the, percent, strong, higher, other
is | a, to, the, of, for
us | united, with, continue, weekly, seafood
their | of, public, need, by, and
to | the, of, on, a, is
companys | company, ceo, percent, from, operations
according | figures, week, norwegian, percent, than
also | to, as, the, in, an
factors | performance, differential, healthier, vaccination, methodology
brands | brand, smoked_salmon, online, shopping, right
developed | individually, rubicon, posed, perspective, can
french | france, boulognesurmer, gras, frances, saumon
helped | earnings_before_interest, led, jonah, arm, due
fire | electrical, blaze, cause, wednesdays, sande
domestic | trout, lithuania, russian, imports, cod
raising | asset, flexibility, capital, euronext, shares
...

Epoch: 128/500
Loss:  2.0708529949188232
by | which, to, and, of, percent
up | the, large, per, was, to
years | marine, a, with, research, as
is | a, to, the, for, of
which | by, to, the, and, in
w

Epoch: 147/500
Loss:  2.067282199859619
it | the, to, is, are, that
landbased | projects, project, evolution, construction, facility
percent | fell, compared, down, per, increased
first | landbased, facility, its, site, will
sales | increased, same_period, percent, percent_increase, earnings
years | a, marine, with, as, research
mowi | mowis, salmon_farming, scotland, canada, salmon_farmer
that | there, not, is, they, said
yearonyear | depreciation, taxes, earnings_before_interest, compared, amortization
ssc | sscs, bakkafrost, craig, salmon_company, anderson
french | france, boulognesurmer, gras, novo, saumon
begin | construction, facility, bangor, november, vast
growout | facility, amar, design, operation, landbased
brands | brand, smoked_salmon, online, shopping, right
communities | rural, fishery, indigenous, workers, community
metric_tons | total, center, bulk, landings, metric_ton
...

Epoch: 149/500
Loss:  2.1893811225891113
norwegian | the, last_week, farmed_salmon_prices, acco

Epoch: 166/500
Loss:  2.1786913871765137
were | the, higher, a, in, increased
than | per, higher, kilograms, nok, by
by | which, and, of, to, percent
was | a, the, up, to, last_year
that | is, not, there, they, said
and | a, in, the, of, to
first | landbased, facility, launched, site, phase
us | united, with, weekly, continue, brazil
largely | harvest_volumes, quarter, fell, severely, costs
beyond | meat, alternative, food, always, theres
holds | takeover, directors, quota, shares, competent
known | government, pathogen, scientists, daily, name
lawsuit | filed, alleges, plaintiffs, alleging, plaintiff
multiexport | lyon, andres, multiexports, arturo, multi
buyer | exporter, next_week, quiet, holidays, sell
greater | year, betting, errors, challenge, months
...

Epoch: 168/500
Loss:  1.974951148033142
it | to, the, is, that, we
fish | another, probably, of, farmed_salmon, prices
companys | company, percent, ceo, from, operations
increase | increased, increasing, operating_profit, volume

Epoch: 187/500
Loss:  1.7626034021377563
operations | quarter, operational, first_quarter, earnings, group
facility | construction, expansion, plans, recirculating_aquaculture_system, currently
group | subsidiary, announced, which, a, groups
now | is, a, will, that, to
some | buyers, level, argument, said, sources
were | the, a, in, higher, increased
not | that, to, must, know, have
by | which, and, of, percent, to
domestic | lithuania, cod, russian, trout, reidie
worked | hansen, team, officer, marketing, solutions
approval | granted, controversial, aquafarms, joel, objection
eight | balmaqueen, applying, approved, licenses, previously
sernapesca | surveillance, lagos, gallardo, sma, service
kingdom | canner, united, salmon_products, eu, europe
ownership | nrs, cooperation, shares, cdq, malek
outlook | bounce, quarter, segment, bullish, reflects
...

Epoch: 189/500
Loss:  2.041111469268799
out | the, to, it, is, not
have | said, are, we, know, is
which | by, to, group, and, the
more |

Epoch: 206/500
Loss:  1.5426603555679321
other | the, to, if, one, said
some | buyers, said, level, argument, can
while | percent, the, was, higher, groups
ceo | companys, and, company, regin, executive
facility | construction, expansion, plans, recirculating_aquaculture_system, currently
production | for, to, continuous, biomass, salmon_production
he | said, i, the, has, that
companys | company, percent, ceo, from, of
longer | term, gases, effectiveness, srs, swift
jbs | forrest, huon, twiggy, tattarang, unsolicited
experienced | caused, losing, dieoff, severe, million
russias | rosrybolovstvo, russian, pink, rub, allrussian
average_prices | kilo_on_average, popular_size, week, kilo, nasdaq
approval | granted, controversial, intertidal, joel, objection
begin | construction, facility, bangor, aquafarms, vast
practices | sustainability, standards, indicators, environmental, gsi
...

Epoch: 208/500
Loss:  1.9213230609893799
over | up, steadily, as, salmon, down
around | is, of, kilos, fe

Epoch: 225/500
Loss:  1.9720542430877686
out | the, it, is, to, he
reported | taxes, same_period, slight, compared, ebit
mowi | mowis, salmon_farming, canada, ivan, scotland
aquaculture | salmon_farming, of, in, a, equipment
a | the, with, is, and, in
more | are, than, on, frozen, demand
it | is, to, the, we, that
atlantic | sapphire, sapphires, company, runar, bjornvegard
managed | sluggish, adoption, counts, improvement, reference
retailer | spencer, castle, pennsylvaniabased, customer, waitrose
license | application, manufactured, permits, licenses, facility
risks | fairr, approach, various, continuity, agrees
course | going, consistent, fiorillo, others, met
cash | flow, debt, equivalents, share, credit
outlook | bounce, quarter, bullish, rebound, segment
factors | differential, reset, performance, fairr, tightness
...

Epoch: 227/500
Loss:  1.554993748664856
they | is, that, i, why, know
during | percent, increased, due, came, year
with | a, for, the, and, as
will | of, and, in, a

Epoch: 246/500
Loss:  1.4247580766677856
some | level, buyers, said, can, argument
into | market, to, a, trim, salmon
other | to, the, if, one, said
while | percent, was, the, groups, norwegian
have | said, we, are, so, is
it | is, the, to, we, that
million | first_quarter, amounted, depreciation, last_year, posted
also | to, as, an, in, that
shipping | container, containers, fredriksen, freight, vladimir
outlook | rebound, bounce, bullish, quarter, segment
brands | brand, smoked_salmon, shelves, mornings, chains
went | datasalmon, week, last_week, popular_sizes, reversing
french | france, boulognesurmer, gras, intermarche, novo
domestic | lithuania, cod, russian, frozenatsea, reidie
tassal | tassals, advocate, abc, aud, ryan
staff | workforce, trained, travel, staggered, governors
...

Epoch: 248/500
Loss:  2.141298294067383
which | by, to, group, and, the
out | the, it, is, he, at
were | a, in, the, increased, per
at | are, on, per, the, is
and | a, of, in, the, by
more | than, are, 

Epoch: 265/500
Loss:  2.1779944896698
during | percent, increased, due, same_quarter, came
with | a, for, the, and, in
also | to, an, as, said, that
after | company, puts, from, average_price, according
there | that, are, told, think, i
an | in, a, of, the, to
told | intrafish, are, that, weve, difficult
per | kilo, average_price, prices, compared, average
short | lagging, term, salmon_price, perfect, cheuvreux
brands | brand, smoked_salmon, shelves, mornings, chains
license | application, manufactured, permits, licenses, facility
probably | sizes, exporter, prices, fall_in_prices, next_week
meat | pork, brazilian, protein, outlets, twiggy
risks | fairr, approach, various, continuity, potential
tassal | tassals, advocate, abc, aud, ryan
russias | rosrybolovstvo, russian, allrussian, wild_salmon_harvest, moscow
...

Epoch: 267/500
Loss:  1.3850011825561523
be | in, this, an, have, do
percent | fell, increased, down, year, compared
norway | nrs, royal, salmar, norwegian, downgrading
over

Epoch: 286/500
Loss:  1.79985511302948
is | a, to, for, the, that
increase | increased, operating_profit, increasing, revenue, harvest_volumes
with | a, for, the, and, in
so | else, said, have, we, they
can | dont, that, traditional, we, it
from | the, percent, of, year, company
facility | construction, expansion, plans, recirculating_aquaculture_system, landbased
out | the, it, is, he, at
concept | directorate, submersible, rejection, donut, roxel
operate | notifying, aimed, uncertainties, rrg, ii
uncertainty | salmon_price, difficult, paretos, factors, priced
license | application, manufactured, permits, licenses, facility
falling | kilo, salmon_prices, fell, drop, average_prices
french | france, boulognesurmer, gras, intermarche, novo
fillets | frozen, dtrim, domestic_market, pmc, loins
located | clayoquot, project, temuco, digby, province
...

Epoch: 287/500
Loss:  1.8011406660079956
over | up, percent, past, down, as
in | of, a, the, and, on
now | is, will, a, have, that
seafood |

Epoch: 305/500
Loss:  1.7345231771469116
ceo | companys, and, company, executive, regin
was | a, the, of, up, percent
our | we, unique, responsible, impact, work
from | the, percent, of, year, last_year
told | intrafish, are, that, weve, at
per | kilo, average_price, prices, compared, week
week | popular_size, kilo_on_average, nasdaq, prices, average_prices
after | puts, company, from, average_price, the
cash | flow, debt, revolving, share, credit
date | expediting, discussed, spc, methodology, authorities
followed | confirmed, prototype, bonafide, are, sees
longer | term, effectiveness, farming_industry, gases, swift
declined | earlier, bid, rose, amounted, drags
called | government, reform, aside, label, largescale
story | tell, incorrectly, affordable, journalists, version
meeting | elected, agm, votes, bondholders, proposing
...

Epoch: 306/500
Loss:  1.864445447921753
to | the, on, a, is, in
atlantic | sapphire, sapphires, runar, company, bjornvegard
it | the, is, to, that, we
an 

Epoch: 324/500
Loss:  1.7619249820709229
years | be, and, a, manager, the
on | in, to, well, of, a
that | is, not, told, said, there
after | puts, company, from, average_price, salmon
he | said, i, that, why, theyve
our | we, unique, impact, responsible, develop
for | a, the, of, is, to
was | a, the, of, percent, up
deliver | peptides, optimize, methods, barcaldine, profitable
salmon_farming_industry | global, hurdle, phenomenal, nonmedicinal, trillion
gutted | quarter, same_quarter, hog, compared, weight
operate | notifying, aimed, institutions, rrg, stayhome
average_prices | popular_size, kilo_on_average, week, kilo, nasdaq
followed | confirmed, bonafide, sees, prototype, frode
risks | fairr, various, approach, continuity, potential
domestic | cod, lithuania, reidie, midaugust, russian
...

Epoch: 325/500
Loss:  1.239789605140686
aquaculture | salmon_farming, of, equipment, in, that
have | said, we, are, now, be
production | in, first, to, salmon_production, for
first | phase, will, 

Epoch: 343/500
Loss:  1.482650637626648
week | popular_size, kilo_on_average, nasdaq, prices, average_prices
first | phase, will, site, landbased, facility
has | in, of, been, to, as
which | by, to, group, of, the
landbased | evolution, projects, project, salmon_farm, facility
that | is, not, told, it, said
are | up, told, at, number, we
we | do, said, our, you, know
action | euclid, courts, cartel, request, lawsuit
options | rumored, studying, evaluating, combination, performance
date | expediting, discussed, spc, methodology, authorities
update | dieoff, gaard, mass, expects, annualized
fire | electrical, extinguished, knocking, attackers, destroyed
soon | them, haigh, are, ronning, advantages
authority | mattilsynet, inspection, goahead, inspect, detected
concerns | tribes, directly, trident, citizen, treaty
...

Epoch: 344/500
Loss:  1.6607465744018555
during | percent, due, year, same_quarter, came
aquaculture | salmon_farming, of, equipment, in, that
told | intrafish, are, that, 

Epoch: 362/500
Loss:  1.4740664958953857
aquaculture | salmon_farming, of, in, equipment, a
percent | fell, increased, year, same_period, down
landbased | projects, evolution, project, salmon_farm, facility
intrafish | told, are, buyer, we, it
at | are, on, per, the, that
have | we, said, are, be, so
new | in, and, will, its, on
fish | this, of, probably, farmed_salmon, prices
license | application, zoning, manufactured, applied, licenses
concerns | tribes, citizen, trident, directly, prohibiting
russias | russian, rosrybolovstvo, allrussian, moscow, wild_salmon_harvest
progress | gsi, knowledge, model, isles, discusses
located | temuco, clayoquot, project, digby, province
meeting | elected, agm, bondholders, votes, proposing
media | social, inperson, influencers, sometimes, troubling
member | runar, skarbo, consisting, serves, shares
...

Epoch: 363/500
Loss:  1.6128199100494385
kilo | average_prices, per, week, popular_size, sizes
can | that, dont, we, is, to
an | in, a, of, the, to


Epoch: 381/500
Loss:  1.8212461471557617
seafood | and, business, has, to, frozen
some | level, said, told, think, so
year | percent, same_period, taxes, first_quarter, first_half
a | the, with, is, in, and
million | first_quarter, amounted, last_year, same_period, depreciation
of | in, and, the, for, its
into | market, trim, to, salmon, of
after | puts, from, average_price, company, in
kingdom | canner, tariffs, united, salmon_products, democratic
progress | gsi, knowledge, model, isles, discusses
media | social, inperson, influencers, sometimes, stories
sernapesca | surveillance, gallardo, lagos, superintendency, sernapescas
worked | solutions, team, hansen, joining, marine
sockeye | copper, sockeye_salmon, kings, sockeye_salmon_harvest, pink
practices | sustainability, standards, gsi, taskforce, wwf
theres | revolution, somebody, i, yes, barramundi
...

Epoch: 382/500
Loss:  1.3281960487365723
million | first_quarter, amounted, last_year, same_period, depreciation
be | in, this, hav

Epoch: 400/500
Loss:  1.925899624824524
that | is, not, told, there, said
also | to, an, said, that, salmon
companys | percent, company, year, ceo, of
group | subsidiary, announced, a, groups, which
they | is, know, that, but, not
he | said, i, that, theyve, why
all | the, for, these, we, with
atlantic | sapphire, sapphires, company, runar, bjornvegard
acquisitions | mergers, greenfield, lofotprodukt, havfisk, canfisco
gutted | quarter, same_quarter, hog, weight, compared
fillets | frozen, loins, pmc, dtrim, domestic_market
acquire | offer, takeover, shareholders, acquisition, shareholder
worked | solutions, team, marine, hansen, roles
declined | earlier, underperformed, amounted, unrest, bid
goal | differentiation, economy, rearing, mapping, fao
member | runar, skarbo, serves, consisting, upped
...

Epoch: 401/500
Loss:  1.5842844247817993
by | which, and, of, percent, than
or | average_prices, market_share, emailing, prior, whether
he | said, i, that, theyve, why
in | of, a, the, on,

Epoch: 419/500
Loss:  1.7492321729660034
by | which, and, of, percent, than
fish | this, of, farmed_salmon, probably, another
seafood | and, business, has, frozen, to
chilean | australis, salmones, brazil, chile, salmon
after | puts, from, in, average_price, salmon
he | said, i, that, theyve, why
aquaculture | salmon_farming, of, equipment, in, announced
companies | it, them, global, a, one
reason | tomjorgen, lye, smallest, poisoning, cheap
update | dieoff, gaard, annualized, expects, parcel
raising | raised, capital, manage, asset, israelbased
cash | flow, revolving, debt, private_placement, share
experienced | caused, losing, dieoff, wounds, impacted
retailer | pennsylvaniabased, spencer, stores, castle, aquascot
wrote | annual_report, turnover, terrible, consequently, yearend
called | government, reform, aside, administration, hedging
...

Epoch: 420/500
Loss:  1.2697274684906006
with | a, the, for, and, in
norway | nrs, royal, norwegian, nts, salmar
not | that, to, they, have, kno

Epoch: 438/500
Loss:  1.6771382093429565
told | intrafish, that, are, weve, there
industry | we, done, one, are, know
now | will, is, a, have, new
from | the, percent, of, year, loss
us | with, united, brazil, weekly, lago
or | average_prices, percent, market_share, emailing, prior
been | has, said, are, of, on
million | first_quarter, amounted, last_year, same_period, depreciation
sernapesca | surveillance, gallardo, superintendency, lagos, alicia
firms | mitsubishi, katz, renamed, stock_exchange, seeking
theres | i, revolution, somebody, yes, barramundi
longer | term, effectiveness, wrasse, farming_industry, respective
fire | electrical, extinguished, attackers, knocking, destroyed
media | social, inperson, influencers, obama, sometimes
soon | them, haigh, ronning, are, it
approval | goahead, controversial, provincial, joel, shovel
...

Epoch: 439/500
Loss:  1.6051489114761353
into | market, trim, salmon, of, hg
that | is, told, not, it, said
some | think, side, told, so, luck
so | h

Epoch: 457/500
Loss:  1.8447883129119873
has | in, of, to, and, as
which | by, of, group, to, with
out | the, it, he, at, their
down | drop, percent, kilo, prices, week
mowi | mowis, canada, ivan, newfoundland, salmon_farming
will | of, as, now, and, in
than | higher, per, by, market, weeks
their | people, by, choice, said, everyone
online | front, engage, ecommerce, ocado, instore
acquisitions | mergers, greenfield, havfisk, lofotprodukt, pattison
kingdom | canner, tariffs, salmon_products, united, democratic
gsi | gsis, cochair, transparency, precompetitive, wwf
risks | fairr, various, approach, continuity, potential
called | government, reform, aside, administration, caring
probably | prices, exporter, next_week, unsold, sizes
retailer | pennsylvaniabased, spencer, stores, castle, aquascot
...

Epoch: 459/500
Loss:  1.531076192855835
its | of, in, company, group, the
so | have, else, we, how, if
other | to, if, the, one, of
year | percent, same_period, taxes, first_quarter, first_ha

Epoch: 476/500
Loss:  1.4526039361953735
price | prices, per, market, salmon_prices, week
industry | we, one, done, are, know
also | to, an, that, salmon, in
our | we, responsible, unique, business, priorities
down | percent, drop, kilo, prices, week
will | of, and, as, now, in
production | first, have, and, in, company
their | people, by, choice, said, everyone
gsi | gsis, cochair, transparency, precompetitive, wwf
lack | freezing, do, they, avoided, too
average_prices | popular_size, kilo_on_average, kilo, week, weekonweek
growout | amar, phase, facility, construction, batch
firms | mitsubishi, katz, renamed, stock_exchange, seeking
located | temuco, clayoquot, project, buildings, digby
soon | them, haigh, ronning, are, bondo
falling | kilo, salmon_prices, drop, average_prices, farmed_salmon_prices
...

Epoch: 478/500
Loss:  1.5530085563659668
also | to, an, that, salmon, in
percent | fell, increased, year, same_period, down
salmon_farming | aquaculture, salmar, region, salmon_farms,

Epoch: 495/500
Loss:  1.5745924711227417
sales | same_period, earnings, percent, compared, amortization
been | has, said, are, of, on
they | is, know, that, not, we
in | of, a, the, on, and
over | up, percent, past, reported, salmon
were | down, percent, per, in, a
salmon | in, the, a, by, of
chilean | australis, salmones, brazil, chile, coho
institute | sciences, veterinary, scientist, americans, universidad
concept | roxel, directorate, donut, rejection, rejected
sockeye | copper, sockeye_salmon, kings, sockeye_salmon_harvest, pink
ended | taxes, amounted, operating_profit, amortization, yearend
beyond | plantbased, bowl, alternative, etc, upended
fire | electrical, extinguished, attackers, firefighters, destroyed
story | tell, incorrectly, journalists, perspectives, impactful
member | runar, skarbo, shares, serves, upped
...

Epoch: 497/500
Loss:  1.7618788480758667
a | the, is, with, in, and
also | to, that, an, salmon, with
price | prices, per, market, week, salmon_prices
now | wi

In [70]:
print(datetime.now()-startTime)

2:56:40.748683


# Export vectors as txt

In [72]:
# getting embeddings from the embedding layer of our model, by name
embeddings = model.in_embed.weight.to('cpu').data.numpy()
# our vocabulary
vocab = [int_to_vocab[i] for i in range(len(embeddings))]
# indices of all the words in a vocabulary
id_list = [i for i in range(len(embeddings))]
# dictionary where the key is the word and the value is an embedding
word_embed = dict(zip(vocab, embeddings))
# dictionary where the key is the word and the value is an embedding
id_word = dict(zip(id_list, vocab))

In [77]:
import codecs
# export vectors as txt "my_word2vec_Fish.txt"
with codecs.open("my_word2vec_Fish.txt", "w", "utf-8") as f:
    for id, word in id_word.items():
        vec = embeddings[id, :].tolist()
        vec = [np.round(num, 6) for num in vec]
        vec_str = " ".join(map(str, vec))
        f.write("%s %s\n" % (word, vec_str))