# Generate word embeddings using Swivel

## Overview

In this notebook we show how to generate word embeddings based on the K-Cap corpus using the [Swivel algorithm](https://arxiv.org/pdf/1602.02215). In particular, we reuse the implementation included in the [Tensorflow models repo on Github](https://github.com/tensorflow/models/tree/master/research/swivel) (with some small modifications).

## Download a small text corpus
First, let's download a corpus into our environment. We will use a small sample of the UMBC corpus that has been pre-tokenized and that we have included as part of our GitHub repository. First, we will clone the repo so we have access to it from this environment.

In [0]:
%ls

In [0]:
!git clone https://github.com/hybridNLP2018/tutorial.git

The dataset comes as a zip file, so we unzip it by executing the following cell. We also define a variable pointing to the corpus file:

In [0]:
!unzip /content/tutorial/datasamples/umbc_t_5K.zip -d /content/tutorial/datasamples/
input_corpus='/content/tutorial/datasamples/umbc_t_5K'

You can inspect the file using the `%less` command to print the whole input file at the bottom of the screen. It'll be quicker to just print a few lines:

In [0]:
#%less {input_corpus}
!head -n1 {input_corpus}

The output above shows that the input text has already been pre-processed. 
 * All words have been converted to lower-case (this will avoid having two separate words for *The* and *the*)
 * punctuation marks have been separated from words. This will avoid creating "words" such as "staff." or "grimm," in the example above.

## `swivel`: an algorithm for learning word embeddings
Now that we have a corpus, we need an (implementation of an) algorithm for learning embeddings. There are various libraries and implementations for this:
  * [word2vec](https://pypi.org/project/word2vec/) the system proposed by Mikolov that introduced many of the techniques now commonly used for learning word embeddings. It directly generates word embeddings from the text corpus by using a sliding window and trying to predict a target word based on neighbouring context words.
  * [GloVe](https://github.com/stanfordnlp/GloVe) an alternative algorithm by Pennington, Socher and Manning. It splits the process in two steps: 
    1. calculating a word-word co-occurrence matrix 
    2. learning embeddings from this matrix
  * [FastText](https://fasttext.cc/) is a more recent algorithm by Mikolov et al (now at Facebook) that extends the original word2vec algorithm in various ways. Among others, this algorithm takes into accout subword information.
  
In this tutorial we will be using [Swivel](https://github.com/tensorflow/models/tree/master/research/swivel) an algorithm similar to GloVe, which makes it easier to extend to include both words and concepts (which we will do in [notebook 03 vecsigrafo](https://colab.research.google.com/github/HybridNLP2018/tutorial/blob/master/03_vecsigrafo.ipynb)). As with GloVe, Swivel first extracts a word-word co-occurence matrix from a text corpus and then uses this matrix to learn the embeddings.

The official  [Swivel](https://github.com/tensorflow/models/tree/master/research/swivel)  implementation has a few issues when running on Colaboratory, hence we have included a slightly modified version as part of the HybridNLP2018 github repository. 

In [0]:
%ls /content/tutorial/scripts/swivel/

## Learn embeddings

### Generate co-occurrence matrix using Swivel `prep`

Call swivel's `prep` command to calculate the word co-occurrence matrix. We use the `%run` magic command, which runs the named python file as a program, allowing us to pass parameters as if using a command-line terminal.

We set the `shard_size` to 512 since the corpus is quite small. For larger corpora we could use the standard value of 4096.

In [0]:
coocs_path = '/content/umbc/coocs/t_5K/'
shard_size = 512
!python /content/tutorial/scripts/swivel/prep.py \
  --input="/content/tutorial/datasamples/umbc_t_5K" \
  --output_dir="/content/umbc/coocs/t_5K/" \
  --shard_size=512

The expected output is:

```
   ... tensorflow parameters ...
    vocabulary contains 5120 tokens

    writing shard 100/100
    done!
```

We see that first, the algorithm determined the **vocabulary** $V$, this is the list of words for which an embedding will be generated. Since the corpus is fairly small, so is the vocabulary, which consists of only about 5K words (large corpora can result in vocabularies with millions of words).

The co-occurrence matrix is a sparse matrix of $|V| \times |V|$ elements. Swivel uses shards to create submatrices of $|S| \times |S|$, where $S$ is the shard-size specified above. In this case, we have 100 sub-matrices.

All this information is stored in the output folder we specified above. It consists of  100 files, one per shard/sub-matrix and a few additional files:

In [0]:
%ls {coocs_path} | head -n 10


The `prep` step does the following:
  - it uses a basic, white space, tokenization to get sequences of tokens
  - in a first pass through the corpus, it counts all tokens and keeps only those that have a minimum frequency (5) in the corpus. Then it keeps a multiple of the `shard_size` of that. The tokens that are kept form the **vocabulary** with size $v = |V|$.
  - on a second pass through the corpus, it uses a sliding window to count co-occurrences between the focus token and the context tokens (similar to `word2vec`). The result is a sparse co-occurrence matrix of size $v \times v$.
  - for easier storage and manipulation, Swivel uses *sharding* to split the co-occurrence matrix into sub-matrices of size $s \times s$, where $s$ is the `shard_size`.
  ![Swivel co-occurrence matrix sharding](https://github.com/hybridNLP2018/tutorial/blob/master/images/swivel-sharding.PNG?raw=1)
  - store the sharded co-occurrence submatrices as [protobuf files](https://developers.google.com/protocol-buffers/).

## Learn embeddings from co-occurrence matrix
With the sharded co-occurrence matrix it is now possible to learn embeddings:
 - the input is the folder with the co-occurrence matrix (protobuf files with the sparse matrix).
 - `submatrix_` rows and columns need to be the same size as the `shard_size` used in the `prep` step.

In [0]:
vec_path = '/content/umbc/vec/t_5K/'
!python /content/tutorial/scripts/swivel/swivel.py --input_base_path={coocs_path} \
    --output_base_path={vec_path} \
    --num_epochs=40 --dim=150 \
    --submatrix_rows={shard_size} --submatrix_cols={shard_size}

This should take a few minutes, depending on your machine.
The result is a list of files in the specified output folder, including:
 - checkpoints of the model
 - `tsv` files for the column and row embeddings.

In [0]:
%ls {vec_path}

One thing missing from the output folder is a file with just the vocabulary, which we'll need later on. We copy this file from the folder with the co-occurrenc matrix.

In [0]:
%cp {coocs_path}/row_vocab.txt {vec_path}vocab.txt

### Convert `tsv` files to `bin` file
The `tsv` files are easy to inspect, but they take too much space and they are slow to load since we need to convert the different values to floats and pack them as vectors. Swivel offers a utility to convert the `tsv` files into a `bin`ary format. At the same time it combines the column and row embeddings into a single space (it simply adds the two vectors for each word in the vocabulary).

In [0]:
!python /content/tutorial/scripts/swivel/text2bin.py --vocab={vec_path}vocab.txt --output={vec_path}vecs.bin \
        {vec_path}row_embedding.tsv \
        {vec_path}col_embedding.tsv

This adds the `vocab.txt` and `vecs.bin` to the folder with the vectors:

In [0]:
%ls -lah {vec_path}

## Read stored binary embeddings and inspect them

Swivel provides the `vecs` library which implements the basic `Vecs` class. It accepts a `vocab_file` and a file for the binary serialization of the vectors (`vecs.bin`).

In [0]:
from tutorial.scripts.swivel import vecs

...and we can load existing vectors. We assume you managed to generate the embeddings by following the tutorial up to now. Note that,  due to random initialization of weight during the training step, your results may be different from the ones presented below.

In [0]:
#uncommend the following two lines if you did not manage to train embedding above 
#!tar -xzf /content/tutorial/datasamples/umbc_swivel_vec_t_5K.tar.gz -C / 
#vec_path = /content/umbc/vec/t_5K/
vectors = vecs.Vecs(vec_path + 'vocab.txt', 
            vec_path + 'vecs.bin')

We have extended the standard implementation of `swivel.vecs.Vecs` to include a method `k_neighbors`. It accepts a string with the word and an optional `k` parameter, that defaults to $10$. It returns a list of python dictionaries with fields:
  * `word`: a word in the vocabulary that is near the input word
  * `cosim`: the cosine similiarity between the input word and the near word.
It's easier to display the results as a `pandas` table:

In [0]:
import pandas as pd
pd.DataFrame(vectors.k_neighbors('california'))

In [0]:
pd.DataFrame(vectors.k_neighbors('knowledge'))

In [0]:
pd.DataFrame(vectors.k_neighbors('semantic'))

In [0]:
pd.DataFrame(vectors.k_neighbors('conference'))

The cells above should display results similar the the following (for words *california* and *conference*):

|	cosim	| word | | cosim	| word |
| ---------- | -------- || ---------- | -------- |
| 0	1.000 |	california ||	1.0000	| conference |
| 0.5060 |	university ||	0.4320	| international |
| 0.4239 |	berkeley ||	0.4063	| secretariat |
| 0.4103 |	barbara ||	0.3857	| jcdl |
|	0.3941 |	santa ||	0.3798	| annual |
| 0.3899 |	southern ||	0.3708	| conferences |
| 0.3673 |	uc ||	0.3705	| forum |
| 0.3542 |	johns ||	0.3629	| presentations |
| 0.3396 |	indiana ||	0.3601	| workshop |
| 0.3388 | melvy ||	0.3580	| ... |



### Compound words

Note that the vocabulary only has single words, i.e. compound words are not present:

In [0]:
pd.DataFrame(vectors.k_neighbors('semantic web'))

A common way to work around this issue is to use the average vector of the two individual words (of course this only works if both words are in the vocabulary):

In [0]:
semantic_vec = vectors.lookup('semantic')
web_vec = vectors.lookup('web')
semweb_vec = (semantic_vec + web_vec)/2
pd.DataFrame(vectors.k_neighbors(semweb_vec))

## Conclusion

In this notebook, we used swivel to generate word embeddings and we explored the resulting embeddings using `k neighbors` exploration. 

# Optional Excercise

## Create word-embeddings for texts from Project Gutenburg

### Download and pre-process the corpus

You can try generating new embeddings using a small `gutenberg` corpus, that is provided as part of the NLTK library. It consists of a few public-domain works published as part of the Project Gutenberg.

First, we download the dataset into out environment:

In [0]:
import os
import nltk
nltk.download('gutenberg')
%ls '/root/nltk_data/corpora/gutenberg/'

As you can see, the corpus consists of various books, one per file. Most word2vec implementations require you to pass a corpus as a single text file. We can issue a few commands to do this by concatenating all the `txt` files in the folder into a single `all.txt` file, which we will use later on.

A couple of the files are encoded using iso-8859-1 or binary encodings, which will cause trouble later on, so we rename them to avoid including them into our corpus.

In [0]:
%cd /root/nltk_data/corpora/gutenberg/
# avoid including books with incorrect encoding
!mv chesterton-ball.txt chesterton-ball.badenc-txt
!mv milton-paradise.txt milton-paradise.badenc-txt
!mv shakespeare-caesar.txt shakespeare-caesar.badenc-txt
# now concatenate all other files into 'all.txt'
!cat *.txt >> all.txt
# print result
%ls -lah '/root/nltk_data/corpora/gutenberg/all.txt'
# go back to standard folder 
%cd /content/

The full dataset is about 11MB.

### Learn embeddings

Run the steps described above to generate embeddings for the gutenberg dataset.

### Inspect embeddings
Use methods similar to the ones shown above to get a feeling for whether the generated embeddings have captured interesting relations between words.