# Generate word embeddings using Swivel

## Overview

In this notebook we show how to generate word embeddings based on the K-Cap corpus using the [Swivel algorithm](https://arxiv.org/pdf/1602.02215). In particular, we reuse the implementation included in the [Tensorflow models repo on Github](https://github.com/tensorflow/models/tree/master/research/swivel) (with some small modifications).

## Download a small text corpus
First, let's download a corpus into our environment. We will use the `gutenberg` corpus, consisting of a few public-domain works published as part of the Project Gutenberg.

In [0]:
import os

In [0]:
import nltk

In [46]:
%ls

[0m[01;34mgutenberg[0m/        [01;34msample_data[0m/  swivel.zip  wget-log.1
[01;34miswc18-tutorial[0m/  [01;34mswivel[0m/       wget-log    wget-log.2


In [45]:
!git clone https://github.com/rdenaux/iswc18-tutorial.git

Cloning into 'iswc18-tutorial'...
remote: Counting objects: 26, done.[K
remote: Compressing objects: 100% (24/24), done.[K
remote: Total 26 (delta 1), reused 23 (delta 1), pack-reused 0[K
Unpacking objects: 100% (26/26), done.


In [6]:
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

Let's verify the corpus that was downloaded. It should be in the path specified below.

In [7]:
%ls '/root/nltk_data/corpora/gutenberg/'

austen-emma.txt          carroll-alice.txt        README
austen-persuasion.txt    chesterton-ball.txt      shakespeare-caesar.txt
austen-sense.txt         chesterton-brown.txt     shakespeare-hamlet.txt
bible-kjv.txt            chesterton-thursday.txt  shakespeare-macbeth.txt
blake-poems.txt          edgeworth-parents.txt    whitman-leaves.txt
bryant-stories.txt       melville-moby_dick.txt
burgess-busterbrown.txt  milton-paradise.txt


As you can see, the corpus consists of various books, one per file. Most word2vec implementations require you to pass a corpus as a single text file. We can issue a few commands to do this by concatenating all the `txt` files in the folder into a single `all.txt` file, which we will use later on.

In [8]:
%cd '/root/nltk_data/corpora/gutenberg/'

/root/nltk_data/corpora/gutenberg


A couple of the files are encoded using iso-8859-1 or binary encodings, which will cause trouble later on, so we rename them to avoid including them into our corpus.

In [0]:
!mv chesterton-ball.txt chesterton-ball.badenc-txt
!mv milton-paradise.txt milton-paradise.badenc-txt
!mv shakespeare-caesar.txt shakespeare-caesar.badenc-txt

In [0]:
!cat *.txt >> all.txt

In [11]:
%ls -lah '/root/nltk_data/corpora/gutenberg/all.txt'

-rw-r--r-- 1 root root 11M Sep 14 09:04 /root/nltk_data/corpora/gutenberg/all.txt


The full dataset is about 12MB, this is sufficient for demonstration purposes. In reality, you want to use very large corpora to train high-quality word embeddings. Since we now have a corpus to train the word embeddings on, we can go back to the initial folder.

In [12]:
%cd '/content'

/content


## Download `swivel`, an algorithm for learning word embeddings
Now that we have a corpus, we need an (implementation of an) algorithm for learning embeddings. There are various libraries and implementations for this:
  * [word2vec](https://pypi.org/project/word2vec/) the system proposed by Mikolov that introduced many of the techniques now commonly used for learning word embeddings. It directly generates word embeddings from the text corpus by using a sliding window and trying to predict a target word based on neighbouring context words.
  * [GloVe](https://github.com/stanfordnlp/GloVe) an alternative algorithm by Pennington, Socher and Manning. It splits the process in two steps: 
    1. calculating a word-word co-occurrence matrix 
    2. learning embeddings from this matrix
  * [FastText](https://fasttext.cc/) is a more recent algorithm by Mikolov et al (now at Facebook) that extends the original word2vec algorithm in various ways. Among others, this algorithm takes into accout subword information.
  
In this tutorial we will be using [Swivel](https://github.com/tensorflow/models/tree/master/research/swivel) an algorithm similar to GloVe, which makes it easier to extend to include both words and concepts. As with GloVe, Swivel first extracts a word-word co-occurence matrix from a text corpus and then uses this matrix to learn the embeddings.

Let's download and unzip the swivel code first.

In [30]:
!wget http://expertsystemlab.com/hybridNLP18/swivel.zip


Redirecting output to ‘wget-log.2’.


In [31]:
!unzip swivel.zip

Archive:  swivel.zip
  inflating: swivel/analogy.cc       
  inflating: swivel/distributed.sh   
  inflating: swivel/eval.mk          
  inflating: swivel/fastprep.cc      
  inflating: swivel/fastprep.mk      
  inflating: swivel/glove_to_shards.py  
  inflating: swivel/nearest.py       
  inflating: swivel/prep.py          
  inflating: swivel/README.md        
  inflating: swivel/swivel.py        
  inflating: swivel/text2bin.py      
  inflating: swivel/vecs.py          
  inflating: swivel/wordsim.py       


In [0]:
!rm swivel/*
!rm swivel.zip

In [11]:
!grep -n .tell swivel/prep.py

94:  nbytes = lines.tell()
104:      pos = lines.tell()
146:  nbytes = lines.tell()
206:      #pos = lines.tell()


## Learn embeddings

### Generate co-occurrence matrix using Swivel `prep`

Call swivel's `prep` command to calculate the word co-occurrence matrix. We use the `%run` magic command, which runs the named python file as a program, allowing us to pass parameters as if using a command-line terminal.

We set the `shard_size` to 512 since the corpus is quite small. For larger corpora we could use the standard value of 4096.

In [15]:
!df -h

Filesystem      Size  Used Avail Use% Mounted on
overlay          40G  7.6G   30G  21% /
tmpfs           6.4G     0  6.4G   0% /dev
tmpfs           6.4G     0  6.4G   0% /sys/fs/cgroup
tmpfs           6.4G     0  6.4G   0% /opt/bin
/dev/sda1        46G  8.4G   37G  19% /etc/hosts
shm              64M     0   64M   0% /dev/shm
tmpfs           6.4G     0  6.4G   0% /sys/firmware


In [21]:
!df -h

Filesystem      Size  Used Avail Use% Mounted on
overlay          40G  7.6G   30G  21% /
tmpfs           6.4G     0  6.4G   0% /dev
tmpfs           6.4G     0  6.4G   0% /sys/fs/cgroup
tmpfs           6.4G     0  6.4G   0% /opt/bin
/dev/sda1        46G  8.5G   37G  19% /etc/hosts
shm              64M     0   64M   0% /dev/shm
tmpfs           6.4G     0  6.4G   0% /sys/firmware


In [23]:
%lsmagic

Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %shell  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%bigquery  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%perl  %%prun  %%pypy  %%python  

In [33]:
corpus_path = '/root/nltk_data/corpora/gutenberg/all.txt'
coocs_path = '/content/gutenberg/coocs'
shard_size = 512
%run -i -e swivel/prep --input={corpus_path} --output_dir={coocs_path} --shard_size={shard_size}

running with flags 
swivel/prep.py:
  --bufsz: The number of co-occurrences to buffer
    (default: '16777216')
    (an integer)
  --input: The input text.
    (default: '')
  --max_vocab: The maximum vocabulary size
    (default: '1048576')
    (an integer)
  --min_count: The minimum number of times a word should occur to be included in
    the vocabulary
    (default: '5')
    (an integer)
  --output_dir: Output directory for Swivel data
    (default: '/tmp/swivel_data')
  --shard_size: The size for each shard
    (default: '4096')
    (an integer)
  --vocab: Vocabulary to use instead of generating one
    (default: '')
  --window_size: The window size
    (default: '10')
    (an integer)

tensorflow.python.platform.app:
  -h,--[no]help: show this help
    (default: 'false')
  --[no]helpfull: show full help
    (default: 'false')
  --[no]helpshort: show this help
    (default: 'false')

absl.flags:
  --flagfile: Insert flag definitions from the given file into the command line.
    (

We see that first, the algorithm determined the **vocabulary** $V$, this is the list of words for which an embedding will be generated. Since the corpus is fairly small, so is the vocabulary, which consists of only about 23K words (large corpora can result in vocabularies with millions of words).

The co-occurrence matrix is a sparse matrix of $|V| \times |V|$ elements. Swivel uses shards to create submatrices of $|S| \times |S|$, where $S$ is the shard-size specified above. In this case, we have 2025 sub-matrices.

All this information is stored in the output folder we specified above. It consists of  2025 files, one per shard/sub-matrix and a few additional files:

In [37]:
%ls /content/gutenberg/coocs/ | head -n 10

col_sums.txt
col_vocab.txt
row_sums.txt
row_vocab.txt
shard-000-000.pb
shard-000-001.pb
shard-000-002.pb
shard-000-003.pb
shard-000-004.pb
shard-000-005.pb


The expected output is:

    vocabulary contains 23040 tokens

    writing shard 2025/2025
    done!
    
The `prep` step does the following:
  - it uses a basic, white space, tokenization to get sequences of tokens
  - in a first pass through the corpus, it counts all tokens and keeps only those that have a minimum frequency (5) in the corpus. Then it keeps a multiple of the `shard_size` of that. The tokens that are kept form the **vocabulary** with size $v = |V|$.
  - on a second pass through the corpus, it uses a sliding window to count co-occurrences between the focus token and the context tokens (similar to `word2vec`). The result is a sparse co-occurrence matrix of size $v \times v$.
  - for easier storage and manipulation, Swivel uses *sharding* to split the co-occurrence matrix into sub-matrices of size $s \times s$, where $s$ is the `shard_size`.
  ![Swivel co-occurrence matrix sharding](https://github.com/hybridNLP2018/tutorial/blob/master/images/swivel-sharding.PNG?raw=1)
  - store the sharded co-occurrence submatrices as [protobuf files](https://developers.google.com/protocol-buffers/).

## Learn embeddings from co-occurrence matrix
With the sharded co-occurrence matrix it is now possible to learn embeddings:
 - the input is the folder with the co-occurrence matrix (protobuf files with the sparse matrix).
 - `submatrix_` rows and columns need to be the same size as the `shard_size` used in the `prep` step.

In [47]:
vec_path = '/content/gutenberg/txt/vec/'
%run -i swivel/swivel --input_base_path={coocs_path} \
    --output_base_path={vec_path} \
    --num_epochs=20 --dim=300 \
    --submatrix_rows={shard_size} --submatrix_cols={shard_size}

INFO:tensorflow:local_step=39990 global_step=39990 loss=20.6, 98.7% complete
INFO:tensorflow:local_step=40000 global_step=40000 loss=20.7, 98.8% complete
INFO:tensorflow:local_step=40010 global_step=40010 loss=19.5, 98.8% complete
INFO:tensorflow:local_step=40020 global_step=40020 loss=20.2, 98.8% complete
INFO:tensorflow:local_step=40030 global_step=40030 loss=20.6, 98.8% complete
INFO:tensorflow:local_step=40040 global_step=40040 loss=21.7, 98.9% complete
INFO:tensorflow:local_step=40050 global_step=40050 loss=19.8, 98.9% complete
INFO:tensorflow:local_step=40060 global_step=40060 loss=20.8, 98.9% complete
INFO:tensorflow:local_step=40070 global_step=40070 loss=19.7, 98.9% complete
INFO:tensorflow:local_step=40080 global_step=40080 loss=20.0, 99.0% complete
INFO:tensorflow:local_step=40090 global_step=40090 loss=21.3, 99.0% complete
INFO:tensorflow:local_step=40100 global_step=40100 loss=22.3, 99.0% complete
INFO:tensorflow:local_step=40110 global_step=40110 loss=20.1, 99.0% complete

This should take a few minutes, depending on your machine.
The result is a list of files in the specified output folder, including:
 - checkpoints of the model
 - `tsv` files for the column and row embeddings.

In [48]:
os.listdir(vec_path)

['model.ckpt-26113.data-00000-of-00001',
 'row_embedding.tsv',
 'graph.pbtxt',
 'model.ckpt-40500.meta',
 'col_embedding.tsv',
 'model.ckpt-40500.index',
 'model.ckpt-40500.data-00000-of-00001',
 'checkpoint',
 'model.ckpt-26113.meta',
 'model.ckpt-0.data-00000-of-00001',
 'model.ckpt-26113.index',
 'model.ckpt-0.meta',
 'model.ckpt-0.index',
 'events.out.tfevents.1536918848.0873453509b3']

### Convert `tsv` files to `bin` file
The `tsv` files are easy to inspect, but they take too much space and they are slow to load since we need to convert the different values to floats and pack them as vectors. Swivel offers a utility to convert the `tsv` files into a `bin`ary format. At the same time it combines the column and row embeddings into a single space (it simply adds the two vectors for each word in the vocabulary).

In [49]:
%run -i swivel/text2bin --vocab={vec_path}vocab.txt --output={vec_path}vecs.bin \
        {vec_path}row_embedding.tsv \
        {vec_path}col_embedding.tsv

executing text2bin
merging files ['/content/gutenberg/txt/vec/row_embedding.tsv', '/content/gutenberg/txt/vec/col_embedding.tsv'] into output bin


This adds the `vocab.txt` and `vecs.bin` to the folder with the vectors:

In [50]:
%ls {vec_path}

checkpoint
col_embedding.tsv
events.out.tfevents.1536918848.0873453509b3
graph.pbtxt
model.ckpt-0.data-00000-of-00001
model.ckpt-0.index
model.ckpt-0.meta
model.ckpt-26113.data-00000-of-00001
model.ckpt-26113.index
model.ckpt-26113.meta
model.ckpt-40500.data-00000-of-00001
model.ckpt-40500.index
model.ckpt-40500.meta
row_embedding.tsv
vecs.bin
vocab.txt


## Read stored binary embeddings and inspect them

Swivel provides the `vecs` library which implements the basic `Vecs` class. It accepts a `vocab_file` and a file for the binary serialization of the vectors (`vecs.bin`).

In [0]:
from swivel import vecs

...and we can load existing vectors. Here we load some pre-computed embeddings, but feel free to use the embeddings you computed by following the steps above (although, due to random initialization of weight during the training step, your results may be different).

In [52]:
#precomp_vec_path = 'corpora/kcap15-17/precomp-txt/vec/'
#vec_path = precomp_vec_path # uncomment if you did not manage to train embeddings above
vecs = vecs.Vecs(vec_path + 'vocab.txt', 
            vec_path + 'vecs.bin')

Opening vector with expected size 23040 from file /content/gutenberg/txt/vec/vocab.txt
vocab size 23040 (unique 23040)
read rows


Next, let's define a basic method for printing the `k` nearest neighbors for a given word:

In [0]:
def k_neighbors(word, k=10):
    res = vecs.neighbors(word)
    if not res:
        print('%s is not in the vocabulary, try e.g. %s' % (word, vecs.random_word_in_vocab()))
    else:
        for word, sim in res[:10]:
            print('%0.4f: %s' % (sim, word))

And let's use the method on a few words:

In [59]:
k_neighbors('California')

1.0000: California
0.4007: Pennsylvania,
0.3726: wool
0.3176: Clear
0.2998: forenoon
0.2880: lakes,
0.2822: song,
0.2791: lakes
0.2786: Missouri,
0.2732: life,


In [60]:
k_neighbors('knowledge')

1.0000: knowledge
0.3061: utter
0.2724: stature,
0.2605: kinds
0.2522: Spirit;
0.2414: imperfect
0.2411: partial
0.2399: Colonel's
0.2279: wisdom,
0.2238: mayest


In [63]:
k_neighbors('science')

1.0000: science
0.3302: marine
0.3165: matters.
0.3127: pretensions
0.3025: equality,
0.2901: march
0.2698: whales.
0.2684: sentimental
0.2623: falsely
0.2591: generations


In [67]:
k_neighbors('national')

1.0000: national
0.4614: prejudices
0.3875: hero
0.3769: buoyancy
0.3386: habitual
0.3348: importance.
0.2786: conceal'd
0.2770: reserve
0.2615: Many
0.2599: faults,


In [70]:
k_neighbors('conference')

1.0000: conference
0.3847: afterwards,
0.3199: private
0.2983: added
0.2583: opportunity
0.2553: concluding
0.2468: list
0.2365: soon
0.2356: friendly
0.2229: horizontal


## Conclusion

In this notebook, we used swivel to generate word embeddings and we explored the resulting embeddings using `k neighbors` exploration. 