# Generate a Vecsigrafo using Swivel


In this notebook we show how to generate a Vecsigrafo based on a subset of the [UMBC corpus](https://ebiquity.umbc.edu/resource/html/id/351/UMBC-webbase-corpus).

We follow the procedure described in [Towards a Vecsigrafo: Portable Semantics in Knowledge-based Text Analytics](https://pdfs.semanticscholar.org/b0d6/197940d8f1a5fa0d7474bd9a94bd9e44a0ee.pdf) and depicted in the following figure:

![Generic Vecsigrafo Creation](https://github.com/hybridNLP2018/tutorial/blob/master/images/generic-vecsigrafo-creation.png?raw=1)




## Tokenization and Word Sense Disambiguation

The main difference with standard swivel is that:
 - we use word-sense disambiguation on the text as a pre-processing step (Swivel simply uses white-space tokenization)
 - each 'token' in the resulting sequences is composed of a lemma and an optional concept identifier.
 
### Disambiguators
If we are going to apply WSD, we will need some disambiguator strategy. Unfortunately, there are not a lot of open-source high-performance disambiguators available. At [Expert System](https://www.expertsystem.com/), we have a [state-of-the-art disambiguator](https://www.expertsystem.com/products/cogito-cognitive-technology/semantic-technology/disambiguation/) that assings **syncon**s (our version of synsets) to lemmas in the text. 

Since Expert System's disambiguator and semantic KG are proprietary, in this notebook we will be mostly using WordNet (although we may present some results and examples based on Expert System's results). We have implemented a lightweight disambiguation strategy, proposed by [Mancini, M., Camacho-Collados, J., Iacobacci, I., & Navigli, R. (2017). Embedding Words and Senses Together via Joint Knowledge-Enhanced Training. CoNLL.](http://arxiv.org/abs/1612.02703), which has allowed us to produce disambiguated corpora based on WordNet 3.1.

To be able to inspect the disambiguated corpus, let's make sure we have access to WordNet in our environment by executing the following cell.






In [30]:
import nltk
nltk.download('wordnet')
wn.synset('Maya.n.02')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Synset('maya.n.02')

### Tokenizations
When applying a disambiguator, the tokens are no longer (groups of) words. Each token can contain different types of information, we generally keep the following token information:
  * `t`: text, the original text (possibly normalised, i.e. lower-cased)
  * `l`: lemma, the lemma form of the word
  * `g`: grammar: the grammar type
  * `s`: syncon (or synset) identifier
  
### Example WordNet

We have included a small sample of our disambiguated UMBC corpus as part of our [GitHub tutorial repo](https://github.com/HybridNLP2018/tutorial). Execute the following cell to clone the repo, unzip the sample corpus and print the first line of the corpus

In [24]:
%cd /content/
!git clone https://github.com/HybridNLP2018/tutorial.git
%cd /content/tutorial/datasamples/
!unzip umbc_tlgs_wnscd_5K.zip
toked_corpus = '/content/tutorial/datasamples/umbc_tlgs_wnscd_5K'
!head -n1 {toked_corpus}
%cd /content/

/content
fatal: destination path 'tutorial' already exists and is not an empty directory.
/content/tutorial/datasamples
Archive:  umbc_tlgs_wnscd_5K.zip
replace umbc_tlgs_wnscd_5K? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
the%7CGT_ART mayan%7Clem_Mayan%7CGT_ADJ%7Cwn31_Maya.n.03 image%7Clem_image%7CGT_NOU%7Cwn31_effigy.n.01 collection%7Clem_collection%7CGT_NOU%7Cwn31_collection.n.01 was%7Clem_be%7CGT_AUX%7Cwn31_embody.v.02 contributed%7Clem_contribute%7CGT_VER%7Cwn31_contribute.v.02 by%7Clem_by%7CGT_PRE%7Cwn31_aside.r.06 oberlin+college%7Clem_Oberlin+College%7CGT_NPR faculty%7Clem_faculty%7CGT_NOU%7Cwn31_staff.n.03 and%7CGT_CON library%7Clem_library%7CGT_NOU%7Cwn31_library.n.02 staff%7Clem_staff%7CGT_NOU%7Cwn31_staff.n.03 .%7CGT_PNT professor%7Clem_professor%7CGT_NOU%7Cwn31_professor.n.01 linda+grimm%7Clem_Linda+Grimm%7CGT_NPH %2C%7CGT_PNT associate+professor%7Clem_associate+professor%7CGT_NOU%7Cwn31_associate+professor.n.01 of%7CGT_PRE anthropology%7Clem_anthropology%7CGT_NOU%7Cwn31_ant

You should see, among others, the first line in the corpus, which starts with:

```
the%7CGT_ART mayan%7Clem_Mayan%7CGT_ADJ%7Cwn31_Maya.n.03 image%7Clem_image%7CGT_NOU%7Cwn31_effigy.n.01 
```



The file included in the github repo for this tutorial is a subset of a disambiguated tokenization for the UMBC corpus, it only contains the first 5 thousand lines of that corpus (the full corpus has about 40 million lines) as we only need it to show the steps necessary to generate embeddings.

The last output, from the cell above, shows the format we are using to represent the tokenized corpus. We use white space to separate the tokens, and have URL encoded each token to avoid mixing up tokens. Since this format is hard to read, we provide a library to inspect the lines in an easy manner. Execute the following cell to display the first two lines in the corpus as a table.

In [0]:
%cd /content/
import tutorial.scripts.wntoken as wntoken
import pandas

# open the file and produce a list of python dictionaries describing the tokens
corpus_tokens = wntoken.open_as_token_dicts(toked_corpus, max_lines=2)
# convert the tokens into a pandas DataFrame to display in table form
pandas.DataFrame(corpus_tokens, columns=['line', 't', 'l', 'g', 's', 'glossa'])

### Example Cogito
As a second example, analysing the original sentence:
   
    EXPERIMENTAL STUDY  We conducted an empirical evaluation to assess the effectiveness
    
using Cogito gives us     
![Full Cogito Analysis of example sentence](https://github.com/hybridNLP2018/tutorial/blob/master/images/example-sentence-cogito.PNG?raw=1)

We filter some of the words and only keep the lemmas and the syncon ids and encode them into the next sequence of disambiguated tokens:

    en#86052|experimental en#2686|study en#76710|conduct en#86047|empirical en#3546|evaluation en#68903|assess 
    en#25094|effectiveness  

## Vocabulary and Co-occurrence matrix

Next, we need to count the co-occurrences in the disambiguated corpus. We can either:
 - use **standard swivel prep**: in this case each *<text>|<lemma>|<grammar>|<synset>* tuple will be treated as a separate token. For the example sentence from UMBC, presented above, we would then get that `mayan|lem_Mayan|GT_ADJ|wn31_Maya.n.03` has a co-occurrence count of 1 with `image|lem_image|GT_NOU|wn31_effigy.n.01`. This would result in a very large vocabulary.
 - use **joint-subtoken prep**: in this case, you can specify which individual subtoken information you want to take into account. In this notebook we will use **ls** information, hence each synset and each lemma are treated as separate entities in the vocabulary and will be represented with different embeddings. For the example sentence we would get that `lem_Mayan` has a co-occurrence count of 1 with `wn31_Maya.n.03`, `lem_image` and `wn31_effigy.n.01`. 
 

In [0]:
import os
import numpy as np

### Standard Swivel Prep
For the **standard swivel prep**, we can simply call `prep` using the `!python` command. In this case we have the `toked_corpus` which contains the disambiguated sequences as shown above. The output wil be a set of sharded co-occurrence submatrices as explained in the notebook for creating word vectors.

We set the `shard_size` to 512 since the corpus is quite small. For larger corpora we could use the standard value of 4096.

In [40]:
!mkdir /content/umbc/
!mkdir /content/umbc/coocs
!mkdir /content/umbc/coocs/tlgs_wnscd_5k_standard
coocs_path = '/content/umbc/coocs/tlgs_wnscd_5k_standard/'
!python tutorial/scripts/swivel/prep.py --input={toked_corpus} --output_dir={coocs_path} --shard_size=512

mkdir: cannot create directory ‘/content/umbc/’: File exists
mkdir: cannot create directory ‘/content/umbc/coocs’: File exists
mkdir: cannot create directory ‘/content/umbc/coocs/tlgs_wnscd_5k_standard’: File exists
running with flags 
tutorial/scripts/swivel/prep.py:
  --bufsz: The number of co-occurrences to buffer
    (default: '16777216')
    (an integer)
  --input: The input text.
    (default: '')
  --max_vocab: The maximum vocabulary size
    (default: '1048576')
    (an integer)
  --min_count: The minimum number of times a word should occur to be included in
    the vocabulary
    (default: '5')
    (an integer)
  --output_dir: Output directory for Swivel data
    (default: '/tmp/swivel_data')
  --shard_size: The size for each shard
    (default: '4096')
    (an integer)
  --vocab: Vocabulary to use instead of generating one
    (default: '')
  --window_size: The window size
    (default: '10')
    (an integer)

tensorflow.python.platform.app:
  -h,--[no]help: show this help
  

Expected output:

    ... tensorflow flags ....
    
    vocabulary contains 8192 tokens

    writing shard 256/256
    Wrote vocab and sum files to /content/umbc/coocs/tlgs_wnscd_5k_standard/
    Wrote vocab and sum files to /content/umbc/coocs/tlgs_wnscd_5k_standard/
    done!



In [48]:
!head -n15  /content/umbc/coocs/tlgs_wnscd_5k_standard/row_vocab.txt

the%7CGT_ART
%2C%7CGT_PNT
.%7CGT_PNT
of%7CGT_PRE
and%7CGT_CON
to%7CGT_PRE
a%7CGT_ART
in%7CGT_PRE
for%7CGT_PRE
%22%7CGT_PNT
is%7Clem_be%7CGT_VER%7Cwn31_be.v.01
with%7CGT_PRE
%29%7CGT_PNT
%28%7CGT_PNT
on%7Clem_on%7CGT_PRE


As the cells above show, applying standard prep results in a vocabulary of over 8K "tokens", however each token is still represented as a URL-encoded combination of the plain text, lemma, grammar type and synset (when available).

### Joint-subtoken Prep
For the **joint-subtoken prep** step, we have a Java implementation that is not open-source yet (as it is still tied to proprietary code, we are working on refactoring the code so that Cogito subtokens are just a special case). However, we ***provide pre-computed co-occurrence files***.

Although not open source, we describe the steps we executed to help you implement a similar pipeline. 

First,  we ran our implementation of subtoken prep on the corpus. Notice:
  * we are only including lemma and synset information (i.e. we are not including plain text and grammar information). 
  * furthermore, we are filtering the corpus by
     1. removing any tokens related to punctuation marks (PNT), auxiliary verbs (AUX) and articles (ART), since we think these do not contribute much to the semantics of words.
     2. replacing tokens with grammar types `ENT` (entities) and `NPH` (proper names) with generic variants `grammar#ENT` and `grammar#NPH` respectively. The rationale is that, depending on the input corpus, names of people or organizations may appear a few times, but may be filtered out if they do not appear enough times. This ensures such tokens are kept in the vocabulary and contribute to the embeddings of words nearby. The main disadvantage is that we will not have some proper names in our final vocabulary.

```
java $JAVA_OPTIONS net.expertsystem.word2vec.swivel.SubtokPrep \
  --input C:/hybridNLP2018/tutorial/datasamples/umbc_tlgs_wnscd_5K \
  --output_dir C:/corpora/umbc/coocs/tlgs_wnscd_5K_ls_f/  \
  --expected_seq_encoding TLGS_WN \
  --sub_tokens \
  --output_subtokens "LEMMA,SYNSET" \
  --remove_tokens_with_grammar_types "PNT,AUX,ART"  \
  --generalise_tokens_with_grammar_types "ENT,NPH" \
  --shard_size 512
```

The output log looked as follows:

```
INFO  net.expertsystem.word2vec.swivel.SubtokPrep - expected_seq_encoding set to 'TLGS_WN'
INFO  net.expertsystem.word2vec.swivel.SubtokPrep - remove_tokens_with_grammar_types set to PNT,AUX,ART
INFO  net.expertsystem.word2vec.swivel.SubtokPrep - generalise_tokens_with_grammar_types set to ENT,NPH
INFO  net.expertsystem.word2vec.swivel.SubtokPrep - Creating vocab for C:\hybridNLP2018\tutorial\datasamples\umbc_tlgs_wnscd_5K
INFO  net.expertsystem.word2vec.swivel.SubtokPrep - read 5000 lines from C:\hybridNLP2018\tutorial\datasamples\umbc_tlgs_wnscd_5K
INFO  net.expertsystem.word2vec.swivel.SubtokPrep - filtered 166152 tokens from a total of 427796 (38,839%)
generalised 1899 tokens from a total of 427796 (0,444%)
full vocab size 21321
INFO  net.expertsystem.word2vec.swivel.SubtokPrep - Vocabulary contains 5632 tokens (21321 full count, 5913 appear > 5 times)
INFO  net.expertsystem.word2vec.swivel.SubtokPrep - Flushing 1279235 co-occ pairs
INFO  net.expertsystem.word2vec.swivel.SubtokPrep - Wrote 121 tmpShards to disk
```

We have included the output of this process as part of the GitHub repo for the tutorial. We will unzip this folder to inspect the results:

In [44]:
!unzip /content/tutorial/datasamples/precomp-coocs-tlgs_wnscd_5K_ls_f.zip -d /content/umbc/coocs/
precomp_coocs_path = '/content/umbc/coocs/tlgs_wnscd_5K_ls_f'

Archive:  /content/tutorial/datasamples/precomp-coocs-tlgs_wnscd_5K_ls_f.zip
   creating: /content/umbc/coocs/tlgs_wnscd_5K_ls_f/
  inflating: /content/umbc/coocs/tlgs_wnscd_5K_ls_f/col_sums.txt  
  inflating: /content/umbc/coocs/tlgs_wnscd_5K_ls_f/col_vocab.txt  
  inflating: /content/umbc/coocs/tlgs_wnscd_5K_ls_f/init_vocab.txt  
  inflating: /content/umbc/coocs/tlgs_wnscd_5K_ls_f/row_sums.txt  
  inflating: /content/umbc/coocs/tlgs_wnscd_5K_ls_f/row_vocab.txt  
  inflating: /content/umbc/coocs/tlgs_wnscd_5K_ls_f/shard-000-000.pb  
  inflating: /content/umbc/coocs/tlgs_wnscd_5K_ls_f/shard-000-001.pb  
  inflating: /content/umbc/coocs/tlgs_wnscd_5K_ls_f/shard-000-002.pb  
  inflating: /content/umbc/coocs/tlgs_wnscd_5K_ls_f/shard-000-003.pb  
  inflating: /content/umbc/coocs/tlgs_wnscd_5K_ls_f/shard-000-004.pb  
  inflating: /content/umbc/coocs/tlgs_wnscd_5K_ls_f/shard-000-005.pb  
  inflating: /content/umbc/coocs/tlgs_wnscd_5K_ls_f/shard-000-006.pb  
  inflating: /content/umbc/coocs/t

The previous cell extracts the pre-computed co-occurrence shards and defines a variable `precomp_coocs_path` that points to the folder where these shards are stored.

Next, we print the first 10 elements of the vocabulary to see the format that we are using to represent the lemmas and synsets:

In [45]:
!head -n10 {precomp_coocs_path}/row_vocab.txt

lem_be
wn31_be.v.01
lem_that
lem_this
lem_on
lem_by
lem_information
lem_as
lem_use
lem_from


As the output above shows, the vocabulary we get with `subtoken prep` is smaller (5.6K elements instead of over 8K) and it contains individual lemmas and synsets (it also contains *special* elements grammar#ENT and grammar#NPH, as described above).

**More importantly**, the co-occurrence counts take into account the fact that certain lemmas co-occur more frequently with certain other lemmas and synsets, which should be taken into account when learning embedding representations.

## Learn embeddings from co-occurrence matrix

With the sharded co-occurrence matrices created in the previous section it is now possible to learn embeddings by calling the `swivel.py` script. This launches a tensorflow application based on various parameters (most of which are self-explanatory) :

 - `input_base_path`: the folder with the co-occurrence matrix (protobuf files with the sparse matrix) generated above.
 - `submatrix_` rows and columns need to be the same size as the `shard_size` used in the `prep` step.
 - `num_epochs` the number of times to go through the input data (all the co-occurrences in the shards). We have found that for large corpora, the learning algorithm converges after a few epochs, while for smaller corpora you need a larger number of epochs. 
 
 Execute the following cell to generate embeddings for the pre-computed co-occurrences.

In [49]:
vec_path = '/content/umbc/vec/tlgs_wnscd_5k_ls_f'
!python /content/tutorial/scripts/swivel/swivel.py --input_base_path={precomp_coocs_path} \
    --output_base_path={vec_path} \
    --num_epochs=40 --dim=150 \
    --submatrix_rows=512 --submatrix_cols=512

Instructions for updating:
To construct input pipelines, use the `tf.data` module.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
'Tensor' object has no attribute 'to_proto'
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Saving checkpoint to path /content/umbc/vec/tlgs_wnscd_5k_ls_f/model.ckpt
INFO:tensorflow:Starting queue runners.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
'Tensor' object has no attribute 'to_proto'
INFO:tensorflow:local_step=10 global_step=10 loss=82.3, 0.2% complete
INFO:tensorflow:local_step=20 global_step=20 loss=78.6, 0.4% complete
INFO:tensorflow:local_step=30 global_step=30 loss=76.6, 0.6% complete
INFO:tensorflow:local_step=40 global_step=40 loss=74.5, 0.8% complete
INFO:tensorflow:local_step=50 global_step=50 loss=7

This will take a few minutes, depending on your machine.
The result is a list of files in the specified output folder, including:
 - the tensorflow graph, which defines the architecture of the model being trained
 - checkpoints of the model (intermediate snapshots of the weights)
 - `tsv` files for the final state of the column and row embeddings.

In [51]:
%ls {vec_path}

checkpoint
col_embedding.tsv
events.out.tfevents.1538661737.5b2dbcb47226
graph.pbtxt
model.ckpt-0.data-00000-of-00001
model.ckpt-0.index
model.ckpt-0.meta
model.ckpt-4840.data-00000-of-00001
model.ckpt-4840.index
model.ckpt-4840.meta
row_embedding.tsv


### Convert `tsv` files to `bin` file

As we've seen in previous notebooks, the `tsv` files are easy to inspect, but they take too much space and they are slow to load since we need to convert the different values to floats and pack them as vectors. Swivel offers a utility to convert the `tsv` files into a `bin`ary format. At the same time it combines the column and row embeddings into a single space (it simply adds the two vectors for each word in the vocabulary).

In [52]:
!python /content/tutorial/scripts/swivel/text2bin.py --vocab={precomp_coocs_path}/row_vocab.txt --output={vec_path}/vecs.bin \
        {vec_path}/row_embedding.tsv \
        {vec_path}/col_embedding.tsv

executing text2bin
merging files ['/content/umbc/vec/tlgs_wnscd_5k_ls_f/row_embedding.tsv', '/content/umbc/vec/tlgs_wnscd_5k_ls_f/col_embedding.tsv'] into output bin


This adds the `vocab.txt` and `vecs.bin` to the folder with the vectors:

In [53]:
%ls {vec_path}

checkpoint
col_embedding.tsv
events.out.tfevents.1538661737.5b2dbcb47226
graph.pbtxt
model.ckpt-0.data-00000-of-00001
model.ckpt-0.index
model.ckpt-0.meta
model.ckpt-4840.data-00000-of-00001
model.ckpt-4840.index
model.ckpt-4840.meta
row_embedding.tsv
vecs.bin


## Inspect the embeddings

As in previous notebooks, we can now use Swivel to inspect the vectors using the `Vecs` class. It accepts a `vocab_file` and a file for the binary serialization of the vectors (`vecs.bin`).

In [0]:
from tutorial.scripts.swivel import vecs

...and we can load existing vectors. Here we load some pre-computed embeddings, but feel free to use the embeddings you computed by following the steps above (although, due to random initialization of weight during the training step, your results may be different).

In [57]:
vectors = vecs.Vecs(precomp_coocs_path + '/row_vocab.txt', 
            vec_path + '/vecs.bin')

Opening vector with expected size 5632 from file /content/umbc/coocs/tlgs_wnscd_5K_ls_f/row_vocab.txt
vocab size 5632 (unique 5632)
read rows


Next, let's define a basic method for printing the `k` nearest neighbors for a given word:

In [0]:
def k_neighbors(word, k=10):
    res = vectors.neighbors(word)
    if not res:
        print('%s is not in the vocabulary, try e.g. %s' % (word, vecs.random_word_in_vocab()))
    else:
        for word, sim in res[:10]:
            print('%0.4f: %s' % (sim, word))

And let's use the method on a few lemmas and synsets in the vocabulary:

In [62]:
k_neighbors('lem_California')

1.0000: lem_California
0.5068: lem_University of California
0.3764: wn31_recognize.v.08
0.3148: lem_comprise
0.2934: lem_battle
0.2912: lem_deployment
0.2840: wn31_map.v.04
0.2792: grammar#ENT
0.2624: wn31_publication.n.04
0.2565: lem_Caribbean


In [64]:
k_neighbors('lem_semantic')

1.0000: lem_semantic
0.3592: wn31_map.v.01
0.3559: lem_procedural
0.3517: lem_dictionary
0.3468: lem_relationship
0.3334: lem_object-oriented
0.3277: lem_similarity
0.3247: wn31_common.a.01
0.3139: wn31_similarity.n.01
0.3134: lem_header


In [65]:
k_neighbors('lem_conference')

1.0000: lem_conference
0.6992: wn31_conference.n.01
0.4336: wn31_session.n.01
0.4307: lem_proceedings
0.4257: lem_annual
0.4160: wn31_conference.n.03
0.3906: wn31_seminar.n.01
0.3890: lem_workshop
0.3809: lem_seminar
0.3803: lem_hold


In [68]:
k_neighbors('wn31_conference.n.01')

1.0000: wn31_conference.n.01
0.6992: lem_conference
0.4521: wn31_seminar.n.01
0.4392: wn31_annual.a.01
0.4329: lem_annual
0.4231: wn31_external.a.03
0.4072: lem_seminar
0.3822: lem_proceedings
0.3613: wn31_meeting.n.01
0.3539: wn31_session.n.01


Note that using the Vecsigrafo approach gets us very different results than when using standard swivel (notebook 01):
 * the results now include concepts (synsets), besides just words. Without further information, this makes interpreting the results harder since we now only have the concept id, but we can search for these concepts in the underlying KG (WordNet in this case) to explore the semantic network and get further information.
 
Of course, results may not be very good, since these have been derived from a very small corpus (5K lines from UMBC). In the excercise below, we encourage you to download and inspect pre-computed embeddings based on the full UMBC corpus.

In [69]:
k_neighbors('lem_semantic web')

1.0000: lem_semantic web
0.3511: wn31_technology.n.01
0.3413: lem_machine learning
0.3242: wn31_employment.n.01
0.3156: lem_emergence
0.2912: wn31_model.n.04
0.2868: lem_incorporation
0.2858: lem_infrastructure
0.2830: lem_technology
0.2819: lem_educationally


In [70]:
k_neighbors('lem_ontology')

1.0000: lem_ontology
0.3664: lem_eye
0.3107: lem_extend
0.3028: wn31_function.n.01
0.2987: lem_rdf
0.2955: lem_concepts
0.2910: wn31_existent.a.01
0.2886: lem_mapping
0.2807: lem_unified
0.2806: lem_relationship


# Conclusion and Exercises

In this notebook we generated a vecsigrafo based on a disambiguated corpus. The resulting embedding space combines concept ids and lemmas.

We have seen that the resulting space:
 1. may be harder to inspect due to the potentially opaque concept ids
 2. clearly different than standard swivel embeddings
 
The question is: are the resulting embeddings *better*? 

To get an answer, in the next notebook, we will look at **evaluation methods for embeddings**.

## Exercise 1: Explore full precomputed embeddings

We have also pre-computed embeddings for the full UMBC corpus. The provided `tar.gz` file is about 1.1GB, hence downloading it may take several minutes.

In [0]:
full_precomp_url = 'https://esdrive.expertsystem.com/public/file/AYBZoTKJ80yibBVv-z84iA/vecsigrafo_umbc_tlgs_ls_f_6e_160d_row_embedding.tar.gz'
full_precomp_targz = '/content/umbc/vec/tlgs_wnscd_ls_f_6e_160d_row_embedding.tar.gz'
!wget {full_precomp_url} -O {full_precomp_targz}

--2018-10-04 15:05:57--  https://esdrive.expertsystem.com/public/file/AYBZoTKJ80yibBVv-z84iA/vecsigrafo_umbc_tlgs_ls_f_6e_160d_row_embedding.tar.gz
Resolving esdrive.expertsystem.com (esdrive.expertsystem.com)... 195.78.200.194
Connecting to esdrive.expertsystem.com (esdrive.expertsystem.com)|195.78.200.194|:443... connected.
HTTP request sent, awaiting response... 200 Ok
Length: 1166454112 (1.1G) [application/x-gzip]
Saving to: ‘/content/umbc/vec/tlgs_wnscd_ls_f_6e_160d_row_embedding.tar.gz’


In [0]:

!tar -xzf {full_precomp_targz} -C /content/umbc/vec/
full_precomp_vec_path = '/content/umbc/vec/vecsi_tlgs_wnscd_ls_f_6e_160d'