Welcome! This is a simple notebook to demo Small-Fry usage. If you have any questions, please feel free to contact [tginart](https://github.com/tginart).

For the purposes of this demo, we will be compressing 1,000 rows (words) from the offical [Wiki Gigawords GloVe embeddings]. Our demo embeddings are light-weight enough to keep within the code repository.  

Although Small-Fry can be used as a command line utility, it is recommended to use Small-Fry as an API. 

First, import the library:

In [None]:
import smallfry as sfry

Space constraints:
* We are going to use the default bitrate, R = 1, for compression.
* If you have more space for your embeddings, be sure to give yourself a looser memory constraint!
* You may specify either as an approximate memory budget (in bytes) or as a bitrate (avg bits per entry in embeddings matrix)
* For most downstream applications, in order to incur <1% loss in extrinsic performance, the bitrate should be somewhere in (0.1,3). See TODO:paperlink for more details.

Output directory:
* You may optionally specify an output directory using the param outdir. 
* Otherwise, Small-Fry will write to the same directory containing the source embeddings.

Prior:
* The prior should be specified as word frequency counts over a corpus. Small-Fry automatically normalizes the frequency counts into a probability vector. 
* The prior should be a Python dictionary mapping word to float, saved in the `.npy` format. See [`numpy.save`](https://docs.scipy.org/doc/numpy-1.14.0/re)
* As discussed in TODO:paperlink, Small-Fry is robust to noisy priors. If you do not have a prior of your own, and your application will be processing common English, please use [this prior], collected from the wiki16 corpus [cite].

We proceed to define inputs for the compressor.

In [None]:
source_path = "glove.head.txt" #1000 lines out of the offical glove.6B.50d embeddings
prior_path = "prior.npy" #A prior for these 1000 words in dict format saved as npy.
word_rep = "trie" #Let's use the marisa-trie representation for the word list. It's more compact that a dict!

We are now ready to make the API call:

In [None]:
word2idx, sfry_path = sfry.compress(source_path, prior_path, word_rep=word_rep)

Your Small-Fry embeddings have been written to file! I bet you they're pretty small! Let's check the filesizes before and after!

First, let's see how big the original embeddings are:

In [None]:
import os
os.popen("ls -lha glove.head.txt").read()

And now, let's check out the size of the Small-Fry embeddings:

In [None]:
os.popen("du -h glove.head.txt.sfry --apparent-size").read()

Wow! That's small!

Now, let's see how we can efficiently query for word vectors without inflating the entire `.sfry` representation.

Like with other embeddings, the user is responsible for keeping track of the word representation returned by the Small-Fry compressor, in `word2idx`. Optionally, the word representation can be automatically saved by the `compress` call, using the `write-word-rep` flag.

The path to Small-Fry's compressed embeddings is returned in `sfry_path`. Both of these returns are used in Small-Fry's `query` API call.

We proceed to define the inputs for the querying:

In [None]:
word = 'them' #The query word 
word_rep = word2idx #This can be a path to a saved word rep, or the Python object itself -- either way works.
query_path = sfry_path #This must be a path to a saved Small-Fry directory

We proceed to call the query routine:

In [None]:
word_vector = sfry.query(word, word_rep, query_path) #returns a numpy vector

print(word_vector)

But reading from disk can be slow! That's why Small-Fry can create a memory-mapped representation using `numpy.memmap`.

Use `sfry.load` to generate a Python wrapper object as follows: 

In [None]:
my_smallfry = sfry.load(sfry_path, word2idx) #generates a wrapper object for memory-mapped Small-Fry

And to query with your Small-Fry wrapper, just try:

In [None]:
print(my_smallfry.query(word))

It is as easy as that! You are now ready to use Small-Fry embeddings for your favorite NLP apps!