Welcome! This is a simple notebook to demo Small-Fry usage. If you have any questions, please feel free to contact [tginart](https://github.com/tginart).

For the purposes of this demo, we will be compressing 1,000 rows (words) from the offical [Wiki Gigawords GloVe embeddings]. Our demo embeddings are light-weight enough to keep within the code repository.  

Although Small-Fry can be used as a command line utility, it is recommended to use Small-Fry as an API. 

First, import the library:

In [1]:
import smallfry as sfry

Space constraints:
* We are going to use the default bitrate, R = 1, for compression.
* If you have more space for your embeddings, be sure to give yourself a looser memory constraint!
* You may specify either as an approximate memory budget (in bytes) or as a bitrate (avg bits per entry in embeddings matrix)
* For most downstream applications, in order to incur <1% loss in extrinsic performance, the bitrate should be somewhere in (0.1,3). See TODO:paperlink for more details.

Output directory:
* You must specify an output directory using the param outdir. 
* The compressed embeddings will be written to this directory

Prior:
* The prior should be specified as word frequency counts over a corpus. Small-Fry automatically normalizes the frequency counts into a probability vector. 
* The prior should be a Python dictionary mapping word to float, saved in the `.npy` format. See [`numpy.save`](https://docs.scipy.org/doc/numpy-1.14.0/re)
* Small-Fry is robust to noisy priors. If you do not have a prior of your own, and your application will be processing common English, please use [this prior], collected from the wiki16 corpus [cite].

Before making our call lets describe two variables we will need. Our `source_path` argument is 1000 lines out of the offical glove.6B.50d embeddings while our `prior_path` is a prior for these 1000 words in dict format saved as npy. We are now ready to make the API call:

In [2]:
word2idx, sfry_path  = sfry.compress(sourcepath="data/glove.head.txt", 
                         priorpath="data/prior.npy", 
                         outdir="data/glove.head.sfry")

Saving Small-Fry representation to file: data/glove.head.sfry
Compression complete!!!


Your Small-Fry embeddings have been written to file!

If we want to query them, let's load them into 

I bet you they're pretty small! Let's check the filesizes before and after!

First, let's see how big the original embeddings are:

In [3]:
import os; print(str(os.path.getsize("data/glove.head.txt")) + " bytes")

428849 bytes


And now, let's check out the size of the Small-Fry embeddings:

Wow! That's small!

Now, let's see how we can efficiently query for word vectors without inflating the entire `.sfry` representation.

Like with other embeddings, the user is responsible for keeping track of the word representation returned by the Small-Fry compressor, in `word2idx`. Optionally, the word representation can be automatically saved by the `compress` call, using the `write-word-rep` flag.

The path to Small-Fry's compressed embeddings is returned in `sfry_path`. Both of these returns are used in Small-Fry's `query` API call.

We proceed to define the inputs for the querying:

We proceed to call the query routine on the word 'them':

But reading from disk can be slow! That's why Small-Fry can create a memory-mapped representation using `numpy.memmap`.

Use `sfry.load` to generate a Python wrapper object as follows: 

In [4]:
my_smallfry = sfry.load(sfry_path, word2idx) #generates a wrapper object for memory-mapped Small-Fry
print(my_smallfry.get_size())

8810.0


And to query with your Small-Fry wrapper, just try:

In [5]:
print(my_smallfry.query("the"))

[ 0.42144004  0.25169128 -0.45076984  0.09734009  0.33648273 -0.05103369
 -0.45076984 -0.18810399  0.02078924 -0.6736526   0.25169128 -0.11995012
 -0.55109245  0.17368607  0.02078924  0.02078924  0.09734009 -0.11995012
 -0.80504096 -0.11995012 -0.05103369 -0.3521969  -0.18810399 -0.2619221
 -0.18810399 -1.8357271  -0.80504096  0.09734009 -0.45076984 -0.18810399
  4.0157175  -0.18810399 -0.55109245 -0.3521969   0.02078924  0.02078924
  0.17368607 -0.18810399  0.02078924 -0.05103369 -0.2619221  -0.18810399
 -0.3521969  -0.05103369 -0.45076984  0.17368607  0.02078924 -0.18810399
 -0.11995012 -0.80504096]


It is as easy as that! You are now ready to use Small-Fry embeddings for your favorite NLP apps!