# Data Encoding

This notebook provides a short walkthrough of some of the data encoding features of the `sharrow` library.

In [None]:
# HIDDEN
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

In [None]:
import numpy as np
import pandas as pd
import xarray as xr
from io import StringIO

import sharrow as sh
sh.__version__

In [None]:
# check versions
import packaging
assert packaging.version.parse(xr.__version__) >= packaging.version.parse("0.20.2")

## Example Data

We'll begin by importing some example data to work with.  We'll be using 
some test data taken from the MTC example in the ActivitySim project. For
this data encoding walkthrough, we'll focus on the
skims containing transportation level of service information for travel around
a tiny slice of San Francisco.

We'll load them as a multi-dimensional `xarray.Dataset` — or, more exactly, a 
`sharrow.Dataset`, which is a subclass from the xarray version that adds some 
useful features, including compatability with automatic tools for recoding data.

In [None]:
skims = sh.example_data.get_skims()
skims

Because sharrow uses the `xarray.Dataset` format to work with data, individual 
variables in each Dataset can be encoded in different data types.
For example, automobile travel times can be stored with 
high(er) precision floating point numbers, while transit 
fares, which vary less and have a narrower range, can be 
stored with lower precision.  This allows a user to choose 
the most efficient encoding for each variable, if desired. 

## Fixed Point Encoding

Very often, data (especially skim matrixes like here) can be expressed adequately 
with far less precicion than a standard 32-bit floating point representation allows.
In these cases, it may be beneficial to store this 
data with "fixed point" encoding, which is also 
sometimes called scaled integers.

Instead of storing values as 32-bit floating point values, 
they could be multiplied by a scale factor (e.g., 100) 
and then converted to 16-bit integers. This uses half the
RAM and can still express any value (to two decimal point 
precision) up to positive or negative 327.68.  If the lowest 
values in that range are never needed, it can also be shifted,
moving both the bottom and top limits by a fixed amount. Then, 
for a particular scale $\mu$ and shift $\xi$ (stored in metadata),
from any array element $i$ the implied (original) value $x$ 
can quickly be recovered by evaluating $(i / \mu) - \xi$.

Sharrow includes a pair of functions to encode and decode arrays in
this manner. These functions also attach the necessary metadata
to the Dataset objects, so that later when we construct `sharrow.Flow` 
instances, they can decode arrays automatically.

In [None]:
from sharrow.digital_encoding import array_encode, array_decode

The distance data in the skims is a great candidate for fixed point
of encoding.  We can peek at the top corner of this array:

In [None]:
skims.DIST.values[:2,:3]

The data are all small(ish) values with two decimal point fixed
precision, so we can probably efficiently encode this data by scaling by 100.
If we're not sure, we can confirm by checking the range of values, to make
sure it fits inside the 16-bit integers we're hoping to use.

In [None]:
skims.DIST.values.min(), skims.DIST.values.max()

That's a really small range because this is only test data.  But even 
the full-scale MTC skims spanning the entire region don't contain distances
over 300 miles.

We can create a new DataArray and apply fixed point encoding using the
`array_encode` function.

In [None]:
distance_encoded = array_encode(skims.DIST, scale=0.01, offset=0)
distance_encoded.values[:2,:3]

In [None]:
# TEST encoding
assert distance_encoded.dtype == np.int16
np.testing.assert_array_equal(
    distance_encoded.values[:2,:3],
    np.array([[12, 24, 44], [37, 14, 28]], dtype=np.int16)
)
assert distance_encoded.attrs['digital_encoding'] == {'scale': 0.01, 'offset': 0, 'missing_value': None}

We can apply that function for any number of variables in the skims, and
create a new Dataset that includes the encoded arrays.

In [None]:
skims_encoded = skims.assign(
    {'DIST': array_encode(skims.DIST, scale=0.01, offset=0)}
)

To manage the digital encodings across an entire dataset, sharrow implements
a `digital_encoding` accessor.  You can use it to apply encodings to one or more
variables in a simple fashion.

In [None]:
skims_encoded = skims_encoded.digital_encoding.set(['DISTWALK', 'DISTBIKE'], scale=0.01, offset=0)

And you can review the encodings for every variable in the dataset like this:

In [None]:
skims_encoded.digital_encoding.info()

In [None]:
# TEST
assert skims_encoded.digital_encoding.info() == {
 'DIST': {'scale': 0.01, 'offset': 0, 'missing_value': None},
 'DISTBIKE': {'scale': 0.01, 'offset': 0, 'missing_value': None},
 'DISTWALK': {'scale': 0.01, 'offset': 0, 'missing_value': None},
}

To demonstrate that the encoding works transparently with a `Flow`,
we can construct a simple flow that extracts the distance and 
square of distance for the top corner of values we looked at above.

First we'll do so for a flow with the original float32 encoded skims.

In [None]:
pairs = pd.DataFrame({'orig': [0,0,0,1,1,1], 'dest': [0,1,2,0,1,2]})
tree = sh.DataTree(
    base=pairs, 
    skims=skims.drop_dims('time_period'), 
    relationships=(
        "base.orig -> skims.otaz",
        "base.dest -> skims.dtaz",
    ),
)
flow = tree.setup_flow({'d1': 'DIST', 'd2': 'DIST**2'})
arr = flow.load()
arr

We can do the same for the encoded skims, and we get exactly the
same result, even though the encoded skims use less RAM.

In [None]:
tree_enc = sh.DataTree(
    base=pairs, 
    skims=skims_encoded.drop_dims('time_period'), 
    relationships=(
        "base.orig -> skims.otaz",
        "base.dest -> skims.dtaz",
    ),
)
flow_enc = tree_enc.setup_flow({'d1': 'DIST', 'd2': 'DIST**2'}, hashing_level=2)
arr_enc = flow_enc.load()
arr_enc

In [None]:
# TEST
np.testing.assert_array_almost_equal(arr, arr_enc)

Since we use exactly the same flow definition with a modified DataTree, 
we need to use `hashing_level=2` here to avoid accidentally picking up
and running the compiled code from the first flow, which gives erroneous
results as it's expecting a float32 instead of a scaled int16 array.

In [None]:
tree_bad = sh.DataTree(
    base=pairs, 
    skims=skims_encoded.drop_dims('time_period'), 
    relationships=(
        "base.orig -> skims.otaz",
        "base.dest -> skims.dtaz",
    ),
)
flow_bad = tree_bad.setup_flow({'d1': 'DIST', 'd2': 'DIST**2'})
arr_bad = flow_bad.load()
arr_bad

In [None]:
# TEST
np.testing.assert_raises(
    AssertionError, 
    np.testing.assert_array_almost_equal, arr, arr_bad
)


## Dictionary Encoding

For skim matrixes where the universe of all possible 
cell values can be adequately represented by just 255 
unique values, we can use an explicit mapping process
called "dictionary encoding", which works by storing 
those unique values in a tiny base array.  Then, in the 
main body of the skim data we only store pointers to 
positions in that base array. This reduces the marginal 
memory footprint of each array cell to just an 8 bit 
integer, reducing memory requirements by up to 75% for 
these arrays compared to float32's. This approach is 
particularly appropriate for many transit skims, as fares, 
wait times, and transfers can almost always be reduced 
to a dictionary encoding with no meaningful information 
loss.

For example, the `'WLK_LOC_WLK_FAR'` array containing fares
only has four unique values:

In [None]:
np.unique(skims.WLK_LOC_WLK_FAR)

We can see various fares applied at different time periods if we
look at the top corner of the array:

In [None]:
skims.WLK_LOC_WLK_FAR.values[:2,:3,:]

Once encoded, the array itself only contains offset pointers (small integers),
plus the original values stored in metadata.

In [None]:
wlwfare_enc = array_encode(skims.WLK_LOC_WLK_FAR, bitwidth=8, by_dict=True)
wlwfare_enc.values[:2,:3,:]

In [None]:
wlwfare_enc.attrs['digital_encoding']['dictionary']

In [None]:
# TEST encoding
assert wlwfare_enc.dtype == np.uint8
np.testing.assert_array_equal(
    wlwfare_enc.values[:2,:3,:],
    np.array([[[0, 0, 0, 0, 0],
        [1, 2, 2, 1, 2],
        [1, 2, 2, 1, 2]],

       [[1, 1, 2, 2, 1],
        [0, 0, 0, 0, 0],
        [1, 2, 2, 1, 2]]], dtype=np.uint8)
)
np.testing.assert_array_equal(
    wlwfare_enc.attrs['digital_encoding']['dictionary'],
    np.array([   0.,  152.,  474.,  626.], dtype=np.float32)
)

If we want to recover the original data for analysis (other than in
a Flow, which can decode it automatically), we can use the `array_decode` function.

In [None]:
array_decode(wlwfare_enc)

In [None]:
# TEST
xr.testing.assert_equal(array_decode(wlwfare_enc), skims.WLK_LOC_WLK_FAR)