# Creating decoding quantities

You can either create decoding quantities from scratch, or from pre-computed CSFS values.

In addition, there are a number of built-in defaults that can be selected (see [here](https://github.com/PalamaraLab/ASMC_data) for details):
 - built-in haploid demographies ("ACB", "ASW", "BEB", "CDX", "CEU", "CHB", "CHS", "CLM", "ESN", "FIN", "GBR", "GIH", "GWD", "IBS", "ITU", "JPT", "KHV", "LWK", "MSL", "MXL", "PEL", "PJL", "PUR", "STU", "TSI", "YRI")
 - built-in frequencies information; currently only available for UKBB and either 50, 100, 200 or 300 samples

The discretization can be loaded from file or specified by providing the required quantiles. Full details [here](https://github.com/PalamaraLab/PrepareDecoding/blob/master/docs/api.md#discretization).

Further details are available for:
 - [the API](https://github.com/PalamaraLab/PrepareDecoding/blob/master/docs/api.md)
 - [file formats](https://github.com/PalamaraLab/PrepareDecoding/blob/master/docs/file_formats.md)

## Creating decoding quantities

If you do not have pre-computed CSFS values, then these will be generated when calculating the decoding quantities.
This step will likely dominate the runtime, so you may wish to save the CSFS to file and re-use them for subsequent runs.

See examples below: simply specify the `csfs_file` parameter, and CSFS will be loaded from file.

In [None]:
import pathlib

from asmc.preparedecoding import *

In [None]:
files_dir = (pathlib.Path('..') / 'test' / 'regression').resolve()

demo_file = str(files_dir / 'input_CEU.demo')
disc_file = str(files_dir / 'input_30-100-2000.disc')
freq_file = str(files_dir / 'input_UKBB.frq')

# Calculate with a discretization file
dq = prepare_decoding(
    demography=demo_file,
    discretization=disc_file,
    frequencies=freq_file,
    samples=50, # Use a larger number (300 is suggested) for real analysis
)

# Or calculate with pre-defined quantiles, and a user-defined number of additional quantiles
dq = prepare_decoding(
    demography=demo_file,
    discretization=[[30.0, 15], [100.0, 15], 39],
    frequencies=freq_file,
    samples=50, # Use a larger number (300 is suggested) for real analysis
)

# Or use with built-in demographies
dq = prepare_decoding(
    demography='CEU',
    discretization=[[30.0, 15], [100.0, 15], 39],
    frequencies=freq_file,
    samples=50, # Use a larger number (300 is suggested) for real analysis
)

# Or use with built-in frequency information from UKBB
dq = prepare_decoding(
    demography='CEU',
    discretization=[[30.0, 15], [100.0, 15], 39],
    frequencies='UKBB',
    samples=50, # Use a larger number (300 is suggested) for real analysis
)

## Write the decoding quantities to file

Once you have generated the decoding quantities object, you can save them to file.
It may also be worth saving the CSFS to file, as this can be used to prevent them being recalculated in subsequent runs.

In [None]:
# This will create a file `files_dir/output.decodingQuantities.gz`
dq.save_decoding_quantities(str(files_dir / 'output'))

# This will create a file `files_dir/output.csfs`
dq.save_csfs(str(files_dir / 'output'))

# You may also save other files, which may be of use:
dq.save_intervals(str(files_dir / 'output'))
dq.save_discretization(str(files_dir / 'output'))
save_demography(str(files_dir), 'CEU')

## Using precomputed CSFS

If you have pre-computed CSFS values, you can use those to speed up calculation of the decoding quantities.

In [None]:
dq = prepare_decoding(
    csfs_file=str(files_dir / 'output.csfs'),
    demography='CEU',
    discretization=[[30.0, 15], [100.0, 15], 39],
    frequencies='UKBB',
    samples=50, # Use a larger number (300 is suggested) for real analysis
)

## We can access various properties from the decoding quantities object

In [None]:
{"states": dq.states, "samples": dq.samples, "mu": dq.mu}

Eigen matrices are converted to numpy arrays:

In [None]:
X = dq.compressedEmission
type(X), X.shape

Maps can be iterated over. They are not directly converted to Python dicts to allow [passing by reference](https://pybind11.readthedocs.io/en/stable/advanced/cast/stl.html#making-opaque-types) if necessary.

In [None]:
len([x for x in dq.CSFS])

In [None]:
c0 = dq.CSFS[0]
{"mu": c0.mu, "from": c0.csfsFrom, "to": c0.csfsTo, "samples": c0.samples}
c0.csfs.shape