# Parsing SFS from VCF file

fastDFE provides parser utilities that enable convenient parsing of frequency spectra from VCF files. By default, {class}`~fastdfe.parser.Parser` looks at the ``AA`` tag in the VCF file's info field to retrieve the correct polarization. Sites for which this tag is not well-defined are by default included (see {attr}`~fastdfe.parser.Parser.skip_not_polarized`). Note that non-polarized frequency spectra provide little information on the distribution of beneficial mutations.

We also might want to stratify the SFS by some property of the sites, such as synonymous vs. non-synonymous mutations. This is done by specifying a list of stratifications to the parser. In this example, we will stratify the SFS by synonymous vs. non-synonymous mutations using a VCF file for `Betula spp.`.

In [None]:
from fastdfe import Parser, DegeneracyStratification, Spectra

# instantiate parser
p = Parser(
    n=10,
    vcf="../../resources/genome/betula/all.vcf.gz",
    stratifications=[DegeneracyStratification()]
)

# parse SFS
spectra: Spectra = p.parse()

Counting sites: 2439037it [02:16, 17821.79it/s]
[32mINFO:fastdfe:Using stratification: [neutral, selected].[0m
[32mINFO:fastdfe:Starting to parse.[0m
Processing sites:  79%|███████▉  | 1935472/2439037 [12:56<03:15, 2581.94it/s]

In [None]:
# visualize SFS
spectra.plot();

fastDFE relies on VCF info tags to determine the degeneracy of a site but this behaviour can be customized (cf. {class}`~fastdfe.parser.DegeneracyStratification`).

We can also increase the number of stratifications by specifying a list of stratifications. In this example, we will stratify the SFS by synonymous vs. non-synonymous mutations and by base transitions.

In [None]:
from fastdfe import AncestralBaseStratification

# instantiate parser
p = Parser(
    n=10,
    vcf="../../resources/genome/betula/all.vcf.gz",
    stratifications=[DegeneracyStratification(), AncestralBaseStratification()]
)

# parse SFS
spectra: Spectra = p.parse()

In [None]:
# visualize SFS
spectra.plot();

Note that fastDFE required the ancestral state of sites to be determined. {class}`~fastdfe.parser.Parser` does this by looking at the ``AA`` field but this can be customized. We admit there is currently no easy method for determining the ancestral states directly from a VCF file and are working on implementing one. You can also define custom Stratifications by extending {class}`~fastdfe.parser.Stratification`).