# Extract a SNP

This notebook will extract a single variant from the "ACAF" dataset. Information about this dataset can be found here: https://support.researchallofus.org/hc/en-us/articles/14929793660948
We can run this with a very small cluster. The default Hail "genomic analysis" cluster works great!

In [None]:
import pandas
import os
bucket = os.environ['WORKSPACE_BUCKET']
bucket

## Start Hail

This initializes our Hail backend in the spark cluster.

In [None]:
# Initialize Hail
import hail as hl

hl.init(default_reference='GRCh38', app_name=f'snp-extract')

In [None]:
#Get the full matrix table
ds_full = hl.read_matrix_table(f'{os.environ["WGS_ACAF_THRESHOLD_SPLIT_HAIL_PATH"]}')

## Filter to our variant

We use the filter_intervals() function here as it leverages the index so that the entire genome does not need to be read/processed. You can find more information about this variant in AoU data here: https://databrowser.researchallofus.org/variants/rs8050136

In [None]:
ds_filtered = hl.filter_intervals(ds_full, [hl.parse_locus_interval("chr16:53782363-53782364")])

### Check our work

This should return a single variant of interest for all samples.
Note that Hail is "lazy"- the above filter statement didn't actually execute the filtering step. It won't happen until we ask for something, like a count, that requires the filtering to be completed.

In [None]:
print('Samples: %d  Variants: %d' % (ds_filtered.count_cols(), ds_filtered.count_rows()))

We can also get information about the variant that was stored in the Hail MT.

In [None]:
ds_filtered.rows().collect()

## Save the variant

We are going to export this as a plink binary file to our bucket.

In [None]:
hl.export_plink(ds_filtered, f'{bucket}/fto_filtered',ind_id=ds_filtered.s, fam_id="0")