# Processing VCF file using Hail

In [2]:
%scala
displayHTML("https://us.v-cdn.net/5019796/uploads/FileUpload/8d/a8143b3b21ee2f98fc83b8528f469b.png")

In [3]:
%scala
displayHTML("https://en.wikipedia.org/wiki/Variant_Call_Format")

## This Notebook is based on the tutorial [Analyzing 1000 Genomes with Spark and Hail](https://docs.databricks.com/spark/latest/training/1000-genomes.html)

## Cluster setup

On the Databricks interface, click the `Clusters` icon on the left sidebar and then `+Create Cluster`. In the Databricks cluster creation dialog, click `Show advanced settings` at bottom and then on the `Spark` tab, and paste the text below into the `Spark config` box.

```
spark.hadoop.io.compression.codecs org.apache.hadoop.io.compress.DefaultCodec,is.hail.io.compress.BGzipCodec,org.apache.hadoop.io.compress.GzipCodec
spark.sql.files.openCostInBytes 1099511627776
spark.sql.files.maxPartitionBytes 1099511627776
spark.hadoop.mapreduce.input.fileinputformat.split.minsize 1099511627776
spark.hadoop.parquet.block.size 1099511627776```
Start the cluster and attach this notebook to it by clicking on your cluster name in menu `Detached` at the top left of this workbook. Now you're ready to Hail!

In [6]:
from hail import *
hc = HailContext(sc)

In [7]:
import pandas as pd
import numpy as np
from math import log, isnan
import seaborn

In [8]:
display(dbutils.fs.ls("/databricks-datasets/hail/data-001"))

path,name,size
dbfs:/databricks-datasets/hail/data-001/1kg_annotations.txt,1kg_annotations.txt,22784
dbfs:/databricks-datasets/hail/data-001/1kg_sample.vcf.bgz,1kg_sample.vcf.bgz,39767725
dbfs:/databricks-datasets/hail/data-001/purcell5k.interval_list,purcell5k.interval_list,192078


## Download the data

In [10]:
vcf_path = '/databricks-datasets/hail/data-001/1kg_sample.vcf.bgz'
annotation_path = '/databricks-datasets/hail/data-001/1kg_annotations.txt'
purcell_5k_path = '/databricks-datasets/hail/data-001/purcell5k.interval_list'

In [11]:
display(dbutils.fs.ls("/databricks-datasets/hail/data-001"))

path,name,size
dbfs:/databricks-datasets/hail/data-001/1kg_annotations.txt,1kg_annotations.txt,22784
dbfs:/databricks-datasets/hail/data-001/1kg_sample.vcf.bgz,1kg_sample.vcf.bgz,39767725
dbfs:/databricks-datasets/hail/data-001/purcell5k.interval_list,purcell5k.interval_list,192078


In [12]:
vds = hc.import_vcf(vcf_path)

This method produced a [VariantDataset](https://hail.is/hail/hail.VariantDataset.html), Hail's primary representation of genomic data. Following the link below to Hail's python API documentation will let you see the myriad methods it offers.

In [14]:
%scala 
displayHTML("https://hail.is/docs/0.1/overview.html#variant-dataset-vds")

In [15]:
vds = vds.annotate_samples_table(annotation_path,
                                 root='sa.myAnnot',
                                 sample_expr='Sample',
                                 config=TextTableConfig(impute=True))

Use VaraintDataset function count() to get a sense of number of samples, genotypes and variants. However, any aggregate function are usually pretty slow.

In [17]:
vds.count()

In [18]:
vds.num_samples

In [19]:
vds.count_variants()

In [20]:
dir(vds)

### Initial exploration of dataset

If the Boolean parameter `genotypes` is set to `True`, the overall call rate across all genotypes is computed as well:

In [23]:
vds.count(genotypes=True)

So the overall call rate before any QC filtering is about 99%.

Let's print variant and sample schemas

In [25]:
print(vds.variant_schema)

In [26]:
print(vds.sample_schema)

Now it's easy to count samples by population using the [counter](https://hail.is/expr_lang.html#counter) aggregator:

In [28]:
counter = vds.query_samples('samples.map(s => sa.myAnnot.Population).counter()')[0]
for x in counter:
    print('population %s found %s times' % (x.key, x.count))

###VCF file contains many annotations scores that define the quality of genotypes and variants, which allows genotype and varaint filtering

In [30]:
%scala
displayHTML("https://en.wikipedia.org/wiki/Phred_quality_score")

###Let's first filter genotypes based on genotype quality (GQ) and read coverage (DP).

In [32]:
filter_condition_gDP_gGQ = 'g.dp >= 5 && g.gq >= 20'
vds_gDP_gGQ = vds.filter_genotypes(filter_condition_gDP_gGQ)

In [33]:
vds_gDP_gGQ.count(genotypes=True)

Now the call rate is about 54%, so close to 45% genotypes failed the filter. Filtering out a genotype is equivalent to setting the genotype call to missing.

###Let's next remove variants with low call rate (vCR)

In [36]:
vds_gDP_gGQ_vCR = (vds_gDP_gGQ
    .filter_variants_expr('gs.fraction(g => g.isCalled) >= 0.40')
    .sample_qc())

Check how many variants retained after filtering.

In [38]:
vds_gDP_gGQ_vCR.count(genotypes=True)

### Filter samples according to their average read depth (sDP) and genotype quality (sGQ)

In [40]:
print(vds_gDP_gGQ_vCR.sample_schema)

In [41]:
sampleqc_table = vds_gDP_gGQ_vCR.samples_keytable().to_dataframe()

new_column_name_list= list(map(lambda x: x.replace(".", "_"), sampleqc_table.columns))
sampleqc_table = sampleqc_table.toDF(*new_column_name_list)

In [42]:
display(sampleqc_table)

s,sa_myAnnot_Sample,sa_myAnnot_Population,sa_myAnnot_SuperPopulation,sa_myAnnot_isFemale,sa_myAnnot_PurpleHair,sa_myAnnot_CaffeineConsumption,sa_qc_callRate,sa_qc_nCalled,sa_qc_nNotCalled,sa_qc_nHomRef,sa_qc_nHet,sa_qc_nHomVar,sa_qc_nSNP,sa_qc_nInsertion,sa_qc_nDeletion,sa_qc_nSingleton,sa_qc_nTransition,sa_qc_nTransversion,sa_qc_dpMean,sa_qc_dpStDev,sa_qc_gqMean,sa_qc_gqStDev,sa_qc_nNonRef,sa_qc_rTiTv,sa_qc_rHetHomVar,sa_qc_rInsertionDeletion
HG00096,HG00096,GBR,EUR,False,False,77.0,0.2749232343909928,2686,7084,1130,1248,308,1864,0,0,1,1524,340,7.4586746090841425,1.829162383895932,45.90655249441547,28.46876215640281,1556,4.482352941176471,4.051948051948052,
HG00100,HG00100,GBR,EUR,True,False,59.0,0.9616171954964176,9395,375,5081,2779,1535,5849,0,0,2,4762,1087,13.224374667376209,3.712169739189841,55.6111761575305,27.06547877264492,4314,4.380864765409384,1.8104234527687295,
HG00105,HG00105,GBR,EUR,False,False,77.0,0.5797338792221085,5664,4106,2793,2024,847,3718,0,0,0,2993,725,8.650600282485852,2.2771253327175165,44.382591807909634,27.39013726273073,2871,4.128275862068966,2.389610389610389,
HG00114,HG00114,GBR,EUR,False,True,72.0,0.1494370522006141,1460,8310,509,776,175,1126,0,0,0,924,202,6.986301369863015,1.6055823053687388,43.37945205479454,25.086450301723687,951,4.574257425742574,4.434285714285714,
HG00115,HG00115,GBR,EUR,False,True,71.0,0.6856704196519959,6699,3071,3479,2218,1002,4222,0,0,1,3451,771,9.22779519331246,2.5460202117218307,44.704433497536925,26.70956053033648,3220,4.476005188067445,2.2135728542914173,
HG00116,HG00116,GBR,EUR,False,False,71.0,0.252200614124872,2464,7306,1065,1075,324,1723,0,0,0,1403,320,7.394074675324675,1.7025517387131073,40.97767857142858,23.627226861323056,1399,4.384375,3.317901234567901,
HG00119,HG00119,GBR,EUR,False,True,70.0,0.3195496417604913,3122,6648,1363,1365,394,2153,0,0,0,1743,410,7.527866752082009,1.7771483099316476,42.03939782190902,24.84769850706469,1759,4.251219512195122,3.464467005076142,
HG00120,HG00120,GBR,EUR,True,True,68.0,0.2860798362333674,2795,6975,1238,1241,316,1873,0,0,0,1517,356,7.446153846153844,1.6802170140360535,41.18246869409654,23.966680366447957,1557,4.26123595505618,3.9272151898734178,
HG00128,HG00128,GBR,EUR,True,True,67.0,0.6401228249744114,6254,3516,3277,2036,941,3918,0,0,0,3153,765,9.280620402942136,2.5949974996440823,45.32523185161495,27.475594538376384,2977,4.12156862745098,2.1636556854410203,
HG00131,HG00131,GBR,EUR,False,False,72.0,0.309007164790174,3019,6751,1311,1317,391,2099,0,0,0,1712,387,7.452798940046386,1.6856292863721216,41.58860549850948,24.326264763553837,1708,4.423772609819121,3.368286445012788,


In [43]:
display(sampleqc_table)

s,sa_myAnnot_Sample,sa_myAnnot_Population,sa_myAnnot_SuperPopulation,sa_myAnnot_isFemale,sa_myAnnot_PurpleHair,sa_myAnnot_CaffeineConsumption,sa_qc_callRate,sa_qc_nCalled,sa_qc_nNotCalled,sa_qc_nHomRef,sa_qc_nHet,sa_qc_nHomVar,sa_qc_nSNP,sa_qc_nInsertion,sa_qc_nDeletion,sa_qc_nSingleton,sa_qc_nTransition,sa_qc_nTransversion,sa_qc_dpMean,sa_qc_dpStDev,sa_qc_gqMean,sa_qc_gqStDev,sa_qc_nNonRef,sa_qc_rTiTv,sa_qc_rHetHomVar,sa_qc_rInsertionDeletion
HG00096,HG00096,GBR,EUR,False,False,77.0,0.2749232343909928,2686,7084,1130,1248,308,1864,0,0,1,1524,340,7.4586746090841425,1.829162383895932,45.90655249441547,28.46876215640281,1556,4.482352941176471,4.051948051948052,
HG00100,HG00100,GBR,EUR,True,False,59.0,0.9616171954964176,9395,375,5081,2779,1535,5849,0,0,2,4762,1087,13.224374667376209,3.712169739189841,55.6111761575305,27.06547877264492,4314,4.380864765409384,1.8104234527687295,
HG00105,HG00105,GBR,EUR,False,False,77.0,0.5797338792221085,5664,4106,2793,2024,847,3718,0,0,0,2993,725,8.650600282485852,2.2771253327175165,44.382591807909634,27.39013726273073,2871,4.128275862068966,2.389610389610389,
HG00114,HG00114,GBR,EUR,False,True,72.0,0.1494370522006141,1460,8310,509,776,175,1126,0,0,0,924,202,6.986301369863015,1.6055823053687388,43.37945205479454,25.086450301723687,951,4.574257425742574,4.434285714285714,
HG00115,HG00115,GBR,EUR,False,True,71.0,0.6856704196519959,6699,3071,3479,2218,1002,4222,0,0,1,3451,771,9.22779519331246,2.5460202117218307,44.704433497536925,26.70956053033648,3220,4.476005188067445,2.2135728542914173,
HG00116,HG00116,GBR,EUR,False,False,71.0,0.252200614124872,2464,7306,1065,1075,324,1723,0,0,0,1403,320,7.394074675324675,1.7025517387131073,40.97767857142858,23.627226861323056,1399,4.384375,3.317901234567901,
HG00119,HG00119,GBR,EUR,False,True,70.0,0.3195496417604913,3122,6648,1363,1365,394,2153,0,0,0,1743,410,7.527866752082009,1.7771483099316476,42.03939782190902,24.84769850706469,1759,4.251219512195122,3.464467005076142,
HG00120,HG00120,GBR,EUR,True,True,68.0,0.2860798362333674,2795,6975,1238,1241,316,1873,0,0,0,1517,356,7.446153846153844,1.6802170140360535,41.18246869409654,23.966680366447957,1557,4.26123595505618,3.9272151898734178,
HG00128,HG00128,GBR,EUR,True,True,67.0,0.6401228249744114,6254,3516,3277,2036,941,3918,0,0,0,3153,765,9.280620402942136,2.5949974996440823,45.32523185161495,27.475594538376384,2977,4.12156862745098,2.1636556854410203,
HG00131,HG00131,GBR,EUR,False,False,72.0,0.309007164790174,3019,6751,1311,1317,391,2099,0,0,0,1712,387,7.452798940046386,1.6856292863721216,41.58860549850948,24.326264763553837,1708,4.423772609819121,3.368286445012788,


In [44]:
vds_gDP_gGQ_vCR_sDP_sGQ = (vds_gDP_gGQ_vCR
    .annotate_samples_vds(vds_gDP_gGQ_vCR, code = 'sa.qc = vds.qc' )
    .filter_samples_expr('sa.qc.dpMean >=5 && sa.qc.gqMean >=40'))

As before, we can count the number of samples that remain in the dataset after filtering. (But nothing has been filtered out here)

In [46]:
vds_gDP_gGQ_vCR_sDP_sGQ.count(genotypes=True)

### More detailed variants filtering (vFilter) according to their distribution plots

We now have `vds_gDP_gGQ_vCR_sDP_sGT`, a VDS where low-quality genotypes and samples have been removed.

Let's use the [variant_qc](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.variant_qc) method to start exploring variant metrics to perform more detailed variant filtering:

In [48]:
vds_gDP_gGQ_vCR_sDP_sGQ = vds_gDP_gGQ_vCR_sDP_sGQ.split_multi()
vds_gDP_gGQ_vCR_sDP_sGQ = vds_gDP_gGQ_vCR_sDP_sGQ.variant_qc()
print(vds_gDP_gGQ_vCR_sDP_sGQ.variant_schema)

In [49]:
variantqc_table = vds_gDP_gGQ_vCR_sDP_sGQ.variants_keytable().to_dataframe()
display(variantqc_table)

v.contig,v.start,v.ref,v.altAlleles,va.rsid,va.qual,va.filters,va.pass,va.info.AC,va.info.AF,va.info.AN,va.info.BaseQRankSum,va.info.ClippingRankSum,va.info.DP,va.info.DS,va.info.FS,va.info.HaplotypeScore,va.info.InbreedingCoeff,va.info.MLEAC,va.info.MLEAF,va.info.MQ,va.info.MQ0,va.info.MQRankSum,va.info.QD,va.info.ReadPosRankSum,va.info.set,va.aIndex,va.wasSplit,va.qc.callRate,va.qc.AC,va.qc.AF,va.qc.nCalled,va.qc.nNotCalled,va.qc.nHomRef,va.qc.nHet,va.qc.nHomVar,va.qc.dpMean,va.qc.dpStDev,va.qc.gqMean,va.qc.gqStDev,va.qc.nNonRef,va.qc.rHeterozygosity,va.qc.rHetHomVar,va.qc.rExpectedHetFrequency,va.qc.pHWE
1,904165,G,"List(List(G, A))",.,52346.37000000001,List(),False,List(518),List(0.103),5020,-3.394,-0.17,17827,,2.233,,0.0988,List(514),List(0.102),59.05,0,1.447,15.02,6.286,,1,False,0.5181950509461426,92,0.1292134831460674,356,331,272,76,8,9.870786516853933,3.2099109948524105,39.80898876404496,23.730877058037567,84,0.2134831460674157,9.5,0.2253512223644495,0.2900388345474841
1,1707740,T,"List(List(T, G))",.,93517.82,List(),False,List(997),List(0.198),5034,-40.42,-0.287,19902,,3.311,,0.0387,List(983),List(0.195),58.32,0,9.478,13.59,2.259,,1,False,0.6259097525473072,186,0.2162790697674418,430,257,264,146,20,10.202325581395352,3.507267429813318,47.57906976744185,27.84244758469988,166,0.3395348837209302,7.3,0.3393995180983837,0.9433988035930664
1,2284195,T,"List(List(T, C))",.,142480.77,List(),False,List(1559),List(0.312),4990,-45.982,0.35,18176,,2.945,,0.0925,List(1552),List(0.311),58.57,0,16.136,15.48,-0.682,,1,False,0.5749636098981077,265,0.3354430379746835,395,292,174,177,44,10.124050632911397,3.822293538723904,50.76455696202529,28.571877108663653,221,0.4481012658227848,4.0227272727272725,0.4464070847571834,0.9551366128713484
1,2944527,G,"List(List(G, A))",.,124329.15,List(),False,List(1206),List(0.245),4928,0.063,-0.655,17698,,0.449,,0.1232,List(1192),List(0.242),58.15,0,12.011,17.54,21.754,,1,False,0.5342066957787481,191,0.2602179836512261,367,320,199,145,23,10.21798365122616,3.7316531418923233,50.10081743869208,28.09255187271957,168,0.3950953678474114,6.304347826086956,0.3855344205255547,0.6363296264457586
1,3761547,C,"List(List(C, A))",.,1614.69,List(),False,List(30),List(0.005948),5044,-4.468,-8.82,16845,,2.055,,-0.0047,List(28),List(0.005551),56.98,0,6.299,7.99,-1.752,,1,False,0.5443959243085881,10,0.0133689839572192,374,313,364,10,0,8.79411764705883,2.938381024472951,35.890374331550824,15.117941438430414,10,0.0267379679144385,,0.0264158237226982,0.529558260774204
1,3803755,T,"List(List(T, C))",.,383548.94,List(),False,List(3368),List(0.673),5008,-53.782,-6.841,17687,,9.582,,0.0812,List(3376),List(0.674),58.0,0,26.256,24.93,-0.433,,1,False,0.5662299854439592,522,0.6709511568123393,389,298,41,174,174,9.347043701799482,3.3586588017654098,50.33676092544985,28.10780222298778,348,0.4473007712082262,1.0,0.4421196811942313,0.8638672568716725
1,4121584,A,"List(List(A, G))",.,115117.7,List(),False,List(1489),List(0.3),4962,-26.975,-0.719,16179,,22.674,,0.104,List(1465),List(0.295),58.55,0,2.845,14.32,-4.907,,1,False,0.5502183406113537,227,0.3002645502645503,378,309,188,153,37,9.235449735449736,3.263573597825833,49.96560846560851,27.783105183267025,190,0.4047619047619047,4.135135135135135,0.4207680717614492,0.4277922729402096
1,4170048,C,"List(List(C, T))",.,120311.95,List(),False,List(1323),List(0.266),4980,-12.472,-10.32,16375,,4.365,,0.1302,List(1313),List(0.264),58.96,0,7.409,16.72,1.056,,1,False,0.4919941775836972,176,0.2603550295857988,338,349,176,148,14,9.446745562130182,3.549123327088349,48.57100591715978,27.46707527990407,162,0.4378698224852071,10.571428571428571,0.3857111549419242,0.013401595751732
1,4180842,C,"List(List(C, T))",.,138252.12,List(),False,List(1429),List(0.286),4996,54.319,-0.207,15845,,0.321,,0.0815,List(1430),List(0.286),59.15,0,-1.479,17.99,-1.476,,1,False,0.519650655021834,223,0.3123249299719888,357,330,158,175,24,8.700280112044819,2.876997526498,49.128851540616246,28.410066115097894,199,0.4901960784313725,7.291666666666667,0.4301585992040574,0.0080247270522444
1,6279383,G,"List(List(G, C))",.,1268.87,List(),False,List(16),List(0.003197),5004,-4.319,-1.24,15942,,3.279,,-0.028,List(15),List(0.002998),58.22,0,0.258,10.75,1.466,,1,False,0.4366812227074236,3,0.005,300,387,297,3,0,9.413333333333332,2.9814686910238635,29.430000000000007,10.575053979373656,3,0.01,,0.0099666110183639,0.5025041736227043


In [50]:
display(variantqc_table)

v.contig,v.start,v.ref,v.altAlleles,va.rsid,va.qual,va.filters,va.pass,va.info.AC,va.info.AF,va.info.AN,va.info.BaseQRankSum,va.info.ClippingRankSum,va.info.DP,va.info.DS,va.info.FS,va.info.HaplotypeScore,va.info.InbreedingCoeff,va.info.MLEAC,va.info.MLEAF,va.info.MQ,va.info.MQ0,va.info.MQRankSum,va.info.QD,va.info.ReadPosRankSum,va.info.set,va.aIndex,va.wasSplit,va.qc.callRate,va.qc.AC,va.qc.AF,va.qc.nCalled,va.qc.nNotCalled,va.qc.nHomRef,va.qc.nHet,va.qc.nHomVar,va.qc.dpMean,va.qc.dpStDev,va.qc.gqMean,va.qc.gqStDev,va.qc.nNonRef,va.qc.rHeterozygosity,va.qc.rHetHomVar,va.qc.rExpectedHetFrequency,va.qc.pHWE
1,904165,G,"List(List(G, A))",.,52346.37000000001,List(),False,List(518),List(0.103),5020,-3.394,-0.17,17827,,2.233,,0.0988,List(514),List(0.102),59.05,0,1.447,15.02,6.286,,1,False,0.5181950509461426,92,0.1292134831460674,356,331,272,76,8,9.870786516853933,3.2099109948524105,39.80898876404496,23.730877058037567,84,0.2134831460674157,9.5,0.2253512223644495,0.2900388345474841
1,1707740,T,"List(List(T, G))",.,93517.82,List(),False,List(997),List(0.198),5034,-40.42,-0.287,19902,,3.311,,0.0387,List(983),List(0.195),58.32,0,9.478,13.59,2.259,,1,False,0.6259097525473072,186,0.2162790697674418,430,257,264,146,20,10.202325581395352,3.507267429813318,47.57906976744185,27.84244758469988,166,0.3395348837209302,7.3,0.3393995180983837,0.9433988035930664
1,2284195,T,"List(List(T, C))",.,142480.77,List(),False,List(1559),List(0.312),4990,-45.982,0.35,18176,,2.945,,0.0925,List(1552),List(0.311),58.57,0,16.136,15.48,-0.682,,1,False,0.5749636098981077,265,0.3354430379746835,395,292,174,177,44,10.124050632911397,3.822293538723904,50.76455696202529,28.571877108663653,221,0.4481012658227848,4.0227272727272725,0.4464070847571834,0.9551366128713484
1,3761547,C,"List(List(C, A))",.,1614.69,List(),False,List(30),List(0.005948),5044,-4.468,-8.82,16845,,2.055,,-0.0047,List(28),List(0.005551),56.98,0,6.299,7.99,-1.752,,1,False,0.5443959243085881,10,0.0133689839572192,374,313,364,10,0,8.79411764705883,2.938381024472951,35.890374331550824,15.117941438430414,10,0.0267379679144385,,0.0264158237226982,0.529558260774204
1,3803755,T,"List(List(T, C))",.,383548.94,List(),False,List(3368),List(0.673),5008,-53.782,-6.841,17687,,9.582,,0.0812,List(3376),List(0.674),58.0,0,26.256,24.93,-0.433,,1,False,0.5662299854439592,522,0.6709511568123393,389,298,41,174,174,9.347043701799482,3.3586588017654098,50.33676092544985,28.10780222298778,348,0.4473007712082262,1.0,0.4421196811942313,0.8638672568716725
1,4121584,A,"List(List(A, G))",.,115117.7,List(),False,List(1489),List(0.3),4962,-26.975,-0.719,16179,,22.674,,0.104,List(1465),List(0.295),58.55,0,2.845,14.32,-4.907,,1,False,0.5502183406113537,227,0.3002645502645503,378,309,188,153,37,9.235449735449736,3.263573597825833,49.96560846560851,27.783105183267025,190,0.4047619047619047,4.135135135135135,0.4207680717614492,0.4277922729402096
1,4170048,C,"List(List(C, T))",.,120311.95,List(),False,List(1323),List(0.266),4980,-12.472,-10.32,16375,,4.365,,0.1302,List(1313),List(0.264),58.96,0,7.409,16.72,1.056,,1,False,0.4919941775836972,176,0.2603550295857988,338,349,176,148,14,9.446745562130182,3.549123327088349,48.57100591715978,27.46707527990407,162,0.4378698224852071,10.571428571428571,0.3857111549419242,0.013401595751732
1,4180842,C,"List(List(C, T))",.,138252.12,List(),False,List(1429),List(0.286),4996,54.319,-0.207,15845,,0.321,,0.0815,List(1430),List(0.286),59.15,0,-1.479,17.99,-1.476,,1,False,0.519650655021834,223,0.3123249299719888,357,330,158,175,24,8.700280112044819,2.876997526498,49.128851540616246,28.410066115097894,199,0.4901960784313725,7.291666666666667,0.4301585992040574,0.0080247270522444
1,6279383,G,"List(List(G, C))",.,1268.87,List(),False,List(16),List(0.003197),5004,-4.319,-1.24,15942,,3.279,,-0.028,List(15),List(0.002998),58.22,0,0.258,10.75,1.466,,1,False,0.4366812227074236,3,0.005,300,387,297,3,0,9.413333333333332,2.9814686910238635,29.430000000000007,10.575053979373656,3,0.01,,0.0099666110183639,0.5025041736227043
1,7569602,C,"List(List(C, T))",.,329869.45,List(),False,List(2774),List(0.551),5036,14.403,-0.417,17990,,7.683,,0.0757,List(2777),List(0.551),58.92,0,6.004,23.33,-0.65,,1,False,0.586608442503639,429,0.532258064516129,403,284,82,213,108,9.21836228287841,3.003217513272482,55.05707196029776,29.906458274316478,321,0.5285359801488834,1.9722222222222223,0.4985373672610699,0.2123100307528677


In [51]:
vds_gDP_gGQ_vCR_sDP_sGQ_vFilter = vds_gDP_gGQ_vCR_sDP_sGQ.filter_variants_expr('va.info.MQ >= 55.00 && va.info.QD >= 2.00 && va.info.FS <= 60.000 && va.info.MQRankSum >= -20.000 && va.info.ReadPosRankSum >= -10.000 && va.info.ReadPosRankSum <= 10.000')
print('variants before filtering: %d' % vds_gDP_gGQ_vCR_sDP_sGQ.count_variants())
print('variants after filtering: %d' % vds_gDP_gGQ_vCR_sDP_sGQ_vFilter.count_variants())

In [52]:
variantqc_table = vds_gDP_gGQ_vCR_sDP_sGQ_vFilter.variants_keytable().to_dataframe()
display(variantqc_table)

v.contig,v.start,v.ref,v.altAlleles,va.rsid,va.qual,va.filters,va.pass,va.info.AC,va.info.AF,va.info.AN,va.info.BaseQRankSum,va.info.ClippingRankSum,va.info.DP,va.info.DS,va.info.FS,va.info.HaplotypeScore,va.info.InbreedingCoeff,va.info.MLEAC,va.info.MLEAF,va.info.MQ,va.info.MQ0,va.info.MQRankSum,va.info.QD,va.info.ReadPosRankSum,va.info.set,va.aIndex,va.wasSplit,va.qc.callRate,va.qc.AC,va.qc.AF,va.qc.nCalled,va.qc.nNotCalled,va.qc.nHomRef,va.qc.nHet,va.qc.nHomVar,va.qc.dpMean,va.qc.dpStDev,va.qc.gqMean,va.qc.gqStDev,va.qc.nNonRef,va.qc.rHeterozygosity,va.qc.rHetHomVar,va.qc.rExpectedHetFrequency,va.qc.pHWE
1,904165,G,"List(List(G, A))",.,52346.37000000001,List(),False,List(518),List(0.103),5020,-3.394,-0.17,17827,,2.233,,0.0988,List(514),List(0.102),59.05,0,1.447,15.02,6.286,,1,False,0.5181950509461426,92,0.1292134831460674,356,331,272,76,8,9.870786516853933,3.2099109948524105,39.80898876404496,23.730877058037567,84,0.2134831460674157,9.5,0.2253512223644495,0.2900388345474841
1,1707740,T,"List(List(T, G))",.,93517.82,List(),False,List(997),List(0.198),5034,-40.42,-0.287,19902,,3.311,,0.0387,List(983),List(0.195),58.32,0,9.478,13.59,2.259,,1,False,0.6259097525473072,186,0.2162790697674418,430,257,264,146,20,10.202325581395352,3.507267429813318,47.57906976744185,27.84244758469988,166,0.3395348837209302,7.3,0.3393995180983837,0.9433988035930664
1,2284195,T,"List(List(T, C))",.,142480.77,List(),False,List(1559),List(0.312),4990,-45.982,0.35,18176,,2.945,,0.0925,List(1552),List(0.311),58.57,0,16.136,15.48,-0.682,,1,False,0.5749636098981077,265,0.3354430379746835,395,292,174,177,44,10.124050632911397,3.822293538723904,50.76455696202529,28.571877108663653,221,0.4481012658227848,4.0227272727272725,0.4464070847571834,0.9551366128713484
1,3761547,C,"List(List(C, A))",.,1614.69,List(),False,List(30),List(0.005948),5044,-4.468,-8.82,16845,,2.055,,-0.0047,List(28),List(0.005551),56.98,0,6.299,7.99,-1.752,,1,False,0.5443959243085881,10,0.0133689839572192,374,313,364,10,0,8.79411764705883,2.938381024472951,35.890374331550824,15.117941438430414,10,0.0267379679144385,,0.0264158237226982,0.529558260774204
1,3803755,T,"List(List(T, C))",.,383548.94,List(),False,List(3368),List(0.673),5008,-53.782,-6.841,17687,,9.582,,0.0812,List(3376),List(0.674),58.0,0,26.256,24.93,-0.433,,1,False,0.5662299854439592,522,0.6709511568123393,389,298,41,174,174,9.347043701799482,3.3586588017654098,50.33676092544985,28.10780222298778,348,0.4473007712082262,1.0,0.4421196811942313,0.8638672568716725
1,4121584,A,"List(List(A, G))",.,115117.7,List(),False,List(1489),List(0.3),4962,-26.975,-0.719,16179,,22.674,,0.104,List(1465),List(0.295),58.55,0,2.845,14.32,-4.907,,1,False,0.5502183406113537,227,0.3002645502645503,378,309,188,153,37,9.235449735449736,3.263573597825833,49.96560846560851,27.783105183267025,190,0.4047619047619047,4.135135135135135,0.4207680717614492,0.4277922729402096
1,4170048,C,"List(List(C, T))",.,120311.95,List(),False,List(1323),List(0.266),4980,-12.472,-10.32,16375,,4.365,,0.1302,List(1313),List(0.264),58.96,0,7.409,16.72,1.056,,1,False,0.4919941775836972,176,0.2603550295857988,338,349,176,148,14,9.446745562130182,3.549123327088349,48.57100591715978,27.46707527990407,162,0.4378698224852071,10.571428571428571,0.3857111549419242,0.013401595751732
1,4180842,C,"List(List(C, T))",.,138252.12,List(),False,List(1429),List(0.286),4996,54.319,-0.207,15845,,0.321,,0.0815,List(1430),List(0.286),59.15,0,-1.479,17.99,-1.476,,1,False,0.519650655021834,223,0.3123249299719888,357,330,158,175,24,8.700280112044819,2.876997526498,49.128851540616246,28.410066115097894,199,0.4901960784313725,7.291666666666667,0.4301585992040574,0.0080247270522444
1,6279383,G,"List(List(G, C))",.,1268.87,List(),False,List(16),List(0.003197),5004,-4.319,-1.24,15942,,3.279,,-0.028,List(15),List(0.002998),58.22,0,0.258,10.75,1.466,,1,False,0.4366812227074236,3,0.005,300,387,297,3,0,9.413333333333332,2.9814686910238635,29.430000000000007,10.575053979373656,3,0.01,,0.0099666110183639,0.5025041736227043
1,7569602,C,"List(List(C, T))",.,329869.45,List(),False,List(2774),List(0.551),5036,14.403,-0.417,17990,,7.683,,0.0757,List(2777),List(0.551),58.92,0,6.004,23.33,-0.65,,1,False,0.586608442503639,429,0.532258064516129,403,284,82,213,108,9.21836228287841,3.003217513272482,55.05707196029776,29.906458274316478,321,0.5285359801488834,1.9722222222222223,0.4985373672610699,0.2123100307528677


### Finally, we perform PCA on the clean variant dataset

We will use principal component analysis (PCA) to check if there is any genetic structure.

In [54]:
vds_pca = vds_gDP_gGQ_vCR_sDP_sGQ_vFilter.pca(scores='sa.pca')

In [55]:
pca_table = vds_pca.samples_keytable().to_dataframe()
display(pca_table)

s,sa.myAnnot.Sample,sa.myAnnot.Population,sa.myAnnot.SuperPopulation,sa.myAnnot.isFemale,sa.myAnnot.PurpleHair,sa.myAnnot.CaffeineConsumption,sa.qc.callRate,sa.qc.nCalled,sa.qc.nNotCalled,sa.qc.nHomRef,sa.qc.nHet,sa.qc.nHomVar,sa.qc.nSNP,sa.qc.nInsertion,sa.qc.nDeletion,sa.qc.nSingleton,sa.qc.nTransition,sa.qc.nTransversion,sa.qc.dpMean,sa.qc.dpStDev,sa.qc.gqMean,sa.qc.gqStDev,sa.qc.nNonRef,sa.qc.rTiTv,sa.qc.rHetHomVar,sa.qc.rInsertionDeletion,sa.pca.PC1,sa.pca.PC2,sa.pca.PC3,sa.pca.PC4,sa.pca.PC5,sa.pca.PC6,sa.pca.PC7,sa.pca.PC8,sa.pca.PC9,sa.pca.PC10
HG00096,HG00096,GBR,EUR,False,False,77.0,0.2749232343909928,2686,7084,1130,1248,308,1864,0,0,1,1524,340,7.4586746090841425,1.829162383895932,45.90655249441547,28.46876215640281,1556,4.482352941176471,4.051948051948052,,0.0339854341184076,-0.0669290296841473,-3.918654050448722e-05,-0.0281726963491451,-0.0118317260285843,0.0175199711669721,-0.001045970131701,0.0296021690851765,-0.0019836075783904,-0.0017771707280481
HG00100,HG00100,GBR,EUR,True,False,59.0,0.9616171954964176,9395,375,5081,2779,1535,5849,0,0,2,4762,1087,13.224374667376209,3.712169739189841,55.6111761575305,27.06547877264492,4314,4.380864765409384,1.8104234527687295,,0.1458948130976565,-0.264013951598218,0.0158046083051394,-0.123153657617185,0.0298905596925979,-0.0132612260012892,-0.0836652529465023,-0.1321839057543317,-0.1051296225605967,0.0792733515145119
HG00105,HG00105,GBR,EUR,False,False,77.0,0.5797338792221085,5664,4106,2793,2024,847,3718,0,0,0,2993,725,8.650600282485852,2.2771253327175165,44.382591807909634,27.39013726273073,2871,4.128275862068966,2.389610389610389,,0.0896623808558018,-0.1329886614076017,0.0187129597207611,-0.0612303317023324,-0.0192212574619952,0.0099091198648107,0.0151664377471947,-0.0061477456995093,-0.0053606531790385,0.0062030627039544
HG00114,HG00114,GBR,EUR,False,True,72.0,0.1494370522006141,1460,8310,509,776,175,1126,0,0,0,924,202,6.986301369863015,1.6055823053687388,43.37945205479454,25.086450301723687,951,4.574257425742574,4.434285714285714,,0.0162272475550337,-0.0351944450091645,0.0048659798935873,-0.0170523751493668,-0.007862523467512,0.0068920771228971,-0.003898509985393,0.0055467946686808,0.0012987331760997,-0.0019128138048738
HG00115,HG00115,GBR,EUR,False,True,71.0,0.6856704196519959,6699,3071,3479,2218,1002,4222,0,0,1,3451,771,9.22779519331246,2.5460202117218307,44.704433497536925,26.70956053033648,3220,4.476005188067445,2.2135728542914173,,0.0887336255256423,-0.1864476391899,-1.150393449029586e-05,-0.0408445364624962,-0.0026755734912689,0.0027208264303513,-0.0127655326452375,0.0502095251138228,0.013885103961345,-0.0114127348151012
HG00116,HG00116,GBR,EUR,False,False,71.0,0.252200614124872,2464,7306,1065,1075,324,1723,0,0,0,1403,320,7.394074675324675,1.7025517387131073,40.97767857142858,23.627226861323056,1399,4.384375,3.317901234567901,,0.030360474135906,-0.0600164437801835,0.0060656758150249,-0.0175891387450457,-0.0018471100897441,0.0076827922867729,-0.0010094758696741,0.0212723386732146,0.004779394964683,-0.0021225561961096
HG00119,HG00119,GBR,EUR,False,True,70.0,0.3195496417604913,3122,6648,1363,1365,394,2153,0,0,0,1743,410,7.527866752082009,1.7771483099316476,42.03939782190902,24.84769850706469,1759,4.251219512195122,3.464467005076142,,0.0443606948844794,-0.0715003761285376,0.0008518414179108384,-0.0255236488092138,-0.0027943270683119,0.0131707263528077,-0.0054185836512832,0.020802873692448,-0.002365140755455,-0.0060587279137972
HG00120,HG00120,GBR,EUR,True,True,68.0,0.2860798362333674,2795,6975,1238,1241,316,1873,0,0,0,1517,356,7.446153846153844,1.6802170140360535,41.18246869409654,23.966680366447957,1557,4.26123595505618,3.9272151898734178,,0.0348894154799068,-0.0762717583295123,0.0036710378846882,-0.0306936533841754,-0.005691016826978,0.0096578486287151,-0.0011829546882877,0.0135604981440989,0.0086063975834412,0.0009229437137270208
HG00128,HG00128,GBR,EUR,True,True,67.0,0.6401228249744114,6254,3516,3277,2036,941,3918,0,0,0,3153,765,9.280620402942136,2.5949974996440823,45.32523185161495,27.475594538376384,2977,4.12156862745098,2.1636556854410203,,0.0718412691139752,-0.1842544365834433,-0.0023018817155439,-0.0704483755302232,0.0108878156160324,-0.003978294665389,-0.0051287956501131,-0.0203962321533912,0.0243594032658542,0.0164299620291913
HG00131,HG00131,GBR,EUR,False,False,72.0,0.309007164790174,3019,6751,1311,1317,391,2099,0,0,0,1712,387,7.452798940046386,1.6856292863721216,41.58860549850948,24.326264763553837,1708,4.423772609819121,3.368286445012788,,0.035529651301294,-0.0873794285589938,-9.755293400657022e-05,-0.0300848814599541,-0.0041658588306447,0.0044089869556906,-0.0038209154540117,0.018906429846461,-0.0085042192263344,0.0041660729280597


## Summary:

Data filtering:
 - Filter genotypes
 - Filter samples
 - Filter variants

PCA 
 - Genetic structure of super-poupulations