# Processing VCF file using Hail

In [2]:
%scala
displayHTML(frameIt("https://en.wikipedia.org/wiki/Variant_Call_Format",500))

## This Notebook is based on the tutorial [Analyzing 1000 Genomes with Spark and Hail](https://docs.databricks.com/spark/latest/training/1000-genomes.html)

## Cluster setup

On the Databricks interface, click the `Clusters` icon on the left sidebar and then `+Create Cluster`. In the Databricks cluster creation dialog, click `Show advanced settings` at bottom and then on the `Spark` tab, and paste the text below into the `Spark config` box.

```
spark.hadoop.io.compression.codecs org.apache.hadoop.io.compress.DefaultCodec,is.hail.io.compress.BGzipCodec,org.apache.hadoop.io.compress.GzipCodec
spark.sql.files.openCostInBytes 1099511627776
spark.sql.files.maxPartitionBytes 1099511627776
spark.hadoop.mapreduce.input.fileinputformat.split.minsize 1099511627776
spark.hadoop.parquet.block.size 1099511627776```
Start the cluster and attach this notebook to it by clicking on your cluster name in menu `Detached` at the top left of this workbook. Now you're ready to Hail!

In [5]:
from hail import *
hc = HailContext(sc)

In [6]:
import pandas as pd
import numpy as np
from math import log, isnan
import seaborn

In [7]:
display(dbutils.fs.ls("/databricks-datasets/hail/data-001"))

path,name,size
dbfs:/databricks-datasets/hail/data-001/1kg_annotations.txt,1kg_annotations.txt,22784
dbfs:/databricks-datasets/hail/data-001/1kg_sample.vcf.bgz,1kg_sample.vcf.bgz,39767725
dbfs:/databricks-datasets/hail/data-001/purcell5k.interval_list,purcell5k.interval_list,192078


## Download the data

In [9]:
vcf_path = '/databricks-datasets/hail/data-001/1kg_sample.vcf.bgz'
annotation_path = '/databricks-datasets/hail/data-001/1kg_annotations.txt'
purcell_5k_path = '/databricks-datasets/hail/data-001/purcell5k.interval_list'

In [10]:
display(dbutils.fs.ls("/databricks-datasets/hail/data-001"))

path,name,size
dbfs:/databricks-datasets/hail/data-001/1kg_annotations.txt,1kg_annotations.txt,22784
dbfs:/databricks-datasets/hail/data-001/1kg_sample.vcf.bgz,1kg_sample.vcf.bgz,39767725
dbfs:/databricks-datasets/hail/data-001/purcell5k.interval_list,purcell5k.interval_list,192078


In [11]:
vds = hc.import_vcf(vcf_path)

This method produced a [VariantDataset](https://hail.is/hail/hail.VariantDataset.html), Hail's primary representation of genomic data. Following that link to Hail's python API documentation will let you see the myriad methods it offers.

In [13]:
%scala 
displayHTML(frameIt("https://hail.is/docs/0.1/overview.html#variant-dataset-vds", 500))

In [14]:
vds = vds.annotate_samples_table(annotation_path,
                                 root='sa.myAnnot',
                                 sample_expr='Sample',
                                 config=TextTableConfig(impute=True))

Use VaraintDataset function count() to get a sense of number of samples, genotypes and variants. However, any aggregate function usually runs pretty slow because of data collection cost

In [16]:
vds.count()

In [17]:
vds.num_samples

In [18]:
vds.count_variants()

In [19]:
dir(vds)

## Start exploring

If the Boolean parameter `genotypes` is set to `True`, the overall call rate across all genotypes is computed as well:

In [22]:
vds.count(genotypes=True)

So the call rate before any QC filtering is about 99%.

Let's print variant and sample schemas

In [24]:
print(vds.variant_schema)

In [25]:
print(vds.sample_schema)

Now it's easy to count samples by population using the [counter](https://hail.is/expr_lang.html#counter) aggregator:

In [27]:
counter = vds.query_samples('samples.map(s => sa.myAnnot.Population).counter()')[0]
for x in counter:
    print('population %s found %s times' % (x.key, x.count))

##VCF file contains many annotations scores that define the quality of genotypes and variants, which allows genotypes and varaints filtering

In [29]:
%scala
displayHTML(frameIt("https://en.wikipedia.org/wiki/Phred_quality_score",500))

##Let's first filter genotypes based on genotype quality (GQ) and read coverage (DP).

In [31]:
filter_condition_gDP_gGQ = 'g.dp >= 5 && g.gq >= 20'
vds_gDP_gGQ = vds.filter_genotypes(filter_condition_gDP_gGQ)

In [32]:
vds_gDP_gGQ.count(genotypes=True)

Now the call rate is about 50%, so nearly 35% of genotypes failed the filter. Filtering out a genotype is equivalent to setting the genotype call to missing.

Having removed suspect genotypes, let's next remove variants with low call rate and then calculate summary statistics per sample with the [sample_qc](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.sample_qc) method.

In [34]:
vds_gDP_gGQ_vCR = (vds_gDP_gGQ
    .filter_variants_expr('gs.fraction(g => g.isCalled) >= 0.40')
    .sample_qc())

Check how many variants retained after filtering.

In [36]:
vds_gDP_gGQ_vCR.count(genotypes=True)

### Filter samples

The call rate for each variant is calculated using the `fraction` [aggregable](https://hail.is/expr_lang.html#aggregables) on the genotypes `gs`. [sample_qc](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.sample_qc) adds a number of statistics to sample annotations:

In [39]:
print(vds_gDP_gGQ_vCR.sample_schema)

We examine the samples' callRate and readdepth to determine the criteria for filtering

In [41]:
sampleqc_table = vds_gDP_gGQ_vCR.samples_keytable().to_dataframe()

new_column_name_list= list(map(lambda x: x.replace(".", "_"), sampleqc_table.columns))
sampleqc_table = sampleqc_table.toDF(*new_column_name_list)

In [42]:
display(sampleqc_table)

s,sa_myAnnot_Sample,sa_myAnnot_Population,sa_myAnnot_SuperPopulation,sa_myAnnot_isFemale,sa_myAnnot_PurpleHair,sa_myAnnot_CaffeineConsumption,sa_qc_callRate,sa_qc_nCalled,sa_qc_nNotCalled,sa_qc_nHomRef,sa_qc_nHet,sa_qc_nHomVar,sa_qc_nSNP,sa_qc_nInsertion,sa_qc_nDeletion,sa_qc_nSingleton,sa_qc_nTransition,sa_qc_nTransversion,sa_qc_dpMean,sa_qc_dpStDev,sa_qc_gqMean,sa_qc_gqStDev,sa_qc_nNonRef,sa_qc_rTiTv,sa_qc_rHetHomVar,sa_qc_rInsertionDeletion
HG00096,HG00096,GBR,EUR,False,False,77.0,0.2749232343909928,2686,7084,1130,1248,308,1864,0,0,1,1524,340,7.4586746090841425,1.829162383895932,45.90655249441547,28.46876215640281,1556,4.482352941176471,4.051948051948052,
HG00100,HG00100,GBR,EUR,True,False,59.0,0.9616171954964176,9395,375,5081,2779,1535,5849,0,0,2,4762,1087,13.224374667376209,3.712169739189841,55.6111761575305,27.06547877264492,4314,4.380864765409384,1.8104234527687295,
HG00105,HG00105,GBR,EUR,False,False,77.0,0.5797338792221085,5664,4106,2793,2024,847,3718,0,0,0,2993,725,8.650600282485852,2.2771253327175165,44.382591807909634,27.39013726273073,2871,4.128275862068966,2.389610389610389,
HG00114,HG00114,GBR,EUR,False,True,72.0,0.1494370522006141,1460,8310,509,776,175,1126,0,0,0,924,202,6.986301369863015,1.6055823053687388,43.37945205479454,25.086450301723687,951,4.574257425742574,4.434285714285714,
HG00115,HG00115,GBR,EUR,False,True,71.0,0.6856704196519959,6699,3071,3479,2218,1002,4222,0,0,1,3451,771,9.22779519331246,2.5460202117218307,44.704433497536925,26.70956053033648,3220,4.476005188067445,2.2135728542914173,
HG00116,HG00116,GBR,EUR,False,False,71.0,0.252200614124872,2464,7306,1065,1075,324,1723,0,0,0,1403,320,7.394074675324675,1.7025517387131073,40.97767857142858,23.627226861323056,1399,4.384375,3.317901234567901,
HG00119,HG00119,GBR,EUR,False,True,70.0,0.3195496417604913,3122,6648,1363,1365,394,2153,0,0,0,1743,410,7.527866752082009,1.7771483099316476,42.03939782190902,24.84769850706469,1759,4.251219512195122,3.464467005076142,
HG00120,HG00120,GBR,EUR,True,True,68.0,0.2860798362333674,2795,6975,1238,1241,316,1873,0,0,0,1517,356,7.446153846153844,1.6802170140360535,41.18246869409654,23.966680366447957,1557,4.26123595505618,3.9272151898734178,
HG00128,HG00128,GBR,EUR,True,True,67.0,0.6401228249744114,6254,3516,3277,2036,941,3918,0,0,0,3153,765,9.280620402942136,2.5949974996440823,45.32523185161495,27.475594538376384,2977,4.12156862745098,2.1636556854410203,
HG00131,HG00131,GBR,EUR,False,False,72.0,0.309007164790174,3019,6751,1311,1317,391,2099,0,0,0,1712,387,7.452798940046386,1.6856292863721216,41.58860549850948,24.326264763553837,1708,4.423772609819121,3.368286445012788,


In [43]:
display(sampleqc_table)

s,sa_myAnnot_Sample,sa_myAnnot_Population,sa_myAnnot_SuperPopulation,sa_myAnnot_isFemale,sa_myAnnot_PurpleHair,sa_myAnnot_CaffeineConsumption,sa_qc_callRate,sa_qc_nCalled,sa_qc_nNotCalled,sa_qc_nHomRef,sa_qc_nHet,sa_qc_nHomVar,sa_qc_nSNP,sa_qc_nInsertion,sa_qc_nDeletion,sa_qc_nSingleton,sa_qc_nTransition,sa_qc_nTransversion,sa_qc_dpMean,sa_qc_dpStDev,sa_qc_gqMean,sa_qc_gqStDev,sa_qc_nNonRef,sa_qc_rTiTv,sa_qc_rHetHomVar,sa_qc_rInsertionDeletion
HG00096,HG00096,GBR,EUR,False,False,77.0,0.2749232343909928,2686,7084,1130,1248,308,1864,0,0,1,1524,340,7.4586746090841425,1.829162383895932,45.90655249441547,28.46876215640281,1556,4.482352941176471,4.051948051948052,
HG00100,HG00100,GBR,EUR,True,False,59.0,0.9616171954964176,9395,375,5081,2779,1535,5849,0,0,2,4762,1087,13.224374667376209,3.712169739189841,55.6111761575305,27.06547877264492,4314,4.380864765409384,1.8104234527687295,
HG00105,HG00105,GBR,EUR,False,False,77.0,0.5797338792221085,5664,4106,2793,2024,847,3718,0,0,0,2993,725,8.650600282485852,2.2771253327175165,44.382591807909634,27.39013726273073,2871,4.128275862068966,2.389610389610389,
HG00114,HG00114,GBR,EUR,False,True,72.0,0.1494370522006141,1460,8310,509,776,175,1126,0,0,0,924,202,6.986301369863015,1.6055823053687388,43.37945205479454,25.086450301723687,951,4.574257425742574,4.434285714285714,
HG00115,HG00115,GBR,EUR,False,True,71.0,0.6856704196519959,6699,3071,3479,2218,1002,4222,0,0,1,3451,771,9.22779519331246,2.5460202117218307,44.704433497536925,26.70956053033648,3220,4.476005188067445,2.2135728542914173,
HG00116,HG00116,GBR,EUR,False,False,71.0,0.252200614124872,2464,7306,1065,1075,324,1723,0,0,0,1403,320,7.394074675324675,1.7025517387131073,40.97767857142858,23.627226861323056,1399,4.384375,3.317901234567901,
HG00119,HG00119,GBR,EUR,False,True,70.0,0.3195496417604913,3122,6648,1363,1365,394,2153,0,0,0,1743,410,7.527866752082009,1.7771483099316476,42.03939782190902,24.84769850706469,1759,4.251219512195122,3.464467005076142,
HG00120,HG00120,GBR,EUR,True,True,68.0,0.2860798362333674,2795,6975,1238,1241,316,1873,0,0,0,1517,356,7.446153846153844,1.6802170140360535,41.18246869409654,23.966680366447957,1557,4.26123595505618,3.9272151898734178,
HG00128,HG00128,GBR,EUR,True,True,67.0,0.6401228249744114,6254,3516,3277,2036,941,3918,0,0,0,3153,765,9.280620402942136,2.5949974996440823,45.32523185161495,27.475594538376384,2977,4.12156862745098,2.1636556854410203,
HG00131,HG00131,GBR,EUR,False,False,72.0,0.309007164790174,3019,6751,1311,1317,391,2099,0,0,0,1712,387,7.452798940046386,1.6856292863721216,41.58860549850948,24.326264763553837,1708,4.423772609819121,3.368286445012788,


In [44]:
vds_gDP_gGQ_vCR_sDP_sGT = (vds_gDP_gGQ_vCR
    .annotate_samples_vds(vds_gDP_gGQ_vCR, code = 'sa.qc = vds.qc' )
    .filter_samples_expr('sa.qc.dpMean > 0.50 && sa.qc.dpMean >=5 && sa.qc.gqMean >=20'))

As before, we can count the number of samples that remain in the dataset after filtering. (But nothing has been filtered out here)

In [46]:
vds_gDP_gGQ_vCR_sDP_sGT.count(genotypes=True)

### Filter variants

We now have `vds_gDP_gGQ_vCR_sDP_sGT`, a VDS where low-quality genotypes and samples have been removed.

Let's use the [variant_qc](https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.variant_qc) method to start exploring variant metrics:

In [48]:
vds_gDP_gGQ_vCR_sDP_sGT = vds_gDP_gGQ_vCR_sDP_sGT.split_multi()
vds_gDP_gGQ_vCR_sDP_sGT = vds_gDP_gGQ_vCR_sDP_sGT.variant_qc()
print(vds_gDP_gGQ_vCR_sDP_sGT.variant_schema)

Next, we will filter variants following the [Best Practices GATK recommendations](https://gatkforums.broadinstitute.org/gatk/discussion/2806/howto-apply-hard-filters-to-a-call-set).

These recommendations are for human data, but our data is not human and the distribution of quality statistics will differ from expected for human. (Explain why it is so is beyond the scope of this tutorial.)

Let's have a look at the distribution of different variant quality statistics:

- QD - variant confidence standardized by depth.

  This annotation puts the variant confidence QUAL score into perspective by normalizing for the amount of coverage available. Because each read contributes a little to the QUAL score, variants in region with deep coverage can have artificially inflated QUAL scores, giving the impression that the call is supported by more evidence than it really is. To compensate for this, we normalize the variant confidence by depth, which gives us a more objective picture of how well supported the call is.
   
- MQ - Mapping quality of a SNP.

- FS - strand bias in support for REF vs ALT allele calls.

  Strand bias is a type of sequencing bias in which one DNA strand is favored over the other, which can result in incorrect evaluation of the amount of evidence observed for one allele vs. the other. The FisherStrand annotation is one of several methods that aims to evaluate whether there is strand bias in the data. It uses Fisher's Exact Test to determine if there is strand bias between forward and reverse strands for the reference or alternate allele. The output is a Phred-scaled p-value. The higher the output value, the more likely there is to be bias. More bias is indicative of false positive calls.
  
- SOR - sequencing bias in which one DNA strand is favored over the other
   
   Strand bias is a type of sequencing bias in which one DNA strand is favored over the other, which can result in incorrect evaluation of the amount of evidence observed for one allele vs. the other. It is used to determine if there is strand bias between forward and reverse strands for the reference or alternate allele. The reported value is ln-scaled.
   
- MQRankSum - Rank sum test for mapping qualities of REF vs. ALT reads.
   
   This variant-level annotation compares the mapping qualities of the reads supporting the reference allele with those supporting the alternate allele. The ideal result is a value close to zero, which indicates there is little to no difference. A negative value indicates that the reads supporting the alternate allele have lower mapping quality scores than those supporting the reference allele. Conversely, a positive value indicates that the reads supporting the alternate allele have higher mapping quality scores than those supporting the reference allele.

- ReadPosRankSum - do all the reads support a SNP call tend to be near the end of a read.
   
   The ideal result is a value close to zero, which indicates there is little to no difference in where the alleles are found relative to the ends of reads. A negative value indicates that the alternate allele is found at the ends of reads more often than the reference allele. Conversely, a positive value indicates that the reference allele is found at the ends of reads more often than the alternate allele.

We've once again used matplotlib to make histograms of these siz summary statistics.

In [51]:
variantqc_table = vds_gDP_gGQ_vCR_sDP_sGT.variants_keytable().to_dataframe()
display(variantqc_table)


v.contig,v.start,v.ref,v.altAlleles,va.rsid,va.qual,va.filters,va.pass,va.info.AC,va.info.AF,va.info.AN,va.info.BaseQRankSum,va.info.ClippingRankSum,va.info.DP,va.info.DS,va.info.FS,va.info.HaplotypeScore,va.info.InbreedingCoeff,va.info.MLEAC,va.info.MLEAF,va.info.MQ,va.info.MQ0,va.info.MQRankSum,va.info.QD,va.info.ReadPosRankSum,va.info.set,va.aIndex,va.wasSplit,va.qc.callRate,va.qc.AC,va.qc.AF,va.qc.nCalled,va.qc.nNotCalled,va.qc.nHomRef,va.qc.nHet,va.qc.nHomVar,va.qc.dpMean,va.qc.dpStDev,va.qc.gqMean,va.qc.gqStDev,va.qc.nNonRef,va.qc.rHeterozygosity,va.qc.rHetHomVar,va.qc.rExpectedHetFrequency,va.qc.pHWE
1,904165,G,"List(List(G, A))",.,52346.37000000001,List(),False,List(518),List(0.103),5020,-3.394,-0.17,17827,,2.233,,0.0988,List(514),List(0.102),59.05,0,1.447,15.02,6.286,,1,False,0.5098591549295775,96,0.1325966850828729,362,348,275,78,9,9.83425414364641,3.20912685351177,39.65469613259668,23.58784199754261,87,0.2154696132596685,8.666666666666666,0.2303477682767474,0.2090316153427615
1,1707740,T,"List(List(T, G))",.,93517.82,List(),False,List(997),List(0.198),5034,-40.42,-0.287,19902,,3.311,,0.0387,List(983),List(0.195),58.32,0,9.478,13.59,2.259,,1,False,0.6211267605633802,189,0.2142857142857142,441,269,272,149,20,10.21315192743764,3.486925277157898,47.4965986394558,27.806353574108623,169,0.3378684807256236,7.45,0.3371169125993189,0.9436905368317896
1,2284195,T,"List(List(T, C))",.,142480.77,List(),False,List(1559),List(0.312),4990,-45.982,0.35,18176,,2.945,,0.0925,List(1552),List(0.311),58.57,0,16.136,15.48,-0.682,,1,False,0.5633802816901409,268,0.335,400,310,176,180,44,10.092500000000006,3.8103731772623015,50.7575,28.513394988145485,224,0.45,4.090909090909091,0.446107634543179,0.8666508974008196
1,2944527,G,"List(List(G, A))",.,124329.15,List(),False,List(1206),List(0.245),4928,0.063,-0.655,17698,,0.449,,0.1232,List(1192),List(0.242),58.15,0,12.011,17.54,21.754,,1,False,0.5253521126760563,194,0.260053619302949,373,337,202,148,23,10.171581769436994,3.7226279911110978,49.97050938337803,28.09409102780986,171,0.3967828418230563,6.434782608695652,0.3853680479334976,0.5467951045957519
1,3761547,C,"List(List(C, A))",.,1614.69,List(),False,List(30),List(0.005948),5044,-4.468,-8.82,16845,,2.055,,-0.0047,List(28),List(0.005551),56.98,0,6.299,7.99,-1.752,,1,False,0.5464788732394367,11,0.0141752577319587,388,322,377,11,0,8.744845360824744,2.919777781899568,35.70618556701032,14.995403518652449,11,0.0283505154639175,,0.0279847023611573,0.534665074597403
1,3803755,T,"List(List(T, C))",.,383548.94,List(),False,List(3368),List(0.673),5008,-53.782,-6.841,17687,,9.582,,0.0812,List(3376),List(0.674),58.0,0,26.256,24.93,-0.433,,1,False,0.5619718309859155,535,0.6704260651629073,399,311,41,181,177,9.28822055137844,3.3425391563162665,50.33834586466166,27.90654486651193,358,0.4536340852130325,1.0225988700564972,0.4424643792668623,0.6111203676316223
1,4121584,A,"List(List(A, G))",.,115117.7,List(),False,List(1489),List(0.3),4962,-26.975,-0.719,16179,,22.674,,0.104,List(1465),List(0.295),58.55,0,2.845,14.32,-4.907,,1,False,0.5408450704225352,229,0.2981770833333333,384,326,192,155,37,9.195312499999996,3.2557447474697825,49.64583333333334,27.71553008042402,192,0.4036458333333333,4.1891891891891895,0.4190806986093003,0.4298698169607571
1,4170048,C,"List(List(C, T))",.,120311.95,List(),False,List(1323),List(0.266),4980,-12.472,-10.32,16375,,4.365,,0.1302,List(1313),List(0.264),58.96,0,7.409,16.72,1.056,,1,False,0.4859154929577465,180,0.2608695652173913,345,365,179,152,14,9.391304347826088,3.5403160069563744,48.275362318840585,27.32354935776841,166,0.4405797101449275,10.857142857142858,0.3861929702782861,0.0064648281742706
1,4180842,C,"List(List(C, T))",.,138252.12,List(),False,List(1429),List(0.286),4996,54.319,-0.207,15845,,0.321,,0.0815,List(1430),List(0.286),59.15,0,-1.479,17.99,-1.476,,1,False,0.5098591549295775,228,0.3149171270718232,362,348,159,178,25,8.674033149171263,2.8679674553362657,49.04143646408841,28.30137622588541,203,0.4917127071823204,7.12,0.4320854634235804,0.0087983658934937
1,6279383,G,"List(List(G, C))",.,1268.87,List(),False,List(16),List(0.003197),5004,-4.319,-1.24,15942,,3.279,,-0.028,List(15),List(0.002998),58.22,0,0.258,10.75,1.466,,1,False,0.4380281690140845,3,0.0048231511254019,311,399,308,3,0,9.363344051446967,2.951887576494954,29.27009646302251,10.447435707025488,3,0.0096463022508038,,0.009615235254827,0.5024154589371982


In [52]:
display(variantqc_table)display(variantqc_table)

v.contig,v.start,v.ref,v.altAlleles,va.rsid,va.qual,va.filters,va.pass,va.info.AC,va.info.AF,va.info.AN,va.info.BaseQRankSum,va.info.ClippingRankSum,va.info.DP,va.info.DS,va.info.FS,va.info.HaplotypeScore,va.info.InbreedingCoeff,va.info.MLEAC,va.info.MLEAF,va.info.MQ,va.info.MQ0,va.info.MQRankSum,va.info.QD,va.info.ReadPosRankSum,va.info.set,va.aIndex,va.wasSplit,va.qc.callRate,va.qc.AC,va.qc.AF,va.qc.nCalled,va.qc.nNotCalled,va.qc.nHomRef,va.qc.nHet,va.qc.nHomVar,va.qc.dpMean,va.qc.dpStDev,va.qc.gqMean,va.qc.gqStDev,va.qc.nNonRef,va.qc.rHeterozygosity,va.qc.rHetHomVar,va.qc.rExpectedHetFrequency,va.qc.pHWE
1,904165,G,"List(List(G, A))",.,52346.37000000001,List(),False,List(518),List(0.103),5020,-3.394,-0.17,17827,,2.233,,0.0988,List(514),List(0.102),59.05,0,1.447,15.02,6.286,,1,False,0.5098591549295775,96,0.1325966850828729,362,348,275,78,9,9.83425414364641,3.20912685351177,39.65469613259668,23.58784199754261,87,0.2154696132596685,8.666666666666666,0.2303477682767474,0.2090316153427615
1,1707740,T,"List(List(T, G))",.,93517.82,List(),False,List(997),List(0.198),5034,-40.42,-0.287,19902,,3.311,,0.0387,List(983),List(0.195),58.32,0,9.478,13.59,2.259,,1,False,0.6211267605633802,189,0.2142857142857142,441,269,272,149,20,10.21315192743764,3.486925277157898,47.4965986394558,27.806353574108623,169,0.3378684807256236,7.45,0.3371169125993189,0.9436905368317896
1,2284195,T,"List(List(T, C))",.,142480.77,List(),False,List(1559),List(0.312),4990,-45.982,0.35,18176,,2.945,,0.0925,List(1552),List(0.311),58.57,0,16.136,15.48,-0.682,,1,False,0.5633802816901409,268,0.335,400,310,176,180,44,10.092500000000006,3.8103731772623015,50.7575,28.513394988145485,224,0.45,4.090909090909091,0.446107634543179,0.8666508974008196
1,2944527,G,"List(List(G, A))",.,124329.15,List(),False,List(1206),List(0.245),4928,0.063,-0.655,17698,,0.449,,0.1232,List(1192),List(0.242),58.15,0,12.011,17.54,21.754,,1,False,0.5253521126760563,194,0.260053619302949,373,337,202,148,23,10.171581769436994,3.7226279911110978,49.97050938337803,28.09409102780986,171,0.3967828418230563,6.434782608695652,0.3853680479334976,0.5467951045957519
1,3761547,C,"List(List(C, A))",.,1614.69,List(),False,List(30),List(0.005948),5044,-4.468,-8.82,16845,,2.055,,-0.0047,List(28),List(0.005551),56.98,0,6.299,7.99,-1.752,,1,False,0.5464788732394367,11,0.0141752577319587,388,322,377,11,0,8.744845360824744,2.919777781899568,35.70618556701032,14.995403518652449,11,0.0283505154639175,,0.0279847023611573,0.534665074597403
1,3803755,T,"List(List(T, C))",.,383548.94,List(),False,List(3368),List(0.673),5008,-53.782,-6.841,17687,,9.582,,0.0812,List(3376),List(0.674),58.0,0,26.256,24.93,-0.433,,1,False,0.5619718309859155,535,0.6704260651629073,399,311,41,181,177,9.28822055137844,3.3425391563162665,50.33834586466166,27.90654486651193,358,0.4536340852130325,1.0225988700564972,0.4424643792668623,0.6111203676316223
1,4121584,A,"List(List(A, G))",.,115117.7,List(),False,List(1489),List(0.3),4962,-26.975,-0.719,16179,,22.674,,0.104,List(1465),List(0.295),58.55,0,2.845,14.32,-4.907,,1,False,0.5408450704225352,229,0.2981770833333333,384,326,192,155,37,9.195312499999996,3.2557447474697825,49.64583333333334,27.71553008042402,192,0.4036458333333333,4.1891891891891895,0.4190806986093003,0.4298698169607571
1,4170048,C,"List(List(C, T))",.,120311.95,List(),False,List(1323),List(0.266),4980,-12.472,-10.32,16375,,4.365,,0.1302,List(1313),List(0.264),58.96,0,7.409,16.72,1.056,,1,False,0.4859154929577465,180,0.2608695652173913,345,365,179,152,14,9.391304347826088,3.5403160069563744,48.275362318840585,27.32354935776841,166,0.4405797101449275,10.857142857142858,0.3861929702782861,0.0064648281742706
1,4180842,C,"List(List(C, T))",.,138252.12,List(),False,List(1429),List(0.286),4996,54.319,-0.207,15845,,0.321,,0.0815,List(1430),List(0.286),59.15,0,-1.479,17.99,-1.476,,1,False,0.5098591549295775,228,0.3149171270718232,362,348,159,178,25,8.674033149171263,2.8679674553362657,49.04143646408841,28.30137622588541,203,0.4917127071823204,7.12,0.4320854634235804,0.0087983658934937
1,6279383,G,"List(List(G, C))",.,1268.87,List(),False,List(16),List(0.003197),5004,-4.319,-1.24,15942,,3.279,,-0.028,List(15),List(0.002998),58.22,0,0.258,10.75,1.466,,1,False,0.4380281690140845,3,0.0048231511254019,311,399,308,3,0,9.363344051446967,2.951887576494954,29.27009646302251,10.447435707025488,3,0.0096463022508038,,0.009615235254827,0.5024154589371982


In [53]:
display(variantqc_table)

v.contig,v.start,v.ref,v.altAlleles,va.rsid,va.qual,va.filters,va.pass,va.info.AC,va.info.AF,va.info.AN,va.info.BaseQRankSum,va.info.ClippingRankSum,va.info.DP,va.info.DS,va.info.FS,va.info.HaplotypeScore,va.info.InbreedingCoeff,va.info.MLEAC,va.info.MLEAF,va.info.MQ,va.info.MQ0,va.info.MQRankSum,va.info.QD,va.info.ReadPosRankSum,va.info.set,va.aIndex,va.wasSplit,va.qc.callRate,va.qc.AC,va.qc.AF,va.qc.nCalled,va.qc.nNotCalled,va.qc.nHomRef,va.qc.nHet,va.qc.nHomVar,va.qc.dpMean,va.qc.dpStDev,va.qc.gqMean,va.qc.gqStDev,va.qc.nNonRef,va.qc.rHeterozygosity,va.qc.rHetHomVar,va.qc.rExpectedHetFrequency,va.qc.pHWE
1,904165,G,"List(List(G, A))",.,52346.37000000001,List(),False,List(518),List(0.103),5020,-3.394,-0.17,17827,,2.233,,0.0988,List(514),List(0.102),59.05,0,1.447,15.02,6.286,,1,False,0.5098591549295775,96,0.1325966850828729,362,348,275,78,9,9.83425414364641,3.20912685351177,39.65469613259668,23.58784199754261,87,0.2154696132596685,8.666666666666666,0.2303477682767474,0.2090316153427615
1,1707740,T,"List(List(T, G))",.,93517.82,List(),False,List(997),List(0.198),5034,-40.42,-0.287,19902,,3.311,,0.0387,List(983),List(0.195),58.32,0,9.478,13.59,2.259,,1,False,0.6211267605633802,189,0.2142857142857142,441,269,272,149,20,10.21315192743764,3.486925277157898,47.4965986394558,27.806353574108623,169,0.3378684807256236,7.45,0.3371169125993189,0.9436905368317896
1,2284195,T,"List(List(T, C))",.,142480.77,List(),False,List(1559),List(0.312),4990,-45.982,0.35,18176,,2.945,,0.0925,List(1552),List(0.311),58.57,0,16.136,15.48,-0.682,,1,False,0.5633802816901409,268,0.335,400,310,176,180,44,10.092500000000006,3.8103731772623015,50.7575,28.513394988145485,224,0.45,4.090909090909091,0.446107634543179,0.8666508974008196
1,2944527,G,"List(List(G, A))",.,124329.15,List(),False,List(1206),List(0.245),4928,0.063,-0.655,17698,,0.449,,0.1232,List(1192),List(0.242),58.15,0,12.011,17.54,21.754,,1,False,0.5253521126760563,194,0.260053619302949,373,337,202,148,23,10.171581769436994,3.7226279911110978,49.97050938337803,28.09409102780986,171,0.3967828418230563,6.434782608695652,0.3853680479334976,0.5467951045957519
1,3761547,C,"List(List(C, A))",.,1614.69,List(),False,List(30),List(0.005948),5044,-4.468,-8.82,16845,,2.055,,-0.0047,List(28),List(0.005551),56.98,0,6.299,7.99,-1.752,,1,False,0.5464788732394367,11,0.0141752577319587,388,322,377,11,0,8.744845360824744,2.919777781899568,35.70618556701032,14.995403518652449,11,0.0283505154639175,,0.0279847023611573,0.534665074597403
1,3803755,T,"List(List(T, C))",.,383548.94,List(),False,List(3368),List(0.673),5008,-53.782,-6.841,17687,,9.582,,0.0812,List(3376),List(0.674),58.0,0,26.256,24.93,-0.433,,1,False,0.5619718309859155,535,0.6704260651629073,399,311,41,181,177,9.28822055137844,3.3425391563162665,50.33834586466166,27.90654486651193,358,0.4536340852130325,1.0225988700564972,0.4424643792668623,0.6111203676316223
1,4121584,A,"List(List(A, G))",.,115117.7,List(),False,List(1489),List(0.3),4962,-26.975,-0.719,16179,,22.674,,0.104,List(1465),List(0.295),58.55,0,2.845,14.32,-4.907,,1,False,0.5408450704225352,229,0.2981770833333333,384,326,192,155,37,9.195312499999996,3.2557447474697825,49.64583333333334,27.71553008042402,192,0.4036458333333333,4.1891891891891895,0.4190806986093003,0.4298698169607571
1,4170048,C,"List(List(C, T))",.,120311.95,List(),False,List(1323),List(0.266),4980,-12.472,-10.32,16375,,4.365,,0.1302,List(1313),List(0.264),58.96,0,7.409,16.72,1.056,,1,False,0.4859154929577465,180,0.2608695652173913,345,365,179,152,14,9.391304347826088,3.5403160069563744,48.275362318840585,27.32354935776841,166,0.4405797101449275,10.857142857142858,0.3861929702782861,0.0064648281742706
1,4180842,C,"List(List(C, T))",.,138252.12,List(),False,List(1429),List(0.286),4996,54.319,-0.207,15845,,0.321,,0.0815,List(1430),List(0.286),59.15,0,-1.479,17.99,-1.476,,1,False,0.5098591549295775,228,0.3149171270718232,362,348,159,178,25,8.674033149171263,2.8679674553362657,49.04143646408841,28.30137622588541,203,0.4917127071823204,7.12,0.4320854634235804,0.0087983658934937
1,6279383,G,"List(List(G, C))",.,1268.87,List(),False,List(16),List(0.003197),5004,-4.319,-1.24,15942,,3.279,,-0.028,List(15),List(0.002998),58.22,0,0.258,10.75,1.466,,1,False,0.4380281690140845,3,0.0048231511254019,311,399,308,3,0,9.363344051446967,2.951887576494954,29.27009646302251,10.447435707025488,3,0.0096463022508038,,0.009615235254827,0.5024154589371982


In [54]:
display(variantqc_table)

v.contig,v.start,v.ref,v.altAlleles,va.rsid,va.qual,va.filters,va.pass,va.info.AC,va.info.AF,va.info.AN,va.info.BaseQRankSum,va.info.ClippingRankSum,va.info.DP,va.info.DS,va.info.FS,va.info.HaplotypeScore,va.info.InbreedingCoeff,va.info.MLEAC,va.info.MLEAF,va.info.MQ,va.info.MQ0,va.info.MQRankSum,va.info.QD,va.info.ReadPosRankSum,va.info.set,va.aIndex,va.wasSplit,va.qc.callRate,va.qc.AC,va.qc.AF,va.qc.nCalled,va.qc.nNotCalled,va.qc.nHomRef,va.qc.nHet,va.qc.nHomVar,va.qc.dpMean,va.qc.dpStDev,va.qc.gqMean,va.qc.gqStDev,va.qc.nNonRef,va.qc.rHeterozygosity,va.qc.rHetHomVar,va.qc.rExpectedHetFrequency,va.qc.pHWE
1,904165,G,"List(List(G, A))",.,52346.37000000001,List(),False,List(518),List(0.103),5020,-3.394,-0.17,17827,,2.233,,0.0988,List(514),List(0.102),59.05,0,1.447,15.02,6.286,,1,False,0.5098591549295775,96,0.1325966850828729,362,348,275,78,9,9.83425414364641,3.20912685351177,39.65469613259668,23.58784199754261,87,0.2154696132596685,8.666666666666666,0.2303477682767474,0.2090316153427615
1,1707740,T,"List(List(T, G))",.,93517.82,List(),False,List(997),List(0.198),5034,-40.42,-0.287,19902,,3.311,,0.0387,List(983),List(0.195),58.32,0,9.478,13.59,2.259,,1,False,0.6211267605633802,189,0.2142857142857142,441,269,272,149,20,10.21315192743764,3.486925277157898,47.4965986394558,27.806353574108623,169,0.3378684807256236,7.45,0.3371169125993189,0.9436905368317896
1,2284195,T,"List(List(T, C))",.,142480.77,List(),False,List(1559),List(0.312),4990,-45.982,0.35,18176,,2.945,,0.0925,List(1552),List(0.311),58.57,0,16.136,15.48,-0.682,,1,False,0.5633802816901409,268,0.335,400,310,176,180,44,10.092500000000006,3.8103731772623015,50.7575,28.513394988145485,224,0.45,4.090909090909091,0.446107634543179,0.8666508974008196
1,2944527,G,"List(List(G, A))",.,124329.15,List(),False,List(1206),List(0.245),4928,0.063,-0.655,17698,,0.449,,0.1232,List(1192),List(0.242),58.15,0,12.011,17.54,21.754,,1,False,0.5253521126760563,194,0.260053619302949,373,337,202,148,23,10.171581769436994,3.7226279911110978,49.97050938337803,28.09409102780986,171,0.3967828418230563,6.434782608695652,0.3853680479334976,0.5467951045957519
1,3761547,C,"List(List(C, A))",.,1614.69,List(),False,List(30),List(0.005948),5044,-4.468,-8.82,16845,,2.055,,-0.0047,List(28),List(0.005551),56.98,0,6.299,7.99,-1.752,,1,False,0.5464788732394367,11,0.0141752577319587,388,322,377,11,0,8.744845360824744,2.919777781899568,35.70618556701032,14.995403518652449,11,0.0283505154639175,,0.0279847023611573,0.534665074597403
1,3803755,T,"List(List(T, C))",.,383548.94,List(),False,List(3368),List(0.673),5008,-53.782,-6.841,17687,,9.582,,0.0812,List(3376),List(0.674),58.0,0,26.256,24.93,-0.433,,1,False,0.5619718309859155,535,0.6704260651629073,399,311,41,181,177,9.28822055137844,3.3425391563162665,50.33834586466166,27.90654486651193,358,0.4536340852130325,1.0225988700564972,0.4424643792668623,0.6111203676316223
1,4121584,A,"List(List(A, G))",.,115117.7,List(),False,List(1489),List(0.3),4962,-26.975,-0.719,16179,,22.674,,0.104,List(1465),List(0.295),58.55,0,2.845,14.32,-4.907,,1,False,0.5408450704225352,229,0.2981770833333333,384,326,192,155,37,9.195312499999996,3.2557447474697825,49.64583333333334,27.71553008042402,192,0.4036458333333333,4.1891891891891895,0.4190806986093003,0.4298698169607571
1,4170048,C,"List(List(C, T))",.,120311.95,List(),False,List(1323),List(0.266),4980,-12.472,-10.32,16375,,4.365,,0.1302,List(1313),List(0.264),58.96,0,7.409,16.72,1.056,,1,False,0.4859154929577465,180,0.2608695652173913,345,365,179,152,14,9.391304347826088,3.5403160069563744,48.275362318840585,27.32354935776841,166,0.4405797101449275,10.857142857142858,0.3861929702782861,0.0064648281742706
1,4180842,C,"List(List(C, T))",.,138252.12,List(),False,List(1429),List(0.286),4996,54.319,-0.207,15845,,0.321,,0.0815,List(1430),List(0.286),59.15,0,-1.479,17.99,-1.476,,1,False,0.5098591549295775,228,0.3149171270718232,362,348,159,178,25,8.674033149171263,2.8679674553362657,49.04143646408841,28.30137622588541,203,0.4917127071823204,7.12,0.4320854634235804,0.0087983658934937
1,6279383,G,"List(List(G, C))",.,1268.87,List(),False,List(16),List(0.003197),5004,-4.319,-1.24,15942,,3.279,,-0.028,List(15),List(0.002998),58.22,0,0.258,10.75,1.466,,1,False,0.4380281690140845,3,0.0048231511254019,311,399,308,3,0,9.363344051446967,2.951887576494954,29.27009646302251,10.447435707025488,3,0.0096463022508038,,0.009615235254827,0.5024154589371982


In [55]:
vds_gDP_gGQ_vCR_sDP_sGT_vFilter = vds_gDP_gGQ_vCR_sDP_sGT.filter_variants_expr('va.info.MQ >= 55.00 && va.info.QD >= 2.00 && va.info.FS <= 60.000 && va.info.MQRankSum >= -20.000 && va.info.ReadPosRankSum >= -10.000 && va.info.ReadPosRankSum <= 10.000')
print('variants before filtering: %d' % vds_gDP_gGQ_vCR_sDP_sGT.count_variants())
print('variants after filtering: %d' % vds_gDP_gGQ_vCR_sDP_sGT_vFilter.count_variants())

Verify the filtering results with plots:

In [57]:
variantqc_table = vds_gDP_gGQ_vCR_sDP_sGT_vFilter.variants_keytable().to_dataframe()

display(variantqc_table)

v.contig,v.start,v.ref,v.altAlleles,va.rsid,va.qual,va.filters,va.pass,va.info.AC,va.info.AF,va.info.AN,va.info.BaseQRankSum,va.info.ClippingRankSum,va.info.DP,va.info.DS,va.info.FS,va.info.HaplotypeScore,va.info.InbreedingCoeff,va.info.MLEAC,va.info.MLEAF,va.info.MQ,va.info.MQ0,va.info.MQRankSum,va.info.QD,va.info.ReadPosRankSum,va.info.set,va.aIndex,va.wasSplit,va.qc.callRate,va.qc.AC,va.qc.AF,va.qc.nCalled,va.qc.nNotCalled,va.qc.nHomRef,va.qc.nHet,va.qc.nHomVar,va.qc.dpMean,va.qc.dpStDev,va.qc.gqMean,va.qc.gqStDev,va.qc.nNonRef,va.qc.rHeterozygosity,va.qc.rHetHomVar,va.qc.rExpectedHetFrequency,va.qc.pHWE
1,904165,G,"List(List(G, A))",.,52346.37000000001,List(),False,List(518),List(0.103),5020,-3.394,-0.17,17827,,2.233,,0.0988,List(514),List(0.102),59.05,0,1.447,15.02,6.286,,1,False,0.5098591549295775,96,0.1325966850828729,362,348,275,78,9,9.83425414364641,3.20912685351177,39.65469613259668,23.58784199754261,87,0.2154696132596685,8.666666666666666,0.2303477682767474,0.2090316153427615
1,1707740,T,"List(List(T, G))",.,93517.82,List(),False,List(997),List(0.198),5034,-40.42,-0.287,19902,,3.311,,0.0387,List(983),List(0.195),58.32,0,9.478,13.59,2.259,,1,False,0.6211267605633802,189,0.2142857142857142,441,269,272,149,20,10.21315192743764,3.486925277157898,47.4965986394558,27.806353574108623,169,0.3378684807256236,7.45,0.3371169125993189,0.9436905368317896
1,2284195,T,"List(List(T, C))",.,142480.77,List(),False,List(1559),List(0.312),4990,-45.982,0.35,18176,,2.945,,0.0925,List(1552),List(0.311),58.57,0,16.136,15.48,-0.682,,1,False,0.5633802816901409,268,0.335,400,310,176,180,44,10.092500000000006,3.8103731772623015,50.7575,28.513394988145485,224,0.45,4.090909090909091,0.446107634543179,0.8666508974008196
1,3761547,C,"List(List(C, A))",.,1614.69,List(),False,List(30),List(0.005948),5044,-4.468,-8.82,16845,,2.055,,-0.0047,List(28),List(0.005551),56.98,0,6.299,7.99,-1.752,,1,False,0.5464788732394367,11,0.0141752577319587,388,322,377,11,0,8.744845360824744,2.919777781899568,35.70618556701032,14.995403518652449,11,0.0283505154639175,,0.0279847023611573,0.534665074597403
1,3803755,T,"List(List(T, C))",.,383548.94,List(),False,List(3368),List(0.673),5008,-53.782,-6.841,17687,,9.582,,0.0812,List(3376),List(0.674),58.0,0,26.256,24.93,-0.433,,1,False,0.5619718309859155,535,0.6704260651629073,399,311,41,181,177,9.28822055137844,3.3425391563162665,50.33834586466166,27.90654486651193,358,0.4536340852130325,1.0225988700564972,0.4424643792668623,0.6111203676316223
1,4121584,A,"List(List(A, G))",.,115117.7,List(),False,List(1489),List(0.3),4962,-26.975,-0.719,16179,,22.674,,0.104,List(1465),List(0.295),58.55,0,2.845,14.32,-4.907,,1,False,0.5408450704225352,229,0.2981770833333333,384,326,192,155,37,9.195312499999996,3.2557447474697825,49.64583333333334,27.71553008042402,192,0.4036458333333333,4.1891891891891895,0.4190806986093003,0.4298698169607571
1,4170048,C,"List(List(C, T))",.,120311.95,List(),False,List(1323),List(0.266),4980,-12.472,-10.32,16375,,4.365,,0.1302,List(1313),List(0.264),58.96,0,7.409,16.72,1.056,,1,False,0.4859154929577465,180,0.2608695652173913,345,365,179,152,14,9.391304347826088,3.5403160069563744,48.275362318840585,27.32354935776841,166,0.4405797101449275,10.857142857142858,0.3861929702782861,0.0064648281742706
1,4180842,C,"List(List(C, T))",.,138252.12,List(),False,List(1429),List(0.286),4996,54.319,-0.207,15845,,0.321,,0.0815,List(1430),List(0.286),59.15,0,-1.479,17.99,-1.476,,1,False,0.5098591549295775,228,0.3149171270718232,362,348,159,178,25,8.674033149171263,2.8679674553362657,49.04143646408841,28.30137622588541,203,0.4917127071823204,7.12,0.4320854634235804,0.0087983658934937
1,6279383,G,"List(List(G, C))",.,1268.87,List(),False,List(16),List(0.003197),5004,-4.319,-1.24,15942,,3.279,,-0.028,List(15),List(0.002998),58.22,0,0.258,10.75,1.466,,1,False,0.4380281690140845,3,0.0048231511254019,311,399,308,3,0,9.363344051446967,2.951887576494954,29.27009646302251,10.447435707025488,3,0.0096463022508038,,0.009615235254827,0.5024154589371982
1,7569602,C,"List(List(C, T))",.,329869.45,List(),False,List(2774),List(0.551),5036,14.403,-0.417,17990,,7.683,,0.0757,List(2777),List(0.551),58.92,0,6.004,23.33,-0.65,,1,False,0.5816901408450704,438,0.5302663438256658,413,297,84,220,109,9.181598062953984,2.990452987169473,54.903147699757874,29.766408009399772,329,0.5326876513317191,2.018348623853211,0.4987717367378384,0.1532558338674897


## PCA

To check if there is any genetic structure, we will use a principal component analysis (PCA).

In [59]:
vds_pca = (vds_gDP_gGQ_vCR_sDP_sGT_vFilter.pca(scores='sa.pca'))

In [60]:
type(vds_pca)

In [61]:
pca_table = vds_pca.samples_keytable().to_dataframe()
display(pca_table)

s,sa.myAnnot.Sample,sa.myAnnot.Population,sa.myAnnot.SuperPopulation,sa.myAnnot.isFemale,sa.myAnnot.PurpleHair,sa.myAnnot.CaffeineConsumption,sa.qc.callRate,sa.qc.nCalled,sa.qc.nNotCalled,sa.qc.nHomRef,sa.qc.nHet,sa.qc.nHomVar,sa.qc.nSNP,sa.qc.nInsertion,sa.qc.nDeletion,sa.qc.nSingleton,sa.qc.nTransition,sa.qc.nTransversion,sa.qc.dpMean,sa.qc.dpStDev,sa.qc.gqMean,sa.qc.gqStDev,sa.qc.nNonRef,sa.qc.rTiTv,sa.qc.rHetHomVar,sa.qc.rInsertionDeletion,sa.pca.PC1,sa.pca.PC2,sa.pca.PC3,sa.pca.PC4,sa.pca.PC5,sa.pca.PC6,sa.pca.PC7,sa.pca.PC8,sa.pca.PC9,sa.pca.PC10
HG00096,HG00096,GBR,EUR,False,False,77.0,0.2749232343909928,2686,7084,1130,1248,308,1864,0,0,1,1524,340,7.4586746090841425,1.829162383895932,45.90655249441547,28.46876215640281,1556,4.482352941176471,4.051948051948052,,-0.0325923472843964,-0.0685573311994893,-0.0001123812246816334,0.0282811776825848,-0.014010634542255,-0.017776752541682,0.0070538929090744,0.0278181992401504,0.0022370033606121,-0.00020094674282807024
HG00100,HG00100,GBR,EUR,True,False,59.0,0.9616171954964176,9395,375,5081,2779,1535,5849,0,0,2,4762,1087,13.224374667376209,3.712169739189841,55.6111761575305,27.06547877264492,4314,4.380864765409384,1.8104234527687295,,-0.1411342165515785,-0.2695722630186621,0.0151371346882598,0.1239393425826786,0.0377280802646084,0.0170086582109924,-0.1158179938298054,-0.1166263268686007,0.1062295803008534,0.0842077556500663
HG00105,HG00105,GBR,EUR,False,False,77.0,0.5797338792221085,5664,4106,2793,2024,847,3718,0,0,0,2993,725,8.650600282485852,2.2771253327175165,44.382591807909634,27.39013726273073,2871,4.128275862068966,2.389610389610389,,-0.0868458559218454,-0.1365471331196452,0.0185967429517303,0.0611515792044504,-0.0195489893306977,-0.0076328286456307,0.0137735707058318,-0.0123320662131688,0.004046293374303,0.0060683292505109
HG00114,HG00114,GBR,EUR,False,True,72.0,0.1494370522006141,1460,8310,509,776,175,1126,0,0,0,924,202,6.986301369863015,1.6055823053687388,43.37945205479454,25.086450301723687,951,4.574257425742574,4.434285714285714,,-0.0153890976402962,-0.0359986302251286,0.0046913408539963,0.0173470607358387,-0.0089425822318202,-0.0072195810959567,-0.0026005684325834,0.006588598068711,-0.0005640274308805656,-0.0020790033670626
HG00115,HG00115,GBR,EUR,False,True,71.0,0.6856704196519959,6699,3071,3479,2218,1002,4222,0,0,1,3451,771,9.22779519331246,2.5460202117218307,44.704433497536925,26.70956053033648,3220,4.476005188067445,2.2135728542914173,,-0.0849685066794085,-0.1907138587745121,-0.0004696177875208916,0.0406969824992708,-0.0057361345117335,-0.0047331208751515,0.0014882309275965,0.0522990703804851,-0.0046335620200133,-0.0152350422812655
HG00116,HG00116,GBR,EUR,False,False,71.0,0.252200614124872,2464,7306,1065,1075,324,1723,0,0,0,1403,320,7.394074675324675,1.7025517387131073,40.97767857142858,23.627226861323056,1399,4.384375,3.317901234567901,,-0.0290276964778749,-0.0616094307004349,0.0058841810945058,0.0175856791539295,-0.0029524673108539,-0.0081618674081923,0.0048589195411199,0.0203366335535753,-0.0029991626469734,-0.0028072224002579
HG00119,HG00119,GBR,EUR,False,True,70.0,0.3195496417604913,3122,6648,1363,1365,394,2153,0,0,0,1743,410,7.527866752082009,1.7771483099316476,42.03939782190902,24.84769850706469,1759,4.251219512195122,3.464467005076142,,-0.0427009722248152,-0.0734706147843711,0.0006949008546658025,0.0254972745542106,-0.004740550859982,-0.0144889210099777,0.0010899562550298,0.0225521909346353,0.0033400517817615,-0.0052905018432835
HG00120,HG00120,GBR,EUR,True,True,68.0,0.2860798362333674,2795,6975,1238,1241,316,1873,0,0,0,1517,356,7.446153846153844,1.6802170140360535,41.18246869409654,23.966680366447957,1557,4.26123595505618,3.9272151898734178,,-0.0334172898004762,-0.0779457863149751,0.0035228419963734,0.0309019533042058,-0.0066378903238628,-0.0099762418869531,0.0020642819454803,0.0135567442364605,-0.0093524971372831,0.0011281933530189
HG00128,HG00128,GBR,EUR,True,True,67.0,0.6401228249744114,6254,3516,3277,2036,941,3918,0,0,0,3153,765,9.280620402942136,2.5949974996440823,45.32523185161495,27.475594538376384,2977,4.12156862745098,2.1636556854410203,,-0.0685885430498648,-0.187549447420306,-0.0028406656410715,0.0708426662913436,0.0116936170262754,0.0028093349274797,-0.0134119759305212,-0.0142278585206974,-0.0294974430721741,0.0175350703409932
HG00131,HG00131,GBR,EUR,False,False,72.0,0.309007164790174,3019,6751,1311,1317,391,2099,0,0,0,1712,387,7.452798940046386,1.6856292863721216,41.58860549850948,24.326264763553837,1708,4.423772609819121,3.368286445012788,,-0.0337305413800772,-0.0891131817732007,-0.0002461137258943689,0.0302978444179001,-0.0054696389329686,-0.0050234759834465,0.002512517328665,0.0191718522895119,0.0073201882024065,0.0058429520097317


Install packages that are necessary to produce a geographic map:

## Summary:

Data filtering:
 - Filter genotypes
 - Filter samples
 - Filter variants

PCA -clustering of super-poupulations