# Where to kākāpō SNVs come from?

All living kāĸāpō descent from two populations, Stewart Island and a single male (Richard Henry) from Fiordland, South Island. In this analyse we ask for each polymorphic sitee, where did the polymorphism  come from? ie. a minor allele exists at this site (because it is polymorphic), so did the minor allele come from teh Stewart Island foudners, Richard Henry, both of these sources or does it appear to be a _de novo_ mutation within the pedigree?

We can approach this question using the filtered pop gen SNPs and a map assigning individuals to an origin. There are 35 founders from Stewart Island with at leat one chick, 123 descendant birds and our single Fiordland representative

In [1]:
from collections import Counter
population_map = { bird:pop for bird,pop in ( line.split() for line in open("pops.tsv").readlines() )}
Counter(population_map.values())

Counter({'SI': 35, 'SI_noSI': 12, 'descendant': 123, 'fiordland': 1})

Now we can make the filtered SNP set and read them into to pyVCF. 

The header gets a little munted in the bcf -> vcf conversion, and grepping on "#" seems to cut off some sample names. Use this code to create a header to edit in a few lines in text editors so pyVCF can parse it:

```py
with open("header","w") as out:
    for f in open("vars/filtered.recode.vcf"):
        if not f.startswith("#"):
            break
        out.write(f)
```

Then some shell to stich it back together

```sh
cat header <(grep -v "#" vars/filtered.recode.vcf ) |  gzip > vars/filt_snps.vcf.gz
rm vars/filtered.recode.vcf
```

**RUNNING THE NEXT CELL TAKES ~10 MINUTES ON MY DESKTOP**

In [2]:
! mkdir -p vars
! vcftools --bcf ../../vars/Trained.bcf --minQ 80 --hwe 0.001 --max-alleles 2 --max-missing .4  --remove-indels --recode --out vars/filtered


VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--bcf ../../vars/Trained.bcf
	--max-alleles 2
	--hwe 0.001
	--minQ 80
	--max-missing 0.4
	--out vars/filtered
	--recode
	--remove-indels

Error: Could not open VCF file: ../../vars/Trained.bcf


In [3]:
import vcf 
import gzip
from collections import defaultdict
from collections import Counter

recs = vcf.Reader(open("vars/filt_snps.vcf.gz", mode="rb"), compressed=True)
#tiny sample of the vcf to test everything works
recs = [next(recs) for _ in range(1000)]

Now we have the SNPs are ready to parse, we define a functions to assign each site to an origin. In short,  we firtst determine whether the reference or non-reference allele is the minor allele at this site. We then count the number of minor alleles coming birds of each possible catgory. The function `counts_to_cat` takes this information as input as assigns those counts to a catergorisation. There are a number of tests cases run below to confirm the this functions works as intended.

In [4]:

def count_minor(sample, minor="1"):
    "Count the number of minor alleles in a vcf sample"
    return(sample.gt_alleles.count(minor))

def counts_to_cat(count_dict):
    "Assign a count-dictionary to an SNP-origin category"
    if count_dict["SI"] > 0:
        if count_dict["fiordland"] > 0:
            return("Both")
        else:
            return("SI")
    #not a SI origin
    if count_dict["fiordland"] > 0:
        if count_dict["descendant"] > 10:
            return("fiordland_dodgy")
        else:
            return("fiordland")
    return("de novo")
    

def assign_allele(site, pop_map):
    """Assign a given site to an origin category """
    minor = "0" if site.aaf[0] > 0.5 else "1"
    pop_counts = defaultdict(int)
    for s in site.samples:
        pop_counts[ population_map[s.sample] ] += count_minor(s, minor)
    return(counts_to_cat(pop_counts))
    

In [5]:
fiordland = {'SI': 0, 'SI_no_rep': 0, 'descendant': 2, 'fiordland': 1}
both = {'SI': 1, 'SI_no_rep': 0, 'descendant': 2, 'fiordland': 1}
SI = {'SI': 1, 'SI_no_rep': 0, 'descendant': 2, 'fiordland': 0}
dodgy_fiordland = {'SI': 0, 'SI_no_rep': 0, 'descendant': 12, 'fiordland': 1}
good_denovo = {'SI': 0, 'SI_no_rep': 0, 'descendant': 2, 'fiordland': 0}



In [6]:
print( counts_to_cat(fiordland) )
print( counts_to_cat(both) )
print( counts_to_cat(SI) )
print( counts_to_cat(dodgy_fiordland) )
print( counts_to_cat(good_denovo) )

fiordland
Both
SI
fiordland_dodgy
de novo


With the functions in hand, it's just a matter of running them on every site:

**RUNNING THE FOLLOWING CELL TAKES A LOOOONG TIME. COULD EASILY BE PARALELLISED IF WE REPETA THIS OFTEN**

In [7]:
all_types = Counter( (assign_allele(site, population_map) for site in recs) )


In [8]:
all_types

Counter({'Both': 328, 'SI': 226, 'de novo': 15, 'fiordland': 431})

In [9]:
n = sum(all_types.values())
for k,v in all_types.items():
    print(k, v/n)

fiordland 0.431
Both 0.328
de novo 0.015
SI 0.226


In [10]:
with open("snp_origin.tsv", "w") as out:
    for k,v in all_types.items():
        out.write("{}\t{}\n".format(k,v))
