In this noteobook I combine the results of three candidate outlier detection approaches, implemented in:

 - Stacks
 - Bayescan
 - Bayenv
 


#*Stacks*

Fst was calculated for pairwise comparisons uing `populations` from the Stacks program suite. Initial Fst significance levels were calculated across 50kb sliding windows using 10,000 bootstrap replicates. For windows with p < 0.001, significance levels were recalculated using 1,000,000 bootstrap replicates. 

In [1]:
!mkdir Stacks

In [2]:
cd Stacks/

/media/chrishah/STORAGE/RAD/popgen/Fst-outlier/Diplotaxodon_FOR_PAPER/Stacks


Identify all tags that were assigned __*p < 0.00005*__ in at least one pairwise comparison by `populations`.

In [40]:
%%bash

#specify the directory where populations had been run
stacks_dir=/media/chrishah/STORAGE/RAD/stacks/ALL/mapping/excl_PCR_dupl/BWA-8MM/M_zebra/3-populations/pairwise_single_SNP/Diplotaxodon/excl_singletons

#extract the most significant SNPs
for b in $(ls -1 $stacks_dir | grep "Di"); do for a in $(zcat $stacks_dir/$b/r_0.8-p_2-w50kb-1M_bs/bootstrap_whitelist.txt.gz); do zcat $stacks_dir/$b/r_0.8-p_2-w50kb-1M_bs/batch_1.fst_$b.tsv.gz | grep -P "1\t$a\tDi"; done | perl -ne 'chomp; @a=split("\t"); if ($a[-2] < 0.00005){print "$a[1]\n"}'; done | sort -n |uniq > Stacks.candidates.txt 

echo -e "Number of candidates highlighted by Stacks (list of tag IDs saved in 'Stacks.candidates.txt'):\n$(cat Stacks.candidates.txt |wc -l)"

Number of candidates highlighted by Stacks (list of tag IDs saved in 'Stacks.candidates.txt'):
125


In [3]:
cd ..

/media/chrishah/STORAGE/RAD/popgen/Fst-outlier/Diplotaxodon_FOR_PAPER


#*Bayescan*

Fst outlier scans with `Bayescan` were performed for all pairwise comparisons and also for a global dataset containing all 4 populations. In all cases prior odds for the neutral model were set to 10 (`--pr_odds 10`) and false discovery rate was set to 0.05.

In [13]:
!mkdir Bayescan

In [4]:
cd Bayescan/

/media/chrishah/STORAGE/RAD/popgen/Fst-outlier/Diplotaxodon_FOR_PAPER/Bayescan


In [18]:
%%bash
#summarize results from pairwise runs

#this is where the pairwise results are
bayescan_dir=/media/chrishah/STORAGE/RAD/stacks/ALL/mapping/excl_PCR_dupl/BWA-8MM/M_zebra/3-populations/pairwise_single_SNP/Diplotaxodon/excl_singletons/BAYESCAN_pairwise

for a in $(ls -1 $bayescan_dir/ | grep "^Di_"); do cat $bayescan_dir/$a/$a-10-FDR-0.05.outlier_stacks_ID.list; done |sort -n |uniq > Bayescan.pairwise.candidates.txt

echo -e "Number of candidates highlighted by pairwise Bayescan runs \n\
(list of tag IDs saved in 'Bayescan.pairwise.candidates.txt'):\n\
$(cat Bayescan.pairwise.candidates.txt |wc -l)"


Number of candidates highlighted by pairwise Bayescan runs 
(list of tag IDs saved in 'Bayescan.pairwise.candidates.txt'):
98


In [20]:
%%bash
#sumamrize results from global run

bayescan_dir=/media/chrishah/STORAGE/RAD/stacks/ALL/mapping/excl_PCR_dupl/BWA-8MM/M_zebra/3-populations/Diplotaxodon_4pop/m5_mpop5_kernel_iterate_ONE_SNP_PER_TAG_EXCL_SINGLETONS_BOOTSTRAP/r_0.8-p_4/Bayescan

cat $bayescan_dir/Di_4pop_r_0.8_p4/Di_4pop_r_0.8_p4-10-FDR-0.05.outlier_stacks_ID.list > Bayescan.global.candidates.txt

echo -e "Number of candidates highlighted by global Bayescan run \n\
(list of tag IDs saved in 'Bayescan.global.candidates.txt'):\n\
$(cat Bayescan.global.candidates.txt |wc -l)"

Number of candidates highlighted by global Bayescan run 
(list of tag IDs saved in 'Bayescan.global.candidates.txt'):
76


In [21]:
%%bash
#summarize all Bayescan results

cat Bayescan.global.candidates.txt Bayescan.pairwise.candidates.txt | sort -n |uniq > Bayescan.candidates.txt

echo -e "Number of candidates highlighted by Bayescan runs \n\
(list of tag IDs saved in 'Bayescan.candidates.txt'):\n\
$(cat Bayescan.candidates.txt |wc -l)"

Number of candidates highlighted by Bayescan runs 
(list of tag IDs saved in 'Bayescan.candidates.txt'):
122


In [5]:
cd ..

/media/chrishah/STORAGE/RAD/popgen/Fst-outlier/Diplotaxodon_FOR_PAPER


#*Bayenv*

XTX was calculated using `Bayenv` for the global dataset containing all 4 Diplotaxodon populations. Bayenv was run 20 times. Initially we identified SNPs with the strongest signal for selection (top 5% XTX across 20 independent Bayenv runs, i.e. ARR > 0.95). Then we assigned a p-value to each of these SNPs using 10,000 boostrap replicates. Bayenv candidate SNPs have __ARR > 0.95 and p < 0.005__.

In [27]:
mkdir Bayenv

In [6]:
cd Bayenv/

/media/chrishah/STORAGE/RAD/popgen/Fst-outlier/Diplotaxodon_FOR_PAPER/Bayenv


In [29]:
%%bash

Bayenv_dir=/media/chrishah/STORAGE/RAD/popgen/Bayenv/Diplotaxodon/4_populations/M_zebra-BWA-8MM-stacks_m5_n5_r_0.8_p4_ONLY_ONE_SNP/ANALYSES_FOR_DIPLOTAXODON_PAPER_EXCLUDE_SINGLETONS/XTX

cat $Bayenv_dir/XTX.candidates.txt > Bayenv.candidates.txt

echo -e "Number of candidates highlighted by Bayenv runs \n\
(list of tag IDs saved in 'Bayenv.candidates.txt'):\n\
$(cat Bayenv.candidates.txt |wc -l)"

Number of candidates highlighted by Bayenv runs 
(list of tag IDs saved in 'Bayenv.candidates.txt'):
96


In [7]:
cd ..

/media/chrishah/STORAGE/RAD/popgen/Fst-outlier/Diplotaxodon_FOR_PAPER


#Identify overlaps between the three approaches

In [31]:
mkdir summarize

In [8]:
cd summarize

/media/chrishah/STORAGE/RAD/popgen/Fst-outlier/Diplotaxodon_FOR_PAPER/summarize


In [9]:
%%bash

cat ../Stacks/Stacks.candidates.txt ../Bayescan/Bayescan.candidates.txt ../Bayenv/Bayenv.candidates.txt | sort -n | uniq > Candidates.cummulative.txt

echo -e "In total the three approaches identified $(cat Candidates.cummulative.txt |wc -l) candidates\n"

cat ../Stacks/Stacks.candidates.txt ../Bayescan/Bayescan.candidates.txt ../Bayenv/Bayenv.candidates.txt | sort -n > Candidates.cummulative.redundant.txt



In total the three approaches identified 276 candidates



In [360]:

Stacks = []
IN = open('../Stacks/Stacks.candidates.txt')
for line in IN:
    Stacks.append(line.strip())
IN.close()
print len(Stacks)

Bayescan = []
IN = open('../Bayescan/Bayescan.global.candidates.txt')
for line in IN:
    Bayescan.append(line.strip())
IN.close()
print len(Bayescan)

Bayenv = []
IN = open('../Bayenv/Bayenv.candidates.txt')
for line in IN:
    Bayenv.append(line.strip())
IN.close()
print len(Bayenv)

tags = {'Stacks':Stacks, 'Bayescan':Bayescan, 'Bayenv':Bayenv}

cummulative = []
cummulative.extend(Stacks)
cummulative.extend(Bayescan)
cummulative.extend(Bayenv)
cummulative=list(set(cummulative))
print "Cummulative Number of candidates: %i" %len(cummulative)

Headers=['Stacks','Bayescan','Bayenv']

matrix = {}

OUT=open('matrix.csv','w')
OUT.write("tag_ID,"+",".join(Headers)+"\n")

for i in cummulative:
    outstring=str(i)
    matrix[i] = {}
    for h in Headers:
        if i in tags[h]:
            matrix[i][h] = 1
        else:
            matrix[i][h] = 0
        outstring+=","+str(matrix[i][h])
    OUT.write(outstring+"\n")

OUT.close()

125
76
96
Cummulative Number of candidates: 243


In [361]:
%%bash

Rscript Venn.R

[1] 125
[1] 96
[1] 76
[1] 1


Loading required package: VennDiagram
Loading required package: grid


In [377]:
#find reference locations for relevant tag IDs from stacks catalog

print "\n### minimum 3 of 3 ###\n"

out='min_3_of_3.tsv'
catalog='/media/chrishah/STORAGE/RAD/stacks/ALL/mapping/excl_PCR_dupl/BWA-8MM/M_zebra/2-cstacks/m5/n0/data/batch_1.catalog.tags.tsv.gz'

minimum = ['Stacks','Bayescan','Bayenv']

#minimum = ['Bayenv']
tags_coordinates = {}
tags_ids = {}
count=0

import gzip

FH = gzip.open(catalog,'rb')

for line in FH:
    if line.split("\t")[2] in matrix.keys():
        to_test = minimum[:]
#        print matrix[line.split("\t")[2]]
        for m in reversed(range(len(to_test))):
            if not matrix[line.split("\t")[2]][to_test[m]]:
                continue
            else:
                del to_test[m]
        if len(to_test) <= 0:
            count+=1
#            print "OK\n"
#            print line.strip().split("\t")[2:5]
            if not line.split("\t")[3] in tags_coordinates.keys():
                tags_coordinates[line.split("\t")[3]] = [line.split("\t")[4]]
                tags_ids[line.split("\t")[3]] = [line.split("\t")[2]]
            else:
                tags_coordinates[line.split("\t")[3]].append(line.split("\t")[4])
                tags_ids[line.split("\t")[3]].append(line.split("\t")[2])
#            print line.split("\t")[3],tags_coordinates[line.split("\t")[3]]
        else:
#            print "not good enough\n"
            pass
    
print "Total number of candidates at these filtering criteria: %i" %count
print "Distributed across a total number of scaffolds: %i" %len(tags_coordinates)


OUT=open(out,'w')
for scf in sorted(tags_coordinates):
    print scf, tags_coordinates[scf]
    for i in range(len(tags_coordinates[scf])):
        
        OUT.write("%s\t%s\t%s\n" %(scf, tags_coordinates[scf][i], tags_ids[scf][i]))
OUT.close()
print



### minimum 3 of 3 ###

Total number of candidates at these filtering criteria: 13
Distributed across a total number of scaffolds: 8
scaffold_114 ['1587947']
scaffold_174 ['404159']
scaffold_197 ['125372', '143015', '150458']
scaffold_219 ['11706', '774']
scaffold_242 ['261422']
scaffold_29 ['1759443', '1854859']
scaffold_45 ['357551']
scaffold_81 ['374911', '450783']



In [376]:
#find reference locations for relevant tag IDs from stacks catalog

print "\n### minimum 2 of 3 ###\n"

out='min_2_of_3.tsv'
catalog='/media/chrishah/STORAGE/RAD/stacks/ALL/mapping/excl_PCR_dupl/BWA-8MM/M_zebra/2-cstacks/m5/n0/data/batch_1.catalog.tags.tsv.gz'

minimum = ['Stacks','Bayescan','Bayenv']

#minimum = ['Bayenv']
tags_coordinates = {}
tags_ids = {}
count=0

import gzip

FH = gzip.open(catalog,'rb')

for line in FH:
    if line.split("\t")[2] in matrix.keys():
        to_test = minimum[:]
#        print matrix[line.split("\t")[2]]
        for m in reversed(range(len(to_test))):
            if not matrix[line.split("\t")[2]][to_test[m]]:
                continue
            else:
                del to_test[m]
        if len(to_test) <= 1:
            count+=1
#            print matrix[line.split("\t")[2]]
            #            print "OK\n"
#            print line.strip().split("\t")[2:5]
            if not line.split("\t")[3] in tags_coordinates.keys():
                tags_coordinates[line.split("\t")[3]] = [line.split("\t")[4]]
                tags_ids[line.split("\t")[3]] = [line.split("\t")[2]]
            else:
                tags_coordinates[line.split("\t")[3]].append(line.split("\t")[4])
                tags_ids[line.split("\t")[3]].append(line.split("\t")[2])
#            print line.split("\t")[3],tags_coordinates[line.split("\t")[3]]
        else:
#            print "not good enough\n"
            pass
    
print "Total number of candidates at these filtering criteria: %i" %count
print "Distributed across a total number of scaffolds: %i" %len(tags_coordinates)


OUT=open(out,'w')
for scf in sorted(tags_coordinates):
    print scf, tags_coordinates[scf]
    for i in range(len(tags_coordinates[scf])):
        
        OUT.write("%s\t%s\t%s\n" %(scf, tags_coordinates[scf][i], tags_ids[scf][i]))
OUT.close()
print



### minimum 2 of 3 ###

Total number of candidates at these filtering criteria: 41
Distributed across a total number of scaffolds: 26
scaffold_103 ['1580155']
scaffold_111 ['2213607', '2232468']
scaffold_114 ['1587947', '1587950']
scaffold_12 ['3799856']
scaffold_125 ['1591798']
scaffold_133 ['466413', '466416']
scaffold_136 ['317152']
scaffold_162 ['1301127']
scaffold_174 ['404159']
scaffold_190 ['169576']
scaffold_197 ['125372', '143015', '150458', '212091']
scaffold_203 ['246632', '270184', '275720']
scaffold_212 ['153555']
scaffold_215 ['135319', '176053', '174230']
scaffold_219 ['105692', '11706', '774']
scaffold_227 ['206705']
scaffold_242 ['261422']
scaffold_29 ['1759443', '1854859']
scaffold_344 ['7411']
scaffold_39 ['1771946', '1815005']
scaffold_45 ['357551']
scaffold_48 ['2642963']
scaffold_53 ['3346004']
scaffold_55 ['1871864']
scaffold_58 ['942901']
scaffold_81 ['374911', '450783']



In [379]:
#find reference locations for relevant tag IDs from stacks catalog

print "\n### minimum 1 of 3 ###\n"

out='min_1_of_3.tsv'
catalog='/media/chrishah/STORAGE/RAD/stacks/ALL/mapping/excl_PCR_dupl/BWA-8MM/M_zebra/2-cstacks/m5/n0/data/batch_1.catalog.tags.tsv.gz'

minimum = ['Stacks','Bayescan','Bayenv']

#minimum = ['Bayenv']
tags_coordinates = {}
tags_ids = {}
count=0

import gzip

FH = gzip.open(catalog,'rb')

for line in FH:
    if line.split("\t")[2] in matrix.keys():
        to_test = minimum[:]
#        print len(to_test)
#        print matrix[line.split("\t")[2]]
        for m in reversed(range(len(to_test))):
            if not matrix[line.split("\t")[2]][to_test[m]]:
                continue
            else:
                del to_test[m]
        if len(to_test) <= 3:
#            print matrix[line.split("\t")[2]]
            count+=1
#            print "OK\n"
#            print line.strip().split("\t")[2:5]
            if not line.split("\t")[3] in tags_coordinates.keys():
                tags_coordinates[line.split("\t")[3]] = [line.split("\t")[4]]
                tags_ids[line.split("\t")[3]] = [line.split("\t")[2]]
            else:
                tags_coordinates[line.split("\t")[3]].append(line.split("\t")[4])
                tags_ids[line.split("\t")[3]].append(line.split("\t")[2])
#            print line.split("\t")[3],tags_coordinates[line.split("\t")[3]]
        else:
#            print "not good enough\n"
            pass
    
print "Total number of candidates at these filtering criteria: %i" %count
print "Distributed across a total number of scaffolds: %i" %len(tags_coordinates)


OUT=open(out,'w')
for scf in sorted(tags_coordinates):
    print scf, tags_coordinates[scf], tags_ids[scf]
    for i in range(len(tags_coordinates[scf])):
        
        OUT.write("%s\t%s\t%s\n" %(scf, tags_coordinates[scf][i], tags_ids[scf][i]))
OUT.close()
print



### minimum 1 of 3 ###

Total number of candidates at these filtering criteria: 243
Distributed across a total number of scaffolds: 103
scaffold_0 ['12990281', '8979020'] ['117', '864']
scaffold_1 ['330745', '6150413'] ['17242', '17509']
scaffold_103 ['1580155', '1964245', '1971523', '1979541'] ['1562', '1611', '1614', '1616']
scaffold_104 ['2286919'] ['1868']
scaffold_106 ['638748'] ['2264']
scaffold_110 ['2268305', '636788', '718485'] ['3405', '3461', '3464']
scaffold_111 ['1164830', '1209473', '1456276', '2183279', '2193336', '2202289', '2213607', '2232468', '2248059'] ['3505', '3514', '3556', '3652', '3658', '3661', '3666', '3667', '3669']
scaffold_112 ['514446', '577510'] ['3836', '3839']
scaffold_114 ['1532328', '1532602', '1587947', '1587950'] ['4118', '4119', '4122', '4123']
scaffold_119 ['1188769', '1259748', '1566980', '1569637', '852616'] ['4826', '4834', '4860', '4863', '4955']
scaffold_12 ['3799856'] ['6728']
scaffold_122 ['1482278'] ['5764']
scaffold_125 ['1591798', '159

find genes

In [240]:
!ln -s /media/chrishah/STORAGE/Dropbox/Github/genomisc/popogeno/QTlight/QTLight_functions.py .

ln: failed to create symbolic link ‘./QTLight_functions.py’: File exists


In [5]:
import QTLight_functions as QTL

In [6]:
files = ['min_3_of_3.tsv', 'min_2_of_3.tsv', 'min_1_of_3.tsv']

In [7]:
gff_per_scaffold = QTL.parse_gff(gff='/media/chrishah/STORAGE/DATA/Cichlids/reference_data/M_zebra/annotations/Metriaclima_zebra.BROADMZ2.gtf')

In [8]:
genes_per_analysis = QTL.find_genes(rank_stats = files, gff = gff_per_scaffold, distance = 50)

processing rank statistic file: min_3_of_3.tsv
processing rank statistic file: min_2_of_3.tsv
processing rank statistic file: min_1_of_3.tsv
min_1_of_3:
identified 1333 gene(s)
min_2_of_3:
identified 236 gene(s)
min_3_of_3:
identified 91 gene(s)


In [9]:
QTL.annotate_genes(SNPs_to_genes=genes_per_analysis, annotations='/media/chrishah/STORAGE/DATA/Cichlids/reference_data/M_zebra/annotations/blast2GO/blast2GO/blast2go_table_20150630_0957.txt')

min_3_of_3
adding annoation for min_3_of_3
min_1_of_3
adding annoation for min_1_of_3
min_2_of_3
adding annoation for min_2_of_3


In [257]:
mkdir find_genes

mkdir: cannot create directory ‘find_genes’: File exists


In [10]:
QTL.write_candidates(SNPs_to_genes=genes_per_analysis, whitelist=genes_per_analysis.keys(), out_dir='./find_genes/')

min_1_of_3
writing to: ./find_genes/min_1_of_3.genes.annotated.tsv
min_2_of_3
writing to: ./find_genes/min_2_of_3.genes.annotated.tsv
min_3_of_3
writing to: ./find_genes/min_3_of_3.genes.annotated.tsv


Prepare lists of genes in txt files for Enrichment analyses in Blast2GO.

In [11]:
%%bash


cat find_genes/min_1_of_3.genes.annotated.tsv | cut -f 4 | grep "gene" -v | sort -n | uniq > find_genes/min_1_of_3.genes.txt
cat find_genes/min_2_of_3.genes.annotated.tsv | cut -f 4 | grep "gene" -v | sort -n | uniq > find_genes/min_2_of_3.genes.txt
cat find_genes/min_3_of_3.genes.annotated.tsv | cut -f 4 | grep "gene" -v | sort -n | uniq > find_genes/min_3_of_3.genes.txt

Prepare non redundant lists of genes for Enrichment analyses

In [8]:
%%bash

for a in $(ls -1 find_genes/*of_3.genes.txt)
do
    new=$(echo "$a" | sed 's/\.genes./.genes.nr./g')
    echo "$new"
    cat $a | grep "\.1$" | sort | uniq > $new
done

find_genes/min_1_of_3.genes.nr.txt
find_genes/min_2_of_3.genes.nr.txt
find_genes/min_3_of_3.genes.nr.txt


Create non-redundant list of full gene complement.

In [11]:
%%bash

cat /media/chrishah/STORAGE/DATA/Cichlids/functional_annotation/BLAST2GO/22_02_2015/full_blast_mapped_ips_annotated_annexed_EC.txt | \
cut -f 1 | grep "SeqName" -v | grep "\.1$" |sort | uniq > find_genes/genes.list.nr.txt


#Identify outliers in the Di_2 vs. Di_4 pairwise comparison

###*Stacks*

In [337]:
cd ..

/media/chrishah/STORAGE/RAD/popgen/Fst-outlier/Diplotaxodon_FOR_PAPER


In [344]:
%%bash

stacks_dir=/media/chrishah/STORAGE/RAD/stacks/ALL/mapping/excl_PCR_dupl/BWA-8MM/M_zebra/3-populations/pairwise_single_SNP/Diplotaxodon/excl_singletons

for a in $(zcat $stacks_dir/Di_2-Di_4/r_0.8-p_2-w50kb-1M_bs/bootstrap_whitelist.txt.gz); do zcat $stacks_dir/Di_2-Di_4/r_0.8-p_2-w50kb-1M_bs/batch_1.fst_Di_2-Di_4.tsv.gz | grep -P "1\t$a\tDi"; done | \
perl -ne 'chomp; @a=split("\t"); if ($a[-2] < 0.00005){print "$a[1]\n"}' | sort | uniq > Stacks/Stacks.Di_2-Di_4.candidates.txt 

In [345]:
catalog='/media/chrishah/STORAGE/RAD/stacks/ALL/mapping/excl_PCR_dupl/BWA-8MM/M_zebra/2-cstacks/m5/n0/data/batch_1.catalog.tags.tsv.gz'
cat={}
by_scaffold = {}
count=0
out='Stacks/Stacks.Di_2-Di_4.candidates.locations.txt'

import gzip

FH = gzip.open(catalog,'rb')

for line in FH:
    cat[line.split("\t")[2]] = line.split("\t")[3:5]
    
FH. close()

IN=open('Stacks/Stacks.Di_2-Di_4.candidates.txt')
for line in IN:
    count+=1
    if line.strip() in cat.keys():
#        print line.strip(),cat[line.strip()]
#        print by_scaffold
        if not cat[line.strip()][0] in by_scaffold.keys():
            by_scaffold[cat[line.strip()][0]] = {}
#            print by_scaffold
            by_scaffold[cat[line.strip()][0]]['coordinate'] = [cat[line.strip()][1]]
#            print by_scaffold
            by_scaffold[cat[line.strip()][0]]['id'] = [line.strip()]
#            print by_scaffold
        else:
            by_scaffold[cat[line.strip()][0]]['coordinate'].append(cat[line.strip()][1])
            by_scaffold[cat[line.strip()][0]]['id'].append(line.strip())

#        print
#        print cat[line.strip()][0],by_scaffold[cat[line.strip()][0]]
        
IN.close()

print "Total candidates: %i" %count
print "on %i scaffolds" %len(by_scaffold)

OUT=open(out,'w')
for scf in sorted(by_scaffold):
    print scf, by_scaffold[scf]['coordinate'], by_scaffold[scf]['id']
    
    
    for i in range(len(by_scaffold[scf]['coordinate'])):
        
        OUT.write("%s\t%s\t%s\n" %(scf, by_scaffold[scf]['coordinate'][i], by_scaffold[scf]['id'][i]))
OUT.close()


Total candidates: 18
on 7 scaffolds
scaffold_212 ['552616', '558131', '553169'] ['19079', '19081', '52692']
scaffold_240 ['106994', '146332', '149815', '69072', '60090'] ['21681', '21682', '21685', '21745', '52400']
scaffold_31 ['5910726'] ['28090']
scaffold_39 ['1771946', '1815002', '1815005'] ['31651', '31654', '31655']
scaffold_61 ['81559', '113270', '113273'] ['40655', '53385', '60983']
scaffold_9 ['1045424'] ['49117']
scaffold_96 ['316998', '363044'] ['48627', '48629']


###*Bayescan*

In [291]:
!cat /media/chrishah/STORAGE/RAD/stacks/ALL/mapping/excl_PCR_dupl/BWA-8MM/M_zebra/3-populations/pairwise_single_SNP/Diplotaxodon/excl_singletons/BAYESCAN_pairwise/Di_2-Di_4/Di_2-Di_4-10-FDR-0.05.outlier_stacks_ID.list > Bayescan/Bayescan.pairwise.Di_2-Di_4.candidates.txt

In [293]:
!cat Bayescan/Bayescan.pairwise.Di_2-Di_4.candidates.txt

6728
7014
30641
60983


In [296]:
catalog='/media/chrishah/STORAGE/RAD/stacks/ALL/mapping/excl_PCR_dupl/BWA-8MM/M_zebra/2-cstacks/m5/n0/data/batch_1.catalog.tags.tsv.gz'
cat={}
by_scaffold = {}
count=0
out='Bayescan/Bayescan.pairwise.Di_2-Di_4.candidates.locations.txt'

import gzip

FH = gzip.open(catalog,'rb')

for line in FH:
    cat[line.split("\t")[2]] = line.split("\t")[3:5]
    
FH. close()

IN=open('Bayescan/Bayescan.pairwise.Di_2-Di_4.candidates.txt')
for line in IN:
    if line.strip() in cat.keys():
#        print line.strip(),cat[line.strip()]
#        print by_scaffold
        if not cat[line.strip()][0] in by_scaffold.keys():
            by_scaffold[cat[line.strip()][0]] = {}
#            print by_scaffold
            by_scaffold[cat[line.strip()][0]]['coordinate'] = [cat[line.strip()][1]]
#            print by_scaffold
            by_scaffold[cat[line.strip()][0]]['id'] = [line.strip()]
#            print by_scaffold
        else:
            by_scaffold[cat[line.strip()][0]]['coordinate'].append(cat[line.strip()][1])
            by_scaffold[cat[line.strip()][0]]['id'].append(line.strip())

#        print
#        print cat[line.strip()][0],by_scaffold[cat[line.strip()][0]]
        
IN.close()


OUT=open(out,'w')
for scf in sorted(by_scaffold):
    print scf, by_scaffold[scf]['coordinate'], by_scaffold[scf]['id']
    
    
    for i in range(len(by_scaffold[scf]['coordinate'])):
        
        OUT.write("%s\t%s\t%s\n" %(scf, by_scaffold[scf]['coordinate'][i], by_scaffold[scf]['id'][i]))
OUT.close()


scaffold_12 ['3799856', '8415472'] ['6728', '7014']
scaffold_36 ['5276102'] ['30641']
scaffold_61 ['113273'] ['60983']


In [346]:
cd summarize/

/media/chrishah/STORAGE/RAD/popgen/Fst-outlier/Diplotaxodon_FOR_PAPER/summarize


In [298]:
!mkdir pairwise.Di_2-Di_4

In [347]:
cd pairwise.Di_2-Di_4/

/media/chrishah/STORAGE/RAD/popgen/Fst-outlier/Diplotaxodon_FOR_PAPER/summarize/pairwise.Di_2-Di_4


In [348]:
!cat ../../Stacks/Stacks.Di_2-Di_4.candidates.locations.txt ../../Bayescan/Bayescan.pairwise.Di_2-Di_4.candidates.locations.txt > Di_2-Di_4.pairwise.tsv

In [349]:
files = ['Di_2-Di_4.pairwise.tsv']

In [350]:
genes_per_analysis = QTL.find_genes(rank_stats = files, gff = gff_per_scaffold, distance = 50)

processing rank statistic file: Di_2-Di_4.pairwise.tsv
Di_2-Di_4.pairwise:
identified 115 gene(s)


In [351]:
QTL.annotate_genes(SNPs_to_genes=genes_per_analysis, annotations='/media/chrishah/STORAGE/DATA/Cichlids/reference_data/M_zebra/annotations/blast2GO/blast2GO/blast2go_table_20150630_0957.txt')

Di_2-Di_4.pairwise
adding annoation for Di_2-Di_4.pairwise


In [352]:
QTL.write_candidates(SNPs_to_genes=genes_per_analysis, whitelist=genes_per_analysis.keys(), out_dir='./')

Di_2-Di_4.pairwise
writing to: ./Di_2-Di_4.pairwise.genes.annotated.tsv


In [354]:
cd ..

/media/chrishah/STORAGE/RAD/popgen/Fst-outlier/Diplotaxodon_FOR_PAPER


Investigate Stacks pairwise

In [388]:
cd Stacks/

/media/chrishah/STORAGE/RAD/popgen/Fst-outlier/Diplotaxodon_FOR_PAPER/Stacks


In [329]:
mkdir pairwise

In [389]:
cd pairwise/

/media/chrishah/STORAGE/RAD/popgen/Fst-outlier/Diplotaxodon_FOR_PAPER/Stacks/pairwise


In [393]:
%%bash

%%bash

#specify the directory where populations had been run
stacks_dir=/media/chrishah/STORAGE/RAD/stacks/ALL/mapping/excl_PCR_dupl/BWA-8MM/M_zebra/3-populations/pairwise_single_SNP/Diplotaxodon/excl_singletons

#fetch all pairwise candidates
for b in $(ls -1 $stacks_dir | grep "Di"); do for a in $(zcat $stacks_dir/$b/r_0.8-p_2-w50kb-1M_bs/bootstrap_whitelist.txt.gz); do zcat $stacks_dir/$b/r_0.8-p_2-w50kb-1M_bs/batch_1.fst_$b.tsv.gz | grep -P "1\t$a\tDi"; done | perl -ne 'chomp; @a=split("\t"); if ($a[-2] < 0.00005){print "$a[1]\n"}'; done | sort -n |uniq > Stacks.pairwise.candidates.txt 


#extract the most significant SNPs
for b in $(ls -1 $stacks_dir | grep "Di"); do for a in $(cat Stacks.pairwise.candidates.txt); do zcat $stacks_dir/$b/r_0.8-p_2-w50kb/batch_1.fst_$b.tsv.gz | grep -P "1\t$a\tDi"; done | perl -ne 'chomp; @a=split("\t"); print "$a[4]\t$a[5]\t$a[1]\t$a[8]\t$a[-5]\t$a[-2]\t$a[2],$a[3]\n"'; done  | sort -n > Stacks.pairwise.Fsts.txt



bash: line 2: fg: no job control
