# Explore usage of PyBedtools for Marker Design I/O

See 

- http://bedtools.readthedocs.org/en/latest/
- http://pythonhosted.org/pybedtools/


Can Tabix Bedtools see http://daler.github.io/pybedtools/autodocs/pybedtools.bedtool.BedTool.tabix.html#pybedtools.bedtool.BedTool.tabix

Might be simpler to slice out fasta using pyfaidx 

-------

### Explore usage with Fasta

Could possibly use this to get sequence of an amplicon

http://pythonhosted.org/pybedtools/autodocs/pybedtools.bedtool.BedTool.seq.html#pybedtools.bedtool.BedTool.seq

also consider other pure py tools https://pypi.python.org/pypi/pyfaidx

In [78]:
!pip freeze | grep pybedtools

pybedtools==0.7.6


In [79]:
!conda info -e

Using Anaconda Cloud api site https://api.anaconda.org
# conda environments:
#
_build                   /Users/johnmccallum/miniconda3/envs/_build
py2PCR                   /Users/johnmccallum/miniconda3/envs/py2PCR
py3-r-env                /Users/johnmccallum/miniconda3/envs/py3-r-env
py3markers            *  /Users/johnmccallum/miniconda3/envs/py3markers
root                     /Users/johnmccallum/miniconda3



In [2]:
from pybedtools import BedTool

In [3]:
from pybedtools import example_filename

### Example from docs

In [4]:
a = BedTool("""
... chr1 1 10
... chr1 50 55""", from_string=True)
fasta = example_filename('test.fa')
a = a.sequence(fi=fasta)
print(open(a.seqfn).read())

>chr1:1-10
GATGAGTCT
>chr1:50-55
CCATC



**NOTE** that VCF location is 1-based, versus bed/py 0-based

### Do same with original test data

In [5]:
ls ../test/test-data/

384um_251453690362217.txt  targets                    targets.fasta              targets.fasta.fai          targets.gff


#### Need an indexed  genome file

In [6]:
!samtools faidx ../test/test-data/targets.fasta

In [7]:
cat ../test/test-data/targets.fasta.fai

k69_93535	1628	11	60	61
k69_98089	749	1678	60	61


In [8]:
cat ../test/test-data/targets

k69_93535:SAMTOOLS:SNP:1147
k69_93535:SAMTOOLS:SNP:1336
k69_98089:SAMTOOLS:SNP:30
k69_98089:SAMTOOLS:SNP:550
k69_98089:SAMTOOLS:SNP:625


In [9]:
PRIMER_PRODUCT_SIZE_RANGE=[60,120]

In [10]:
PRIMER_PRODUCT_SIZE_RANGE[1]

120

### Make a target bed

In [11]:
b=BedTool("k69_93535 1146 1147",from_string=True)

### Check the properties of this and how to access them

 see https://pythonhosted.org/pybedtools/intervals.html

In [12]:
b[0].chrom

'k69_93535'

In [13]:
b[0].start

1146

In [14]:
b[0].length

1

### attach a sequence to it

In [15]:
b=b.sequence(fi='../test/test-data/targets.fasta')

In [16]:
print(b)

k69_93535	1146	1147



In [17]:
c=BedTool('../test/test-data/targets.gff')

In [18]:
print(c.intersect(b))

k69_93535	SAMTOOLS	SNP	1147	1147	999	.	.	ID=k69_93535:SAMTOOLS:SNP:1147;Variant_seq=G;Reference_seq=C;DP=2645;VDB=0.0371;AF1=0.3527;G3=0.2771,0.7229,6.934e-153;HWE=0.0248;AC1=8;DP4=733,804,447,519;MQ=42;FQ=999;PV4=0.51,0,0.027,1



### Can use subtraction to exclude our target

here we get the features in our design window and then remove the target to create an exclude

In [19]:
max_product_size=PRIMER_PRODUCT_SIZE_RANGE[1]
target_exclude=c.intersect(b.slop(b=max_product_size,g='../test/test-data/targets.fasta.fai')).subtract(b)
print(target_exclude)

k69_93535	SAMTOOLS	SNP	1141	1142	999	.	.	ID=k69_93535:SAMTOOLS:SNP:1141;Variant_seq=T;Reference_seq=C;DP=2644;VDB=0.0374;AF1=0.1882;AC1=5;DP4=748,786,225,294;MQ=42;FQ=999;PV4=0.037,0,0.036,0.39
k69_93535	SAMTOOLS	SNP	1147	1148	999	.	.	ID=k69_93535:SAMTOOLS:SNP:1147;Variant_seq=G;Reference_seq=C;DP=2645;VDB=0.0371;AF1=0.3527;G3=0.2771,0.7229,6.934e-153;HWE=0.0248;AC1=8;DP4=733,804,447,519;MQ=42;FQ=999;PV4=0.51,0,0.027,1



In [20]:
print(b.slop(b=max_product_size,g='../test/test-data/targets.fasta.fai'))

k69_93535	1026	1267



### Use Getfasta to extract the window of sequence

- need the genome or fai index for slop

In [21]:
target_window=b.slop(b=max_product_size,g='../test/test-data/targets.fasta.fai')

In [22]:
print(target_window)

k69_93535	1026	1267



In [23]:
target_window=target_window.sequence(fi='../test/test-data/targets.fasta')

In [24]:
open(target_window.seqfn).read()

'>k69_93535:1026-1267\nAGATGAATCAGACTCTTCAGTTGCTTCCTGCCCTCCTACACTTAATGAAGGAAAGAAAAAAAGGACAGGGAAGCTTCATAGGCCTTTGAGTCTGAACGCATTTGACATAATTTCCTTTTCCAGAGGATTTGATCTTTCAGGTTTGTTTGAAGAAACGGGAGATGAAACAAGATTTGTGTCGGGTGAAACGATACCAAACATCATATCGAAATTGGAGGAGATTGCAAAAGTGGGTAGTTTC\n'

In [25]:
fo=open(target_window.seqfn)

In [26]:
fo.readlines()[1].strip('\n')

'AGATGAATCAGACTCTTCAGTTGCTTCCTGCCCTCCTACACTTAATGAAGGAAAGAAAAAAAGGACAGGGAAGCTTCATAGGCCTTTGAGTCTGAACGCATTTGACATAATTTCCTTTTCCAGAGGATTTGATCTTTCAGGTTTGTTTGAAGAAACGGGAGATGAAACAAGATTTGTGTCGGGTGAAACGATACCAAACATCATATCGAAATTGGAGGAGATTGCAAAAGTGGGTAGTTTC'

### Do Same with Pyfaidx

In [27]:
from pyfaidx import Fasta
target_seqs=Fasta('../test/test-data/targets.fasta')
target_seqs.keys()

odict_keys(['k69_93535', 'k69_98089'])

In [28]:
 target_seqs['k69_93535'][250:350].seq

'GACAAAGAGAAAATCCTCAAATCCGGCCTCGTCAACCACACCAAACGCGAGATCTCAATCCTCCGCCGTCTTCGTCATCCGAACGTCGTCGAGCTCTTCG'

This is much nicer way to do it

### HOWTO Nudge all the annotations relative to the reference slice

we should be able to use bedtools [shift](http://bedtools.readthedocs.org/en/latest/content/tools/shift.html?highlight=shift)

but that doesnt seem to be implemented in pybedtools

In [29]:
print(c.intersect(target_window))

k69_93535	SAMTOOLS	SNP	1141	1142	999	.	.	ID=k69_93535:SAMTOOLS:SNP:1141;Variant_seq=T;Reference_seq=C;DP=2644;VDB=0.0374;AF1=0.1882;AC1=5;DP4=748,786,225,294;MQ=42;FQ=999;PV4=0.037,0,0.036,0.39
k69_93535	SAMTOOLS	SNP	1147	1148	999	.	.	ID=k69_93535:SAMTOOLS:SNP:1147;Variant_seq=G;Reference_seq=C;DP=2645;VDB=0.0371;AF1=0.3527;G3=0.2771,0.7229,6.934e-153;HWE=0.0248;AC1=8;DP4=733,804,447,519;MQ=42;FQ=999;PV4=0.51,0,0.027,1



#### Just need to get tuples in form [start,length] for primer3-py

*How come intervals for SNP are 2 bp??*

In [54]:
print(target_window)

k69_93535	1026	1267



In [31]:
[(X.start,X.length) for X in c.intersect(target_window)]

[(1140, 2), (1146, 2)]

In [32]:
offset=target_window[0].start
offset

1026

### can adjust annotations for design like..

In [33]:
[(X.start - offset,X.length) for X in c.intersect(target_window)]

[(114, 2), (120, 2)]

### Could Create a Design Target Dict Like so..

In [62]:
 target_dict={'SEQUENCE_ID': b[0].chrom + "_" + str(b[0].start) }   

In [63]:
target_dict['SEQUENCE_TARGET']=[b[0].start - offset,b[0].length]

In [64]:
target_dict['SEQUENCE_EXCLUDED_REGION']=[(X.start - offset,X.length) for X in c.intersect(target_window) - b]

In [65]:
target_seqs[target_window[0].chrom][target_window[0].start:target_window[0].end].seq

'AGATGAATCAGACTCTTCAGTTGCTTCCTGCCCTCCTACACTTAATGAAGGAAAGAAAAAAAGGACAGGGAAGCTTCATAGGCCTTTGAGTCTGAACGCATTTGACATAATTTCCTTTTCCAGAGGATTTGATCTTTCAGGTTTGTTTGAAGAAACGGGAGATGAAACAAGATTTGTGTCGGGTGAAACGATACCAAACATCATATCGAAATTGGAGGAGATTGCAAAAGTGGGTAGTTTC'

In [66]:
target_dict['SEQUENCE_TEMPLATE']=target_seqs[target_window[0].chrom][target_window[0].start:target_window[0].end].seq

In [73]:
target_dict

{'SEQUENCE_EXCLUDED_REGION': [(114, 2)],
 'SEQUENCE_ID': 'k69_93535_1146',
 'SEQUENCE_TARGET': [120, 1],
 'SEQUENCE_TEMPLATE': 'AGATGAATCAGACTCTTCAGTTGCTTCCTGCCCTCCTACACTTAATGAAGGAAAGAAAAAAAGGACAGGGAAGCTTCATAGGCCTTTGAGTCTGAACGCATTTGACATAATTTCCTTTTCCAGAGGATTTGATCTTTCAGGTTTGTTTGAAGAAACGGGAGATGAAACAAGATTTGTGTCGGGTGAAACGATACCAAACATCATATCGAAATTGGAGGAGATTGCAAAAGTGGGTAGTTTC'}

In [74]:
 import primer3 as P3

In [75]:
p3_test_globals={
        'PRIMER_OPT_SIZE': 20,
        'PRIMER_PICK_INTERNAL_OLIGO': 0,
        'PRIMER_INTERNAL_MAX_SELF_END': 8,
        'PRIMER_MIN_SIZE': 18,
        'PRIMER_MAX_SIZE': 25,
        'PRIMER_OPT_TM': 60.0,
        'PRIMER_MIN_TM': 57.0,
        'PRIMER_MAX_TM': 63.0,
        'PRIMER_MIN_GC': 20.0,
        'PRIMER_MAX_GC': 80.0,
        'PRIMER_MAX_POLY_X': 100,
        'PRIMER_INTERNAL_MAX_POLY_X': 100,
        'PRIMER_SALT_MONOVALENT': 50.0,
        'PRIMER_DNA_CONC': 50.0,
        'PRIMER_MAX_NS_ACCEPTED': 0,
        'PRIMER_MAX_SELF_ANY': 12,
        'PRIMER_MAX_SELF_END': 8,
        'PRIMER_PAIR_MAX_COMPL_ANY': 12,
        'PRIMER_PAIR_MAX_COMPL_END': 8,
        'PRIMER_PRODUCT_SIZE_RANGE': [[75,100]],
    }

In [76]:
P3.designPrimers(target_dict,p3_test_globals)

{'PRIMER_INTERNAL_EXPLAIN': 'considered 11816, unacceptable product size 11808, ok 8',
 'PRIMER_INTERNAL_NUM_RETURNED': 0,
 'PRIMER_LEFT_0': (64, 20),
 'PRIMER_LEFT_0_END_STABILITY': 5.19,
 'PRIMER_LEFT_0_GC_PERCENT': 55.0,
 'PRIMER_LEFT_0_HAIRPIN_TH': 0.0,
 'PRIMER_LEFT_0_PENALTY': 0.5485098850288068,
 'PRIMER_LEFT_0_SELF_ANY_TH': 14.281185332787231,
 'PRIMER_LEFT_0_SELF_END_TH': 7.548278344245659,
 'PRIMER_LEFT_0_SEQUENCE': 'ACAGGGAAGCTTCATAGGCC',
 'PRIMER_LEFT_0_TM': 59.45149011497119,
 'PRIMER_LEFT_1': (64, 20),
 'PRIMER_LEFT_1_END_STABILITY': 5.19,
 'PRIMER_LEFT_1_GC_PERCENT': 55.0,
 'PRIMER_LEFT_1_HAIRPIN_TH': 0.0,
 'PRIMER_LEFT_1_PENALTY': 0.5485098850288068,
 'PRIMER_LEFT_1_SELF_ANY_TH': 14.281185332787231,
 'PRIMER_LEFT_1_SELF_END_TH': 7.548278344245659,
 'PRIMER_LEFT_1_SEQUENCE': 'ACAGGGAAGCTTCATAGGCC',
 'PRIMER_LEFT_1_TM': 59.45149011497119,
 'PRIMER_LEFT_2': (79, 20),
 'PRIMER_LEFT_2_END_STABILITY': 4.84,
 'PRIMER_LEFT_2_GC_PERCENT': 55.0,
 'PRIMER_LEFT_2_HAIRPIN_TH': 0.0,


In [77]:
!gister -d 'How to use pybedtools to drive bulk design' 2016-01-24PyBedToolsforMarkerDesign.ipynb

https://gist.github.com/3f0147e858c40eb95433


In [68]:
!gister -e https://gist.github.com/3b24004be19d1f55ad25 2016-01-24PyBedToolsforMarkerDesign.ipynb

https://gist.github.com/3b24004be19d1f55ad25
