# Explore usage of PyBedtools for Marker Design I/O

See 

- http://bedtools.readthedocs.org/en/latest/
- http://pythonhosted.org/pybedtools/

### Explore usage with Fasta

Could possibly use this to get sequence of an amplicon

http://pythonhosted.org/pybedtools/autodocs/pybedtools.bedtool.BedTool.seq.html#pybedtools.bedtool.BedTool.seq

In [97]:
import pybedtools
from pybedtools import BedTool

### Example from docs

In [38]:
a = BedTool("""
... chr1 1 10
... chr1 50 55""", from_string=True)
fasta = pybedtools.example_filename('test.fa')
a = a.sequence(fi=fasta)
print(open(a.seqfn).read())

>chr1:1-10
GATGAGTCT
>chr1:50-55
CCATC



### Do same with original test data

In [2]:
!grep '>' /Users/cfljam/Documents/galaxy-pcr-markers/test-data/targets.fasta

>k69_93535
>k69_98089


In [90]:
b= BedTool("k69_93535 25 255",from_string=True)
b=b.sequence(fi=fn)
print(b)

k69_93535	25	255



In [96]:
b.seq?

In [91]:
open(b.seqfn).read()

'>k69_93535:25-255\nAGTTTCTCGCCCTCTCCACAGCGCTTTGAAGCGCTTTTAATGGCAGCAGATGCCGCCTCAGATCTCCCAAACCGACCCATTCCTAACACTCTCAGAAGATCCGATTCCAATTCCGTTCTACTGAACAAATACGAGCTGGGCAAGCTCCTCGGCCATGGAAATTTCGCCAAGGTTTACCTCGCCCGCAACCTCGCCTCCAACGAAGAAGTCGCTATCAAAGTCTTCGACAA\n'

Will be working with design globals accesible as dict

In design mode we would like to  :

- load up a fasta
- load up a vcf
- create dicts for a given target from this
    - work with just sub-sequence of fasta i.e. target length +/- product size max slop
    - just work with variants in this region
    - Target interval which will be vcf record ref feature length
    - mask which will be all regions but the target

Try using test data from galaxy-pcr-markers and Hong Yang

Kiwifruit_pseudomolecule.fa.gz (needs to be bgzipped, not gzipped, and indexed with tabix)

Try with specific target in vcf file
```
Chr1	10082	.	A	C
```

(No name for this. How do we handle naming? Assume in many contexts these will have a name..)

**NB** The following works fine with uncompressed fasta, fails with bgzipped

In [156]:
my_target=BedTool("Chr1 9000 10000",from_string= True)
HY_ref='/Users/cfljam/Documents/Kiwifruit_pseudomolecule.fa'
my_target=my_target.sequence(fi=HY_ref)
open(my_target.seqfn).read()

'>Chr1:9000-10000\nTATAACCTGACTAACCATGAACCTGGGTAGAATTCCACTCCTCCACCAAATTTTTTAACTTAACCAAGCATAAATGAATCTGAACGGTTTTAGCCCCCAATTTATATCCTCGTCCACTAACAATAGAAGGCAATGATCAGACAAGTTTCTGTGTAGACCCTACAAAATCAATTTTGGGAAATTCTGTCATTCTCTATCAATAAGAACTCTATCCAACTTGCTCCAGCTTTCATTTTCTTGAAGATTCAACCAAGTGAATTTCCTAACACATAGTGGTAAGTCAACCAAATCCATAAAATGGATGAAATTGTTGAATTCCCTCGTGGCACTGGTGACCCTTGTACTCCCCTTTCTTTTTCCTATTCTTCTTATCTCATTAAGAAGCCTGTATTTAGTGTTACGTGGGGAATCCAAAAAATTGTTTCGGAAGAAGCTAGCGTCACACTACTTAAATGACATGATCTAGTGAGTCTAATCAGCTTCACAGGGCTGAAGCCCTTGCTTCTAGATGCTGTTTTATTTATTTTTCGCACTGTGCTATCTTCCATTTCTGATATCATTTGGTTTGTTGCAAAATCATTATCTAACCAATGTATATCGTACTGTCATCTAAATTACCATAAGAAAAGAAATGTAGAGATGCTTCATTAGATAGTATTGTAGAAAATAAACAAATCTCTTTGGAAGTTAATTTAGAAAGTAACTGAGGGAGTTTGGTCATTCTTGCAGTCAGGGTATGCTTTGTTTTTTACCTGGCTCAGGTGATGCCATTAGCGCAACGCTTGGAAAAGAATTTTGTGTTTGAATCTCTTAATTTTGTGCTGAGGTTATTATTGGTGAATAGCAATCTTTTACGGAAGTACTCTTCCAGATGAGGTGCCAAAAAGTGGAAAATTATAAAGGTACACCTCGAGGTTTATTTTTGCAAGGTGAAAAAATTCTTTGTTTCTCCTTGTATTTCTACAGGTTTTCTCTCTTAGT

**NOTE** that VCF location is 1-based, versus bed/py 0-based

so bedtools interval for our SNP is 10081-10082

In [157]:
my_snp=BedTool("Chr1 10080 10083",from_string= True).sequence(fi=HY_ref)
open(my_snp.seqfn).read()

'>Chr1:10080-10083\nGAT\n'

So...could get string for Primer 3 by reading second line of this tmp file

### Try with bedtools

In [146]:
%%bash
cat /Users/cfljam/Documents/test.bed
bedtools getfasta -fi /Users/cfljam/Documents/Kiwifruit_pseudomolecule.fa \
                    -bed /Users/cfljam/Documents/test.bed \
                    -fo test.fasta
head test.fasta

Chr1	9000	10000
>Chr1:9000-10000
TATAACCTGACTAACCATGAACCTGGGTAGAATTCCACTCCTCCACCAAATTTTTTAACTTAACCAAGCATAAATGAATCTGAACGGTTTTAGCCCCCAATTTATATCCTCGTCCACTAACAATAGAAGGCAATGATCAGACAAGTTTCTGTGTAGACCCTACAAAATCAATTTTGGGAAATTCTGTCATTCTCTATCAATAAGAACTCTATCCAACTTGCTCCAGCTTTCATTTTCTTGAAGATTCAACCAAGTGAATTTCCTAACACATAGTGGTAAGTCAACCAAATCCATAAAATGGATGAAATTGTTGAATTCCCTCGTGGCACTGGTGACCCTTGTACTCCCCTTTCTTTTTCCTATTCTTCTTATCTCATTAAGAAGCCTGTATTTAGTGTTACGTGGGGAATCCAAAAAATTGTTTCGGAAGAAGCTAGCGTCACACTACTTAAATGACATGATCTAGTGAGTCTAATCAGCTTCACAGGGCTGAAGCCCTTGCTTCTAGATGCTGTTTTATTTATTTTTCGCACTGTGCTATCTTCCATTTCTGATATCATTTGGTTTGTTGCAAAATCATTATCTAACCAATGTATATCGTACTGTCATCTAAATTACCATAAGAAAAGAAATGTAGAGATGCTTCATTAGATAGTATTGTAGAAAATAAACAAATCTCTTTGGAAGTTAATTTAGAAAGTAACTGAGGGAGTTTGGTCATTCTTGCAGTCAGGGTATGCTTTGTTTTTTACCTGGCTCAGGTGATGCCATTAGCGCAACGCTTGGAAAAGAATTTTGTGTTTGAATCTCTTAATTTTGTGCTGAGGTTATTATTGGTGAATAGCAATCTTTTACGGAAGTACTCTTCCAGATGAGGTGCCAAAAAGTGGAAAATTATAAAGGTACACCTCGAGGTTTATTTTTGCAAGGTGAAAAAATTCTTTGTTTCTCCTTGTATTTCTACAGG