# Example of how to use the new design program            
dec 2018        
Gabriel Besombes  
##    
#### We will use:
* reference file : "/output/genomic/plant/Actinidia/chinensis/CK51F3_01/Genome/Assembly/PS1/1.68.5/AllChromosomes/PS1.1.68.5.fasta"
* annotation file : "/output/genomic/plant/Actinidia/chinensis/Resequencing/Variants/PS1.1.68.5/52DiploidGenomes/Combined_diploidCK_basic_NS30_Q50_SAFR3_DP50_PAIR0.8_PS1.1.68.5_ann.vcf.gz"
#     
#        

In [1]:
%load_ext autoreload
%autoreload 2

## Imports:
---

Import python file from other directory in Jupyter:

In [2]:
%run "../design.py"

---
#    

## Creating a PrimerDesign object:
---

Without a target file and with default parameters

In [3]:
d = PrimerDesign(reference_file="/output/genomic/plant/Actinidia/chinensis/CK51F3_01/Genome/Assembly/PS1/1.68.5/AllChromosomes/PS1.1.68.5.fasta",
                 annotation_file="/output/genomic/plant/Actinidia/chinensis/Resequencing/Variants/PS1.1.68.5/52DiploidGenomes/Combined_diploidCK_basic_NS30_Q50_SAFR3_DP50_PAIR0.8_PS1.1.68.5_ann.vcf.gz",
                 description="PS1.1.68.5",
                 targets_file=None,
                 output_dir=None,
                 amplicon_size_range=None,
                 primer_size_range=None,
                 p3_globals=None)

---
#      

## Quick test design on a small region:
---

With auto you can do a quick design with default parameters.   
It will :
* Create targets in the specifyed region if no target file was given for the PrimerDesign object and output these in the output directory in a bed file called "targets.bed"
* Create a csv and a bed for the amplicons, same for the primers

In [4]:
d.auto(region={"CHR1":[[0, 2500], [5000, 7500]], "CHR2":[[0, 10000]]})

In [5]:
d.p3_out.head()

Unnamed: 0,CHROMOSOME,PRIMER_INTERNAL_0,PRIMER_INTERNAL_0_GC_PERCENT,PRIMER_INTERNAL_0_HAIRPIN_TH,PRIMER_INTERNAL_0_PENALTY,PRIMER_INTERNAL_0_SELF_ANY_TH,PRIMER_INTERNAL_0_SELF_END_TH,PRIMER_INTERNAL_0_SEQUENCE,PRIMER_INTERNAL_0_TM,PRIMER_INTERNAL_1,...,PRIMER_RIGHT_1_GC_PERCENT,PRIMER_RIGHT_1_HAIRPIN_TH,PRIMER_RIGHT_1_PENALTY,PRIMER_RIGHT_1_SELF_ANY_TH,PRIMER_RIGHT_1_SELF_END_TH,PRIMER_RIGHT_1_SEQUENCE,PRIMER_RIGHT_1_TM,PRIMER_RIGHT_EXPLAIN,PRIMER_RIGHT_NUM_RETURNED,REF_OFFSET
0,CHR1,"(141, 22)",54.545455,0.0,2.660472,0.0,0.0,TCCAACGCCCACCATGAATGCA,59.339528,"(141, 22)",...,47.619048,32.4404,2.80025,0.0,0.0,AATTGACCCTGAAGCAGAAGG,58.19975,"considered 317, overlap excluded region 35, lo...",2,114
1,CHR1,,,,,,,,,,...,,,,,,,,"considered 165, low tm 3, high tm 86, ok 76",0,166
2,CHR1,"(152, 20)",70.0,33.236722,0.23011,0.0,0.0,CCGAGACTTGGCTGACCCGC,59.76989,"(152, 20)",...,40.909091,0.0,4.974645,0.0,0.0,TTCCTATGAACAGACAACAGGT,57.025355,"considered 507, GC content failed 85, low tm 3...",2,329
3,CHR1,"(152, 20)",70.0,33.236722,0.23011,0.0,0.0,CCGAGACTTGGCTGACCCGC,59.76989,"(152, 20)",...,40.909091,0.0,4.974645,0.0,0.0,TTCCTATGAACAGACAACAGGT,57.025355,"considered 57, low tm 43, high hairpin stabili...",2,329
4,CHR1,,,,,,,,,,...,,,,,,,,"considered 190, overlap excluded region 36, GC...",0,454


---
#     

## Bed files analysis:
---

All of the bed files associated with the PrimerDesign object can be analysed using the data in the vcf anotation file

In [6]:
d.amplicons.analyse(d.annotations._pyvcf_reader, d.reference, write="analysed.csv")

This will create a pandas dataframe that looks like this:

In [7]:
d.amplicons.analysed.head()

Unnamed: 0,CHR,START,END,SSRs,SSRs_variants,PI,min_call_rate
0,CHR2,75,297,"{'AAACCAA': [2], 'CCCTAAA': [3]}","{'CHR2:136-145': {'CCTAAGC': [2]}, 'CHR2:227-2...",4.801258,1.0
1,CHR2,74,297,"{'AAACCAA': [2], 'CCCTAAA': [3]}","{'CHR2:136-145': {'CCTAAGC': [2]}, 'CHR2:227-2...",4.801258,1.0
2,CHR2,203,413,"{'AT': [11], 'AGA': [6], 'GAA': [4]}","{'CHR2:227-236': {'AAACCTG': [3, 2]}}",1.36442,1.0
3,CHR2,203,412,"{'AT': [11], 'AGA': [6], 'GAA': [4]}","{'CHR2:227-236': {'AAACCTG': [3, 2]}}",1.36442,1.0
4,CHR1,159,414,"{'CA': [30, 24, 21, 12], 'AT': [24], 'TAT': [4...",{},0.145193,1.0


You can then filter this:

In [8]:
d.amplicons.filter()

That will also create a pandas dataframe:

In [9]:
d.amplicons.filtered.head()

Unnamed: 0,CHR,START,END,SSRs,SSRs_variants,PI,min_call_rate
0,CHR2,75,297,"{'AAACCAA': [2], 'CCCTAAA': [3]}","{'CHR2:136-145': {'CCTAAGC': [2]}, 'CHR2:227-2...",4.801258,1.0
1,CHR2,74,297,"{'AAACCAA': [2], 'CCCTAAA': [3]}","{'CHR2:136-145': {'CCTAAGC': [2]}, 'CHR2:227-2...",4.801258,1.0
2,CHR2,203,413,"{'AT': [11], 'AGA': [6], 'GAA': [4]}","{'CHR2:227-236': {'AAACCTG': [3, 2]}}",1.36442,1.0
3,CHR2,203,412,"{'AT': [11], 'AGA': [6], 'GAA': [4]}","{'CHR2:227-236': {'AAACCTG': [3, 2]}}",1.36442,1.0
4,CHR2,203,456,"{'AT': [12], 'AGA': [8], 'GAA': [7], 'CCT': [2...","{'CHR2:227-236': {'AAACCTG': [3, 2]}}",1.401797,1.0


You can then create a new bed file from this filtered result and point the object to this new file:

In [10]:
d.amplicons.replace(name="filtered_targets.bed")

In [11]:
d.amplicons.name

'filtered_targets'

---
#      

## More detailed and flexible design:
---

You can specify a target file directly during the object creation or afterwards like so:

In [13]:
d.targets = "targets.bed"

If you don't have a targets file you can create it like so:

In [14]:
d.gettargets(region={"CHR1":[[0, 2500], [5000, 7500]], "CHR11":[[0, 5000]]}, write="advanced_targets.bed")

And change the targtes file used by the PrimerDesign object like this:

In [15]:
d.targets = "advanced_targets.bed"

You can then analyse and filter your targets on different criteria:
* SSR contents of the region
* SSR contents in the variants
* Nucleotide diversity
* Minimum call_rate in the region   
     
If any is set to False it will not be analysed.     
These result will be available in a pandas dataframe and can be outputed to a csv format if necessary.

In [16]:
d.targets.analyse(d.annotations._pyvcf_reader,
                  d.reference,
                  SSRs=True,
                  SSRs_variants=True,
                  PI=True,
                  call_rate=True,
                  write="analysed_targets.csv")

In [17]:
d.targets.analysed.head()

Unnamed: 0,CHR,START,END,SSRs,SSRs_variants,PI,min_call_rate
0,CHR1,215,394,"{'CA': [22, 19, 10], 'TAT': [4], 'TTG': [3], '...",{},0.145193,1.0
1,CHR1,215,446,"{'CA': [26, 23, 14], 'TAT': [5], 'TTG': [4], '...",{},0.695238,1.0
2,CHR1,435,609,"{'CACTC': [2], 'TC': [14], 'GA': [10], 'TCT': ...",{},1.016173,1.0
3,CHR1,360,609,"{'CT': [24], 'CACTC': [2], 'TC': [14], 'GA': [...",{},1.123989,1.0
4,CHR1,532,734,"{'TCT': [6], 'TGT': [3], 'TATT': [3], 'CTA': [3]}",{},0.503504,1.0


The formats and defaults for the filtering are as follow:
* SSRs in the region : SSRs = {
                2: 10,
                3: 2,
                4: 2,
                5: 2,
                6: 2,
                7: 2,
                8: 2,
                9: 2,
                10: 2
            }
* SSRs in the variants : SSRs_variants = {
                2: 10,
                3: 2,
                4: 2,
                5: 2,
                6: 2,
                7: 2,
                8: 2,
                9: 2,
                10: 2
            }
* Nucleotide diversity : PI = 0.5
* Minimum call_rate in the region : call_rate = 0.5

If any is set to False it will not be used for filtering.   
These result will be available in a pandas dataframe and can be outputed to a csv format if necessary.

In [18]:
d.targets.filter(write="filtered_targets.csv")

You can see that only the targets that had a PI>0.5, min_call_rate>0.5, SSRs and SSRs in variants of length 2 that repeated 10 or more times, or length 3 to 10 that repeated 2 or more times where kept.

In [19]:
d.targets.filtered.head()

Unnamed: 0,CHR,START,END,SSRs,SSRs_variants,PI,min_call_rate
0,CHR1,5658,5861,"{'GGGAGG': [2], 'AG': [12, 9], 'TCT': [5], 'GC...",{'CHR1:5843-5861': {'AGCG': [3]}},0.838994,1.0
1,CHR1,5623,5861,"{'GGGAGG': [2], 'AG': [12, 9], 'TCT': [5], 'GC...",{'CHR1:5843-5861': {'AGCG': [3]}},0.838994,1.0
2,CHR1,5773,5934,"{'CT': [6], 'CA': [15, 8], 'AT': [10, 7], 'AGC...","{'CHR1:5843-5861': {'AGCG': [3]}, 'CHR1:5914-5...",1.259659,0.962264
3,CHR1,5734,5934,"{'TCT': [5], 'GCT': [3], 'TG': [7], 'CT': [6],...","{'CHR1:5843-5861': {'AGCG': [3]}, 'CHR1:5914-5...",1.297035,0.962264
4,CHR1,5821,5989,"{'TA': [8], 'CA': [9], 'AT': [8], 'AGCG': [4],...","{'CHR1:5843-5861': {'AGCG': [3]}, 'CHR1:5914-5...",1.295597,0.962264


You can then change the targets file like this:

In [20]:
d.targets.replace("filtered_targets.bed")

You can check that it worked:

In [21]:
d.targets.file_name

'filtered_targets.bed'

#### You can now do the design on different region or without specifying a region:

In [22]:
d.run_p3()

This only outputs crude results from primer3 to self.p3_out

In [23]:
d.p3_out.head()

Unnamed: 0,CHROMOSOME,PRIMER_INTERNAL_0,PRIMER_INTERNAL_0_GC_PERCENT,PRIMER_INTERNAL_0_HAIRPIN_TH,PRIMER_INTERNAL_0_PENALTY,PRIMER_INTERNAL_0_SELF_ANY_TH,PRIMER_INTERNAL_0_SELF_END_TH,PRIMER_INTERNAL_0_SEQUENCE,PRIMER_INTERNAL_0_TM,PRIMER_INTERNAL_1,...,PRIMER_RIGHT_1_GC_PERCENT,PRIMER_RIGHT_1_HAIRPIN_TH,PRIMER_RIGHT_1_PENALTY,PRIMER_RIGHT_1_SELF_ANY_TH,PRIMER_RIGHT_1_SELF_END_TH,PRIMER_RIGHT_1_SEQUENCE,PRIMER_RIGHT_1_TM,PRIMER_RIGHT_EXPLAIN,PRIMER_RIGHT_NUM_RETURNED,REF_OFFSET
0,CHR1,"(258, 20)",65.0,41.005307,0.113248,6.191775,6.191775,TTGACAGCGAGCGAGCGAGC,60.113248,"(258, 20)",...,55.0,45.415575,0.677731,0.864186,0.0,TTCCAATGGAGCTTCTGCGG,60.677731,"considered 133, overlap excluded region 46, lo...",2,5581
1,CHR1,"(258, 20)",65.0,41.005307,0.113248,6.191775,6.191775,TTGACAGCGAGCGAGCGAGC,60.113248,"(258, 20)",...,55.0,40.049105,0.455005,13.838267,5.482251,GCTTCCAATGGAGCTTCTGC,59.544995,"considered 98, overlap excluded region 11, low...",2,5581
2,CHR1,"(185, 20)",65.0,41.005307,0.113248,6.191775,6.191775,TTGACAGCGAGCGAGCGAGC,60.113248,"(185, 20)",...,55.0,0.0,0.179029,0.0,0.0,ACACCAGTTACACCACCAGC,60.179029,"considered 475, overlap excluded region 25, lo...",2,5654
3,CHR1,"(185, 20)",65.0,41.005307,0.113248,6.191775,6.191775,TTGACAGCGAGCGAGCGAGC,60.113248,"(185, 20)",...,55.0,0.0,0.034896,0.0,0.0,ACCACCCGTCAACTCACATC,59.965104,"considered 241, overlap excluded region 25, lo...",2,5654
4,CHR1,"(130, 20)",65.0,41.005307,0.113248,6.191775,6.191775,TTGACAGCGAGCGAGCGAGC,60.113248,"(130, 20)",...,55.0,0.0,0.179032,0.0,0.0,CACCAGTTACACCACCAGCA,60.179032,"considered 433, overlap excluded region 25, lo...",2,5709


If you want a bit more than this you can use self.cleanoutput().

In [24]:
d.cleanoutput()

This creates pandas dataframe for the amplicons and the primers and export these as csv and bed files:

In [25]:
d.amplicons_df.head()

Unnamed: 0,CHROMOSOME,END,PRIMER_INTERNAL,PRIMER_INTERNAL_EXPLAIN,PRIMER_INTERNAL_GC_PERCENT,PRIMER_INTERNAL_HAIRPIN_TH,PRIMER_INTERNAL_NUM_RETURNED,PRIMER_INTERNAL_PENALTY,PRIMER_INTERNAL_REGION,PRIMER_INTERNAL_SELF_ANY_TH,...,PRIMER_RIGHT_HAIRPIN_TH,PRIMER_RIGHT_NUM_RETURNED,PRIMER_RIGHT_PENALTY,PRIMER_RIGHT_REGION,PRIMER_RIGHT_SELF_ANY_TH,PRIMER_RIGHT_SELF_END_TH,PRIMER_RIGHT_SEQUENCE,PRIMER_RIGHT_TM,REF_OFFSET,START
21,CHR11,395,"(195, 20)","considered 3185, GC content failed 3, low tm 2...",70.0,41.637802,2.0,0.033108,CHR11:287-307,0.0,...,0.0,2.0,5.799681,CHR11:372-395,0.0,0.0,ACTCTACTACTCGAAGTTAGCAC,57.200319,92.0,128
20,CHR11,395,"(195, 20)","considered 3185, GC content failed 3, low tm 2...",70.0,41.637802,2.0,0.033108,CHR11:287-307,0.0,...,0.0,2.0,5.799681,CHR11:372-395,0.0,0.0,ACTCTACTACTCGAAGTTAGCAC,57.200319,92.0,127
3,CHR1,5894,"(258, 20)","considered 3005, GC content failed 5, low tm 2...",65.0,41.005307,2.0,0.113248,CHR1:5839-5859,6.191775,...,40.049105,2.0,0.455005,CHR1:5874-5894,13.838267,5.482251,GCTTCCAATGGAGCTTCTGC,59.544995,5581.0,5599
0,CHR1,5894,"(258, 20)","considered 3355, GC content failed 5, low tm 2...",65.0,41.005307,2.0,0.113248,CHR1:5839-5859,6.191775,...,40.049105,2.0,0.455005,CHR1:5874-5894,13.838267,5.482251,GCTTCCAATGGAGCTTCTGC,59.544995,5581.0,5634
2,CHR1,5894,"(258, 20)","considered 3005, GC content failed 5, low tm 2...",65.0,41.005307,2.0,0.113248,CHR1:5839-5859,6.191775,...,40.049105,2.0,0.455005,CHR1:5874-5894,13.838267,5.482251,GCTTCCAATGGAGCTTCTGC,59.544995,5581.0,5599


In [26]:
d.primers_df.head()

Unnamed: 0,CHROMOSOME,END,PRIMER,PRIMER_END_STABILITY,PRIMER_GC_PERCENT,PRIMER_HAIRPIN_TH,PRIMER_PENALTY,PRIMER_SELF_ANY_TH,PRIMER_SELF_END_TH,PRIMER_SEQUENCE,PRIMER_TM,REF_OFFSET,START
4,CHR1,5623,"(18, 24)",3.66,37.5,0.0,5.649557,0.0,0.0,TGAAACAAATTGACTTGGACACAG,58.350443,5581.0,5599
6,CHR1,5622,"(18, 23)",3.72,34.782609,0.0,5.651387,0.0,0.0,TGAAACAAATTGACTTGGACACA,57.348613,5581.0,5599
0,CHR1,5656,"(53, 22)",4.75,40.909091,0.0,3.754965,0.0,0.0,ACAGTCAATAGCAGAAAAGGCA,58.245035,5581.0,5634
14,CHR1,5704,"(29, 21)",3.71,52.380952,0.0,1.134189,0.0,0.0,TCGGATCGCTTGAGTAGAGGA,60.134189,5654.0,5683
12,CHR1,5705,"(30, 21)",3.85,57.142857,0.0,1.002251,0.0,0.0,CGGATCGCTTGAGTAGAGGAC,60.002251,5654.0,5684


You can analyse and filter these beds through d.primers.analyse or d.amplicons.analyse, etc...

---
#     