<a href="https://colab.research.google.com/github/AvantiShri/colab_notebooks/blob/master/domainadaptation/DomainAdaptationSimulation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install simdna

Collecting simdna
[?25l  Downloading https://files.pythonhosted.org/packages/14/c6/dc6cc2e9ac09c85d5ec6d896c6c43c8dd5ef50bb9c14423e9290131dce27/simdna-0.4.3.2.tar.gz (634kB)
[K     |████████████████████████████████| 634kB 2.8MB/s 
Building wheels for collected packages: simdna
  Building wheel for simdna (setup.py) ... [?25lerror
[31m  ERROR: Failed building wheel for simdna[0m
[?25h  Running setup.py clean for simdna
Failed to build simdna
Installing collected packages: simdna
    Running setup.py install for simdna ... [?25l[?25hdone
Successfully installed simdna-0.4.3.2


The simulation is as follows:
- For each sequence, a background sequence of length `seqLength` is simulated with a 40% GC content
- Then simualtor iterates over the REST_known1 motif and the HNF4_known2 motif (the PWMs can be found at http://compbio.mit.edu/encode-motifs/)
- For a given type of motif, the number of motifs that will be inserted into the sequence needs to be determined. With `zero_prob` probability (here 0.5), no instances of that motif will be inserted. Otherwise, with `1-zero_prob` probability, the number of motif instances will be sampled from a poisson distribution with a mean of `mean_motifs`. If the number returned by the poisson distribution lies below `min_motifs` or above `max_motifs`, the number of motifs will be sampled again until a number that is within the min/max range is found. (Aside: this is why you occasionally see the message `warning: made (num) tries at trying to sample from distribution with min/max limits`).
- Once the number of the motifs is determined, the instances of the motifs will be embedded in the background sequence. Each instance is sampled from the corresponding PWM and embedded uniformly at random at an unoccupied position in the background sequence ("unoccupied" means it does not overlap the position of any other embedded motif. This is why you see warnings to the effect of `made (num) attemps at trying to embed (motif instance) in region of length (length) with (num) occupied sites"`). The instance will be reverse-complemented with probability `rc_prob`, specified at the command line.
- A total of `numSeqs` sequences are simulated

In [2]:
!densityMotifSimulation.py --prefix sourcedomain --motifNames REST_known1 HNF4_known2 --rc-prob 0.5  --mean-motifs 1 --max-motifs 3 --min-motifs 1 --zero-prob 0.5 --seqLength 200 --numSeqs 50000



In [3]:
!head -10 DensityEmbedding_prefix-sourcedomain_motifs-REST_known1+HNF4_known2_min-1_max-3_mean-1_zeroProb-0p5_seqLength-200_numSeqs-50000.simdata

seqName	sequence	embeddings
sourcedomain-synth0	CTACAAACCGCGATAGCGAATTCGTTAAATACTGGCGTATTTATACTAAAAGAACAGAGGACGCAGGGTGTAATCATCTTTGTCATCTGGATCGCATGAGCGTCTTCAGCACCAAGGTCAGAAGCGGTAAGTGATATCTTGAAAAAATGAATGAATAGTTACGTATTCAGCACCACGGACAGCGACAGGTAATTTTAAGT	pos-104_REST_known1-TTCAGCACCAAGGTCAGAAGC,pos-165_REST_known1-TTCAGCACCACGGACAGCGAC
sourcedomain-synth1	TCTGTGGAGTTTGTCGCCCCCGCTGACCTTTGAATGTGTGTGCCACATATTGGACCAGCTTAAACTAATCGAAGCAGGGATAGTACTTGTTTTAGTGTATTAAGCTCTTCATGGGCTCTGTCCGTGGTGCTGAAATCAGATAAGTCTGGTTATCTAAACAGTGTCTCTGAAAACAATCTCCCCTCTATAGAGATTTTAAA	pos-113_revComp-REST_known1-GGCTCTGTCCGTGGTGCTGAA
sourcedomain-synth2	AGCGTCACTTTTAAACAGCTATAGCAAGCGAGTTGTTGAAACCTATTTGGCACCTCGCGGCTGTCCTTGGTGCTGATGGCTGACTTGGGAATATTAGAGGTCACTTAAAACAAGTAAATTGCTCTTCATAACCATAGTAGTGGATTCTATACCATCGCAAGTCCAAAGTTCAAGTAGATTATATGCTTTAATTTGTTCTT	pos-56_revComp-REST_known1-GCGGCTGTCCTTGGTGCTGAT,pos-158_revComp-HNF4_known2-AAGTCCAAAGTTCA,pos-89_revComp-HNF4_known2-AATATTAGAGGTCA
sourcedomain-synth3	ATTCTCGCGATAAAATGTGTG

Label sequences with a 1 if BOTH REST and HNF4-A have been embedded, 0 otherwise

In [0]:
import simdna.synthetic
data = simdna.synthetic.read_simdata_file("DensityEmbedding_prefix-sourcedomain_motifs-REST_known1+HNF4_known2_min-1_max-3_mean-1_zeroProb-0p5_seqLength-200_numSeqs-50000.simdata")  

In [0]:
import numpy as np
rng = np.random.RandomState(1)
from scipy.special import expit
#get the labels for the sequences
labels = []
outf = open("sourcedomain_sequences_and_labels.txt",'w')
for sequence, embeddings in zip(data.sequences, data.embeddings):
    num_rest = sum([1 if 'REST' in x.what.stringDescription else 0
                    for x in embeddings])
    num_hnf = sum([1 if 'HNF4' in x.what.stringDescription else 0
                    for x in embeddings])   
    #all sequences that do not have a REST motif get a label of 0
    if (num_rest == 0):
        label = 0
    #if a sequence has BOTH REST and HNF4, the label is 1
    elif (num_hnf > 0 and num_rest > 0):
        label = 1
    #if there are ONLY REST motifs, sample the probability of being 1
    # according to the number of REST motifs
    else: #num_rest > 0 but num_hnf = 0
        #sample the label depending on the number of rest motifs
        #expit is the same as sigmoid
        if (rng.uniform() < expit(num_rest)):
            label = 1
        else:
            label = 0
    outf.write(sequence+"\t"+str(label)+"\n")
    labels.append(label)
outf.close()

In [6]:
import numpy as np
np.mean(labels)

0.44876

In [7]:
!head sourcedomain_sequences_and_labels.txt

CTACAAACCGCGATAGCGAATTCGTTAAATACTGGCGTATTTATACTAAAAGAACAGAGGACGCAGGGTGTAATCATCTTTGTCATCTGGATCGCATGAGCGTCTTCAGCACCAAGGTCAGAAGCGGTAAGTGATATCTTGAAAAAATGAATGAATAGTTACGTATTCAGCACCACGGACAGCGACAGGTAATTTTAAGT	1
TCTGTGGAGTTTGTCGCCCCCGCTGACCTTTGAATGTGTGTGCCACATATTGGACCAGCTTAAACTAATCGAAGCAGGGATAGTACTTGTTTTAGTGTATTAAGCTCTTCATGGGCTCTGTCCGTGGTGCTGAAATCAGATAAGTCTGGTTATCTAAACAGTGTCTCTGAAAACAATCTCCCCTCTATAGAGATTTTAAA	1
AGCGTCACTTTTAAACAGCTATAGCAAGCGAGTTGTTGAAACCTATTTGGCACCTCGCGGCTGTCCTTGGTGCTGATGGCTGACTTGGGAATATTAGAGGTCACTTAAAACAAGTAAATTGCTCTTCATAACCATAGTAGTGGATTCTATACCATCGCAAGTCCAAAGTTCAAGTAGATTATATGCTTTAATTTGTTCTT	1
ATTCTCGCGATAAAATGTGTGTCGATTTAGCACCATGGTCAGACCCGCTTAGTCTAGGAACTCTACATGGAGACCCAGTGGAAGTACCCACATTCGAGATGGACTTGAGGCCGTTGTATAACAGAAATGCATCTCGAGCACCCTCTGTCATATGCTAGAGAGTTCTTCGACATCCTGCAACAAGTAGCTAACTATAAACC	1
TACGGATTTAGCGTCGCTCTTTCAACAGGCTGAAACAGGGCTCATGTTAAACTACATTACATTGTCTCATGATTGTAGTAATGGAATCATATTCGGGTATGAAGAAATGTCGGGATTTGACCCTTGCACCTCAGAATACATGATGCGACCAGGATTTTACCGCTGGGATTTAACCTGTCAACATGGCA

Simulate the target domain, but with GATA_disc1 as the cofactor motif, rather than HNF4_known2

In [8]:
!densityMotifSimulation.py --prefix targetdomain --motifNames REST_known1 GATA_disc1 --rc-prob 0.5  --mean-motifs 1 --max-motifs 3 --min-motifs 1 --zero-prob 0.5 --seqLength 200 --numSeqs 50000



In [9]:
!head DensityEmbedding_prefix-targetdomain_motifs-REST_known1+GATA_disc1_min-1_max-3_mean-1_zeroProb-0p5_seqLength-200_numSeqs-50000.simdata

seqName	sequence	embeddings
targetdomain-synth0	CTACAAACCGCGATAGCGAATTCGTTAAATACTGGCGTATTTATACTAAAAGAACAGAGGACGCAGGGTGTAATCATCTTTGTCATCTGGATCGCATGAGCGTCTTCAGCACCAAGGTCAGAAGCGGTAAGTGATATCTTGAAAAAATGAATGAATAGTTACGTATTCAGCACCACGGACAGCGACAGGTAATTTTAAGT	pos-104_REST_known1-TTCAGCACCAAGGTCAGAAGC,pos-165_REST_known1-TTCAGCACCACGGACAGCGAC
targetdomain-synth1	TCTGTGGAGTTTGTCGCCCCCGCTGACCTTTGAATGTGTGTGCCACATATTGGACCAGCTTAAACTAATCGAAGCAGGGATAGTACTTGTTTTAGTGTATTAAGCTCTTCATGGGCTCTGTCCGTGGTGCTGAAATCAGATAAGTCTGGTTATCTAAACAGTGTCTCTGAAAACAATCTCCCCTCTATAGAGATTTTAAA	pos-113_revComp-REST_known1-GGCTCTGTCCGTGGTGCTGAA
targetdomain-synth2	AGCGTCACTTTTAAACAGCTATAGCAAGCGAGTTGTTGAAACCTATTTGGCACCTCGCGGCTGTCCTTGGTGCTGATGGCTGACTTGGGAGGCTATATCTCATCTTAAACCTTATCTCGTGCTCTTCATAACCATAGTAGTGGATTCTATACCATCGCAAAGACATATAGATAGTAGATTATATGCTTCCTTATCTGATT	pos-56_revComp-REST_known1-GCGGCTGTCCTTGGTGCTGAT,pos-109_revComp-GATA_disc1-CCTTATCTCG,pos-188_revComp-GATA_disc1-CCTTATCTGA
targetdomain-synth3	ACTATTCATCATTCTCGCGATAAAATGTGT