<a href="https://colab.research.google.com/github/PStettler/Schneider-Group-DCBP/blob/main/RNAiResistantGeneGenerator_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Welcome to the RNAi Resistant Gene Generator of the Schneider Group!

This is a random RNAi resistant gene generator that takes the codon frequency of T. brucei into consideration. This version (version 2) will in every case try to replace the codon with a different one but choose these different codons according to the codon frequency in T. brucei. As a result, the most common T. brucei codons will likely be underrepresented in the output sequence. The advantage is a more distant gene output, usually around 38% of all nucleotides can be changed in respect to the input sequence. 

In [1]:
sequence = 
  #please the example DNA sequence below with your sequence of interest, use 
  #only DNA characters (GATC) in capital letters. 
  #Now go to Runtime, and select run everything. The results will be displayed below 
  #this box. Have fun.
  "
      ATGATTCCTCCGGCAAAATTAAATGATTTCTTCAACATCGTCGACGACTTCCTGAAGAAAACTTTCAGAGATGAGTCGTT
      TCTATTCGCTGCAGTCAAGTCCAGGCAGTATTCCGAGTCTTTTCCGGGTGAACAACTCTTCTTCACCTCCCCGCCTTCGG
      CGACGACTGACCAGCCCTCGGATGAGAGCCTTCAAGCAGGAGGAGGCGAGTTGAGGAAAACGCAAAATTTTATGTACTTC
      TCGCCCCGTCTAATTTTCAACAGAGACGGAGGGTACCATGGGAAGGTGAAGTTGCACAGTGGTGTTCACGTGCCGCAGGT
      GTGTCGGTTTGAGCAAGGGATTGCTGTGAACAGCGAGGGGTTGGCATCGGGTAGCGTCAAGGTTAGTGATCTTCTTGAGG
      GAATGGAGGTAAAAGGGCGTCTCGCAGTAAACACAATTGCTCCTCCGTCAAAAGATGTGTGGTCCGTTGCAATGGATTAC
      CAGCGTCGTGACTTCTATTCCACACTTAACTATCAGAGGAACGGATTGGGAAGCAGTGACTTGCTTGTTGATTGTGGCAC
      GAAATTTTTTAATCTTCTTGCGGGTGCCGGATTTGAGCGGCAGAAAGTTTCGTTTCTTGAGCAACAGGATCACACCGCGC
      AGTTGGACGTGCTTTATGCGGGGGTTGGCTTTACTGGGGTGAATTGGTCCGTTGGTGCAAAGTTAGTGAGAGCGAACGAT
      ATGTGGAGCGCGGCACGCATTGCTTTTTATCAGCGCGTGGTGCCCGACACATCGGTGGCCTGCGCATACAATTTCGATAT
      GGAGGAGTCACGCGTACATGTCTCGCTGGGATTTTCACAGGGTTTTCGGCTTCGCGTCCCCACGATTCTGCAGCAGCGGG
      CGTGTGAGCAGTTGGACGTATGGACCGCCATTCTCCCGTTTGTGGGAGCATTCAAGGTGGAGAGCGGTGGCTTATGTGCG
      GCAACCATTCGTGGCATATTTAATGGCGTGGTGCATTGGGGTTTGGTGGCGCAGAAGAACGTGTTGGTGGAGAACAGTCC
      CATCCGTTTCGGTCTCACTCTTTCCGTGGAATCCGGTTGA
  
  "

In [4]:
#packages
require(stringr)

#functions
sequence.misidentities = function(seq1,seq2){
  seq1 = paste(seq1,sep="",collapse="")
  seq1 = str_split(seq1,"")[[1]]
  seq1.A = which(seq1=="A")
  seq1.T = which(seq1=="T")
  seq1.G = which(seq1=="G")
  seq1.C = which(seq1=="C")
  
  seq2 = paste(seq2,sep="",collapse="")
  seq2 = str_split(seq2,"")[[1]]
  seq2.A = which(seq2=="A")
  seq2.T = which(seq2=="T")
  seq2.G = which(seq2=="G")
  seq2.C = which(seq2=="C")
  
  both.A = seq1.A[seq1.A%in%seq2.A] #A in both sequences
  both.T = seq1.T[seq1.T%in%seq2.T] #T in both sequences
  both.C = seq1.C[seq1.C%in%seq2.C] #G in both sequences
  both.G = seq1.G[seq1.G%in%seq2.G] #C in both sequences
  
  all.identicals = sum(c(length(both.A),length(both.T),length(both.C),length(both.G)))
  score = 1-all.identicals/length(seq1) #fraction of different nucleotides
  return(score)
}
###############################################################################################################

#this RNAi resistance gene generator will always try to replace
#a given codon with a different codon but again using T. brucei
#codon frequencies. Although there is then a bias as the most
#frequent codons will probably be present in the sequence but will
#then be removed as a result. 
#this algorithm will usually produce ~38% different genes.

RNAiResGeneGenerator2 = function(seq = sequence)
{
    seq = str_replace_all(seq,"[^ATGC]","")
    seq = paste(seq,sep="",collapse="")

    #test if sequence was given correctly
    if(!str_detect(seq,"^ATG([ATGC]{3})*((TGA)|(TAA)|(TAG))$")){
      cat("The given sequence cannot be used. Please check the sequence and \n",
          "provide a DNA character only (ATGC) sequence starting with ATG and ending \n",
          "with a Stop codon (TAA, TAG, or TGA), thx.")
      break()
    }
  
  #get codons and unique codons
  seq.tripplets = str_extract_all(seq,"[ATGC]{3}")[[1]]
  seq.tripplets.unique = unique(seq.tripplets)
  
  { #load codon table
    codons.1 = c(rep("T",16),rep("C",16),rep("A",16),rep("G",16))
    codons.2 = rep(c(rep("T",4),rep("C",4),rep("A",4),rep("G",4)),4)
    codons.3 = rep(c("T","C","A","G"),16)
    one.letter.not = c("F","F","L","L","S","S","S","S","Y","Y","*","*","C","C","*","W",
                       "L","L","L","L","P","P","P","P","H","H","Q","Q","R","R","R","R",
                       "I","I","I","M","T","T","T","T","N","N","K","K","S","S","R","R",
                       "V","V","V","V","A","A","A","A","D","D","E","E","G","G","G","G")
    codons = cbind(paste(codons.1,codons.2,codons.3,sep=""),one.letter.not)
  }#load codon table end
  
  { #codon usage table of T. brucei from kazusa.org
    # UUU F 0.56 20.5 ( 53927)  UCU S 0.16 12.6 ( 33149)  UAU Y 0.44 11.3 ( 29639)  UGU C 0.49 10.9 ( 28666)
    # UUC F 0.44 15.9 ( 41800)  UCC S 0.16 12.5 ( 32896)  UAC Y 0.56 14.1 ( 37043)  UGC C 0.51 11.3 ( 29707)
    # UUA L 0.10  9.8 ( 25683)  UCA S 0.17 13.5 ( 35511)  UAA * 0.36  0.7 (  1874)  UGA * 0.36  0.7 (  1875)
    # UUG L 0.21 19.5 ( 51383)  UCG S 0.15 11.5 ( 30350)  UAG * 0.27  0.5 (  1389)  UGG W 1.00 10.9 ( 28786)
    # 
    # CUU L 0.24 22.3 ( 58573)  CCU P 0.23 11.1 ( 29130)  CAU H 0.47 11.3 ( 29758)  CGU R 0.23 15.9 ( 41800)
    # CUC L 0.17 15.6 ( 40909)  CCC P 0.23 11.1 ( 29151)  CAC H 0.53 13.0 ( 34152)  CGC R 0.21 14.5 ( 38107)
    # CUA L 0.09  8.3 ( 21743)  CCA P 0.29 13.9 ( 36504)  CAA Q 0.45 16.9 ( 44506)  CGA R 0.13  9.0 ( 23569)
    # CUG L 0.20 18.5 ( 48568)  CCG P 0.25 11.8 ( 30918)  CAG Q 0.55 21.0 ( 55275)  CGG R 0.18 12.1 ( 31803)
    # 
    # AUU I 0.47 19.0 ( 49922)  ACU T 0.23 13.0 ( 34212)  AAU N 0.49 18.2 ( 47864)  AGU S 0.19 15.1 ( 39666)
    # AUC I 0.29 11.6 ( 30419)  ACC T 0.21 12.1 ( 31844)  AAC N 0.51 19.2 ( 50557)  AGC S 0.17 13.6 ( 35753)
    # AUA I 0.25 10.0 ( 26313)  ACA T 0.30 17.5 ( 45955)  AAA K 0.44 20.8 ( 54602)  AGA R 0.10  6.8 ( 17992)
    # AUG M 1.00 23.4 ( 61545)  ACG T 0.26 14.8 ( 39063)  AAG K 0.56 26.6 ( 69988)  AGG R 0.15 10.0 ( 26246)
    # 
    # GUU V 0.30 22.9 ( 60206)  GCU A 0.25 20.8 ( 54709)  GAU D 0.55 28.1 ( 73880)  GGU G 0.34 22.6 ( 59534)
    # GUC V 0.15 11.4 ( 29932)  GCC A 0.22 18.3 ( 48021)  GAC D 0.45 22.7 ( 59712)  GGC G 0.22 14.9 ( 39184)
    # GUA V 0.17 12.6 ( 33047)  GCA A 0.28 23.3 ( 61330)  GAA E 0.46 31.7 ( 83517)  GGA G 0.23 15.6 ( 41118)
    # GUG V 0.38 28.6 ( 75313)  GCG A 0.25 20.7 ( 54480)  GAG E 0.54 38.0 ( 99928)  GGG G 0.21 13.9 ( 36525)
    t.brucei.codon.usage.fraction = list(c(0.56,0.44),c(0.1,0.21,0.24,0.17,0.09,0.2),c(0.16,0.16,0.17,0.15,0.19,0.17),
                                         c(0.44,0.56),c(0.36,0.28,0.36),c(0.49,0.51),c(1),c(0.23,0.23,0.29,0.25),
                                         c(0.47,0.53),c(0.45,0.55),c(0.23,0.21,0.13,0.18,0.10,0.15),c(0.46,0.29,0.25),
                                         c(1),c(0.23,0.21,0.3,0.26),c(0.49,0.51),c(0.44,0.56),c(0.3,0.15,0.17,0.38),
                                         c(0.25,0.22,0.28,0.25),c(0.55,0.45),c(0.46,0.54),c(0.34,0.22,0.23,0.21))
    t.brucei.codon.usage.codon = list(c("TTT","TTC"),c("TTA","TTG","CTT","CTC","CTA","CTG"),c("TCT","TCC","TCA","TCG","AGT","AGC"),
                                      c("TAT","TAC"),c("TAA","TAG","TGA"),c("TGT","TGC"),c("TGG"),c("CCT","CCC","CCA","CCG"),
                                      c("CAT","CAC"),c("CAA","CAG"),c("CGT","CGC","CGA","CGG","AGA","AGG"),c("ATT","ATC","ATA"),
                                      c("ATG"),c("ACT","ACC","ACA","ACG"),c("AAT","AAC"),c("AAA","AAG"),c("GTT","GTC","GTA","GTG"),
                                      c("GCT","GCC","GCA","GCG"),c("GAT","GAC"),c("GAA","GAG"),c("GGT","GGC","GGA","GGG"))
    names(t.brucei.codon.usage.fraction) = c("F","L","S","Y","*","C","W","P","H","Q","R","I","M","T","N","K","V","A","D","E","G")
    names(t.brucei.codon.usage.codon) = c("F","L","S","Y","*","C","W","P","H","Q","R","I","M","T","N","K","V","A","D","E","G")
  } #fuck me
  
  
  
  { #the code below generates one new sequence, we will do 100 and always check 
    #if the current is the most diverged one, then we keep it other wise we 
    #keep the current most diverged one.
    set.seed(length(seq.tripplets))
    for(j in 1:100){
      
      {
        #generate resistance sequence
        seq.out = rep(NA,length(seq.tripplets))
        #find list family of each tripplet, generate !different one using the probs
        for (i in 1:length(seq.tripplets.unique)){
          aa = codons[which(seq.tripplets.unique[i]==codons[,1]),2]
          pos.in.seq = which(seq.tripplets==seq.tripplets.unique[i])
          
          #remove the codon frome the frequency lists if more than one codon encodes the aa
          if(length(unlist(t.brucei.codon.usage.codon[aa])) > 1){
            codons.to.choosefrom = unlist(t.brucei.codon.usage.codon[aa])[-which(unlist(
              t.brucei.codon.usage.codon[aa])==seq.tripplets.unique[i])]
            
            codons.to.choosefrom.frequencies = unlist(t.brucei.codon.usage.fraction[aa])[-which(
              unlist(t.brucei.codon.usage.codon[aa])==seq.tripplets.unique[i])]
            
            codons.to.choosefrom.frequencies = codons.to.choosefrom.frequencies/sum(
              codons.to.choosefrom.frequencies) #all probabilities have to sum up to 1
            new.tripplets = sample(codons.to.choosefrom,
                                   length(pos.in.seq),
                                   prob = codons.to.choosefrom.frequencies,
                                   replace = TRUE)
            seq.out[pos.in.seq] = sample(new.tripplets,length(pos.in.seq))
          }
          else{ #if only one codon encodes the aa then just keep them of course.
            seq.out[pos.in.seq] = seq.tripplets.unique[i]
          }
        }
        
        if(j>1){#check if the current is sequence is less identical
          #with the input sequence than the current best, replace 
          #if this should be the case
          score.current = sequence.misidentities(seq.tripplets,seq.out)
          if(score.current > score.current.best){
            seq.current.best = seq.out
            score.current.best = score.current
          }
        }else{score.current.best = sequence.misidentities(seq.tripplets,seq.out)
        seq.current.best = seq.out}
        
        #cat(score.current.best)
      }
    }
    seq.final = paste(seq.current.best,sep="",collapse = "")
    cat("\nThe final sequence is: ","\n",seq.final,"\nthe sequence differs from the input sequence in ",
        score.current.best*100,"% of all positions")
    cat("\n\nPlease remember to check for unwanted Restriction enzyme sites that might be\n",
        "present in the output sequence. Also have a look at the Stop codon,\n",
        "you may want to keep the original one there.")
  }
}

RNAiResGeneGenerator2()
###############################################################################################################


The final sequence is:  
 ATGATCCCCCCTGCGAAGCTAAACGACTTTTTTAATATTGTAGATGATTTTTTAAAAAAGACCTTTCGGGACGAAAGCTTCCTTTTTGCGGCTGTAAAAAGTCGTCAATACAGCGAAAGCTTCCCCGGCGAGCAGCTGTTTTTTACTTCGCCTCCCTCCGCAACTACGGATCAACCAAGCGACGAATCATTGCAGGCCGGCGGGGGTGAACTCCGCAAGACCCAGAACTTCATGTATTTTTCTCCGCGCCTTATCTTTAATCGGGATGGGGGATATCACGGCAAAGTTAAACTTCATTCGGGCGTACATGTCCCTCAAGTTTGCCGCTTCGAACAGGGTATCGCCGTAAATTCTGAAGGTCTTGCCAGTGGGTCTGTAAAAGTATCGGACCTGCTCGAAGGTATGGAAGTTAAGGGAAGATTGGCCGTGAATACTATAGCGCCACCCTCTAAGGACGTATGGAGTGTCGCCATGGACTATCAACGCCGCGATTTTTACTCAACCTTAAATTACCAAAGAAATGGCCTCGGTTCGTCTGATCTCTTGGTGGACTGCGGGACCAAGTTCTTCAACCTCTTGGCAGGAGCGGGTTTCGAACGCCAAAAGGTGAGTTTCTTAGAACAGCAAGACCATACGGCACAACTCGATGTTTTGTACGCTGGTGTAGGGTTCACAGGCGTCAACTGGTCAGTCGGGGCGAAACTAGTCCGTGCTAATGACATGTGGTCAGCCGCGCGTATCGCATTCTACCAACGAGTCGTTCCTGATACGAGCGTAGCTTGTGCCTATAACTTTGACATGGAAGAAAGCCGGGTGCACGTAAGTCTTGGGTTCTCTCAAGGCTTCCGTCTACGTGTACCAACCATCCTCCAACAAAGAGCCTGCGAACAACTAGATGTTTGGACTGCGATCCTGCCCTTCGTTGGGGCTTTTAAAGTTGAATCAGGCGGACTTTGCGCAGCCACTATCAGGG