<a href="https://colab.research.google.com/github/PStettler/Schneider-Group-DCBP/blob/main/sequenceCleaner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h2> Welcome to the sequence cleaner! </h2> <br>

Is there something more annoying than copying a sequence from somewhere which contains line breaks or numbers or even U instead of T? I don't think so. This is why you need this algorithm!

In [None]:
sequence = 
  #please replace the example DNA sequence below with your sequence of interest, use 
  #only DNA characters (GATC + see list below). 
  #Now go to Runtime, and select run everything. The results will be displayed below 
  #this box. Have fun.

  # R for A or G
  # Y for C or T
  # S for G or C
  # W for A or T
  # K for G or T
  # M for A or C
  # B for C, G, or T
  # D for A, G, or T
  # H for A, C, or T
  # V for A, C, or G
  # N for A, G, C, or T


  "
      1 atgagtcagg gacgagaagc caaagtctac gtcaacggaa gcctggtcgg gacgcacccc
       61 gacccgaaca gactcgcgag cgacatccga caggcccgcc gccgtggcga cgtcagtcag
      121 atggtcaacg tctcggtgcg cgagcgcacc ggcgaggtca tcgtcaacgc cgacgccggc
      181 cgcgcccgcc gaccgctcat cgtcgtcgag gacggcgagc cgcgcatcgg tgacgaggac
      241 atcgaggcgc tcgaggccgg ccagctcgac ttcgaggact tcgtcgagac cggtgccatc
      301 gagttcatcg acgccgagga ggaagaggac atctacgtcg cggtcgacga ggaggaggtc
      361 agcgaggacc acacccacct cgagatcgac ccacagctca tcttcggtat cggtgccggg
      421 atgattccct accccgagca caacgcttcc ccacgaatta cgatggggtc ggggatggtc
      481 aagcagtcgc tcgggctgcc gagcgcgaac taccgcatcc ggccggacac gcgccagcac
      541 ctgctgcact acccacagct ctcgatggtc aaaacgcaga ccaccgagca gatcgggtac
      601 gacgaccggc ccgccgcaca gaacttcacc gtcgcggtga tgagctacga ggggttcaac
      661 atcgaggacg cgctcgtcat gaacaaggcg tcggtcgagc gcgcgctcgc ccggtcgcac
      721 ttcttccgca cctacgaggg tgaggagcga cgctaccccg gcggccagga ggaccgcttc
      781 gagattccct ccgacgaggt ccgtggcgca cgcggcgagg aggcgtacac gcacctcgac
      841 gacgacggtc tcgtcaaccc ggagacgacc gtcgacgaga acgacgtgct gctcgggaag
      901 acctccccgc cccggttcct cgaagagccg gacgacatgg gtgggctgag cccccagaag
      961 cgccgcgaga cgagcgtcac gatgcgttcg ggcgaagacg gcgtcgtcga cacggtgacg
     1021 ctgatggagg gcgaggacgg gtcgaagctc tcgaaggtct cggtgcgcga ccagcgaatc
     1081 cccgaactcg gggacaagtt cgcgtcgcga cacggccaga agggggtcgt gggccacctc
     1141 gcgccccagg aggacatgcc gttcacccag gagggcgtcg tgcccgacct catcgtcaac
     1201 ccgcacgcgc tgccgtcgcg gatgacggtg ggtcacgtgc tggagatgct cggcgggaag
     1261 gtcggcgcgc tcgaaggccg ccgtgtcgac gggaccgcct tccagggcga ggacgaggag
     1321 gaactgcgtg cggcgctgga ggagaagggg tacaactccg cgggcaagga gacgatgtac
     1381 tccggtgtca ccggcgagaa gatcgaggcc gagatcttcg tcggggtcat cttctaccag
     1441 aagctgtacc acatggtctc gaacaagatt cacgcgcgtt cgcgtgggcc ggtccaggtg
     1501 ctgacccgcc agcccaccga agggcgtgcg cgtgaaggtg ggctccgtgt cggagagatg
     1561 gagcgcgacg tgctcatcgg tcacggcgcg gcgatggcgc tcaaagagcg cctcctcgac
     1621 gagtccgacc gcgagtacat cgacatctgt gggaactgtg ggatgaccgc cgtcgagaac
     1681 gtcgagcaac ggcgcatcta ctgtccgaac tgcgaggagg agacggacat ccaccgcgtc
     1741 gagatgagct acgcgttcaa actactgctc gacgagatga aggcgctggg catcgccccg
     1801 cgaatcgaac tggaggacgc agtatga
  
  "

In [None]:
#@title 
#packages
require(stringr)

#Sequence Cleaner is an alogirthm that
#takes any messy sequence as input and
#returns a nice and clean DNA sequence.

sequenceCleaner = function(seq = sequence)
{

  seq = paste(seq,sep="",collapse="")

  #write anything to capital
  seq = toupper(seq) #this will leave any \n \r \t unchanged
                     #thus, they can be later removed.

  #replace U (RNA) to T (DNA)
  seq = str_replace_all(seq,"U","T")

  #remove anything that is not DNA
  seq = str_replace_all(seq,"[^ATGCNRYSWKMBDHV]","")

  cat("the cleaned-up sequence is:\n",seq,"\n\n")

#if sequence has open reading frame
if(str_detect(seq,"ATG([ATGC]{3})*((TGA)|(TAA)|(TAG))"))
{
  #detect open reading frame (ORF)
  regex = "ATG([ATGC]{3})+(?=((TGA)|(TAA)|(TAG)))"
  ORF = seq
  while(str_detect(ORF,regex)){
    #a while loop to detect the correct open reading frame
    #starting at the first atg and ending at the closest inframe stop codon
    ORF = str_extract(ORF,regex)
    if(!str_detect(regex,"\\^")) {regex = paste0("^",regex)}
  }
  seq_ORF = str_extract_all(ORF,"[ATGC]{3}")[[1]] 
  seq_ORF_location = str_locate(seq,ORF)
  #split sequence into codons inbetween ORF
  
  seq_ORF[2:(length(seq_ORF)+1)] = seq_ORF
  seq_ORF[1] = substr(seq,1,seq_ORF_location[1]-1)
  seq_ORF[length(seq_ORF)+1] = substr(seq,seq_ORF_location[2]+1,nchar(seq))
  
  cat("the cleand-up sequence with the open reading frame split into codons:\n",seq_ORF,sep=" ")
  }
}

sequenceCleaner()