## Week 5 practical: Exploring Biopython's Sequence Object part 2 

## Jason Okwuonu

In [2]:
import os 
os.chdir("C:/Users/smart/Documents/Bioinformatics/Data Science for bioinformatics/Python code")

In [4]:
#importing packages from biopython 
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO 
from Bio.Seq import Seq
from Bio import SeqFeature 
from Bio.SeqUtils import GC
          

In [19]:
#Creating sequence object
test_seq = Seq("GCTTCTTTTAGCCTTCACCTATTCAACAACAGGGGTGAGCTGCTTCTGTCC")
print(test_seq)

GCTTCTTTTAGCCTTCACCTATTCAACAACAGGGGTGAGCTGCTTCTGTCC


In [10]:
#Using string indexing in order to identify nucleotides in specific positions

print("first letter of your sequence:" , test_seq[0])
print("third letter of your sequence:" , test_seq[2])
print("last letter of your sequence:" , test_seq[-1])


first letter of your sequence: G
third letter of your sequence: T
last letter of your sequence: C


## Calculating GC content in the sequence 

In [11]:
# using an equation
print("GC content:" , 100 * float(test_seq.count("G") + test_seq.count("C"))/ len(test_seq))

GC content: 49.01960784313726


In [13]:
# The same in the cell above can be done
# by using Biopython's GC function 

print("GC content:", GC(test_seq))

GC content: 49.01960784313726


In [15]:
#Slicing can be used to asses the sequences potential primer design 

slice1 = test_seq[4:12]

print("a little slice:", slice1)

a little slice: CTTTTAGC


In [20]:
# Slicing can also be used to find a nucleotide at 
#every third position for example
print("every 3rd position from 0:", test_seq[0::3])
# it can also be used to reverse the sequence 
print("String reverse" , test_seq[::-1])

every 3rd position from 0: GTTACCCTAAAGGCCCT
String reverse CCTGTCTTCGTCGAGTGGGGACAACAACTTATCCACTTCCGATTTTCTTCG


## Task 

* Printing every third nucleotide from postion one and every third nucleotide from position 2

In [21]:
print("every 3rd nucleotide from 1:", test_seq[1::3])
print("every 3rd nucleotide from 2:", test_seq[2::3])

every 3rd nucleotide from 1: CCTGTATTAAGGATTTC
every 3rd nucleotide from 2: TTTCTCACCCGTGGTGC


### Concatenating sequences 


In [23]:
piece1 = Seq("GCTTC")
piece2 = Seq("GCCTTC")
combined = piece1 + piece2
print(combined)

GCTTCGCCTTC


* can also be done with lists and can be concatenated using a loop

In [25]:
list_of_seqs = [Seq("TTCACC"), Seq("GGGT"), Seq("TGTCC")]
concatenated = Seq("")

for s in list_of_seqs:
    concatenated += s

In [27]:
print(concatenated)

TTCACCGGGTTGTCC


* The join method can also be used to concatenate Biopython Seq objects. It can also be separated using a specified delimiter (like , or L or \t)

In [37]:
contigs = [Seq("GGGTG"), Seq("ATGTCC"), Seq("TCACA")]
spacer = Seq("\t"*10)
print(spacer.join(contigs))

GGGTG										ATGTCC										TCACA


* The case of the sequence can be changed.

In [31]:
d_seq = Seq("acgtATGT")

print(d_seq)

print("d seq upper case:", d_seq.upper())

print(d_seq.lower())

acgtATGT
d seq upper case: ACGTATGT
acgtatgt


In [32]:
"ACGT" in d_seq

False

In [35]:
"ACGT" in d_seq.upper()

True

## Translation Tables

In [38]:
from Bio.Data import CodonTable 

In [41]:
standard_table = CodonTable.unambiguous_dna_by_name["Standard"]
mito_table = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"]

In [44]:
print(standard_table)
print(mito_table)

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------

## Comparing Seq Objects

In [45]:
seq1 = Seq("ACGT")
"ACGT" == seq1

True

* Sequence objects cannot be changed once they are made (they are immutable)

In [48]:
 
test_seq[5] = "G"

TypeError: 'Seq' object does not support item assignment

* But they can be made mutable  using MutableSeq

In [50]:
from Bio.Seq import MutableSeq 

In [52]:
test_str = str(test_seq)
mutable_seq = MutableSeq(test_str)

print(mutable_seq)

GCTTCTTTTAGCCTTCACCTATTCAACAACAGGGGTGAGCTGCTTCTGTCC


In [55]:
mutable_seq[3] = "C"

print(mutable_seq)

GCTCCTTTTAGCCTTCACCTATTCAACAACAGGGGTGAGCTGCTTCTGTCC


In [57]:
mutable_seq.remove("T")
print(mutable_seq)

GCCCTTTAGCCTTCACCTATTCAACAACAGGGGTGAGCTGCTTCTGTCC


In [73]:
print(mutable_seq.reverse)

<bound method MutableSeq.reverse of MutableSeq('CCTGTCTTCGTCGAGTGGGGACAACAACTTATCCACTTCCGATTTCCCG')>


## Working with Unknown Sequences 

In [74]:
from Bio.Seq import UnknownSeq

In [75]:
unk = UnknownSeq(20)
print("unknown seq:" , unk)
print("length unknown:", len(unk))

unknown seq: ????????????????????
length unknown: 20


* The question marks acan also be replaced with Ns

In [76]:

unk_DNA  = UnknownSeq(20, character = "N")
print(unk_DNA)

NNNNNNNNNNNNNNNNNNNN
