<a href="https://colab.research.google.com/github/SamAugusto/BioPythonNotes/blob/main/Sequence_Object.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sequence objects.
**Seq objects**: Objects that come from BioPython that store sequences. They can be simlar to python strings however, some of the methods that can be used on Seq objects might differ from string methods. For instance the **translate()** method does biological translation


In [3]:
!pip install Bio

Collecting Bio
  Downloading bio-1.8.0-py3-none-any.whl.metadata (5.7 kB)
Collecting biopython>=1.80 (from Bio)
  Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Collecting gprofiler-official (from Bio)
  Downloading gprofiler_official-1.0.0-py3-none-any.whl.metadata (11 kB)
Collecting mygene (from Bio)
  Downloading mygene-3.2.2-py2.py3-none-any.whl.metadata (10 kB)
Collecting biothings-client>=0.2.6 (from mygene->Bio)
  Downloading biothings_client-0.4.1-py3-none-any.whl.metadata (10 kB)
Downloading bio-1.8.0-py3-none-any.whl (321 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m321.1/321.1 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m55.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading gprofiler_official-1.0.0-py3-none-any.whl (9.3

In [8]:
#Example 1:
from Bio.Seq import Seq
my_seq = Seq("GATCG")
for index, letter in enumerate(my_seq):
  print(f'{index} {letter}')
print("My sequence length is:", (len(my_seq)), sep = ' ')

0 G
1 A
2 T
3 C
4 G
My sequence length is: 5


Elements of a sequence can be accessed the same way as strings

In [9]:
#Example 2 :
print(my_seq[0])
print(my_seq[2])

G
T


The **Seq** object has another method, **".count()"**. It gives a non overlapping count.

In [15]:
#Example 3:
AA_count = Seq("AAAA").count("AA")
print(f"There are {AA_count} AA occurence(s).")
#Example 4:
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
count_of_G = my_seq.count("G")
print(f"There are {count_of_G} G occurence(s).")
GC_percentage = 100*((my_seq.count("G") + my_seq.count("C"))/len(my_seq))
print(f"There are {GC_percentage}% GC in this sequence")


There are 2 AA occurence(s).
There are 9 G occurence(s).
There are 46.875% GC in this sequence


BioPython has some already inbuild functions that can perform gc_operations

In [21]:
#Example:
from Bio.Seq import Seq
from Bio.SeqUtils import gc_fraction
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
gc_fraction(my_seq)


0.46875

Note that the **gc_fraction()** function has already an inbuild feature that takes into account ambigous nucleotides S (which can be G or C)

In [20]:
#Example
my_seq = Seq("SATSSATGGGCCTATATAGGATCGAAAATCSS")
gc_fraction(my_seq)

0.46875

The Seq object is and immutable object so you cannot perform the same mutable properties that python strings have however, Section MutableSeq section goes more into how to edit sequences

In [25]:
#Example:
my_seq[0]= "G"

TypeError: 'Seq' object does not support item assignment

# Slicing a sequences
Follows normal pythons string rules

In [28]:
#Example:
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
print(my_seq[4:12])
print(my_seq[0::3])
print(my_seq[::-1])
#Note that during the print function python displays
#the sequences as if it was a string


GATGGGCC
GCTGTAGTAAG
CGCTAAAAGCTAGGATATATCCGGGTAGCTAG


# Turning Seq objects into strings

In [29]:
# Just aply the str into the object
str(my_seq)


'GATCGATGGGCCTATATAGGATCGAAAATCGC'

In [30]:
# You can also reverse the operation by calling the Seq object into a string
Seq(str(my_seq))

Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC')

# Concatenating or Adding sequences

In [31]:
# Follows the Normal Python String rule
seq1 = Seq("ACGT")
seq2 = Seq("AACCGG")
seq1 + seq2

Seq('ACGTAACCGG')

Biopython does not check the sequence contents and will not raise an exception if for example you concatenate a protein sequence and a DNA sequence (which is likely a mistake):

In [33]:
#Example:
protein_seq = Seq("EVRNAK")
dna_seq = Seq("ACGT")
protein_seq + dna_seq

Seq('EVRNAKACGT')

In [38]:
#Example on how to add multiple sequences

list_of_seqs = [Seq("ACGT"), Seq("AACC"), Seq("GGTT")]
concatnated_seqs = Seq("")
for seqs in list_of_seqs:
  concatnated_seqs += seqs
concatnated_seqs

Seq('ACGTAACCGGTT')

Like Pythons strings biopython Seq also has .join method

In [39]:
contigs = [Seq("ATG"), Seq("ATCCCG"), Seq("TTGCA")]
spacer = Seq("N"*10)
spacer.join(contigs)

Seq('ATGNNNNNNNNNNATCCCGNNNNNNNNNNTTGCA')

Changing Case

Follows the same rules as python

In [42]:
from Bio.Seq import Seq
dna_seq = Seq("acgtACGT")
dna_seq
Seq('acgtACGT')
dna_seq.upper()
Seq('ACGTACGT')
dna_seq.lower()
Seq('acgtacgt')
"GTAC" in dna_seq

"GTAC" in dna_seq.upper()


True

Nucleotide sequences and (reverse) complements


The methods complement() or reverse() can be used to obtain their respective operations

In [43]:
from Bio.Seq import Seq
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
my_seq
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC')
my_seq.complement()
Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG')
my_seq.reverse_complement()
Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC')

Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC')

In [44]:
from Bio.Seq import Seq
protein_seq = Seq("EVRNAK")
protein_seq.complement()

Seq('EBYNTM')

Transcription

Now let’s actually get down to doing a transcription in Biopython