### MEDC0106: Bioinformatics in Applied Biomedical Science

<p align="center">
  <img src="../../resources/static/Banner.png" alt="MEDC0106 Banner" width="90%"/>
  <br>
</p>

---------------------------------------------------------------

# 09 - Introduction to Biopython - Sequences

*Written by:* Mateusz Kaczyński

**This notebook provides a general introduction to Biopython library and provides in depth look at dealing with the sequences - the cornerstone of bioinformatics.**


### What is Biopython?

**Biopython** is a popular open-source toolbox for computational biology and bioinformatics. It contains tools and connectors for various resources and provides functions for running common bioinformatics tasks and parsing common bioinformatics data formats.


### Why Biopython?
**Biopython** is a de-facto standard for accessing a lot of the databases and tools making it easier to share your work and results. It simplifies the use the libraries written in other languages and technologies as wekk as those hosted online. While it is possible to use those directly directly - Biopython simplifies the process and provides the components you. 
 


## Contents


1. [Sequence basics](#Sequence-basics)
2. [Transcription and translation](#Transcription-and-translation)
3. [Alignment](#Alignment)
4. [Downloading and reading FASTA files](#Downloading-and-reading-fasta-files)
5. [Discussion](#Discussion)
-----

#### Extra Resources:

- [Official Biopython tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html) - A comprehensive guide to what the library capabilities.
- [Rosalind](http://rosalind.info) - A bioinformatics learning platform that includes exercises.




## Sequence basics

In [64]:
import Bio
print("Module", Bio.__name__, "version", Bio.__version__)

Module Bio version 1.79


In [65]:
# Definiting a sentence
from Bio.Seq import Seq
dna_sequence = Seq("ACTG")

In [66]:
print("The sequence compared with the original string:", dna_sequence == "ACTG")
print("The type of dna_sequence:", type(dna_sequence))

The sequence compared with the original string True
The type of dna_sequence <class 'Bio.Seq.Seq'>


In [67]:
help(dna_sequence)

Help on Seq in module Bio.Seq object:

class Seq(_SeqAbstractBaseClass)
 |  Seq(data, length=None)
 |  
 |  Read-only sequence object (essentially a string with biological methods).
 |  
 |  Like normal python strings, our basic sequence object is immutable.
 |  This prevents you from doing my_seq[5] = "A" for example, but does allow
 |  Seq objects to be used as dictionary keys.
 |  
 |  The Seq object provides a number of string like methods (such as count,
 |  find, split and strip).
 |  
 |  The Seq object also provides some biological methods, such as complement,
 |  reverse_complement, transcribe, back_transcribe and translate (which are
 |  not applicable to protein sequences).
 |  
 |  Method resolution order:
 |      Seq
 |      _SeqAbstractBaseClass
 |      abc.ABC
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __hash__(self)
 |      Hash of the sequence as a string for comparison.
 |      
 |      See Seq object comparison documentation (method ``__eq__`` in

In [68]:
#String-like behaviour
print("The length of the sentence:", len(dna_sequence))
print("AC found in the sequence:", "AC" in dna_sequence)
print("Sequence ends with 'TG':", dna_sequence.endswith("TG"))
print("The first base:", dna_sequence[0])
print("The last base:", dna_sequence[-1])
print("Slicing - first two bases:", dna_sequence[0:2])
print("How many Adanines are in the sequence:", dna_sequence.count("A"))

The length of the sentence: 4
AC found in the sequence: True
Sequence ends with 'TG': True
The first base: A
The last base: G
Slicing - first two bases: AC
How many Adanines are in the sequence: 1


In [69]:
#Case insensitivity
dna_sequence_mixedcase = Seq("acTG")
print("Upper case sequence equal to mixed case sentence:       ", 
      dna_sequence == dna_sequence_mixedcase)
print("How about if we change the original to lower case?:     ", 
      dna_sequence.lower() == dna_sequence_mixedcase)
print("How about moving both to the same case? Are they equal?:", 
      dna_sequence.lower() == dna_sequence_mixedcase.lower())

Upper case sequence equal to mixed case sentence:        False
How about if we change the original to lower case?:      False
How about moving both to the same case? Are they equal?: True


In [70]:
# Extending the sequence
print("", dna_sequence + "TGG")

 ACTGTGG


## Transcription and translation

In [71]:
#Ungap 
sequence_with_gaps = Seq("-".join(3*[str(dna_sequence)]))
print("Sentence with gaps   :", sequence_with_gaps)
print("Sentence without gaps:", sequence_with_gaps.ungap())

Sentence with gaps   : ACTG-ACTG-ACTG
Sentence without gaps: ACTGACTGACTG


In [72]:
# GC content count
from Bio.SeqUtils import GC
print("The guanine-cytosine content of the sequence (%):", GC(dna_sequence))

The guanine-cytosine content of the sequence (%): 50.0


In [73]:
# Complement
print("Sequence:          ", dna_sequence)
print("Complement:        ", dna_sequence.complement())
print("Reverse:           ", Seq("".join(reversed(dna_sequence))))
print("Reverse complement:", dna_sequence.reverse_complement())

Sequence:           ACTG
Complement:         TGAC
Reverse:            GTCA
Reverse complement: CAGT


In [75]:
# DNA to RNA
print("Sequence:              ", dna_sequence)
print("Transcribed:           ", dna_sequence.transcribe())
print("Reverse complement RNA:", dna_sequence.reverse_complement_rna())

Sequence:               ACTG
Transcribed:            ACUG
Reverse complement RNA: CAGU


In [76]:
# From RNA to DNA 
transcribed = dna_sequence.transcribe()
print("Transcibed:      ", transcribed)
print("Back transcribed:", transcribed.back_transcribe())

Transcibed:       ACUG
Back transcribed: ACTG


In [77]:
# To Protein
sequence = Seq("ACGCGACGA")
sequence.translate()

Seq('TRR')

In [78]:
# Operations on strings directly.
from Bio.Seq import transcribe
print("Transcription directly from 'ACGT' string, without using Seq object:", transcribe("ACGT"))

Transcription directly from 'ACGT' string, without using Seq object: ACGU


## Alignment

In [79]:
# Two-sequence alignment
from Bio import pairwise2

# Notice we can supply either Seq's or strings directly.
global_alignments = pairwise2.align.globalxx(Seq("ACGT"), "ACGC")
for ga in global_alignments:
    print(ga)

Alignment(seqA='ACGT-', seqB='ACG-C', score=3.0, start=0, end=5)
Alignment(seqA='ACGT', seqB='ACGC', score=3.0, start=0, end=4)


In [80]:
# Print the first available alignment. The * converts the Alignment objects into a list of parameters required by the format_alignment function.
print(pairwise2.format_alignment(*global_alignments[0]))

ACGT-
|||  
ACG-C
  Score=3



In [81]:
# Similarly we can use local alignments.
local_alignments = pairwise2.align.localxx(Seq("ACGT"), "ACGC")
print(pairwise2.format_alignment(*local_alignments[0]))

1 ACG
  |||
1 ACG
  Score=3



In [82]:
# Multiple sequence alignment
from Bio.Align import MultipleSeqAlignment, AlignInfo
from Bio.SeqRecord import SeqRecord

seqs = [
    "ACGTACGT",
    "ACGTGCGC",
    "ACGTA--T",
    "CCGTACGG",
    "A-GTACCC",
    "ACGTA--T",
    "CTG-ACG-",
    "AGGTACG-"
]
aligned = MultipleSeqAlignment([SeqRecord(Seq(s)) for s in seqs])
align_info = AlignInfo.SummaryInfo(aligned)
print("The simple consensus:", align_info.dumb_consensus())

The simple consensus: ACGTACGX


In [83]:
print("Alignment score for particular bases at a given position.")
print("Dumb consensus on the y axis up-to-down, bases on x-axis.")
print(align_info.pos_specific_score_matrix())

Alignment score for particular bases at a given position.
Dumb consensus on the y axis up-to-down, bases on x-axis.
    A   C   G   T
A  6.0 2.0 0.0 0.0
C  0.0 5.0 1.0 1.0
G  0.0 0.0 8.0 0.0
T  0.0 0.0 0.0 7.0
A  7.0 0.0 1.0 0.0
C  0.0 6.0 0.0 0.0
G  0.0 1.0 5.0 0.0
X  0.0 2.0 1.0 3.0



## Downloading and reading fasta files

In [84]:
# File downloads
from urllib.request import urlretrieve 
# CFTR gene - homo sapiens - fasta URL from Ensembl / P13569.
CFTR_FASTA_path = "data/ENSG00000001626.fasta"
result_location, http_response = urlretrieve("https://rest.ensembl.org/sequence/id/ENSG00000001626.fasta", CFTR_FASTA_path)
print("Downloaded file to: ", result_location)

print("\nDownload metadata and statistics:")
print(http_response)

Downloaded file to:  data/ENSG00000001626.fasta

Download metadata and statistics:
Vary: Content-Type
Vary: Origin
Content-Type: text/x-fasta; charset=UTF-8
Date: Tue, 23 Nov 2021 13:40:33 GMT
X-RateLimit-Limit: 55000
X-RateLimit-Reset: 1167
X-Runtime: 0.280467
Connection: close
X-RateLimit-Period: 3600
X-RateLimit-Remaining: 54999
Content-Length: 436062




In [85]:
# Reading FASTA files. Note that each file can contain multiple records.
from Bio import SeqIO
for record in SeqIO.parse(CFTR_FASTA_path, "fasta"):
    print(record)
    print("Sequence length:", len(record.seq))
    print("GC content:", GC(record.seq))

ID: ENSG00000001626.16
Name: ENSG00000001626.16
Description: ENSG00000001626.16 chromosome:GRCh38:7:117287120:117715971:1
Number of features: 0
Seq('AGGCGGATCACAAGTTCATGAGATCGAGACCATCTTGGCCAACATGGTGAGACC...ACA')
Sequence length: 428852
GC content: 36.895245912342716
