# Biopython
![](images/biopython_logo_white.png)

## Overview
Specific module for bioinformatics tasks. Have quite big number of submodules including such for working with:
* sequences
* aligners
* databases including NCBI
* 3d structures
* populations
* phylogenetics

Not the best choice (targeted libraries for each narrow task are preferable), but ok

## Working with sequences in biopython
Let's look at some of its functionality 

In [1]:
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna, generic_protein

In [10]:
# Create a sequence
dna = Seq('ATGCCCATATTT', generic_dna)
dna

Seq('ATGCCCATATTT', DNAAlphabet())

In [11]:
# Complement
dna.complement()

Seq('TACGGGTATAAA', DNAAlphabet())

In [12]:
# Reverse complement
dna.reverse_complement()

Seq('AAATATGGGCAT', DNAAlphabet())

In [13]:
# Transcription
dna.transcribe()

Seq('AUGCCCAUAUUU', RNAAlphabet())

In [15]:
# Translation, takes additional arguments to flexible translation to protein
dna.translate()

Seq('MPIF', ExtendedIUPACProtein())

Translation and back transcription (simply substitution of U to T) is available for RNA sequences

Couple of Seq attributes and transformations

In [34]:
# Get associated alphabet
dna.alphabet

DNAAlphabet()

In [36]:
# Convert to string
str(dna)

'ATGCCCATATTT'

## SeqIO
Working with files

* `SeqIO.parse(filepath, extension)` - read data from `filepath` which is in `extension` format
* `SeqIO.write(sequences, filepath, extension)` - write `sequences` to `filepath` in `extension` format

In [37]:
from Bio import SeqIO

In [39]:
for record in SeqIO.parse("seqs.fasta", "fasta"):
    print(record)
    break

ID: Canis
Name: Canis
Description: Canis
Number of features: 0
Seq('ATGAATCATGATTTTCAAGCTCTTGCATTAGAATCTCGGGGAATGGGAGAGCTT...TAA', SingleLetterAlphabet())


Some useful attributes:
* `seq` - sequence
* `description` - metadata
* `id` - metadata
* `name` - metadata
* `letter_annotations` - quality of letters

In [44]:
record = next(SeqIO.parse("seqs.fasta", "fasta"))

In our case metadata will be the same

In [56]:
record.seq

Seq('ATGAATCATGATTTTCAAGCTCTTGCATTAGAATCTCGGGGAATGGGAGAGCTT...TAA', SingleLetterAlphabet())

In [49]:
record.description

'Canis'

In [54]:
record.name

'Canis'

Sequence writing

In [63]:
# Immediate variant - write iterable with sequences from SeqIO.parse
SeqIO.write(SeqIO.parse('seqs.fasta', 'fasta'), 'seqs2.fasta', 'fasta')

13

In [66]:
# Write subset of sequences from 1 file to another
# 1st 3 records
records = list(SeqIO.parse('sample.fastq', 'fastq'))[:3]
records

[SeqRecord(seq=Seq('TTTCCGGGGCACATAATCTTCAGCCGGGCGC', SingleLetterAlphabet()), id='cluster_2:UMI_ATTCCG', name='cluster_2:UMI_ATTCCG', description='cluster_2:UMI_ATTCCG', dbxrefs=[]),
 SeqRecord(seq=Seq('TATCCTTGCAATACTCTCCGAACGGGAGAGC', SingleLetterAlphabet()), id='cluster_8:UMI_CTTTGA', name='cluster_8:UMI_CTTTGA', description='cluster_8:UMI_CTTTGA', dbxrefs=[]),
 SeqRecord(seq=Seq('GCAGTTTAAGATCATTTTATTGAAGAGCAAG', SingleLetterAlphabet()), id='cluster_12:UMI_GGTCAA', name='cluster_12:UMI_GGTCAA', description='cluster_12:UMI_GGTCAA', dbxrefs=[])]

In [67]:
SeqIO.write(records, 'just3sequences.fasta', 'fasta')

3

In [68]:
# Format conversion
seqs = list(SeqIO.parse('sample.fastq', 'fastq'))[:10]
SeqIO.write(seqs, 'subsample.fasta', 'fasta')

10

In this example we convert fastq (which have more information) to fasta. It is easy cause we just need to get rid of quality and comments.

For reverse conversion you'll need to provide `letter_annotations` attribute to fasta records