# What I can do with Biopython
- The ability to parse bioinformatics files into Python utilizable data structure. The formats including Balst output, FASTA, Genbank, etc
- Files in the supported formats can be iterated over record by record or indexed and accessed via a Dictionary interface
- Code to deal with popular online bioinformatic destinations, such as NCBI and ExPASy
- Interfaces to common bioinformatics programs, such as Standalone Blast from NCBI, Clustalw alignment program and EMBOSS command line
- A standard sequence class that deals with sequences, ids on sequences, and sequence features
- Tools for performing common operations on sequences, such as translation, transcription and weight calculations.
- Code to perform classification of data using k Nearest Neighbors, Naive Bayes or Support Vector Machines.
- Code fir dealing with alignment, including a standard way to create an deal with substitution matrices
- Code making it easy to split up parallelizable tasks into separate processes.
- GUI-based programs to do basic sequence manipulations, translations, BLASTing, etc

In [1]:
import Bio
from Bio.Seq import Seq
from Bio.Seq import MutableSeq
print(Bio.__version__)

1.81


# Chapter 1: Sequence Object(the *Seq* object)

The main differece between *Seq* objects and standard Python strings is they have different methods. *Seq* object supports many of the same method as a plain string, its *translate()* method differs by doing biological translation and there are also addition biolofically relevant methods like *reverse_complement()*


## *Seq* object act like strings

You could access the Seq object using the method we normally use in dealing with Python string object, like getting the length, iterating over the elements.

## Unsing *enumerate()* method

In [2]:
my_seq = Seq('CGATTAGC')

for index, letter in enumerate(my_seq):
    print('%i %s' %(index, letter))

0 C
1 G
2 A
3 T
4 T
5 A
6 G
7 C


## Accessing elements in the sequence by index

In [3]:
print(my_seq[1])
print(my_seq[-1])

G
C


## *.count()* method in *Seq* object

Just the string in Python, the *Seq* object has a *.count()* method, which return non-overlapping count and it is **case sensitive**.

In [4]:
print(Seq("AAAAAAAA").count('AA'))
print('AAAAAA'.count('aa'))

4
0


## Slicing a sequence

*Seq* Object could be slided by using index with the grammar of *Seq[Start:End:Interval]*

In [5]:
my_seq = Seq('AGCTCTGATCGATCGATGATCAGTACTAGCTAGTCTACGTAGCTAGCTAGCAT')
print(my_seq[4:12]) # getting index from index 4 to 12
print(my_seq[0::3]) # starting from index 0, taking base from every 3 bases
print(my_seq[1::3]) # starting from index 1, taking base from every 3 bases
print(my_seq[2::3]) # starting from index 2, taking base from every 3 bases
print(my_seq[::-1]) # flip the sequence

CTGATCGA
ATGCTAAAAATTATCGAA
GCAGCTTGCGACCATCGT
CTTAGGCTTCGTGGATC
TACGATCGATCGATGCATCTGATCGATCATGACTAGTAGCTAGCTAGTCTCGA


## Turning *Seq* objects into strings

*Seq* object could turn into string by using the *str()* command and this could be used in converting a *Seq* object into **unwrapped** FASTA format record

In [6]:
print(str(my_seq))
fasta_format_string = '>Name \n%s\n' % my_seq
print(fasta_format_string)

AGCTCTGATCGATCGATGATCAGTACTAGCTAGTCTACGTAGCTAGCTAGCAT
>Name 
AGCTCTGATCGATCGATGATCAGTACTAGCTAGTCTACGTAGCTAGCTAGCAT



## Concatenating or adding sequences

Two *Seq* Object could be concatenated by adding them, a *for* loop and *.join()* method. Biopython does not check the sequence contects and ***will not*** rase an exception if you concatenate a protein sequence and a DNA sequence(which is likely to be a mistake).

In [7]:
Seq1 = Seq('ACGTAG')
Seq2 = Seq('AGCTAGC')
Seq3 = Seq('TGACTA')
Spacer = Seq('N'*10)
Protein_seq = Seq('EVRNAK')

print(Seq1+Seq2+Seq3) # Concatenating by adding

# Concatenating by for loop
SeqCluster  = [Seq1, Seq2, Seq3]
ConcatenatedSeq = Seq('')
for i in SeqCluster:
    ConcatenatedSeq += i
print(ConcatenatedSeq)

# Concatenating by .join() method
SpacerSeq = Spacer.join(SeqCluster)
print(SpacerSeq)

ACGTAGAGCTAGCTGACTA
ACGTAGAGCTAGCTGACTA
ACGTAGNNNNNNNNNNAGCTAGCNNNNNNNNNNTGACTA


## Changing case

Python strings have very useful *upper()* and *lower()* methods for changing case.

In [8]:
DNASeq = Seq('ACGTCGATCacgtacgat')
print(DNASeq.upper())
print(DNASeq.lower())

ACGTCGATCACGTACGAT
acgtcgatcacgtacgat


## *complements()* and *reverse_complement()* method in *Seq* object

For nucleotide sequences, you could easily obtain the complement or reverse complement of a *Seq* object by using the built-in *complement()* and *reverse_complement* method. You would get biologically meaningless result if you input a protein sequence.

In [9]:
my_seq = Seq('AGCTAGCTAGACTAGTGATCAGCTAGCTA')
print(my_seq)
print(my_seq.complement())
print(my_seq.reverse_complement())

AGCTAGCTAGACTAGTGATCAGCTAGCTA
TCGATCGATCTGATCACTAGTCGATCGAT
TAGCTAGCTGATCACTAGTCTAGCTAGCT


## *transcribe()* and *translate()* method in *Seq* object(Transcription and Translation)

![image.png](attachment:image.png)

Biologically speaking, transcription process starts from the template strand and doing a reverse complement to make mRNA. But in Biopython and in general bioinformatics, we typically work directly with the coding strand, because it's way easier to just switching T to U. In Biopython, we have *transcribe()*(DNA to mRNA) and *back_transcribe()*(mDNA to DNA) method to do those jobs.

In [10]:
CodingDNA = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
print(CodingDNA)
MessengerRNA = CodingDNA.transcribe()
print(MessengerRNA)
print(MessengerRNA.back_transcribe())

# To simulate what is really happening in cell
TemplateDNA = CodingDNA.reverse_complement()
MessengerRNA = TemplateDNA.reverse_complement().transcribe() # Enzyme working on Template DNA to get mRNA
print(MessengerRNA)

ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG
AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG
ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG
AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG


The mRNA information could be tranlated to protein sequence by using the *translate()* method. *Tanslate()* method have some alternative arguments you could use. 
- *table* argument could change the translation table you could use. Different creatures or organelles have different genetic codes(translation tables) and the default value for this argument is the standard genetic code from NCBI. 
- *to_stop* argument could stop deciphering a sequence when a stop coden is detected. The default value for this argument is False.
- *stop_symbol* argument could replace the default stop codon symbol '*' with arbitrary symbol you assigned.
- *cds* argument will tell Biopython to translate an alternative start codon as methionine and make sure your sequence is a valid CDS(An exception will be raised if not). The default value is False. Use this argument if you know that your sequence ***starts with a start codon, ends with a stop codon and has no internal in-frame stop codons.***

In [11]:
CodingDNA = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
print(CodingDNA.translate()) #using translation function
print(CodingDNA.translate(table='Vertebrate Mitochondrial')) #using genetic code of vertibrate mitochondrial to do translation
print(CodingDNA.translate(table=2)) #using different genetic code to do translation by indexing
print(CodingDNA.translate(to_stop=True)) # stop the code when meeting end codon
print(CodingDNA.translate(stop_symbol='@')) # replacing * with @ as end codon

gene = Seq(
"GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA"
"GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT"
"AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT"
"TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT"
"AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA"
)

print(gene.translate(table='Bacterial'))
print(gene.translate(table='Bacterial',to_stop=True))
print(gene.translate(table="Bacterial", cds=True))


MAIVMGR*KGAR*
MAIVMGRWKGAR*
MAIVMGRWKGAR*
MAIVMGR
MAIVMGR@KGAR@
VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHLHGPPPPPRHHKKAPHDHHGGHGPGKHHR*
VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHLHGPPPPPRHHKKAPHDHHGGHGPGKHHR
MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHLHGPPPPPRHHKKAPHDHHGGHGPGKHHR


## Comparing *Seq* objects

In Biopython, the package could track down the molecule types and then comparing your sequence. For example, the DNA 'ACG' is different than the RNA and protein 'ACG'. In Biopython, the sequence comparison only looking at sequence and compares like Python string.

In [12]:
Seq1 = Seq('AGCT')
print(Seq1 == "AGCT")

True


## Sequence with undefined and partially defined sequence contents

In some cases, the length of a sequence may be known but not the actual letters constituting it. We could create a *Seq* object with the argument *None*, followed by the sequence length. We could use this method without specifying the sequence content explicitly.

In [13]:
UnknownSeq = Seq(None, 10)
# print(UnknownSeq) # Bio.Seq.UndefinedSequenceError: Sequence content is undefined
print(len(UnknownSeq))

10


Sometimes the sequence contents is defined by part of sequences only and undefined elsewhere. For example MAF(Multiple Alignment Format) file shows an alignment of creature's genome sequence with the format of '(starting position) + (size of aligned sequence) (aligned sequence)', like s rn5.chr4    42326848 36 + 248343840 CTGAAAACCTAAGTAGGAGAGACAGTTAAAGATAAT. We only know part of the sequence in the whole genome. 

So we could create a partially defined *Seq* object by using a dictionary for the *data* argument, where the key is the starting coordinates of the known sequence segmnet and the values are the corresponding sequences contents. 

In [14]:
seq = Seq({117512683: "TTGAAAACCTGAATGTGAGAGTCAGTCAAGGATAGT"}, length=159345973)

seq[1000:2000] #Seq(None, length=1000)
seq[117512690:117512700] #Seq('CCTGAATGTG')
seq[117512670:117512690] # Seq({13: 'TTGAAAA'}, length=20)
seq[117512700:] #Seq({0: 'AGAGTCAGTCAAGGATAGT'}, length=41833273)

Seq({0: 'AGAGTCAGTCAAGGATAGT'}, length=41833273)

## *MutableSeq objects*

*Seq* object is "read-only" or immutable, so you could not modify sequence by directly give them number and indexing. But you could convert it into a mutable sequence by using *MutableSeq* object and do everything you want. 

In [15]:
mutable_seq = MutableSeq(my_seq)
print(mutable_seq)
print(type(mutable_seq))
mutable_seq[5] = "C"
print(mutable_seq)
mutable_seq.remove("T")
print(mutable_seq)
mutable_seq.reverse()
print(mutable_seq)

AGCTAGCTAGACTAGTGATCAGCTAGCTA
<class 'Bio.Seq.MutableSeq'>
AGCTACCTAGACTAGTGATCAGCTAGCTA
AGCACCTAGACTAGTGATCAGCTAGCTA
ATCGATCGACTAGTGATCAGATCCACGA


# Chapter 2:  Sequence ananotation objects

In this chapter, we will introduce some features in Biopython which is above *Seq* class. *SeqRecord* and *SeqFeature* objects define identifiers and features associate with the *Seq* Object which is used throughout the sequence input/output interface *Bio.SeqIO*.

## *SeqRecord* class and its attributes

The *SeqRecord* class enable you to add identifiers and features to be associated with sequence which is basic data type for the *Bio.SeqIO* sequence input/output interface. The *SeqRecord* class is simple, it has *attributes* as below:

- *.seq*: The sequence itself, typically a *Seq* object.
- *.id*: The primary ID used to identify the sequence. i.e. accession number
- *.name*: A "common" name/id for the sequence. In some case, it's the same as the accession number, but it also could be a clone name.
- *.description*: A human readable description or expressive name for the sequence.
- *.letter_annotations*: Hold per-letter-annotations using a restricted dictionary of additional information about the letters in the sequence. Keys are the name of the information(quality scores, scondary structure, methylation pattern, etc.) and the values are as a Python list, tuple, or string with the smae length as the sequence itself.
- *.annotation*: A dictionary of additional information about the sequence. The keys are the name of the information, and the information is contained in the value. This allowed more "unstructured" information to the sequence. 
- *.features*: A list of *SeqFeature* objects with more constructed information about the features on a sequence(e.g. position of a gene on a genome or domains on a protein sequence)
- *.dbxrefs*: A list of database cross-references.

### Generating a *SeqRecord* object from scratch

To use *SeqRecord*, you ***start*** with a *Seq* object and give the attributes values. If the attributes are not given values, they will be set as strings indicating they are unknown. **If you want to output your SeqRecord to a file, it must have a identifier(.id).** 

In [16]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqFeature
from Bio import SeqIO

In [17]:
SimpleSeq = Seq('ATCGGCAT')
SimpleSeqR = SeqRecord(SimpleSeq)

# Giving value to attributes of SeqRecord

SimpleSeqR.id = "AC12345"
SimpleSeqR.description = "Made-up Sequences"

print(SimpleSeqR.seq, SimpleSeqR.id,SimpleSeqR.description)

ATCGGCAT AC12345 Made-up Sequences


*.annotation* could be used for any miscellaneous annotations that does not fit under one of the other more specific attributes. Annotations could be used as followed:

In [18]:
SimpleSeqR.annotations['Evidence'] = 'I made that up.'
SimpleSeqR.annotations['Methylation'] = [1,2,3]

print(SimpleSeqR.annotations)
print(SimpleSeqR.annotations['Evidence'])

{'Evidence': 'I made that up.', 'Methylation': [1, 2, 3]}
I made that up.


Working with per-letter-annotations is similar, *letter_annotations* is a dictionary like attribute which will let you assign any Python sequence (i.e. a string, list or tuple) which has ***the same length*** as the sequence:

In [19]:
SimpleSeqR.letter_annotations['phred_quality'] = [40,30,38,30,40,38,40,30]
print(SimpleSeqR.letter_annotations)
print(SimpleSeqR.letter_annotations['phred_quality'])

{'phred_quality': [40, 30, 38, 30, 40, 38, 40, 30]}
[40, 30, 38, 30, 40, 38, 40, 30]


### Genrating *SeqRecord* onjects from FASTA files

*Bio.SeqIO.parse()* could be used to parse FASTA(.fna) and Genbank files(.gb). The information in the FASTA and Genbank files will be given into the *SeqRecord* object's attributes. This example uses a fairly large FASTA file containing the whole sequence for Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, originally downloaded from the NCBI. 

In [20]:
Record = SeqIO.read("NC_005816.fna", "fasta")
print(Record)
print(Record.seq)
print(Record.id)
print(Record.name)
print(Record.description)
print(Record.dbxrefs)
print(Record.annotations)
print(Record.letter_annotations)
print(Record.features)

ID: gi|45478711|ref|NC_005816.1|
Name: gi|45478711|ref|NC_005816.1|
Description: gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence
Number of features: 0
Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG')
TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTGAGACAGTTATGGAAATTAAAATCCTGCACAAGCAGGGAATGAGTAGCCGGGCGATTGCCAGAGAACTGGGGATCTCCCGCAATACCGTTAAACGTTATTTGCAGGCAAAATCTGAGCCGCCAAAATATACGCCGCGACCTGCTGTTGCTTCACTCCTGGATGAATACCGGGATTATATTCGTCAACGCATCGCCGATGCTCATCCTTACAAAATCCCGGCAACGGTAATCGCTCGCGAGATCAGAGACCAGGGATATCGTGGCGGAATGACCATTCTCAGGGCATTCATTCGTTCTCTCTCGGTTCCTCAGGAGCAGGAGCCTGCCGTTCGGTTCGAAACTGAACCCGGACGACAGATGCAGGTTGACTGGGGCACTATGCGTAATGGTCGCTCACCGCTTCACGTGTTCGTTGCTGTTCTCGGATACAGCCGAATGCTGTACATCGAATTCACTGACAATATGCGTTATGACACGCTGGAGACCTGCCATCGTAATGCGTTCCGCTTCTTTGGTGGTGTGCCGCGCGAAGTGTTGTATGACAATATGAAAACTGTGGTTCTGCAACGTGACGCATATCAGACCGGTCAGCACCGGTTCCATCCTTCGCTGTGGCAGTTCGGCAA

The first word of the FASTA files is used for both *.id* and *.name* attributes. The whole first line is used as *.description*. All the other attributes will leave in blank.

### Genearating *SeqRecord* objects from Genbank files

Just as the FASTA files, we could also import Genbank files into a *SeqRecord* objects. Normally, a Genbank should format as below:

LOCUS NC_005816 9609 bp DNA circular BCT 21-JUL-2008
DEFINITION Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete
sequence.
ACCESSION NC_005816
VERSION NC_005816.1 GI:45478711
PROJECT GenomeProject:10638
......

In [21]:
Record = SeqIO.read("NC_005816.gb","genbank")
print(Record)

ID: NC_005816.1
Name: NC_005816
Description: Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence
Database cross-references: Project:58037
Number of features: 41
/molecule_type=DNA
/topology=circular
/data_file_division=BCT
/date=21-JUL-2008
/accessions=['NC_005816']
/sequence_version=1
/gi=45478711
/keywords=['']
/source=Yersinia pestis biovar Microtus str. 91001
/organism=Yersinia pestis biovar Microtus str. 91001
/taxonomy=['Bacteria', 'Proteobacteria', 'Gammaproteobacteria', 'Enterobacteriales', 'Enterobacteriaceae', 'Yersinia']
/references=[Reference(title='Genetics of metabolic variations between Yersinia pestis biovars and the proposal of a new biovar, microtus', ...), Reference(title='Complete genome sequence of Yersinia pestis strain 91001, an isolate avirulent to humans', ...), Reference(title='Direct Submission', ...), Reference(title='Direct Submission', ...)]
/comment=PROVISIONAL REFSEQ: This record has not yet been subject to final
NCBI review. The 

In [22]:
print(Record.id)
print(Record.name)
print(Record.description)
print(Record.annotations)
print(Record.letter_annotations)
print(Record.dbxrefs)
print(Record.features)
print(len(Record.features))

NC_005816.1
NC_005816
Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence
{'molecule_type': 'DNA', 'topology': 'circular', 'data_file_division': 'BCT', 'date': '21-JUL-2008', 'accessions': ['NC_005816'], 'sequence_version': 1, 'gi': '45478711', 'keywords': [''], 'source': 'Yersinia pestis biovar Microtus str. 91001', 'organism': 'Yersinia pestis biovar Microtus str. 91001', 'taxonomy': ['Bacteria', 'Proteobacteria', 'Gammaproteobacteria', 'Enterobacteriales', 'Enterobacteriaceae', 'Yersinia'], 'references': [Reference(title='Genetics of metabolic variations between Yersinia pestis biovars and the proposal of a new biovar, microtus', ...), Reference(title='Complete genome sequence of Yersinia pestis strain 91001, an isolate avirulent to humans', ...), Reference(title='Direct Submission', ...), Reference(title='Direct Submission', ...)], 'comment': 'PROVISIONAL REFSEQ: This record has not yet been subject to final\nNCBI review. The reference sequence was derived f

## Feature, location and position objects

### *SeqFeature* object

*SeqFeature* object attempt to encapsulate as much of the information about **a region on parent sequence**, typically a *SeqRecord* object. The *SeqFeature* object is defined with *location* object, typically between two positions. *SeqFeature* has a number number of attributes whihc are listed below:

- *.type*: This is a textual description of the type of feature, like CDS and gene.
- *.location*: The location of the *SeqFeature* on the sequence that you are dealing with. It includes a number of shortcut attributes for properties of the location.
  - *.ref*: shorthand for *.location.ref*. Any reference sequence the location is referring to. Usually just None.
  - *.ref_db*: shorthand for *.location.ref_db*. Specifies the database any identifier in *.ref* refers to. Usually just None.
  - *.strand*: shorthand for *.location.strand*. The strand that the sequence that the feature is located on.
    - 1 for the top strand of a dsDNA.
    - -1 for the bottom strand of a dsDNA.
    - 0 for the strand is important but is unknown.
    - None for it does not matter, normally for proteins or ssDNA.
- *.qualifier*: Use a Python dictionary of additional information about the feature. the key is one-word description of the information and the value is the actual information. This is a reflection of the feature tables in Genbank/EMBL files.

### Position and location

Postion and location is to describe a region on a parent sequence. Location object is described by the range between two positions. 

- Position: This refers to a region in the sequence, which could be ambiguous or clear. e.g. 5, 20, >200, <100 could all be positions.
- Location: This refers to a certain sequence location defined by a clear vaule. e.g. 3, 20.
  
#### *SimpleLocation* object

**Unless you work with eukaryotic genes**, most *SeqFeature* locations are extremely simple - you just need **start and end** coordinates and a strand. That’s essentially all the basic *SimpleLocation* object does.

#### *CompoundLocation* object

If you have to deal with complex locations made up of multiple regions are represented, *CompoundLocation* will be a useful tool. This feature is normally used to handle 'join' location in EMBL/GenBank files.

#### Fuzzy Position:

 For example, in a dinucleotide priming experiment and discover that the start of mRNA transcript starts at one of two sites. We could use the fuzzy postion function in Biopython to address this problem. Five classes are designed in Biopython to deal with the fuzzy position problem if you did not figure out where is the start point.

- *ExactPosition*: Represents a position which is specified as exact along the sequence.
- *BeforePosition*: Represents a fuzzy position that occurs prior to some specified site. In Genbank/EMBL notation, this is represented as something like '<13'.
- *AfterPosition*: Represents a fuzzy postion that occurs after some specified site. In Genbank/EMBL notation, this is reoresented as something like '>13'.
- *WithinPosition*: Occasionally used for Genbank/EMBL locations, this class models a position which occurs somewhere between  two specified nucleotides. In Genbank/EMBL notation, this would be represented as (1,5).
- *OneOfPosition*: Occasionally used for Genbank/EMBL locations, this class deals with a position where several possible values exist.
- *UnknownPosition*: This class deals with a position of unknown location. This is not used in Genbank/EMBL, but corresponds to the '?' feature coordinate used in UniProt.
 

In [23]:
start_position = SeqFeature.AfterPosition(5)
end_position = SeqFeature.BetweenPosition(9,left=8, right=9) #either left or right should be the position
my_location = SeqFeature.SimpleLocation(start_position,end_position)

print(my_location)
print(my_location.start)
print(my_location.end)

[>5:(8^9)]
>5
(8^9)


If you have a SNP of interest and you want to know which feature this SNP within, we could use a for loop to check all the features.

In [24]:
my_snp = 4350
Record = SeqIO.read('NC_005816.gb','genbank')
for feature in Record.features:
    if my_snp in feature:
        print("%s %s" % (feature.type, feature.qualifiers.get("db_xref")))

source ['taxon:229193']
gene ['GeneID:2767712']
CDS ['GI:45478716', 'GeneID:2767712']


### Sequence describes by a feature or location

*SeqFeature* or location object doesn't directly contain a sequence, but the location describe how to get this from the parent sequence by [slicing](#slicing-a-sequence). 

In [25]:
from Bio.SeqFeature import SeqFeature, SimpleLocation

seq = Seq("ACCGAGACGGCAAAGGCTAGCATAGGTATGAGACTTCCTTCCTGCCAGTGCTGAGGAACTGGGAGCCTAC")
feature = SeqFeature(SimpleLocation(5,18, strand = -1), type = 'gene')

feature_seq = seq[feature.location.start:feature.location.end].reverse_complement()
print(feature_seq)

AGCCTTTGCCGTC


If you need to deal with compound features which is rather messy than the simple example. We have a *.extract* mothed to concatenate all the sequence.

In [26]:
feature_seq = feature.extract(seq)
print(feature_seq)

AGCCTTTGCCGTC


## Comparison

Two *SeqRecord* objects normally can not be compared, it will raise an explicit error(NotImplemented Error) if you try to do that. But you could still compare two *SeqRecord* by comparing the attributes.

In [27]:
record1 = SeqRecord(Seq('ACGT'), id = 'test')
record2 = SeqRecord(Seq('ACGT'), id = 'test')

print(record1.seq == record2.seq)
print(record1.id == record2.id)

True
True


## References

Another common annotation related to a sequence is a reference to a journal or other published work dealing with sequence. So, a class *Bio.SeqFeature.Reference*, is designed to store the relevent information about a reference as attibutes of an object.

## The *.format* method for *SeqFeature* object

The *format()* method of the *SeqRecord* class gives a string containing your record record formatted using one of the output file formates supported by *Bio.SeqIO*, such as FASTA:

In [28]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

record = SeqRecord(
Seq(
    "MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD"
    "GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK"
    "NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM"
    "SSAC"
    ),
    id="gi|14150838|gb|AAK54648.1|AF376133_1",
    description="chalcone synthase [Cucumis sativus]",
 )

print(record.format("fasta"))

>gi|14150838|gb|AAK54648.1|AF376133_1 chalcone synthase [Cucumis sativus]
MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD
GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK
NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM
SSAC



## Slicing a *SeqRecord* object

You can slice a *SeqRecord*, to give you a new *SeqRecord* covering just part of the sequence. What is important here is that any per-letter annotations are also sliced, and any features which fall completely within the new sequence are preserved (with their locations adjusted).

In [29]:
record = SeqIO.read('NC_005816.gb','genbank')
print(record.features[21])
print(record.features[20])


type: CDS
location: [4342:4780](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: db_xref, Value: ['GI:45478716', 'GeneID:2767712']
    Key: gene, Value: ['pim']
    Key: locus_tag, Value: ['YP_pPCP05']
    Key: note, Value: ['similar to many previously sequenced pesticin immunity protein entries of Yersinia pestis plasmid pPCP, e.g. gi| 16082683|,ref|NP_395230.1| (NC_003132) , gi|1200166|emb|CAA90861.1| (Z54145 ) , gi|1488655| emb|CAA63439.1| (X92856) , gi|2996219|gb|AAC62543.1| (AF053945) , and gi|5763814|emb|CAB531 67.1| (AL109969)']
    Key: product, Value: ['pesticin immunity protein']
    Key: protein_id, Value: ['NP_995571.1']
    Key: transl_table, Value: ['11']
    Key: translation, Value: ['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSHGIYGKQTTFKQTEFTNIKSNTKKHIALINKDNSWMISLKILGIKRDEYTVCFEDFSLIRPPTYVAIHPLLIKKVKSGNFIVVKEIKKSIPGCTVYYH']

type: gene
location: [4342:4780](+)
qualifiers:
    Key: db_xref, Value: ['GeneID:2767712']
    Key: gene, Value: ['pim']
    Key: l

In [30]:
sub_record = record[4300:4800]
print(len(sub_record.features))
print(sub_record.features[0]) # locations are adjusted to fit the new parent sequence
print(sub_record.features[1])

2
type: gene
location: [42:480](+)
qualifiers:
    Key: db_xref, Value: ['GeneID:2767712']
    Key: gene, Value: ['pim']
    Key: locus_tag, Value: ['YP_pPCP05']

type: CDS
location: [42:480](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: db_xref, Value: ['GI:45478716', 'GeneID:2767712']
    Key: gene, Value: ['pim']
    Key: locus_tag, Value: ['YP_pPCP05']
    Key: note, Value: ['similar to many previously sequenced pesticin immunity protein entries of Yersinia pestis plasmid pPCP, e.g. gi| 16082683|,ref|NP_395230.1| (NC_003132) , gi|1200166|emb|CAA90861.1| (Z54145 ) , gi|1488655| emb|CAA63439.1| (X92856) , gi|2996219|gb|AAC62543.1| (AF053945) , and gi|5763814|emb|CAB531 67.1| (AL109969)']
    Key: product, Value: ['pesticin immunity protein']
    Key: protein_id, Value: ['NP_995571.1']
    Key: transl_table, Value: ['11']
    Key: translation, Value: ['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSHGIYGKQTTFKQTEFTNIKSNTKKHIALINKDNSWMISLKILGIKRDEYTVCFEDFSLIRPPTYVAIHPLLIKK

While Biopython has done something sensible and hopefully intuitive with the features (and any per-letter annotation), for the other annotation it is impossible to know if this still applies to the sub-sequenceor not. To avoid guessing, with the exception of the molecule type, the *.annotations* and *.dbxrefs* are omitted from the sub-record, and it is up to you to transfer any relevant information as appropriate.

## Adding *SeqRecord* objects

You could add *SeqRecord* objects together to a new *SeqRecord* object. Noted that any common per-letter annotations are also added, all the features are preserved(with location adjusted). Common annotations like id, name and description, are kept, but some annotations like the database cross references are lost.

## Reverse-complementing *SeqRecord* objects

For the sequence, this uses the Seq object’s reverse complement method. Any features are transferred with the location and strand recalculated. Likewise any per-letter-annotation is also copied but reversed (which makes sense for typical examples like quality scores). However, transfer of most annotation is problematical.

The SeqRecord object’s reverse_complement method takes a number of optional arguments corresponding to properties of the record. Setting these arguments to True means copy the old values, while False means drop the old values and use the default value. You can alternatively provide the new desired value instead.

In [31]:
from Bio import SeqIO

record = SeqIO.read("NC_005816.gb", "genbank")

print("%s %i %i %i %i" %(record.id, len(record.seq), len(record.features), len(record.dbxrefs), 
                         len(record.annotations)))

rc = record.reverse_complement(id = 'TESTING')

print("%s %i %i %i %i" %(rc.id, len(rc.seq), len(rc.features), len(rc.dbxrefs), 
                         len(rc.annotations)))

NC_005816.1 9609 41 1 13
TESTING 9609 41 0 0
