# Sequence Record

Essentially a sequence with database metadata attached to it. Depending on the source, this metadata can be surprisingly rich.

In [None]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
record =SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKD",
 IUPAC.protein),
 id="YP_025292.1", name="HokC",
 description="toxic membrane protein, small")
print record

ID: YP_025292.1
Name: HokC
Description: toxic membrane protein, small
Number of features: 0
Seq('MKQHKAMIVALIVICITAVVAALVTRKD', IUPACProtein())


You get a `SeqRecord` when you:
- parse a file with sequence data in one of the standard formats
- obtain sequence data from an online resource (eg NCBI, UniProt, etc)

# Entrez Utilities - Sequence queries

Online access to NCBI's data. The [Bio.Entrez](http://biopython.org/DIST/docs/api/Bio.Entrez-module.html) module makes use of the Entrez
Programming Utilities (also known as EUtils), described in detail [here at NCBI](http://www.ncbi.nlm.nih.gov/entrez/utils/). You usually get an XML output, which is parsed using some utility function in `Bio.Entrez`.

**Important:** Be aware that the Entrez utilities impose usage limits and you can get yourself blacklisted if you violate those. BioPython takes care of keeping an eye on those for you. You also should fill in a non-fake email address. See the [Frequency section](http://www.ncbi.nlm.nih.gov/books/NBK25497/#chapter2\2e Frequency_Timing_and_Registrati) in the manual.

# esearch

`esearch` searches and retrieves primary IDs (for use in EFetch, ELink, and ESummary) and term translations and optionally retains results for future use in the user's environment. `efetch` can then be used to download the actual records.


In [None]:
from Bio import Entrez
Entrez.email = "my_little_self@utu.fi"
handle = Entrez.esearch(db="nucleotide", retmax=10, term="opuntia[ORGN] accD")
print "Handle:", handle #A file-like object, you can read it
content=handle.read()
print "\n\nContent:", content

Handle: <addinfourl at 31691048 whose fp = <socket._fileobject object at 0x1e35450>>


Content: <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "http://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
<eSearchResult><Count>3</Count><RetMax>3</RetMax><RetStart>0</RetStart><IdList>
<Id>377580661</Id>
<Id>156535673</Id>
<Id>156535671</Id>
</IdList><TranslationSet><Translation>     <From>opuntia[ORGN]</From>     <To>"Opuntia"[Organism]</To>    </Translation></TranslationSet><TranslationStack>   <TermSet>    <Term>"Opuntia"[Organism]</Term>    <Field>Organism</Field>    <Count>1725</Count>    <Explode>Y</Explode>   </TermSet>   <TermSet>    <Term>accD[All Fields]</Term>    <Field>All Fields</Field>    <Count>161156</Count>    <Explode>N</Explode>   </TermSet>   <OP>AND</OP>  </TranslationStack><QueryTranslation>"Opuntia"[Organism] AND accD[All Fields]</QueryTranslation></eSearchResult>



That's XML allright. Let's parse it. The primary method to do this is `Entrez.read()` but it expects a file. We have already consumed our handle - we read the data from it to the end so that it cannot be used again. We could query again, or we can use `StringIO` which is a module which can simulate file objects on top of strings.

In [None]:
import StringIO #Useful whenever you have your data in a string but a function which expects a file
content_filelike=StringIO.StringIO(content)
equery_result=Entrez.read(content_filelike)
print "Parsed result of equery:", equery_result

So we are having a dictionary. Most important here is the `IdList` key which holds the IDs which fit our query. Now we can fetch those.

### read() vs. parse()

Throughout BioPython in the different modules the function `read()` is used to parse an output which only has a single record (of whatever kind) and it returns that record plus it may even throw an error if the output seems to contain more than one record. On the other hand `parse()` is used to parse an output which has several records and returns a generator over them (ie something you can for-loop over once, or turn into a list with `list()`).

## efetch

Now that we have our IDs, we can fetch the records:

In [None]:
handle=Entrez.efetch("nucleotide",id=equery_result['IdList'],retmode="xml")
efetch_result=Entrez.parse(handle)
print "Efetch result", efetch_result #This gives a "generator". You may want to make this into a list, or loop over it
for r in efetch_result:
    print r

This is **not** what we wanted! What we want is a `SeqRecord` object. Reason: `Entrez.parse()` is a generic function for parsing Entrez XML. We are receiving nucleotide database records, ie sequences. For this, we want [Bio.SeqIO](http://biopython.org/DIST/docs/_api_159/Bio.SeqIO-module.html) which is specialized in parsing the various sequence formats we can meet and in this particular case the *GenBank* format (`gb`  is what we want).

In [None]:
import Bio.SeqIO as SeqIO
handle=Entrez.efetch("nucleotide",id=equery_result['IdList'],rettype="gb") #Note: this now says "gb" to get the right format
efetch_result=SeqIO.parse(handle,"gb") #Note: this used to read Entrez.parse(handle)
print "Efetch result", efetch_result #This gives a "generator". You may want to make this into a list, or loop over it
for r in efetch_result:
    print repr(r)
    print
    print

Much better. Now we have SeqRecords. Let us summarize everything into a nice piece of code.

In [None]:
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "my_little_self@utu.fi"

def fetch_records(query):
    search_result_handle=Entrez.esearch(db="nucleotide", retmax=10, term=query) #Query Entrez
    search_result=Entrez.read(search_result_handle) #Parse the XML data you get in reply
    efetch_result_handle=Entrez.efetch(db="nucleotide", id=search_result['IdList'],rettype="gb") #Now fetch the actual sequences in GenBank format
    seq_records=SeqIO.parse(efetch_result_handle,format="gb") #Now parse the obtained data using SeqIO
    return seq_records

for seq_rec in fetch_records("opuntia[ORGN] accD"):
    print repr(seq_rec)
    


## Search & download recap (1)

- `Entrez.esearch()` to get a reply with list of IDs
- `Entrez.efetch()` to grab the actual records
- `SeqIO.parse()` to obtain a sequence of SeqRecord objects

This is okay for simple small queries but is highly discouraged for large queries. NCBI expects you to use the *query history* functionality which lets them bind your fetch to a previous search and use their internal caching mechanisms. Turn it on with `usehistory="y"` like such:

In [None]:
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "my_little_self@utu.fi"

search_result_handle=Entrez.esearch(db="nucleotide", retmax=10, term="opuntia[ORGN] accD", usehistory="y") #Query Entrez with history on
search_result=Entrez.read(search_result_handle) #Parse the XML data you get in reply
print "Search_result:", search_result


Note the `QueryKey` and `WebEnv` keys. This is the information which identifies your result set at NCBI's server and can be used to fetch the records like so:

In [None]:
efetch_result_handle=Entrez.efetch(db="nucleotide",
                                   query_key=search_result["QueryKey"],
                                   webenv=search_result["WebEnv"],
                                   rettype="gb") #Now fetch the actual sequences in GenBank format. Note how I don't give the IDs!
seq_records=SeqIO.parse(efetch_result_handle,format="gb") #Now parse the obtained data using SeqIO
for r in seq_records:
    print repr(r)
    print
    print
    

## Search & download recap (2)

- `Entrez.esearch(..., usehistory="y")` to get a reply and webenv + query_key
- `Entrez.efetch(..., webenv=, query_key=)` to grab the actual records from the previous search
- `SeqIO.parse()` to obtain a sequence of SeqRecord objects

Great! Now our problem is that if we have a large result set, thousands of IDs, Entrez doesn't let us grab them all at once with a single efetch. We need to get them bit-by-bit. Here's a complete example.

In [None]:
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "my_little_self@utu.fi"

def fetch_records(query, ret_max=10, fetch_batch=5):
    search_result_handle=Entrez.esearch(db="nucleotide", retmax=ret_max, term=query, usehistory="y") #Query Entrez
    search_result=Entrez.read(search_result_handle) #Parse the XML data I get in reply
    #The only thing you care about is Count, and the Webenv+QueryKey data
    all_records=[] #This will be our result list
    #Divide into blocks of fetch_batch results, download one at a time
    for start_index in range(0,len(search_result["IdList"]),fetch_batch):
        efetch_result_handle=Entrez.efetch(db="nucleotide", 
                                           retstart=start_index,
                                           retmax=fetch_batch,
                                           webenv=search_result["WebEnv"],
                                           query_key=search_result["QueryKey"],
                                           rettype="gb") #Now fetch the actual sequences in GenBank format
        #add these records into our list
        all_records.extend(SeqIO.parse(efetch_result_handle,format="gb")) #Now parse the obtained data using SeqIO
    return all_records

fetched_records_p53=fetch_records("p53", ret_max=50, fetch_batch=10) #get max 50 records, fetch 10 at a time
print "Fetched", len(fetched_records_p53), "records"
print "First few:", fetched_records_p53[:5]


## Search & download - final version

- `Entrez.esearch(..., retmax=, usehistory="y")` to get a reply and webenv + query_key, searches for retmax records
- `for batch_idx in range(0, total_num_of_results, download_batch)` to chop the results into download_batch sized chunks
- `Entrez.efetch(retstart=, retmax=download_batch, webenv=, query_key=)` to grab one batch at a time
- `SeqIO.parse()` to obtain a sequence of SeqRecord objects for every batch

Let's grab a protein and see what the record has:

In [None]:
tp53_rec=SeqIO.read(Entrez.efetch(db="protein",id=["NP_000537"],rettype="gb"),"gb")
print repr(tp53_rec)


- `seq` sequence itself, Seq object
- `id` ID used to identify the sequence, accession number, string
- `name` name for the sequence, string
- `description` readable description or expressive name for the sequence, string
- `letter_annotations` dictionary of additional information about the letters in the sequence
- `annotations` dictionary of additional information about the sequence
- `features` A list of [SeqFeature](http://biopython.org/DIST/docs/api/Bio.SeqFeature.SeqFeature-class.html) objects with information about the features on a sequence
- `dbxrefs` A list of database cross-references as strings

In [None]:
dir(tp53_rec)
help(tp53_rec)

In [None]:
for key in tp53_rec.__dict__:
    if not key.startswith('_'):
        print key, str(tp53_rec.__getattribute__(key))[:50]+"..."
for key,val in tp53_rec.annotations.iteritems():
    print key, " "*15, str(val)[:50]+"..."

Note: the information you get depends on the format. Let's try with fasta to see that we get next to nothing, compared to the genbank format.

In [None]:
print Entrez.efetch(db="protein",id=["NP_000537"],rettype="fasta").read()
tp53_rec_fasta=SeqIO.read(Entrez.efetch(db="protein",id=["NP_000537"],rettype="fasta"),"fasta")
for key in tp53_rec_fasta.__dict__:
    if not key.startswith('_'):
        print key, str(tp53_rec_fasta.__getattribute__(key))[:50]+"..."
for key,val in tp53_rec_fasta.annotations.iteritems():
    print key, " "*15, str(val)[:50]+"..."

## Features

Annotated features along the sequence. Best seen in the [source](http://www.ncbi.nlm.nih.gov/nuccore/NM_001276697.1). The complete list of possible feature types is [here](http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#FeaturesB).

In [None]:
for f in tp53_rec.features:
    print f.type, f.location, f.qualifiers

In [None]:
#extract all CDS from a gene
tp53_rec_gene=SeqIO.read(Entrez.efetch(db="nucleotide",id=["NM_001276697.1"],rettype="gb"),"gb")
for f in tp53_rec_gene.features:
    if f.type=="CDS":
        print "CDS at", f.location
        extracted=f.extract(tp53_rec_gene)
        print repr(extracted)
        print extracted.seq
    

# File I/O

`SeqIO.read()` and `.parse()` also work with file name or an open file. And `SeqIO.write(sequences,filename,format)` can write into a file in a [number of formats](http://biopython.org/wiki/SeqIO#File_Formats).

In [None]:
print "records written:", SeqIO.write([tp53_rec_gene],"tp53.gb","gb")
tp53_rec_gene_fromfile=SeqIO.read("tp53.gb","gb")
print repr(tp53_rec_gene_fromfile)
print tp53_rec_gene_fromfile==tp53_rec_gene #Why false?
print tp53_rec_gene_fromfile.seq==tp53_rec_gene.seq #Why true?