# How to get gene sequences from a set of genbank records

For example, I want to perform a multiple sequence alignment on a singular gene (glycoprotein E) using sequences from a BioProject with multiple Zika virus genbank records: 

[BioProject PRJNA344504](https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA344504) 
from the publication [Zika virus evolution and spread in the Americas](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5563848/#SD1)

Genbank records, however, are complete records containing all the sequence information. How to remove only the information that I need (in this case all the glycoprotein E sequences) and copy them into a new file?

After downloading all the records into one file, it can be accessed in this script, which will:
1. parse through all the records
2. search each record for a feature and qualifier
3. returns the sequence for that feature in FASTA format
4. saves it to a file

In [5]:
from Bio import SeqIO

def genbank_qualifier_hunter(gb_file, feature_type, qualifier, qualifier_value):
    
    answer = ''
      
    for gb_record in SeqIO.parse(open(gb_file,'r'), 'genbank'):
        
        for (index, feature) in enumerate(gb_record.features):
            if feature.type == feature_type:
                if qualifier in feature.qualifiers:
                    value = feature.qualifiers.get(qualifier)
                    if value == qualifier_value:
                        sequence = (feature.location.extract(gb_record.seq))                       
                        answer+=">" + gb_record.id + ' | ' + gb_record.description + ' ' + str(feature.location) + '\n'
                        answer+=sequence + '\n' + '\n'
                                   
    return answer

check_qualifiers = genbank_qualifier_hunter('PRJNA344504.gb', 'mat_peptide', 'product', ['envelope'])
print(check_qualifiers)

>KY785485.1 | Zika virus isolate Zika virus/H.sapiens-wt/BRA/2016/FC-DQ192D1-URI polyprotein gene, partial cds [608:2120](+)
NTCAGGTGCATAGGAGTCAGCAATAGGGACTTTGTGGAAGGTATGTCAGGTGGGACTTGGGTTGATRTTGTCTTGGAACATGGAGGTTGTGTCACCGTAATGGCACAGGACAAACCGACTGTCGACATAGAGCTGGTTACAACAACAGTCAGCAACATGGCGGAGGTAAGATCCTACTGCTATGAGGCATCAATATCAGACATGGCTTCKGACAGCCGCTGCCCAACACAAGGTGAAGCCTACCTTGACAAGCAATCAGACACTCAATATGTCTGCAAAAGAACGTTAGTGGACAGAGGCTGGGGAAATGGATGTGGACTTTTTGGCAAAGGGAGCCTGGTGACATGCGCTAAGTTTGCATGCTCCAAGAAAATGACCGGGAAGAGCATCCAGCCAGAGAATCTGGAGTACCGGATAATGCTGTCAGTTCATGGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Understandably, the code is not removing the sequences that contain N's, and is only fetching those sequences that have the specific charactertistics requested as parameters for the genbank_qualifier_hunter function. 

Some improvements I'm working on:
1. A list of those accession numbers not included
2. An option to remove those sequences that are not complete, i.e. containing non-standard ATCG bases