You now have a BLAST output (see Notebook 1 if not true)… now what?

From Notebook 1, we saw our sequence didn't result with any reference genome, so we ended choosing a Genbank (gb) genome (AE014135.4) to further examine what function our sequence has. 

First lets look one more time at the BLAST output of the 4th (3rd index) hit:

In [2]:
from Bio.Blast import NCBIXML
from Bio import SearchIO

qresults = next(SearchIO.parse('blast_output.xml', 'blast-xml'))
print(qresults[3]) # the 4th [index 3] hit

#print(qresults[3][0])



Query: unknown_yakuba_sequence
  Hit: gi|667678241|gb|AE014135.4| (1348131)
       Drosophila melanogaster chromosome 4
 HSPs: ----  --------  ---------  ------  ---------------  ---------------------
          #   E-value  Bit score    Span      Query range              Hit range
       ----  --------  ---------  ------  ---------------  ---------------------
          0         0    2955.20    2419       [510:2900]        [436867:439261]
          1         0    2875.86    2246      [3286:5526]        [439380:441601]
          2         0    1462.02    1931      [8097:9964]        [442173:444019]
          3   6.4e-75     297.04     375      [5781:6141]        [379383:379755]
          4   1.1e-65     266.38     372      [5781:6127]        [449438:449790]
          5   2.4e-61     251.96     500      [5694:6138]        [816487:816980]
          6   4.3e-58     241.13     335      [5834:6138]        [195882:196212]
          7   1.8e-56     235.72     272      [5894:6140]      [112113

The "Hit range" gives us a chromosome position to examine. For this tutorial we will focus on the Hit range with E-value=0 (start=436867 and stop=444019) and see if any gene correspond to that region:

In [3]:
!esearch -db gene -query "AE014135.4[Nucleotide Accession] AND 436867:444019[CHRPOS]" -sort Chromosome | \
efetch -format docsum | \
xtract -pattern DocumentSummary -element Id Name -block GenomicInfoType -element ChrAccVer ChrStart ChrStop \
> genes.txt

infile=open('genes.txt','r')
print('Id   Gene Name   ChrAccVer   ChrStart ChrStop')
for line in infile: 
    print(line)

Id   Gene Name   ChrAccVer   ChrStart ChrStop
43791	lgs	NC_004353.4	443910	436956



This region of the accession only has one corresponding gene - lgs (legless). Using the information about the gene, we can now use EDirect to link to a protein:

In [4]:
!cat genes.txt | cut -f 1 | xargs -n 1 sh -c 'elink -db gene -target protein -id "$0" | \
efilter -source refseq | efetch -format fasta' > protein.fasta

!cat protein.fasta

>NP_651922.1 legless [Drosophila melanogaster]
MLSTTMPRSPTQQQPQPNSDASSTSASGSNPGAAIGNGDSAASRSSPKTLNSEPFSTLSPDQIKLTPEEG
TEKSGLSTSDKAATGGAPGSGNNLPEGQTMLRQNSTSTINSCLVASPQNSSEHSNSSNVSATVGLTQMVD
CDEQSKKNKCSVKDEEAEISSNKAKGQAAGGGCETGSTSSLTVKEEPTDVLGSLVNMKKEERENHSPTMS
PVGFGSIGNAQDNSATPVKIERISNDSTTEKKGSSLTMNNDEMSMEGCNQLNPDFINESLNNPAISSILV
SGVGPIPGIGVGAGTGNLLTANANGISSGSSNCLDYMQQQNHIFVFSTQLANKGAESVLSGQFQTIIAYH
CTQPATKSFLEDFFMKNPLKINKLQRHNSVGMPWIGMGQVGLTPPNPVAKITQQQPHTKTVGLLKPQFNQ
HENSKRSTVSAPSNSFVDQSDPMGNETELMCWEGGSSNTSRSGQNSRNHVDSISTSSESQAIKILEAAGV
DLGQVTKGSDPGLTTENNIVSLQGVKVPDENLTPQQRQHREEQLAKIKKMNQFLFPENENSVGANVSSQI
TKIPGDLMMGMSGGGGGSIINPTMRQLHMPGNAKSELLSATSSGLSEDVMHPGDVISDMGAVIGCNNNQK
TSVQCGSGVGVVTGTTAAGVNVNMHCSSSGAPNGNMMGSSTDMLASFGNTSCNVIGTAPDMSKEVLNQDS
RTHSHQGGVAQMEWSKIQHQFFEERLKGGKPRQVTGTVVPQQQTPSGSGGNSLNNQVRPLQGPPPPYHSI
QRSASVPIATQSPNPSSPNNLSLPSPRTTAAVMGLPTNSPSMDGTGSLSGSVPQANTSTVQAGTTTVLSA
NKNCFQADTPSPSNQNRSRNTGSSSVLTHNLSSNPSTPLSHLSPKEFESFGQSSAGDNMKSRRPSPQGQR
SPVNSLIEANKDVRFA

In [5]:
!cat genes.txt | cut -f 1 | xargs -n 1 sh -c 'elink -db gene -target protein -id "$0" | \
efilter -source refseq | efetch -format uid'



21356901


In [None]:
We can view the results on NCBI conserved 

In [4]:
!esearch -db gene -query "AE014135.4[Nucleotide Accession] AND 379383:449790[CHRPOS]" -sort Chromosome | \
efetch -format docsum | \
xtract -pattern DocumentSummary -element Id Name -block GenomicInfoType -element ChrAccVer ChrStart ChrStop \
> longerpos-genes.txt

infile=open('longerpos-genes.txt','r')
print('Id   Gene Name   ChrAccVer   ChrStart ChrStop')
for line in infile: 
    print(line)

Id   Gene Name   ChrAccVer   ChrStart ChrStop
43789	dati	NC_004353.4	395604	375639

43791	lgs	NC_004353.4	443910	436956

43792	CaMKI	NC_004353.4	445503	453946



In [8]:
!cat longerpos-genes.txt | cut -f 1 | xargs -n 1 sh -c 'elink -db gene -target protein -id "$0" | \
efilter -source refseq | efetch -format fasta' > more-protein.fasta

!cat more-protein.fasta

#!cat longerpos-genes.txt | cut -f 1 | xargs -n 1 sh -c 'elink -db gene -target protein -id "$0" | \
#efilter -source refseq | efetch -format acc'

>NP_001245422.1 datilografo, isoform D [Drosophila melanogaster]
MTLDCEKEHDLQLSRSSSAAAISERTLEECWSTLQRLFMHKSAMQQIQQQIPRVGLGTHGVTGSANLGGS
ITPSSDTKPHQCQQCMKSFSSNHQLVQHIRVHTGEKPYKCSYCDRRFKQLSHVQQHTRLHTGERPYKCHL
PDCGRAFIQLSNLQQHLRNHDAQVERAKNRPFHCNICGKGFATESSLRTHTSKELQLHLGVLQQHAALIG
GPNATSCPVCHKLFLGTEALVDHMKHVHKEKSPPPGGSASSQFSELNQIVTGNGNGTGSSNEQIATQCTT
ESNSHQATVGSSLIDSFLGKRRTANHPCPVCGKHYVNEGSLRKHLACHAENSQLTNSLRMWPCSVCQAVF
THENGLLTHMESMRMDPKHQFAAQYVLSRAAAEQRERESLLAVTLAASSGASTRIGIADAGNVLPTGAHN
SDGSNSKCPSPSANSECSSNGRLSSSTTSDQDQDIDHGLSENENSNQNNIGSSTNNNNNCTSNNNASSHK
MAELRLPGTGQYTMDAELHVANRMSLMAAAAAAVAASRPQDGVDTSAVPSAAVQAAVVNLAAAMRMNNSS
NGATPYQQHHADHTQAHQTHILQHAHPHHHQQQQAPQHQQQQLSHLLVHTHPSNSNSRSQSSPNINVPSL
QNESTAASIAMNMNVHMMRGSLDPDPSLGGMHGIEVLHQHQQQHHHHHHHHTSNYPQHAVTPTNQHTHPH
PQTHQTAHHHTSPETALRMHQAEAILRSHTEAAFRLATGSGPDTGVKCEADQNQLNSNNAGNAGGNSGGN
SQQHRFHASSNHQENQRPSDS
>NP_001245421.1 datilografo, isoform C [Drosophila melanogaster]
MKLRCFNQAQTSHSWHPRSVFRVVSSCPPDRKKRPSPASWQICIGHLLTNYLG