## Fazendo uma busca BLAST com as proteínas da Covid-19

Extraído de https://github.com/lanadominkovic/12-days-of-biopython

Vamos usar as sequências de proteínas mais longas do genoma da Covid-19 salvas no notebook anterior

In [None]:
# Instalando o Biopython
!pip install biopython

Collecting biopython
  Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.84


In [None]:
# Lê as sequências
from Bio import SeqIO
protein_seqs = list(SeqIO.parse("protein_seq.fasta", "fasta"))

for protein in protein_seqs:
  print(protein.id)
  print(repr(protein.seq))
  print(len(protein))

covid_protein_1
Seq('QQMFHLVDFQVTIAEILLIIMRTFKVSIWNLDYIINLIIKNLSKSLTENKYSQL...EID')
63
covid_protein_2
Seq('AQADEYELMYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNI...LLV')
83
covid_protein_3
Seq('TNMKIILFLALITLATCELYHYQECVRGTTVLLKEPCSSGTYEGNSPFHPLADN...KTE')
123
covid_protein_4
Seq('ASAQRSQITLHINELMDLFMRIFTIGTVTLKQGEIKDATPSDFVRATATIPIQA...VPL')
290
covid_protein_5
Seq('CTIVFKRVCGVSAARLTPCGTGTSTDVVYRAFDIYNDKVAGFAKFLKTNCCRFQ...VNN')
2701


Aqui usaremos o NCBI BLAST, mas ao invés de usar o módulo NCBIXML para ler o resultado do BLAST, usaremos o SearchIO, que e mais novo que o antigo módulo Bio.Blast, pois fornece uma estrutura mais geral que lida também com outras ferramentas de busca de sequências relacionadas.

In [None]:
protein_seqs[4].seq

Seq('CTIVFKRVCGVSAARLTPCGTGTSTDVVYRAFDIYNDKVAGFAKFLKTNCCRFQ...VNN')

In [None]:
# Busca BLAST
from Bio.Blast import NCBIWWW
result_handle = NCBIWWW.qblast("blastp", "pdb", protein_seqs[4].seq) # pdb - protein data bank

In [None]:
# Pega os dados do XML
from Bio import SearchIO
blast_records = SearchIO.read(result_handle, 'blast-xml')

In [None]:
print(blast_records)

Program: blastp (2.16.0+)
  Query: unnamed (2701)
         protein product
 Target: pdb
   Hits: ----  -----  ----------------------------------------------------------
            #  # HSP  ID + description
         ----  -----  ----------------------------------------------------------
            0      1  pdb|7D4F|A  Chain A, RNA-directed RNA polymerase [Sever...
            1      1  pdb|6YYT|A  Chain A, nsp12 [Severe acute respiratory sy...
            2      1  pdb|6XEZ|A  Chain A, RNA-directed RNA polymerase [Sever...
            3      1  pdb|7BW4|A  Chain A, RNA-directed RNA polymerase [Sever...
            4      1  pdb|6XQB|A  Chain A, RNA-directed RNA polymerase [Sever...
            5      1  pdb|7BV1|A  Chain A, RNA-directed RNA polymerase [Sever...
            6      1  pdb|7C2K|A  Chain A, RNA-directed RNA polymerase [Sever...
            7      1  pdb|6M71|A  Chain A, RNA-directed RNA polymerase [Sever...
            8      1  pdb|7ED5|A  Chain A, RNA-directed RNA pol

In [None]:
# Detalhes dos alinhamentos
for blast_record in blast_records:
  print(f"Query: {blast_record.query_id}")
  print(f"Sequence ID: {blast_record.id}")
  print(f"description: {blast_record.description}")
  print(f"E value: {blast_record[0].evalue}")
  print(f"Bit Score:  {blast_record[0].bitscore}")
  print(f"alignment:\n{blast_record[0].aln}")
  print()

Query: unnamed
Sequence ID: pdb|7D4F|A
description: Chain A, RNA-directed RNA polymerase [Severe acute respiratory syndrome coronavirus 2]
E value: 0.0
Bit Score:  1938.7
alignment:
Alignment with 2 rows and 926 columns
FKRVCGVSAARLTPCGTGTSTDVVYRAFDIYNDKVAGFAKFLKT...LQA unnamed
LNRVCGVSAARLTPCGTGTSTDVVYRAFDIYNDKVAGFAKFLKT...LQG pdb|7D4F|A

Query: unnamed
Sequence ID: pdb|6YYT|A
description: Chain A, nsp12 [Severe acute respiratory syndrome coronavirus 2]
E value: 0.0
Bit Score:  1938.31
alignment:
Alignment with 2 rows and 925 columns
FKRVCGVSAARLTPCGTGTSTDVVYRAFDIYNDKVAGFAKFLKT...VLQ unnamed
LNRVCGVSAARLTPCGTGTSTDVVYRAFDIYNDKVAGFAKFLKT...VLQ pdb|6YYT|A

Query: unnamed
Sequence ID: pdb|6XEZ|A
description: Chain A, RNA-directed RNA polymerase [Severe acute respiratory syndrome coronavirus 2]
E value: 0.0
Bit Score:  1937.92
alignment:
Alignment with 2 rows and 925 columns
FKRVCGVSAARLTPCGTGTSTDVVYRAFDIYNDKVAGFAKFLKT...VLQ unnamed
LNRVCGVSAARLTPCGTGTSTDVVYRAFDIYNDKVAGFAKFLKT...VLQ pdb|6X