# Identificação de Homólogos

A identificação de sequências homologas, é particularmente útil para a identificação de informações funcionais, filogenia, entre outras. A ferramenta por excelência para este procedimento é o BLAST.
Existem disponiveis diversas métodos para aplicar esta ferramenta, entre os quais dois servidores (NCBI, Uniprot) e ferramentas locais de linhas de comando e programáticas (biopython)
Sendo as versões Web as mais completas em termos de funcionalidade, iremos explorar ambos os servidores, para perceber também as diferenças nos seus resultados, devido às bases de dados que os alimentam.
Outra decisão que foi tomada, foi a utilização de sequências proteicas, ??? explicar melhor???

In [9]:
import os
from Bio.Blast import NCBIXML

a = %pwd
wd = (a.rsplit('/',2))[0]
seq_id = "MN908947.3"
prot_seq_id = "P0DTC7"
gene = "ORF7a"
file_blast_ncbi = "008SHWMV01R-Alignment.xml"
file_blast_uniprot = "B20210130A94466D2655679D1FD8953E075198DA8030B5FR.fasta"

## Resultados NCBI

A utilização da ferramenta do NCBI permite devolver resultados que excluem sequências armazenadas do mesmo organismo, o que quando são usadas bases de dados como a nr/nt é particularmente útil, devido aos elevados números de sequências redundantes.

In [10]:
result_handle= open(os.path.join( wd,"data/homologue", gene, file_blast_ncbi))
blast_ncbi = NCBIXML.read(result_handle)
result_handle.close()

print('Foram carregadas {0} seq.'.format(len(blast_ncbi.alignments)))

Foram carregadas 44 seq.


### Resumo

In [11]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
import re

query_len = blast_ncbi.query_length
homologue = []
sources = {}

print("Accession", "Identity%")
for aln in blast_ncbi.alignments:
    for HSP in aln.hsps:
        if HSP.score > 80 and HSP.expect < 1.0e-10 and HSP.identities/HSP.align_length > 0.80:
            print(aln.accession,HSP.identities/HSP.align_length)
            species = re.findall('\[[^\]]*\]', aln.hit_def)[0].replace("[","").replace("]","")
            if species not in sources:
                sources[species] = 1
            else:
                sources[species] +=1
            homologue.append(SeqRecord(Seq(HSP.sbjct), aln.title.split( ">", 1)[0], "", ""))


print("Total:",len(homologue))

Accession Identity%
QJS57332 1.0
QHR63305 0.9752066115702479
QIG55950 0.9752066115702479
QIA48619 0.8842975206611571
QIQ54053 0.8842975206611571
AVP78047 0.8842975206611571
AVP78036 0.8760330578512396
Q3I5J0 0.8852459016393442
AIA62335 0.8852459016393442
Q3LZX7 0.8852459016393442
ARO76387 0.8770491803278688
ATO98225 0.8770491803278688
ACU31037 0.8770491803278688
AGZ48838 0.8770491803278688
AAZ41334 0.8770491803278688
AKZ19081 0.8688524590163934
ATO98237 0.8688524590163934
AIA62325 0.860655737704918
ATO98187 0.8688524590163934
AIA62315 0.8688524590163934
ADE34760 0.860655737704918
ABD75319 0.860655737704918
Q0Q470 0.860655737704918
ACZ72114 0.8524590163934426
YP_009825057 0.8524590163934426
AAT52336 0.8524590163934426
AIA62295 0.860655737704918
ANA96033 0.8524590163934426
AHX37563 0.8442622950819673
ABI96964 0.8442622950819673
ACZ72157 0.8442622950819673
AFR58734 0.8442622950819673
AFR58706 0.8442622950819673
AFR58678 0.8360655737704918
ASO66814 0.8360655737704918
ADE34817 0.88524590163

In [12]:
for source in sources.keys():
    print(source, ":", sources[source])

SeqIO.write(homologue, os.path.join(wd,"data/homologue", gene, f"{seq_id}_{gene}.fasta"), "fasta")

Severe acute respiratory syndrome coronavirus 2 : 1
Bat coronavirus RaTG13 : 1
Pangolin coronavirus : 3
Bat SARS-like coronavirus : 5
Bat SARS CoV Rp3/2004 : 1
BtRs-BetaCoV/YN2013 : 1
Bat SARS coronavirus HKU3 : 1
Severe acute respiratory syndrome-related coronavirus : 4
SARS coronavirus Rs_672/2006 : 1
Bat SARS-like coronavirus WIV1 : 1
Bat SARS coronavirus HKU3-2 : 1
Bat SARS-like coronavirus YNLF_31C : 1
BtRs-BetaCoV/GX2013 : 1
BtRs-BetaCoV/HuB2013 : 1
Bat SARS coronavirus HKU3-7 : 1
Bat SARS CoV Rf1/2004 : 1
Bat CoV 279/2005 : 1
SARS coronavirus ExoN1 : 1
SARS coronavirus Tor2 : 4
SARS coronavirus LLJ-2004 : 1
BtRf-BetaCoV/HeB2013 : 1
Bat coronavirus : 2
Rhinolophus affinis coronavirus : 1
SARS coronavirus MA15 ExoN1 : 1
Bat SARS coronavirus HKU3-12 : 1
BtRf-BetaCoV/JL2012 : 1
SARS coronavirus Shanhgai LY : 1


40

### Resultados UNIPROT

#### Comentários

In [13]:
path = os.path.join(wd, "data/homologue", gene, file_blast_uniprot)
blast_uniprot_raw = SeqIO.parse(path, format="fasta")
blast_uniprot = []
blast_uniprot_id = []

for protein in blast_uniprot_raw:
    blast_uniprot.append(protein)
    seq_id = re.findall('\|[^|]*\|', protein.id)[0].replace("|","")
    blast_uniprot_id.append(seq_id)

In [14]:
from wget import download

for accession in blast_uniprot_id:
    url = "https://www.uniprot.org/uniprot/{0}.xml".format(accession)
    path = os.path.join(wd, "data/homologue", gene, "uniprot","{0}.xml".format(accession))
    download(url, path)

In [15]:
import re

blast_matches = []

for accession in blast_uniprot_id:
    path = os.path.join(wd, "data/homologue", gene,"uniprot","{0}.xml".format(accession))
    blast_matches.append(SeqIO.read(path, format="uniprot-xml"))

comments = {}
an_domains = {}
keywords = {}
for match in blast_matches:
    if "keywords" in match.annotations.keys():
        for keyword in match.annotations["keywords"]:
            if keyword in keywords:
                keywords[keyword] = ", ".join([keywords[keyword],match.id])
            else:
                keywords[keyword] = match.id
    for annotation in match.annotations.keys():
        if re.match("(comment_)[a-z]*", annotation):
            comment_type = annotation.split("_",1)[1]
            if comment_type not in comments.keys():
                comments[comment_type] = {}
            comment = match.annotations[annotation][0]
            if comment in comments[comment_type].keys():
                comments[comment_type][comment] = ", ".join([comments[comment_type][comment], match.id])
            else:
                 comments[comment_type][comment] = match.id

for comment_type in comments.keys():
    print(comment_type)
    for comment in comments[comment_type].keys():
        print(comment, ":", comments[comment_type][comment])
    print("\n")

print("Keywords")
for keyword in keywords.keys():
    print(keyword, ":", keywords[keyword])
print("\n")

function
Plays a role as antagonist of host tetherin (BST2), disrupting its antiviral effect. Acts by binding to BST2 thereby interfering with its glycosylation. May suppress small interfering RNA (siRNA). May bind to host ITGAL, thereby playing a role in attachment or modulation of leukocytes. : P0DTC7, P59635
Non-structural protein which is dispensable for virus replication in cell culture. : Q3I5J0, Q3LZX7, Q0Q470


subunit
Interacts with the spike glycoprotein. Interacts with M protein. Interacts with E protein. Interacts with the ORF3a protein. Interacts with human SGT. Interacts with host ITGAL. Interacts with host BST2. : P0DTC7
Interacts with the spike glycoprotein, M protein, E protein and the accessory protein 3. : Q3I5J0, Q3LZX7, Q0Q470
Interacts with the spike glycoprotein (PubMed:16840309). Interacts with M protein (PubMed:16580632). Interacts with E protein (PubMed:16580632). Interacts with the ORF3a protein (PubMed:15194747). Interacts with human SGT (PubMed:16580632). I

### Features

In [16]:
features = {}
for match in blast_matches:
    for feature in match.features:
        if re.search("(domain|motif|bond|region)", feature.type):
            feature_type = feature.type
            if feature_type not in features.keys():
                features[feature_type] = {}

            feature_desc = str(feature.location)
            if "description" in feature.qualifiers.keys():
                feature_desc = " ".join([feature_desc,feature.qualifiers["description"]])

            if feature_desc in features[feature_type].keys():
                features[feature_type][feature_desc] = ", ".join([features[feature_type][feature_desc], match.id])
            else:
                 features[feature_type][feature_desc] = match.id

for feature_type in features.keys():
    print(feature_type)
    for feature in features[feature_type].keys():
        print(feature, ":", features[feature_type][feature])
    print("\n")

transmembrane region
[95:116] Helical : P0DTC7
[96:117] Helical : Q3I5J0, Q3LZX7, P59635, Q0Q470


domain
[15:81] X4e : P0DTC7, Q3I5J0, Q3LZX7, P59635, Q0Q470


short sequence motif
[116:121] Di-lysine motif : P0DTC7
[117:122] Di-lysine motif : Q3I5J0, Q3LZX7, P59635, Q0Q470


disulfide bond
[22:58] : P0DTC7, Q3I5J0, Q3LZX7, P59635, Q0Q470
[34:67] : P0DTC7, Q3I5J0, Q3LZX7, P59635, Q0Q470


topological domain
[15:96] Virion surface : Q3I5J0, Q3LZX7, P59635, Q0Q470
[117:122] Intravirion : Q3I5J0, Q3LZX7, P59635, Q0Q470


