# Registo no GenBank

## Verificar as anotações correspondentes aos genes de interesse;

In [15]:
from scripts_python import gbSearch
import os
from Bio import SeqIO, Entrez

a = %pwd
wd = (a.rsplit('/',2))[0]

seq_id = "NM_004335"
protein = "Q10589"
gene = "BST2"

In [16]:
handle = Entrez.efetch(db="nucleotide",
                       id=seq_id,
                       rettype="gb",
                       retmode="text")

file_path = os.path.join(wd,"data/reference/", gene, f"{gene}.gb")
with open(file_path, 'w') as file:
    file.write(handle.read())

In [17]:
path = os.path.join(wd, "data/reference",gene,f"{gene}.gb")
seq_r = SeqIO.read(path, format="genbank")
for Key, Value in seq_r.annotations.items():
    print(Key,": ", Value)


molecule_type :  mRNA
topology :  linear
data_file_division :  PRI
date :  11-DEC-2020
accessions :  ['NM_004335']
sequence_version :  4
keywords :  ['RefSeq', 'MANE Select']
source :  Homo sapiens (human)
organism :  Homo sapiens
taxonomy :  ['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']
references :  [Reference(title='Differential Vpu-Mediated CD4 and Tetherin Downregulation Functions among Major HIV-1 Group M Subtypes', ...), Reference(title='Human BST-2/tetherin inhibits Junin virus release from host cells and its inhibition is partially counteracted by viral nucleoprotein', ...), Reference(title='BST2 Promotes Tumor Growth via Multiple Pathways in Hepatocellular Carcinoma', ...), Reference(title='A reference map of the human binary protein interactome', ...), Reference(title='Vpu of a Simian Immunodeficiency Virus Isolated from Greater Spot-Nose

O ficheiro .gb contêm varias, entre as quais podemos confirmar o organismo, a taxonomia do mesmo e referências bibliográficas associadas a esta sequência. Porém a existência de *features* vai permitir conhecer melhor a sequência e determinar a localização do nosso gene de interesse.

## Features

In [18]:
interest_features = []

for feature in seq_r.features:
    if feature.type in ["gene", "CDS"]:
        if feature.qualifiers["gene"][0] == gene:
            interest_features.append(feature)
            print(feature.type)
            print(feature.location)
            print(feature.ref_db)
            for Key,Value in feature.qualifiers.items():
                print(Key,": ", Value)
            print(" ")

gene
[0:1001](+)
None
gene :  ['BST2']
gene_synonym :  ['CD317; TETHERIN']
note :  ['bone marrow stromal cell antigen 2']
db_xref :  ['GeneID:684', 'HGNC:HGNC:1119', 'MIM:600534']
 
CDS
[55:598](+)
None
gene :  ['BST2']
gene_synonym :  ['CD317; TETHERIN']
note :  ['NPC-A-7; BST-2; HM1.24 antigen; bone marrow stromal antigen 2']
codon_start :  ['1']
product :  ['bone marrow stromal antigen 2 precursor']
protein_id :  ['NP_004326.1']
db_xref :  ['CCDS:CCDS12358.1', 'GeneID:684', 'HGNC:HGNC:1119', 'MIM:600534']
translation :  ['MASTSYDYCRVPMEDGDKRCKLLLGIGILVLLIIVILGVPLIIFTIKANSEACRDGLRAVMECRNVTHLLQQELTEAQKGFQDVEAQAATCNHTVMALMASLDAEKAQGQKKVEELEGEITTLNHKLQDASAEVERLRRENQVLSVRIADKKYYPSSQDSSSAAAPQLLIVLLGLSALLQ']
 


Realizando a filtração destes features foi possivel encontrar tanto uma anotação para o gene como a CDS que este representa, ambas indicam um localização coincidente, que poderá ser usada para *slicing* deste gene.


## Slicing do gene de interesse.

In [19]:
from Bio.SeqRecord import SeqRecord


gene_seq = seq_r.seq[interest_features[0].location.nofuzzy_start:interest_features[0].location.nofuzzy_end]
gene_OBJ = SeqRecord(gene_seq, f"{seq_id} | Reference sequence {gene} gene", "", "")
SeqIO.write(gene_OBJ, os.path.join(wd,"data/reference", gene, f"{gene}_nu.fasta"), "fasta")

1

### Extração de informação relativa à proteina

Podemos ficar a conhecer melhor a proteina, seja a identificação da ORF que permite obter a sua tradução, como a sequência desta e mais importante um ID, que poderá ser usado para pesquisar noutras base de dados como o Uniprot.

In [20]:
from Bio.Seq import Seq

prot_seq = Seq(interest_features[1].qualifiers['translation'][0])
prot_OBJ = SeqRecord(prot_seq, interest_features[1].qualifiers['protein_id'][0] ,interest_features[1].qualifiers['product'][0], interest_features[1].qualifiers['product'][0])
SeqIO.write(prot_OBJ, os.path.join(wd,"data/reference", gene, f"{gene}_prot.fasta"), "fasta")

1

## Extração de conhecimento do Uniprot

Como referido anterirormente, com base no identificador para a proteina, foi possivel encontrar na *Uniprot* o id neste base de dados a qual esta proteina corresponde, nomeadamente P0DTC7.

Descarregando o ficheiro XML e abrindo com o SeqIO podemos então analisar a função desta proteina e algumas das suas anotações e features.

In [21]:
from wget import download

url = "https://www.uniprot.org/uniprot/{0}.xml".format(protein)
path = os.path.join(wd, "data/reference", gene,f"{protein}.xml")

download(url, path)
prot_ref = SeqIO.read(path, format="uniprot-xml")

Alguns dos elementos mais importantes nestas bases de dados são os comentários, estes incluem informações como função, interações, localização subcelular e dominios que estejam presente

In [22]:
import re

for annotation in prot_ref.annotations.keys():
    if re.match("(comment_)[a-z]*", annotation):
        print(annotation.split("_",1)[1],":")
        print(prot_ref.annotations[annotation][0], "\n")

function :
IFN-induced antiviral host restriction factor which efficiently blocks the release of diverse mammalian enveloped viruses by directly tethering nascent virions to the membranes of infected cells. Acts as a direct physical tether, holding virions to the cell membrane and linking virions to each other. The tethered virions can be internalized by endocytosis and subsequently degraded or they can remain on the cell surface. In either case, their spread as cell-free virions is restricted (PubMed:22520941, PubMed:21529378, PubMed:20940320, PubMed:20419159, PubMed:20399176, PubMed:19879838, PubMed:19036818, PubMed:18342597, PubMed:18200009). Its target viruses belong to diverse families, including retroviridae: human immunodeficiency virus type 1 (HIV-1), human immunodeficiency virus type 2 (HIV-2), simian immunodeficiency viruses (SIVs), equine infectious anemia virus (EIAV), feline immunodeficiency virus (FIV), prototype foamy virus (PFV), Mason-Pfizer monkey virus (MPMV), human 

Muitas vezes estas informações alertam-nos para a presença de features no sequencia, como é o caso da informação relativa ao domain, encontrado no passo anterior.
Estas features têm diferentes niveis de evidencia, que podem ser filtrados, neste caso nos escolhemos o nivel 2.

Nota: Todas as localizações têm um inicio adiantado uma posição.

In [23]:
for feature in prot_ref.features:
    if "evidence" in feature.qualifiers:
        if int(feature.qualifiers["evidence"].split(" ",1)[0]) >= 2:
            print("feature: ", feature.type)
            print("location: ", feature.location)
            print("seq: ", prot_ref.seq[feature.location.nofuzzy_start:feature.location.nofuzzy_end])
            for qual in feature.qualifiers.keys():
                if qual != "type":
                    print(qual,": ", feature.qualifiers[qual])
            print("\n")

feature:  propeptide
location:  [161:180]
seq:  SAAAPQLLIVLLGLSALLQ
description :  Removed in mature form
id :  PRO_0000253552
evidence :  6


feature:  topological domain
location:  [0:20]
seq:  MASTSYDYCRVPMEDGDKRC
description :  Cytoplasmic
evidence :  6


feature:  transmembrane region
location:  [20:48]
seq:  KLLLGIGILVLLIIVILGVPLIIFTIKA
description :  Helical; Signal-anchor for type II membrane protein
evidence :  6


feature:  topological domain
location:  [48:161]
seq:  NSEACRDGLRAVMECRNVTHLLQQELTEAQKGFQDVEAQAATCNHTVMALMASLDAEKAQGQKKVEELEGEITTLNHKLQDASAEVERLRRENQVLSVRIADKKYYPSSQDSS
description :  Extracellular
evidence :  6


feature:  lipid moiety-binding region
location:  [160:161]
seq:  S
description :  GPI-anchor amidated serine
evidence :  6


feature:  glycosylation site
location:  [64:65]
seq:  N
description :  N-linked (GlcNAc...) asparagine
evidence :  12 14 16 19


feature:  glycosylation site
location:  [91:92]
seq:  N
description :  N-linked (GlcNAc...) asparagine
e