# Registo no GenBank

## Verificar as anotações correspondentes aos genes de interesse;

In [35]:
from scripts_python import gbSearch
import os
from Bio import SeqIO, Entrez

a = %pwd
wd = (a.rsplit('/',2))[0]

seq_id = "NM_001114380"
protein = "P20701"
gene = "ITGAL"

In [36]:
handle = Entrez.efetch(db="nucleotide",
                       id=seq_id,
                       rettype="gb",
                       retmode="text")

file_path = os.path.join(wd,"data/reference/", gene, f"{gene}.gb")
with open(file_path, 'w') as file:
    file.write(handle.read())

In [37]:
path = os.path.join(wd, "data/reference",gene,f"{gene}.gb")
seq_r = SeqIO.read(path, format="genbank")
for Key, Value in seq_r.annotations.items():
    print(Key,": ", Value)


molecule_type :  mRNA
topology :  linear
data_file_division :  PRI
date :  12-DEC-2020
accessions :  ['NM_001114380']
sequence_version :  2
keywords :  ['RefSeq']
source :  Homo sapiens (human)
organism :  Homo sapiens
taxonomy :  ['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']
references :  [Reference(title='Perlecan regulates pericyte dynamics in the maintenance and repair of the blood-brain barrier', ...), Reference(title='Aggregatibacter actinomycetemcomitans leukotoxin causes activation of lymphocyte function-associated antigen 1', ...), Reference(title='The down-regulation of hsa_circ_0012919, the sponge for miR-125a-3p, contributes to DNA methylation of CD11a and CD70 in CD4(+) T cells of systemic lupus erythematous', ...), Reference(title='Direction of actin flow dictates integrin LFA-1 orientation during leukocyte migration', ...), Reference(

O ficheiro .gb contêm varias, entre as quais podemos confirmar o organismo, a taxonomia do mesmo e referências bibliográficas associadas a esta sequência. Porém a existência de *features* vai permitir conhecer melhor a sequência e determinar a localização do nosso gene de interesse.

## Features

In [38]:
interest_features = []

for feature in seq_r.features:
    if feature.type in ["gene", "CDS"]:
        if feature.qualifiers["gene"][0] == gene:
            interest_features.append(feature)
            print(feature.type)
            print(feature.location)
            print(feature.ref_db)
            for Key,Value in feature.qualifiers.items():
                print(Key,": ", Value)
            print(" ")

gene
[0:4877](+)
None
gene :  ['ITGAL']
gene_synonym :  ['CD11A; LFA-1; LFA1A']
note :  ['integrin subunit alpha L']
db_xref :  ['GeneID:3683', 'HGNC:HGNC:6148', 'MIM:153370']
 
CDS
[96:3357](+)
None
gene :  ['ITGAL']
gene_synonym :  ['CD11A; LFA-1; LFA1A']
note :  ['isoform b precursor is encoded by transcript variant 2; LFA-1A; antigen CD11A (p180), lymphocyte function-associated antigen 1, alpha polypeptide; integrin, alpha L (antigen CD11A (p180), lymphocyte function-associated antigen 1; alpha polypeptide); LFA-1 alpha; integrin gene promoter; CD11 antigen-like family member A; leukocyte adhesion glycoprotein LFA-1 alpha chain; leukocyte function-associated molecule 1 alpha chain; integrin alpha-L']
codon_start :  ['1']
product :  ['integrin alpha-L isoform b precursor']
protein_id :  ['NP_001107852.1']
db_xref :  ['CCDS:CCDS45461.1', 'GeneID:3683', 'HGNC:HGNC:6148', 'MIM:153370']
translation :  ['MKDSCITVMAMALLSGFFFFAPASSYNLDVRGARSFSPPRAGRHFGYRVLQVGNGVIVGAPGEGNSTGSLYQCQSGTGHCLPVT

Realizando a filtração destes features foi possivel encontrar tanto uma anotação para o gene como a CDS que este representa, ambas indicam um localização coincidente, que poderá ser usada para *slicing* deste gene.


## Slicing do gene de interesse.

In [39]:
from Bio.SeqRecord import SeqRecord


gene_seq = seq_r.seq[interest_features[0].location.nofuzzy_start:interest_features[0].location.nofuzzy_end]
gene_OBJ = SeqRecord(gene_seq, f"{seq_id} | Reference sequence {gene} gene", "", "")
SeqIO.write(gene_OBJ, os.path.join(wd,"data/reference", gene, f"{gene}_nu.fasta"), "fasta")

1

### Extração de informação relativa à proteina

Podemos ficar a conhecer melhor a proteina, seja a identificação da ORF que permite obter a sua tradução, como a sequência desta e mais importante um ID, que poderá ser usado para pesquisar noutras base de dados como o Uniprot.

In [40]:
from Bio.Seq import Seq

prot_seq = Seq(interest_features[1].qualifiers['translation'][0])
prot_OBJ = SeqRecord(prot_seq, interest_features[1].qualifiers['protein_id'][0] ,interest_features[1].qualifiers['product'][0], interest_features[1].qualifiers['product'][0])
SeqIO.write(prot_OBJ, os.path.join(wd,"data/reference", gene, f"{gene}_prot.fasta"), "fasta")

1

## Extração de conhecimento do Uniprot

Como referido anterirormente, com base no identificador para a proteina, foi possivel encontrar na *Uniprot* o id neste base de dados a qual esta proteina corresponde, nomeadamente P0DTC7.

Descarregando o ficheiro XML e abrindo com o SeqIO podemos então analisar a função desta proteina e algumas das suas anotações e features.

In [41]:
from wget import download

url = "https://www.uniprot.org/uniprot/{0}.xml".format(protein)
path = os.path.join(wd, "data/reference", gene,f"{protein}.xml")

download(url, path)
prot_ref = SeqIO.read(path, format="uniprot-xml")

Alguns dos elementos mais importantes nestas bases de dados são os comentários, estes incluem informações como função, interações, localização subcelular e dominios que estejam presente

In [42]:
import re

for annotation in prot_ref.annotations.keys():
    if re.match("(comment_)[a-z]*", annotation):
        print(annotation.split("_",1)[1],":")
        print(prot_ref.annotations[annotation][0], "\n")

function :
Integrin ITGAL/ITGB2 is a receptor for ICAM1, ICAM2, ICAM3 and ICAM4. Integrin ITGAL/ITGB2 is a receptor for F11R (PubMed:11812992, PubMed:15528364). Integin ITGAL/ITGB2 is a receptor for the secreted form of ubiquitin-like protein ISG15; the interaction is mediated by ITGAL (PubMed:29100055). Involved in a variety of immune phenomena including leukocyte-endothelial cell interaction, cytotoxic T-cell mediated killing, and antibody dependent killing by granulocytes and monocytes. Contributes to natural killer cell cytotoxicity (PubMed:15356110). Involved in leukocyte adhesion and transmigration of leukocytes including T-cells and neutrophils (PubMed:11812992). Required for generation of common lymphoid progenitor cells in bone marrow, indicating a role in lymphopoiesis (By similarity). Integrin ITGAL/ITGB2 in association with ICAM3, contributes to apoptotic neutrophil phagocytosis by macrophages (PubMed:23775590). 

subunit :
Heterodimer of an alpha and a beta subunit (PubMed

Muitas vezes estas informações alertam-nos para a presença de features no sequencia, como é o caso da informação relativa ao domain, encontrado no passo anterior.
Estas features têm diferentes niveis de evidencia, que podem ser filtrados, neste caso nos escolhemos o nivel 2.

Nota: Todas as localizações têm um inicio adiantado uma posição.

In [43]:
for feature in prot_ref.features:
    if "evidence" in feature.qualifiers:
        if int(feature.qualifiers["evidence"]) >= 2:
            print("feature: ", feature.type)
            print("location: ", feature.location)
            print("seq: ", prot_ref.seq[feature.location.nofuzzy_start:feature.location.nofuzzy_end])
            for qual in feature.qualifiers.keys():
                if qual != "type":
                    print(qual,": ", feature.qualifiers[qual])
            print("\n")

feature:  signal peptide
location:  [0:25]
seq:  MKDSCITVMAMALLSGFFFFAPASS
evidence :  10


feature:  topological domain
location:  [25:1090]
seq:  YNLDVRGARSFSPPRAGRHFGYRVLQVGNGVIVGAPGEGNSTGSLYQCQSGTGHCLPVTLRGSNYTSKYLGMTLATDPTDGSILACDPGLSRTCDQNTYLSGLCYLFRQNLQGPMLQGRPGFQECIKGNVDLVFLFDGSMSLQPDEFQKILDFMKDVMKKLSNTSYQFAAVQFSTSYKTEFDFSDYVKRKDPDALLKHVKHMLLLTNTFGAINYVATEVFREELGARPDATKVLIIITDGEATDSGNIDAAKDIIRYIIGIGKHFQTKESQETLHKFASKPASEFVKILDTFEKLKDLFTELQKKIYVIEGTSKQDLTSFNMELSSSGISADLSRGHAVVGAVGAKDWAGGFLDLKADLQDDTFIGNEPLTPEVRAGYLGYTVTWLPSRQKTSLLASGAPRYQHMGRVLLFQEPQGGGHWSQVQTIHGTQIGSYFGGELCGVDVDQDGETELLLIGAPLFYGEQRGGRVFIYQRRQLGFEEVSELQGDPGYPLGRFGEAITALTDINGDGLVDVAVGAPLEEQGAVYIFNGRHGGLSPQPSQRIEGTQVLSGIQWFGRSIHGVKDLEGDGLADVAVGAESQMIVLSSRPVVDMVTLMSFSPAEIPVHEVECSYSTSNKMKEGVNITICFQIKSLIPQFQGRLVANLTYTLQLDGHRTRRRGLFPGGRHELRRNIAVTTSMSCTDFSFHFPVCVQDLISPINVSLNFSLWEEEGTPRDQRAQGKDIPPILRPSLHSETWEIPFEKNCGEDKKCEANLRVSFSPARSRALRLTAFASLSVELSLSNLEEDAYWVQLDLHFPPGLSFRKVEMLKPHSQIPVSCEELPEESRLLSRALSCNVSSPIFKAGHSVALQ

ValueError: invalid literal for int() with base 10: '13 22 25'