# <span style="color:#DBD318">Carlos Michel Mourra Díaz</span>
# <span style="color:#DBD318">Elizabeth Márquez Gómez</span>
# <span style="color:#DBD318">Salvador González Juárez</span>

# <h1 style="color:#50AEDB">Project: Diagnosing Sickle Cell Anemia</h1>

# <h2 style="color:#8F091A">Ejercicios</h2>
+ <a href="#Ejercicio_1" style="color:#E0AC3D">Ejercicio 1</a>
+ <a href="#Ejercicio_2" style="color:#E0AC3D">Ejercicio 2</a>
+ <a href="#Ejercicio_3" style="color:#E0AC3D">Ejercicio 3</a>
+ <a href="#Ejercicio_4" style="color:#E0AC3D">Ejercicio 4</a>

<h2 style="color:#8F091A">1. Retrieving DNA and protein sequences with Bio.Entrez</h2>
<h3 style="color:#8F091A">1.1 Search identifiers on NCBI</h3>


In [115]:
from Bio import Entrez

Entrez.email = 'elimqzg@lcg.unam.mx'

#Obtener los identificadores de la secuencia
handle = Entrez.esearch(db='nucleotide', term='sickle AND homo sapiens AND globin NOT chromosome') 
record = Entrez.read(handle)
print(record['IdList'])

['1868032479', '1515564438', '179408', '224959855', '2168937', '183859', '183844']


<h3 style="color:#8F091A">1.2 Retrieve sequences using identifiers</h3>

In [116]:
from xml.etree import ElementTree as ET

for identifier in record['IdList']:
    
    handle = Entrez.efetch(db='nucleotide', id= identifier, rettype='fasta', retmode='xml')
    resultXML = handle.read().decode()
    #print(resultXML)
    
    tree = ET.ElementTree(ET.fromstring(resultXML))
    root = tree.getroot()
    print((root.find('.//TSeq_defline')).text)
    print((root.find('.//TSeq_accver')).text, end='\n\n')
    

Homo sapiens A-gamma globin (HBG1) gene, exon 1 and partial cds
MN609913.1

Homo sapiens voucher ATGLAB 2018103 hemoglobin beta subunit (HBB) gene, exon 3 and partial cds
MH580289.1

Human sickle cell beta-globin mRNA, complete cds
M25079.1

Homo sapiens A-gamma globin (HBG1) gene, promoter region
FJ766333.1

Part of DNA encoding beta-globin gene
E00658.1

Human hemoglobin DNA with a deletion causing Indian delta-beta thalassemia
M33706.1

Human hemoglobin-related sequence across the breakpoint for Indian delta-beta thalassemia
M37467.1



<h3 style="color:#8F091A">1.3 Retrieve a single GenBank entry</h3>

In [117]:
handle = Entrez.efetch(db='nucleotide', id='M25079.1', rettype='gb', retmode='text')
resultText = handle.read()
print(resultText)

LOCUS       HUMBETGLA                468 bp    mRNA    linear   PRI 27-APR-1993
DEFINITION  Human sickle cell beta-globin mRNA, complete cds.
ACCESSION   M25079
VERSION     M25079.1
KEYWORDS    .
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 468)
  AUTHORS   Marotta,C.A., Forget,B.G., Cohen-Solal,M. and Weissman,S.M.
  TITLE     Nucleotide sequence analysis of coding and noncoding regions of
            human beta-globin mRNA
  JOURNAL   Prog Nucleic Acid Res Mol Biol 19, 165-175 (1976)
   PUBMED   1019344
COMMENT     Original source text: Human sickle cell, cDNA to mRNA.
FEATURES             Location/Qualifiers
     source          1..468
                     /organism="Homo sapiens"
                     /mol_type="mRNA"
                     /db_xref="taxon:

<h3 style="color:#8F091A">1.4 Retrieve a single GenBank entry</h3>

In [118]:
with open('./sickle.gb','w') as fileHandler:
    fileHandler.write(resultText)

In [119]:
%%bash
less ./sickle.gb

LOCUS       HUMBETGLA                468 bp    mRNA    linear   PRI 27-APR-1993
DEFINITION  Human sickle cell beta-globin mRNA, complete cds.
ACCESSION   M25079
VERSION     M25079.1
KEYWORDS    .
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 468)
  AUTHORS   Marotta,C.A., Forget,B.G., Cohen-Solal,M. and Weissman,S.M.
  TITLE     Nucleotide sequence analysis of coding and noncoding regions of
            human beta-globin mRNA
  JOURNAL   Prog Nucleic Acid Res Mol Biol 19, 165-175 (1976)
   PUBMED   1019344
COMMENT     Original source text: Human sickle cell, cDNA to mRNA.
FEATURES             Location/Qualifiers
     source          1..468
                     /organism="Homo sapiens"
                     /mol_type="mRNA"
                     /db_xref="taxon:

<h3 style="color:#8F091A">1.5 Retrieve and write multiple GenBank entries</h3>

In [120]:
def GetRecord(resultados, fileFormat):
    '''
    Esta funcion extrae las secuencias Fasta correspondientes a cada uno de los resultados de 
    una busqueda realizada.
    :param resultados: Bio.Entrez.Parser.DictionaryElement, conjunto de resultados para una busqueda usando
                       Entrez.
    :return str, secuencia Fasta del objeto encontrado en la busqueda.
    '''
    
    # Obtiene los valores de la busqueda
    num_resultados = int(resultados['Count'])
    webenv = resultados['WebEnv']
    query_key = resultados['QueryKey']
    lote = num_resultados

    # Comienzo a extraer los resultados por lotes
    for inicio in range(0, num_resultados, lote):
        fin = min(num_resultados, inicio + lote)
        
        # Tres intentos para extraer los resultados
        intento = 1
        while intento <= 3:
            try:
                                
                # Obtengo los resultados en formato Fasta
                fetch_handler= Entrez.efetch(db='nucleotide', rettype=fileFormat, retmode='text',
                                             retstart=inicio, retmax=lote,
                                             webenv=webenv, query_key=query_key)
                break
            
            # En caso de ocurrir un error imprimo que fue lo que ocurrio
            except HTTPError as err:
                
                # Si el error es del servidor, espero un momento y vuelvo a intentarlo
                if 500 <= err.code <= 599:
                    print('Error del servidor: {}'.format(err))
                    print('Intento {} de 3'.format(intento))
                    intento += 1
                    time.sleep(15)
                
                # Si el error no es del servidor, interrumpe el programa
                else:
                    raise

        # Guardo el resultado y lo retorno
        record = fetch_handler.read()
        fetch_handler.close()
        
        return record
    
# Inicializa un archivo output para depositar las secuencias fasta  

out_handler = open('./entries.gb', 'w+')
out_handler.write('')
out_handler.close()   

searchHandler = Entrez.esearch(db='nucleotide',term='((beta-globin[Keyword]) AND Homo sapiens[Organism]) AND complete cds[TI] ', usehistory='y')
searchResults = Entrez.read(searchHandler)

    
with open('./entries.gb','a') as docEntriesGb:
    docEntriesGb.write(GetRecord(searchResults,'gb'))
    

In [121]:
%%bash
head ./entries.gb

LOCUS       AH001475                4355 bp    DNA     linear   PRI 10-JUN-2016
DEFINITION  Homo sapiens beta-globin gene, complete cds.
ACCESSION   AH001475 M34058 M34059
VERSION     AH001475.2
KEYWORDS    beta-globin.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.


<h3 style="color:#8F091A">1.6 Retrieve a single GenBank entry (Optional exercise)</h3>
<h4 style="color:#8F091A">1.6.1 Retrieve a single GenBank entry (Optional exercise)</h4>

In [122]:
out_handler = open('./entries.fasta', 'w+')
out_handler.write('')
out_handler.close()

with open('./entries.fasta','a') as docEntriesFasta:
    docEntriesFasta.write(GetRecord(searchResults,'fasta'))

In [123]:
%%bash
head ./entries.fasta

>AH001475.2 Homo sapiens beta-globin gene, complete cds
TCTATTTATTTAGCAATAATAGAGAAAGCATTTAAGAGAATAAAGCAATGGAAATAAGAAATTTGTAAAT
TTCCTTCTGATAACTAGAAATAGAGGATCCAGTTTCTTTTGGTTAACCTAAATTTTATTTCATTTTATTG
TTTTATTTTATTTTATTTTATTTTATTTTATTTTGTGTAATCGTAGTTTCAGAGTGTTAGAGCTGAAAGG
AAGAAGTAGGAGAAACATGCAAAGTAAAAGTATAACACTTTCCTTACTAAACCGACATGGGTTTCCAGGT
AGGGGCAGGATTCAGGATGACTGACAGGGCCCTTAGGGAACACTGAGACCCTACGCTGACCTCATAAATG
CTTGCTACCTTTGCTGTTTTAATTACATCTTTTAATAGCAGGAAGCAGAACTCTGCACTTCAAAAGTTTT
TCCTCACCTGAGGAGTTAATTTAGTACAAGGGGAAAAAGTACAGGGGGATGGGAGAAAGGCGATCACGTT
GGGAAGCTATAGAGAAAGAAGAGTAAATTTTAGTAAAGGAGGTTTAAACAAACAAAATATAAAGAGAAAT
AGGAACTTGAATCAAGGAAATGATTTTAAAACGCAGTATTCTTAGTGGACTAGAGGAAAAAAATAATCTG


<h4 style="color:#8F091A">1.6.2 Retrieve a single GenBank entry (Optional exercise)</h4>

In [124]:
counter = 1

for gbRecord in SeqIO.parse('./entries.gb', 'genbank'):
    file = open("./betaglobin_gbfiles/globinRecord_{}.gb".format(counter),"w+")
    file.write(str(gbRecord))
    counter += 1
    file.close()
        
           

In [125]:
%%bash
head ./betaglobin_gbfiles/globinRecord_1.gb

ID: AH001475.2
Name: AH001475
Description: Homo sapiens beta-globin gene, complete cds
Number of features: 13
/molecule_type=DNA
/topology=linear
/data_file_division=PRI
/date=10-JUN-2016
/accessions=['AH001475', 'M34058', 'M34059']
/sequence_version=2


<h4 style="color:#8F091A">1.6.3 Retrieve a single GenBank entry (Optional exercise)</h4>

In [126]:
handle = Entrez.esearch(db="pubmed", term="(malaria) AND (sickle cell anemia)", retmax=100)
record = Entrez.read(handle)

record['IdList']

['33261264', '33238928', '33180011', '33177908', '33008877', '32716940', '32697331', '32646433', '32614730', '32579813', '32491454', '32409247', '32391975', '32348635', '32334583', '32265284', '32211753', '32189304', '32183478', '32043441', '31999737', '31959596', '31937250', '31933570', '31889707', '31815023', '31808910', '31794569', '31703724', '31692839', '31681984', '31656474', '31562022', '31546868', '31518428', '31288760', '31167999', '31084943', '31037173', '31029083', '30923098', '30827499', '30787300', '30665411', '30658602', '30657108', '30605461', '30595467', '30578732', '30541465', '30501550', '30472238', '30455821', '30425067', '30411430', '30393954', '30344128', '30245824', '30222732', '30200838', '30178476', '30165857', '30145110', '30060095', '29946035', '29801034', '29749368', '29609623', '29607473', '29599243', '29588281', '29579313', '29524230', '29489205', '29408573', '29321025', '29318647', '29313430', '29310873', '29260650', '29242203', '29168218', '29127677', '29

<h2 style="color:#8F091A">2 Retrieve a single GenBank entry</h2>
<h3 style="color:#8F091A">2.1 Retrieve a single GenBank entry</h3>

In [127]:
from Bio import SeqIO

for gbRecord in SeqIO.parse('./sickle.gb', 'genbank'):
    print(gbRecord)

ID: M25079.1
Name: HUMBETGLA
Description: Human sickle cell beta-globin mRNA, complete cds
Number of features: 2
/molecule_type=mRNA
/topology=linear
/data_file_division=PRI
/date=27-APR-1993
/accessions=['M25079']
/sequence_version=1
/keywords=['']
/source=Homo sapiens (human)
/organism=Homo sapiens
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']
/references=[Reference(title='Nucleotide sequence analysis of coding and noncoding regions of human beta-globin mRNA', ...)]
/comment=Original source text: Human sickle cell, cDNA to mRNA.
Seq('ATGGTNCAYYTNACNCCNGTGGAGAAGTCYGCYGTNACNGCNCTNTGGGGYAAG...TTT')


<h3 style="color:#8F091A">2.2 Retrieve a single GenBank entry</h3>

In [128]:
for gbRecord in SeqIO.parse('./sickle.gb', 'genbank'):
    print(gbRecord.id)
    print(gbRecord.name)
    print(gbRecord.description)
    sickleCell = gbRecord


M25079.1
HUMBETGLA
Human sickle cell beta-globin mRNA, complete cds


<h3 style="color:#8F091A">2.3 Retrieve a single GenBank entry</h3>

In [129]:
SeqIO.write(gbRecord, './file_ex2.fasta', 'fasta')


1

In [130]:
%%bash
less ./file_ex2.fasta

>M25079.1 Human sickle cell beta-globin mRNA, complete cds
ATGGTNCAYYTNACNCCNGTGGAGAAGTCYGCYGTNACNGCNCTNTGGGGYAAGGTNAAY
GTGGATGAAGYYGGYGGYGAGGCCCTGGGCAGNCTGCTNGTGGTCTACCCTTGGACCCAG
AGGTTCTTNGANTCNTTYGGGGATCTGNNNACNCCNGANGCAGTTATGGGCAACCCTAAG
GTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGAC
AACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTNCAYGTGGAT
CCTGAGAACTTCAGGCTNCTNGGCAACGTGYTNGTCTGYGTGCTGGCCCATCACTTTGGC
AAAGAATTCACCCCACCAGTGCANGCNGCCTATCAGAAAGTGGTNGCTGGTGTNGCTAAT
GCCCTGGCCCACAAGTATCACTAAGCTNGCYTTYTTGYTGTCCAATTT


<h3 style="color:#8F091A">2.4 Retrieve a single GenBank entry</h3>

In [131]:
for gbRecord in SeqIO.parse('./entries.gb', 'genbank'):
    print(gbRecord.id)
    print(gbRecord.name)
    print(gbRecord.description, end='\n\n')

AH001475.2
AH001475
Homo sapiens beta-globin gene, complete cds

L26478.1
HUMBETGLOR
Human haplotype D3 beta-globin gene, complete cds

L26477.1
HUMBETGLOP
Human haplotype D2 beta-globin gene, complete cds

L26476.1
HUMBETGLOO
Human haplotype D1 beta-globin gene, complete cds

L26475.1
HUMBETGLON
Human haplotype C2 beta-globin gene, complete cds

L26474.1
HUMBETGLOM
Human haplotype C3 beta-globin gene, complete cds

L26473.1
HUMBETGLOL
Human haplotype C1 beta-globin gene, complete cds

L26472.1
HUMBETGLOK
Human haplotype B6 beta-globin gene, complete cds

L26471.1
HUMBETGLOJ
Human haplotype B5 beta-globin gene, complete cds

L26470.1
HUMBETGLOI
Human haplotype B4 beta-globin gene, complete cds

L26469.1
HUMBETGLOH
Human haplotype B3 beta-globin gene, complete cds

L26468.1
HUMBETGLOG
Human haplotype B2 beta-globin gene, complete cds

L26467.1
HUMBETGLOF
Human haplotype B1 beta-globin gene, complete cds

L26466.1
HUMBETGLOE
Human haplotype A4 beta-globin gene, complete cds

L26465.1
HUM

<h3 style="color:#8F091A">2.5 Retrieve a single GenBank entry</h3>

In [132]:
import re

for gbRecord in SeqIO.parse('./entries.gb', 'genbank'):
    if not re.search('(isolate)|(vector)', gbRecord.description) and re.search('globin', gbRecord.description):
        print(gbRecord.id)
        print(gbRecord.name)
        print(gbRecord.description, end='\n\n')

AH001475.2
AH001475
Homo sapiens beta-globin gene, complete cds

L26478.1
HUMBETGLOR
Human haplotype D3 beta-globin gene, complete cds

L26477.1
HUMBETGLOP
Human haplotype D2 beta-globin gene, complete cds

L26476.1
HUMBETGLOO
Human haplotype D1 beta-globin gene, complete cds

L26475.1
HUMBETGLON
Human haplotype C2 beta-globin gene, complete cds

L26474.1
HUMBETGLOM
Human haplotype C3 beta-globin gene, complete cds

L26473.1
HUMBETGLOL
Human haplotype C1 beta-globin gene, complete cds

L26472.1
HUMBETGLOK
Human haplotype B6 beta-globin gene, complete cds

L26471.1
HUMBETGLOJ
Human haplotype B5 beta-globin gene, complete cds

L26470.1
HUMBETGLOI
Human haplotype B4 beta-globin gene, complete cds

L26469.1
HUMBETGLOH
Human haplotype B3 beta-globin gene, complete cds

L26468.1
HUMBETGLOG
Human haplotype B2 beta-globin gene, complete cds

L26467.1
HUMBETGLOF
Human haplotype B1 beta-globin gene, complete cds

L26466.1
HUMBETGLOE
Human haplotype A4 beta-globin gene, complete cds

L26465.1
HUM

<h3 style="color:#8F091A">2.6 Retrieve a single GenBank entry</h3>
<h4 style="color:#8F091A">2.6.1 Retrieve a single GenBank entry</h4>

In [133]:
gbs = []
for gbRecord in SeqIO.parse('./entries.gb', 'genbank'):
    if not re.search('(isolate)|(vector)', gbRecord.description) and re.search('globin', gbRecord.description):
        gbs.append(gbRecord)
print(gbs)

[SeqRecord(seq=Seq('TCTATTTATTTAGCAATAATAGAGAAAGCATTTAAGAGAATAAAGCAATGGAAA...AAA'), id='AH001475.2', name='AH001475', description='Homo sapiens beta-globin gene, complete cds', dbxrefs=[]), SeqRecord(seq=Seq('ACCTCCTATTTGACACCACTGATTACCCCATTGATAGTCACACTTTGGGTTGTA...ATC'), id='L26478.1', name='HUMBETGLOR', description='Human haplotype D3 beta-globin gene, complete cds', dbxrefs=[]), SeqRecord(seq=Seq('ACCTCCTATTTGACACCACTGATTACCCCATTGATAGTCACACTTTGGGTTGTA...ATC'), id='L26477.1', name='HUMBETGLOP', description='Human haplotype D2 beta-globin gene, complete cds', dbxrefs=[]), SeqRecord(seq=Seq('ACCTCCTATTTGACACCACTGATTACCCCATTGATAGTCACACTTTGGGTTGTA...ATC'), id='L26476.1', name='HUMBETGLOO', description='Human haplotype D1 beta-globin gene, complete cds', dbxrefs=[]), SeqRecord(seq=Seq('ACCTCCTATTTGACACCACTGATTACCCCATTGATAGTCACACTTTGGGTTGTA...ATC'), id='L26475.1', name='HUMBETGLON', description='Human haplotype C2 beta-globin gene, complete cds', dbxrefs=[]), SeqRecord(seq=Seq('ACCTCCTATTT

<h4 style="color:#8F091A">2.6.2 Retrieve a single GenBank entry</h4>

In [134]:
out_handler = open('./filter_globin.fasta', 'w+')
out_handler.write('')
out_handler.close()

with open('./filter_globin.fasta','a') as docEntriesFasta:
    for gbRecord in gbs:
        SeqIO.write(gbRecord, docEntriesFasta, "fasta")

In [135]:
%%bash
head ./filter_globin.fasta

>AH001475.2 Homo sapiens beta-globin gene, complete cds
TCTATTTATTTAGCAATAATAGAGAAAGCATTTAAGAGAATAAAGCAATGGAAATAAGAA
ATTTGTAAATTTCCTTCTGATAACTAGAAATAGAGGATCCAGTTTCTTTTGGTTAACCTA
AATTTTATTTCATTTTATTGTTTTATTTTATTTTATTTTATTTTATTTTATTTTGTGTAA
TCGTAGTTTCAGAGTGTTAGAGCTGAAAGGAAGAAGTAGGAGAAACATGCAAAGTAAAAG
TATAACACTTTCCTTACTAAACCGACATGGGTTTCCAGGTAGGGGCAGGATTCAGGATGA
CTGACAGGGCCCTTAGGGAACACTGAGACCCTACGCTGACCTCATAAATGCTTGCTACCT
TTGCTGTTTTAATTACATCTTTTAATAGCAGGAAGCAGAACTCTGCACTTCAAAAGTTTT
TCCTCACCTGAGGAGTTAATTTAGTACAAGGGGAAAAAGTACAGGGGGATGGGAGAAAGG
CGATCACGTTGGGAAGCTATAGAGAAAGAAGAGTAAATTTTAGTAAAGGAGGTTTAAACA


<h4 style="color:#8F091A">2.6.3 Retrieve a single GenBank entry</h4>

In [136]:
for gbRecord in gbs:
    if (gbRecord.id).startswith("L"):
        print(gbRecord.id)

L26478.1
L26477.1
L26476.1
L26475.1
L26474.1
L26473.1
L26472.1
L26471.1
L26470.1
L26469.1
L26468.1
L26467.1
L26466.1
L26465.1
L26464.1
L26463.1
L26462.1


<h2 style="color:#8F091A">3 Retrieve a single GenBank entry</h2>
<h3 style="color:#8F091A">3.1 Retrieve a single GenBank entry</h3>

In [137]:
seqSickle = sickleCell.seq
print(seqSickle)


ATGGTNCAYYTNACNCCNGTGGAGAAGTCYGCYGTNACNGCNCTNTGGGGYAAGGTNAAYGTGGATGAAGYYGGYGGYGAGGCCCTGGGCAGNCTGCTNGTGGTCTACCCTTGGACCCAGAGGTTCTTNGANTCNTTYGGGGATCTGNNNACNCCNGANGCAGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTNCAYGTGGATCCTGAGAACTTCAGGCTNCTNGGCAACGTGYTNGTCTGYGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCANGCNGCCTATCAGAAAGTGGTNGCTGGTGTNGCTAATGCCCTGGCCCACAAGTATCACTAAGCTNGCYTTYTTGYTGTCCAATTT


<h3 style="color:#8F091A">3.2 Retrieve a single GenBank entry</h3>

In [138]:
rnaSickle = seqSickle.transcribe()
print(rnaSickle)

AUGGUNCAYYUNACNCCNGUGGAGAAGUCYGCYGUNACNGCNCUNUGGGGYAAGGUNAAYGUGGAUGAAGYYGGYGGYGAGGCCCUGGGCAGNCUGCUNGUGGUCUACCCUUGGACCCAGAGGUUCUUNGANUCNUUYGGGGAUCUGNNNACNCCNGANGCAGUUAUGGGCAACCCUAAGGUGAAGGCUCAUGGCAAGAAAGUGCUCGGUGCCUUUAGUGAUGGCCUGGCUCACCUGGACAACCUCAAGGGCACCUUUGCCACACUGAGUGAGCUGCACUGUGACAAGCUNCAYGUGGAUCCUGAGAACUUCAGGCUNCUNGGCAACGUGYUNGUCUGYGUGCUGGCCCAUCACUUUGGCAAAGAAUUCACCCCACCAGUGCANGCNGCCUAUCAGAAAGUGGUNGCUGGUGUNGCUAAUGCCCUGGCCCACAAGUAUCACUAAGCUNGCYUUYUUGYUGUCCAAUUU


<h3 style="color:#8F091A">3.3 Retrieve a single GenBank entry</h3>

In [139]:
protSickle = rnaSickle.translate(to_stop=True)
print(protSickle)

MVHXTPVEKSAVTALWGKVNVDEXGGEALGXLLVVYPWTQRFXXSFGDLXTPXAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVXVCVLAHHFGKEFTPPVXAAYQKVVAGVANALAHKYH


<h3 style="color:#8F091A">3.4 Retrieve a single GenBank entry</h3>

In [140]:
with open ('./L26462.gb', 'w+') as handler:
    for gbRecord in SeqIO.parse('./entries.gb', 'genbank'):
        access = gbRecord.annotations['accessions']
        if access[0] == 'L26462':
            L26462 = gbRecord
            handler.write(str(L26462))
        

In [141]:
%%bash
less ./L26462.gb

ID: L26462.1
Name: HUMBETGLOA
Description: Human haplotype C4 beta-globin gene, complete cds
Number of features: 26
/molecule_type=DNA
/topology=linear
/data_file_division=PRI
/date=26-AUG-1994
/accessions=['L26462']
/sequence_version=1
/keywords=['beta-globin']
/source=Homo sapiens (human)
/organism=Homo sapiens
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']
/references=[Reference(title='Molecular and population genetic analysis of allelic sequence diversity at the human beta-globin locus', ...)]
Seq('ACCTCCTATTTGACACCACTGATTACCCCATTGATAGTCACACTTTGGGTTGTA...ATC')

<h3 style="color:#8F091A">3.5 Retrieve a single GenBank entry</h3>

In [142]:
print(L26462.features)

[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(3002), strand=1), type='source'), SeqFeature(FeatureLocation(ExactPosition(110), ExactPosition(111), strand=1), type='variation'), SeqFeature(FeatureLocation(ExactPosition(262), ExactPosition(263), strand=1), type='variation'), SeqFeature(FeatureLocation(ExactPosition(272), ExactPosition(273), strand=1), type='variation'), SeqFeature(FeatureLocation(ExactPosition(285), ExactPosition(287), strand=1), type='variation'), SeqFeature(FeatureLocation(ExactPosition(287), ExactPosition(288), strand=1), type='variation'), SeqFeature(FeatureLocation(ExactPosition(294), ExactPosition(296), strand=1), type='variation'), SeqFeature(FeatureLocation(ExactPosition(346), ExactPosition(347), strand=1), type='variation'), SeqFeature(FeatureLocation(ExactPosition(475), ExactPosition(476), strand=1), type='variation'), SeqFeature(FeatureLocation(ExactPosition(499), ExactPosition(500), strand=1), type='variation'), SeqFeature(CompoundLocation([Feat

<h3 style="color:#8F091A">3.6 Retrieve a single GenBank entry</h3>

In [143]:
completeExon = ''
for feature in L26462.features:
    if feature.type == 'exon':
        print(feature)
    
        print('\tstart: ',feature.location.start)
        print('\tend: ',feature.location.end)
        
        print('\tnofuzzy_start: ',feature.location.nofuzzy_start)
        nfStart = feature.location.nofuzzy_start
        print('\tnofuzzy_end: ',feature.location.nofuzzy_end)
        nfEnd = feature.location.nofuzzy_end
        
        exon = L26462.seq[nfStart:nfEnd]
        print(exon, end='\n--------------------\n')
        completeExon += exon

print(completeExon)

type: exon
location: [<865:957](+)
qualifiers:
    Key: number, Value: ['1']

	start:  <865
	end:  957
	nofuzzy_start:  865
	nofuzzy_end:  957
ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAG
--------------------
type: exon
location: [1087:1310](+)
qualifiers:
    Key: number, Value: ['2']

	start:  1087
	end:  1310
	nofuzzy_start:  1087
	nofuzzy_end:  1310
GCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGG
--------------------
type: exon
location: [2160:>2289](+)
qualifiers:
    Key: number, Value: ['3']

	start:  2160
	end:  >2289
	nofuzzy_start:  2160
	nofuzzy_end:  2289
CTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAA
--------------------
ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGC

<h3 style="color:#8F091A">3.7 Retrieve a single GenBank entry</h3>

In [144]:
seqGlobin = completeExon
rnaGlobin = seqGlobin.transcribe()
print(rnaGlobin)
print('----------')
protGlobin = rnaGlobin.translate(to_stop=True)
print(protGlobin)

AUGGUGCAUCUGACUCCUGAGGAGAAGUCUGCCGUUACUGCCCUGUGGGGCAAGGUGAACGUGGAUGAAGUUGGUGGUGAGGCCCUGGGCAGGCUGCUGGUGGUCUACCCUUGGACCCAGAGGUUCUUUGAGUCCUUUGGGGAUCUGUCCACUCCUGAUGCUGUUAUGGGCAACCCUAAGGUGAAGGCUCAUGGCAAGAAAGUGCUCGGUGCCUUUAGUGAUGGCCUGGCUCACCUGGACAACCUCAAGGGCACCUUUGCCACACUGAGUGAGCUGCACUGUGACAAGCUGCACGUGGAUCCUGAGAACUUCAGGCUCCUGGGCAACGUGCUGGUCUGUGUGCUGGCCCAUCACUUUGGCAAAGAAUUCACCCCACCAGUGCAGGCUGCCUAUCAGAAAGUGGUGGCUGGUGUGGCUAAUGCCCUGGCCCACAAGUAUCACUAA
----------
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH


<h3 style="color:#8F091A">3.8 Retrieve a single GenBank entry</h3>

In [145]:
print(seqSickle)
print('----------')
print(seqGlobin)

ATGGTNCAYYTNACNCCNGTGGAGAAGTCYGCYGTNACNGCNCTNTGGGGYAAGGTNAAYGTGGATGAAGYYGGYGGYGAGGCCCTGGGCAGNCTGCTNGTGGTCTACCCTTGGACCCAGAGGTTCTTNGANTCNTTYGGGGATCTGNNNACNCCNGANGCAGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTNCAYGTGGATCCTGAGAACTTCAGGCTNCTNGGCAACGTGYTNGTCTGYGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCANGCNGCCTATCAGAAAGTGGTNGCTGGTGTNGCTAATGCCCTGGCCCACAAGTATCACTAAGCTNGCYTTYTTGYTGTCCAATTT
----------
ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAA


In [146]:
print(protSickle)
print('----------')
print(protGlobin)

MVHXTPVEKSAVTALWGKVNVDEXGGEALGXLLVVYPWTQRFXXSFGDLXTPXAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVXVCVLAHHFGKEFTPPVXAAYQKVVAGVANALAHKYH
----------
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH


 JUSTIFICAR
 What differences do you see?

Opcional 3.9

<h2 style="color:#8F091A">4 Retrieve a single GenBank entry</h2>
<h3 style="color:#8F091A">4.1 Retrieve a single GenBank entry</h3>

<h3 style="color:#8F091A">4.2 Retrieve a single GenBank entry</h3>

In [147]:
match = re.search('ATG', str(seqGlobin))
if match:
    print(match.start(),'-', match.end()-1)

0 - 2


<h3 style="color:#8F091A">4.3 Retrieve a single GenBank entry</h3>

In [148]:
def CutDNA(restEnz, seq):
    for m in re.finditer(restEnz, str(seq)):
        print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))

Ddel = re.compile('[ATGC]{2}CT[ATGC]AG')

print('Cortes en sano')
CutDNA(Ddel, str(seqGlobin))

print('\nCortes en enfermo')
CutDNA(Ddel, str(seqSickle))



Cortes en sano
14-21: TCCTGAG
173-180: CCCTAAG
262-269: CACTGAG
299-306: TCCTGAG

Cortes en enfermo
173-180: CCCTAAG
262-269: CACTGAG
299-306: TCCTGAG
438-445: CACTAAG


<h3 style="color:#8F091A">4.4 Retrieve a single GenBank entry</h3>

In [149]:
reDb = {'Ddel': re.compile('[ATGC]{2}CT[ATGC]AG'),
        'HinfI': re.compile('GT[ATGC]{2}AC'),
        'BceAI': re.compile('ACGGC[ATGC]{13}'),
        'BseRI': re.compile('GAGGAG[ATGC]{10}'),
        'EcoRI': re.compile('GAATTC'),
        'MstII': re.compile('CCT[ATGC]{1}AGG')}

In [150]:
for enzyme in reDb:
    print(enzyme)
    print('Cortes en sano')
    CutDNA(reDb[enzyme], str(seqGlobin))

    print('\nCortes en enfermo')
    CutDNA(reDb[enzyme], str(seqSickle))
    print('------------\n')

Ddel
Cortes en sano
14-21: TCCTGAG
173-180: CCCTAAG
262-269: CACTGAG
299-306: TCCTGAG

Cortes en enfermo
173-180: CCCTAAG
262-269: CACTGAG
299-306: TCCTGAG
438-445: CACTAAG
------------

HinfI
Cortes en sano
54-60: GTGAAC
102-108: GTCTAC
146-152: GTCCAC

Cortes en enfermo
102-108: GTCTAC
------------

BceAI
Cortes en sano

Cortes en enfermo
------------

BseRI
Cortes en sano
18-34: GAGGAGAAGTCTGCCG

Cortes en enfermo
------------

EcoRI
Cortes en sano
363-369: GAATTC

Cortes en enfermo
363-369: GAATTC
------------

MstII
Cortes en sano
15-22: CCTGAGG
174-181: CCTAAGG

Cortes en enfermo
174-181: CCTAAGG
------------



JUSTIFICAR
Which restriction enzyme could you use to specifically identify carriers of the sickle cell anemia gene?

<h3 style="color:#8F091A">4.5 Retrieve a single GenBank entry</h3>

In [151]:
cleanSickleSeq = ''
for i in range(0, len(seqGlobin)):
    if seqSickle[i] == 'N':
        cleanSickleSeq += seqGlobin[i]
    else:
        cleanSickleSeq += seqSickle[i]

print(cleanSickleSeq)

ATGGTGCAYYTGACTCCTGTGGAGAAGTCYGCYGTTACTGCCCTGTGGGGYAAGGTGAAYGTGGATGAAGYYGGYGGYGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTYGGGGATCTGTCCACTCCTGATGCAGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCAYGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGYTGGTCTGYGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAA
