# Pyhton para Bioinformatica 

Pyhton es un lenguaje de proposito general para instalar una version de python puede visitar su pagina de [Descarga](https://www.python.org/downloads/), en caso de windows pueden ejecutar el instalador y estara instalado en el sistema. Alternativamente pueden descargar la version de [Miniconda](https://docs.conda.io/en/latest/miniconda.html#windows-installers) o [Anaconda](https://www.anaconda.com/products/distribution)

## Problemas de algoritmos en Python
### Programacion dinamica

The Fibonacci numbers $0,1,1,2,3,5,8,13,21,34,…$ are generated by the following simple rule:

$f(n) = \left\{ \begin{matrix} F_{n-1} + F_{n-2} & \mbox{n > 1,} \\ 1, & \mbox{n = 1,} \\ 0, & \mbox{n = 0,}\end{matrix}\right.$


$\color{green}{\textbf{Given}}$: A positive integer $n≤25$.

$\color{green}{\textbf{Return}}$: The value of $F_n$.


### Funciones en Python

In [93]:
def funcion(a,b,c,d):
    return a + b + c + d

In [96]:
funcion(1,2,3,4)

10

### Programacion recursiva Costo computacional
$O(n^2)$

In [2]:
def Fibonnacci(n):
    if n == 1:
        return 1
    elif n == 0:
        return 0
    else:
        return Fibonnacci(n-1) + Fibonnacci(n-2)

In [10]:
Fibonnacci(39)

63245986

### Para calcular el tiempo de ejecucion de Programacion recursiva


In [112]:
from datetime import datetime

start_time = datetime.now()

Fibonnacci(39)

end_time = datetime.now()

print('Duration: {}'.format(end_time - start_time))

Duration: 0:00:12.516821


### Programacion Dinamica 
$O(ln(N))$

In [8]:
def FibonnacciDinamico(n):
    f=[0,1]
    for i in range(2,n+1):
        f.append(f[i-1]+f[i-2])
    return f[n]

In [15]:
FibonnacciDinamico(30)

832040

In [14]:
from datetime import datetime

start_time = datetime.now()

FibonnacciDinamico(10000)

end_time = datetime.now()

print('Duration: {}'.format(end_time - start_time))

Duration: 0:00:00.006151


## Problemas de algoritmos en bioinformatica

![](http://biopython.org/assets/images/biopython_logo_xs.png)

El proyecto `Biopython` es el nombre que recibe una serie de aplicaciones y programas informáticos pensados para cuantificar y hacer cálculos con datos biológicos, programados por una comunidad internacional.

In [27]:
Dna = "AGCCTGCGAC"
type(Dna)

str

In [26]:
print(Dna[3:7][::-1])

CGTC


Biopython utiliza Programacion orientado a objetos (POO in bioinformatica).

In [154]:
from Bio.Seq import Seq
sequence = Seq("AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC")
print(sequence)
type(sequence)

AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC


Bio.Seq.Seq

## Problem Counting DNA nucleotides

A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.

An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."

>**Given**: A DNA string s of length at most 1000 nt.

>**Return**: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s.

Sample Dataset

`AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC`

Sample Output

`20` `12` `17` `21`

In [29]:
def GCcontent(dna):
    G = 0
    C = 0
    for i in dna:
        if i == "G":
            G += 1
        elif i == "C":
            C += 1
    return(G+C/len(dna)) 

In [35]:
dna = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
print(round(GCcontent(dna),2),"%")

17.17 %


In [42]:
def GCcontent2(dna):
    print(dna.count("G") + dna.count("C")/ len(dna))

In [43]:
GCcontent2(dna)

17.17142857142857


## Problem reverse Complementing a Strand of DNA

In DNA strings, symbols `A` and `T` are complements of each other, as are `C` and `G`.

The reverse complement of a DNA string s is the string sc formed by reversing the symbols of s, then taking the complement of each symbol (e.g., the reverse complement of `GTCA` is `TGAC`).

>$\color{green}{\textbf{Given}}$: A DNA string s of length at most 1000 bp.

>$\color{red}{\textbf{Return}}$: The reverse complement sc of s.

Sample Dataset

`AAAACCCGGT`

Sample Output

`ACCGGGTTTT`

### solucion manual 

In [45]:
def DNAcomplement(DNA):
    temp = DNA.replace("A","t").replace("C","g").replace("T","a").replace("G","c")
    return temp

In [47]:
print(DNAcomplement("AAAACCCGGT")[::-1].upper())

ACCGGGTTTT


### solucion con biopython

In [48]:
from Bio.Seq import Seq

In [49]:
sequence= Seq("AAAACCCGGT")

In [50]:
print(sequence.reverse_complement())

ACCGGGTTTT


### Transcripcion

Una cadena de DNA y queremos hallar su ARNm

In [159]:
dna = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"

In [52]:
def transcript(DNA):
    dnaC = DNAcomplement(DNA)
    temp = dnaC.replace("t","u")
    return temp

In [56]:
rna = transcript(dna).upper() ### Esto es un string 

### Solucion con Biopyhton


In [57]:
rnaBio = Seq(rna)

In [64]:
print(rnaBio.translate(table="Vertebrate Mitochondrial", stop_symbol=False))

SKSKTDVARYTETHLIFFSQTIV


In [168]:
transcript(dna)

'AGCUUUUCAUUCUGACUGCAACGGGCAAUAUGUCUCUGUGUGGAUUAAAAAAAGAGUGUCUGAUAGCAGC'

## Traduccion de ADN a proteina

In [66]:
dict = {1:"A",2:"B"}

In [69]:
dict.keys()

dict_keys([1, 2])

In [174]:
s = "ACAATGGCCATGGCGCCCAGAACTGAGATCAAATGTAGTACATGCTACATGCCTACATGCATGTACATGCTACATGCTACATGCTACATGCTACATGCTACATGCTACATGCGTATTAACGGGTGA"
codon = {
"UUU":"F","CUU":"L","AUU":"I","GUU":"V",
"UUC":"F","CUC":"L","AUC":"I","GUC":"V",
"UUA":"L","CUA":"L","AUA":"I","GUA":"V",
"UUG":"L","CUG":"L","AUG":"M","GUG":"V",
"UCU":"S","CCU":"P","ACU":"T","GCU":"A",
"UCC":"S","CCC":"P","ACC":"T","GCC":"A",
"UCA":"S","CCA":"P","ACA":"T","GCA":"A",
"UCG":"S","CCG":"P","ACG":"T","GCG":"A",
"UAU":"Y","CAU":"H","AAU":"N","GAU":"D",
"UAC":"Y","CAC":"H","AAC":"N","GAC":"D",
"UAA":"Stop","CAA":"Q","AAA":"K","GAA":"E",
"UAG":"Stop","CAG":"Q","AAG":"K","GAG":"E",
"UGU":"C","CGU":"R","AGU":"S","GGU":"G",
"UGC":"C","CGC":"R","AGC":"S","GGC":"G",
"UGA":"Stop","CGA":"R","AGA":"R","GGA":"G",
"UGG":"W","CGG":"R","AGG":"R","GGG":"G"} 

out = open("outProtein.txt","w")

for i in iter(range(0,int(len(s)),3)):
    for key,value in codon.items():
        if key == s[i:i+3]:
            out.write(value)
print(open("outProtein.txt","r").read())

TAAPREKTTTTNG


In [76]:
codon_table = 'KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV\0Y\0YSSSS\0CWCLFLF'
nucleo = {'A': 0, 'C': 1, 'G': 2, 'U': 3}

#f = open('rosalind_prot.txt', 'r')
seq = "AUCGUGCAUGCUGAG"
#f.close()

protein = []
for i in range(0, len(seq), 3):
    protein.append(codon_table[nucleo[seq[i]]*16 + nucleo[seq[i+1]]*4 + nucleo[seq[i+2]]])

print(''.join(protein[:-1]))

IVHA


In [81]:
from Bio.Seq import Seq
ADN = Seq('ACAATGGCCATGGCGCCCAGAACTGAGATCAAATGTAGTACATGCTACATGCCTACATGCATGTACATGCTACATGCTACATGCTACATGCTACATGCTACATGCTACATGCGTATTAACGGGTGA')
ADN.translate(to_stop=False)

Seq('TMAMAPRTEIKCSTCYMPTCMYMLHATCYMLHATCYMRING*')

### Conectar con bases biologicas

In [83]:
from Bio import ExPASy ## Proteicas
from Bio import SwissProt ## Proteinas
from Bio import Entrez ## NCBI 

In [88]:
Entrez.email = "francisco.ascue@unmsm.edu.pe"
handle = Entrez.esearch(db="Nucleotide",term='"Severe acute respiratory syndrome coronavirus 2"[Organism] OR Sars cov 2[All Fields]', datetype="pdat", mindate='2020/1/25', maxdate='2020/1/30')
record = Entrez.read(handle)

In [92]:
listaIds = record['IdList']

#### Importar multiples secuencias

In [94]:
from Bio import Entrez

for i in listaIds:
    Entrez.email = "francisco.ascue@unmsm.edu.pe"
    handle = Entrez.efetch(db="Nucleotide", id=i,rettype="fasta",retmode="text")
    out = open("QuerySars.fasta","w")
    out.write(handle.read())

In [9]:
from Bio import Entrez

for ID in ["JX398977","JX462669","JN573266","NM_001185098","JX317622","JX205496","JX460804","JX4626702","NM_001009148","NM_001079732"]:
    Entrez.email = "francisco.ascue@unmsm.edu.pe"
    handle = Entrez.efetch(db="Nucleotide", id=ID,rettype="fasta",retmode="text")
    print(handle.read())

>JX398977.1 Triticum aestivum iron-superoxide dismutase (FeSOD) mRNA, partial cds
CTCCTCCCAGTGCGCGGCCTCCCGGCCGCCCCTTCATTCGCCGCTCACCCACCCACACAGCTCCCCCGCG
CCGCCGCCCTCCCTCCTCGGTGCCCGGCGCCGCCGTCCGTCCCGGAGGCTCTCCAAGGTCGTGTCCTACT
ACGGCCTCACCACTCCGCCGTACAAAACCGACGCCCTGGAGCCGTACATGAGCAGGCGGGCGGTGGAGCT
GCACTGGGGCAAGCACCAGCAGGAGTACGTGGACGGGCTCAACAGGCAGCTCGCCATCAGCCCGCTCTAC
GGCCACACCCTCGAGGACCTCATCAAGGAGGCCTACAACAACGGCAACCCGCTGCCCGAGTACAACGACG
CGGCCGAGGTCTGGAACCATCACTTCTTCTGGGAATCGATGCAGCCGGAAGGCGGCGGCTCGCCCGAGGC
GGGCGTGCTGCAGCAGATCGAGAAGGATTTCGGCTCCTTTTTTAATTTCAGGGAGGAGTTCATGCGCTCG
GCGTTGTCGTTGTTGGGGTCTGGTTGGGTTTGGCTTGTCTTGAAGAGAAGCGAGAGGAAGCTCGAGGTGG
TTCACACCCGAAATGCTATCAACCCACTTGCTTTTGGGGATATTCCAATCATCAGCCTAGACTTGTGGGA
GCATGCTTACTACTTAGATTACAAGGATGACAGGCGAACATATGTGTCAAACTTTCTGGATCACCTTGTG
TCTTGGCATACTGTCACTGTACGCATGATGCGCGCGGAGGCTTTTGTTAACCTTGGTGAACCAACTATCC
CAGTGGCATGAGATGATACGGATATGACCTGGAGTTGTCTCTGAATATGTCGATCTCATGAGGGATGCCA
AGGAAAATCCTGAACCACATCCTACTGGATGAGACGGAGAGTTTGCGGGGGGTAGCGTGATGTAAC

HTTPError: HTTP Error 400: Bad Request

#### Para extrar regiones de una secuencia fasta descargada del NCBI

In [3]:
from Bio import Entrez

Entrez.email = "francisco.ascue@unmsm.edu.pe"
Covid = Entrez.efetch(db="Nucleotide", id="NC_045512.2",rettype="fasta",retmode="text")


In [197]:
SeqCovid.seq[4000:4010]

Seq('ACTAAGTTCC')

#### Cargar un archivo a Python con Biopython 

In [1]:
from Bio import SeqIO
for sequence in SeqIO.parse("NC_045512.2.fasta", "fasta"):
    print(sequence.id)
    print(repr(sequence.seq))
    print(len(sequence))

NC_045512.2
Seq('ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGT...AAA')
29903


In [4]:
MultipleSeq = Entrez.efetch(db="Nucleotide", id=("JX398977","JX462669","JN573266"),rettype="fasta",retmode="text")

In [5]:
#count = SeqIO.write(records, output_file, "fasta")
type(MultipleSeq)

_io.TextIOWrapper

In [6]:
DataTr = MultipleSeq.read().split(sep="\n")

In [8]:
DataTr

['>JX398977.1 Triticum aestivum iron-superoxide dismutase (FeSOD) mRNA, partial cds',
 'CTCCTCCCAGTGCGCGGCCTCCCGGCCGCCCCTTCATTCGCCGCTCACCCACCCACACAGCTCCCCCGCG',
 'CCGCCGCCCTCCCTCCTCGGTGCCCGGCGCCGCCGTCCGTCCCGGAGGCTCTCCAAGGTCGTGTCCTACT',
 'ACGGCCTCACCACTCCGCCGTACAAAACCGACGCCCTGGAGCCGTACATGAGCAGGCGGGCGGTGGAGCT',
 'GCACTGGGGCAAGCACCAGCAGGAGTACGTGGACGGGCTCAACAGGCAGCTCGCCATCAGCCCGCTCTAC',
 'GGCCACACCCTCGAGGACCTCATCAAGGAGGCCTACAACAACGGCAACCCGCTGCCCGAGTACAACGACG',
 'CGGCCGAGGTCTGGAACCATCACTTCTTCTGGGAATCGATGCAGCCGGAAGGCGGCGGCTCGCCCGAGGC',
 'GGGCGTGCTGCAGCAGATCGAGAAGGATTTCGGCTCCTTTTTTAATTTCAGGGAGGAGTTCATGCGCTCG',
 'GCGTTGTCGTTGTTGGGGTCTGGTTGGGTTTGGCTTGTCTTGAAGAGAAGCGAGAGGAAGCTCGAGGTGG',
 'TTCACACCCGAAATGCTATCAACCCACTTGCTTTTGGGGATATTCCAATCATCAGCCTAGACTTGTGGGA',
 'GCATGCTTACTACTTAGATTACAAGGATGACAGGCGAACATATGTGTCAAACTTTCTGGATCACCTTGTG',
 'TCTTGGCATACTGTCACTGTACGCATGATGCGCGCGGAGGCTTTTGTTAACCTTGGTGAACCAACTATCC',
 'CAGTGGCATGAGATGATACGGATATGACCTGGAGTTGTCTCTGAATATGTCGATCTCATGAGGGATGCCA',
 'AGGAAAATCCTG

In [228]:
out = open("JX398977.1.fasta","w")
for i in DataTr[0:2]:
    out.write(i)
    out.write("\n")

In [9]:
CoviSeq = Entrez.efetch(db="Nucleotide", id="NC_045512.2",rettype="fasta",retmode="text")

In [10]:
ReadCoviSeq = SeqIO.read(CoviSeq,"fasta")

In [14]:
ReadCoviSeq.seq[300:320]

Seq('ACACGTCCAACTCAGTTTGC')

In [234]:
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "francisco.ascue@unmsm.edu.pe"
handle = Entrez.efetch(db="Nucleotide", id="NC_045512.2",rettype="fasta",retmode="text")
CovidS = SeqIO.parse(handle,"fasta")

In [253]:
from Bio import ExPASy
from Bio import SwissProt
from Bio.SwissProt import KeyWList

In [7]:
handle = ExPASy.get_sprot_raw('Q5SLP9')
resumen = SwissProt.read(handle)


In [255]:
handle = ExPASy.get_sprot_raw('Q5SLP9')
records = KeyWList.parse(handle)
for record in records:
    print(record["GO"])

Ignoring: DT   29-MAR-2005, integrated into UniProtKB/Swiss-Prot.
Ignoring: DT   21-DEC-2004, sequence version 1.
Ignoring: DT   25-MAY-2022, entry version 98.
Ignoring: GN   Name=ssb; OrderedLocusNames=TTHA0244;
Ignoring: OS   Thermus thermophilus (strain ATCC 27634 / DSM 579 / HB8).
Ignoring: OC   Bacteria; Deinococcus-Thermus; Deinococci; Thermales; Thermaceae; Thermus.
Ignoring: OX   NCBI_TaxID=300852;
Ignoring: RN   [1]
Ignoring: RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA], AND SUBUNIT.
Ignoring: RX   PubMed=12368464; DOI=10.1099/00221287-148-10-3307;
Ignoring: RA   Dabrowski S., Olszewski M., Piatek R., Brillowska-Dabrowska A., Konopa G.,
Ignoring: RA   Kur J.;
Ignoring: RT   "Identification and characterization of single-stranded-DNA-binding
Ignoring: RT   proteins from Thermus thermophilus and Thermus aquaticus -- new arrangement
Ignoring: RT   of binding domains.";
Ignoring: RL   Microbiology 148:3307-3315(2002).
Ignoring: RN   [2]
Ignoring: RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA].
