Let's try to read, and analyze the genome of E. coli.
We will start with something simple. We're going compute ATGC content. This is historically relevant because before sequencing became available, people could very easily use biochemistry to measure the ATGC content and characterize different organisms based on their ATGC content. For example, it was discovered that some parasites usually tend to have AT rich genomes while thermophiles tend to have GC rich genomes. GC binds more tightly and in a high temp environment, the tighter bonds makes DNA more stable. 

Today we will also revisit the classes. We will create a class 'Genome' that will read a fasta file, and have a method to calculate the AT:GC content. Unlike the top-down approach on Monday, we will design this class bottom-up. 

In [None]:
#First, let's compute the AT:GC content of a string

def at_gc(st):
    at = 0 # counter for at content
    gc = 0 # counter for gc content
    st = st.upper()
    for ch in st:
        if ch == 'G' or ch == 'C': 
            gc += 1 # gc = gc + 1
        if ch == 'A' or ch == 'T': 
            at += 1
    return (at, gc) # return a tuple

In [None]:
st = 'GATACCA'
print(at_gc(st))

# what type the result? integer? string? list? tuple?
#print(type(at_gc(st)))

In [None]:
st = 'GATACCA'
at, gc = at_gc(st) # tuple unpacking
print(at, gc)


In [None]:
result = at_gc(st)
print(result)

# tuples are immutable
print(result[0])
result[0] = 5 # change element in index 0 to 5. what happens?

In [None]:
#Let's now make a function to read a fasta file, and creates a string with the genome
def read_fasta(file_name):
    genome=""
    f = open(file_name, 'rt') #t stands for text, other option is a bimary file 'b' 
    ln = f.readline()
    while ln: # end of file will produce ln = '' which will evaluate as FALSE 
        if ln[0] != '>':
            genome += ln
        ln = f.readline() # data for next iteration
    f.close()
    return genome


In [None]:
read_fasta("ecoli_genome.fasta")

In [None]:
#Let's strip the annoying \n , new line.
def read_fasta(file_name):
    genome=""
    f = open(file_name, 'rt') #t stands for text, other option is a bimary file 'b' 
    ln = f.readline()
    while ln: # end of file will produce ln = '' which will evaluate as FALSE 
        if ln[0] != '>':
            genome += ln.strip()
        ln = f.readline() # data for next iteration
    f.close()
    return genome

In [None]:
read_fasta("ecoli_genome.fasta")

In [None]:
ecoli_genome=read_fasta("ecoli_genome.fasta")

In [None]:
at_gc(ecoli_genome)

In [None]:
#Let's make a class Genome
class Genome:
    """This is a class to read, store and analyze a genome."""
    def __init__(self, fasta_filename):
        self.genome=""
        self.genome=self.read_fasta(fasta_filename)
            
    def read_fasta(self, fasta_filename):
        self.genome=""
        f = open(fasta_filename, 'rt') #t stands for text, other option is a bimary file 'b' 
        ln = f.readline()
        
        while ln: # end of file will produce ln = '' which will evaluate as FALSE 
            if ln[0] != '>':
                self.genome += ln.strip()
            ln = f.readline() # data for next iteration
            
        f.close()


In [None]:
ecoli=Genome("ecoli_genome.fasta")
ecoli.genome

In [None]:
ecoli.read_fasta("ecoli_genome.fasta")
ecoli.genome

In [1]:
#Let's make a class Genome
class Genome:
    """This is a class to read, store and analyze a genome."""
    #def __init__(self):
    #    self.genome=""
            
    def read_fasta(self, fasta_filename):
        self.genome=""
        f = open(fasta_filename, 'rt') #t stands for text, other option is a bimary file 'b' 
        ln = f.readline()
        
        while ln: # end of file will produce ln = '' which will evaluate as FALSE 
            if ln[0] != '>':
                self.genome += ln.strip()
            ln = f.readline() # data for next iteration
            
        f.close()
        
    def at_gc(self):
        self.at = 0 # counter for at content
        self.gc = 0 # counter for gc content
        self.genome = self.genome.upper()
        for ch in self.genome:
            if ch == 'G' or ch == 'C': 
                self.gc += 1 # gc = gc + 1
            if ch == 'A' or ch == 'T': 
                self.at += 1
        return self.at, self.gc 


In [2]:
ecoli=Genome()

AttributeError: 'Genome' object has no attribute 'genome'

In [None]:
ecoli.genome

In [3]:
ecoli.read_fasta("ecoli_genome.fasta")

In [4]:
ecoli.genome

'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTTGACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTCACAACGTTACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCTGAGTCCACCCGCCGTATTGCGGCAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCGCCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACTACTCTGCTGCGGTGCTGGCTGCCTGTTTACGCGCCGAT

In [5]:
ecoli.at_gc()

(2284124, 2357528)

In [9]:
#Let's start with a read

class Genome:
    """This is a class to read, store and analyze a genome."""
    def __init__(self, fasta_filename):
        self.genome=""
        self.read_fasta(fasta_filename)
            
    def read_fasta(self, f_filename):
        self.genome=""
        f = open(f_filename, 'rt') #t stands for text, other option is a bimary file 'b' 
        ln = f.readline()
        
        while ln: # end of file will produce ln = '' which will evaluate as FALSE 
            if ln[0] != '>':
                self.genome += ln.strip()
            ln = f.readline() # data for next iteration
            
        f.close()
        
    def at_gc(self):
        self.at = 0 # counter for at content
        self.gc = 0 # counter for gc content
        self.genome = self.genome.upper()
        for ch in self.genome:
            if ch == 'G' or ch == 'C': 
                self.gc += 1 # gc = gc + 1
            if ch == 'A' or ch == 'T': 
                self.at += 1
        return self.at, self.gc 


In [10]:
ecoli2=Genome("ecoli_genome.fasta")

In [11]:
ecoli2.genome

'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTTGACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTCACAACGTTACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCTGAGTCCACCCGCCGTATTGCGGCAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCGCCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACTACTCTGCTGCGGTGCTGGCTGCCTGTTTACGCGCCGAT