<img src=images/ucsc_banner.png width=500>

## Natural Selection

<img src=images/natural_selection.png>
[src](http://evolution.berkeley.edu/evolibrary/article/evo_25)

## DNA
<img src=images/cell_dna.png>
*https://upload.wikimedia.org/wikipedia/commons/thumb/e/e2/Eukaryote_DNA-en.svg/2000px-Eukaryote_DNA-en.svg.png
 
## Gene Structure
<img src=images/orf_start_stop.jpg>
*http://images.slideplayer.com/19/5750865/slides/slide_22.jpg

<img src=images/centraldogma.jpg>
*http://www.lhsc.on.ca/_images/Genetics/centraldogma.jpg


## Mutations

<img src=images/point_mutation.png width="250">
[src](http://rosalind.info/media/point_mutation.png)

<img src=images/mutation_types.png>
[src](https://upload.wikimedia.org/wikipedia/commons/6/69/Point_mutations-en.png)

<img src=images/block_mutations_med.jpeg>
[src](http://www.vce.bioninja.com.au/_Media/block_mutations_med.jpeg)

<img src=images/genetic_code.jpeg>
[src](http://www.vce.bioninja.com.au/_Media/genetic_code_med.jpeg)

In [40]:
# Mutation exercise
from random import sample
interesting_aas = {'P', 'F', 'E', 'I', 'L', 'H', 'D', 'W', 'Y', 'R'}
bag = {key: aa for key, aa in zip(sample(range(1, 11), 10), interesting_aas)}
guesses = [1, 2, 3, 4, 5,]
print 'Protein: M{}*'.format(''.join(bag[g] for g in guesses))

Protein: MPWEDR*


## Introduction to the dataset

Microbes are ideal organisms for exploring 'Long-term Evolution Experiments' (LTEEs) - thousands of generations can be generated and stored in a way that would be virtually impossible for more complex eukaryotic systems. In Blount et al 2012, 12 populations of Escherichia coli were propagated for more than 40,000 generations in a glucose-limited minimal medium. This medium was supplemented with citrate which E. coli cannot metabolize in the aerobic conditions of the experiment.

<img src=images/citric_acid.png>


Sequencing of the populations at regular time points reveals that spontaneous citrate-using mutants (Cit+) appeared in a population of E.coli (designated Ara-3) at around 31,000 generations. It should be noted that spontaneous Cit+ mutants are extraordinarily rare - inability to metabolize citrate is one of the defining characters of the E. coli species. Eventually, Cit+ mutants became the dominant population as the experimental growth medium contained a high concentration of citrate relative to glucose.

Strains from generation 0 to generation 40,000 were sequenced, including ones that were both Cit+ and Cit- after generation 31,000.

We want to be able to look at the genome size to see if there is a difference between genome size and the Cit status of the strain. We also want to analyze the sequences to figure out what changes occurred in genomes to make the strains Cit+.

   1. What is the distribution of genome sizes for all the strains?
       
      The distribution is bimodal at 4.63 and 4.78. It is not a normals distribution. The Cit+ strains average at 4.75 and the Cit- or unknown strains average at 4.63.
      
   2. Is there a relationship between genome size and Cit status?
      
      It seems to be that in order for E. Coli to be Cit+, the genome size has to be about 0.15 longer. This could possibly be because Cit+ E. Coli have extra DNA and codons in order to produce a protein or chemical that lets them metabolize in citrus environments.
      
   3. How many base pair changes are there between the Cit+ and Cit- strains?
  
   4. What are the base pair changes between strains?

**Reference**
Blount, Z.D., Barrick, J.E., Davidson, C.J., Lenski, R.E. Genomic analysis of a key innovation in an experimental Escherichia coli population (2012) Nature, 489 (7417), pp. 513-518.
Data on NCBI SRA: http://www.ncbi.nlm.nih.gov/sra?term=SRA026813

Using materials from: http://www.datacarpentry.org/introduction-genomics/01-intro-to-dataset.html


## Use a pandas dataframe to analyze E. coli data

1. What is the mean and standard deviation for genome size in Cit+ bacteria. 
2. What is the mean and standard deviation for genome sizes in Cit- or Cit unknown bacteria?

In [41]:
import pandas as pd
from numpy import mean, std
from IPython.core.display import HTML
css = open('style-table.css').read() + open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))
data = pd.DataFrame.from_csv('data/ecoli_cit.csv', index_col=None)
print "Overall mean:",
print data.genome_size.mean()
print "Cit+ mean:",
print data[data.cit=='plus'].genome_size.mean()
print "Cit- mean:",
print data[data.cit=='minus'].genome_size.mean()
print "Cit unknown mean:",
print data[data.cit=='unknown'].genome_size.mean()
print "Cit - and unknown mean:",
print data[data.cit!='plus'].genome_size.mean()
print ""
print ""
print "Overall standard deviation:",
print data.genome_size.std()
print "Cit+ standard deviation:",
print data[data.cit=='plus'].genome_size.std()
print "Cit- standard deviation:",
print data[data.cit=='minus'].genome_size.std()
print "Cit unknown standard deviation:",
print data[data.cit=='unknown'].genome_size.std()
print "Cit - and unknown standard deviation:",
print data[data.cit!='plus'].genome_size.std()


Overall mean: 4.66266666667
Cit+ mean: 4.76888888889
Cit- mean: 4.61444444444
Cit unknown mean: 4.61916666667
Cit - and unknown mean: 4.61714285714


Overall standard deviation: 0.0732936151584
Cit+ standard deviation: 0.0226077666104
Cit- standard deviation: 0.0133333333333
Cit unknown standard deviation: 0.0215146180045
Cit - and unknown standard deviation: 0.0182051797967


Is there a relationship between genome size and Cit status?
* Are Cit+ and Cit- genome sizes statistically different? 

In [68]:
from scipy.stats import ttest_ind
ttest_ind(data[data.cit=='minus'].genome_size, data[data.cit=='plus'].genome_size)

Ttest_indResult(statistic=-17.65301765302652, pvalue=6.4959188034384765e-12)

Using these results, develop a hypothesis for how these bacteria may have obtained the ability to metabolize citrate.

Data set comes from:
https://github.com/datacarpentry/R-genomics/blob/gh-pages/Ecoli_metadata.csv

NIH BD2K Center for Big Data in Translational Genomics, UCSC Genomics Institute