# Q1

a) **A DNA sequence** - str (string), with each letter in the string representing a single nucleotide ('A', 'C', 'G' or 'T'). Another option, somewhat less convenient, is to use a list of single-letter strings (nucleotides).

b) **Names/symbols of genes associated with a disease** - A set of strings, with each element in the set being the name/symbol of a specific gene. Since generally there is no relevance for the order of the genes (and they are unique, namely this is meaningless to include a gene twice in such a list), it makes more sense to use a set than a list.

c) **RNA sequences of all the transcripts expressed in a tissue** - If there is no relevance to the order of the transcripts, it would be the most convenient to use a dictionary of the form `{'name_of_transcript1': 'seq1', 'name_of_transcript2': 'seq2', ...}` (i.e. the keys and values of the dictionary are of type str). If the order of transcripts is relevant, then can use a list of tuples, i.e. `[('name_of_transcript1', 'seq1'), ('name_of_transcript2', 'seq2'), ...]`. If we just care about the sequences themselves (and not about the names of the transcripts), then we can use a list of the form `['seq1', 'seq2', ...]`.

d) **Frequencies of amino acids in a given protein sequence** - A dictionary of the form `{'aa1': freq1, 'aa2': freq2, ...}`, for example `{'A': 0.2, 'K': 0.1, 'L': 0.4, 'P': 0.15, 'G': 0.15}`. A dictionary is the most convenient data structure for representing a mapping, like  in this case (mapping the frequency of each amino acid), given that the keys are unique and their order is irrelevant.

e) **Unique amino-acids appearing in a short peptide** - A set of strings, each a one letter representing the unique amino acid. For example, `{'A', 'K', 'L', 'P', 'G'}`. A set is the right data structure to use for representing unique elements (whose order is irrelevant).

# Q2A

In [1]:
strain_a = 1e04
strain_b = 1e02
day = 0

while strain_a + strain_b < 1e10:
    
    if strain_a + strain_b < 1e08:
        strain_a *= 2
        strain_b *= 3
    else:
        strain_a *= 1.5
        strain_b *= 1.8
    
    day += 1
    total = strain_a + strain_b
    print('After %d days: strain A = %d (%.1f%%), strain B = %d (%.1f%%) [Total = %d]' % (day, strain_a, \
            100 * strain_a / total, strain_b, 100 * strain_b / total, total))

After 1 days: strain A = 20000 (98.5%), strain B = 300 (1.5%) [Total = 20300]
After 2 days: strain A = 40000 (97.8%), strain B = 900 (2.2%) [Total = 40900]
After 3 days: strain A = 80000 (96.7%), strain B = 2700 (3.3%) [Total = 82700]
After 4 days: strain A = 160000 (95.2%), strain B = 8100 (4.8%) [Total = 168100]
After 5 days: strain A = 320000 (92.9%), strain B = 24300 (7.1%) [Total = 344300]
After 6 days: strain A = 640000 (89.8%), strain B = 72900 (10.2%) [Total = 712900]
After 7 days: strain A = 1280000 (85.4%), strain B = 218700 (14.6%) [Total = 1498700]
After 8 days: strain A = 2560000 (79.6%), strain B = 656100 (20.4%) [Total = 3216100]
After 9 days: strain A = 5120000 (72.2%), strain B = 1968300 (27.8%) [Total = 7088300]
After 10 days: strain A = 10240000 (63.4%), strain B = 5904900 (36.6%) [Total = 16144900]
After 11 days: strain A = 20480000 (53.6%), strain B = 17714700 (46.4%) [Total = 38194700]
After 12 days: strain A = 40960000 (43.5%), strain B = 53144100 (56.5%) [Total 

# Q2B

In [2]:
# The fraction of bacteria that develop immunity, for each of the two strains.
IMMUNITY_DEVELOPMENT_RATES = {
    'A': 0.1,
    'B': 0.01,
}

# Number of bacteria at the beginning of the experiment (for each of the 4 populations)
populations = {
    # (strain, is_immune): population_size
    ('A', False): 1e04,
    ('A', True): 0,
    ('B', False): 1e02,
    ('B', True): 0,
}

day = 0
is_phage_active = False

# sum(population.values()) provides the total number of bacteria (of all 4 populations)
while sum(populations.values()) < 1e10:

    # Determine if the phage should turn active (note that once it activates, it cannot deactivate)
    if sum(populations.values()) >= 1e06:
        is_phage_active = True

    # Determine the base growth rates, according to the total population size
    if sum(populations.values()) < 1e08:
        growth_rate = {'A': 2, 'B': 3}
    else:
        growth_rate = {'A': 1.5, 'B': 1.8}

    # Update the size of each of the 4 populations
    for population_key in populations.keys():
    
        strain, is_immune = population_key
        
        # Multiply the population size by the appropriate growth rate of the strain.
        populations[population_key] *= growth_rate[strain]

        # If the phage is active and the population is not immune, then 40% of the cells will die. Note that it doesn't
        # matter if we first multiply by the growth rate and then multiply by the phage decay factor or do it in a reversed
        # order (because multiplication is symmetric).
        if is_phage_active and not is_immune:
            populations[population_key] *= 0.6
                
    if is_phage_active:
        for strain in ['A', 'B']:
            immunity_rate = IMMUNITY_DEVELOPMENT_RATES[strain]
            n_bacteria_developing_immunity = immunity_rate * populations[(strain, False)]
            populations[(strain, True)] += n_bacteria_developing_immunity
            # Note that the number of bacteria that develop immunity should be deducted from the unimmune population!
            populations[(strain, False)] -= n_bacteria_developing_immunity
    
    # Print the report for the end of the day.
    
    day += 1
    
    if is_phage_active:
        print('After %d days [active phage]:' % day)
    else:
        print('After %d days:' % day)
    
    for strain in ['A', 'B']:
        for is_immune in [False, True]:
            
            if is_immune:
                immune_label = 'Immune'
            else:
                immune_label = 'Unimmune'
                
            population_size = populations[(strain, is_immune)]
            population_percentage = 100 * population_size / sum(populations.values())
            print('\t' + '%s %s: %d bacteria (%.1f%%)' % (immune_label, strain, population_size, population_percentage))
            
    print('\t' + 'Total: %d bacteria' % sum(populations.values()))

After 1 days:
	Unimmune A: 20000 bacteria (98.5%)
	Immune A: 0 bacteria (0.0%)
	Unimmune B: 300 bacteria (1.5%)
	Immune B: 0 bacteria (0.0%)
	Total: 20300 bacteria
After 2 days:
	Unimmune A: 40000 bacteria (97.8%)
	Immune A: 0 bacteria (0.0%)
	Unimmune B: 900 bacteria (2.2%)
	Immune B: 0 bacteria (0.0%)
	Total: 40900 bacteria
After 3 days:
	Unimmune A: 80000 bacteria (96.7%)
	Immune A: 0 bacteria (0.0%)
	Unimmune B: 2700 bacteria (3.3%)
	Immune B: 0 bacteria (0.0%)
	Total: 82700 bacteria
After 4 days:
	Unimmune A: 160000 bacteria (95.2%)
	Immune A: 0 bacteria (0.0%)
	Unimmune B: 8100 bacteria (4.8%)
	Immune B: 0 bacteria (0.0%)
	Total: 168100 bacteria
After 5 days:
	Unimmune A: 320000 bacteria (92.9%)
	Immune A: 0 bacteria (0.0%)
	Unimmune B: 24300 bacteria (7.1%)
	Immune B: 0 bacteria (0.0%)
	Total: 344300 bacteria
After 6 days:
	Unimmune A: 640000 bacteria (89.8%)
	Immune A: 0 bacteria (0.0%)
	Unimmune B: 72900 bacteria (10.2%)
	Immune B: 0 bacteria (0.0%)
	Total: 712900 bacteria
Aft

By the end of the experiment, the phage-immune populations appear to have taken over the colony. 

# Q3A

In [3]:
DNA_seq = 'CTACTGTAATTCAACACAACTGTTTAATAGTACTTGGTTTAATAGTACTTGGAGTACTGAAGGGTCAAATAACACTGAAGGAAGTGACACAATCACCCTCCCATGCAGAAT' + \
          'AAAACAAATTATAAACATGTGGCAGAAAGTAGGAAAAGCAATGTATGCCCCTCCCATCAGTGGACAAATTAGATGTTCATCAAATATTACAGGGCTGCTATTAACAAGAGA' + \
          'TGGTGGTAATAGCAACAATGAGTCCGAGATCTTCAGACCTGGAGGAGGAGATATGAGGGACAATTGGAGAAGTGAATTATATAAATATAAAGTAGTAAAAATTGAACCATT' + \
          'AGGAGTAGCACCCACCAAGGCAAAGAGAAGAGTGGTGCAGAGAGAAAAAAGAGCAGTGGGAATAGGAGCTTTGTTCCTTGGGTTCTTGGGAGCAGCAGGAAGCACTATGGG' + \
          'CGCAGCCTCAATGACGCTGACGGTACAGGCCAGACAATTATTGTCTGGTATAGTGCAGCAGCAGAACAATTTGCTGAGGGCTATTGAGGCGCAACAGCATCTGTTGCAACT' + \
          'CACAGTCTGGGGCAT'

# Calcluate the reverse complement of the sequence.

COMPLEMENT_NTS = {
    'A': 'T',
    'C': 'G',
    'G': 'C',
    'T': 'A',
}
          
reverse_complement_DNA_seq = []

# Go over the given DNA seq in reversed order, and for each nucleotide, append its complement nucleotide to the resulting 
# DNA sequence (to obtain the reverse complement)
for nt in DNA_seq[::-1]:
    reverse_complement_DNA_seq.append(COMPLEMENT_NTS[nt])
    
# Convert the result from a list of letters into one string
reverse_complement_DNA_seq = ''.join(reverse_complement_DNA_seq)

print('The reverse complement of the given DNA sequence is: %s' % reverse_complement_DNA_seq)


# Translate the resulted sequence into protein.

CODON_TABLE = {
    'UUU': 'F', 'UUC': 'F', 'UUA': 'L', 'UUG': 'L',
    'UCU': 'S', 'UCC': 'S', 'UCA': 'S', 'UCG': 'S',
    'UAU': 'Y', 'UAC': 'Y', 'UAA': '*', 'UAG': '*',
    'UGU': 'C', 'UGC': 'C', 'UGA': '*', 'UGG': 'W',
    'CUU': 'L', 'CUC': 'L', 'CUA': 'L', 'CUG': 'L',
    'CCU': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P',
    'CAU': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q',
    'CGU': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R',
    'AUU': 'I', 'AUC': 'I', 'AUA': 'I', 'AUG': 'M',
    'ACU': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T',
    'AAU': 'N', 'AAC': 'N', 'AAA': 'K', 'AAG': 'K',
    'AGU': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R',
    'GUU': 'V', 'GUC': 'V', 'GUA': 'V', 'GUG': 'V',
    'GCU': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A',
    'GAU': 'D', 'GAC': 'D', 'GAA': 'E', 'GAG': 'E',
    'GGU': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G',
}

reverse_complement_RNA_seq = reverse_complement_DNA_seq.replace('T', 'U')

protein_seq = []

for i in range(0, len(reverse_complement_RNA_seq), 3):

    codon = reverse_complement_RNA_seq[i:(i + 3)]
    aa = CODON_TABLE[codon]

    if aa == '*':
        break
    else:
        protein_seq.append(aa)

protein_seq = ''.join(protein_seq)
print('The resulted protein sequence: %s' % protein_seq)

The reverse complement of the given DNA sequence is: ATGCCCCAGACTGTGAGTTGCAACAGATGCTGTTGCGCCTCAATAGCCCTCAGCAAATTGTTCTGCTGCTGCACTATACCAGACAATAATTGTCTGGCCTGTACCGTCAGCGTCATTGAGGCTGCGCCCATAGTGCTTCCTGCTGCTCCCAAGAACCCAAGGAACAAAGCTCCTATTCCCACTGCTCTTTTTTCTCTCTGCACCACTCTTCTCTTTGCCTTGGTGGGTGCTACTCCTAATGGTTCAATTTTTACTACTTTATATTTATATAATTCACTTCTCCAATTGTCCCTCATATCTCCTCCTCCAGGTCTGAAGATCTCGGACTCATTGTTGCTATTACCACCATCTCTTGTTAATAGCAGCCCTGTAATATTTGATGAACATCTAATTTGTCCACTGATGGGAGGGGCATACATTGCTTTTCCTACTTTCTGCCACATGTTTATAATTTGTTTTATTCTGCATGGGAGGGTGATTGTGTCACTTCCTTCAGTGTTATTTGACCCTTCAGTACTCCAAGTACTATTAAACCAAGTACTATTAAACAGTTGTGTTGAATTACAGTAG
The resulted protein sequence: MPQTVSCNRCCCASIALSKLFCCCTIPDNNCLACTVSVIEAAPIVLPAAPKNPRNKAPIPTALFSLCTTLLFALVGATPNGSIFTTLYLYNSLLQLSLISPPPGLKISDSLLLLPPSLVNSSPVIFDEHLICPLMGGAYIAFPTFCHMFIICFILHGRVIVSLPSVLFDPSVLQVLLNQVLLNSCVELQ


# Q3B

In [4]:
codon_counts = {}

for i in range(0, len(reverse_complement_RNA_seq), 3):
    
    codon = reverse_complement_RNA_seq[i:(i + 3)]
    
    if codon in codon_counts:
        codon_counts[codon] += 1
    else:
        codon_counts[codon] = 1
        
total_codons = sum(codon_counts.values())
codon_freqs = {}

for codon, codon_count in codon_counts.items():
    codon_freqs[codon] = codon_count / total_codons
    
print('Codon frequencies: %s' % codon_freqs)

Codon frequencies: {'AUG': 0.015789473684210527, 'CCC': 0.021052631578947368, 'CAG': 0.010526315789473684, 'ACU': 0.042105263157894736, 'GUG': 0.031578947368421054, 'AGU': 0.010526315789473684, 'UGC': 0.042105263157894736, 'AAC': 0.02631578947368421, 'AGA': 0.005263157894736842, 'UGU': 0.031578947368421054, 'GCC': 0.021052631578947368, 'UCA': 0.03684210526315789, 'AUA': 0.031578947368421054, 'CUC': 0.031578947368421054, 'AGC': 0.021052631578947368, 'AAA': 0.010526315789473684, 'UUG': 0.02631578947368421, 'UUC': 0.010526315789473684, 'CCA': 0.031578947368421054, 'GAC': 0.015789473684210527, 'AAU': 0.02631578947368421, 'CUG': 0.021052631578947368, 'ACC': 0.010526315789473684, 'GUC': 0.010526315789473684, 'AUU': 0.042105263157894736, 'GAG': 0.005263157894736842, 'GCU': 0.03684210526315789, 'GCG': 0.005263157894736842, 'CUU': 0.031578947368421054, 'CCU': 0.04736842105263158, 'AAG': 0.010526315789473684, 'AGG': 0.010526315789473684, 'UUU': 0.042105263157894736, 'UCU': 0.015789473684210527, 

# Q3C

In [5]:
# A dict mapping each aa into a set of its relevant codons
aa_to_codons = {}

for codon, aa in CODON_TABLE.items():
    
    if aa == '*':
        continue
    
    if aa in aa_to_codons:
        aa_to_codons[aa].add(codon)
    else:
        aa_to_codons[aa] = {codon}

for aa, aa_codons in aa_to_codons.items():
    
    aa_count = protein_seq.count(aa)
    
    if aa_count == 0:
        continue
    
    print('Amino-acid %s (%d codons, %d occurrences in the protein sequence):' % (aa, len(aa_codons), aa_count))
    
    for codon in aa_codons:
        codon_count = codon_counts.get(codon, 0)
        codon_freq = codon_count / aa_count
        print('\t' + 'Codon %s: %d occurrences (%d%%)' % (codon, codon_count, 100 * codon_freq))

Amino-acid F (2 codons, 10 occurrences in the protein sequence):
	Codon UUU: 8 occurrences (80%)
	Codon UUC: 2 occurrences (20%)
Amino-acid L (6 codons, 32 occurrences in the protein sequence):
	Codon CUA: 4 occurrences (12%)
	Codon CUC: 6 occurrences (18%)
	Codon UUG: 5 occurrences (15%)
	Codon CUU: 6 occurrences (18%)
	Codon CUG: 4 occurrences (12%)
	Codon UUA: 7 occurrences (21%)
Amino-acid S (6 codons, 18 occurrences in the protein sequence):
	Codon AGU: 2 occurrences (11%)
	Codon AGC: 4 occurrences (22%)
	Codon UCG: 1 occurrences (5%)
	Codon UCU: 3 occurrences (16%)
	Codon UCA: 7 occurrences (38%)
	Codon UCC: 1 occurrences (5%)
Amino-acid Y (2 codons, 3 occurrences in the protein sequence):
	Codon UAC: 1 occurrences (33%)
	Codon UAU: 2 occurrences (66%)
Amino-acid C (2 codons, 14 occurrences in the protein sequence):
	Codon UGU: 6 occurrences (42%)
	Codon UGC: 8 occurrences (57%)
Amino-acid P (4 codons, 19 occurrences in the protein sequence):
	Codon CCC: 4 occurrences (21%)
	Codo

There are clearly codons that appear more than others (within a given amino-acids). For example, of the 10 occurrences of F in the protein sequence, 8 (80%) are coded by the UUU codon, and only 2 (20%) by the UUC codon. To determine whether those deviations from a uniform distribution appear to be significant, we will need statistical tools (which we will only cover later in the course).