# [DS4Bio] BONUS Coding: Python Coding of Genetic Coding
### Data Science for Biology
**Notebook developed by:** *Sarp Dora Kurtoglu, Kinsey Long*<br>
**Supervised by:** *Steven E. Brenner*

### Bonus Questions
You may work together with other students on this. You have two weeks to complete this (note the deadline on bCourses and/or Gradescope). Please have everyone submit the same file on Gradescope.

***
Extra Credit Mini-Projects

Most assignments in this course will include optional extra credit questions. These questions are designed as starting points for students to explore more free-form mini projects. Therefore, there is no skeleton code and minimal guidance for these questions. Students are welcome to go beyond the scope of the question or adapt the question as necessary to answer their own scientific questions of interest. You are welcome to create as many coding cells as you would like for these mini-projects. In order to get extra credit, students should make a reasonable attempt (as judged by the grader) on at least one question and write a brief report.

Write a summary on your methodology and your findings, highlighting key results and any interesting observations. The length of the report does not matter, as long as it answers all of the following questions:
- What was your scientific goal with this project?
- What methods did you use and why?
- What were the key results you found for each method you implemented?
- Were there any limitations in your methods?
- What additional observations or comments can you make on your findings? What is the greater biological relevance or implication?
- Are there any additional questions you would want to explore?


<font color = #d14d0f>**EC Mini-Project A: Alternative Codon Tables**</font>

The codon translation table is not actually universal. The genetic code has evolved over time, which led to differences in the codon conversion tables for different species. So, we need to pay attention to where our genetic code comes from and pick the correct codon table. For example, in ciliated protozoa, the universal stop codons UAA and UAG code for glutamine.

Let's imagine that we try to express the GFP gene in *S. cerevisiae* mitochondria. The standard codon table and the mitchondrial yeast codon table are different. Use NCBI resources to find the mitochondial yeast codon table and translate the sequence. 

Describe (not necessarily implement) ways you could automate the process of retrieving species codon tables and translating using that table, and how you could use computational methods to quantitatively compare your result with the one you previously derived. 

As an intermediate project, write code to create a DNA sequence that when expressed in yeast mitochondria produce the same protein as had been made the universal code table. As an advanced project, implement a function that translates a sequence based on an inputted species.

<font color = #d14d0f>**EC Mini-Project B: Codon Patterns**</font>

It is often useful to compare codon usage patterns in transcribed mRNA sequences. Visualize the frequency distribution of codons in the transcribed GFP mRNA. Compare the frequency distributions results with at least one other relevant mRNA sequence, and justify your choice. Interpret your findings in a biological context, considering any implications about evolutionary history or functional characteristics.

<font color = #d14d0f>**EC Mini-Project C: Reverse Translation**</font>

Some amino acids are encoded by multiple different codons. Implement a reverse translation function that takes a protein sequence and returns the number of DNA sequences that could code for it. How many different DNA transcripts could be generated from the GFP amino acid sequence? Apply your function on other protein sequences too, and explore the relationship between the number of possible DNA reverse-translations and another quantitative variable of your choice. Consider the biological implications for any relationships you might find.

As an advanced project, read literature on how mRNAs are recoded to express more efficiently and implement an approach. Such methods were used for making mRNA vaccines.

In [18]:
from Bio import Entrez, SeqIO
from Bio.Seq import Seq
from Bio.Data import CodonTable

Entrez.email = "qamilmirza@berkeley.edu"

def get_mrna_sequence(species_name, gene_name):
    """
    Input: species_name (str) - the name of the species to search for
           gene_name (str) - the name of the gene to search for
    Output: mrna_sequences (str) - a single string containing all mRNA sequences found for the gene in the species
    """
    try:
        search_query = f"{species_name}[Organism] AND {gene_name}[Gene] AND mRNA[Filter]"
        handle = Entrez.esearch(db="nucleotide", term=search_query, retmax=20)  # Fetch max 20 mRNA sequences
        record = Entrez.read(handle)
        handle.close()
        
        if not record["IdList"]:
            return ""
        
        gene_ids = record["IdList"]
        mrna_sequences = []
        
        for gene_id in gene_ids:
            fetch_handle = Entrez.efetch(db="nucleotide", id=gene_id, rettype="fasta", retmode="text")
            fasta_data = fetch_handle.read()
            fetch_handle.close()
            
            # Remove the first line (description) from the FASTA data
            fasta_data_lines = fasta_data.split('\n')
            sequence_only = ''.join(fasta_data_lines[1:])  # Join without newlines
            mrna_sequences.append(sequence_only)
        
        # Concatenate all mRNA sequences into a single string, remove all spaces, and replace Ts with Us
        concatenated_mrna_sequences = ''.join(mrna_sequences).replace(" ", "").replace("T", "U").replace("t", "u")
        
        return concatenated_mrna_sequences
    except Exception as e:
        print(f"Error: {e}")
        return ""

def translate_mrna_to_protein(mrna_sequence, codon_table_id):
    """
    Input: mrna_sequence (str) - the mRNA sequence to translate
           codon_table_id (int) - the NCBI genetic code table ID
    Output: protein_sequence (str) - the translated protein sequence
    """
    try:
        # Create a Seq object from the mRNA sequence
        mrna_seq = Seq(mrna_sequence)
        
        # Translate the mRNA sequence into a protein sequence using the specified codon table
        protein_seq = mrna_seq.translate(table=codon_table_id)
        
        return str(protein_seq)
    except Exception as e:
        print(f"Error: {e}")
        return ""

# Example usage
species_name = "Saccharomyces cerevisiae"
gene_name = "ACT1"
mrna_sequence = get_mrna_sequence(species_name, gene_name)

# Saccharomyces cerevisiae uses the standard codon table (NCBI genetic code table ID 1)
codon_table_id = 1
protein_sequence = translate_mrna_to_protein(mrna_sequence, codon_table_id)

print("mRNA Sequence:\n", mrna_sequence)
print("Protein Sequence:\n", protein_sequence)

mRNA Sequence:
 AUGGAUUCUGAGGUUGCUGCUUUGGUUAUUGAUAACGGUUCUGGUAUGUGUAAAGCCGGUUUUGCCGGUGACGACGCUCCUCGUGCUGUCUUCCCAUCUAUCGUCGGUAGACCAAGACACCAAGGUAUCAUGGUCGGUAUGGGUCAAAAAGACUCCUACGUUGGUGAUGAAGCUCAAUCCAAGAGAGGUAUCUUGACUUUACGUUACCCAAUUGAACACGGUAUUGUCACCAACUGGGACGAUAUGGAAAAGAUCUGGCAUCAUACCUUCUACAACGAAUUGAGAGUUGCCCCAGAAGAACACCCUGUUCUUUUGACUGAAGCUCCAAUGAACCCUAAAUCAAACAGAGAAAAGAUGACUCAAAUUAUGUUUGAAACUUUCAACGUUCCAGCCUUCUACGUUUCCAUCCAAGCCGUUUUGUCCUUGUACUCUUCCGGUAGAACUACUGGUAUUGUUUUGGAUUCCGGUGAUGGUGUUACUCACGUCGUUCCAAUUUACGCUGGUUUCUCUCUACCUCACGCCAUUUUGAGAAUCGAUUUGGCCGGUAGAGAUUUGACUGACUACUUGAUGAAGAUCUUGAGUGAACGUGGUUACUCUUUCUCCACCACUGCUGAAAGAGAAAUUGUCCGUGACAUCAAGGAAAAACUAUGUUACGUCGCCUUGGACUUCGAACAAGAAAUGCAAACCGCUGCUCAAUCUUCUUCAAUUGAAAAAUCCUACGAACUUCCAGAUGGUCAAGUCAUCACUAUUGGUAACGAAAGAUUCAGAGCCCCAGAAGCUUUGUUCCAUCCUUCUGUUUUGGGUUUGGAAUCUGCCGGUAUUGACCAAACUACUUACAACUCCAUCAUGAAGUGUGAUGUCGAUGUCCGUAAGGAAUUAUACGGUAACAUCGUUAUGUCCGGUGGUACCACCAUGUUCCCAGGUAUUGCCGAAAGAAUGCAAAAGGAAAUCACCGCUUUGGCUCCAUCUUCCAUGAAGGUCAAG

**EXTRA CREDIT REPORT: [insert project choice here]** <br>

*DOUBLE-CLICK TO EDIT THIS CELL AND TYPE YOUR REPORT*

***
### Congratulations! You have finished the bonus questions!
***