# Week 7

Hopefully you have already opened this document as a Jupyter Notebook and not simply as a pdf. If not, then you need to download the file to an appropriate folder, and open the file.

To open the file, if you have already installed Anaconda Individual Edition, then open Anaconda.Navigator, from which you can launch Jupyter Notebook.

(If working in the School of Computing labs, start Anaconda3 / Anaconda Prompt, then enter the following command specifying the path to the folder containing the file, and then open the file from the browser.)

The Jupyter Notebook App is a tool for creating documents (notebooks) containing both live code and text, as well as visualizations, etc. Various programming languages can be used, but here the focus will be on Python.

For more general resources:

Jupyter Notebook - https://jupyter.org/
Anaconda - https://www.anaconda.com/
Python - https://www.python.org/
The rest of this notebook is intended as an introduction to key features of Python and Jupyter Notebooks, irrespective of whether you have used Python before or not. It is not intended to be exhaustive, but will cover material that will be relevant in this module. In addition to this material, you are strongly encouarged to develop your knowledge of Python further through the use of other resources such as

The Python tutorial - https://docs.python.org/3/tutorial/
Both Python and Jupyter Notebooks are used in various areas, including data science, so it would be well worth your while to develop your skills in this area as much as possible.

Getting started
One thing you should do at the outset is to save this noteback as a new notebook, so that you can change it as much as you want, but still go back to the original file if necessary. Go to File / Make a Copy. This creates a new notebook called Week_1_Lab-Copy1 in a file with the same name. You can change the name of the notebook (and the file) by going to File / Rename and changing it to Week_1_Lab_my_version for example, or by editing the name of the notebook at the top of the page beside the Jupyter symbol (just above the menu).

Cell Types
We need to distinguish between different types of cells. This cell is a Markdown cell, whereas the next cell below is a Code cell. Markdown cells are for formatting text rather than for running code. To see how to format headings, use italics and bold font, for example, you can go into edit mode by double-clicking on a cell. Try it for this cell. To execute the cell (and so produce the formatted text), you can go to Cell / Run Cells or use the shortcut Ctrl-Enter. (You can find other keyboard shortcuts under the Help menu.) Markdown is very useful for mathematical notation such as  2⎯⎯√
 . For further details on Markdown see Markdown in Jupyter Notebook.

The next cell below is a Code cell. You can also edit and run it as described above, but now it will execute the code and present the output below the cell.

# Task 1

Generates a random DNA sequence and calculates its GC content, which is the percentage of guanine (G) or cytosine (C) bases.

Once an acceptable sequence is found (with GC content above 50%), it stops and prints the final sequence. 

In [1]:
import random

def generate_random_dna_sequence(length):
    """Generates a random DNA sequence of a given length."""
    return ''.join(random.choices('ACGT', k=length))

def calculate_gc_content(dna_sequence):
    """Calculates the GC content of a DNA sequence."""
    gc_count = dna_sequence.count('G') + dna_sequence.count('C')
    return (gc_count / len(dna_sequence)) * 100

# Set the length of the DNA sequence and the GC content threshold
sequence_length = 100
gc_threshold = 50  # 50%

# Initialize variables
gc_content = 0
dna_sequence = ''

# Use a while loop to keep generating sequences until the GC content is above the threshold
while gc_content < gc_threshold:
    dna_sequence = generate_random_dna_sequence(sequence_length)
    gc_content = calculate_gc_content(dna_sequence)
    print(f"Generated sequence: {dna_sequence} with GC content: {gc_content:.2f}%")

# Print the final acceptable sequence and its GC content
print(f"\nAcceptable sequence found: {dna_sequence} with GC content: {gc_content:.2f}%")


Generated sequence: GATACGGGCTCCAGTCGGTGCCGAGGCTAGTATGAGAATCCATACGTACTACATCGACGGGTGCGCCCAGACACAGCGATGTGCGACGCGTAGATGTAGG with GC content: 58.00%

Acceptable sequence found: GATACGGGCTCCAGTCGGTGCCGAGGCTAGTATGAGAATCCATACGTACTACATCGACGGGTGCGCCCAGACACAGCGATGTGCGACGCGTAGATGTAGG with GC content: 58.00%


# Task 2

Parsing genetic sequence data.

1. Parse through a dictionary of gene sequences.
2. Calculate the GC content, which is the percentage of bases in a DNA sequence that are either guanine (G) or cytosine (C).
3. Search for a specific DNA motif within the sequence.
4. Create the reverse complement of the DNA sequence.
5. Use control structures (if, for, while) efficiently to accomplish the tasks.

In [7]:
# Define a dictionary with gene identifiers as keys and DNA sequences as values
gene_sequences = {
    'gene1': 'ATGCGTACGTAGCTAGCTGACTG',
    'gene2': 'GCGCTATGCTAGCATGCTAGCTG',
    'gene3': 'ATGCGCATATCGTACGATCG',
    # Add more gene sequences as needed
}

# Function to calculate GC content
def calculate_gc_content(sequence):
    gc_count = sequence.count('G') + sequence.count('C')
    return (gc_count / len(sequence)) * 100

# Function to search for a DNA motif
def search_motif(sequence, motif='ATG'):
    return motif in sequence

# Function to create the reverse complement of a DNA sequence
def reverse_complement(sequence):
    complement = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}
    return ''.join(complement[base] for base in reversed(sequence))

# Main processing loop
for gene_id, sequence in gene_sequences.items():
    gc_content = calculate_gc_content(sequence)
    motif_present = search_motif(sequence)
    rev_complement = reverse_complement(sequence)
    
    # Print the results for each gene
    print(f"Gene: {gene_id}")
    print(f"  GC Content: {gc_content:.2f}%")
    print(f"  Motif 'ATG' Present: {'Yes' if motif_present else 'No'}")
    print(f"  Reverse Complement: {rev_complement}\n")

    # Additional conditional checks can be added here
    # For example, to process genes with high GC content differently
    if gc_content > 60:
        print(f"  Note: {gene_id} has a high GC content.\n")

# While loop example: Let's say we want to find the first gene with GC content over 50%
# and we want to do it using a while loop for demonstration purposes
index = 0
gene_keys = list(gene_sequences.keys())
while index < len(gene_keys):
    gene_id = gene_keys[index]
    sequence = gene_sequences[gene_id]
    if calculate_gc_content(sequence) > 50:
        print(f"First gene with GC content over 50%: {gene_id}")
        break
    index += 1


Gene: gene1
  GC Content: 52.17%
  Motif 'ATG' Present: Yes
  Reverse Complement: CAGTCAGCTAGCTACGTACGCAT

Gene: gene2
  GC Content: 56.52%
  Motif 'ATG' Present: Yes
  Reverse Complement: CAGCTAGCATGCTAGCATAGCGC

Gene: gene3
  GC Content: 50.00%
  Motif 'ATG' Present: Yes
  Reverse Complement: CGATCGTACGATATGCGCAT

First gene with GC content over 50%: gene1


# Task 3 

Exercise: You're working with a dataset of genetic sequences where each sequence is represented by a string composed of four nucleotides (adenine (A), cytosine (C), guanine (G), and thymine (T)). You want to perform the following tasks:

1. Create a dictionary where each key is a unique genetic sequence (string) and its value is a tuple containing the count of each nucleotide in the sequence.
2. For each sequence, determine if it has a high GC content. GC content is considered high if it is greater than 60% of the sequence.
3. Create a set that contains only the sequences with high GC content.
4. Write a loop that iteratively reduces the sequences by removing one nucleotide from the end of each sequence until the length of each sequence is less than 10 or no sequences are left.

In [3]:
# Define a function to calculate GC content
def calculate_gc_content(sequence):
    gc_content = (sequence.count('G') + sequence.count('C')) / len(sequence)
    return gc_content > 0.6  # Returns True if GC content is higher than 60%

# Example list of genetic sequences
sequences = ['ATCGGCTA', 'GCGCGCGC', 'ATATATAT', 'CGCGATCGA']

# Step 1: Create a dictionary with counts of each nucleotide
sequence_info = {}
for seq in sequences:
    # Count each nucleotide and store in a tuple
    a_count = seq.count('A')
    c_count = seq.count('C')
    g_count = seq.count('G')
    t_count = seq.count('T')
    sequence_info[seq] = (a_count, c_count, g_count, t_count)

# Step 2 and 3: Determine sequences with high GC content and add to a set
high_gc_content_sequences = set()
for seq, counts in sequence_info.items():
    if calculate_gc_content(seq):
        high_gc_content_sequences.add(seq)

# Step 4: Iteratively reduce sequences until they are less than 10 nucleotides long
while sequences:  # While the list of sequences is not empty
    sequences = [seq[:-1] for seq in sequences if len(seq) > 1]  # Reduce sequences
    # Remove sequences that are less than 10 nucleotides long
    sequences = [seq for seq in sequences if len(seq) >= 10]

# Print the results
print("Nucleotide counts for each sequence:", sequence_info)
print("High GC content sequences:", high_gc_content_sequences)
print("Reduced sequences:", sequences)


Nucleotide counts for each sequence: {'ATCGGCTA': (2, 2, 2, 2), 'GCGCGCGC': (0, 4, 4, 0), 'ATATATAT': (4, 0, 0, 4), 'CGCGATCGA': (2, 3, 3, 1)}
High GC content sequences: {'CGCGATCGA', 'GCGCGCGC'}
Reduced sequences: []


# Task 4

Calculates the frequency of each nucleotide in a DNA sequence and identifies the most frequent nucleotide

In [5]:
# Define a DNA sequence (in practice, this would be read from a file)
dna_sequence = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"

# Initialize a dictionary to store nucleotide counts
nucleotide_counts = {"A": 0, "C": 0, "G": 0, "T": 0}

# Use a for loop to iterate through each nucleotide in the sequence
for nucleotide in dna_sequence:
    # Check if the nucleotide is one of the expected bases
    if nucleotide in nucleotide_counts:
        # If so, increment the count for this nucleotide in the dictionary
        nucleotide_counts[nucleotide] += 1

# Initialize a list to store nucleotides and their counts as tuples
nucleotide_count_list = []

# Use a for loop to iterate over the dictionary items
for nucleotide, count in nucleotide_counts.items():
    # Append a tuple of nucleotide and its count to the list
    nucleotide_count_list.append((nucleotide, count))

# Sort the list of tuples by the count in descending order
nucleotide_count_list.sort(key=lambda x: x[1], reverse=True)

# Initialize a set to keep track of the most frequent nucleotides
most_frequent_nucleotides = set()

# Use a while loop to find all nucleotides with the highest frequency
# We use the first element's count as the highest count for comparison
highest_count = nucleotide_count_list[0][1]
index = 0
while index < len(nucleotide_count_list) and nucleotide_count_list[index][1] == highest_count:
    # Add the nucleotide to the set of most frequent nucleotides
    most_frequent_nucleotides.add(nucleotide_count_list[index][0])
    index += 1

# Output the results
print("Nucleotide counts:", nucleotide_counts)
print("Sorted nucleotide counts:", nucleotide_count_list)
print("Most frequent nucleotides:", most_frequent_nucleotides)


Nucleotide counts: {'A': 20, 'C': 12, 'G': 17, 'T': 21}
Sorted nucleotide counts: [('T', 21), ('A', 20), ('G', 17), ('C', 12)]
Most frequent nucleotides: {'T'}


# Advance Task 5

## Exercise Description
Given a DNA sequence, perform the following:

#### 1. Transcribe the DNA sequence into RNA.

Ensure that only valid DNA nucleotides are transcribed.
   
   
#### 2. Translate the RNA sequence into a peptide chain.

Create a dictionary representing the codon table.
Use a for loop to iterate through the RNA sequence in codons (triplets) and map each to a peptide.
   
#### 3. Metabolize specific peptides in the sequence into metabolites.

If a specific peptide sequence is encountered, break it down into a set of metabolites.
Store the metabolites in a list, and ensure no duplicates by converting it to a set.

In [13]:
# Define the codon table as a dictionary mapping RNA codons to peptides
codon_table = {
    'AUG': 'Methionine', 'UGG': 'Tryptophan', 'UAC': 'Tyrosine', 'CGG': 'Arginine', # Example codons
    # ... (other codons should be filled in here for a complete table)
    # 'UAA', 'UAG', 'UGA' are stop codons and are not included in the codon table
}

# Transcription function: DNA -> RNA
def transcribe_dna_to_rna(dna_sequence):
    transcription_map = {'A': 'U', 'T': 'A', 'C': 'G', 'G': 'C'}
    rna_sequence = ''
    for nucleotide in dna_sequence:
        if nucleotide in transcription_map:
            rna_sequence += transcription_map[nucleotide]
        else:
            # Handle invalid nucleotides
            raise ValueError(f"Invalid DNA nucleotide: {nucleotide}")
    return rna_sequence

# Translation function: RNA -> Peptide Chain
def translate_rna_to_peptide(rna_sequence):
    peptide_chain = []
    stop_codons = {'UAA', 'UAG', 'UGA'}  # Common stop codons

    # Ensure that the RNA sequence length is a multiple of 3
    if len(rna_sequence) % 3 != 0:
        raise ValueError("RNA sequence length is not a multiple of 3")
    
    for i in range(0, len(rna_sequence), 3):
        codon = rna_sequence[i:i+3]
        # Check if codon is a stop codon
        if codon in stop_codons:
            break  # Stop translation
        if codon in codon_table:
            peptide_chain.append(codon_table[codon])
        else:
            # Handle invalid codons
            raise ValueError(f"Invalid RNA codon: {codon}")
    
    return peptide_chain

# Metabolization function: Peptides -> Metabolites
def metabolize_peptides(peptide_chain):
    metabolizing_peptides = ('Methionine', 'Tryptophan')
    metabolites = []
    i = 0
    while i < len(peptide_chain):
        if peptide_chain[i] in metabolizing_peptides:
            metabolites += ['metabolite1', 'metabolite2']
            i += 1  # Corrected incrementation
        else:
            i += 1  # Increment i for all peptides
    return set(metabolites)

# Main function to run the simulation with a forced metabolization for demonstration
def simulate(dna_sequence):
    rna_sequence = transcribe_dna_to_rna(dna_sequence)
    peptide_chain = translate_rna_to_peptide(rna_sequence)
    
    # Force the addition of 'Methionine' to the peptide chain for demonstration
    peptide_chain.append('Methionine')
    
    metabolites = metabolize_peptides(peptide_chain)
    return rna_sequence, peptide_chain, metabolites

# Print the results
    dna_seq = 'ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG'
    rna_seq, peptides, metabolites = simulate(dna_seq)
    print(f"RNA Sequence: {rna_seq}")
    print(f"Peptide Chain: {peptides}")
    print(f"Metabolites: {metabolites}")
except ValueError as e:
    print(f"An error occurred: {e}")


RNA Sequence: UACCGGUAACAUUACCCGGCGACUUUCCCACGGGCUAUC
Peptide Chain: ['Tyrosine', 'Arginine', 'Methionine']
Metabolites: {'metabolite2', 'metabolite1'}
