# Week 8

Hopefully you have already opened this document as a Jupyter Notebook and not simply as a pdf. If not, then you need to download the file to an appropriate folder, and open the file.

To open the file, if you have already installed Anaconda Individual Edition, then open Anaconda.Navigator, from which you can launch Jupyter Notebook.

(If working in the School of Computing labs, start Anaconda3 / Anaconda Prompt, then enter the following command specifying the path to the folder containing the file, and then open the file from the browser.)

The Jupyter Notebook App is a tool for creating documents (notebooks) containing both live code and text, as well as visualizations, etc. Various programming languages can be used, but here the focus will be on Python.

For more general resources:

Jupyter Notebook - https://jupyter.org/
Anaconda - https://www.anaconda.com/
Python - https://www.python.org/
The rest of this notebook is intended as an introduction to key features of Python and Jupyter Notebooks, irrespective of whether you have used Python before or not. It is not intended to be exhaustive, but will cover material that will be relevant in this module. In addition to this material, you are strongly encouarged to develop your knowledge of Python further through the use of other resources such as

The Python tutorial - https://docs.python.org/3/tutorial/
Both Python and Jupyter Notebooks are used in various areas, including data science, so it would be well worth your while to develop your skills in this area as much as possible.

Getting started
One thing you should do at the outset is to save this noteback as a new notebook, so that you can change it as much as you want, but still go back to the original file if necessary. Go to File / Make a Copy. This creates a new notebook called Week_1_Lab-Copy1 in a file with the same name. You can change the name of the notebook (and the file) by going to File / Rename and changing it to Week_1_Lab_my_version for example, or by editing the name of the notebook at the top of the page beside the Jupyter symbol (just above the menu).

Cell Types
We need to distinguish between different types of cells. This cell is a Markdown cell, whereas the next cell below is a Code cell. Markdown cells are for formatting text rather than for running code. To see how to format headings, use italics and bold font, for example, you can go into edit mode by double-clicking on a cell. Try it for this cell. To execute the cell (and so produce the formatted text), you can go to Cell / Run Cells or use the shortcut Ctrl-Enter. (You can find other keyboard shortcuts under the Help menu.) Markdown is very useful for mathematical notation such as  2⎯⎯√
 . For further details on Markdown see Markdown in Jupyter Notebook.

The next cell below is a Code cell. You can also edit and run it as described above, but now it will execute the code and present the output below the cell.

# Task 1

Practice Problem: Python Function with if, for, and while

Write a function called process_numbers that takes a list of numbers as input and performs the following steps:


If the list is empty, return the string: "The list is empty".

Use a for loop to iterate through the list:

    If a number is even, add it to a new list called even_numbers.
    If a number is odd, add it to a new list called odd_numbers.
    If the number 0 appears in the list, print "Zero found in the list".

Use a while loop to calculate the sum of numbers in even_numbers.

Return a dictionary with:

    even_numbers: the list of even numbers.
    odd_numbers: the list of odd numbers.
    even_sum: the sum of the even numbers.


In [2]:
def process_numbers(numbers):
    if not numbers:
        return "The list is empty"
    
    even_numbers = []
    odd_numbers = []
    even_sum = 0
    
    # Use a for loop to classify even and odd numbers
    for num in numbers:
        if num % 2 == 0:
            even_numbers.append(num)
        else:
            odd_numbers.append(num)
        if num == 0:
            print("Zero found in the list")
    
    # Use a while loop to calculate the sum of even numbers
    index = 0
    while index < len(even_numbers):
        even_sum += even_numbers[index]
        index += 1
    
    # Return the results as a dictionary
    return {
        "even_numbers": even_numbers,
        "odd_numbers": odd_numbers,
        "even_sum": even_sum
    }

# Test
numbers = [10, 15, 20, 0, 3, 8, 7]
result = process_numbers(numbers)
print(result)


Zero found in the list
{'even_numbers': [10, 20, 0, 8], 'odd_numbers': [15, 3, 7], 'even_sum': 38}


Zero found in the list
{'even_numbers': [10, 20, 0, 8], 'odd_numbers': [15, 3, 7], 'even_sum': 38}


## Task 2

#### Objective: Write a Python function to analyze DNA sequences for specific characteristics.

##### Problem:
You are tasked with creating a Python program that:

    Checks whether a DNA sequence is valid (contains only A, T, C, and G).
    Counts the occurrences of each nucleotide in the sequence.
    Finds the complementary sequence using a loop.
    Stops processing when a stop codon (TAA, TAG, or TGA) is detected.
    


In [6]:
def analyze_dna(sequence, stop_codon):
    """
    Analyze a DNA sequence with a user-defined stop codon.

    Parameters:
        sequence (str): The DNA sequence to analyze.
        stop_codon (str): The stop codon provided by the user.

    Returns:
        dict: A dictionary containing analysis results.
    """
    # Validate the DNA sequence
    for base in sequence:
        if base not in "ATCG":
            return {"Error": "Invalid DNA sequence."}

    # Count occurrences of each nucleotide
    nucleotide_count = {"A": 0, "T": 0, "C": 0, "G": 0}
    for base in sequence:
        if base in nucleotide_count:
            nucleotide_count[base] += 1

    # Find the complementary sequence
    complement = {"A": "T", "T": "A", "C": "G", "G": "C"}
    complementary_sequence = ""
    for base in sequence:
        complementary_sequence += complement[base]

    # Stop processing at the user-defined stop codon
    stop_index = -1
    i = 0
    while i <= len(sequence) - 3:
        codon = sequence[i:i+3]
        if codon == stop_codon:
            stop_index = i
            break
        i += 3

    if stop_index != -1:
        sequence = sequence[:stop_index]  # Truncate at the stop codon

    # Return the analysis results
    return {
        "Nucleotide Count": nucleotide_count,
        "Complementary Sequence": complementary_sequence,
        "Processed Sequence": sequence
    }

# Get inputs from the user
sequence = input("Enter the DNA sequence: ").upper()
stop_codon = input("Enter the stop codon (e.g., TAA, TAG, TGA): ").upper()

# Validate the stop codon
if len(stop_codon) != 3 or any(base not in "ATCG" for base in stop_codon):
    print("Invalid stop codon. Please enter a valid codon (3 bases: A, T, C, G).")
else:
    # Analyze the DNA sequence
    results = analyze_dna(sequence, stop_codon)
    print(results)


Enter the DNA sequence: AAAAAGGGGGTCA
Enter the stop codon (e.g., TAA, TAG, TGA): TCA
{'Nucleotide Count': {'A': 6, 'T': 1, 'C': 1, 'G': 5}, 'Complementary Sequence': 'TTTTTCCCCCAGT', 'Processed Sequence': 'AAAAAGGGGGTCA'}


## Task 3

Problem: DNA Sequence Analysis
Write a Python script that includes the following functionality:

    1. A function called validate_dna(sequence) that checks if a given string is a valid DNA sequence. A valid DNA sequence contains only the characters A, T, C, and G. If invalid, the function should return False.

    2. A function called gc_content(sequence) that calculates the GC content (percentage of G and C bases) in a given DNA sequence.

    3. A function called find_motif(sequence, motif) that uses a for loop to find all start positions of a given motif (substring) in the DNA sequence.

    4. A function called reverse_complement(sequence) that calculates the reverse complement of a DNA sequence using a while loop.

In [7]:
# 1. Validate DNA sequence
def validate_dna(sequence):
    """
    Validate if the sequence contains only A, T, C, G.
    Returns True if valid, otherwise False.
    """
    valid_bases = {'A', 'T', 'C', 'G'}
    for base in sequence:
        if base not in valid_bases:
            return False
    return True

# 2. Calculate GC content
def gc_content(sequence):
    """
    Calculate the GC content in a DNA sequence.
    Returns GC content as a percentage.
    """
    gc_count = 0
    for base in sequence:
        if base in {'G', 'C'}:
            gc_count += 1
    return (gc_count / len(sequence)) * 100

# 3. Find all occurrences of a motif
def find_motif(sequence, motif):
    """
    Find all start positions of a motif in the DNA sequence.
    Returns a list of starting indices.
    """
    positions = []
    for i in range(len(sequence) - len(motif) + 1):
        if sequence[i:i + len(motif)] == motif:
            positions.append(i)
    return positions

# 4. Reverse complement of DNA
def reverse_complement(sequence):
    """
    Calculate the reverse complement of a DNA sequence.
    Uses a while loop to process the sequence.
    """
    complement = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}
    reverse_comp = []
    i = len(sequence) - 1
    while i >= 0:
        reverse_comp.append(complement[sequence[i]])
        i -= 1
    return ''.join(reverse_comp)

# Example usage
dna_sequence = "ATGCGTACGTAGCTAGCT"
motif = "CGT"

if validate_dna(dna_sequence):
    print(f"GC Content: {gc_content(dna_sequence):.2f}%")
    print(f"Motif '{motif}' found at positions: {find_motif(dna_sequence, motif)}")
    print(f"Reverse Complement: {reverse_complement(dna_sequence)}")
else:
    print("Invalid DNA sequence!")


GC Content: 50.00%
Motif 'CGT' found at positions: [3, 7]
Reverse Complement: AGCTAGCTACGTACGCAT


## Advance task

The following Python code is intended to calculate the GC content of a given DNA sequence. However, it contains several bugs. Your task is to debug the code and ensure it works correctly.


In [8]:
def calculate_gc_content(dna_sequence):
    if not isinstance(dna_sequence, str):
        return "Error: DNA sequence must be a string"
    
    # Convert the sequence to uppercase to handle lowercase input
    dna_sequence = dna_sequence.upper()
    
    gc_count = 0
    total_count = len(dna_sequence)
    
    if total_count == 0:  # Handle empty sequence case
        return "Error: DNA sequence is empty"
    
    for base in dna_sequence:
        if base == "G" or base == "C":
            gc_count += 1  # Corrected: Added assignment operator
        elif base not in ["A", "T", "G", "C"]:
            return "Error: Invalid character in DNA sequence"
    
    # Removed unnecessary while loop
    gc_content = (gc_count / total_count) * 100
    return gc_content

# Test Cases
print(calculate_gc_content("ATGC"))           # Expected: 50.0
print(calculate_gc_content("atgc"))           # Expected: 50.0
print(calculate_gc_content("AGGCGTAA"))       # Expected: 50.0
print(calculate_gc_content(""))               # Expected: Error: DNA sequence is empty
print(calculate_gc_content("ATGXYZ"))         # Expected: Error: Invalid character in DNA sequence
print(calculate_gc_content(12345))            # Expected: Error: DNA sequence must be a string

50.0
50.0
50.0
Error: DNA sequence is empty
Error: Invalid character in DNA sequence
Error: DNA sequence must be a string
