# Lecture 2: Working with DNA Sequences
## Python for Biology
**Learning Objectives:**
- Iterate through DNA sequences (strings)
- Count and find specific nucleotides
- Search for patterns in DNA
- Extract and transform DNA sequences
- Work with codons and reading frames

---

## Section 1: Iterating Through DNA Sequences

### Guided Example 1.1: Printing each nucleotide

DNA sequences are stored as strings. We can loop through them just like lists!

In [None]:
# A simple DNA sequence
dna = "ATCG"

print("Nucleotides in the sequence:")
for nucleotide in dna:
    print(nucleotide)

**What's happening here?**
- DNA sequences are strings (text)
- We can loop through strings just like lists
- Each character (nucleotide) is processed one at a time
- `nucleotide` holds one letter each time through the loop

### Practice Example 1.1: Print nucleotides

Print each nucleotide in the sequence below, one per line

In [None]:
dna = "GCTA"

# YOUR CODE HERE: print each nucleotide

### Guided Example 1.2: Counting total nucleotides

Let's count how many nucleotides are in a sequence

In [None]:
dna = "ATCGATCG"

count = 0
for nucleotide in dna:
    count += 1

print(f"The sequence has {count} nucleotides")

# Note: We could also use len(dna) for this!
print(f"Using len(): {len(dna)} nucleotides")

**What's new here?**
- We count nucleotides just like we counted items in a list
- `len(dna)` is a faster way to get the length of a string
- But counting manually helps us understand loops!

### Practice Example 1.2: Count nucleotides

Count how many nucleotides are in this sequence (using a loop)

In [None]:
dna = "ATCGATCGATCG"

# YOUR CODE HERE: count the nucleotides using a for loop
# Print the result

---

## Section 2: Counting Specific Nucleotides

### Guided Example 2.1: Counting one type of nucleotide

Let's count how many times a specific nucleotide appears

In [None]:
dna = "ATCGATCG"

# Count how many A's
a_count = 0

for nucleotide in dna:
    if nucleotide == 'A':
        a_count += 1

print(f"Number of A nucleotides: {a_count}")

**What's happening here?**
- We check each nucleotide with `if nucleotide == 'A'`
- Only count it if it matches
- This is useful for analyzing DNA composition

### Practice Example 2.1: Count G nucleotides

Count how many G nucleotides are in this sequence

In [None]:
dna = "GGATCGATCG"

# YOUR CODE HERE: count the G nucleotides

### Guided Example 2.2: Counting multiple nucleotides

Let's count all four types of nucleotides at once

In [None]:
dna = "ATCGATCGATTAG"

# Initialize counters
a_count = 0
t_count = 0
c_count = 0
g_count = 0

# Count each nucleotide
for nucleotide in dna:
    if nucleotide == 'A':
        a_count += 1
    elif nucleotide == 'T':
        t_count += 1
    elif nucleotide == 'C':
        c_count += 1
    elif nucleotide == 'G':
        g_count += 1

print(f"A: {a_count}")
print(f"T: {t_count}")
print(f"C: {c_count}")
print(f"G: {g_count}")

**What's new here?**
- `elif` means "else if" - it checks another condition
- We use multiple counters, one for each nucleotide type
- This gives us a complete nucleotide composition

### Practice Example 2.2: Count all nucleotides

Count all four types of nucleotides in this sequence

In [None]:
dna = "GCGCATAT"

# YOUR CODE HERE: count A, T, C, and G
# Print the count for each

### Guided Example 2.3: Calculating GC content

GC content is the percentage of G and C nucleotides. It's important for understanding DNA properties!

In [None]:
dna = "ATCGATCG"

# Count G and C
gc_count = 0

for nucleotide in dna:
    if nucleotide == 'G' or nucleotide == 'C':
        gc_count += 1

# Calculate percentage
gc_percentage = (gc_count / len(dna)) * 100

print(f"GC content: {gc_percentage}%")

**What's new here?**
- We count G and C together using `or`
- We calculate percentage: (count / total) × 100
- `len(dna)` gives us the total length

### Practice Example 2.3: Calculate GC content

Calculate the GC content percentage for this sequence

In [None]:
dna = "GCGCATATATAT"

# YOUR CODE HERE: calculate and print the GC content percentage

---

## Section 3: Finding Nucleotide Positions

### Guided Example 3.1: Finding positions of a specific nucleotide

Sometimes we need to know WHERE a nucleotide appears in the sequence

In [None]:
dna = "ATCGATCG"

print("Positions of 'A' nucleotides:")
for index, nucleotide in enumerate(dna):
    if nucleotide == 'A':
        print(f"Found A at position {index}")

**What's happening here?**
- `enumerate(dna)` gives us both the position (index) and the nucleotide
- We print the position when we find an 'A'
- Remember: positions start at 0!

### Practice Example 3.1: Find T positions

Find and print all positions where 'T' appears

In [None]:
dna = "ATCGATTCG"

# YOUR CODE HERE: find and print all positions of 'T'

### Guided Example 3.2: Storing positions in a list

Let's collect all the positions into a list for later use

In [None]:
dna = "ATCGATCG"

# Create empty list to store positions
a_positions = []

for index, nucleotide in enumerate(dna):
    if nucleotide == 'A':
        a_positions.append(index)

print(f"Positions of A: {a_positions}")
print(f"Number of A's: {len(a_positions)}")

**What's new here?**
- We create an empty list: `a_positions = []`
- We append each position where we find 'A'
- Now we have a list of all positions!

### Practice Example 3.2: Store G positions

Create a list of all positions where 'G' appears

In [None]:
dna = "GCGATAGC"

# YOUR CODE HERE: create a list of G positions and print it

---

## Section 4: Searching for Patterns

### Guided Example 4.1: Checking if a pattern exists

We can check if a short sequence (pattern) exists in our DNA

In [None]:
dna = "ATCGATCG"
pattern = "CGA"

if pattern in dna:
    print(f"Found pattern {pattern} in the sequence!")
else:
    print(f"Pattern {pattern} not found")

**What's happening here?**
- `in` checks if one string is inside another
- `"CGA" in "ATCGATCG"` returns True because CGA is in the sequence
- This is the easiest way to check for patterns!

### Practice Example 4.1: Check for a pattern

Check if the pattern "ATG" (start codon) is in this sequence

In [None]:
dna = "GCGATGATCG"
pattern = "ATG"

# YOUR CODE HERE: check if the pattern is in the DNA and print a message

### Guided Example 4.2: Finding the position of a pattern

Let's find WHERE a pattern starts in the sequence

In [None]:
dna = "ATCGATCG"
pattern = "CGA"

position = dna.find(pattern)

if position != -1:
    print(f"Pattern {pattern} found at position {position}")
else:
    print(f"Pattern {pattern} not found")

**What's new here?**
- `.find(pattern)` returns the starting position of the pattern
- If the pattern isn't found, it returns -1
- We check if position is not equal to -1 using `!= -1`

### Practice Example 4.2: Find pattern position

Find the position of the stop codon "TAA" in this sequence

In [None]:
dna = "ATGCGATAA"
pattern = "TAA"

# YOUR CODE HERE: find and print the position of the pattern

### Guided Example 4.3: Counting pattern occurrences

How many times does a pattern appear in our DNA?

In [None]:
dna = "ATGATGATG"
pattern = "ATG"

count = dna.count(pattern)

print(f"Pattern {pattern} appears {count} times")

**What's new here?**
- `.count(pattern)` counts how many times the pattern appears
- This is much easier than counting manually!
- Note: overlapping patterns might not all be counted

### Practice Example 4.3: Count pattern occurrences

Count how many times "CG" appears in this sequence

In [None]:
dna = "CGCGATCGCG"
pattern = "CG"

# YOUR CODE HERE: count and print how many times the pattern appears

---

## Section 5: Extracting Subsequences (Slicing)

### Guided Example 5.1: Getting the first N nucleotides

We can extract parts of a DNA sequence using slicing

In [None]:
dna = "ATCGATCGATCG"

# Get first 5 nucleotides
first_five = dna[0:5]

print(f"Original: {dna}")
print(f"First 5: {first_five}")

# Shortcut: can omit the 0
first_five_short = dna[:5]
print(f"First 5 (shortcut): {first_five_short}")

**What's happening here?**
- `dna[0:5]` means "from position 0 to position 5 (not including 5)"
- This gives us positions 0, 1, 2, 3, 4 (5 nucleotides)
- `dna[:5]` is a shortcut that means "from the start to position 5"
- Slicing does NOT change the original string

### Practice Example 5.1: Extract first nucleotides

Extract and print the first 4 nucleotides from this sequence

In [None]:
dna = "GCGATCGATCG"

# YOUR CODE HERE: extract and print the first 4 nucleotides

### Guided Example 5.2: Getting the last N nucleotides

We can also extract from the end of the sequence

In [None]:
dna = "ATCGATCGATCG"

# Get last 4 nucleotides
last_four = dna[-4:]

print(f"Original: {dna}")
print(f"Last 4: {last_four}")

**What's new here?**
- Negative numbers count from the end
- `-4` means "4th position from the end"
- `dna[-4:]` means "from 4th from end to the end"

### Practice Example 5.2: Extract last nucleotides

Extract and print the last 3 nucleotides

In [None]:
dna = "ATCGATCG"

# YOUR CODE HERE: extract and print the last 3 nucleotides

### Guided Example 5.3: Extracting a middle region

We can extract any region from the middle of the sequence

In [None]:
dna = "ATCGATCGATCG"

# Extract nucleotides from position 3 to 7
middle = dna[3:7]

print(f"Original: {dna}")
print(f"Positions 3-7: {middle}")

**What's new here?**
- `dna[3:7]` extracts positions 3, 4, 5, 6 (not 7!)
- The start is included, the end is not
- This is useful for extracting genes or specific regions

### Practice Example 5.3: Extract middle region

Extract positions 2 through 6 from this sequence

In [None]:
dna = "GCGATCGATCG"

# YOUR CODE HERE: extract positions 2-6 and print the result

---

## Section 6: DNA Transformations

### Guided Example 6.1: Converting DNA to RNA

To convert DNA to RNA, we replace all T's with U's

In [None]:
dna = "ATCGATCG"

# Replace T with U
rna = dna.replace('T', 'U')

print(f"DNA: {dna}")
print(f"RNA: {rna}")

**What's happening here?**
- `.replace('T', 'U')` replaces all T's with U's
- This creates a new string (the original is unchanged)
- This is called transcription in biology!

### Practice Example 6.1: Transcribe DNA to RNA

Convert this DNA sequence to RNA

In [None]:
dna = "ATGCATGC"

# YOUR CODE HERE: convert to RNA and print both DNA and RNA

### Guided Example 6.2: Creating the complement

The complement has A↔T and G↔C swapped

In [None]:
dna = "ATCG"

complement = ""

for nucleotide in dna:
    if nucleotide == 'A':
        complement += 'T'
    elif nucleotide == 'T':
        complement += 'A'
    elif nucleotide == 'C':
        complement += 'G'
    elif nucleotide == 'G':
        complement += 'C'

print(f"Original:   {dna}")
print(f"Complement: {complement}")

**What's new here?**
- We start with an empty string: `complement = ""`
- We add to it using `+=` (concatenation)
- For each nucleotide, we add its complement
- A pairs with T, C pairs with G

### Practice Example 6.2: Create complement

Create the complement of this DNA sequence

In [None]:
dna = "GCGCATAT"

# YOUR CODE HERE: create and print the complement

### Guided Example 6.3: Reverse complement

The reverse complement is the complement read backwards. This is the sequence on the other DNA strand!

In [None]:
dna = "ATCG"

# First, create complement (from previous example)
complement = ""
for nucleotide in dna:
    if nucleotide == 'A':
        complement += 'T'
    elif nucleotide == 'T':
        complement += 'A'
    elif nucleotide == 'C':
        complement += 'G'
    elif nucleotide == 'G':
        complement += 'C'

# Then reverse it
reverse_complement = complement[::-1]

print(f"Original:           {dna}")
print(f"Complement:         {complement}")
print(f"Reverse complement: {reverse_complement}")

**What's new here?**
- `[::-1]` reverses a string
- This is a special slicing syntax
- The reverse complement is what you'd read on the opposite DNA strand

### Practice Example 6.3: Create reverse complement

Create the reverse complement of this sequence

In [None]:
dna = "ATCGATCG"

# YOUR CODE HERE: create and print the reverse complement
# Hint: first create the complement, then reverse it

---

## Section 7: Working with Codons

### Guided Example 7.1: Splitting into codons

Codons are groups of 3 nucleotides. Let's split a sequence into codons!

In [None]:
dna = "ATGCGATAA"

codons = []

# Step through sequence 3 nucleotides at a time
for i in range(0, len(dna), 3):
    codon = dna[i:i+3]
    codons.append(codon)

print(f"Original: {dna}")
print(f"Codons: {codons}")

**What's happening here?**
- `range(0, len(dna), 3)` gives us 0, 3, 6, 9, ...
- The third parameter (3) is the "step" - how much to increment
- `dna[i:i+3]` extracts 3 nucleotides starting at position i
- We collect all codons in a list

### Practice Example 7.1: Split into codons

Split this sequence into codons

In [None]:
dna = "GCGATCTAGCTA"

# YOUR CODE HERE: split into codons and print the list

### Guided Example 7.2: Finding the start codon

Let's find where the start codon (ATG) is located

In [None]:
dna = "GCGATGCGATAA"

start_codon = "ATG"
position = dna.find(start_codon)

if position != -1:
    print(f"Start codon ATG found at position {position}")
    print(f"Translation would begin here: {dna[position:]}")
else:
    print("No start codon found")

**What's new here?**
- We search for the start codon "ATG"
- `dna[position:]` gives us everything from the start codon onward
- This is where translation (making protein) would begin!

### Practice Example 7.2: Find start codon

Find the start codon and extract everything from that point onward

In [None]:
dna = "TTTATGCGATCGTAA"

# YOUR CODE HERE: find ATG and print the sequence from that point

### Guided Example 7.3: Finding stop codons

Let's check if any stop codons (TAA, TAG, TGA) are present

In [None]:
dna = "ATGCGATCGTAA"

stop_codons = ["TAA", "TAG", "TGA"]

print("Checking for stop codons:")
for stop in stop_codons:
    if stop in dna:
        position = dna.find(stop)
        print(f"  Found {stop} at position {position}")

**What's new here?**
- We have a list of stop codons to check
- We loop through each stop codon
- We check if it's in the sequence and find its position

### Practice Example 7.3: Find stop codons

Check this sequence for all three stop codons

In [None]:
dna = "ATGCGATCGTAGCTA"

# YOUR CODE HERE: check for TAA, TAG, and TGA
# Print which ones are found and their positions

---

## Section 8: Practice Challenges

Now try some more complex challenges that combine everything you've learned!

### Challenge 1: AT/GC Ratio

Calculate the ratio of AT content to GC content

In [None]:
dna = "ATCGATCGATTAAAGGG"

# YOUR CODE HERE:
# 1. Count A and T nucleotides
# 2. Count G and C nucleotides
# 3. Calculate AT percentage and GC percentage
# 4. Print both percentages

### Challenge 2: Extract Open Reading Frame

Find the region between the start codon (ATG) and first stop codon (TAA, TAG, or TGA)

In [None]:
dna = "TTTATGCGATCGTAAGGC"

# YOUR CODE HERE:
# 1. Find the position of ATG
# 2. Find the positions of all stop codons after ATG
# 3. Extract the sequence from ATG to the first stop codon
# 4. Print the open reading frame (ORF)

### Challenge 3: Find All CpG Sites

CpG sites are "CG" dinucleotides. Find all their positions (these are important for gene regulation!)

In [None]:
dna = "ATCGATCGATTCGAA"

# YOUR CODE HERE:
# Find all positions where "CG" appears
# Hint: You'll need to loop through the sequence and check each position
# Store all positions in a list and print it

### Challenge 4: Translate to Amino Acids

Use this simple genetic code table to translate codons to amino acids

In [None]:
dna = "ATGGGCTAA"

# Simple genetic code (just a few codons)
genetic_code = {
    'ATG': 'M',  # Methionine (start)
    'GGC': 'G',  # Glycine
    'TAA': '*',  # Stop
    'TAG': '*',  # Stop
    'TGA': '*'   # Stop
}

# YOUR CODE HERE:
# 1. Split the DNA into codons
# 2. For each codon, look it up in the genetic_code dictionary
# 3. Build a string of amino acids
# 4. Stop when you hit a stop codon (*)
# 5. Print the amino acid sequence

---

## Summary

Congratulations! You've learned how to work with DNA sequences in Python:

**Basic operations:**
- ✅ Iterating through nucleotides
- ✅ Counting specific nucleotides
- ✅ Calculating GC content

**Finding and searching:**
- ✅ Finding positions of nucleotides
- ✅ Searching for patterns
- ✅ Counting pattern occurrences

**Extracting regions:**
- ✅ Using slicing to extract subsequences
- ✅ Getting first, last, and middle regions

**Transformations:**
- ✅ Converting DNA to RNA
- ✅ Creating complement sequences
- ✅ Creating reverse complement

**Working with codons:**
- ✅ Splitting sequences into codons
- ✅ Finding start and stop codons
- ✅ Extracting open reading frames

These skills are essential for bioinformatics! You can now analyze real DNA sequences, find genes, and perform sequence manipulations.

**Next steps:** Practice with real DNA sequences from databases like NCBI, and learn about reading FASTA files!