<a href="https://colab.research.google.com/github/AmirSedaghaati/computational-biology/blob/main/Day2_DNA_Manipulation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Day 2: Advanced DNA Sequence Manipulation
# Date: July 26, 2025
# Focus: String methods, DNA complements, working with multiple sequences

print("🧬 Day 2: Mastering DNA Sequence Manipulation")
print("Today's mission: Build a complete DNA toolkit!")
print("="*60)


🧬 Day 2: Mastering DNA Sequence Manipulation
Today's mission: Build a complete DNA toolkit!


In [2]:
# Understanding DNA Base Pairing Rules
print("📚 DNA BASE PAIRING RULES:")
print("A (Adenine)  ↔  T (Thymine)")
print("T (Thymine)  ↔  A (Adenine)")
print("G (Guanine)  ↔  C (Cytosine)")
print("C (Cytosine) ↔  G (Guanine)")
print()

# Why is this important?
print("🔬 BIOLOGICAL SIGNIFICANCE:")
print("- DNA exists as double helix with complementary strands")
print("- PCR amplification relies on complementary primers")
print("- RNA transcription uses complementary base pairing")
print("- Mutation analysis compares complementary sequences")
print()

# Let's see this in action
original_strand = "ATGCGATCG"
print(f"Original DNA strand: 5'-{original_strand}-3'")
print("Complement will be:   3'-?????????-5'")
print("Let's build this step by step...")

📚 DNA BASE PAIRING RULES:
A (Adenine)  ↔  T (Thymine)
T (Thymine)  ↔  A (Adenine)
G (Guanine)  ↔  C (Cytosine)
C (Cytosine) ↔  G (Guanine)

🔬 BIOLOGICAL SIGNIFICANCE:
- DNA exists as double helix with complementary strands
- PCR amplification relies on complementary primers
- RNA transcription uses complementary base pairing
- Mutation analysis compares complementary sequences

Original DNA strand: 5'-ATGCGATCG-3'
Complement will be:   3'-?????????-5'
Let's build this step by step...


In [3]:
# Method 1: Building complement step by step (learning approach)
def create_complement_manual(dna_sequence):
    """
    Create DNA complement using manual base-by-base approach
    Great for understanding the biology!
    """
    print(f"Creating complement for: {dna_sequence}")

    complement = ""  # Start with empty string

    # Process each nucleotide one by one
    for i, nucleotide in enumerate(dna_sequence):
        if nucleotide == 'A':
            complement_base = 'T'
        elif nucleotide == 'T':
            complement_base = 'A'
        elif nucleotide == 'G':
            complement_base = 'C'
        elif nucleotide == 'C':
            complement_base = 'G'
        else:
            complement_base = 'N'  # For unknown bases

        complement += complement_base
        print(f"Position {i+1}: {nucleotide} → {complement_base}")

    return complement

# Test it
test_seq = "ATGC"
print("\n🧪 TESTING MANUAL METHOD:")
result = create_complement_manual(test_seq)
print(f"Final complement: {result}")


🧪 TESTING MANUAL METHOD:
Creating complement for: ATGC
Position 1: A → T
Position 2: T → A
Position 3: G → C
Position 4: C → G
Final complement: TACG


In [4]:
# Method 2: Efficient complement function (professional approach)
def create_complement_efficient(dna_sequence):
    """
    Create DNA complement using dictionary mapping - fast and clean!
    This is how bioinformaticians actually do it.
    """
    # Complement mapping
    complement_map = {
        'A': 'T', 'T': 'A',
        'G': 'C', 'C': 'G',
        'N': 'N'  # For ambiguous bases
    }

    # Convert to uppercase and create complement
    dna_sequence = dna_sequence.upper()
    complement = ''.join([complement_map.get(base, 'N') for base in dna_sequence])

    return complement

# Test both methods
print("\n🔬 COMPARING METHODS:")
test_sequences = ["ATGC", "GGCCAATT", "ATGCGATCGTAGC"]

for seq in test_sequences:
    manual_result = create_complement_manual(seq)
    efficient_result = create_complement_efficient(seq)

    print(f"\nSequence:    5'-{seq}-3'")
    print(f"Complement:  3'-{efficient_result}-5'")
    print(f"Methods match: {manual_result == efficient_result}")


🔬 COMPARING METHODS:
Creating complement for: ATGC
Position 1: A → T
Position 2: T → A
Position 3: G → C
Position 4: C → G

Sequence:    5'-ATGC-3'
Complement:  3'-TACG-5'
Methods match: True
Creating complement for: GGCCAATT
Position 1: G → C
Position 2: G → C
Position 3: C → G
Position 4: C → G
Position 5: A → T
Position 6: A → T
Position 7: T → A
Position 8: T → A

Sequence:    5'-GGCCAATT-3'
Complement:  3'-CCGGTTAA-5'
Methods match: True
Creating complement for: ATGCGATCGTAGC
Position 1: A → T
Position 2: T → A
Position 3: G → C
Position 4: C → G
Position 5: G → C
Position 6: A → T
Position 7: T → A
Position 8: C → G
Position 9: G → C
Position 10: T → A
Position 11: A → T
Position 12: G → C
Position 13: C → G

Sequence:    5'-ATGCGATCGTAGC-3'
Complement:  3'-TACGCTAGCATCG-5'
Methods match: True


In [5]:
# The REVERSE COMPLEMENT - Most important in bioinformatics!
def reverse_complement(dna_sequence):
    """
    Create reverse complement of DNA sequence
    This is what you actually need for most bioinformatics applications!

    Example: ATGC → complement: TACG → reverse: CGTA
    """
    complement_map = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G', 'N': 'N'}

    dna_sequence = dna_sequence.upper()

    # Step 1: Create complement
    complement = ''.join([complement_map.get(base, 'N') for base in dna_sequence])

    # Step 2: Reverse it
    reverse_comp = complement[::-1]  # Python slice notation for reverse

    return complement, reverse_comp

# Comprehensive testing
print("🔄 REVERSE COMPLEMENT ANALYSIS:")
print("="*50)

biological_sequences = {
    "Gene start": "ATGAAGTGC",
    "Primer": "GCTAGCTA",
    "Restriction site": "GAATTC",  # EcoRI site
    "Random": "AAATTTGGGCCC"
}

for name, sequence in biological_sequences.items():
    complement, rev_comp = reverse_complement(sequence)

    print(f"\n{name}:")
    print(f"Original:         5'-{sequence}-3'")
    print(f"Complement:       3'-{complement}-5'")
    print(f"Reverse complement: 5'-{rev_comp}-3'")

    # Show the biological double helix
    print("Double helix visualization:")
    print(f"5'-{sequence}-3'")
    print(f"3'-{complement}-5'")

🔄 REVERSE COMPLEMENT ANALYSIS:

Gene start:
Original:         5'-ATGAAGTGC-3'
Complement:       3'-TACTTCACG-5'
Reverse complement: 5'-GCACTTCAT-3'
Double helix visualization:
5'-ATGAAGTGC-3'
3'-TACTTCACG-5'

Primer:
Original:         5'-GCTAGCTA-3'
Complement:       3'-CGATCGAT-5'
Reverse complement: 5'-TAGCTAGC-3'
Double helix visualization:
5'-GCTAGCTA-3'
3'-CGATCGAT-5'

Restriction site:
Original:         5'-GAATTC-3'
Complement:       3'-CTTAAG-5'
Reverse complement: 5'-GAATTC-3'
Double helix visualization:
5'-GAATTC-3'
3'-CTTAAG-5'

Random:
Original:         5'-AAATTTGGGCCC-3'
Complement:       3'-TTTAAACCCGGG-5'
Reverse complement: 5'-GGGCCCAAATTT-3'
Double helix visualization:
5'-AAATTTGGGCCC-3'
3'-TTTAAACCCGGG-5'


In [6]:
# Working with Multiple DNA Sequences - Enter the List!
print("📋 ANALYZING MULTIPLE SEQUENCES")
print("="*40)

# Create a list of DNA sequences (like having multiple genes)
dna_sequences = [
    "ATGAAGTGCAACCTG",      # Gene 1
    "GCTAGCTAGCTAGCT",       # Gene 2
    "TTTTTTTTTTTTTT",        # Poly-T sequence
    "GGGGGGGGGGGGGG",        # Poly-G sequence
    "ATCGATCGATCGAT"         # Alternating sequence
]

# Analyze each sequence
print("Analysis of multiple sequences:")
print("-" * 60)

for i, sequence in enumerate(dna_sequences, 1):
    print(f"\nSEQUENCE {i}: {sequence}")

    # Use our previous functions
    length = len(sequence)
    gc_content = (sequence.count('G') + sequence.count('C')) / length * 100
    complement, rev_comp = reverse_complement(sequence)

    print(f"  Length: {length} bp")
    print(f"  GC content: {gc_content:.1f}%")
    print(f"  Reverse complement: {rev_comp}")

    # Categorize by GC content
    if gc_content > 60:
        category = "GC-rich (like bacterial genes)"
    elif gc_content < 40:
        category = "AT-rich (like promoter regions)"
    else:
        category = "Moderate GC (typical coding sequence)"

    print(f"  Category: {category}")

📋 ANALYZING MULTIPLE SEQUENCES
Analysis of multiple sequences:
------------------------------------------------------------

SEQUENCE 1: ATGAAGTGCAACCTG
  Length: 15 bp
  GC content: 46.7%
  Reverse complement: CAGGTTGCACTTCAT
  Category: Moderate GC (typical coding sequence)

SEQUENCE 2: GCTAGCTAGCTAGCT
  Length: 15 bp
  GC content: 53.3%
  Reverse complement: AGCTAGCTAGCTAGC
  Category: Moderate GC (typical coding sequence)

SEQUENCE 3: TTTTTTTTTTTTTT
  Length: 14 bp
  GC content: 0.0%
  Reverse complement: AAAAAAAAAAAAAA
  Category: AT-rich (like promoter regions)

SEQUENCE 4: GGGGGGGGGGGGGG
  Length: 14 bp
  GC content: 100.0%
  Reverse complement: CCCCCCCCCCCCCC
  Category: GC-rich (like bacterial genes)

SEQUENCE 5: ATCGATCGATCGAT
  Length: 14 bp
  GC content: 42.9%
  Reverse complement: ATCGATCGATCGAT
  Category: Moderate GC (typical coding sequence)


In [7]:
# Complete DNA Analysis Toolkit - Day 2 Masterpiece!
def complete_dna_analysis(sequence_list, labels=None):
    """
    Complete analysis of multiple DNA sequences
    This function combines everything we've learned!
    """
    print("🧬 COMPLETE DNA SEQUENCE ANALYSIS TOOLKIT")
    print("="*60)

    if labels is None:
        labels = [f"Sequence_{i+1}" for i in range(len(sequence_list))]

    results = []

    for i, (label, seq) in enumerate(zip(labels, sequence_list)):
        print(f"\n📊 ANALYZING {label.upper()}:")
        print("-" * 40)

        # Basic stats
        seq = seq.upper()
        length = len(seq)

        # Nucleotide counts
        counts = {'A': seq.count('A'), 'T': seq.count('T'),
                 'G': seq.count('G'), 'C': seq.count('C')}

        # Percentages
        percentages = {base: (count/length)*100 for base, count in counts.items()}

        # GC content
        gc_content = percentages['G'] + percentages['C']

        # Complements
        complement, rev_comp = reverse_complement(seq)

        # Display results
        print(f"Sequence:    5'-{seq}-3'")
        print(f"Length:      {length} bp")
        print(f"Composition: A:{counts['A']} T:{counts['T']} G:{counts['G']} C:{counts['C']}")
        print(f"GC Content:  {gc_content:.1f}%")
        print(f"Rev Comp:    5'-{rev_comp}-3'")

        # Store results
        results.append({
            'label': label,
            'sequence': seq,
            'length': length,
            'gc_content': gc_content,
            'reverse_complement': rev_comp,
            'counts': counts
        })

    return results

# Test with real biological sequences
real_sequences = [
    "ATGAAGTGCAACCTGCTGGTG",  # Start of insulin gene
    "GCTAGCTAGCTAGCTAGCTAG",  # Repetitive sequence
    "GAATTCGCGCGAATTC",       # Contains EcoRI sites
    "TTTTTTAAAAAATTTTTT"      # AT-rich regulatory region
]

sequence_names = [
    "Insulin_gene_start",
    "Repetitive_DNA",
    "EcoRI_containing",
    "AT_rich_region"
]

# Run complete analysis
analysis_results = complete_dna_analysis(real_sequences, sequence_names)

🧬 COMPLETE DNA SEQUENCE ANALYSIS TOOLKIT

📊 ANALYZING INSULIN_GENE_START:
----------------------------------------
Sequence:    5'-ATGAAGTGCAACCTGCTGGTG-3'
Length:      21 bp
Composition: A:5 T:5 G:7 C:4
GC Content:  52.4%
Rev Comp:    5'-CACCAGCAGGTTGCACTTCAT-3'

📊 ANALYZING REPETITIVE_DNA:
----------------------------------------
Sequence:    5'-GCTAGCTAGCTAGCTAGCTAG-3'
Length:      21 bp
Composition: A:5 T:5 G:6 C:5
GC Content:  52.4%
Rev Comp:    5'-CTAGCTAGCTAGCTAGCTAGC-3'

📊 ANALYZING ECORI_CONTAINING:
----------------------------------------
Sequence:    5'-GAATTCGCGCGAATTC-3'
Length:      16 bp
Composition: A:4 T:4 G:4 C:4
GC Content:  50.0%
Rev Comp:    5'-GAATTCGCGCGAATTC-3'

📊 ANALYZING AT_RICH_REGION:
----------------------------------------
Sequence:    5'-TTTTTTAAAAAATTTTTT-3'
Length:      18 bp
Composition: A:6 T:12 G:0 C:0
GC Content:  0.0%
Rev Comp:    5'-AAAAAATTTTTTAAAAAA-3'


In [8]:
# Day 2 Summary - What I've Accomplished!
print("\n" + "="*60)
print("🎉 DAY 2 ACCOMPLISHMENTS")
print("="*60)

print("\n✅ SKILLS MASTERED:")
skills = [
    "String manipulation for biological sequences",
    "DNA complement calculation (both methods)",
    "Reverse complement generation",
    "Working with lists of sequences",
    "Creating comprehensive analysis functions",
    "Professional code documentation",
    "Biological context understanding"
]

for i, skill in enumerate(skills, 1):
    print(f"{i}. {skill}")

print("\n🧬 BIOLOGICAL CONCEPTS LEARNED:")
concepts = [
    "DNA base pairing rules (A-T, G-C)",
    "Importance of reverse complement in PCR",
    "GC content biological significance",
    "Different sequence types (genes, regulatory, repetitive)"
]

for concept in concepts:
    print(f"• {concept}")

print("\n💻 PROGRAMMING CONCEPTS:")
programming = [
    "String methods and slicing",
    "Dictionary mapping for efficient lookups",
    "List iteration and enumeration",
    "Function parameters and return values",
    "Error handling with .get() method"
]

for prog in programming:
    print(f"• {prog}")

print(f"\n📈 PROGRESS: 2/56 days completed ({2/56*100:.1f}%)")
print("🚀 Ready for Day 3: Control structures and file handling!")


🎉 DAY 2 ACCOMPLISHMENTS

✅ SKILLS MASTERED:
1. String manipulation for biological sequences
2. DNA complement calculation (both methods)
3. Reverse complement generation
4. Working with lists of sequences
5. Creating comprehensive analysis functions
6. Professional code documentation
7. Biological context understanding

🧬 BIOLOGICAL CONCEPTS LEARNED:
• DNA base pairing rules (A-T, G-C)
• Importance of reverse complement in PCR
• GC content biological significance
• Different sequence types (genes, regulatory, repetitive)

💻 PROGRAMMING CONCEPTS:
• String methods and slicing
• Dictionary mapping for efficient lookups
• List iteration and enumeration
• Function parameters and return values
• Error handling with .get() method

📈 PROGRESS: 2/56 days completed (3.6%)
🚀 Ready for Day 3: Control structures and file handling!
