<a href="https://colab.research.google.com/github/AmirSedaghaati/computational-biology/blob/main/Day3_Control_Structures_Biology.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Day 3: Control Structures for Biological Data Processing
# Date: July 27, 2025
# Focus: Loops, conditionals, file handling, real bioinformatics algorithms

print("🎯 Day 3: Mastering Control Flow in Bioinformatics")
print("Today's mission: Process real biological data like a pro!")
print("="*65)

# What we'll build today:
print("🛠️ TODAY'S PROJECTS:")
print("1. ORF (Open Reading Frame) Finder")
print("2. Sequence Quality Controller")
print("3. Multi-FASTA processor")
print("4. Codon usage analyzer")
print()

🎯 Day 3: Mastering Control Flow in Bioinformatics
Today's mission: Process real biological data like a pro!
🛠️ TODAY'S PROJECTS:
1. ORF (Open Reading Frame) Finder
2. Sequence Quality Controller
3. Multi-FASTA processor
4. Codon usage analyzer



In [2]:
# Understanding Loops with Biological Context
print("🔄 LOOPS IN BIOINFORMATICS")
print("="*40)

# For loop: Processing each nucleotide
def analyze_sequence_composition(dna_seq):
    """Analyze nucleotide composition using loops"""
    print(f"Analyzing sequence: {dna_seq}")

    nucleotide_counts = {'A': 0, 'T': 0, 'G': 0, 'C': 0, 'N': 0}

    # Loop through each position
    for position, nucleotide in enumerate(dna_seq.upper()):
        print(f"Position {position + 1:2d}: {nucleotide}")

        if nucleotide in nucleotide_counts:
            nucleotide_counts[nucleotide] += 1
        else:
            nucleotide_counts['N'] += 1  # Unknown nucleotides

    return nucleotide_counts

# Test with a short sequence first
test_seq = "ATGCATGC"
print("\n🧪 TESTING WITH SHORT SEQUENCE:")
composition = analyze_sequence_composition(test_seq)
print(f"Final composition: {composition}")

🔄 LOOPS IN BIOINFORMATICS

🧪 TESTING WITH SHORT SEQUENCE:
Analyzing sequence: ATGCATGC
Position  1: A
Position  2: T
Position  3: G
Position  4: C
Position  5: A
Position  6: T
Position  7: G
Position  8: C
Final composition: {'A': 2, 'T': 2, 'G': 2, 'C': 2, 'N': 0}


In [3]:
# Conditionals: Making biological decisions
def classify_sequence_type(dna_seq):
    """
    Classify DNA sequence based on biological characteristics
    This mimics real bioinformatics decision-making!
    """
    seq = dna_seq.upper()
    length = len(seq)

    # Calculate composition
    gc_content = (seq.count('G') + seq.count('C')) / length * 100 if length > 0 else 0
    at_content = (seq.count('A') + seq.count('T')) / length * 100 if length > 0 else 0

    print(f"\n🔬 CLASSIFYING SEQUENCE: {seq}")
    print(f"Length: {length} bp, GC: {gc_content:.1f}%, AT: {at_content:.1f}%")

    # Multiple conditional decisions
    if length < 20:
        size_class = "Short (primer/oligo)"
    elif length < 100:
        size_class = "Medium (gene fragment)"
    elif length < 1000:
        size_class = "Long (gene/small genome)"
    else:
        size_class = "Very long (chromosome/genome)"

    # GC content classification
    if gc_content > 65:
        gc_class = "High GC (thermophile/bacterial)"
    elif gc_content > 45:
        gc_class = "Moderate GC (typical eukaryote)"
    else:
        gc_class = "Low GC (AT-rich/regulatory)"

    # Pattern detection
    has_start_codon = 'ATG' in seq
    has_stop_codons = any(stop in seq for stop in ['TAA', 'TAG', 'TGA'])

    # Repetitiveness check
    is_repetitive = False
    for i in range(len(seq) - 3):
        triplet = seq[i:i+3]
        if seq.count(triplet) > 2:
            is_repetitive = True
            break

    # Final classification
    print(f"📊 CLASSIFICATION RESULTS:")
    print(f"  Size category: {size_class}")
    print(f"  GC category: {gc_class}")
    print(f"  Contains start codon (ATG): {has_start_codon}")
    print(f"  Contains stop codons: {has_stop_codons}")
    print(f"  Repetitive: {is_repetitive}")

    # Overall assessment
    if has_start_codon and has_stop_codons and not is_repetitive:
        overall = "Likely protein-coding gene"
    elif is_repetitive:
        overall = "Repetitive/structural element"
    elif gc_content < 30:
        overall = "Possible regulatory region"
    else:
        overall = "Non-coding sequence"

    print(f"  🎯 PREDICTION: {overall}")
    return overall

# Test with diverse biological sequences
biological_test_sequences = [
    ("ATGAAGTGCAACCTGCTGGTGCTGTTCCTGGGGCTCTTCCTGTAG", "Insulin gene fragment"),
    ("TTTTTTTTTTTTTTTTTTTT", "Poly-T tail"),
    ("GCGCGCGCGCGCGCGCGCGC", "High GC repeat"),
    ("ATGATGATGATGATGATG", "Repetitive ATG"),
    ("GCTAGCTAGCTAGCTAG", "Alternating sequence")
]

for seq, description in biological_test_sequences:
    print(f"\n{'='*60}")
    print(f"Testing: {description}")
    classify_sequence_type(seq)


Testing: Insulin gene fragment

🔬 CLASSIFYING SEQUENCE: ATGAAGTGCAACCTGCTGGTGCTGTTCCTGGGGCTCTTCCTGTAG
Length: 45 bp, GC: 55.6%, AT: 44.4%
📊 CLASSIFICATION RESULTS:
  Size category: Medium (gene fragment)
  GC category: Moderate GC (typical eukaryote)
  Contains start codon (ATG): True
  Contains stop codons: True
  Repetitive: True
  🎯 PREDICTION: Repetitive/structural element

Testing: Poly-T tail

🔬 CLASSIFYING SEQUENCE: TTTTTTTTTTTTTTTTTTTT
Length: 20 bp, GC: 0.0%, AT: 100.0%
📊 CLASSIFICATION RESULTS:
  Size category: Medium (gene fragment)
  GC category: Low GC (AT-rich/regulatory)
  Contains start codon (ATG): False
  Contains stop codons: False
  Repetitive: True
  🎯 PREDICTION: Repetitive/structural element

Testing: High GC repeat

🔬 CLASSIFYING SEQUENCE: GCGCGCGCGCGCGCGCGCGC
Length: 20 bp, GC: 100.0%, AT: 0.0%
📊 CLASSIFICATION RESULTS:
  Size category: Medium (gene fragment)
  GC category: High GC (thermophile/bacterial)
  Contains start codon (ATG): False
  Contains stop cod

In [4]:
# ORF (Open Reading Frame) Finder - Real Bioinformatics Algorithm!
def find_orfs(dna_seq, min_length=30):
    """
    Find Open Reading Frames (ORFs) in a DNA sequence
    ORF = Start codon (ATG) → Stop codon (TAA, TAG, TGA) in same reading frame
    This is a fundamental bioinformatics algorithm!
    """
    seq = dna_seq.upper()
    stop_codons = ['TAA', 'TAG', 'TGA']
    orfs = []

    print(f"🧬 ORF FINDER - Analyzing: {seq}")
    print(f"Minimum ORF length: {min_length} bp")
    print("-" * 50)

    # Check all three reading frames
    for frame in range(3):
        print(f"\n📖 Reading Frame {frame + 1}:")

        # Start from the frame position
        for start_pos in range(frame, len(seq) - 2, 3):
            codon = seq[start_pos:start_pos + 3]

            # Found start codon?
            if codon == 'ATG':
                print(f"  Start codon found at position {start_pos + 1}")

                # Look for stop codon in same frame
                for stop_pos in range(start_pos + 3, len(seq) - 2, 3):
                    stop_codon = seq[stop_pos:stop_pos + 3]

                    if stop_codon in stop_codons:
                        orf_length = stop_pos + 3 - start_pos

                        if orf_length >= min_length:
                            orf_seq = seq[start_pos:stop_pos + 3]

                            print(f"    → Stop codon {stop_codon} at position {stop_pos + 1}")
                            print(f"    → ORF length: {orf_length} bp")
                            print(f"    → ORF sequence: {orf_seq}")

                            orfs.append({
                                'frame': frame + 1,
                                'start': start_pos + 1,
                                'stop': stop_pos + 3,
                                'length': orf_length,
                                'sequence': orf_seq,
                                'start_codon': codon,
                                'stop_codon': stop_codon
                            })
                        else:
                            print(f"    → Stop codon {stop_codon} found but ORF too short ({orf_length} bp)")

                        break  # Found stop codon, move to next start

    print(f"\n📊 SUMMARY: Found {len(orfs)} ORFs meeting criteria")
    return orfs

# Test with realistic gene sequence
test_gene = "ATGAAGTGCAACCTGCTGGTGCTGTTCCTGGGGCTCTTCCTGTTCGCCTTCCTGGGCATGCCCAAGTAG"
print("Testing ORF finder with realistic gene sequence:")
found_orfs = find_orfs(test_gene, min_length=30)

# Show detailed results
if found_orfs:
    print("\n🎯 DETAILED ORF ANALYSIS:")
    for i, orf in enumerate(found_orfs, 1):
        print(f"\nORF {i}:")
        print(f"  Reading Frame: {orf['frame']}")
        print(f"  Position: {orf['start']}-{orf['stop']}")
        print(f"  Length: {orf['length']} bp ({orf['length']//3} amino acids)")
        print(f"  Codons: {orf['start_codon']} ... {orf['stop_codon']}")

Testing ORF finder with realistic gene sequence:
🧬 ORF FINDER - Analyzing: ATGAAGTGCAACCTGCTGGTGCTGTTCCTGGGGCTCTTCCTGTTCGCCTTCCTGGGCATGCCCAAGTAG
Minimum ORF length: 30 bp
--------------------------------------------------

📖 Reading Frame 1:
  Start codon found at position 1
    → Stop codon TAG at position 67
    → ORF length: 69 bp
    → ORF sequence: ATGAAGTGCAACCTGCTGGTGCTGTTCCTGGGGCTCTTCCTGTTCGCCTTCCTGGGCATGCCCAAGTAG
  Start codon found at position 58
    → Stop codon TAG found but ORF too short (12 bp)

📖 Reading Frame 2:

📖 Reading Frame 3:

📊 SUMMARY: Found 1 ORFs meeting criteria

🎯 DETAILED ORF ANALYSIS:

ORF 1:
  Reading Frame: 1
  Position: 1-69
  Length: 69 bp (23 amino acids)
  Codons: ATG ... TAG


In [5]:
# Simulating File Input - Multiple FASTA sequences
# (In real life, these would come from files, but we'll simulate for now)

print("📁 PROCESSING MULTIPLE SEQUENCES (FASTA-style)")
print("="*55)

# Simulate FASTA file content
fasta_data = """
>gene1|insulin_human|chromosome_11
ATGAAGTGCAACCTGCTGGTGCTGTTCCTGGGGCTCTTCCTGTTCGCCTTCCTGGGCATG
CCCAAGTAG

>gene2|hemoglobin_alpha|chromosome_16
ATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGGCGCG
CACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGTAG

>gene3|repetitive_element|unknown
ATGATGATGATGATGATGATGATGATGATGATGATGATG

>gene4|short_sequence|test
ATGAAATAG
"""

def parse_fasta_simulation(fasta_text):
    """
    Parse simulated FASTA data into sequences
    This simulates reading a real FASTA file!
    """
    sequences = []
    current_header = ""
    current_sequence = ""

    lines = fasta_text.strip().split('\n')

    for line in lines:
        line = line.strip()

        if line.startswith('>'):  # Header line
            # Save previous sequence if exists
            if current_header and current_sequence:
                sequences.append({
                    'header': current_header,
                    'sequence': current_sequence.replace(' ', ''),  # Remove spaces
                    'length': len(current_sequence.replace(' ', ''))
                })

            # Start new sequence
            current_header = line[1:]  # Remove '>'
            current_sequence = ""

        elif line and not line.startswith('>'):  # Sequence line
            current_sequence += line

    # Don't forget the last sequence!
    if current_header and current_sequence:
        sequences.append({
            'header': current_header,
            'sequence': current_sequence.replace(' ', ''),
            'length': len(current_sequence.replace(' ', ''))
        })

    return sequences

# Parse our simulated FASTA data
sequences = parse_fasta_simulation(fasta_data)

print(f"📊 Parsed {len(sequences)} sequences from FASTA data:")
for i, seq_data in enumerate(sequences, 1):
    print(f"\nSequence {i}:")
    print(f"  Header: {seq_data['header']}")
    print(f"  Length: {seq_data['length']} bp")
    print(f"  Preview: {seq_data['sequence'][:30]}...")

📁 PROCESSING MULTIPLE SEQUENCES (FASTA-style)
📊 Parsed 4 sequences from FASTA data:

Sequence 1:
  Header: gene1|insulin_human|chromosome_11
  Length: 69 bp
  Preview: ATGAAGTGCAACCTGCTGGTGCTGTTCCTG...

Sequence 2:
  Header: gene2|hemoglobin_alpha|chromosome_16
  Length: 99 bp
  Preview: ATGGTGCTGTCTCCTGCCGACAAGACCAAC...

Sequence 3:
  Header: gene3|repetitive_element|unknown
  Length: 39 bp
  Preview: ATGATGATGATGATGATGATGATGATGATG...

Sequence 4:
  Header: gene4|short_sequence|test
  Length: 9 bp
  Preview: ATGAAATAG...


In [6]:
# Batch Processing - Analyze all sequences at once
def batch_sequence_analysis(sequence_list):
    """
    Analyze multiple sequences in batch
    This is how real bioinformatics pipelines work!
    """
    print("🔬 BATCH SEQUENCE ANALYSIS PIPELINE")
    print("="*50)

    results = []

    for i, seq_data in enumerate(sequence_list, 1):
        header = seq_data['header']
        sequence = seq_data['sequence']

        print(f"\n📋 Processing sequence {i}: {header.split('|')[0]}")
        print("-" * 40)

        # Basic statistics
        length = len(sequence)
        gc_content = (sequence.count('G') + sequence.count('C')) / length * 100 if length > 0 else 0

        # Find ORFs
        orfs = find_orfs(sequence, min_length=15)  # Lower threshold for demo

        # Classify sequence
        classification = classify_sequence_type(sequence)

        # Count start/stop codons
        start_codons = sequence.count('ATG')
        stop_codons = sum(sequence.count(stop) for stop in ['TAA', 'TAG', 'TGA'])

        # Store results
        result = {
            'id': i,
            'header': header,
            'sequence': sequence,
            'length': length,
            'gc_content': gc_content,
            'start_codons': start_codons,
            'stop_codons': stop_codons,
            'orfs': orfs,
            'classification': classification
        }

        results.append(result)

        # Display summary
        print(f"  Length: {length} bp")
        print(f"  GC content: {gc_content:.1f}%")
        print(f"  Start codons: {start_codons}")
        print(f"  Stop codons: {stop_codons}")
        print(f"  ORFs found: {len(orfs)}")
        print(f"  Classification: {classification}")

    return results

# Run batch analysis
batch_results = batch_sequence_analysis(sequences)

# Generate summary report
print("\n" + "="*60)
print("📊 BATCH ANALYSIS SUMMARY REPORT")
print("="*60)

total_sequences = len(batch_results)
total_length = sum(result['length'] for result in batch_results)
avg_gc = sum(result['gc_content'] for result in batch_results) / total_sequences
total_orfs = sum(len(result['orfs']) for result in batch_results)

print(f"Total sequences analyzed: {total_sequences}")
print(f"Total sequence length: {total_length:,} bp")
print(f"Average GC content: {avg_gc:.1f}%")
print(f"Total ORFs found: {total_orfs}")

# Classification summary
classifications = [result['classification'] for result in batch_results]
classification_counts = {}
for classification in classifications:
    classification_counts[classification] = classification_counts.get(classification, 0) + 1

print("\nClassification breakdown:")
for classification, count in classification_counts.items():
    print(f"  {classification}: {count}")

🔬 BATCH SEQUENCE ANALYSIS PIPELINE

📋 Processing sequence 1: gene1
----------------------------------------
🧬 ORF FINDER - Analyzing: ATGAAGTGCAACCTGCTGGTGCTGTTCCTGGGGCTCTTCCTGTTCGCCTTCCTGGGCATGCCCAAGTAG
Minimum ORF length: 15 bp
--------------------------------------------------

📖 Reading Frame 1:
  Start codon found at position 1
    → Stop codon TAG at position 67
    → ORF length: 69 bp
    → ORF sequence: ATGAAGTGCAACCTGCTGGTGCTGTTCCTGGGGCTCTTCCTGTTCGCCTTCCTGGGCATGCCCAAGTAG
  Start codon found at position 58
    → Stop codon TAG found but ORF too short (12 bp)

📖 Reading Frame 2:

📖 Reading Frame 3:

📊 SUMMARY: Found 1 ORFs meeting criteria

🔬 CLASSIFYING SEQUENCE: ATGAAGTGCAACCTGCTGGTGCTGTTCCTGGGGCTCTTCCTGTTCGCCTTCCTGGGCATGCCCAAGTAG
Length: 69 bp, GC: 58.0%, AT: 42.0%
📊 CLASSIFICATION RESULTS:
  Size category: Medium (gene fragment)
  GC category: Moderate GC (typical eukaryote)
  Contains start codon (ATG): True
  Contains stop codons: True
  Repetitive: True
  🎯 PREDICTION: Re

In [7]:
# Advanced: Codon Usage Analysis with Nested Loops
def analyze_codon_usage(sequence_list):
    """
    Analyze codon usage patterns across multiple sequences
    This uses nested loops - common in bioinformatics!
    """
    print("🧬 CODON USAGE ANALYSIS")
    print("="*30)

    # Initialize codon counter
    codon_counts = {}
    total_codons = 0

    # Outer loop: each sequence
    for seq_data in sequence_list:
        sequence = seq_data['sequence'].upper()
        header = seq_data['header']

        print(f"\nAnalyzing codons in: {header.split('|')[0]}")

        # Inner loop: each codon position
        for i in range(0, len(sequence) - 2, 3):
            codon = sequence[i:i+3]

            # Only count complete codons
            if len(codon) == 3 and all(base in 'ATGC' for base in codon):
                codon_counts[codon] = codon_counts.get(codon, 0) + 1
                total_codons += 1
                print(f"  Position {i+1:2d}: {codon}")

    # Calculate frequencies and display results
    print(f"\n📊 CODON USAGE SUMMARY ({total_codons} total codons):")
    print("-" * 40)

    # Sort codons by frequency
    sorted_codons = sorted(codon_counts.items(), key=lambda x: x[1], reverse=True)

    for codon, count in sorted_codons:
        frequency = (count / total_codons) * 100 if total_codons > 0 else 0
        print(f"{codon}: {count:3d} times ({frequency:5.1f}%)")

    return codon_counts

# Run codon analysis
codon_results = analyze_codon_usage(sequences)

🧬 CODON USAGE ANALYSIS

Analyzing codons in: gene1
  Position  1: ATG
  Position  4: AAG
  Position  7: TGC
  Position 10: AAC
  Position 13: CTG
  Position 16: CTG
  Position 19: GTG
  Position 22: CTG
  Position 25: TTC
  Position 28: CTG
  Position 31: GGG
  Position 34: CTC
  Position 37: TTC
  Position 40: CTG
  Position 43: TTC
  Position 46: GCC
  Position 49: TTC
  Position 52: CTG
  Position 55: GGC
  Position 58: ATG
  Position 61: CCC
  Position 64: AAG
  Position 67: TAG

Analyzing codons in: gene2
  Position  1: ATG
  Position  4: GTG
  Position  7: CTG
  Position 10: TCT
  Position 13: CCT
  Position 16: GCC
  Position 19: GAC
  Position 22: AAG
  Position 25: ACC
  Position 28: AAC
  Position 31: GTC
  Position 34: AAG
  Position 37: GCC
  Position 40: GCC
  Position 43: TGG
  Position 46: GGT
  Position 49: AAG
  Position 52: GTC
  Position 55: GGC
  Position 58: GCG
  Position 61: CAC
  Position 64: GCT
  Position 67: GGC
  Position 70: GAG
  Position 73: TAT
  Positio

In [8]:
# Day 3 Summary and Achievements
print("\n" + "="*60)
print("🎉 DAY 3 ACHIEVEMENTS - CONTROL STRUCTURES MASTERED!")
print("="*60)

print("\n✅ PROGRAMMING SKILLS LEARNED:")
programming_skills = [
    "For loops with enumerate() for position tracking",
    "While loops for pattern searching",
    "Nested conditionals for biological decision making",
    "Complex if/elif/else chains",
    "Nested loops for multi-dimensional analysis",
    "Dictionary accumulation patterns",
    "List comprehensions for data filtering"
]

for i, skill in enumerate(programming_skills, 1):
    print(f"{i}. {skill}")

print("\n🧬 BIOINFORMATICS ALGORITHMS IMPLEMENTED:")
algorithms = [
    "ORF (Open Reading Frame) Finder",
    "Multi-sequence FASTA parser",
    "Batch processing pipeline",
    "Sequence classification system",
    "Codon usage analyzer",
    "Pattern recognition for biological features"
]

for algo in algorithms:
    print(f"• {algo}")

print("\n🔬 BIOLOGICAL CONCEPTS MASTERED:")
concepts = [
    "Open Reading Frames and gene structure",
    "Start codons (ATG) and stop codons (TAA, TAG, TGA)",
    "Reading frames and translation",
    "Codon usage patterns",
    "Sequence classification by GC content",
    "FASTA format understanding"
]

for concept in concepts:
    print(f"• {concept}")

print(f"\n📈 BOOTCAMP PROGRESS: 3/56 days completed ({3/56*100:.1f}%)")
print("🚀 Tomorrow: Functions, file I/O, and NumPy introduction!")

# Code quality assessment
print("\n💻 CODE QUALITY MILESTONES:")
quality_points = [
    "Functions with proper documentation",
    "Error handling for edge cases",
    "Meaningful variable names",
    "Biological context in comments",
    "Modular, reusable code structure"
]

for point in quality_points:
    print(f"✓ {point}")


🎉 DAY 3 ACHIEVEMENTS - CONTROL STRUCTURES MASTERED!

✅ PROGRAMMING SKILLS LEARNED:
1. For loops with enumerate() for position tracking
2. While loops for pattern searching
3. Nested conditionals for biological decision making
4. Complex if/elif/else chains
5. Nested loops for multi-dimensional analysis
6. Dictionary accumulation patterns
7. List comprehensions for data filtering

🧬 BIOINFORMATICS ALGORITHMS IMPLEMENTED:
• ORF (Open Reading Frame) Finder
• Multi-sequence FASTA parser
• Batch processing pipeline
• Sequence classification system
• Codon usage analyzer
• Pattern recognition for biological features

🔬 BIOLOGICAL CONCEPTS MASTERED:
• Open Reading Frames and gene structure
• Start codons (ATG) and stop codons (TAA, TAG, TGA)
• Reading frames and translation
• Codon usage patterns
• Sequence classification by GC content
• FASTA format understanding

📈 BOOTCAMP PROGRESS: 3/56 days completed (5.4%)
🚀 Tomorrow: Functions, file I/O, and NumPy introduction!

💻 CODE QUALITY MILESTO