# 🧬 String Manipulation for Biological Data

## Mastering Text Processing in Biology

In biology, we constantly work with text data:
- **DNA/RNA sequences**: ATCG, AUGC
- **Protein sequences**: MVLSEGEWQLVLHVWAK...
- **Gene names**: BRCA1, TP53, EGFR
- **Sample IDs**: S001_Control, S002_Treatment
- **File paths**: data/sequences/human_genome.fasta

String manipulation is essential for processing, analyzing, and formatting this biological data. Let's master these skills!

## 🎯 Why String Manipulation Matters in Biology

String operations help you:
- **Clean data**: Remove unwanted characters from sequences
- **Search patterns**: Find restriction sites, motifs, or mutations
- **Format output**: Create proper FASTA files or data tables
- **Parse information**: Extract gene names from complex identifiers
- **Validate data**: Check if sequences are valid DNA/RNA/protein

## 1️⃣ String Basics - Accessing and Slicing

Strings are sequences of characters. Like DNA sequences, you can access individual bases or extract subsequences.

In [None]:
# DNA sequence as a string
dna_sequence = "ATCGATCGTAGCTACG"

# Accessing individual characters (0-indexed!)
first_base = dna_sequence[0]
last_base = dna_sequence[-1]
third_base = dna_sequence[2]

print(f"DNA sequence: {dna_sequence}")
print(f"First base: {first_base}")
print(f"Last base: {last_base}")
print(f"Third base (index 2): {third_base}")

# Length of sequence
print(f"\nSequence length: {len(dna_sequence)} bp")

### Slicing - Extracting Subsequences

In [None]:
# Slicing syntax: string[start:end:step]
gene_sequence = "ATGGCGACCCTGGAAAAGCTGATG"

# Extract start codon
start_codon = gene_sequence[0:3]
print(f"Start codon: {start_codon}")

# Extract first 10 bases
first_ten = gene_sequence[:10]
print(f"First 10 bases: {first_ten}")

# Extract last 10 bases
last_ten = gene_sequence[-10:]
print(f"Last 10 bases: {last_ten}")

# Extract every third base (reading frame)
frame1 = gene_sequence[0::3]
frame2 = gene_sequence[1::3]
frame3 = gene_sequence[2::3]
print(f"\nReading frames:")
print(f"Frame 1: {frame1}")
print(f"Frame 2: {frame2}")
print(f"Frame 3: {frame3}")

## 2️⃣ Case Conversion - Upper, Lower, and More

Case conversion is crucial for standardizing biological sequences and identifiers.

In [None]:
# Mixed case sequence (common in GenBank files)
mixed_sequence = "ATCGatcgTAGC"
gene_name = "brca1"
species = "homo sapiens"

# Convert to uppercase (standard for sequences)
dna_upper = mixed_sequence.upper()
print(f"Original: {mixed_sequence}")
print(f"Uppercase: {dna_upper}")

# Convert to lowercase
dna_lower = mixed_sequence.lower()
print(f"Lowercase: {dna_lower}")

# Capitalize first letter (useful for species names)
species_formatted = species.title()
print(f"\nSpecies: {species} → {species_formatted}")

# Gene names often uppercase
gene_formatted = gene_name.upper()
print(f"Gene: {gene_name} → {gene_formatted}")

# Check case
print(f"\nIs uppercase? {dna_upper.isupper()}")
print(f"Is lowercase? {dna_lower.islower()}")

## 3️⃣ String Methods - Find, Replace, Count

These methods are essential for sequence analysis!

### Finding Patterns

In [None]:
# DNA sequence with restriction sites
plasmid = "ATCGAATTCGATCGGATCCTAGCGAATTCGCTA"

# Find restriction site (EcoRI: GAATTC)
ecori_site = "GAATTC"
first_site = plasmid.find(ecori_site)
print(f"Plasmid sequence: {plasmid}")
print(f"First EcoRI site at position: {first_site}")

# Find from a specific position
second_site = plasmid.find(ecori_site, first_site + 1)
print(f"Second EcoRI site at position: {second_site}")

# Check if pattern exists
has_bamhi = "GGATCC" in plasmid
has_pvui = "CGATCG" in plasmid
print(f"\nContains BamHI site (GGATCC)? {has_bamhi}")
print(f"Contains PvuI site (CGATCG)? {has_pvui}")

# Count occurrences
ecori_count = plasmid.count(ecori_site)
print(f"\nNumber of EcoRI sites: {ecori_count}")

### Replacing Sequences

In [None]:
# RNA sequence
rna = "AUGCGAUCCUAGUAA"

# Convert RNA to DNA
dna = rna.replace("U", "T")
print(f"RNA: {rna}")
print(f"DNA: {dna}")

# Simulate mutation
wild_type = "ATGGCGATCGTAGC"
mutant = wild_type.replace("GAT", "GTT")  # Asp → Val mutation
print(f"\nWild type: {wild_type}")
print(f"Mutant:    {mutant}")

# Remove unwanted characters from sequence
dirty_sequence = "ATCG ATCG\nTAGC\tGCTA"
clean_sequence = dirty_sequence.replace(" ", "").replace("\n", "").replace("\t", "")
print(f"\nDirty: {repr(dirty_sequence)}")
print(f"Clean: {clean_sequence}")

## 4️⃣ String Splitting and Joining

Essential for parsing biological data files and creating formatted output.

In [None]:
# Splitting strings
fasta_header = ">gi|123456|ref|NM_001234.5| Homo sapiens BRCA1 gene"

# Split by pipe symbol
header_parts = fasta_header.split("|")
print("Header parts:")
for i, part in enumerate(header_parts):
    print(f"  {i}: {part}")

# Extract specific information
gi_number = header_parts[1]
accession = header_parts[3]
print(f"\nGI: {gi_number}")
print(f"Accession: {accession}")

# Split sample data
samples = "Control1,Control2,Treatment1,Treatment2"
sample_list = samples.split(",")
print(f"\nSamples: {sample_list}")

In [None]:
# Joining strings
codons = ["ATG", "GCG", "ACC", "CTG", "GAA", "AAG", "TGA"]

# Join into continuous sequence
full_sequence = "".join(codons)
print(f"Codons: {codons}")
print(f"Joined: {full_sequence}")

# Join with separator
codon_display = "-".join(codons)
print(f"Display: {codon_display}")

# Create CSV line
data = ["Sample1", "25.3", "0.95", "Positive"]
csv_line = ",".join(data)
print(f"\nCSV: {csv_line}")

# Create formatted sequence (groups of 10)
long_seq = "ATCGATCGTAGCTAGCTAGCTAGCTAGCTAGCTAGC"
chunks = [long_seq[i:i+10] for i in range(0, len(long_seq), 10)]
formatted = " ".join(chunks)
print(f"\nFormatted sequence:\n{formatted}")

## 5️⃣ String Trimming - Strip, Lstrip, Rstrip

Remove unwanted whitespace and characters from the edges of strings.

In [None]:
# Sequence with whitespace
messy_sequence = "  ATCGATCG\n"
clean = messy_sequence.strip()
print(f"Original: '{messy_sequence}'")
print(f"Stripped: '{clean}'")

# Remove specific characters
sequence_with_ns = "NNNNATCGATCGNNNN"
trimmed = sequence_with_ns.strip("N")
print(f"\nWith Ns: {sequence_with_ns}")
print(f"Trimmed: {trimmed}")

# Left and right strip
adapter_seq = "AAAATCGATCGTTT"
no_polyA = adapter_seq.lstrip("A")
no_polyT = adapter_seq.rstrip("T")
print(f"\nOriginal: {adapter_seq}")
print(f"No poly-A: {no_polyA}")
print(f"No poly-T: {no_polyT}")

# Clean file paths
file_path = "  /data/sequences/human.fasta  \n"
clean_path = file_path.strip()
print(f"\nPath: '{clean_path}'")

## 6️⃣ String Validation - Checking Content

Validate that your biological data is in the correct format.

In [None]:
# Check if string contains only certain characters
def is_valid_dna(sequence):
    """Check if sequence contains only valid DNA bases"""
    valid_bases = set("ATCGN")
    return all(base in valid_bases for base in sequence.upper())

# Test sequences
seq1 = "ATCGATCG"
seq2 = "ATCGATUG"  # Contains U (RNA)
seq3 = "ATCG123"

print(f"'{seq1}' is valid DNA? {is_valid_dna(seq1)}")
print(f"'{seq2}' is valid DNA? {is_valid_dna(seq2)}")
print(f"'{seq3}' is valid DNA? {is_valid_dna(seq3)}")

# Check string properties
sample_id = "S001"
numeric_id = "12345"
mixed_id = "S001A"

print(f"\n'{sample_id}' is alphanumeric? {sample_id.isalnum()}")
print(f"'{numeric_id}' is numeric? {numeric_id.isnumeric()}")
print(f"'{mixed_id}' starts with 'S'? {mixed_id.startswith('S')}")
print(f"'{mixed_id}' ends with 'A'? {mixed_id.endswith('A')}")

## 7️⃣ Advanced String Operations for Biology

### Reverse Complement (DNA)

In [None]:
def reverse_complement(dna):
    """Return reverse complement of DNA sequence"""
    complement = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}
    
    # Create complement
    comp_seq = ""
    for base in dna:
        comp_seq += complement.get(base, base)  # Keep unknown bases as-is
    
    # Reverse the sequence
    rev_comp = comp_seq[::-1]
    
    return rev_comp

# Test the function
forward = "ATCGATCG"
rev_comp = reverse_complement(forward)

print(f"Forward:    5'-{forward}-3'")
print(f"Rev Comp:   3'-{rev_comp}-5'")

# Check a restriction site
ecori = "GAATTC"
ecori_rc = reverse_complement(ecori)
print(f"\nEcoRI site: {ecori}")
print(f"Rev Comp:   {ecori_rc}")
print(f"Palindromic? {ecori == ecori_rc}")

### GC Content Calculation

In [None]:
def calculate_gc_content(sequence):
    """Calculate GC content of DNA sequence"""
    sequence = sequence.upper()
    gc_count = sequence.count('G') + sequence.count('C')
    total = len(sequence)
    
    if total == 0:
        return 0
    
    gc_percent = (gc_count / total) * 100
    return gc_percent

# Test different sequences
sequences = [
    ("ATCGATCG", "Balanced"),
    ("GCGCGCGC", "High GC"),
    ("ATATATAT", "Low GC"),
    ("ATCGNNNATCG", "With Ns")
]

for seq, description in sequences:
    gc = calculate_gc_content(seq)
    print(f"{description:12} {seq:15} GC: {gc:.1f}%")

### Codon Translation

In [None]:
# Simplified codon table
codon_table = {
    'ATG': 'M', 'TGG': 'W', 'TTT': 'F', 'TTC': 'F',
    'TAA': '*', 'TAG': '*', 'TGA': '*',
    'GCT': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A',
    'CGT': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R'
}

def translate_sequence(dna):
    """Translate DNA to protein (simplified)"""
    protein = ""
    
    # Process in groups of 3
    for i in range(0, len(dna)-2, 3):
        codon = dna[i:i+3]
        amino_acid = codon_table.get(codon, 'X')  # X for unknown
        protein += amino_acid
        
        # Stop at stop codon
        if amino_acid == '*':
            break
    
    return protein

# Test translation
gene = "ATGGCTCGTTAG"
protein = translate_sequence(gene)

print(f"DNA:     {gene}")
print(f"Codons:  ", end="")
for i in range(0, len(gene), 3):
    print(f"{gene[i:i+3]} ", end="")
print(f"\nProtein: {protein}")

## 🧬 Real-World Example: FASTA File Processing

In [None]:
# Simulate reading a FASTA file
fasta_data = """>seq1|Human|BRCA1|Exon1
ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAA
AATCTTAGAGTGTCCCATCTGTCTGGAGTTGATCAAGGAACCTGTCTCCACAAAGTGTG
>seq2|Mouse|Brca1|Exon1  
ATGGCGTTACTGCCACACGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAA
GTCTTAGAGTGTCCCATCTGTCTGGAGTTGATCAAGGAACCTGTCTCCACAAAGTGTGA
"""

# Process the FASTA data
lines = fasta_data.strip().split('\n')
sequences = {}
current_id = None
current_seq = ""

for line in lines:
    line = line.strip()
    
    if line.startswith('>'):
        # Save previous sequence
        if current_id:
            sequences[current_id] = current_seq
        
        # Parse new header
        header_parts = line[1:].split('|')
        current_id = header_parts[0]
        current_seq = ""
        
        print(f"Found sequence: {current_id}")
        print(f"  Species: {header_parts[1]}")
        print(f"  Gene: {header_parts[2]}")
    else:
        # Add to current sequence
        current_seq += line.upper()

# Save last sequence
if current_id:
    sequences[current_id] = current_seq

# Analyze sequences
print("\nSequence Analysis:")
print("-" * 50)
for seq_id, sequence in sequences.items():
    gc = calculate_gc_content(sequence)
    print(f"{seq_id}:")
    print(f"  Length: {len(sequence)} bp")
    print(f"  GC Content: {gc:.1f}%")
    print(f"  First 30 bp: {sequence[:30]}")

## 💡 Common String Patterns in Biology

In [None]:
# Useful string manipulation patterns

# 1. Clean sequence data
def clean_sequence(seq):
    """Remove whitespace and convert to uppercase"""
    return seq.replace(" ", "").replace("\n", "").replace("\t", "").upper()

# 2. Format sequence for display
def format_sequence(seq, chunk_size=10, line_length=60):
    """Format sequence in chunks"""
    chunks = [seq[i:i+chunk_size] for i in range(0, len(seq), chunk_size)]
    lines = []
    for i in range(0, len(chunks), line_length//chunk_size):
        line_chunks = chunks[i:i+line_length//chunk_size]
        lines.append(" ".join(line_chunks))
    return "\n".join(lines)

# 3. Extract gene ID from complex header
def extract_gene_id(header):
    """Extract gene ID from FASTA header"""
    # Remove > symbol
    header = header.lstrip('>')
    # Split and take first part
    parts = header.split('|')
    return parts[0] if parts else header.split()[0]

# Test these functions
messy = "  ATCG atcg\nTAGC  \t"
print(f"Clean: {clean_sequence(messy)}")

long_seq = "ATCGATCGTAGCTAGCTAGCTAGCTAGCTAGCTAGCATCGATCGTAGCTAGCTAGC"
print(f"\nFormatted:\n{format_sequence(long_seq)}")

header = ">NM_001234.5|Homo sapiens BRCA1 mRNA"
print(f"\nGene ID: {extract_gene_id(header)}")

## 🎯 Practice Exercises

### Exercise 1: Sequence Validator

Create a function that validates whether a sequence is valid DNA, RNA, or protein.

In [None]:
def validate_sequence(sequence, seq_type):
    """
    Validate if sequence is valid DNA, RNA, or protein
    
    Args:
        sequence: string sequence to validate
        seq_type: 'DNA', 'RNA', or 'protein'
    
    Returns:
        True if valid, False otherwise
    """
    # TODO: Implement validation logic
    # DNA: only ATCGN
    # RNA: only AUCGN
    # Protein: only valid amino acid letters
    
    pass

# Test cases
test_sequences = [
    ("ATCGATCG", "DNA"),
    ("AUCGAUCG", "RNA"),
    ("ATCGATUG", "DNA"),
    ("MVLSPADKTN", "protein")
]

# Test your function
# for seq, seq_type in test_sequences:
#     result = validate_sequence(seq, seq_type)
#     print(f"{seq_type:8} '{seq}': {result}")

### Exercise 2: Restriction Site Finder

Find all positions of a restriction site in a DNA sequence.

In [None]:
def find_restriction_sites(sequence, site):
    """
    Find all positions of a restriction site in sequence
    
    Args:
        sequence: DNA sequence to search
        site: restriction site pattern
    
    Returns:
        List of positions (0-indexed)
    """
    # TODO: Find all positions where site occurs
    positions = []
    
    # Your code here
    
    return positions

# Test
plasmid = "GAATTCATCGATGAATTCGCTAGAATTC"
ecori = "GAATTC"

# positions = find_restriction_sites(plasmid, ecori)
# print(f"EcoRI sites found at positions: {positions}")

### Exercise 3: Sequence Statistics

Calculate various statistics for a DNA sequence.

In [None]:
def sequence_statistics(sequence):
    """
    Calculate statistics for a DNA sequence
    
    Returns dictionary with:
    - length
    - gc_content
    - at_content  
    - a_count, t_count, g_count, c_count
    """
    stats = {}
    
    # TODO: Calculate all statistics
    # Remember to handle uppercase/lowercase
    
    return stats

# Test
test_seq = "ATCGATCGtagcTAGC"
# stats = sequence_statistics(test_seq)
# for key, value in stats.items():
#     print(f"{key}: {value}")

### Exercise 4: FASTA Header Parser

Parse complex FASTA headers to extract information.

In [None]:
def parse_fasta_header(header):
    """
    Parse a FASTA header and extract information
    
    Example header:
    >gi|123456|ref|NM_001234.5| Homo sapiens BRCA1 (BRCA1), mRNA
    
    Returns dictionary with available fields
    """
    info = {}
    
    # TODO: Parse the header
    # Extract gi number, accession, description, etc.
    
    return info

# Test headers
headers = [
    ">gi|123456|ref|NM_001234.5| Homo sapiens BRCA1 (BRCA1), mRNA",
    ">seq1|Human|Chromosome1|Position:1000-2000"
]

# for header in headers:
#     info = parse_fasta_header(header)
#     print(f"Header: {header}")
#     print(f"Parsed: {info}")
#     print()

### Exercise 5: Sequence Formatter

Format a sequence for nice display or file output.

In [None]:
def format_fasta(sequence, header, line_length=60):
    """
    Format sequence as proper FASTA format
    
    Args:
        sequence: DNA/protein sequence
        header: FASTA header (without >)
        line_length: characters per line
    
    Returns:
        Formatted FASTA string
    """
    # TODO: Create properly formatted FASTA
    # - Add > to header
    # - Split sequence into lines of specified length
    # - Return complete formatted string
    
    pass

# Test
long_sequence = "ATCGATCGTAGCTAGCTAGCTAGCTAGCTAGCTAGCATCGATCGTAGCTAGCTAGCTAGCTAGCTAGCTAGC" * 3
header = "test_sequence|Example DNA"

# formatted = format_fasta(long_sequence, header, line_length=50)
# print(formatted)

### Challenge Exercise: Open Reading Frame Finder

Find all open reading frames (ORFs) in a DNA sequence.

In [None]:
def find_orfs(sequence, min_length=100):
    """
    Find all open reading frames in a DNA sequence
    
    An ORF starts with ATG and ends with a stop codon (TAA, TAG, TGA)
    
    Args:
        sequence: DNA sequence to search
        min_length: minimum ORF length in bp
    
    Returns:
        List of tuples: (start_pos, end_pos, orf_sequence)
    """
    orfs = []
    
    # TODO: Implement ORF finding
    # Hints:
    # 1. Find all ATG positions
    # 2. For each ATG, look for stop codons in same frame
    # 3. Check if ORF meets minimum length
    # 4. Consider all three reading frames
    
    return orfs

# Test sequence with known ORF
test_dna = "GGGATGGCTAGCTAAGGGATGAAACCCTGA"
# orfs = find_orfs(test_dna, min_length=9)
# for start, end, orf in orfs:
#     print(f"ORF found: {start}-{end}")
#     print(f"Sequence: {orf}")

## 🎉 Summary

You've mastered essential string manipulation techniques for biological data:

### Core Operations
✅ **Accessing**: Use indexing `[i]` and slicing `[start:end]`  
✅ **Case conversion**: `.upper()`, `.lower()`, `.title()`  
✅ **Finding**: `.find()`, `.count()`, `in` operator  
✅ **Replacing**: `.replace(old, new)`  
✅ **Splitting/Joining**: `.split()`, `.join()`  
✅ **Trimming**: `.strip()`, `.lstrip()`, `.rstrip()`  
✅ **Validation**: `.startswith()`, `.endswith()`, `.isalnum()`  

### Biological Applications
✅ Sequence cleaning and validation  
✅ Reverse complement calculation  
✅ GC content analysis  
✅ Restriction site finding  
✅ FASTA file processing  
✅ Codon translation  

### 🔑 Key Takeaways

1. **Always clean your data**: Remove whitespace, standardize case
2. **Validate input**: Check sequences contain only valid characters
3. **Use appropriate methods**: Each string method has its purpose
4. **Think in patterns**: Many biological tasks follow similar patterns
5. **Format output clearly**: Make results easy to read and use

### 🚀 Next Steps

With these string manipulation skills, you're ready to:
- Work with **conditionals** to make decisions based on sequence content
- Use **loops** to process multiple sequences
- Read and write **biological file formats**
- Build **sequence analysis tools**

**Keep practicing!** String manipulation is fundamental to all bioinformatics work. 🧬💻