# 🧬 Working with FASTA Files in Python

## Learning Objectives:
- Understand the FASTA file format structure
- Parse FASTA data using basic Python file I/O
- Use context managers (`with open()`) for safe file handling
- Store sequence data in dictionaries for easy access
- Apply string manipulation to process biological sequences
- Handle multi-line sequences and complex headers

## 1️⃣ What is FASTA Format?

FASTA is the most common format for storing biological sequences. It's simple, readable, and widely supported.

**Structure:**
- **Header line**: Starts with `>` followed by sequence identifier and description
- **Sequence lines**: The actual DNA, RNA, or protein sequence (can span multiple lines)

**Example:**
```
>sequence_id description
ATGGCGACCCTGGAAAAGCTGATG
>another_sequence more info
ATCGATCGTAGCTAGC
```
So to recap: FASTA files always have:
✓ Header lines starting with '>'
✓ Sequence lines following the header

But headers? They're the Wild West! Everyone does it differently.

"""

fasta_examples = """
>simple_id
ATCGATCG

>complex|id|with|pipes|UniProt|style This is a description
ATCGATCG

>another_style gene=BRCA1 species="Homo sapiens" method=illumina
ATCGATCG
"""

## 2️⃣ Creating Sample FASTA Data

Let's start with a small example dataset to understand the format:

In [None]:
# Create sample FASTA data as a string (we use """ to create a multi-line string data)
sample_fasta = """>gene1 Homo_sapiens BRCA1 Chromosome17
ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAA
AATCTTAGAGTGTCCCATCTGTCTGGAGTTGATCAAGGAACCTGTCTCCACAAAGTGTG
"""

print("Sample FASTA data:")
print(sample_fasta)

When we are parsing FASTA data the first thing we need to do is to store sequence and header data in accessible format. We use string parsing operations here, mainly strip() and split()

In [None]:
# First we strip the whitespace and split the lines to store in a list
lines = sample_fasta.strip().split('\n')

print(lines)

So now we can isolate the gene id, gene info and sequence and store the data with separate keys in a dictionary.
The logic will work like so: 

- We check if a line starts with > 
- if it does, we isolate the first string as gene_id and the rest of the string as gene_info
- if not we append it to the sequence string

In [None]:
gene1 = {
    "gene_id": "",
    "gene_info": "",
    "sequence": ""
}

print("Header analysis:")
print("-" * 50)

for line in lines:
    if line.startswith(">"):
        # Remove the '>' character
        gene1["gene_id"] = line.split(" ")[0][1:] # here we first get the >gene1 part of the string and then remove the >
        gene1["gene_info"] = line.split(" ")[1:] # here we split the string by space and store the rest of the string as gene_info
    else:
        gene1["sequence"] += line.strip() # here we strip the line of whitespace and append it to the sequence string

print(f"gene1 id: {gene1['gene_id']}")
print(f"gene1 info: {gene1['gene_info']}")
print(f"gene1 sequence: {gene1['sequence']}")

## 3️⃣ Reading FASTA data from files: File I/O with Context Managers

Usually we read in FASTA data from files stored on our computer or an external data source. Now let's work with real FASTA files! We'll download ATR gene sequences from Zenodo (a scientific data repository).

In [None]:
# Download FASTA files from Zenodo
# We'll use urllib to download the files directly from the web

import urllib.request

# URLs for the ATR gene sequences
human_atr_url = "https://zenodo.org/records/17223635/files/atr_human.fasta?download=1"
mouse_atr_url = "https://zenodo.org/records/17223635/files/atr_mouse.fasta?download=1"

# Download files to local temporary files
print("Downloading ATR gene sequences from Zenodo...")

# Download human ATR
urllib.request.urlretrieve(human_atr_url, 'atr_human.fasta')
print("✓ Downloaded atr_human.fasta")

# Download mouse ATR  
urllib.request.urlretrieve(mouse_atr_url, 'atr_mouse.fasta')
print("✓ Downloaded atr_mouse.fasta")

print("\nFiles ready for analysis!")

## 4️⃣ Reading FASTA Files with Context Managers

Now let's use the `with open()` statement to safely read these files. This is the **proper way** to handle files in Python!

In [None]:
# First, let's peek at what's in the human ATR file
print("First few lines of human ATR file:")
print("-" * 40)

with open('atr_human.fasta', 'r') as file:
    # Read first 5 lines to see the structure
    for i in range(5):
        line = file.readline().strip()
        if line:  # Only print non-empty lines
            print(f"Line {i+1}: {line}")

print("\n" + "=" * 50)

## 5️⃣ Parsing Real FASTA Files into Dictionaries

Now let's create a function to parse FASTA files and store the data in dictionaries. We'll use the same logic as before but make it work with files:

In [None]:
def parse_fasta_file(filename):
    """
    Parse a FASTA file into a dictionary
    
    Args:
        filename: path to the FASTA file
    
    Returns:
        dict: {gene_id: {'header': str, 'sequence': str}}
    """
    sequences = {}
    current_gene_id = None
    current_sequence = ""
    
    # Use context manager to safely open and read the file
    with open(filename, 'r') as file:
        for line in file:
            line = line.strip()  # Remove whitespace
            
            if line.startswith('>'):
                # Save previous sequence if we have one
                if current_gene_id is not None:
                    sequences[current_gene_id]['sequence'] = current_sequence
                
                # Parse new header
                header_parts = line[1:].split(' ', 1)  # Split on first space only
                current_gene_id = header_parts[0]
                
                # Store header info
                sequences[current_gene_id] = {
                    'header': line[1:],  # Full header without >
                    'sequence': ""
                }
                
                current_sequence = ""  # Reset sequence
                print(f"Found sequence: {current_gene_id}")
                
            else:
                # Add to current sequence (sequences can span multiple lines)
                current_sequence += line.upper()
    
    # Don't forget the last sequence!
    if current_gene_id is not None:
        sequences[current_gene_id]['sequence'] = current_sequence
    
    return sequences

# Parse both ATR files
print("Parsing human ATR file:")
human_atr = parse_fasta_file('atr_human.fasta')

print("\nParsing mouse ATR file:")
mouse_atr = parse_fasta_file('atr_mouse.fasta')

print(f"\n✓ Human ATR: {len(human_atr)} sequences")
print(f"✓ Mouse ATR: {len(mouse_atr)} sequences")

## 6️⃣ Analyzing the ATR Gene Sequences

Now let's explore what we've loaded and do some basic sequence analysis:

In [None]:
# Let's examine what we loaded
print("🧬 ATR Gene Analysis")
print("=" * 40)

print("\nHuman ATR sequences:")
for gene_id, data in human_atr.items():
    sequence = data['sequence']
    print(f"  ID: {gene_id}")
    print(f"  Header: {data['header']}")
    print(f"  Length: {len(sequence):,} bp")
    print(f"  First 50 bp: {sequence[:50]}...")
    print()

print("\nMouse ATR sequences:")
for gene_id, data in mouse_atr.items():
    sequence = data['sequence']
    print(f"  ID: {gene_id}")
    print(f"  Header: {data['header']}")
    print(f"  Length: {len(sequence):,} bp")
    print(f"  First 50 bp: {sequence[:50]}...")
    print()

## 7️⃣ Exercise: Compare Human and Mouse ATR

Now it's your turn! Let's compare the ATR gene sequences between human and mouse:

In [None]:
# Exercise: Compare ATR sequences
print("🔬 Species Comparison Exercise")
print("-" * 35)

# TODO: Calculate and compare basic statistics
# Hints:
# 1. Calculate sequence length for each species
# 2. Calculate GC content (G + C nucleotides / total length * 100)
# 3. Check if sequences start with ATG (start codon)
# 4. Find common nucleotide patterns

# Get the first sequence from each species (assuming there's one main sequence)
human_seq_id = list(human_atr.keys())[0]
mouse_seq_id = list(mouse_atr.keys())[0]

human_sequence = human_atr[human_seq_id]['sequence']
mouse_sequence = mouse_atr[mouse_seq_id]['sequence']

print(f"Comparing {human_seq_id} vs {mouse_seq_id}")
print()

# Length comparison
print(f"Human ATR length: {len(human_sequence):,} bp")
print(f"Mouse ATR length: {len(mouse_sequence):,} bp")
print(f"Length difference: {abs(len(human_sequence) - len(mouse_sequence)):,} bp")
print()

# GC content comparison
def calculate_gc_content(sequence):
    """Calculate GC content as percentage"""
    gc_count = sequence.count('G') + sequence.count('C')
    return (gc_count / len(sequence)) * 100 if len(sequence) > 0 else 0

human_gc = calculate_gc_content(human_sequence)
mouse_gc = calculate_gc_content(mouse_sequence)

print(f"Human GC content: {human_gc:.1f}%")
print(f"Mouse GC content: {mouse_gc:.1f}%")
print()

# Start codon check
print(f"Human starts with ATG: {human_sequence.startswith('ATG')}")
print(f"Mouse starts with ATG: {mouse_sequence.startswith('ATG')}")
print()

# First 100 nucleotides comparison
print("First 100 nucleotides:")
print(f"Human: {human_sequence[:100]}")
print(f"Mouse: {mouse_sequence[:100]}")

# Count identical positions in first 100 bp
identical = sum(1 for h, m in zip(human_sequence[:100], mouse_sequence[:100]) if h == m)
print(f"Identical nucleotides in first 100 bp: {identical}/100 ({identical}%)")

## 8️⃣ Writing Results to a New FASTA File

Let's demonstrate how to write data back to a FASTA file using context managers:

In [None]:
# Create a combined FASTA file with both species
def write_fasta_file(sequences_dict, filename, line_length=80):
    """
    Write sequences to a FASTA file with proper formatting
    
    Args:
        sequences_dict: dictionary with sequence data
        filename: output file name
        line_length: maximum characters per line for sequences
    """
    with open(filename, 'w') as file:
        for gene_id, data in sequences_dict.items():
            # Write header line
            file.write(f">{data['header']}\n")
            
            # Write sequence in chunks of specified length
            sequence = data['sequence']
            for i in range(0, len(sequence), line_length):
                chunk = sequence[i:i + line_length]
                file.write(f"{chunk}\n")

# Combine both species into one dictionary
combined_atr = {}
combined_atr.update(human_atr)
combined_atr.update(mouse_atr)

# Write to a new file
output_filename = 'combined_atr_sequences.fasta'
write_fasta_file(combined_atr, output_filename)

print(f"✓ Wrote {len(combined_atr)} sequences to {output_filename}")

# Verify the file was created by reading the first few lines
print("\nFirst few lines of the output file:")
with open(output_filename, 'r') as file:
    for i, line in enumerate(file):
        if i < 5:  # Show first 5 lines
            print(f"  {line.strip()}")
        else:
            break

## 9️⃣ Summary

In this notebook, you've learned:

### ✅ FASTA Format Basics
- **Structure**: Header lines start with `>`, followed by sequence lines
- **Headers**: Contain sequence identifiers and metadata
- **Sequences**: Can span multiple lines

### ✅ Context Managers (`with open()`)
- **Safe file handling**: Automatically closes files even if errors occur
- **Reading files**: Process line by line for efficiency
- **Writing files**: Create properly formatted output

### ✅ String Processing for Biology
- **Parsing headers**: Extract IDs and metadata
- **Handling sequences**: Join multi-line sequences, convert to uppercase
- **Data validation**: Check for proper format and content

### ✅ Dictionary Storage
- **Structured data**: Store sequences with metadata in dictionaries
- **Easy access**: Retrieve sequences by ID
- **Flexible organization**: Handle multiple sequences efficiently

### ✅ Real-World Data
- **Downloaded ATR genes** from Zenodo scientific repository
- **Compared sequences** between human and mouse
- **Calculated statistics** like GC content and sequence similarity

## 🔑 Key Takeaways

1. **Always use `with open()`** for file operations in Python
2. **FASTA files are simple but powerful** for storing biological sequences
3. **Dictionaries are perfect** for organizing sequence data
4. **Real biological data** often requires careful parsing and validation
5. **Context managers prevent file handle leaks** and make code more robust

## 🚀 Next Steps

You can now:
- Work with **any FASTA file** from databases like NCBI or Ensembl
- **Parse and analyze** DNA, RNA, and protein sequences
- **Create workflows** for comparative genomics studies
- **Handle large datasets** efficiently using proper file I/O

**Try downloading other sequences and comparing them!** 🧬💻