<a href="https://colab.research.google.com/github/AmirSedaghaati/computational-biology/blob/main/%22Day5_Data_Structures_Biology_ipynb%22.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# Working with multiple DNA sequences - using lists
print("Learning to handle multiple sequences with lists")

# I'll create a list of gene sequences I want to analyze
gene_sequences = [
    "ATGAAGTGCAACCTGCTGGTG",  # this looks like insulin
    "GCTAGCTAGCTAGCTAGCTAG",  # some repetitive sequence
    "ATGCGATCGTAGCTTAAGGC",   # random sequence
    "TTTTTTTTTTTTTTTTTTTT"    # poly-T sequence
]

print("I have", len(gene_sequences), "sequences to analyze")

# Let me try to analyze each one
for i in range(len(gene_sequences)):
    sequence = gene_sequences[i]
    print(f"\nSequence {i+1}: {sequence}")

    # basic analysis - I'll calculate length and GC content
    length = len(sequence)
    g_count = sequence.count('G')
    c_count = sequence.count('C')
    gc_content = (g_count + c_count) / length * 100

    print(f"  Length: {length} bp")
    print(f"  GC content: {gc_content:.1f}%")

    # check if it starts with ATG (start codon)
    if sequence.startswith('ATG'):
        print("  This sequence starts with ATG - might be a gene!")
    else:
        print("  No ATG start codon found")

Learning to handle multiple sequences with lists
I have 4 sequences to analyze

Sequence 1: ATGAAGTGCAACCTGCTGGTG
  Length: 21 bp
  GC content: 52.4%
  This sequence starts with ATG - might be a gene!

Sequence 2: GCTAGCTAGCTAGCTAGCTAG
  Length: 21 bp
  GC content: 52.4%
  No ATG start codon found

Sequence 3: ATGCGATCGTAGCTTAAGGC
  Length: 20 bp
  GC content: 50.0%
  This sequence starts with ATG - might be a gene!

Sequence 4: TTTTTTTTTTTTTTTTTTTT
  Length: 20 bp
  GC content: 0.0%
  No ATG start codon found


In [3]:
# Learning to use dictionaries for better data organization
print("Using dictionaries to organize biological information better")

# I think dictionaries will help me keep track of more information
# Let me create a dictionary for each sequence

sequence_data = {}  # empty dictionary to start

# I'll add sequences one by one and learn how dictionaries work
sequence_data["insulin"] = {
    "sequence": "ATGAAGTGCAACCTGCTGGTG",
    "organism": "human",
    "chromosome": "11",
    "function": "hormone"
}

sequence_data["hemoglobin"] = {
    "sequence": "ATGGTGCTGTCTCCTGCCGAC",
    "organism": "human",
    "chromosome": "16",
    "function": "oxygen_transport"
}

# let me see what's in my dictionary
print("My sequence database contains:")
for gene_name in sequence_data:
    info = sequence_data[gene_name]
    print(f"\n{gene_name}:")
    print(f"  Sequence: {info['sequence']}")
    print(f"  Organism: {info['organism']}")
    print(f"  Chromosome: {info['chromosome']}")
    print(f"  Function: {info['function']}")

Using dictionaries to organize biological information better
My sequence database contains:

insulin:
  Sequence: ATGAAGTGCAACCTGCTGGTG
  Organism: human
  Chromosome: 11
  Function: hormone

hemoglobin:
  Sequence: ATGGTGCTGTCTCCTGCCGAC
  Organism: human
  Chromosome: 16
  Function: oxygen_transport


In [4]:
# Now I want to add analysis results to my dictionary
print("\nAdding analysis results to my sequence database:")

for gene_name in sequence_data:
    sequence = sequence_data[gene_name]["sequence"]

    # calculate basic stats
    length = len(sequence)
    gc_content = (sequence.count('G') + sequence.count('C')) / length * 100

    # add these to my dictionary
    sequence_data[gene_name]["length"] = length
    sequence_data[gene_name]["gc_content"] = gc_content

    # check for start codon
    has_start_codon = sequence.startswith('ATG')
    sequence_data[gene_name]["has_start_codon"] = has_start_codon

    print(f"{gene_name}: {length} bp, GC={gc_content:.1f}%, Start codon: {has_start_codon}")


Adding analysis results to my sequence database:
insulin: 21 bp, GC=52.4%, Start codon: True
hemoglobin: 21 bp, GC=61.9%, Start codon: True


In [5]:
# Let me try to find patterns in my data using dictionary methods
print("\nLooking for patterns in my sequence data:")

# I want to find all sequences with high GC content
high_gc_genes = []
for gene_name in sequence_data:
    if sequence_data[gene_name]["gc_content"] > 50:
        high_gc_genes.append(gene_name)

print("Genes with high GC content:", high_gc_genes)

# find all genes on chromosome 11
chr11_genes = []
for gene_name in sequence_data:
    if sequence_data[gene_name]["chromosome"] == "11":
        chr11_genes.append(gene_name)

print("Genes on chromosome 11:", chr11_genes)

# I want to see which organisms I have data for
organisms = []
for gene_name in sequence_data:
    organism = sequence_data[gene_name]["organism"]
    if organism not in organisms:  # avoid duplicates
        organisms.append(organism)

print("Organisms in my database:", organisms)


Looking for patterns in my sequence data:
Genes with high GC content: ['insulin', 'hemoglobin']
Genes on chromosome 11: ['insulin']
Organisms in my database: ['human']


In [6]:
# Let me try to find patterns in my data using dictionary methods
print("\nLooking for patterns in my sequence data:")

# I want to find all sequences with high GC content
high_gc_genes = []
for gene_name in sequence_data:
    if sequence_data[gene_name]["gc_content"] > 50:
        high_gc_genes.append(gene_name)

print("Genes with high GC content:", high_gc_genes)

# find all genes on chromosome 11
chr11_genes = []
for gene_name in sequence_data:
    if sequence_data[gene_name]["chromosome"] == "11":
        chr11_genes.append(gene_name)

print("Genes on chromosome 11:", chr11_genes)

# I want to see which organisms I have data for
organisms = []
for gene_name in sequence_data:
    organism = sequence_data[gene_name]["organism"]
    if organism not in organisms:  # avoid duplicates
        organisms.append(organism)

print("Organisms in my database:", organisms)


Looking for patterns in my sequence data:
Genes with high GC content: ['insulin', 'hemoglobin']
Genes on chromosome 11: ['insulin']
Organisms in my database: ['human']


In [7]:
# Now I want to analyze my whole database
print("\nAnalyzing my complete gene database:")

# find the longest gene
longest_gene = ""
longest_length = 0

for gene_name in gene_database:
    length = gene_database[gene_name]["length"]
    if length > longest_length:
        longest_length = length
        longest_gene = gene_name

print(f"Longest gene: {longest_gene} ({longest_length} bp)")

# calculate average GC content
total_gc = 0
gene_count = 0
for gene_name in gene_database:
    total_gc += gene_database[gene_name]["gc_content"]
    gene_count += 1

average_gc = total_gc / gene_count
print(f"Average GC content: {average_gc:.1f}%")

# count how many have start codons
genes_with_start = 0
for gene_name in gene_database:
    if gene_database[gene_name]["has_start_codon"]:
        genes_with_start += 1

print(f"Genes with start codons: {genes_with_start} out of {gene_count}")


Analyzing my complete gene database:


NameError: name 'gene_database' is not defined

In [8]:
# Working with multiple sequences like a real FASTA file
print("Learning to handle FASTA-like biological data")

# I'll simulate what a real FASTA file might look like
fasta_text = """
>gene1_insulin_human
ATGAAGTGCAACCTGCTGGTGCTGTTCCTGGGGCTCTTCCTGTTCGCCTTCCTGGGCATG
CCCAAGTACCCGGGCCTGTTCCTGCACATCGTCAAGGTGGAGCGCACCGTGGTGGACCTG

>gene2_hemoglobin_human
ATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGGCGCG
CACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGTTGAAGGTGGATGCCAAGAAGCTG

>gene3_short_test
ATGAAATAG

>gene4_no_start_codon
GCTAGCTAGCTAGCTAGCTAGCTAGCTAG
"""

# I need to parse this data - separate headers from sequences
sequences_from_fasta = {}  # dictionary to store parsed data

# split the text into lines and process each line
lines = fasta_text.strip().split('\n')

current_header = ""
current_sequence = ""

for line in lines:
    line = line.strip()  # remove extra spaces

    if line.startswith('>'):  # this is a header line
        # save the previous sequence if we have one
        if current_header and current_sequence:
            sequences_from_fasta[current_header] = current_sequence

        # start a new sequence
        current_header = line[1:]  # remove the '>' character
        current_sequence = ""

    elif line:  # this is a sequence line (not empty)
        current_sequence += line

# don't forget the last sequence!
if current_header and current_sequence:
    sequences_from_fasta[current_header] = current_sequence

print(f"Parsed {len(sequences_from_fasta)} sequences from FASTA data")
for header in sequences_from_fasta:
    sequence = sequences_from_fasta[header]
    print(f"{header}: {len(sequence)} bp")

Learning to handle FASTA-like biological data
Parsed 4 sequences from FASTA data
gene1_insulin_human: 120 bp
gene2_hemoglobin_human: 120 bp
gene3_short_test: 9 bp
gene4_no_start_codon: 29 bp


In [9]:
# Now I want to analyze all these sequences
print("\nAnalyzing all sequences from my FASTA-like data:")

analysis_results = {}  # store results here

for header in sequences_from_fasta:
    sequence = sequences_from_fasta[header]

    print(f"\nAnalyzing: {header}")

    # basic analysis
    length = len(sequence)

    # count nucleotides
    a_count = sequence.count('A')
    t_count = sequence.count('T')
    g_count = sequence.count('G')
    c_count = sequence.count('C')

    # calculate GC content
    gc_content = (g_count + c_count) / length * 100 if length > 0 else 0

    # look for start and stop codons
    has_start = 'ATG' in sequence
    stop_codons = ['TAA', 'TAG', 'TGA']
    has_stop = any(stop in sequence for stop in stop_codons)

    # store results
    analysis_results[header] = {
        'sequence': sequence,
        'length': length,
        'composition': {'A': a_count, 'T': t_count, 'G': g_count, 'C': c_count},
        'gc_content': gc_content,
        'has_start_codon': has_start,
        'has_stop_codon': has_stop
    }

    print(f"  Length: {length} bp")
    print(f"  GC content: {gc_content:.1f}%")
    print(f"  Start codon: {has_start}, Stop codon: {has_stop}")


Analyzing all sequences from my FASTA-like data:

Analyzing: gene1_insulin_human
  Length: 120 bp
  GC content: 61.7%
  Start codon: True, Stop codon: True

Analyzing: gene2_hemoglobin_human
  Length: 120 bp
  GC content: 63.3%
  Start codon: True, Stop codon: True

Analyzing: gene3_short_test
  Length: 9 bp
  GC content: 22.2%
  Start codon: True, Stop codon: True

Analyzing: gene4_no_start_codon
  Length: 29 bp
  GC content: 51.7%
  Start codon: False, Stop codon: True


In [10]:
# Finding ORFs in my sequences using what I learned
print("Looking for Open Reading Frames (ORFs) in my sequences")

def find_simple_orfs(sequence):
    """Find ORFs in a sequence - my own implementation"""

    orfs_found = []  # list to store ORFs
    sequence = sequence.upper()

    # I'll check each reading frame (0, 1, 2)
    for frame in range(3):
        print(f"  Checking reading frame {frame + 1}")

        # look for start codons in this frame
        for pos in range(frame, len(sequence) - 2, 3):
            codon = sequence[pos:pos + 3]

            if codon == 'ATG':  # found start codon
                print(f"    Found start codon at position {pos + 1}")

                # now look for stop codon
                for stop_pos in range(pos + 3, len(sequence) - 2, 3):
                    stop_codon = sequence[stop_pos:stop_pos + 3]

                    if stop_codon in ['TAA', 'TAG', 'TGA']:  # found stop
                        orf_length = stop_pos + 3 - pos
                        orf_sequence = sequence[pos:stop_pos + 3]

                        # only keep ORFs that are reasonably long
                        if orf_length >= 30:  # at least 10 amino acids
                            orf_info = {
                                'frame': frame + 1,
                                'start': pos + 1,  # biology uses 1-based numbering
                                'stop': stop_pos + 3,
                                'length': orf_length,
                                'sequence': orf_sequence
                            }
                            orfs_found.append(orf_info)
                            print(f"    Found ORF: {orf_length} bp long")

                        break  # stop looking after first stop codon

    return orfs_found

# test my ORF finder on my sequences
for header in sequences_from_fasta:
    sequence = sequences_from_fasta[header]
    print(f"\nLooking for ORFs in {header}:")

    orfs = find_simple_orfs(sequence)
    analysis_results[header]['orfs'] = orfs

    print(f"Found {len(orfs)} ORFs")
    for i, orf in enumerate(orfs):
        print(f"  ORF {i+1}: {orf['length']} bp (frame {orf['frame']}, pos {orf['start']}-{orf['stop']})")

Looking for Open Reading Frames (ORFs) in my sequences

Looking for ORFs in gene1_insulin_human:
  Checking reading frame 1
    Found start codon at position 1
    Found start codon at position 58
  Checking reading frame 2
  Checking reading frame 3
Found 0 ORFs

Looking for ORFs in gene2_hemoglobin_human:
  Checking reading frame 1
    Found start codon at position 1
  Checking reading frame 2
    Found start codon at position 74
    Found start codon at position 107
  Checking reading frame 3
Found 0 ORFs

Looking for ORFs in gene3_short_test:
  Checking reading frame 1
    Found start codon at position 1
  Checking reading frame 2
  Checking reading frame 3
Found 0 ORFs

Looking for ORFs in gene4_no_start_codon:
  Checking reading frame 1
  Checking reading frame 2
  Checking reading frame 3
Found 0 ORFs


In [11]:
# Creating a summary of all my analysis results
print("Creating a summary report of my sequence analysis")

print("\n" + "="*50)
print("SEQUENCE ANALYSIS SUMMARY REPORT")
print("="*50)

# collect some statistics
total_sequences = len(analysis_results)
total_length = 0
total_orfs = 0
gc_contents = []

print(f"Total sequences analyzed: {total_sequences}")

for header in analysis_results:
    result = analysis_results[header]
    total_length += result['length']
    total_orfs += len(result['orfs'])
    gc_contents.append(result['gc_content'])

print(f"Total sequence length: {total_length} bp")
print(f"Average sequence length: {total_length / total_sequences:.1f} bp")
print(f"Total ORFs found: {total_orfs}")

# calculate average GC content
average_gc = sum(gc_contents) / len(gc_contents)
print(f"Average GC content: {average_gc:.1f}%")

# find sequences with interesting properties
print("\nSequences with interesting properties:")

for header in analysis_results:
    result = analysis_results[header]

    interesting = []

    if result['gc_content'] > 60:
        interesting.append("high GC")
    if result['gc_content'] < 30:
        interesting.append("low GC")
    if len(result['orfs']) > 1:
        interesting.append("multiple ORFs")
    if result['length'] > 100:
        interesting.append("long sequence")
    if not result['has_start_codon']:
        interesting.append("no start codon")

    if interesting:
        print(f"  {header}: {', '.join(interesting)}")

Creating a summary report of my sequence analysis

SEQUENCE ANALYSIS SUMMARY REPORT
Total sequences analyzed: 4
Total sequence length: 278 bp
Average sequence length: 69.5 bp
Total ORFs found: 0
Average GC content: 49.7%

Sequences with interesting properties:
  gene1_insulin_human: high GC, long sequence
  gene2_hemoglobin_human: high GC, long sequence
  gene3_short_test: low GC
  gene4_no_start_codon: no start codon


In [12]:
# Creating a detailed report for each sequence
print("\nDETAILED SEQUENCE REPORTS:")
print("-" * 30)

for header in analysis_results:
    result = analysis_results[header]

    print(f"\nSequence: {header}")
    print(f"Length: {result['length']} bp")

    # composition
    comp = result['composition']
    print(f"Composition: A:{comp['A']} T:{comp['T']} G:{comp['G']} C:{comp['C']}")

    # percentages
    length = result['length']
    if length > 0:
        a_pct = comp['A'] / length * 100
        t_pct = comp['T'] / length * 100
        g_pct = comp['G'] / length * 100
        c_pct = comp['C'] / length * 100
        print(f"Percentages: A:{a_pct:.1f}% T:{t_pct:.1f}% G:{g_pct:.1f}% C:{c_pct:.1f}%")

    print(f"GC content: {result['gc_content']:.1f}%")
    print(f"Has start codon: {result['has_start_codon']}")
    print(f"Has stop codon: {result['has_stop_codon']}")
    print(f"ORFs found: {len(result['orfs'])}")

    # show ORF details
    if result['orfs']:
        print("ORF details:")
        for i, orf in enumerate(result['orfs']):
            amino_acids = orf['length'] // 3
            print(f"  ORF {i+1}: {orf['length']} bp ({amino_acids} amino acids), frame {orf['frame']}")


DETAILED SEQUENCE REPORTS:
------------------------------

Sequence: gene1_insulin_human
Length: 120 bp
Composition: A:16 T:30 G:37 C:37
Percentages: A:13.3% T:25.0% G:30.8% C:30.8%
GC content: 61.7%
Has start codon: True
Has stop codon: True
ORFs found: 0

Sequence: gene2_hemoglobin_human
Length: 120 bp
Composition: A:24 T:20 G:47 C:29
Percentages: A:20.0% T:16.7% G:39.2% C:24.2%
GC content: 63.3%
Has start codon: True
Has stop codon: True
ORFs found: 0

Sequence: gene3_short_test
Length: 9 bp
Composition: A:5 T:2 G:2 C:0
Percentages: A:55.6% T:22.2% G:22.2% C:0.0%
GC content: 22.2%
Has start codon: True
Has stop codon: True
ORFs found: 0

Sequence: gene4_no_start_codon
Length: 29 bp
Composition: A:7 T:7 G:8 C:7
Percentages: A:24.1% T:24.1% G:27.6% C:24.1%
GC content: 51.7%
Has start codon: False
Has stop codon: True
ORFs found: 0


In [13]:
# What I learned today - Day 5 summary
print("\n" + "="*50)
print("DAY 5 LEARNING SUMMARY")
print("="*50)

print("\nPython skills I practiced today:")
python_skills = [
    "Working with lists of biological sequences",
    "Using dictionaries to organize gene information",
    "Parsing FASTA-like data format",
    "Building nested data structures",
    "Using loops to process multiple sequences",
    "Creating functions for repetitive tasks",
    "Organizing analysis results"
]

for skill in python_skills:
    print(f"- {skill}")

print("\nBioinformatics concepts I learned:")
bio_concepts = [
    "Handling multiple gene sequences at once",
    "FASTA format structure and parsing",
    "ORF finding across multiple reading frames",
    "Batch analysis of sequence data",
    "Statistical summaries of sequence properties",
    "Data organization for biological research"
]

for concept in bio_concepts:
    print(f"- {concept}")

print("\nChallenges I worked through:")
challenges = [
    "Parsing multi-line FASTA format correctly",
    "Keeping track of multiple pieces of information per sequence",
    "Finding ORFs in different reading frames",
    "Calculating statistics across multiple sequences",
    "Organizing results in a clear way"
]

for challenge in challenges:
    print(f"- {challenge}")

print(f"\nProgress: 5/56 days completed ({5/56*100:.1f}%)")
print("Tomorrow I want to learn about Pandas for handling bigger datasets!")

# Some things I want to improve
print("\nThings I want to get better at:")
improvements = [
    "Making my code more efficient",
    "Better error handling when sequences have problems",
    "More sophisticated statistical analysis",
    "Working with even larger datasets"
]

for improvement in improvements:
    print(f"- {improvement}")


DAY 5 LEARNING SUMMARY

Python skills I practiced today:
- Working with lists of biological sequences
- Using dictionaries to organize gene information
- Parsing FASTA-like data format
- Building nested data structures
- Using loops to process multiple sequences
- Creating functions for repetitive tasks
- Organizing analysis results

Bioinformatics concepts I learned:
- Handling multiple gene sequences at once
- FASTA format structure and parsing
- ORF finding across multiple reading frames
- Batch analysis of sequence data
- Statistical summaries of sequence properties
- Data organization for biological research

Challenges I worked through:
- Parsing multi-line FASTA format correctly
- Keeping track of multiple pieces of information per sequence
- Finding ORFs in different reading frames
- Calculating statistics across multiple sequences
- Organizing results in a clear way

Progress: 5/56 days completed (8.9%)
Tomorrow I want to learn about Pandas for handling bigger datasets!

Thin