## Unique Mutation Simulation + ORF Detection with full workflow (file parsing, cleaning, mutation, 3-frame ORF analysis, and execution).

Practice Exercise 1: Unique Mutation Simulation and ORF Detection

Task:
Write a Python script that does the following:

1. Parses a multi-sequence FASTA file into a dictionary (header: sequence).


2. Cleans each sequence to include only 'A', 'T', 'G', 'C'.


3. For each cleaned sequence, apply 5 unique point mutations (randomly chosen positions, bases different from the original).


4. For the mutated sequence, find all ORFs in the 3 reading frames (like the ORF detection we practiced).


5. Print the header, the mutated sequence, and details of the ORFs found: frame number, start and end positions, and ORF sequence.

In [19]:
import random

def fasta_parsing(practice3txt):
    fasta_dict= {}
    with open("practice3.txt", "r") as file:
        header= None
        sequence_lines= []
        for line in file:
            line= line.strip()
            if line.startswith('>'):
                if header:
                    fasta_dict[header]= "".join(sequence_lines)
                header= line[1:]
                sequence_lines= []
            else:
                sequence_lines.append(line)
        if header:
                fasta_dict[header]= "".join(sequence_lines)
    return fasta_dict

def clean(seq):
    cleaned_sequence= "".join([base for base in seq if base in 'ATGC'])
    return cleaned_sequence

def unique_mutation(seq,n):
    seq_list= list(seq)
    mutated_positions= set()
    bases= ('A', 'T', 'C', 'G')
    while len(mutated_positions) < n:
        position= random.randint(0, len(seq)-1)
        if position not in mutated_positions:
            original_base= seq_list[position]
            
            possible_outcomes= []
            for base in bases:
                if base != original_base:
                    possible_outcomes.append(base)
            new_base= random.choice(possible_outcomes)

            seq_list[position]= new_base
            mutated_positions.add(position)
    mutated_seq= "".join(seq_list)

    return mutated_seq

def orf_frames(seq):
    start_codon= 'ATG'
    stop_codons= {'TAA', 'TGA', 'TAG'}
    orfs= []
    for frame in range(3):
        i= frame
        while i < len(seq)-2:
            codon= seq[i:i+3]
            if codon == start_codon:
                for j in range(i+3, len(seq)-2, 3):
                    stop_codon= seq[j:j+3]
                    if stop_codon in stop_codons:
                        orf= seq[i:j+3]
                        orfs.append((frame+1, i+1, j+3, orf))
                        i= j+3
                        break
                else:
                    i +=3
            else:
                i +=3

    return orfs

sequences= fasta_parsing("practice3.txt")

clean_sequences= {}
for header, seq in sequences.items():
    clean_sequences[header]= clean(seq)

n= int(input("Number of unique mutations to introduce per sequence:"))

for header, seq in clean_sequences.items():
    print(f"Header: {header}\nOriginal Sequence: {seq}\n")
    mutated_seq= unique_mutation(seq, n)
    print(f"Mutated Sequence: {mutated_seq}\n")
    mutated_orfs= orf_frames(mutated_seq)
    if mutated_orfs:
        for f_index, start, end, orf_sequence in mutated_orfs:
            print(f"frame: {f_index}\nstart index: {start}\nend index: {end}\n",
                  f"orf sequence: {orf_sequence}\n")
    else:
        print("No ORFs found\n")
    print("\n"+"_"*40+"\n")
   

    
    

Number of unique mutations to introduce per sequence: 1


Header: Human_sequence
Original Sequence: ATGCTAGCTAGCTAACGATGCTAGCTAGCTGAC

Mutated Sequence: ATGCTATCTAGCTAACGATGCTAGCTAGCTGAC

frame: 1
start index: 1
end index: 15
 orf sequence: ATGCTATCTAGCTAA

frame: 3
start index: 18
end index: 32
 orf sequence: ATGCTAGCTAGCTGA


________________________________________

Header: Mouse_sequence
Original Sequence: TTGCGCGGATCGTAGCTAGCTAGCTAGCTAATGCTA

Mutated Sequence: TTGCGCGGATGGTAGCTAGCTAGCTAGCTAATGCTA

frame: 3
start index: 9
end index: 23
 orf sequence: ATGGTAGCTAGCTAG


________________________________________

Header: Plant_sequence
Original Sequence: GCTAGCTAGCATCGATCGTATAGCTAGCTAGC

Mutated Sequence: GCTAGCTAGCATCGATCGTATAGCTAGCTATC

No ORFs found


________________________________________

