# V1 DNA Dataset Creation

## Overview

The following code is used to create a dataset consisting of various different DNA sequences and mutated versions of them.
The mutations used are insertions, deletions and substitutions. They are not combined for the purpose of this dataset. 

It makes use of functions written in the file DNA_Sequence_Generator.py.

The dataset is structured as follows:
- Dataset Directory
    - Excel Spreadsheet with overview of all Collections, and the generation parameters for them. 
    - Collection 1
        - Original Sequence FASTA
        - Sequence with Insertions FASTA
        - Sequence with Deletions FASTA
        - Sequence with Substitutions FASTA
        - Excel Spreadsheet listing the positions and exact nature of each mutation for all the files in the collection.
    - Collection 2
        - Original Sequence FASTA
        - Sequence with Insertions FASTA
        - Sequence with Deletions FASTA
        - Sequence with Substitutions FASTA
        - Excel Spreadsheet listing the positions and exact nature of each mutation for all the files in the collection.
    - Collection n...

The original sequence in each collection differs in terms of its length.

## Imports

General Imports

In [61]:
import os
import pandas as pd
from datetime import datetime

Seuqnece Generator Imports

In [62]:
from DNA_Sequence_Generator import DNA_Sequence_Generator as DNA
from DNA_Sequence_Generator import DNA_Sequence_Mutations as Mutate 

## Dataset Generation

In [63]:
# Define the sequence lengths and the parent directory for the dataset.
sequence_lengths = list(range(1000, 11000, 1000))
parent_directory = "DNA_Sequence_Dataset"

In [64]:
# Creates the parent directory if it doesn't exist.
os.makedirs(parent_directory, exist_ok=True)


In [65]:
# Specify parameters for the DNA sequence generation and subsequent mutation.
gc_content = 0.5
num_mutations = 3
max_bases_per_mutation = 5

In [66]:
# Initializes a list to store all the data for the sequences. 
excel_data = []

In [67]:
# Loop through the sequence lengths, creating a collection for each.
for length in sequence_lengths:
    # Creates a subdirectory for each sequence length.
    subdirectory = os.path.join(parent_directory, f"length_{length}")
    os.makedirs(subdirectory, exist_ok=True)

    # Generates a sequence ID and retrieves the current date.
    sequence_id = f"seq_length_{length}"
    date_of_generation = datetime.today().strftime('%Y-%m-%d')

    # Generates the template (original) sequence.
    original_sequence = DNA.generate_sequence(length, gc_content)

    # Writes the original sequence to a FASTA file.
    original_file_name = f"{sequence_id}_original.fasta"
    DNA.sequence_to_fasta(original_sequence, sequence_id, "original", date_of_generation, file_name=os.path.join(subdirectory, original_file_name))

    # Applies mutations and writes the mutated sequences to FASTA files.
    insertions = Mutate.insert(os.path.join(subdirectory, original_file_name), num_mutations, max_bases_per_mutation)
    deletions = Mutate.delete(os.path.join(subdirectory, original_file_name), num_mutations, max_bases_per_mutation)
    substitutions = Mutate.substitute(os.path.join(subdirectory, original_file_name), num_mutations)

    # Adds the data for this sequence to the master Excel data list.
    excel_data.append([sequence_id, date_of_generation, gc_content, num_mutations, max_bases_per_mutation, insertions, deletions, substitutions])

    # Creates a DataFrame for this collection with a record of all the mutations.
    max_mutations = max(len(insertions), len(deletions), len(substitutions))
    collection_data = {
        original_file_name: [None] * max_mutations,
        f"{sequence_id}_mut_i.fasta": insertions + [None] * (max_mutations - len(insertions)),
        f"{sequence_id}_mut_d.fasta": deletions + [None] * (max_mutations - len(deletions)),
        f"{sequence_id}_mut_s.fasta": substitutions + [None] * (max_mutations - len(substitutions)),
    }
    collection_df = pd.DataFrame(collection_data)
    collection_df.to_excel(os.path.join(subdirectory, f"{sequence_id}_mutations.xlsx"), index=False)

# Check an example output.
collection_df.head()

Unnamed: 0,seq_length_10000_original.fasta,seq_length_10000_mut_i.fasta,seq_length_10000_mut_d.fasta,seq_length_10000_mut_s.fasta
0,,9103_9104insA,4420_4422del,9636G>A
1,,2084_2085insAAC,5742_5746del,3457G>T
2,,266_267insATCC,5093_5094del,997A>C


In [68]:
# Creates a DataFrame from the master Excel data list for all the different collections.
df = pd.DataFrame(excel_data, columns=["Sequence ID", "Date of Generation", "GC Content", "Number of Mutations", "Max Bases per Mutation", "Insertions", "Deletions", "Substitutions"])
df.head()

Unnamed: 0,Sequence ID,Date of Generation,GC Content,Number of Mutations,Max Bases per Mutation,Insertions,Deletions,Substitutions
0,seq_length_1000,2024-04-03,0.5,3,5,"[313_314insGCGG, 533_534insC, 85_86insTA]","[107_111del, 686_686del, 677_681del]","[853A>G, 720G>A, 313T>G]"
1,seq_length_2000,2024-04-03,0.5,3,5,"[1415_1416insC, 650_651insATCC, 666_667insGCGG]","[189_189del, 1346_1349del, 697_701del]","[1115A>T, 1807G>T, 1459G>T]"
2,seq_length_3000,2024-04-03,0.5,3,5,"[1485_1486insACGCC, 1896_1897insAT, 1588_1589i...","[540_540del, 2136_2140del, 1055_1059del]","[2037G>A, 1491T>G, 2506C>A]"
3,seq_length_4000,2024-04-03,0.5,3,5,"[1685_1686insGAAGT, 1467_1468insTA, 2980_2981i...","[2274_2275del, 1327_1327del, 1377_1377del]","[2997T>G, 3682C>G, 2869C>T]"
4,seq_length_5000,2024-04-03,0.5,3,5,"[1978_1979insTGCCC, 2251_2252insG, 3537_3538insG]","[288_289del, 2028_2029del, 2136_2139del]","[440G>T, 2642C>A, 376G>T]"


In [69]:
# Saves the DataFrame to an Excel file in the main directory.
excel_file_path = os.path.join(parent_directory, "sequences_info.xlsx")
df.to_excel(excel_file_path, index=False)