# V1 Corrupted SEVA Plasmid Dataset Creation

## Overview

The following code is used to create a dataset consisting of various different canonical SEVA Plasmids that have had their T0, T1 or oriT regions corrupted.

The mutations used to corrupt these regions are insertions, deletions and substitutions. Depending on the corruptions parameters used, each region can have any number of mutations done to it. 

It makes use of functions written in the file SEVA_Plasmid_Webscraper.py.

Each original plasmid is given its own folder, containing the original GenBank file and the corrupted versions of it. There is one corrupted version for each combination of corrupted regions. 

The dataset is structured as follows:
- Dataset Directory
    - Excel Spreadsheet with overview of all Collections,
    - Collection 1
        - Original Sequence GenBank
        - Sequence with T0 Corrupted GenBank
        - Sequence with T1 Corrupted GenBank
        - Sequence with oriT Corrupted GenBank
        - Sequence with T0 + T1 Corrupted GenBank
        - Sequence with T0 + oriT Corrupted GenBank
        - Sequence with T1 + oriT Corrupted GenBank
        - Sequence with T0 + T1 + oriT Corrupted GenBank
        - Excel Spreadsheet listing the positions and exact nature of each mutation for all the files in the collection as well as the generation parameters and time.
    - Collection 2
        - Original Sequence GenBank
        - Sequence with T0 Corrupted GenBank
        - Sequence with T1 Corrupted GenBank
        - Sequence with oriT Corrupted GenBank
        - Sequence with T0 + T1 Corrupted GenBank
        - Sequence with T0 + oriT Corrupted GenBank
        - Sequence with T1 + oriT Corrupted GenBank
        - Sequence with T0 + T1 + oriT Corrupted GenBank
        - Excel Spreadsheet listing the positions and exact nature of each mutation for all the files in the collection as well as the generation parameters and time.
    - Collection n...

Each collection is named from the GenBank number of the original canonical SEVA plasmid it contains.

## Imports

General Imports

In [1]:
import os
import pandas as pd
from datetime import datetime

Seuqnece Generator Imports

In [2]:
from SEVA_Plasmid_Webscaper import SEVA_Plasmid_Webscraper as SEVA
from SEVA_Plasmid_Webscaper import Sequence_Mutations as Mutate 

## Dataset Generation

In [3]:
# Creates the parent directory.
parent_directory = "Corrupted_SEVA_Plasmid_Dataset_V1"
os.makedirs(parent_directory, exist_ok=True)

In [4]:
# URL of the SEVA canonical plasmids list page.
url = "https://seva-plasmids.com/v2/table-all.php"

In [5]:
# Extract the SEVA table.
seva_df = SEVA.extract_seva_table(url, num_rows=50, include_gadget=False)
seva_df.head()

Unnamed: 0,name,resistance,ori,cargo,gadget,genbank_number,genbank_link,developer
0,pSEVA111,Ap,R6K,MCS-default,,JX560321,http://www.ncbi.nlm.nih.gov/nuccore/JX560321,
1,pSEVA211,Km,R6K,MCS-default,,JX560326,http://www.ncbi.nlm.nih.gov/nuccore/JX560326,
2,pSEVA311,Cm,R6K,MCS-default,,JX560331,http://www.ncbi.nlm.nih.gov/nuccore/JX560331,
3,pSEVA411,Sm/Sp,R6K,MCS-default,,JX560336,http://www.ncbi.nlm.nih.gov/nuccore/JX560336,
4,pSEVA511,Tc,R6K,MCS-default,,JX560341,http://www.ncbi.nlm.nih.gov/nuccore/JX560341,


In [6]:
# Save the SEVA table to an Excel file in the parent directory.
seva_df.to_excel(os.path.join(parent_directory, "seva_plasmid_list.xlsx"), index=False)

In [None]:
# Iterate over each plasmid in the SEVA table to create collections of mutated plasmids.
for index, row in seva_df.iterrows():
    genbank_number = row['genbank_number']
    folder_name = os.path.join(parent_directory, genbank_number)
    os.makedirs(folder_name, exist_ok=True)

    # Fetch the original GenBank file
    genbank_content = SEVA.get_genbank_file(genbank_number)
    
    # Save the original GenBank file
    original_filename = os.path.join(folder_name, f"{genbank_number}.gb")
    SEVA.write_to_genbank_file(genbank_content, original_filename)

    # Initialize the DataFrame to store mutation details
    mutation_details_df = pd.DataFrame(columns=['locus', 'oriT_mutations', 'T0_mutations', 'T1_mutations', 'oriT_enabled', 'T0_enabled', 'T1_enabled', 'num_mutations', 'min_bases', 'max_bases', 'generation_time'])

    # Add the original file details to the mutation details DataFrame
    mutation_details_df = pd.concat([mutation_details_df, pd.DataFrame([{
        'locus': genbank_number,
        'oriT_mutations': [],
        'T0_mutations': [],
        'T1_mutations': [],
        'oriT_enabled': False,
        'T0_enabled': False,
        'T1_enabled': False,
        'num_mutations': 0,
        'min_bases': 0,
        'max_bases': 0,
        'generation_time': datetime.now()
    }])], ignore_index=True)

    # Define the combinations of regions to be mutated
    region_combinations = [
        (True, False, False),  # Only oriT
        (False, True, False),  # Only T0
        (False, False, True),  # Only T1
        (True, True, False),   # oriT and T0
        (True, False, True),   # oriT and T1
        (False, True, True),   # T0 and T1
        (True, True, True)     # oriT, T0, and T1
    ]

    # Apply mutations for each combination of regions
    for oriT_enabled, T0_enabled, T1_enabled in region_combinations:
        mutated_content, extension, mutation_df = Mutate.mutate_seva(
            genbank_content, 
            enable_oriT=oriT_enabled, 
            enable_T1=T1_enabled, 
            enable_T0=T0_enabled, 
            enable_insertion=True, 
            enable_deletion=True, 
            enable_substitution=True, 
            num_mutations=1, 
            min_bases=1, 
            max_bases=5
        )
        
        # Save the mutated GenBank file
        mutated_filename = os.path.join(folder_name, f"{genbank_number}{extension}.gb")
        SEVA.write_to_genbank_file(mutated_content, mutated_filename)
        
        # Add the mutation details to the mutation details DataFrame
        for _, mutation_row in mutation_df.iterrows():
            mutation_details_df = pd.concat([mutation_details_df, pd.DataFrame([{
                'locus': mutation_row['locus'],
                'oriT_mutations': mutation_row['oriT_mutations'],
                'T0_mutations': mutation_row['T0_mutations'],
                'T1_mutations': mutation_row['T1_mutations'],
                'oriT_enabled': oriT_enabled,
                'T0_enabled': T0_enabled,
                'T1_enabled': T1_enabled,
                'num_mutations': 1,
                'min_bases': 1,
                'max_bases': 5,
                'generation_time': datetime.now()
            }])], ignore_index=True)
    
    # Save the mutation details DataFrame to an Excel file
    mutation_details_filename = os.path.join(folder_name, f"{genbank_number}_mutations.xlsx")
    mutation_details_df.to_excel(mutation_details_filename, index=False)