# PyVADesign Tutorial

Welcome to the PyVADesign tutorial! 

In this notebook, we will explore the process of designing dsDNA fragments and corresponding primers for a number of selected mutations.

## Tutorial Overview

In this tutorial, we will cover the following topics:

1. Importing required modules and packages
2. Loading the vector sequence and the Gene of Interest
3. Parsing mutation data
4. Designing dsDNA fragments
5. Visualizing fragment regions and corresponding mutations
6. Designing primers that prepare the target plasmid for insertion of the dsDNA fragment as well as sequencing primers

Let's get started with the first step: importing the required modules and packages.

### 1. Importing Required Modules and Packages

In [3]:
import os

from src.mutation import Mutation
from src.sequence import Vector, Gene
from src.eblocks import EblockDesign
from src.primer import DesignPrimers
from src.plot import Plot

### 2. Loading and Analyzing the Gene Sequence

To successfully run the design process, the gene of interest must be provided in FASTA format and the plasmid sequence should be available in either DNA (.dna) or GenBank (.gb)

For the purpose of this tutorial we will focus on Mycobacterial membrane protein Large 3 (MmpL3) of *Mycobacterium avium*. MmpL3 is a lipid transporter that has become a promising drug target for developing new anti-mycobacterial therapies. 

In [None]:
# Create a gene object and parse the gene sequence from the data directory

sequence_file = os.path.join('tutorial-data', 'A0A0H2ZYQ2.fasta')  # Gene sequence in fasta format
gene_instance = Gene(stopcodon=False)
gene_instance.parse_sequence(sequence_file)

# Create a plasmid object and parse the input plasmid from the data directory

vector_file = os.path.join('tutorial-data', 'pACE_mmpL3-Mav.dna')  # Vector sequence including MmpL3 in dna format
vector_instance = Vector(gene=gene_instance)
vector_instance.parse_vector(vector_file)

### 3. Parsing mutation data

The desired mutations should be listed in a text file.

Here, we create a Mutation() object and parse the desired mutations. 

In [11]:
# Create a Mutation object and parse the input mutations from the files/ directory

mutations_file = os.path.join('tutorial-data', 'mutations.txt')  # text file containing mutations
mutation_instance = Mutation()
mutation_instance.parse_mutations(mutations_file)

# Print the mutations that were parsed

mutation_instance.print_mutations()

The selected mutations are:
	Mutation  	V22F      
	Insert    	S34-YLV   
	Deletion  	V48-D57   
	Deletion  	T65-T68   
	Combined  	D76F, K116I, D122F, A77E
	Deletion  	R80-N88   
	Combined  	A102R, A146N, D150G, M151R
	Mutation  	L219S     
	Deletion  	I221-G229 
	Combined  	Y233S, L252R, R257H, E260M
	Insert    	E260-LGNSYSIL
	Mutation  	A279F     
	Deletion  	L300-V305 
	Combined  	R336W, G381C, S358N, T345Y
	Mutation  	E377D     
	Mutation  	N388Q     
	Mutation  	P400V     
	Combined  	I477P, C498Y, R514D, V507L
	Mutation  	T481E     
	Combined  	N496L, K529V, P497G
	Mutation  	A521D     
	Combined  	A521I, L542N, D555A, R523L, F561S
	Insert    	E553-TTGIFQCS
	Combined  	K563P, P565A, G584A, S585Y
	Combined  	I610W, R658C
	Mutation  	F622N     
	Combined  	F622A, I632L, E647D
	Mutation  	L634T     
	Deletion  	G732-R740 
	Insert    	N747-MSVPRC


In [17]:
# count number of mutations per type

num_point_mutations = [i.type == 'Mutation' for i in mutation_instance.mutations].count(True)
print(f'Number of point mutations: {num_point_mutations}')
num_double_mutations = [i.type == 'Combined' for i in mutation_instance.mutations].count(True)
print(f'Number of double mutations: {num_double_mutations}')
num_insert = [i.type == 'Insert' for i in mutation_instance.mutations].count(True)
print(f'Number of insertions: {num_insert}')
num_deletion = [i.type == 'Deletion' for i in mutation_instance.mutations].count(True)
print(f'Number of deletions: {num_deletion}')
print('-----------------------------------')
total_mutations = num_point_mutations + num_double_mutations + num_insert + num_deletion
print(f'Total number of mutations: {total_mutations}')

Number of point mutations: 10
Number of double mutations: 10
Number of insertions: 4
Number of deletions: 6
-----------------------------------
Total number of mutations: 30


We also define an output directory for the generated files and create a snapgene object for visualization

In [18]:
# Set output directory

output_dir = os.path.join('tutorial_output')

### 4. Designing dsDNA fragments

Next, we create an design instance that can initiate the design of the dsDNA fragments. Here, we choose as optimization method cost_optimization that aims to use as little basepairs as possible. Another option would be to do amount_optimization, that aims to cluster as many mutations as possible together, to get the lowest number of different dsDNA fragments

In [28]:
# we use a settingsfile that contains input parameters for the design class, such as the minimum and maximum length of the dsDNA fragments

settingsfile = os.path.join('tutorial-data', 'dsDNA-Design-settings-CostOpt.txt')

# Create an Eblocks object based on the input mutations and the gene sequence

design_instance = EblockDesign(mutation_instance=mutation_instance,
                               vector_instance=vector_instance,
                               gene_instance=gene_instance,
                               settings_file=settingsfile,
                               output_dir=output_dir)

Now we can run the design method to generate the dsDNA fragments

In [29]:
design_instance.run_design_eblocks()  # Run the design

mutation_instance   : <src.mutation.Mutation object at 0x7fdb74fc55d0>
vector_instance     : <src.sequence.Vector object at 0x7fdb74fc4970>
gene_instance       : <src.sequence.Gene object at 0x7fdb74fc4e80>
output_dir          : tutorial-output
settings_file       : tutorial-data/dsDNA-Design-settings-CostOpt.txt
cost_optimization   : True
amount_optimization : True
eblock_colors       : {0: '#1f77b4', 1: '#ff7f0e', 2: '#2ca02c', 3: '#d62728', 4: '#9467bd', 5: '#8c564b', 6: '#e377c2', 7: '#7f7f7f', 8: '#bcbd22', 9: '#17becf', 10: '#aec7e8', 11: '#ffbb78', 12: '#98df8a', 13: '#ff9896', 14: '#c5b0d5', 15: '#c49c94', 16: '#f7b6d2', 17: '#c7c7c7', 18: '#dbdb8d', 19: '#9edae5', 20: '#393b79', 21: '#ff7f0e', 22: '#2ca02c', 23: '#8c564b', 24: '#e377c2', 25: '#7f7f7f', 26: '#bcbd22', 27: '#17becf'}
clone_files         : True
verbose             : True
codon_usage         : U00096
bp_price            : 0.05
max_eblock_length   : 1500
min_eblock_length   : 300
min_overlap         : 25
min_order 

AttributeError: 'NoneType' object has no attribute 'start_index'

In [19]:
# TODO Add DnaE1 gene sequence to vector
# TODO What are the other things in the vector that do not have a name?

In our vector we can see that our vector contains the SacB gene, has an origin of replication and contains a CmR (chloramphenicol) resistance marker

In [20]:

# TODO Show eBlocks in vector as well
# TODO Add plasmid visaulization of eBlock features


# from Bio import SeqIO
# from Bio.Graphics import GenomeDiagram
# from Bio.SeqFeature import SeqFeature, FeatureLocation

# # Parse the plasmid sequence
# plasmid_seq_record = SeqIO.read("plasmid_sequence.fasta", "fasta")

# # Create a GenomeDiagram object
# gd_diagram = GenomeDiagram.Diagram("Plasmid Map")

# # Add the sequence track
# gd_track = gd_diagram.new_track(1, name="Plasmid")
# gd_feature_set = gd_track.new_set()

# # Add the plasmid sequence
# gd_feature_set.add_feature(SeqFeature(FeatureLocation(0, len(plasmid_seq_record))), color="black")

# # Parse the GFF3 file to extract features
# # Assuming you have a function parse_gff3() that returns feature information
# features = parse_gff3("plasmid_features.gff3")

# # Add the features to the plasmid map
# for feature in features:
#     start = feature.start
#     end = feature.end
#     name = feature.attributes["Name"]
#     gd_feature_set.add_feature(SeqFeature(FeatureLocation(start, end)), color="blue", label=True, label_position="middle", label_size=8, label_angle=0, label_strand=0, name=name)

# # Draw the plasmid map
# gd_diagram.draw(format="linear", pagesize=(15*len(plasmid_seq_record), 400), fragments=1)
# gd_diagram.write("plasmid_map.png", "png")


mutation_instance: <src.mutation.Mutation object at 0x0000017CB2044450>
vector_instance: <src.sequence.Vector object at 0x0000017CAB563F10>
gene_instance: <src.sequence.Gene object at 0x0000017CB3734550>
output_dir: 'tutorial_output'
settings_file: settings\eblock-settings.txt
cost_optimization: True
amount_optimization: True
eblock_colors: {0: '#1f77b4', 1: '#ff7f0e', 2: '#2ca02c', 3: '#d62728', 4: '#9467bd', 5: '#8c564b', 6: '#e377c2', 7: '#7f7f7f', 8: '#bcbd22', 9: '#17becf', 10: '#aec7e8', 11: '#ffbb78', 12: '#98df8a', 13: '#ff9896', 14: '#c5b0d5', 15: '#c49c94', 16: '#f7b6d2', 17: '#c7c7c7', 18: '#dbdb8d', 19: '#9edae5', 20: '#393b79', 21: '#ff7f0e', 22: '#2ca02c', 23: '#8c564b', 24: '#e377c2', 25: '#7f7f7f', 26: '#bcbd22', 27: '#17becf'}
clone_files: True
verbose: True
codon_usage: U00096
bp_price: 0.05
max_eblock_length: 1500
min_eblock_length: 300
min_overlap: 25
min_order: 24
wt_eblocks: []
eblocks: []
most_abundant_codons: {}
Calculating relative codon frequencies, based on t

In [None]:
for i in design_instance.wt_eblocks:
    print(i.name, i.sequence)



In the process, for each mutation a different eBlock is created and a .gb file is made to easily view the clone in a sequence editor. 

In [None]:
# Now that we have designed the eblocks, we can visualize them using the Plot class

plot_instance.plot_eblocks_mutations(figure_length=20,
                                     figure_width=5)

In [13]:
# TODO Describe the eblocks here, what you can see with each color etc

In [None]:
# Each type of mutations (insertions, deletions, substitutions) is represented by a different color, you can see the legend below

plot_instance.plot_mutation_legend()

In [None]:
# To see how many mutations can be made in each eBlock, we can plot a histogram

plot_instance.plot_histogram_mutations()

In [16]:
# TODO Do some explanation here

In [17]:
# TODO Save the eblocks to a file

In [None]:
# TODO (At the end of tutorial) Remake the eBlocks but optimize for amount of eBlocks

design_instance = EblockDesign(mutation_instance=mutation_instance,
                               vector_instance=vector_instance,
                               gene_instance=gene_instance,
                               output_dir=output_dir,
                               verbose=False,
                               cost_optimization=False,
                               amount_optimization=True)

design_instance.run_design_eblocks()
plot_instance.plot_eblocks_mutations(figure_length=20,
                                     figure_width=5)

In [102]:
# Remove all files in the output directory
import os
import shutil

def remove_all_files_and_folders(directory):
    # Check if the directory exists
    if os.path.exists(directory):
        shutil.rmtree(directory)  # Remove the entire directory and its contents
        os.makedirs(directory)      # Recreate the empty directory
    else:
        print(f"The directory {directory} does not exist.")

# Specify your directory here
remove_all_files_and_folders('tutorial_output')


In [None]:
output_dir

In [None]:
# Create a primer design object and run the primer design process for IVA primers to amplify the eblocks

settingsfile = 'settings/primer3-settings.txt'
seq_settingsfile = 'settings/primer3-seq-settings.txt'

primers_instance = DesignPrimers(mutation_instance=mutation_instance,
                                 eblocks_design_instance=design_instance,
                                 primers_settingsfile=settingsfile,
                                 seqprimers_settingsfile=seq_settingsfile,
                                 vector_instance=vector_instance,
                                 output_dir=output_dir)

primers_instance.run_design()

In this tutorial we will randomly design a number of mutations for the replicative DNA polymerase DnaE1 from *Mycobacterium smegmatis* to better understand it's function. <br>

The expression plasmid containing Msmeg DnaE1 is XXX and is stored in XXX. <br>

Gene from mycobrowser XXX and is stored in XXX <br>

Now we will randomly design some mutations for this gene. Here, we will not generate any mutations in the N- or C- terminal region, to ensure we can create a 20bp overlap with the beginning or end of the gene in our eBlock design <br>

1. **Single point mutations** <br>

Single point mutations contain a single mutation per eBlock

2. **Multiple point mutations in same eBlock**

Multiple point mutations contain multiple mutations in the same eBlock

3. **Inserts**

Inserts contain addition of amino acids in the eBlock

4. **Deletions**

Deletions have parts of the gene deleted