#Gene Synthesis Final Project

Given an input string of a DNA sequence, this function identifies annealing regions of 18-25 base pairs around the end of an interval of 30 or so base pairs. It chooses options that lack hairpins, unbalanced nucleotide content, contain at least one of each nucleotide, and have optimal GC content. Then, it synthesizes the oligonucleotides from the given strand and from its complement so that our oligos will be overlapping at the beginning and end of each one in either direction. 

From ChatGPT:
The synthesis by synthons, often referred to as the Gibbard method, is a valuable approach in gene synthesis due to its ability to generate longer DNA sequences efficiently. This method involves the stepwise assembly of shorter DNA fragments called synthons, which are typically around 50 to 100 base pairs in length. [NOTE: our synthons are much longer–minimum 400 bps] These synthons, each synthesized separately, contain overlapping regions that enable their assembly into larger, full-length DNA sequences.

There are several reasons why synthesis by synthons is advantageous:

Error Correction: Synthesis by synthons allows for the correction of errors in individual shorter fragments. If an error is detected in one of the synthons, it can be rectified by synthesizing a new correct fragment, rather than the entire sequence, thereby minimizing the impact of errors in the final assembly.

Modularity and Flexibility: The approach offers modularity, as the DNA sequence can be divided into smaller, more manageable fragments, simplifying the synthesis process. It also provides flexibility in designing and modifying specific regions or segments of the DNA sequence independently, facilitating customization of sequences for different applications.

Enhanced Efficiency: Synthesis by assembling shorter fragments enhances the efficiency of the overall process. It is often faster and more cost-effective than synthesizing longer sequences in a single step, particularly for longer DNA constructs.

Minimized Synthesis Challenges: Synthesis of longer DNA sequences as a single piece can be technically challenging and prone to errors or failures. By breaking down the sequence into smaller units (synthons), each synthesis step becomes more reliable and manageable.

Error Screening and Quality Control: Assembling shorter synthons allows for better error screening and quality control at each step of synthesis. It enables the identification and rectification of errors in individual fragments before their assembly into the final full-length sequence, thereby improving the overall quality of the synthesized DNA.

Facilitation of Complex DNA Constructs: For the construction of complex DNA sequences, such as genes with repeated motifs or intricate structures, synthesis by synthons provides a systematic and controlled approach to build and validate each part separately before assembly.

Overall, the synthesis by synthons or the Gibbard method offers a practical and effective strategy for synthesizing longer DNA sequences, enabling error correction, modularity, efficiency, and enhanced quality control throughout the process. These advantages make it a valuable method in gene synthesis and synthetic biology research.

In [49]:
# Outputs the complement 
#(NOT reversing so we can preserve left to right indices for both strands.)
def reverse_complement(sequence):
    complement = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}
    reverse_sequence = ''.join(complement[base] for base in reversed(sequence))
    return reverse_sequence

def test_complement():
    sequence = "ATCGATCGATCG"
    expected_result = "CGATCGATCGAT"
    
    result = reverse_complement(sequence)
    
    assert result == expected_result, f"Expected: {expected_result}, but got: {result}"
    print("Test passed!")

# Run the test
test_complement()


Test passed!


In [50]:
def has_repeats(sequence):
    i = 0
    while i < len(sequence):
        j = i + 1
        count = 1
        while j < len(sequence) and sequence[i] == sequence[j]:
            count += 1
            j += 1
        if count >= 4:
            return 1  # Repeat found
        i = j
    return 0  # No repeat longer than 4 found

def has_repeats_test():
    sequence_with_repeat = "ATCCCCCCGATCGCGCGCGATCG"
    sequence_without_repeat = "ATCGATCGATCGATCG"

    assert has_repeats(sequence_with_repeat) == 1, "Repeat should be found in the sequence"
    assert has_repeats(sequence_without_repeat) == 0, "No repeat should be found in the sequence"
    print("Tests passed!")

# Run the test
has_repeats_test()


Tests passed!


Ensure no one of any base pair / nucleotide exceeds 40% content

In [51]:
def has_balanced_composition(dna_sequence):
    length = len(dna_sequence)
    base_count = { 'A': 0, 'T': 0, 'C': 0, 'G': 0 }

    for base in dna_sequence:
        base_count[base] += 1

    for count in base_count.values():
        if count / length > 0.4:
            return 1

    return 0

# Example usage:
dna_sequence = "ATTTTTTTTCGAAAAAATCGGGGGGGGGGGGGATCG"  # Replace this with your DNA sequence
result = has_balanced_composition(dna_sequence)

if result == 0:
    print("The DNA sequence has a balanced composition.")
else:
    print("The DNA sequence composition is not balanced.")


The DNA sequence composition is not balanced.


Ensure there aren't hairpins in the oligo:

In [52]:
def has_hairpin(sequence):
    min_length = 18
    max_length = 25
    threshold = 20  # Adjust this threshold as needed

    if len(sequence) < min_length or len(sequence) > max_length:
        raise ValueError("Sequence length should be between 18 and 25 nucleotides.")

    for length in range(min_length, len(sequence) + 1):
        for start in range(len(sequence) - length + 1):
            end = start + length
            window_sequence = sequence[start:end]

            # Your logic to check for secondary structures in the window_sequence
            # Use the HairpinCounter equivalent function here
            
            # For demonstration, assuming a random score for hairpin structure
            score = calculate_hairpin_score(window_sequence)

            if score > threshold:
                return 1  # Hairpin structure found

    return 0  # No hairpin structure detected

# Example usage:
input_sequence = "ATCGTTGCCACGATCGAT"  # Replace with your DNA sequence

result = check_for_hairpin(input_sequence)

if result == 0:
    print("Potential hairpin structure detected in the sequence.")
else:
    print("No hairpin structure found in the sequence.")


Potential hairpin structure detected in the sequence.


Ensure there are no missing bases.

In [53]:
def has_all_nucleotides(dna_sequence):
    nucleotides = {'A', 'T', 'C', 'G'}

    present_nucleotides = set(dna_sequence)
    if nucleotides.issubset(present_nucleotides):
        return 0
    return 1 #1 is fail

# Example usage:
dna_sequence = "ATCGATCGATCG"  # Replace this with your DNA sequence
result = has_all_nucleotides(dna_sequence)

if result == 0:
    print("The DNA sequence contains at least one of each nucleotide.")
else:
    print("The DNA sequence does not contain at least one of each nucleotide.")


The DNA sequence contains at least one of each nucleotide.


The following function is for long strands of DNA, of over 300bps (but preferably around 600bps)

In [84]:
def design_oligos(dna_sequence):
    assert len(dna_sequence) > 300, "DNA sequence is too short to run synthesize by Gibbard Method"

    interval = 30
    min_annealing_length = 18
    max_annealing_length = 25
        
    # Return the best indices with a range near the end of the interval
    def find_annealing_region(start_index, end_index):
        best_region = ()
        best_score = 0

        for length in range(min_annealing_length, max_annealing_length+1):
            for i in range(start_index, end_index - length):
                region  = dna_sequence[i:i + length]

                # Bump score for GC content btwn 50%-65%
                score = (region.count('G') + region.count('C')) / len(region)
                gc_content = (region.count('G') + region.count('C')) / len(region) # Outputs fraction of GC content
                if gc_content >= 0.5 and gc_content <= .65:
                    score += 5
                
                # Reduce score if homopolymeric run is found
                score -= has_repeats(region)
                score -= has_all_nucleotides(region)
                score -= has_balanced_composition(region)
                score -= has_hairpin(region)
                
                if score > best_score:
                    best_score = score
                    best_region = (i, i + max_annealing_length)

        # assert best_score > 0, "Wildly bad annealing region options"
        return best_region
    
    annealing_indices = []
    # Create oligos at either 3' end
    annealing_indices.append((0, 18))
    for i in range(18, len(dna_sequence) - 30, interval):
        annealing_region = find_annealing_region(i-15, i+15)
        annealing_indices.append(annealing_region)
    annealing_indices.append((len(dna_sequence) - 18, len(dna_sequence)-1))
    # print(annealing_indices)
    
    # Reverse complement the DNA sequence
    complement = reverse_complement(dna_sequence)

    # Create substrings of oligos from the oligo indices
    forward_oligos = []
    back_oligos = []
    for i in range(0, len(annealing_indices) - 1, 2):
        forward_oligos.append(dna_sequence[annealing_indices[i][0]:annealing_indices[i+1][1]])
    
    for i in range(len(annealing_indices) - 1, 0, -2):
        # print(len(complement) - annealing_indices[i][1], len(complement) - annealing_indices[i][0])
        back_oligos.append(complement[len(complement) - annealing_indices[i][1]:len(complement) - annealing_indices[i][0]])

    return forward_oligos + back_oligos


In [85]:
oligo_list = design_oligos("CTCGATACGTTTAGCACGTTTCTGTCACGTGCGATGTACGTAGCATCGCTCATCGACCGTACTAGCTTCTTTCGACTCGTAGACGGATCCGCGCACTTTACTTTGATCAGCTTCATCAGCTATGCAACGTAGCGCTGGGCTAGCCATACGTTACCTCGAGGATTTCTAGCTACTTTAATGACGTAGCTTCTACTAGGACGTTGCTAGGATAGCTTTGTTGATCAGTGTATCGTAGTAGCAGTCTAGGGTACTTTACGAGGCGGAGTCGACGACGTTACGATGCTACATCGTAGTAGCATCTAGCTCGACTGATTCGTACCGAGAGGAGATCGTCGCTAGTCGTTACGTTCTGATGATCGAGTGCGCTT")

print(oligo_list)

['CTCGATACGTTTAGCACGTTTCTGTCACGTGCGATGT', 'GTAGCATCGCTCATCGACCGTACTAGCTTCTTTCGACTCGTAGACGGATCCGCGCACT', 'TGATCAGCTTCATCAGCTATGCAACGTAGCGCTGGGCTAGCCATACG', 'CCTCGAGGATTTCTAGCTACTTTAATGACGTAGCTTCTACTAGGACGTTGCTAGGAT', 'TCAGTGTATCGTAGTAGCAGTCTAGGGTACTTTACGAGGCGGAGTCGACGACGT', 'CGATGCTACATCGTAGTAGCATCTAGCTCGACTGATTCGTACCGAGAGGAGATCGTCGCTA', 'AGCGCACTCGATCATCA', 'TAGATGCTACTACGATGTAGCATCG', 'CTAGACTGCTACTACGATACACTGA', 'TTAAAGTAGCTAGAAATCCTCGAGG', 'GTTGCATAGCTGATGAAGCTGATCA', 'TAGTACGGTCGATGAGCGATGCTAC']


In [67]:
def test_design_oligos_short_sequence():
    # Create a short DNA sequence
    short_sequence = "ATCG"  # Assuming this sequence is shorter than 300 nucleotides

    try:
        result = design_oligos(short_sequence)
        print(res)
    except AssertionError as e:
        assert str(e) == "DNA sequence is too short to run synthesize by Gibbard Method", "AssertionError message doesn't match"
        print("Test passed! Assertion error occurred as expected.")
    else:
        raise AssertionError("Expected AssertionError due to short sequence length, but no error was raised.")

# Run the test
test_design_oligos_short_sequence()


Test passed! Assertion error occurred as expected.


In [86]:
def generate_protocol_instructions(oligos, output_file):
    # Header and introduction
    instructions = "Gibbard Method Protocol\n\n"
    instructions += "This protocol outlines the steps to assemble DNA synthons using the Gibbard method in the wet lab.\n\n"

    # Step-by-step instructions
    instructions += "Step 1: Synthesis of DNA Synthons\n"
    instructions += "- Synthesize individual DNA fragments (synthons) with overlapping regions.\n\n"

    instructions += "Step 2: Oligonucleotide Preparation\n"
    instructions += "- Order complementary oligonucleotides for each overlapping region.\n"
    instructions += "- Ensure oligos are purified and free from contaminants.\n\n"

    instructions += "Step 3: Annealing of Synthons and Oligos\n"
    instructions += "- Prepare a reaction mixture with synthons and complementary oligos.\n"
    instructions += "- Perform annealing by heating and gradual cooling to enable hybridization.\n\n"

    instructions += "Step 4: Ligation or Assembly Reaction\n"
    instructions += "- Use DNA ligase or appropriate enzymes for covalent joining of synthons and oligos.\n\n"

    instructions += "Step 5: Purification of Assembled DNA\n"
    instructions += "- Purify the assembled DNA product to remove excess oligos and reaction components.\n\n"

    instructions += "Step 6: Quality Control and Validation\n"
    instructions += "- Analyze the purified DNA product through gel electrophoresis or sequencing.\n"
    instructions += "- Perform functional assays or downstream applications to validate the sequence.\n\n"

    instructions += "Step 7: Iterative Optimization\n"
    instructions += "- If needed, optimize synthesis conditions and repeat the assembly process.\n\n"

    # Ordering instructions for oligos from a company
    instructions += "Ordering Oligonucleotides:\n"
    instructions += "- Use the following list of oligos for the Gibbard method:\n"
    for idx, oligo in enumerate(oligos, start=1):
        instructions += f"{idx}. {oligo}\n"
    instructions += "- Contact a DNA synthesis company and provide them with the list of oligos for ordering.\n\n"

    # Writing instructions to a text file
    print(instructions)
    print(f"Protocol instructions written to '{output_file}' successfully.")

# Example:
oligo_list = design_oligos("ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG")

output_filename = "Gibbard_Method_Protocol.txt"
generate_protocol_instructions(oligo_list, output_filename)


Gibbard Method Protocol

This protocol outlines the steps to assemble DNA synthons using the Gibbard method in the wet lab.

Step 1: Synthesis of DNA Synthons
- Synthesize individual DNA fragments (synthons) with overlapping regions.

Step 2: Oligonucleotide Preparation
- Order complementary oligonucleotides for each overlapping region.
- Ensure oligos are purified and free from contaminants.

Step 3: Annealing of Synthons and Oligos
- Prepare a reaction mixture with synthons and complementary oligos.
- Perform annealing by heating and gradual cooling to enable hybridization.

Step 4: Ligation or Assembly Reaction
- Use DNA ligase or appropriate enzymes for covalent joining of synthons and oligos.

Step 5: Purification of Assembled DNA
- Purify the assembled DNA product to remove excess oligos and reaction components.

Step 6: Quality Control and Validation
- Analyze the purified DNA product through gel electrophoresis or sequencing.
- Perform functional assays or downstream applicatio