<a href="https://colab.research.google.com/github/Reclone-org/DNA-Scripts/blob/main/Reclone_Syntax_FlankingSeq_Generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧬 Type IIS Synthetic Sequence Design Tool

This notebook generates **synthetic DNA sequences** with appended elements for **Type IIS cloning** (BsaI, BbsI, SapI, BsmbI) using the **Reclone syntax standard**.
It ensures the correct 5′ and 3′ overhangs are included, along with the necessary restriction sites, landing pads, and spacers, and outputs a ready-to-order sequence list.

* * *

## 📂 Input: CSV file
Upload a CSV file with the following column headers:

| Name        | Seq                     | 5' Syntax | 3' Syntax | Enzyme |
|-------------|-------------------------|-----------|-----------|--------|
| MyGene1     | ATGCGTACGATCGATCGT...   | A         | B         | BsaI   |
| MyGene2     | ATGAGTCCGATCGTACGA...   | N1        | C         | BbsI   |
| MyGene3     | ATGCGGCTTACGAGTACG...   | D         | N5        | SapI   |

You can generate a template in Step 1 below.

### Column details:
- **Name** → A short identifier for the DNA part (e.g. `GeneX`, `Promoter1`).  
- **Seq** → The full DNA sequence (≥ 20 bp). Non-ATGC characters will be ignored.  
- **5' Syntax** → The syntax tag for the **left end** of the part (e.g. `A`, `B`, `N1`, `N2`, `N5`).  
- **3' Syntax** → The syntax tag for the **right end** of the part.  
- **Enzyme** → Restriction enzyme used (`BsaI`, `BbsI`, `SapI`, `BsmbI`).  

⚠️ **Important**:  
- Sequences must be at least 20 bp to allow for appropriate element appending.  
- Syntax tags must match the Reclone standard (A–F, N1–N5, or N).  
- Overhangs and appended elements differ for 5′ and 3′ ends (built in automatically).

* * *

## ⚙️ What the code does
1. Reads your CSV input.  
2. Looks up the correct **overhangs, scar bases, restriction sites, landing pads, and spacers** for each part.  
3. Appends the correct **5′ and 3′ elements** to the original sequence.  
4. Outputs a final **CSV file** with the synthetic sequences and input details.

* * *

## 📤 Output
After running the notebook you’ll get:

- **`typeIIS_synthetic_sequences.csv`** → Downloadable file with the following columns:

| Name                      | Seq                                              | 5' Syntax | 3' Syntax | Enzyme |
|---------------------------|--------------------------------------------------|-----------|-----------|--------|
| MyGene1-A-B-synthetic     | ATGCGGGTCTCN...                                  | A         | B         | BsaI   |
| MyGene2-N1-C-synthetic    | GAAGACGAAGACNN...                                | N1        | C         | BbsI   |
| MyGene3-D-N5-synthetic    | GTTACGCTCTTCN...                                 | D         | N5        | SapI   |

- Each row represents one synthetic sequence combining the original sequence with the 5′ and 3′ appended elements.
- Names follow the pattern: PartName-5'SyntaxTag-3'SyntaxTag-synthetic

* * *

✅ **You can copy-paste the output sequences directly into your DNA synthesis ordering system.**


## Step 0: Install libraries and packages

In [None]:
# 🧬 Type IIS Primer Designer – Colab Version
!pip install biopython
import pandas as pd
from Bio.Seq import Seq
from google.colab import files
import io



## Step 1 [Optional]: Download a template csv file

In [None]:
# Generate a sample CSV file for users to download
import pandas as pd

example_data = [
    {"Name": "MyGene1", "Seq": "ATGCGTACGATCGATCGTACGATCGTAGCTAGCTAG", "5' Syntax": "A",  "3' Syntax": "B",  "Enzyme": "BsaI"},
    {"Name": "MyGene2", "Seq": "ATGAGTCCGATCGTACGATCGTACGCTAGCTAGCTA", "5' Syntax": "N1", "3' Syntax": "C",  "Enzyme": "BbsI"},
    {"Name": "MyGene3", "Seq": "ATGCGGCTTACGAGTACGCTAGCTAGCTAACGTCGA", "5' Syntax": "D",  "3' Syntax": "N5", "Enzyme": "SapI"},
]

example_df = pd.DataFrame(example_data)
example_filename = "example_input.csv"
example_df.to_csv(example_filename, index=False)

from google.colab import files
files.download(example_filename)

print(f"📥 Example CSV generated: {example_filename}")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

📥 Example CSV generated: example_input.csv


## Step 2: Upload csv and run sequence design pipeline

In [None]:
# Colab-ready script: upload a CSV with columns:
# Name, Seq, 5' Syntax, 3' Syntax, Enzyme
import io
import pandas as pd
from google.colab import files
from typing import Dict

# ---------- 1) Upload CSV ----------
print("📤 Please upload your input CSV with columns: Name, Seq, 5' Syntax, 3' Syntax, Enzyme")
uploaded = files.upload()
if not uploaded:
    raise RuntimeError("No file uploaded.")
for fn in uploaded:
    df = pd.read_csv(io.BytesIO(uploaded[fn]))
    break  # first file only

# ---------- 2) Overhang cores & end-specific specs ----------
RECLONE_CANON: Dict[str, str] = {
    "A":  "GGAG",
    "B":  "TACT",
    "N1": "CCAT",
    "N2": "GTCA",
    "N3": "TCCA",
    "C":  "AATG",
    "D":  "AGGT",
    "N4": "TTCG",
    "N5": "CGGC",
    "E":  "GCTT",
    "F":  "CGCT",
    "N":  "TCCA",  # optional alias
}

# Lowercase = extra scar bases outside the canonical 4-nt overhang.
# These strings represent the final fragment ends after digestion.
RECLONE_END_SPECS: Dict[str, Dict[str, str]] = {
    "A":  {"5": "GGAG",    "3": "GGAG"},
    "B":  {"5": "TACT",    "3": "TACT"},
    "N1": {"5": "CCATg",   "3": "tCCAT"},
    "N2": {"5": "GTCA",    "3": "ggGTCA"},
    "N3": {"5": "TCCAtg",  "3": "TCCA"},
    "C":  {"5": "AATG",    "3": "ggAATG"},
    "D":  {"5": "AGGT",    "3": "ggAGGT"},
    "N4": {"5": "TTCG",    "3": "ggTTCG"},
    "N5": {"5": "CGGC",    "3": "ggCGGC"},
    "E":  {"5": "GCTT",    "3": "GCTT"},
    "F":  {"5": "CGCT",    "3": "CGCT"},
    "N":  {"5": "TCCA",    "3": "TCCA"},
}

# ---------- 3) Type IIS enzyme configs ----------
# Adjust landing_pad and spacer_len to your lab standards if needed.
restriction_enzymes = {
    'BsaI':  {'site': 'GGTCTC',  'spacer_len': 1, 'landing_pad': 'ATGCG'},  # GGTCTC N ↓
    'SapI':  {'site': 'GCTCTTC', 'spacer_len': 1, 'landing_pad': 'GTTAC'},  # GCTCTTC N ↓
    'BbsI':  {'site': 'GAAGAC',  'spacer_len': 2, 'landing_pad': 'GAAGAC'},  # GAAGAC NN ↓
    'BsmbI': {'site': 'CGTCTC',  'spacer_len': 1, 'landing_pad': 'TGCAG'},  # CGTCTC N ↓
}

# ---------- 4) Utilities ----------
def reverse_complement(seq: str) -> str:
    comp = str.maketrans("ACGTNacgtn", "TGCANtgcan")
    return seq.translate(comp)[::-1]

def normalize_token(tok: str) -> str:
    """Case-insensitive map to defined tokens (A–F, N, N1–N5)."""
    t = tok.strip().upper()
    if t not in RECLONE_END_SPECS:
        raise ValueError(f"Unknown syntax token '{tok}'. Allowed: {', '.join(RECLONE_END_SPECS.keys())}")
    return t

# ---------- 5) Row processor ----------
def generate_synthetic_sequences(row):
    name    = str(row['Name']).strip()
    seq_raw = str(row['Seq']).strip()
    left_in = str(row["5' Syntax"]).strip()
    right_in= str(row["3' Syntax"]).strip()
    enzyme  = str(row['Enzyme']).strip()

    # Clean sequence
    sequence = ''.join([c for c in seq_raw if c in 'ACGTacgt']).upper()
    if not sequence:
         raise ValueError(f"{name}: sequence is empty after cleaning.")


    # Enzyme config
    if enzyme not in restriction_enzymes:
        raise ValueError(f"{name}: unsupported enzyme '{enzyme}'. "
                         f"Choose from: {', '.join(restriction_enzymes.keys())}")
    enz = restriction_enzymes[enzyme]

    # Normalize tokens
    left_tok  = normalize_token(left_in)
    right_tok = normalize_token(right_in)

    # Build the combined sequence: 5'_addition + sequence + 3'_addition
    # 5'_addition: 5'_landing_pad + 5'_site + 5'_spacer + 5'_overhang
    # 3'_addition: 3'_overhang + 3'_spacer (A's) + reverse_complement(3'_site) + 3'_landing_pad

    # Replace 'N' spacers with 'A's directly in the addition construction
    five_prime_addition = enz['landing_pad'] + enz['site'] + ('A' * enz['spacer_len']) + RECLONE_END_SPECS[left_tok]["5"]

    # Construct the 3' addition by concatenating parts, with only the site reverse complemented
    three_prime_addition_rc = RECLONE_END_SPECS[right_tok]["3"] + ('A' * enz['spacer_len']) + reverse_complement(enz['site']) + enz['landing_pad']


    combined_sequence = five_prime_addition + sequence + three_prime_addition_rc

    combined_name = f"{name}-{left_tok}-{right_tok}-synthetic"

    # Return the combined name, sequence, original columns, and additions
    return pd.Series([combined_name, combined_sequence, left_in, right_in, enzyme, five_prime_addition, three_prime_addition_rc])


# ---------- 6) Run + export ----------
required = {"Name", "Seq", "5' Syntax", "3' Syntax", "Enzyme"}
missing = required - set(df.columns)
if missing:
    raise ValueError(f"CSV is missing required columns: {', '.join(missing)}")

synthetic_sequences_df = df.apply(generate_synthetic_sequences, axis=1)
synthetic_sequences_df.columns = ['Name', 'Seq', '5\' Syntax', '3\' Syntax', 'Enzyme', '5\' Addition', '3\' Addition']


# ---------- 7) Print annotated example sequence ----------
# Removed annotated example printout as requested.


out_fn = 'typeIIS_synthetic_sequences.csv'
synthetic_sequences_df.to_csv(out_fn, index=False)
files.download(out_fn)

print("✅ Synthetic sequence generation complete. Output saved as", out_fn)

📤 Please upload your input CSV with columns: Name, Seq, 5' Syntax, 3' Syntax, Enzyme


Saving example_input_mVirD2.csv to example_input_mVirD2 (8).csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

✅ Synthetic sequence generation complete. Output saved as typeIIS_synthetic_sequences.csv
