# TIsigner-batch: Optimize Protein Expression with Accessibility-Based Sequence Design in Colab

## Introduction

TIsigner is a tool that optimizes protein expression by improving the accessibility of translation initiation region through synonymous changes to the first 9 codons.

This notebook provides a user-friendly interface for batch processing of DNA sequences using TIsigner created by **Logan Hessefort** ([LinkedIn](https://www.linkedin.com/in/logan-hessefort/)).

This notebook is based on the excellent work from [Bikash Kumar Bhandari, Chun Shem Lim, and Paul P Gardner (2019-)](https://github.com/Gardner-BinfLab/TISIGNER-ReactJS).

This notebook was created as part of a <b><font color='green'>US Department of Energy SCGSR Fellowship</font></b> ([details](https://science.osti.gov/wdts/scgsr)) at the National Renewable Energy Laboratory with additional support from the <b><font color='DodgerBlue'>US National Science Foundation</font></b> ([grant](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2132183&HistoricalAwards=false)).

## How to Use This Notebook

1. Run the cells in order. The notebook will:
   - Set up the Conda environment
   - Clone the TIsigner-batch repository
   - Install necessary dependencies

2. Prepare your input CSV file (see next section for details)

3. Upload your CSV file when prompted

4. The script will process each sequence and output a consolidated CSV with the results

## Input CSV Structure

Your input CSV should contain the following columns:

- `Sequence Name`: A unique identifier for each sequence
- `Sequence`: The original DNA sequence
- `Optimized Sequence`: A pre-optimized version of the sequence (if available, otherwise use the original sequence)

Your CSV should look something like this (the optimized sequence column is optional):

| Sequence Name | Sequence | Optimized Sequence |
|:-------------:|:--------:|:------------------:|
| Gene1 | ATGCATGCATGC... | ATGCACGCTTGC... |
| Gene2 | ATGGCTAGCTAG... | ATGAGCTAAAAG... |
| ... | ... | ... |
| GeneN | ATGTACGTACGT... | ATGTACGTCCGT... |

<font color="cyan">**Note:** All sequences must begin with the ATG start codon and end with the TAA stop codon for TIsigner to run properly.</font>

## Output

The script will generate a consolidated CSV file containing:

- Sequence Name
- Original Sequence
- Input Sequence (pre-optimized)
- TISigned Sequence (further optimized by TIsigner)
- Opening Energy
- Score (if available)

## Note

This notebook automatically installs all necessary dependencies. The main requirements are:
- Python 3.6+
- ViennaRNA suite
- Pandas
- NumPy

For any issues or questions, please refer to the [TIsigner-batch GitHub repository](https://github.com/Loganz97/TIsigner-batch).

In [None]:
#@title Download and install Conda
#@markdown <font color="red">**Note**: This cell will restart the session causing a colab error. Just move on to the next cell after the restart.</font>

!pip install -q condacolab
import condacolab
condacolab.install()

In [None]:
#@title Upload your CSV file here ⬇️
#@markdown Running this code block will prompt you to upload a CSV file for TIsigner processing after about **1-2 minutes**. It is highly recommended to use a CPU-high ram session by paying for ColabPro. It will take about 40 s per sequence on a high-ram node.

!git clone -q https://github.com/Loganz97/TIsigner-batch.git
%cd TIsigner-batch/TIsigner_cmd
!mamba env update -q -n base -f tisigner_env.yaml > /dev/null 2>&1

import subprocess
import glob
import pandas as pd
import os
from google.colab import files
from libs.functions import valid_input_seq
import re

def run_tisigner(sequence, seq_name):
    """
    Run TIsigner with specified parameters and return the path to the output CSV file.
    """
    safe_name = re.sub(r'[^\w\-_\.]', '_', seq_name)
    output_name = safe_name

    command = [
        'python3', '/content/TIsigner-batch/TIsigner_cmd/tisigner.py',
        '-s', sequence,
        '-o', output_name
    ]

    result = subprocess.run(command, capture_output=True, text=True)

    if result.returncode != 0:
        raise Exception(f"TIsigner failed with error: {result.stderr}")

    csv_files = glob.glob(f"/content/TIsigner-batch/TIsigner_cmd/results/{output_name}*.csv")
    if not csv_files:
        raise Exception("No output CSV file found")

    return csv_files[0]

def parse_tisigner_csv(csv_path):
    """
    Parse the TIsigner output CSV file and return only the selected sequence and its properties.
    """
    df = pd.read_csv(csv_path)
    selected_row = df[df['Type'] == 'Selected'].iloc[0]

    return {
        'Sequence': selected_row['Sequence'],
        'Opening Energy': selected_row['Opening Energy'],
        'Score': selected_row.get('Score', None)  # Score might not be present for all hosts
    }

def process_csv(input_file):
    """
    Process the uploaded CSV file and run TIsigner on each sequence.
    """
    df = pd.read_csv(input_file)
    results = []

    # Normalize column names
    df.columns = [col.lower().strip() for col in df.columns]

    # Check for required columns
    if 'sequence name' not in df.columns or 'sequence' not in df.columns:
        raise ValueError("CSV must contain 'Sequence Name' and 'Sequence' columns (case-insensitive)")

    for index, row in df.iterrows():
        seq_name = row['sequence name']
        original_seq = row['sequence']
        input_seq = row.get('optimized sequence', original_seq)  # Use original if optimized not available

        try:
            valid_input_seq(input_seq)
            output_csv = run_tisigner(input_seq, seq_name)
            tisigner_result = parse_tisigner_csv(output_csv)
            results.append({
                'Sequence Name': seq_name,
                'Original Sequence': original_seq,
                'Input Sequence': input_seq,
                'TISigned Sequence': tisigner_result['Sequence'],
                'Opening Energy': tisigner_result['Opening Energy'],
                'Score': tisigner_result['Score']
            })
            os.remove(output_csv)
        except Exception as e:
            results.append({
                'Sequence Name': seq_name,
                'Original Sequence': original_seq,
                'Input Sequence': input_seq,
                'TISigned Sequence': f"Error: {str(e)}",
                'Opening Energy': None,
                'Score': None
            })

        print(f"Processed sequence {index + 1} of {len(df)}")

    return pd.DataFrame(results)

# Main execution
print("Please upload your CSV file with columns: 'Sequence Name' and 'Sequence'. 'Optimized Sequence' is optional.")
uploaded = files.upload()
input_file = list(uploaded.keys())[0]
results_df = process_csv(input_file)
output_file = 'tisigner_results.csv'
results_df.to_csv(output_file, index=False)
files.download(output_file)
print(f"\nResults have been saved to {output_file} and a download should begin shortly.")