# ICOR: Improving Codon Optimization with Recurrent Neural Networks in Colab

## Introduction

ICOR (Improving Codon Optimization with Recurrent neural networks) is a deep learning-based tool that optimizes codon usage for enhanced protein expression in Escherichia coli. By leveraging recurrent neural networks, ICOR learns codon usage patterns and context to improve upon traditional frequency-based optimization methods.

This notebook provides a user-friendly interface for batch optimizing of DNA or AA sequences using ICOR, created by <b><font color='Gold'>**Logan Hessefort**</font></b> ([LinkedIn](https://www.linkedin.com/in/logan-hessefort/)).

This notebook is based on the excellent work from [Jain et al. (2023)](https://doi.org/10.1186/s12859-023-05246-8), who developed the original ICOR algorithm.

This notebook was created as part of a <b><font color='green'>US Department of Energy SCGSR Fellowship</font></b> ([details](https://science.osti.gov/wdts/scgsr)) at the National Renewable Energy Laboratory with additional support from the <b><font color='DodgerBlue'>US National Science Foundation</font></b> ([grant](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2132183&HistoricalAwards=false)).

## How to Use This Notebook

1. Run the cells in order. The notebook will:
   - Set up the required environment
   - Clone the ICOR repository
   - Install necessary dependencies

2. Prepare your input CSV file (see next section for details)

3. Upload your CSV file when prompted

4. The script will process each sequence and output a consolidated CSV with the results

## Input CSV Structure

Your input CSV should contain the following columns:

- `Sequence Name`: A unique identifier for each sequence
- `Sequence`: The original DNA sequence

Your CSV should look something like this:

| Sequence Name | Sequence |
|:-------------:|:--------:|
| Gene1 | ATGCATGCATGC... |
| Gene2 | ATGGCTAGCTAG... |
| ... | ... |
| GeneN | ATGTACGTACGT... |

## Output

The script will generate a consolidated CSV file containing:

- Sequence Name
- Original Sequence
- ICOR Optimized Sequence


For any issues or questions, please refer to the [ICOR-batch GitHub repository](https://github.com/Loganz97/icor-codon-optimization-batch).

In [None]:
#@title Setup Environment and Clone Repository

#@markdown Run this cell to set up the environment and clone the ICOR repository. This step will:
#@markdown
#@markdown 1. Install required Python packages
#@markdown 2. Clone the ICOR repository
#@markdown 3. Set up the necessary Python path
#@markdown
#@markdown <font color="cyan">**Note**: This cell may take a few minutes to run as it installs dependencies and clones the repository.</font>
%pip install -q biopython numpy onnxruntime==1.12.0 selenium webdriver_manager pandas
!git clone -q https://github.com/Loganz97/icor-codon-optimization-batch/
%cd icor-codon-optimization-batch
import os
import sys
import pandas as pd
from Bio.Seq import Seq
import onnxruntime as rt
import numpy as np

# Add the 'tool' directory to the Python path
tool_dir = os.path.abspath('tool')
if tool_dir not in sys.path:
    sys.path.insert(0, tool_dir)

In [None]:
#@title ICOR Optimization Input Parameters

#@markdown **Job Name:** Name for this optimization run
jobname = "ICOR_Optimization_1" #@param {type:"string"}

#@markdown **Sequence Type:** Choose between DNA or Amino Acid (AA) input
sequence_type = "AA" #@param ["DNA", "AA"]

#@markdown **Output Options:**
download_results = True #@param {type:"boolean"}

#@markdown ---
#@markdown This cell processes sequences using the ICOR algorithm. You'll be prompted to upload a CSV file with 'Sequence Name' and 'Sequence' columns.
#@markdown
#@markdown If you want to try a demo, just click 'Cancel upload' when prompted.

import io
import pandas as pd
from google.colab import files
from Bio.Seq import Seq
import onnxruntime as rt
import numpy as np
import os

# Define the optimize_sequence function
def optimize_sequence(input_seq, sequence_type):
    # Load ONNX model
    model_path = os.path.join(os.getcwd(), 'tool', 'models', 'icor.onnx')
    sess = rt.InferenceSession(model_path)
    input_name = sess.get_inputs()[0].name

    # Define categorical labels and aa2int function
    labels = ['AAA', 'AAC', 'AAG', 'AAT', 'ACA', 'ACG', 'ACT', 'AGC', 'ATA', 'ATC', 'ATG', 'ATT', 'CAA', 'CAC', 'CAG', 'CCG', 'CCT', 'CTA', 'CTC', 'CTG', 'CTT', 'GAA', 'GAT', 'GCA', 'GCC', 'GCG', 'GCT', 'GGA', 'GGC', 'GTC', 'GTG', 'GTT', 'TAA', 'TAT', 'TCA', 'TCG', 'TCT', 'TGG', 'TGT', 'TTA', 'TTC', 'TTG', 'TTT', 'ACC', 'CAT', 'CCA', 'CGG', 'CGT', 'GAC', 'GAG', 'GGT', 'AGT', 'GGG', 'GTA', 'TGC', 'CCC', 'CGA', 'CGC', 'TAC', 'TAG', 'TCC', 'AGA', 'AGG', 'TGA']

    def aa2int(seq):
        _aa2int = {'A': 1, 'R': 2, 'N': 3, 'D': 4, 'C': 5, 'Q': 6, 'E': 7, 'G': 8, 'H': 9, 'I': 10, 'L': 11, 'K': 12, 'M': 13, 'F': 14, 'P': 15, 'S': 16, 'T': 17, 'W': 18, 'Y': 19, 'V': 20, 'B': 21, 'Z': 22, 'X': 23, '*': 24, '-': 25, '?': 26}
        return [_aa2int[i] for i in seq]

    # Process input sequence
    if sequence_type == 'DNA':
        try:
            input_seq = str(Seq(input_seq).translate())
        except:
            raise ValueError(f"Invalid DNA sequence: {input_seq[:20]}...")
    else:  # AA sequence
        if not all(aa in 'ARNDCQEGHILKMFPSTWYV*' for aa in input_seq):
            raise ValueError(f"Invalid amino acid sequence: {input_seq[:20]}...")

    # One-hot encode the amino acid sequence
    oh_array = np.zeros(shape=(26, len(input_seq)))
    aa_placement = aa2int(input_seq)
    for i in range(len(aa_placement)):
        oh_array[aa_placement[i], i] = 1

    # Prepare input for ONNX model
    x = np.array(np.transpose([oh_array]))
    y = x.astype(np.float32)
    y = np.reshape(y, (y.shape[0], 1, 26))

    # Get prediction
    pred_onx = sess.run(None, {input_name: y})

    # Get the index of the highest probability from softmax output
    pred_indices = [np.argmax(pred) for pred in pred_onx[0]]

    # Convert indices to optimized sequence
    out_str = ''.join([labels[index] for index in pred_indices])

    return out_str

def process_csv(df, sequence_type):
    # Normalize column names to lowercase
    df.columns = df.columns.str.lower()

    # Check if required columns are present
    required_columns = {'sequence name', 'sequence'}
    if not required_columns.issubset(df.columns):
        raise ValueError(f"CSV file must contain columns: {', '.join(required_columns)}")

    results = []
    for _, row in df.iterrows():
        name = row['sequence name']
        seq = row['sequence']
        try:
            optimized_seq = optimize_sequence(seq, sequence_type)
            results.append({'Sequence Name': name, 'Original Sequence': seq, 'Optimized Sequence': optimized_seq})
        except ValueError as e:
            print(f"Error processing sequence {name}: {str(e)}")
            results.append({'Sequence Name': name, 'Original Sequence': seq, 'Optimized Sequence': 'Error: Invalid sequence'})
    return pd.DataFrame(results)

print("Please upload your CSV file containing sequences for optimization.")
print("If you want to try a demo, just click 'Cancel upload'.")

try:
    uploaded = files.upload()
    filename = [key for key in uploaded.keys()][0]
    df = pd.read_csv(io.BytesIO(uploaded[filename]))
    print(f"Successfully uploaded and read file: {filename}")
except:
    print('Failed to read uploaded file. Using demo file instead...')
    # Create a demo DataFrame
    df = pd.DataFrame({
        'Sequence Name': ['Seq1', 'Seq2'],
        'Sequence': ['ATGAGCGACGTGGCTATTGTGAAGGAG', 'MSDVAIVKEGWLHKRGEYIKTWRPRYFLLK']
    })
    filename = 'demo_sequences.csv'

print("\nFirst few rows of the input data:")
print(df.head())

try:
    # Process the CSV file
    result_df = process_csv(df, sequence_type)

    # Save results to CSV
    output_filename = f"{jobname}_optimized_sequences.csv"
    result_df.to_csv(output_filename, index=False)
    print(f"\nResults saved to {output_filename}")

    if download_results:
        files.download(output_filename)

except ValueError as e:
    print(f"\nError: {str(e)}")
    print("Please make sure your CSV file has the correct format with 'Sequence Name' and 'Sequence' columns (case-insensitive).")