<a href="https://colab.research.google.com/github/Hugo-Black/SLICE/blob/main/SLICE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<img src="https://raw.githubusercontent.com/Hugo-Black/SLICE/main/misc/slice_logo.png" height="110" align="left" style="height:300">


#**SLICE**
###_Signal-Peptide Locating, Identifying, and Cleavage Engine_


---
**Authors**: James Pang and Devanshi Shah

**Date**: May 30th 2025

---


##**Usage Instructions**

_Set up the following directories_

**`path_to_signalp6_tar_gz`**: Directory for GPU-converted fast and slow-sequential model weights for SignalP6.0

**`path_to_input`**: Directory for input fasta files, one batch = one fasta.

**`path_to_output`**: Directory for results to be saved


_Results_

**`output_mode`** will determine whether mature fastas will be exported as one
multifasta per batch or one fasta per protein.

This script will additionally output the Signal-P6.0 results, a full summary CSV file and a 2nd summary CSV containing only the detected secreted proteins.

_Signalp6 Parameters_

| Parameter               |Description                                                                                       |
|:------------------------|:--------------------------------------------------------------------------------------------------|
| `organism`                |  `Which organism group to use. Choices: other, eukarya`. |
| `format`                | ` Output file format. Options: none, txt, png, eps, all`.                                   |
| `mode`                  |` Model mode. Options: fast`, `slow-sequential`.                                                    |
| `gpu_load`              | `Target GPU load (≥100 recommended to maximize VRAM usage).`                                  |


For further information on selecting the correct parameters, please refer to
[SignalP-6.0 Installation Instructions](https://github.com/fteufel/signalp-6.0/blob/main/installation_instructions.md).



##**Acknowledgements**

This tool leverages the open‐source **SignalP-6.0** framework developed by the **Nielsen Lab (DTU Bioinformatics)** (Teufel et al. 2022), under the SignalP6.0 licensing terms.

SLICE was developed under the guidance of **Tarhan Ibrahim** and the **Bozkurt lab**.

##**References**

Teufel, F., Almagro Armenteros, J. J., Johansen, A. R., Gíslason, M. H., Pihl, S. I., Tsirigos, K. D., Winther, O., Brunak, S., von Heijne, G., & Nielsen, H. (2022). SignalP 6.0 predicts all five types of signal peptides using protein language models. Nature biotechnology, 40(7), 1023–1025. https://doi.org/10.1038/s41587-021-01156-3

In [None]:
#@title **Mount google drive**
from google.colab import drive
drive.mount('/content/drive')


In [None]:
#@title **Setting up Path to Model Weights, I/O Paths & Output Mode**
import os
import glob
import pandas as pd
import numpy as np

# Define your input paths:
path_to_signalp6_tar_gz = "/content/drive/MyDrive/FYP/signalp6/signalp6_gpu.tar.gz"  #@param {type:"string"}
path_to_input = "/content/drive/MyDrive/FYP/signalp6/input/"  #@param {type:"string"}
path_to_output = "/content/drive/MyDrive/FYP/signalp6/output/"  #@param {type:"string"}
output_mode = "one multifasta per batch"  #@param ["one protein per fasta", "one multifasta per batch"]


if not os.path.exists(path_to_signalp6_tar_gz):
    raise FileNotFoundError(f"Error: File {path_to_signalp6_tar_gz} not found. Check if it's in the right directory.")

if not os.path.exists(path_to_input):
    raise FileNotFoundError(f"Error: Input path {path_to_input} not found. Check the path.")

if not os.path.exists(path_to_output):
    raise FileNotFoundError(f"Error: Output path {path_to_output} not found. Check the path.")


# Handle batch mode: determine whether the input is a directory or a single FASTA file.
fasta_files = []
if os.path.isdir(path_to_input):
    # For batch mode, expect one multifasta file per batch.
    # This example considers files ending with .fasta or .fa.
    full_paths = glob.glob(os.path.join(path_to_input, "*.fasta")) + glob.glob(os.path.join(path_to_input, "*.fa"))
    if not full_paths:
        raise FileNotFoundError("No FASTA files found in the input directory.")
    else:
        print(f"Found {len(full_paths)} FASTA file(s) in the input directory.")
    # Get only the file names (e.g. ABC.fasta)
    fasta_files = [os.path.basename(f) for f in full_paths]
elif os.path.isfile(path_to_input):
    fasta_files = [os.path.basename(path_to_input)]
    print("Input is a single FASTA file.")
else:
    raise ValueError("Input path is neither a valid file nor a directory.")

print("FASTA file names for batch processing:", fasta_files)





In [None]:
#@title **Setting Up Environment (~10 - 15 mins)**
# Install dependencies
!pip install biopython


# Copy and extract signalP6
!cp {path_to_signalp6_tar_gz} /content/
%cd /content/
!tar -xvzf /content/signalp6_gpu.tar.gz


# Install signalP6
%cd /content/signalp6_gpu/signalp-6-package/
!pip install .

# Fix numpy version issue
!pip install numpy=="1.26.4"
!pip install pandas=="2.2.2"


# Copy model files to the appropriate directory
SIGNALP_DIR = os.popen('python3 -c "import signalp; import os; print(os.path.dirname(signalp.__file__))"').read().strip()
!cp -r /content/signalp6_gpu/signalp-6-package/models/* {SIGNALP_DIR}/model_weights/




In [None]:

#@title **Running SignalP6.0h**

#@markdown **SignalP-6.0 Parameters**

import re
organism = "other"  #@param ["other", "eukarya"]
format = "none"  #@param ["none","txt", "png", "eps", "all"]
mode = "slow-sequential"  #@param ["fast","slow-sequential"]
#@markdown **Performance Optimizations**
gpu_load = 200  #@param {type:"integer"}
#@markdown GPU Load ≥ 100 is recommended to maximise GPU VRAM usage.


for fasta_name in fasta_files:
    base_name = re.sub(r'\.fasta$|\.fa$', '', fasta_name, flags=re.IGNORECASE)

    results_dir_file = os.path.join(path_to_output, f"{base_name}_{mode}_results")
    os.makedirs(results_dir_file, exist_ok=True)

    # Build the SignalP6 command using the current FASTA file.
    cmd = (
        f"signalp6 --model_dir {SIGNALP_DIR}/model_weights "
        f"--fasta {os.path.join(path_to_input, fasta_name)} "
        f"--organism {organism} "
        f"--output_dir {results_dir_file} "
        f"--format {format} "
        f"--mode {mode} "
        f"--bsize {gpu_load}"
    )

    print(f"Running SignalP6 on {fasta_name} ...")
    # Execute the command.
    !{cmd}
    print(f"Results for {fasta_name} are in {results_dir_file}\n")



In [None]:
#@title **Batch FASTA Cleaving and Results Summary**


# Temporary debug
import sys, numpy as np
try:
    from numpy.core import records as rec
    np.rec = rec
    sys.modules['numpy.rec'] = rec
except ImportError:
    import numpy.recarray as recarray
    np.rec = recarray
    sys.modules['numpy.rec'] = recarray


import os
import re
import pandas as pd
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

# Function to extract the cleavage site from a "CS Position" string in SignalP-6.0 output.
def extract_cleavage_site(cs_string):
    match = re.search(r'CS pos:\s*(\d+)-(\d+)', cs_string)
    if match:
        return int(match.group(1)), int(match.group(2))
    return None, None

# Function to extract the prediction probability from the "CS Position" string in SignalP-6.0 output.
def extract_prediction_probability(cs_string):
    match = re.search(r'Pr:\s*([\d\.]+)', cs_string)
    if match:
        return float(match.group(1))
    return None

# Function to process each row in in SignalP-6.0 output to determine the signal peptide and mature sequence details.
def process_row(row):
    protein_id = row['ID'].split()[0]
    full_seq = fasta_dict.get(protein_id)
    full_length = len(full_seq) if full_seq is not None else None

    sp_type = row.get('Prediction', None)

    cs = row['CS Position']
    cleavage_start, cleavage_end = extract_cleavage_site(cs) if pd.notnull(cs) else (None, None)
    prediction_probability = extract_prediction_probability(cs) if pd.notnull(cs) else None

    if full_seq is None:
        return {{
            "Protein_ID": protein_id,
            "Full_Length": None,
            "Signal_Peptide_Type": sp_type,
            "Signal_Peptide": None,
            "Mature_Sequence": None,
            "Signal_Peptide_Pos": None,
            "Mature_Sequence_Pos": None,
            "Prediction_Probability": prediction_probability
        }}

    if cleavage_start is not None:
        signal_peptide_seq = full_seq[:cleavage_start]
        mature_seq = full_seq[cleavage_start:]
        signal_peptide_pos = f"1-{cleavage_start}"
        mature_seq_pos = f"{cleavage_start + 1}-{full_length}"
    else:
        signal_peptide_seq = "No Signal Peptide"
        mature_seq = full_seq
        signal_peptide_pos = "None"
        mature_seq_pos = f"1-{full_length}"

    return pd.Series({
        "Protein_ID": protein_id,
        "Full_Length": full_length,
        "Signal_Peptide_Type": sp_type,
        "Signal_Peptide": signal_peptide_seq,
        "Mature_Sequence": mature_seq,
        "Signal_Peptide_Pos": signal_peptide_pos,
        "Mature_Sequence_Pos": mature_seq_pos,
        "Prediction_Probability": prediction_probability
    })

#Cleaving and Results Summaries
for fasta_name in fasta_files:
    base_name = re.sub(r'\.fasta$|\.fa$', '', fasta_name, flags=re.IGNORECASE)

    summary_dir = os.path.join(path_to_output, f"{base_name}_{mode}_summaries")
    results_dir = os.path.join(path_to_output, f"{base_name}_{mode}_results")
    prediction_file = os.path.join(results_dir, "prediction_results.txt")

    if not os.path.exists(prediction_file):
        print(f"Warning: no predictions for {fasta_name}, skipping.")
        continue
    print(f"\nProcessing {fasta_name}")

    df = pd.read_csv(prediction_file, sep='\t', skiprows=1)
    df.columns = df.columns.str.lstrip('# ').str.strip()

    # load input FASTA into dictionary
    fasta_path = None
    for ext in ('.fasta', '.fa'):
        p = os.path.join(path_to_input, base_name + ext)
        if os.path.exists(p):
            fasta_path = p
            break
    if fasta_path is None:
        print(f"Warning: no FASTA for {base_name}, skipping.")
        continue
    fasta_dict = {r.id: str(r.seq) for r in SeqIO.parse(fasta_path, "fasta")}

    # apply process_row and create full summary DF
    summary_df = df.apply(process_row, axis=1)

    # filtered DF of only secreted proteins
    summary_df_signal = summary_df[
        summary_df['Signal_Peptide_Type'].isin(['SP','LIPO','TAT','TATLIPO','PILIN'])
    ].reset_index(drop=True)

    # save full & filtered summaries
    os.makedirs(summary_dir, exist_ok=True)
    summary_df.to_csv(os.path.join(summary_dir, f"{base_name}_{mode}_summary.csv"), index=False)
    summary_df_signal.to_csv(os.path.join(summary_dir, f"{base_name}_{mode}_SP_hits_summary.csv"), index=False)
    print(f"Saved summaries to {summary_dir}")

    # build mature FASTA
    records = []
    for _, row in summary_df_signal.iterrows():
        seq = row['Mature_Sequence']
        if not seq:
            continue
        records.append(SeqRecord(Seq(seq), id=row['Protein_ID'], description=""))

    mature_fastas_dir = os.path.join(path_to_output, f"{base_name}_{mode}_mature_fastas")
    os.makedirs(mature_fastas_dir, exist_ok=True)

    if output_mode == "one multifasta per batch":
        out = os.path.join(mature_fastas_dir, f"{base_name}_{mode}_full.fasta")
        SeqIO.write(records, out, "fasta")
        print(f"Wrote {len(records)} sequences to {out}")

    elif output_mode == "one protein per fasta":
        total = len(records)
        for i, rec in enumerate(records, 1):
            fn = os.path.join(mature_fastas_dir, f"{rec.id}_cleaved.fasta")
            SeqIO.write([rec], fn, "fasta")
            if i % 1000 == 0 or i == total:
                print(f"{i}/{total} files written")
        print(f"All {total} FASTAs in {mature_fastas_dir}")
