<a href="https://colab.research.google.com/github/RyloByte/TS-Capstone-2025/blob/%2335_3/notebooks/colab_hyperpackage_creation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hyperpackage Creation and Assign v1.0

This Colab notebook provides a streamlined, user-friendly interface for running the **Hyperpackage Creation** workflow.  
It is built on top of a Snakemake-based pipeline that extends the capabilities of [TreeSAPP](https://github.com/hallamlab/TreeSAPP) by enabling the automated construction of **composite reference packages** — also known as hyperpackages.

Unlike traditional methods that rely on manually curated protein sets, this workflow clusters sequences by **functional homology**, using identifiers such as **Rhea IDs**, **Enzyme Commission (EC) numbers**, or other biochemical groupings. It integrates protein structure and sequence similarity to construct phylogenetic reference packages that are both robust and scalable.

Use the guided steps in this notebook to:
- Configure and run the Snakemake pipeline
- Build your own hyperpackage based on a selected identifier
- Assign query sequences using TreeSAPP
- Download your resulting assigned hyperpackage

For more details, refer to the original project repository on GitHub: \\
https://github.com/RyloByte/TS-Capstone-2025

In [None]:
#@title Install Dependencies
#@markdown 📦 *Time to build your bioinformatics toolbox!*  \\
#@markdown This step installs everything you need to run the workflow — including **Snakemake**, **Miniconda**, and your **project files** from GitHub 🧬💻  \\
#@markdown Give it a minute or two — we’re setting up your scientific command center! 🚀🧪
!cd /content
!git clone https://github.com/RyloByte/TS-Capstone-2025.git
%cd TS-Capstone-2025
!cp config.yaml.example config.yaml

!apt-get update && apt-get install -y graphviz

!wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
!bash miniconda.sh -bfp /usr/local
!conda install -q -y --prefix /usr/local python=3.10

import sys
import os
sys.path.append("/usr/local/lib/python3.10/site-packages")
os.environ["PATH"] = "/usr/local/bin:" + os.environ["PATH"]

!conda env create -f environment.yaml

# Setup shell to allow conda activation
!eval "$(/usr/local/bin/conda shell.bash hook)"

# Hyperpackage Creation

In [2]:
#@title Configuration
#@markdown 🧪 Choose the type of identifier you're using for your reaction of interest!  \\
#@markdown You can select either a **Rhea ID** (from the [Rhea: Annotated Reactions Database](https://www.rhea-db.org))
#@markdown or an **EC Number** (from the [Enzyme Commission classification system](https://enzyme.expasy.org/)).
#@markdown
#@markdown 🔹 Use **Rhea ID** for specific biochemical reactions (e.g., `10596`)
#@markdown 🔹 Use **EC Number** for general enzyme classifications (e.g., `1.1.1.1`)
#@markdown
#@markdown 💡 Once selected, your input will be used to name your dataset accordingly. Let's get naming! 🏷️
ID_type = 'Rhea-ID' #@param ["Rhea-ID", "EC-Number"]{allow-input: false}
ID = '10596' #@param {type:"string"}

if ID_type == 'Rhea-ID':
    sample = 'rhea_' + ID
elif ID_type == 'EC-Number':
    sample = 'ec_' + ID

In [None]:
#@title Advanced Configuration - Fully Editable Settings { display-mode: "form" }

#@markdown ✨ Welcome to the config garden! 🌿
#@markdown Here you can fine-tune every step of your workflow — from chunking UniProt data to clustering sequences and assigning taxonomy.
#@markdown Play with the knobs and sliders 🎛️, grow your hyperpackage your way! 💚 \\

#@markdown No need to run this cell if you plan to use the default settings!

#@markdown ---
#@markdown ### 🧱 Cluster Database Settings
#@markdown These settings control how the UniProt data is chunked during preprocessing.
#@markdown - `chunk_size`: Number of sequences per chunk when breaking up the input data.
Chunk_Size = 10_000_000  #@param {type:"integer", min:1}

#@markdown ---
#@markdown ### 🧬 Structure Clustering Settings
#@markdown Controls thresholds for accepting structure-based clusters.
#@markdown - `min_cluster_size`: Minimum number of sequences in a structural cluster to keep.
#@markdown - `max_cluster_size`: Maximum size of a cluster (set to `None` to disable filtering).
Min_StructCluster_Size = 5  #@param {type:"integer", min:1}
Max_StructCluster_Size = None  #@param {type:"raw"}

#@markdown ---
#@markdown ### 🔗 Sequence Clustering Settings
#@markdown Defines behavior for clustering by sequence similarity using MMseqs2.
#@markdown - `mute_mmseqs`: Suppress MMseqs2 log output.
#@markdown - `min_cluster_size`: Minimum number of sequences in a sequence-based cluster.
#@markdown - `max_cluster_size`: Maximum size of a cluster.
Mute_MMSeqs_Output = True  #@param {type:"boolean"}
Min_SeqCluster_Size = 5  #@param {type:"integer", min:1}
Max_SeqCluster_Size = None  #@param {type:"raw"}

#@markdown ---
#@markdown ### ⚙️ MMseqs2 Parameters
#@markdown Fine-tune how MMseqs2 performs sequence comparisons.
#@markdown - `-c`: Coverage threshold; filters weak alignments.
#@markdown - `--min-seq-id`: Minimum sequence identity required.
#@markdown - `--cov-mode`: Defines how coverage is calculated.
#@markdown - `-k`: K-mer size for initial sequence matching.
#@markdown - `--shuffle`: Whether to shuffle input sequences.
#@markdown - `--remove-tmp-files`: Whether to delete temp files.
#@markdown - `--alignment-mode`: Type of alignment (e.g. local/global).
#@markdown - `--realign`: Whether to perform realignment.
Min_Cov_Threshold = 0.8  #@param {type:"number", min:0.0, max:1.0}
Min_Seq_ID = 0.9  #@param {type:"number", min:0.0, max:1.0}
Cov_Mode = 5  #@param {type:"integer", min:0, max:5}
Kmer_size = 15  #@param {type:"integer", min:1}
Shuffle = 0  #@param [0, 1]
Remove_Temp_Files = 0  #@param [0, 1]
Alignment_Mode = 3  #@param {type:"integer", min:0, max:3}
Realign = 1  #@param [0, 1]

#@markdown ---
#@markdown ### 🌲 TreeSAPP Create Settings
#@markdown Configures how TreeSAPP builds the reference package.
#@markdown - `mute_treesapp`: Suppress TreeSAPP creation output.
#@markdown - `extra_args`: Extra command-line flags passed to TreeSAPP.
Mute_TreeSAPP_Output = True  #@param {type:"boolean"}
Extra_Arguments = "--headless"  #@param {type:"string"}

#@markdown ---
#@markdown ### 🌳 TreeSAPP Assign Settings
#@markdown Controls TreeSAPP behavior during sample assignment.
#@markdown - `mute_treesapp`: Suppress TreeSAPP assignment output.
#@markdown - `extra_args`: Extra options like `-m prot`, `--trim_align`, or `-n <threads>`.
Mute_TreeSAPP_Assign_Output = True  #@param {type:"boolean"}
Assign_Extra_Arguments = "-m prot --trim_align -n 2"  #@param {type:"string"}

# ------------------- YAML Generation -------------------
!pip install -q pyyaml
import yaml

config = {
    "cluster_db": {
        "filter_by_sprot": True,
        "chunk_size": Chunk_Size,
    },
    "structure_clustering": {
        "min_cluster_size": Min_StructCluster_Size,
        "max_cluster_size": Max_StructCluster_Size,
    },
    "sequence_clustering": {
        "mute_mmseqs": Mute_MMSeqs_Output,
        "min_cluster_size": Min_SeqCluster_Size,
        "max_cluster_size": Max_SeqCluster_Size,
        "mmseqs_args": [
            f"-c {Min_Cov_Threshold}",
            f"--min-seq-id {Min_Seq_ID}",
            f"--cov-mode {Cov_Mode}",
            f"-k {Kmer_size}",
            f"--shuffle {Shuffle}",
            f"--remove-tmp-files {Remove_Temp_Files}",
            f"--alignment-mode {Alignment_Mode}",
            f"--realign {Realign}",
        ]
    },
    "treesapp_create": {
        "mute_treesapp": Mute_TreeSAPP_Output,
        "extra_args": Extra_Arguments.split() if Extra_Arguments else [],
    },
    "treesapp_assign": {
        "mute_treesapp": Mute_TreeSAPP_Assign_Output,
        "extra_args": Assign_Extra_Arguments.split() if Assign_Extra_Arguments else []
    }
}

# Write YAML to file
with open("config.yaml", "w") as f:
    yaml.dump(config, f, default_flow_style=False)

print("✅ Config file saved as `config.yaml`")

In [None]:
#@title 🌱 Run TreeSAPP Create
#@markdown ⏳ *Time to let the trees grow...*  \\
#@markdown This step will build your beautiful hyperpackages 🌳 — As long as this cell is running, trust the process! It may take **20 minutes or more** and some clusters take longer than others, so feel free to grab a tea 🍵 or read over some of the references at the bottom while you wait!
%%bash
source /usr/local/etc/profile.d/conda.sh
conda activate snakemake_env

snakemake --cores 2 --use-conda results/hyperpackages/{sample}.refpkg.tar.gz

In [14]:
#@title 📦 Save Your Hyperpackage
#@markdown 🎁 *Wrapping up your hard work...*  \\
#@markdown 🏡 We’re zipping up your shiny new hyperpackage so you can take it home. \\
#@markdown Run this cell to download when it’s ready and keep it safe!

from google.colab import files
import os
import shutil

# Updated folder path
folder_path = "results/hyperpackages"
if os.path.isdir(folder_path):
  shutil.make_archive("hyperpackages", "zip", folder_path)
  filepath = "hyperpackages.zip"
  files.download(filepath)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [16]:
#@title 🌿 TreeSAPP Assign Steps
#@markdown 🧬 *Ready to assign your sequences to their place in the tree of life?* 🌍   \\
#@markdown Follow these simple steps to get started with **TreeSAPP assign**:  \\
#@markdown 1️⃣ Place your input `.fasta` file into the `TS-Capstone-2025/data/assign_fastas` folder.  \\
#@markdown 2️⃣ Enter the name of your file **without the `.fasta` extension** below 👇


fasta = "geneX"  #@param {type:"string"}

#@markdown ⚙️ Any extra parameters should be added in the **Advanced Configuration** section above under the **TreeSAPP Assign Settings** 🌟


In [None]:
#@title 🌿 Run TreeSAPP Assign
#@markdown 🧭 This step runs **TreeSAPP assign** to classify your sequences using the reference hyperpackage you built earlier 🧬🔍
#@markdown
#@markdown It uses the input from the **Advanced Configuration** section and your `.fasta` file name from the previous step — so make sure those are all set!
#@markdown Once you're all set up, run this cell! 🌱✨
%%bash
source /usr/local/etc/profile.d/conda.sh
conda activate snakemake_env

!snakemake --use-conda --cores 2 results/assigned_hyperpackages/{fasta}/{sample}.refpkg.tar.gz

In [19]:
#@title 📦 Save Assigned Hyperpackage
#@markdown This step zips up the results from **TreeSAPP assign** so you can download them and keep them safe 🧳🧬 \\
#@markdown Your freshly labeled hyperpackage is now ready to explore, share, or take on new adventures 🌍💌

from google.colab import files
import os
import shutil

folder_path = "results/assigned_hyperpackages"
if os.path.isdir(folder_path):
  shutil.make_archive("assigned_hyperpackages", "zip", folder_path)
  filepath = "assigned_hyperpackages.zip"
  files.download(filepath)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [6]:
#@title 📚 References:
#@markdown 🔗 Here are some helpful links to guide your journey through hyperpackage creation and taxonomy assignment 🌱🔬
#@markdown
#@markdown 🔹 **Hyperpackage Creation Repo**
#@markdown https://github.com/RyloByte/TS-Capstone-2025
#@markdown
#@markdown 🔹 **TreeSAPP Repository**
#@markdown https://github.com/hallamlab/TreeSAPP
#@markdown
#@markdown 🔹 **TreeSAPP Tutorial + Docs**
#@markdown https://educe-ubc.github.io/MICB425/a-tutorial-for-using-treesapp.html