<a href="https://colab.research.google.com/github/BERDOGDU/gcollab/blob/main/run_pharokka_and_phold_and_phynteny.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Pharokka + Phold + Phynteny

[pharokka](https://github.com/gbouras13/pharokka) is a rapid standardised annotation tool for bacteriophage genomes and metagenomes. You can read more about pharokka in the [documentation](https://pharokka.readthedocs.io/).

[phold](https://github.com/gbouras13/phold) is a sensitive annotation tool for bacteriophage genomes and metagenomes using protein structural homology. You can read more about phold in the [documentation](https://phold.readthedocs.io/).

phold uses the [ProstT5](https://github.com/mheinzinger/ProstT5) protein language model to translate protein amino acid sequences to the 3Di token alphabet used by [Foldseek](https://github.com/steineggerlab/foldseek). Foldseek is then used to search these against a database of 803k protein structures mostly predicted using [Colabfold](https://github.com/sokrypton/ColabFold).

[phyntney](https://github.com/susiegriggo/Phynteny) uses a long-short term memory model trained on phage synteny (the conserved gene order across phages) to assign hypothetical phage proteins to a PHROG category.

**NOTE: Phynteny will only work if your phage has fewer than 120 predicted genes**

**If this is the case for your phage(s), you should just skip running Phynteny (Cells 5+6)**

The tools are best run sequentially, as Pharokka conducts extra annotation steps like tRNA, tmRNA, CRISPR and INPHARED searches that Phold lacks (for now at least). Pharokka will also (rarely) annotate CDS that Phold can miss. Phynteny can then help annotate remaining hypothetical proteins with a PHROG category.

* **Before you start, please make sure you change the runtime to T4 GPU (or any other kind of GPU if you have $$$), otherwise Phold won't be installed properly**
* To do this, go to the top toolbar, then to Runtime -> Change runtime type -> Hardware accelerator

* To run the cells, press the play button on the left side
* Cells 1 and 2 install pharokka and phold and download the databases/models.
* Once they have been run, you can re-run Cell 3 (to run Pharokka), Cell 4 (to run Phold) and Cell 5+6 (to install and run Phynteny) as many times as you would like



In [1]:
#@title 1. Install pharokka and phold

#@markdown This cell installs pharokka and phold. It will take a few minutes. Please be patient

%%bash

set -e

PYTHON_VERSION="3.10"
PHAROKKA_VERSION="1.7.5"
PHOLD_VERSION="0.2.0"

echo "python version ${PYTHON_VERSION}"

if [ ! -f CONDA_READY ]; then
  echo "installing python"
  wget -qnc https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
  bash Miniconda3-latest-Linux-x86_64.sh -bfp /usr/local 2>&1 1>/dev/null
  rm Miniconda3-latest-Linux-x86_64.sh
  conda config --set auto_update_conda false
  touch CONDA_READY
fi

if [ ! -f PHAROKKA_PHOLD_READY ]; then
  echo "installing pharokka and phold"
  conda install -y -c conda-forge -c bioconda pip pharokka==${PHAROKKA_VERSION} python=${PYTHON_VERSION} phold==${PHOLD_VERSION} pytorch=*=cuda*
  touch PHAROKKA_PHOLD_READY
fi





python version 3.10
installing python
installing pharokka and phold
Channels:
 - conda-forge
 - bioconda
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
Solving environment: - \ | / - \ | / - \ | / - \ | / done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - pharokka==1.7.5
    - phold==0.2.0
    - pip
    - python=3.10
    - pytorch[build=cuda*]


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _openmp_mutex-4.5          |       3_kmp_llvm           7 KB  conda-forge
    about-time-4.2.1           |     pyhd8ed1ab_1          16 KB  conda-forge
    aiohappyeyeballs-2.6.1     |     pyhd8ed1ab_0          19 KB  conda-forge
    aiohtt

In [23]:
#@title 2. Download pharokka phold databases

#@markdown This cell downloads the pharokka then the phold database. It will take some time (5-10 minutes probably). Please be patient.


%%time
import os
import subprocess
print("Downloading pharokka database. This will take a few minutes. Please be patient :)")
#os.system("install_databases.py -o pharokka_db")
print("Downloading phold database. This will take a few minutes. Please be patient :)")
#os.system("phold install -d phold_db")
command = ['phold', 'install', '-d', 'phold_db']

process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)

for line in process.stdout:
    print(line, end='')

process.wait()




Downloading pharokka database. This will take a few minutes. Please be patient :)
Downloading phold database. This will take a few minutes. Please be patient :)
2025-05-26 19:39:08.803 | INFO     | phold:install:1119 - You have specified the phold_db directory to store the Phold database and ProstT5 model
2025-05-26 19:39:08.803 | INFO     | phold:install:1131 - Checking that the Rostlab/ProstT5_fp16 ProstT5 model is available in phold_db
2025-05-26 19:39:08.803 | INFO     | phold.features.predict_3Di:get_T5_model:121 - Using device: cpu
2025-05-26 19:39:08.803 | INFO     | phold.features.predict_3Di:get_T5_model:127 - Loading T5 from: phold_db/Rostlab/ProstT5_fp16
2025-05-26 19:39:08.803 | INFO     | phold.features.predict_3Di:get_T5_model:128 - If phold_db/Rostlab/ProstT5_fp16 is not found, it will be downloaded
2025-05-26 19:39:37.042 | INFO     | phold.databases.db:download_zenodo_prostT5:204 - Downloading ProstT5 model backup from https://zenodo.org/records/11234657/files/models--

1

In [17]:
#@title 3. Run Pharokka

#@markdown First, upload your phage(s) as a nucleotide input FASTA file

#@markdown Click on the folder icon to the left and use the file upload button.

#@markdown Once it is uploaded, write the file name in the INPUT_FILE field on the right.

#@markdown Then provide a directory for pharokka's output using PHAROKKA_OUT_DIR.
#@markdown The default is 'output_pharokka'.

#@markdown Then type in a gene prediction tool for pharokka.
#@markdown Please choose either 'phanotate', 'prodigal', or 'prodigal-gv'.

#@markdown You can also provide a prefix for your output files with PHAROKKA_PREFIX.
#@markdown If you provide nothing it will default to 'pharokka'.

#@markdown You can also provide a locus tag for your output files.
#@markdown If you provide nothing it will generate a random locus tag.

#@markdown You can click FAST to turn off --fast.
#@markdown By default it is True so that Pharokka runs faster in the Colab environment.

#@markdown You can click META to turn on --meta if you have multiple phages in your input.

#@markdown You can click META_HMM to turn on --meta_hmm.

#@markdown You can click FORCE to overwrite the output directory.
#@markdown This may be useful if your earlier pharokka run has crashed for whatever reason.

#@markdown The results of Pharokka will be in the folder icon on the left hand panel.
#@markdown Additionally, it will be zipped so you can download the whole directory.

#@markdown The file to download is PHAROKKA_OUT_DIR.zip, where PHAROKKA_OUT_DIR is what you provided

#@markdown If you do not see the output directory,
#@markdown refresh the window by either clicking the folder with the refresh icon below "Files"
#@markdown or double click and select "Refresh".


%%time
import os
import sys
import subprocess
import zipfile
INPUT_FILE = 'EMS221.fasta' #@param {type:"string"}

if os.path.exists(INPUT_FILE):
    print(f"Input file {INPUT_FILE} exists")
else:
    print(f"Error: File {INPUT_FILE} does not exist")
    print(f"Please check the spelling and that you have uploaded it correctly")
    sys.exit(1)

PHAROKKA_OUT_DIR = 'output_pharokka'  #@param {type:"string"}
GENE_PREDICTOR = 'phanotate'  #@param {type:"string"}
allowed_gene_predictors = ['phanotate', 'prodigal', 'prodigal-gv']
# Check if the input parameter is valid
if GENE_PREDICTOR.lower() not in allowed_gene_predictors:
    raise ValueError("Invalid GENE_PREDICTOR. Please choose from: 'phanotate', 'prodigal', 'prodigal-gv'.")

PHAROKKA_PREFIX = 'pharokka'  #@param {type:"string"}
LOCUS_TAG = 'Default'  #@param {type:"string"}
FAST = True  #@param {type:"boolean"}
META = False  #@param {type:"boolean"}
META_HMM = False  #@param {type:"boolean"}
FORCE = True  #@param {type:"boolean"}


# Construct the command
command = f"pharokka.py -d pharokka_db -i {INPUT_FILE} -t 4 -o {PHAROKKA_OUT_DIR} -p {PHAROKKA_PREFIX} -l {LOCUS_TAG} -g {GENE_PREDICTOR}"

if FORCE is True:
  command = f"{command} -f"

if FAST is True:
  command = f"{command} --fast"

if META is True:
  command = f"{command} -m"

if META_HMM is True:
  command = f"{command} --meta_hmm"

# Execute the command
try:
    print("Running pharokka")
    subprocess.run(command, shell=True, check=True)
    print("pharokka completed successfully.")
    print(f"Your output is in {PHAROKKA_OUT_DIR}.")
    print(f"Zipping the output directory so you can download it all in one go.")

    zip_filename = f"{PHAROKKA_OUT_DIR}.zip"

    # Zip the contents of the output directory
    with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(PHAROKKA_OUT_DIR):
            for file in files:
                zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), PHAROKKA_OUT_DIR))
    print(f"Output directory has been zipped to {zip_filename}")


except subprocess.CalledProcessError as e:
    print(f"Error occurred: {e}")







Input file EMS221.fasta exists
Running pharokka
pharokka completed successfully.
Your output is in output_pharokka.
Zipping the output directory so you can download it all in one go.
Output directory has been zipped to output_pharokka.zip
CPU times: user 371 ms, sys: 39 ms, total: 410 ms
Wall time: 1min 41s


In [18]:
#@title 4. Run phold

%%time
import os
import subprocess
import zipfile

# phold input is pharokka output
PHOLD_INPUT = f"{PHAROKKA_OUT_DIR}/{PHAROKKA_PREFIX}.gbk"
PHOLD_OUT_DIR = 'output_phold'  #@param {type:"string"}
PHOLD_PREFIX = 'phold'  #@param {type:"string"}
FORCE = True  #@param {type:"boolean"}
SEPARATE = False  #@param {type:"boolean"}

# Construct the command
command = f"phold run -i {PHOLD_INPUT} -t 4 -o {PHOLD_OUT_DIR} -p {PHOLD_PREFIX} -d phold_db"

if FORCE:
  command += " -f"
if SEPARATE:
  command += " --separate"

try:
    print("Running phold")
    # Matplotlib backend uyumsuzluğunu önlemek için:
    os.environ["MPLBACKEND"] = "Agg"

    result = subprocess.run(
        command, shell=True, check=True, capture_output=True, text=True
    )

    print(result.stdout)
    print(result.stderr)

    print("phold completed successfully.")
    print(f"Your output is in {PHOLD_OUT_DIR}.")
    print(f"Zipping the output directory so you can download it all in one go.")

    zip_filename = f"{PHOLD_OUT_DIR}.zip"

    # Zip the contents of the output directory
    with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(PHOLD_OUT_DIR):
            for file in files:
                filepath = os.path.join(root, file)
                arcname = os.path.relpath(filepath, PHOLD_OUT_DIR)
                zipf.write(filepath, arcname)
    print(f"Output directory has been zipped to {zip_filename}")

except subprocess.CalledProcessError as e:
    print("phold failed with error:")
    print(e.stdout)
    print(e.stderr)
    print(f"Return code: {e.returncode}")


Running phold
phold failed with error:


.______    __    __    ______    __       _______  
|   _  \  |  |  |  |  /  __  \  |  |     |       \ 
|  |_)  | |  |__|  | |  |  |  | |  |     |  .--.  |
|   ___/  |   __   | |  |  |  | |  |     |  |  |  |
|  |      |  |  |  | |  `--'  | |  `----.|  '--'  |
| _|      |__|  |__|  \______/  |_______||_______/ 
                                                   



2025-05-26 19:20:47.074 | INFO     | phold.utils.validation:instantiate_dirs:70 - Checking the output directory output_phold
2025-05-26 19:20:47.074 | INFO     | phold.utils.validation:instantiate_dirs:73 - Removing output_phold because --force was specified
2025-05-26 19:20:47.081 | INFO     | phold.utils.util:begin_phold:72 - phold: annotating phage genomes with protein structures
2025-05-26 19:20:47.081 | INFO     | phold.utils.util:begin_phold:74 - You are using phold version 0.2.0
2025-05-26 19:20:47.081 | INFO     | phold.utils.util:begin_phold:75 - Repository homepage is https:

In [25]:
# import subprocess

# command = ['find', '/', '-type', 'f', '-name', 'db.py']

# # Run the command, streaming stdout live, ignoring stderr (redirected inside the shell command)
# process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL, text=True)

# for line in process.stdout:
#     print(line, end='')

# process.wait()

import torch
print(torch.__file__)
print(torch.__version__)



/usr/local/lib/python3.11/dist-packages/torch/__init__.py
2.6.0+cu124


In [None]:
#@title 5. Install phynteny

#@markdown This cell installs phynteny and downloads the models. It will take a few minutes. Please be patient
%%bash
PHYNTENY_VERSION="0.1.13"
NUMPY_VERSION="1.26.4"

if [ ! -f PHYNTENY_READY ]; then
  echo "installing phynteny"
  pip install phynteny==${PHYNTENY_VERSION} numpy==${NUMPY_VERSION}
  echo "Downloading phynteny models"
  install_models -o phynteny_models
  touch PHYNTENY_READY
fi


In [None]:
#@title 6. Run Phynteny

#@markdown This cell will run phynteny on the output of cell 4's Phold run to predict the function of remaining hypothetical proteins

#@markdown You do not need to provide any further input files

#@markdown You can now provide a directory for phynteny's output with PHYNTENY_OUT_DIR.
#@markdown The default is 'output_phynteny'.

#@markdown You can click FORCE to overwrite the output directory with .
#@markdown This may be useful if your phynteny run has crashed for whatever reason.

#@markdown The results of Phynteny will be in the folder icon on the left hand panel.
#@markdown Additionally, it will be zipped so you can download the whole directory.

#@markdown The file to download is PHYNTENY_OUT_DIR.zip, where PHYNTENY_OUT_DIR is what you provided

#@markdown If you do not see the output directory,
#@markdown refresh the window by either clicking the folder with the refresh icon below "Files"
#@markdown or double click and select "Refresh".


%%time
import os
import subprocess
import zipfile

# phynteny input is pharokka output
PHYNTENY_INPUT = f"{PHOLD_OUT_DIR}/{PHOLD_PREFIX}.gbk"
PHYNTENY_OUT_DIR = 'output_phynteny'  #@param {type:"string"}
FORCE = False  #@param {type:"boolean"}

# Construct the command
command = f"phynteny {PHYNTENY_INPUT} -m phynteny_models -o {PHYNTENY_OUT_DIR}"

if FORCE is True:
  command = f"{command} -f"


# Execute the command
try:
    print("Running phynteny")
    subprocess.run(command, shell=True, check=True)
    print("phynteny completed successfully.")
    print(f"Your output is in {PHYNTENY_OUT_DIR}.")
    print(f"Zipping the output directory so you can download it all in one go.")

    zip_filename = f"{PHYNTENY_OUT_DIR}.zip"

    # Zip the contents of the output directory
    with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(PHYNTENY_OUT_DIR):
            for file in files:
                zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), PHYNTENY_OUT_DIR))
    print(f"Output directory has been zipped to {zip_filename}")


except subprocess.CalledProcessError as e:
    print(f"Error occurred: {e}")







Running phynteny
phynteny completed successfully.
Your output is in output_phynteny.
Zipping the output directory so you can download it all in one go.
Output directory has been zipped to output_phynteny.zip
CPU times: user 160 ms, sys: 31.1 ms, total: 191 ms
Wall time: 42.1 s
