# CoreFinder Step by Step

This Jupyter Notebook provides a comprehensive guide for running CoreFinder on your own protein sequence file. By following these steps, you will run both the detector and annotator modules on your FASTA file to identify biosynthetic gene clusters (BGCs).

## Prerequisites

In [2]:
!wget https://github.com/HUST-NingKang-Lab/BGC-Prophet/releases/download/package/bgc_prophet-0.1.0-py3-none-any.whl # We use BGC-Prophet as the detector
!pip install bgc_prophet-0.1.0-py3-none-any.whl
!wget https://github.com/HUST-NingKang-Lab/BGC-Prophet/files/12733164/model.tar.gz
!tar -xf model.tar.gz
!pip install biopython
!pip install transformers==4.41.2

!git clone https://github.com/HUST-NingKang-Lab/CoreFinder.git
!mv CoreFinder/* ./
!rm -rf CoreFinder

!git clone https://huggingface.co/KangHuggingface/CoreFinder


--2025-06-10 12:40:28--  https://github.com/HUST-NingKang-Lab/BGC-Prophet/releases/download/package/bgc_prophet-0.1.0-py3-none-any.whl
Resolving github.com (github.com)... 140.82.116.3
Connecting to github.com (github.com)|140.82.116.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/642175690/221faf30-8672-4efa-a904-0f600aeae409?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20250610%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250610T124028Z&X-Amz-Expires=300&X-Amz-Signature=4131ac2b1477c5c1ce6d6b827e07a6dcf70b67fa2e0c5f806bcde13979c558dc&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dbgc_prophet-0.1.0-py3-none-any.whl&response-content-type=application%2Foctet-stream [following]
--2025-06-10 12:40:28--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/642175690/221faf30-8672-4efa-a904-0f600

## Step 1: Detection
In this step, we run BGC-Prophet to determine which genes are BGC genes.

In [3]:
# @title Run BGC-Prophet on your own fasta files
path_to_genomes_directory = "example" # @param {"type":"string","placeholder":"example"}

threshold = 0.5 # @param {"type":"number","placeholder":"0.5"}
max_gap = 3 # @param {"type":"integer","placeholder":"3"}
min_count = 2 # @param {"type":"integer","placeholder":"2"}


from bgc_prophet.command import piplineCommand
import argparse

# Step 5: Simulate CLI args and run
parser = argparse.ArgumentParser()
piplineCommand().add_arguments(parser)

args = parser.parse_args([
    '--genomesDir', f'{path_to_genomes_directory}',
    '--modelPath', 'model/annotator.pt',
    '--classifierPath', 'model/classifier.pt',
    '--threshold', f'{threshold}',
    '--max_gap', f'{max_gap}',
    '--min_count', f'{min_count}',
])

piplineCommand().handle(args)




Downloading: "https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t6_8M_UR50D.pt" to /root/.cache/torch/hub/checkpoints/esm2_t6_8M_UR50D.pt
Downloading: "https://dl.fbaipublicfiles.com/fair-esm/regression/esm2_t6_8M_UR50D-contact-regression.pt" to /root/.cache/torch/hub/checkpoints/esm2_t6_8M_UR50D-contact-regression.pt


Transferred model to GPU
Read example/CoreFinder_example.fasta with 4287 sequences




Processing 1 of 28 batches (590 sequences)
Processing 2 of 28 batches (470 sequences)
Processing 3 of 28 batches (369 sequences)
Processing 4 of 28 batches (296 sequences)
Processing 5 of 28 batches (246 sequences)
Processing 6 of 28 batches (215 sequences)
Processing 7 of 28 batches (189 sequences)
Processing 8 of 28 batches (172 sequences)
Processing 9 of 28 batches (159 sequences)
Processing 10 of 28 batches (147 sequences)
Processing 11 of 28 batches (138 sequences)
Processing 12 of 28 batches (131 sequences)
Processing 13 of 28 batches (125 sequences)
Processing 14 of 28 batches (117 sequences)
Processing 15 of 28 batches (109 sequences)
Processing 16 of 28 batches (102 sequences)
Processing 17 of 28 batches (95 sequences)
Processing 18 of 28 batches (91 sequences)
Processing 19 of 28 batches (86 sequences)
Processing 20 of 28 batches (81 sequences)
Processing 21 of 28 batches (73 sequences)
Processing 22 of 28 batches (66 sequences)
Processing 23 of 28 batches (61 sequences)
Proc

Organizing: 100%|██████████| 1/1 [00:00<00:00, 31.96it/s]

Saving to csv...



Splitting: 100%|██████████| 1/1 [00:00<00:00, 20.30it/s]


Saving to csv...
output
['Alkaloid', 'Terpene', 'NRP', 'Polyketide', 'RiPP', 'Saccharide', 'Other']


Predict: 100%|██████████| 1/1 [00:01<00:00,  1.00s/it]

['Alkaloid', 'Terpene', 'NRP', 'Polyketide', 'RiPP', 'Saccharide', 'Other']



Classify: 100%|██████████| 1/1 [00:03<00:00,  3.19s/it]


## Step 2: Annotation
In this step, we annotate the functional groups of BGC genes as well as the predicted product of the entire BGC.

In [7]:
from prophet2fasta import prophet2fasta
import argparse
cfg = {
    'prophet': 'output',
    'original_fasta': 'example',
    'generated_fasta': 'tmp'
}
parser = argparse.ArgumentParser()
args = parser.parse_args([])
prophet2fasta(cfg, args)

from annotate import annotate
args.input = '/content/tmp.faa/CoreFinder_example_cluster1.fasta'
cfg = {
    'input': '/content/tmp.faa/CoreFinder_example_cluster1.fasta',
    'output': 'corefinder_output',
    'model': 'CoreFinder'
}
annotate(cfg, args)

Extracting BGCs...


100%|██████████| 1/1 [00:00<00:00, 4644.85it/s]


Writing fasta files...


100%|██████████| 1/1 [00:00<00:00, 46.75it/s]
Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t33_650M_UR50D and are newly initialized: ['esm.pooler.dense.bias', 'esm.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 116.12 MiB is free. Process 6748 has 14.62 GiB memory in use. Of the allocated memory 14.24 GiB is allocated by PyTorch, and 265.31 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)