Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output file path #10

Closed
sleepwell-zhd opened this issue Jul 6, 2024 · 5 comments
Closed

Output file path #10

sleepwell-zhd opened this issue Jul 6, 2024 · 5 comments

Comments

@sleepwell-zhd
Copy link

Can't I choose the output file location? Every time I run it, it overwrites the previous result.

@yuanwenguang666
Copy link
Collaborator

Thanks for you questions.

Currently, we are unable to specify the output file path. However, we plan to update PhaGCN2.3 within the next two months. This update will not only refresh the database to align with the latest ICTV tables, but it will also enhance the output file path options and improve the visualization of network graphs.

@Asa12138
Copy link

The inability to select the output directory is a big problem. As a result, the output of this program is all in the running directory. The running directory can only be PhaGCN2.0, and multiple tasks cannot be run at the same time.

It is recommended to optimize this issue first, and then consider the optimization of the latest ICTV later.

@Asa12138
Copy link

@sleepwell-zhd , you can try this way:

安装

git clone https://github.com/KennthShang/PhaGCN2.0.git
cd PhaGCN2.0
rm supplementary\ file/ __pycache__/ pred/ final_prediction.csv -rf
vi run_KnowledgeGraph.py # 把第169行注释掉,因为不需要再建一次数据库。
conda env create -f environment.yaml -n phagcn2

# 准备数据库
cd database
tar -zxvf ALL_protein.tar.gz
diamond makedb --in ALL_protein.fasta -d database.dmnd
diamond blastp --sensitive -d database.dmnd -q ALL_protein.fasta -o database.self-diamond.tab
awk '$1!=$2 {{print $1,$2,$11}}' database.self-diamond.tab > database.self-diamond.tab.abc
cd ..

运行

$ python run_Speed_up.py -h
usage: run_Speed_up.py [-h] [--contigs CONTIGS] [--len LEN]

manual to this script

optional arguments:
  -h, --help         show this help message and exit
  --contigs CONTIGS
  --len LEN

该程序有两个参数:

  • --contigs 是 contigs 文件的路径。
  • --len 是您要预测的重叠群的长度。正如我们的论文所示,随着重叠群长度的增加,查全率和查准率也会增加。我们建议您根据需要选择合适的长度。默认长度为 8000bp。支持的最短长度为1700bp。输出文件为final_prediction.csv。此 csv 文件中有三列:“contig_name、median_file_name、prediction”。

示例:

conda activate phagcn2
export MKL_SERVICE_FORCE_INTEL=1 #要设置一下这个
python run_Speed_up.py --contigs contigs.fa --len 8000

注意,该程序没有指定输出路程,会在当前目录生成,每次重跑会覆盖之前的内容😂,而且因为它的环境路径也不是绝对的,所以只能在PhaGCN2.0目录下跑,所以也不能同时跑多个任务,作者暂时也没有修改这个问题:#10

所以最好在运行该程序之前先切换到相应的输出目录(把运行文件全部拷贝到该目录),或者每次运行完把结果移动到输出目录。

每次运行完把结果再移动到输出目录的策略还是有不足,会导致无法同时运行多个任务,所以选择前者。

我们可以看一下run_Speed_up.py具体的几个步骤:

  1. diamond,blastp 数据库准备,这里只需要运行一遍就好了,不知道为啥要放在run_Speed_up.py里每次都运行,至少还需要几十分钟呢。
  2. 把我们输入的contigs分割为每1000条序列一个子文件,过滤掉小于8000的序列,放在Split_files/下。
  3. 循环每个子文件:
    1. 把子文件mv到input/下,运行run_CNN.py,会用到CNN_Classifier/目录,得到Cyber_data/contig.F
    2. 运行run_KnowledgeGraph.py,生成中间文件single_contig/,all_proteins/,network/,输出到out/和Cyber_data/
    3. 运行run_GCN.py,得到prediction.csv,
    4. 整理子文件输出,放在pred/,删除中间文件
  4. 合并所有子文件输出,运行run_network.py

所以还是选择自己写一个pipeline来跑吧,方便指定输出位置以及同时跑多个任务,不用run_Speed_up.py:

#!/bin/bash

# 打印帮助信息
usage() {
    echo "Usage: $0 -p <phaGCN_dir> -i <input_file> -o <output_dir>"
    exit 1
}

# 解析命令行参数
while getopts ":p:i:o:" opt; do
    case "${opt}" in
        p)
            phaGCN_dir=${OPTARG}
            ;;
        i)
            input=${OPTARG}
            ;;
        o)
            output=${OPTARG}
            ;;
        *)
            usage
            ;;
    esac
done

# 检查是否提供了所有参数
if [ -z "${phaGCN_dir}" ] || [ -z "${input}" ] || [ -z "${output}" ]; then
    usage
fi

# 将路径转换为绝对路径
phaGCN_dir=$(cd "$(dirname "$phaGCN_dir")" && pwd)/$(basename "$phaGCN_dir")
input=$(cd "$(dirname "$input")" && pwd)/$(basename "$input")
output=$(cd "$(dirname "$output")" && pwd)/$(basename "$output")

# 检查输出目录是否存在且非空
if [ -d "$output" ] && [ "$(ls -A $output)" ]; then
    echo "Error: Output directory $output already exists and is not empty."
    exit 1
fi

# 创建输出目录并进入
mkdir -p "$output"
cd "$output" || exit

# 复制 Python 脚本和 C 相关内容
cp "${phaGCN_dir}"/*.py ./
cp -r "${phaGCN_dir}/C"* ./

# 创建 database 目录的符号链接
ln -s "${phaGCN_dir}/database/" ./

# 创建 input 目录并复制输入文件
mkdir input/
cp "$input" input/

# 运行各个 Python 脚本
echo "Running CNN..."
python run_CNN.py

echo "Running KnowledgeGraph..."
mkdir network
python run_KnowledgeGraph.py

echo "Running GCN..."
python run_GCN.py

echo "All tasks completed."

# 删除拷贝过来的脚本和目录
rm -rf *.py C* database
# 把上面内容拷贝到名为 run_phagcn的文件,注意修改~/biosoft/PhaGCN2.0为自己目录
vi run_phagcn
chmod +x run_phagcn
# 链接到环境变量目录
ln -s ~/biosoft/PhaGCN2.0/run_phagcn ~/miniconda3/envs/phagcn2/bin/

这样就可以在任意目录运行该程序并指定输出了。

@yuanwenguang666
Copy link
Collaborator

I'm sorry for the late reply, and I sincerely appreciate your suggestions and feedback. The specification of the output directory and the update of the new database are currently in progress, and we expect to release the updated version of PhagCN2 within one to two weeks.

In the meantime, if you need to run multiple parallel processes, you can do so by duplicating the folder.

Thank you for your patience.

@yuanwenguang666
Copy link
Collaborator

This function has been added to the latest 2.3, if you have additional questions, please feel free to contact us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants