Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem in detect variants using self-trained model #20

Closed
huangnengCSU opened this issue Jun 2, 2021 · 7 comments
Closed

problem in detect variants using self-trained model #20

huangnengCSU opened this issue Jun 2, 2021 · 7 comments

Comments

@huangnengCSU
Copy link

Hi,
I have trained a pileup model (pre-trained + fine-tune) and the output model seems intact. The name of model files are replaced with pileup and combined with provided full_alignment model. I used the pileup model and full_alignment model to detect variants. Then the command occurs two kinds of errors.
The first kind of error does not occur at a fixed location or time when the command is re-executed. And sometimes this type of error does not occur. The following two log outputs are examples of this type of error.

Processed 640000 tensors
Processed 660000 tensors
Total process positions in chr18 with chunk 13/17 : 662097
Total time elapsed: 900.00 s
parallel: This job failed:

python3 /public/home/huangneng/code/Clair3/scripts/../clair3.py CallVarBam \
--chkpnt_fn /public/home/huangneng/tools/clair3-model/clair3_finetune/pileup \
--bam_fn /public/home/huangneng/ont-quickstart/input/120G/hg003_120G.bam \
--call_fn /public/home/huangneng/ont-quickstart/input/120G/clair3_selftraining/output/tmp/pileup_output/pileup_chr18_14.vcf \
--sampleName EMPTY \
--ref_fn /public/home/huangneng/ont-quickstart/input/120G/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna \
--extend_bed /public/home/huangneng/ont-quickstart/input/120G/clair3_selftraining/output/tmp/split_beds/chr18 \
--bed_fn \
--vcf_fn EMPTY \
--ctgName chr18 \
--chunk_id 14 \
--chunk_num 17 \
--platform ont \
--fast_mode False \
--snp_min_af 0.0 \
--indel_min_af 0.0 \
--gvcf False \
--python python3 \
--pypy pypy3 \
--samtools samtools \
--temp_file_dir /public/home/huangneng/ont-quickstart/input/120G/clair3_selftraining/output/tmp/gvcf_tmp_output \
--pileup

real    228m52.610s
user    9075m11.206s
sys     847m57.569s

[INFO] Merge chunked contigs vcf files
[INFO] 2/7 Filter Hete SNP varaints for Whatshap phasing and haplotag
[INFO] Select phasing quality cut off 17
[INFO] Total hete snp positions pass filtering: chr22: 0
[INFO] Total hete snp positions pass filtering: chr20: 19521
[INFO] Total hete snp positions pass filtering: chr19: 38371
[INFO] Total hete snp positions pass filtering: chr21: 2615
[INFO] Total hete snp positions pass filtering: chr11: 105028
[INFO] Total hete snp positions pass filtering: chrX: 0
[INFO] Total hete snp positions pass filtering: chr18: 63294
[INFO] Total hete snp positions pass filtering: chrY: 0
[INFO] Total hete snp positions pass filtering: chr16: 68964
[INFO] Total hete snp positions pass filtering: chr13: 73883
[INFO] Total hete snp positions pass filtering: chr12: 106369
[INFO] Total hete snp positions pass filtering: chr9: 109798
[INFO] Total hete snp positions pass filtering: chr8: 125622
[INFO] Total hete snp positions pass filtering: chr17: 69591
[INFO] Total hete snp positions pass filtering: chr6: 138924
[INFO] Total hete snp positions pass filtering: chr7: 128983
[INFO] Total hete snp positions pass filtering: chr10: 114092
[INFO] Total hete snp positions pass filtering: chr5: 139377
[INFO] Total hete snp positions pass filtering: chr14: 72464
[INFO] Total hete snp positions pass filtering: chr15: 70208
[INFO] Total hete snp positions pass filtering: chr4: 159925
[INFO] Total hete snp positions pass filtering: chr3: 147225
[INFO] Total hete snp positions pass filtering: chr2: 179619
[INFO] Total hete snp positions pass filtering: chr1: 169383

real    4m40.428s
user    105m48.514s
sys     2m16.621s
Total process positions in chr8 with chunk 18/30 : 659427
Total time elapsed: 606.44 s
[INFO] Delay 2 seconds before starting variant calling ...
[mpileup] 1 samples in 1 input files
Calling variants ...
Total process positions in chr9 with chunk 9/28 : 0
Total time elapsed: 0.01 s
[INFO] No vcf output for file /public/home/huangneng/ont-quickstart/input/120G/clair3_selftraining/debug_output/tmp/pileup_output/pileup_chr9_10.vcf, remove empty file
parallel: This job failed:
python3 /public/home/huangneng/code/Clair3/scripts/../clair3.py CallVarBam \
--chkpnt_fn /public/home/huangneng/tools/clair3-model/clair3_finetune/pileup \
--bam_fn /public/home/huangneng/ont-quickstart/input/120G/hg003_120G.bam \
--call_fn /public/home/huangneng/ont-quickstart/input/120G/clair3_selftraining/debug_output/tmp/pileup_output/pileup_chr9_10.vcf \
--sampleName EMPTY \
--ref_fn /public/home/huangneng/ont-quickstart/input/120G/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna \
-extend_bed /public/home/huangneng/ont-quickstart/input/120G/clair3_selftraining/debug_output/tmp/split_beds/chr9 \
--bed_fn \
--vcf_fn EMPTY \
--ctgName chr9 \
--chunk_id 10 \
--chunk_num 28 \
--platform ont \
--fast_mode False \
--snp_min_af 0.0 \
--indel_min_af 0.0 \
--gvcf False \
--python python3 \
--pypy pypy3 \
--samtools samtools \
--temp_file_dir /public/home/huangneng/ont-quickstart/input/120G/clair3_selftraining/debug_output/tmp/gvcf_tmp_output \
--pileup

real    149m55.489s
user    4785m4.119s
sys     444m19.105s
[INFO] Merge chunked contigs vcf files
[INFO] 2/7 Filter Hete SNP varaints for Whatshap phasing and haplotag
[INFO] Select phasing quality cut off 17
[INFO] Total hete snp positions pass filtering: chr18: 0
[INFO] Total hete snp positions pass filtering: chr15: 0
[INFO] Total hete snp positions pass filtering: chr12: 0
[INFO] Total hete snp positions pass filtering: chr21: 0
[INFO] Total hete snp positions pass filtering: chr13: 0
[INFO] Total hete snp positions pass filtering: chrY: 0
[INFO] Total hete snp positions pass filtering: chr19: 0
[INFO] Total hete snp positions pass filtering: chr11: 0
[INFO] Total hete snp positions pass filtering: chr22: 0
[INFO] Total hete snp positions pass filtering: chr17: 0
[INFO] Total hete snp positions pass filtering: chr16: 0
[INFO] Total hete snp positions pass filtering: chr9: 7006
[INFO] Total hete snp positions pass filtering: chrX: 0
[INFO] Total hete snp positions pass filtering: chr10: 0
[INFO] Total hete snp positions pass filtering: chr14: 0
[INFO] Total hete snp positions pass filtering: chr4: 160039
[INFO] Total hete snp positions pass filtering: chr8: 110917
[INFO] Total hete snp positions pass filtering: chr6: 139017
[INFO] Total hete snp positions pass filtering: chr5: 139442
[INFO] Total hete snp positions pass filtering: chr20: 0
[INFO] Total hete snp positions pass filtering: chr3: 147298
[INFO] Total hete snp positions pass filtering: chr1: 169478
[INFO] Total hete snp positions pass filtering: chr7: 129048
[INFO] Total hete snp positions pass filtering: chr2: 179713

real    2m47.910s
user    60m29.533s
sys     1m19.845s

The second type of error appears in the step of Calling variants using Full Alignment. Here is the output log.

[INFO] Delay 1 seconds before starting variant calling ...
[mpileup] 1 samples in 1 input files
Calling variants ...
Total process positions in chr1 : 10000
Total time elapsed: 71.62 s
[INFO] Delay 2 seconds before starting variant calling ...
[mpileup] 1 samples in 1 input files
Calling variants ...
Total process positions in chr1 : 10000
Total time elapsed: 72.46 s
parallel: This job failed:

python3 /public/home/huangneng/code/Clair3/scripts/../clair3.py CallVarBam \
--chkpnt_fn /public/home/huangneng/tools/clair3-model/RNN+CNN_finetune/full_alignment \
--bam_fn /public/home/huangneng/ont-quickstart/input/120G/clair3_modify/debug_output/tmp/phase_output/phase_bam/chr1.bam \
--call_fn /public/home/huangneng/ont-quickstart/input/120G/clair3_modify/debug_output/tmp/full_alignment_output/full_alignment_chr1.28_316.vcf \
--sampleName EMPTY \
--vcf_fn EMPTY \
--ref_fn /public/home/huangneng/ont-quickstart/input/120G/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna \
--full_aln_regions /public/home/huangneng/ont-quickstart/input/120G/clair3_modify/debug_output/tmp/full_alignment_output/candidate_bed/chr1.28_316 \
--ctgName chr1 \
--add_indel_length \
--phasing_info_in_bam \
--gvcf False \
--python python3 \
--pypy pypy3 \
--samtools samtools \
--platform ont

real    1m34.682s
user    58m25.445s
sys     4m16.988s
[INFO] 7/7 Merge pileup vcf and full alignment vcf
Merge pileup and full alignment VCF/GVCF
[INFO] Pileup variants proceeded in chrY: 56779
[INFO] Full alignment variants proceeded in chrY: 0
Merge pileup and full alignment VCF/GVCF
[INFO] Pileup variants proceeded in chr8: 438770
[INFO] Full alignment variants proceeded in chr8: 0
Merge pileup and full alignment VCF/GVCF
[INFO] Pileup variants proceeded in chr13: 314341
[INFO] Full alignment variants proceeded in chr13: 0

I extracted the failed job and performed this job independently, it did not occur any error, the following is the output information.

[mpileup] 1 samples in 1 input files
Calling variants ...
Total process positions in chr1 : 10000
Total time elapsed: 33.62 s
[INFO] Delay 6 seconds before starting variant calling ...
@aquaskyline
Copy link
Member

Hi,

How many threads have you set? And could you paste here the output of ulimit -a in your running environment?

@huangnengCSU
Copy link
Author

@aquaskyline
I have tested 30, 48, 60 and 80 threads. Here is the ulimit info:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 6184280
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

@aquaskyline
Copy link
Member

if you set THREADS=8, would you be able to get no failed job in your environment?

@huangnengCSU
Copy link
Author

huangnengCSU commented Jun 5, 2021

Hi,
I have tested 8 threads, the first type of error no longer appears. But the second type of error still exists.

[INFO] Delay 3 seconds before starting variant calling ...
[mpileup] 1 samples in 1 input files
Calling variants ...
Total process positions in chr10 : 10000
Total time elapsed: 50.98 s
parallel: This job failed:
python3 /public/home/huangneng/code/Clair3_transformer/scripts/../clair3.py CallVarBam     --chkpnt_fn /public/home/huangneng/tools/clair3-model/RNN+CNN_finetune/full_alignment     --bam_fn /public/home/huangneng/ont-quickstart/input/120G/clair3_modify/debug_output/tmp/phase_output/phase_bam/chr10.bam     --call_fn /public/home/huangneng/ont-quickstart/input/120G/clair3_modify/debug_output/tmp/full_alignment_output/full_alignment_chr10.88_183.vcf     --sampleName EMPTY     --vcf_fn EMPTY     --ref_fn /public/home/huangneng/ont-quickstart/input/120G/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna     --full_aln_regions /public/home/huangneng/ont-quickstart/input/120G/clair3_modify/debug_output/tmp/full_alignment_output/candidate_bed/chr10.88_183     --ctgName chr10     --add_indel_length     --phasing_info_in_bam     --gvcf False     --python python3     --pypy pypy3     --samtools samtools     --platform ont

real    66m30.379s
user    583m45.247s
sys     30m43.783s
[INFO] 7/7 Merge pileup vcf and full alignment vcf
Merge pileup and full alignment VCF/GVCF
[INFO] Pileup variants proceeded in chr9: 386571
[INFO] Full alignment variants proceeded in chr9: 0
Merge pileup and full alignment VCF/GVCF
[INFO] Pileup variants proceeded in chr8: 438770
[INFO] Full alignment variants proceeded in chr8: 0

And there may be a bug when comparing the threads number with the maximum available cpu number. I set threads=8 and maximum_available_cores=10, then the command modifies the thread number to 10.

@zhengzhenxian
Copy link
Collaborator

zhengzhenxian commented Jun 5, 2021

Hi, Neng,
For us to pinpoint the problem, could you send us the running log at ${OUTPUT_DIR}/run_clair3.log to my email address zxzheng@cs.hku.hk? Much appreciated.

@aquaskyline
Copy link
Member

aquaskyline commented Jun 9, 2021

The reason why some jobs failed is that Clair3 was requesting more processes than the user environment allows ulimit -u. We have added more running environment checks and automatic retries in v0.1-r3.

Clair3 uses Tensorflow and pypy. These libraries open quite a few threads in each running instance. The THREADS parameter controls how many Clair3 instances can run concurrently, but each instance, as we've summarized, consumes up to 40-50 processes at peak. The number of processes a user could create is limited to a number that could be checked using ulimit -a. In an Ubuntu system, the limitation is usually over 10k (unless otherwise reduced), thus not a problem. But in RedHat or CentOS, which is commonly used in grids and institutions, the limitation is usually at 1024 or 2048, thus setting the THREADS to a number above 20 would reach the limit at some points. Setting ulimit -u to a higher number can solve the problem, but that requires the root privilege (or a blessing from the system admin team).

In v0.1-r3, we check ulimit -u and lower the THREADS accordingly. We also added automatic retries on failed jobs before handing them to users.

@huangnengCSU
Copy link
Author

@aquaskyline @zhengzhenxian
I updated the latest Clair3 and there are no more failed jobs. Thank you for maintaining and providing an excellent tool.

Best,
Neng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants