problem in detect variants using self-trained model #20

huangnengCSU · 2021-06-02T01:22:48Z

Hi,
I have trained a pileup model (pre-trained + fine-tune) and the output model seems intact. The name of model files are replaced with pileup and combined with provided full_alignment model. I used the pileup model and full_alignment model to detect variants. Then the command occurs two kinds of errors.
The first kind of error does not occur at a fixed location or time when the command is re-executed. And sometimes this type of error does not occur. The following two log outputs are examples of this type of error.

Processed 640000 tensors
Processed 660000 tensors
Total process positions in chr18 with chunk 13/17 : 662097
Total time elapsed: 900.00 s
parallel: This job failed:

python3 /public/home/huangneng/code/Clair3/scripts/../clair3.py CallVarBam \
--chkpnt_fn /public/home/huangneng/tools/clair3-model/clair3_finetune/pileup \
--bam_fn /public/home/huangneng/ont-quickstart/input/120G/hg003_120G.bam \
--call_fn /public/home/huangneng/ont-quickstart/input/120G/clair3_selftraining/output/tmp/pileup_output/pileup_chr18_14.vcf \
--sampleName EMPTY \
--ref_fn /public/home/huangneng/ont-quickstart/input/120G/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna \
--extend_bed /public/home/huangneng/ont-quickstart/input/120G/clair3_selftraining/output/tmp/split_beds/chr18 \
--bed_fn \
--vcf_fn EMPTY \
--ctgName chr18 \
--chunk_id 14 \
--chunk_num 17 \
--platform ont \
--fast_mode False \
--snp_min_af 0.0 \
--indel_min_af 0.0 \
--gvcf False \
--python python3 \
--pypy pypy3 \
--samtools samtools \
--temp_file_dir /public/home/huangneng/ont-quickstart/input/120G/clair3_selftraining/output/tmp/gvcf_tmp_output \
--pileup

real    228m52.610s
user    9075m11.206s
sys     847m57.569s

[INFO] Merge chunked contigs vcf files
[INFO] 2/7 Filter Hete SNP varaints for Whatshap phasing and haplotag
[INFO] Select phasing quality cut off 17
[INFO] Total hete snp positions pass filtering: chr22: 0
[INFO] Total hete snp positions pass filtering: chr20: 19521
[INFO] Total hete snp positions pass filtering: chr19: 38371
[INFO] Total hete snp positions pass filtering: chr21: 2615
[INFO] Total hete snp positions pass filtering: chr11: 105028
[INFO] Total hete snp positions pass filtering: chrX: 0
[INFO] Total hete snp positions pass filtering: chr18: 63294
[INFO] Total hete snp positions pass filtering: chrY: 0
[INFO] Total hete snp positions pass filtering: chr16: 68964
[INFO] Total hete snp positions pass filtering: chr13: 73883
[INFO] Total hete snp positions pass filtering: chr12: 106369
[INFO] Total hete snp positions pass filtering: chr9: 109798
[INFO] Total hete snp positions pass filtering: chr8: 125622
[INFO] Total hete snp positions pass filtering: chr17: 69591
[INFO] Total hete snp positions pass filtering: chr6: 138924
[INFO] Total hete snp positions pass filtering: chr7: 128983
[INFO] Total hete snp positions pass filtering: chr10: 114092
[INFO] Total hete snp positions pass filtering: chr5: 139377
[INFO] Total hete snp positions pass filtering: chr14: 72464
[INFO] Total hete snp positions pass filtering: chr15: 70208
[INFO] Total hete snp positions pass filtering: chr4: 159925
[INFO] Total hete snp positions pass filtering: chr3: 147225
[INFO] Total hete snp positions pass filtering: chr2: 179619
[INFO] Total hete snp positions pass filtering: chr1: 169383

real    4m40.428s
user    105m48.514s
sys     2m16.621s

Total process positions in chr8 with chunk 18/30 : 659427
Total time elapsed: 606.44 s
[INFO] Delay 2 seconds before starting variant calling ...
[mpileup] 1 samples in 1 input files
Calling variants ...
Total process positions in chr9 with chunk 9/28 : 0
Total time elapsed: 0.01 s
[INFO] No vcf output for file /public/home/huangneng/ont-quickstart/input/120G/clair3_selftraining/debug_output/tmp/pileup_output/pileup_chr9_10.vcf, remove empty file
parallel: This job failed:
python3 /public/home/huangneng/code/Clair3/scripts/../clair3.py CallVarBam \
--chkpnt_fn /public/home/huangneng/tools/clair3-model/clair3_finetune/pileup \
--bam_fn /public/home/huangneng/ont-quickstart/input/120G/hg003_120G.bam \
--call_fn /public/home/huangneng/ont-quickstart/input/120G/clair3_selftraining/debug_output/tmp/pileup_output/pileup_chr9_10.vcf \
--sampleName EMPTY \
--ref_fn /public/home/huangneng/ont-quickstart/input/120G/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna \
-extend_bed /public/home/huangneng/ont-quickstart/input/120G/clair3_selftraining/debug_output/tmp/split_beds/chr9 \
--bed_fn \
--vcf_fn EMPTY \
--ctgName chr9 \
--chunk_id 10 \
--chunk_num 28 \
--platform ont \
--fast_mode False \
--snp_min_af 0.0 \
--indel_min_af 0.0 \
--gvcf False \
--python python3 \
--pypy pypy3 \
--samtools samtools \
--temp_file_dir /public/home/huangneng/ont-quickstart/input/120G/clair3_selftraining/debug_output/tmp/gvcf_tmp_output \
--pileup

real    149m55.489s
user    4785m4.119s
sys     444m19.105s
[INFO] Merge chunked contigs vcf files
[INFO] 2/7 Filter Hete SNP varaints for Whatshap phasing and haplotag
[INFO] Select phasing quality cut off 17
[INFO] Total hete snp positions pass filtering: chr18: 0
[INFO] Total hete snp positions pass filtering: chr15: 0
[INFO] Total hete snp positions pass filtering: chr12: 0
[INFO] Total hete snp positions pass filtering: chr21: 0
[INFO] Total hete snp positions pass filtering: chr13: 0
[INFO] Total hete snp positions pass filtering: chrY: 0
[INFO] Total hete snp positions pass filtering: chr19: 0
[INFO] Total hete snp positions pass filtering: chr11: 0
[INFO] Total hete snp positions pass filtering: chr22: 0
[INFO] Total hete snp positions pass filtering: chr17: 0
[INFO] Total hete snp positions pass filtering: chr16: 0
[INFO] Total hete snp positions pass filtering: chr9: 7006
[INFO] Total hete snp positions pass filtering: chrX: 0
[INFO] Total hete snp positions pass filtering: chr10: 0
[INFO] Total hete snp positions pass filtering: chr14: 0
[INFO] Total hete snp positions pass filtering: chr4: 160039
[INFO] Total hete snp positions pass filtering: chr8: 110917
[INFO] Total hete snp positions pass filtering: chr6: 139017
[INFO] Total hete snp positions pass filtering: chr5: 139442
[INFO] Total hete snp positions pass filtering: chr20: 0
[INFO] Total hete snp positions pass filtering: chr3: 147298
[INFO] Total hete snp positions pass filtering: chr1: 169478
[INFO] Total hete snp positions pass filtering: chr7: 129048
[INFO] Total hete snp positions pass filtering: chr2: 179713

real    2m47.910s
user    60m29.533s
sys     1m19.845s

The second type of error appears in the step of Calling variants using Full Alignment. Here is the output log.

[INFO] Delay 1 seconds before starting variant calling ...
[mpileup] 1 samples in 1 input files
Calling variants ...
Total process positions in chr1 : 10000
Total time elapsed: 71.62 s
[INFO] Delay 2 seconds before starting variant calling ...
[mpileup] 1 samples in 1 input files
Calling variants ...
Total process positions in chr1 : 10000
Total time elapsed: 72.46 s
parallel: This job failed:

python3 /public/home/huangneng/code/Clair3/scripts/../clair3.py CallVarBam \
--chkpnt_fn /public/home/huangneng/tools/clair3-model/RNN+CNN_finetune/full_alignment \
--bam_fn /public/home/huangneng/ont-quickstart/input/120G/clair3_modify/debug_output/tmp/phase_output/phase_bam/chr1.bam \
--call_fn /public/home/huangneng/ont-quickstart/input/120G/clair3_modify/debug_output/tmp/full_alignment_output/full_alignment_chr1.28_316.vcf \
--sampleName EMPTY \
--vcf_fn EMPTY \
--ref_fn /public/home/huangneng/ont-quickstart/input/120G/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna \
--full_aln_regions /public/home/huangneng/ont-quickstart/input/120G/clair3_modify/debug_output/tmp/full_alignment_output/candidate_bed/chr1.28_316 \
--ctgName chr1 \
--add_indel_length \
--phasing_info_in_bam \
--gvcf False \
--python python3 \
--pypy pypy3 \
--samtools samtools \
--platform ont

real    1m34.682s
user    58m25.445s
sys     4m16.988s
[INFO] 7/7 Merge pileup vcf and full alignment vcf
Merge pileup and full alignment VCF/GVCF
[INFO] Pileup variants proceeded in chrY: 56779
[INFO] Full alignment variants proceeded in chrY: 0
Merge pileup and full alignment VCF/GVCF
[INFO] Pileup variants proceeded in chr8: 438770
[INFO] Full alignment variants proceeded in chr8: 0
Merge pileup and full alignment VCF/GVCF
[INFO] Pileup variants proceeded in chr13: 314341
[INFO] Full alignment variants proceeded in chr13: 0

I extracted the failed job and performed this job independently, it did not occur any error, the following is the output information.

[mpileup] 1 samples in 1 input files
Calling variants ...
Total process positions in chr1 : 10000
Total time elapsed: 33.62 s
[INFO] Delay 6 seconds before starting variant calling ...

The text was updated successfully, but these errors were encountered:

aquaskyline · 2021-06-02T02:55:46Z

Hi,

How many threads have you set? And could you paste here the output of ulimit -a in your running environment?

huangnengCSU · 2021-06-02T07:30:54Z

@aquaskyline
I have tested 30, 48, 60 and 80 threads. Here is the ulimit info:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 6184280
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

aquaskyline · 2021-06-02T07:43:09Z

if you set THREADS=8, would you be able to get no failed job in your environment?

huangnengCSU · 2021-06-05T01:50:04Z

Hi，
I have tested 8 threads, the first type of error no longer appears. But the second type of error still exists.

[INFO] Delay 3 seconds before starting variant calling ...
[mpileup] 1 samples in 1 input files
Calling variants ...
Total process positions in chr10 : 10000
Total time elapsed: 50.98 s
parallel: This job failed:
python3 /public/home/huangneng/code/Clair3_transformer/scripts/../clair3.py CallVarBam     --chkpnt_fn /public/home/huangneng/tools/clair3-model/RNN+CNN_finetune/full_alignment     --bam_fn /public/home/huangneng/ont-quickstart/input/120G/clair3_modify/debug_output/tmp/phase_output/phase_bam/chr10.bam     --call_fn /public/home/huangneng/ont-quickstart/input/120G/clair3_modify/debug_output/tmp/full_alignment_output/full_alignment_chr10.88_183.vcf     --sampleName EMPTY     --vcf_fn EMPTY     --ref_fn /public/home/huangneng/ont-quickstart/input/120G/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna     --full_aln_regions /public/home/huangneng/ont-quickstart/input/120G/clair3_modify/debug_output/tmp/full_alignment_output/candidate_bed/chr10.88_183     --ctgName chr10     --add_indel_length     --phasing_info_in_bam     --gvcf False     --python python3     --pypy pypy3     --samtools samtools     --platform ont

real    66m30.379s
user    583m45.247s
sys     30m43.783s
[INFO] 7/7 Merge pileup vcf and full alignment vcf
Merge pileup and full alignment VCF/GVCF
[INFO] Pileup variants proceeded in chr9: 386571
[INFO] Full alignment variants proceeded in chr9: 0
Merge pileup and full alignment VCF/GVCF
[INFO] Pileup variants proceeded in chr8: 438770
[INFO] Full alignment variants proceeded in chr8: 0

And there may be a bug when comparing the threads number with the maximum available cpu number. I set threads=8 and maximum_available_cores=10, then the command modifies the thread number to 10.

zhengzhenxian · 2021-06-05T03:41:15Z

Hi, Neng,
For us to pinpoint the problem, could you send us the running log at ${OUTPUT_DIR}/run_clair3.log to my email address zxzheng@cs.hku.hk? Much appreciated.

aquaskyline · 2021-06-09T02:39:44Z

The reason why some jobs failed is that Clair3 was requesting more processes than the user environment allows ulimit -u. We have added more running environment checks and automatic retries in v0.1-r3.

Clair3 uses Tensorflow and pypy. These libraries open quite a few threads in each running instance. The THREADS parameter controls how many Clair3 instances can run concurrently, but each instance, as we've summarized, consumes up to 40-50 processes at peak. The number of processes a user could create is limited to a number that could be checked using ulimit -a. In an Ubuntu system, the limitation is usually over 10k (unless otherwise reduced), thus not a problem. But in RedHat or CentOS, which is commonly used in grids and institutions, the limitation is usually at 1024 or 2048, thus setting the THREADS to a number above 20 would reach the limit at some points. Setting ulimit -u to a higher number can solve the problem, but that requires the root privilege (or a blessing from the system admin team).

In v0.1-r3, we check ulimit -u and lower the THREADS accordingly. We also added automatic retries on failed jobs before handing them to users.

huangnengCSU · 2021-06-11T00:37:02Z

@aquaskyline @zhengzhenxian
I updated the latest Clair3 and there are no more failed jobs. Thank you for maintaining and providing an excellent tool.

Best,
Neng

aragornwubo mentioned this issue Jun 5, 2021

Step 6 tmp/phase_output/phase_bam/.bam no found error #23

Closed

huangnengCSU closed this as completed Jun 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

problem in detect variants using self-trained model #20

problem in detect variants using self-trained model #20

huangnengCSU commented Jun 2, 2021

aquaskyline commented Jun 2, 2021

huangnengCSU commented Jun 2, 2021

aquaskyline commented Jun 2, 2021

huangnengCSU commented Jun 5, 2021 •

edited

Loading

zhengzhenxian commented Jun 5, 2021 •

edited by aquaskyline

Loading

aquaskyline commented Jun 9, 2021 •

edited

Loading

huangnengCSU commented Jun 11, 2021

problem in detect variants using self-trained model #20

problem in detect variants using self-trained model #20

Comments

huangnengCSU commented Jun 2, 2021

aquaskyline commented Jun 2, 2021

huangnengCSU commented Jun 2, 2021

aquaskyline commented Jun 2, 2021

huangnengCSU commented Jun 5, 2021 • edited Loading

zhengzhenxian commented Jun 5, 2021 • edited by aquaskyline Loading

aquaskyline commented Jun 9, 2021 • edited Loading

huangnengCSU commented Jun 11, 2021

huangnengCSU commented Jun 5, 2021 •

edited

Loading

zhengzhenxian commented Jun 5, 2021 •

edited by aquaskyline

Loading

aquaskyline commented Jun 9, 2021 •

edited

Loading