Skip to content

Commit

Permalink
optimize running steps and README
Browse files Browse the repository at this point in the history
  • Loading branch information
quentin0515 committed Aug 25, 2022
1 parent 6651d01 commit 38b19a2
Show file tree
Hide file tree
Showing 3 changed files with 62 additions and 57 deletions.
33 changes: 16 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,40 +13,39 @@ TT-Mars: S**t**ructural Varian**t**s Assess**m**ent B**a**sed on Haplotype-**r**

The main program: run `python ttmars.py -h` for help.

`python ttmars.py output_dir centro_file files_dir/assem1_non_cov_regions.bed files_dir/assem2_non_cov_regions.bed vcf_file reference asm_h1 asm_h2 files_dir/lo_pos_assem1_result_compressed.bed files_dir/lo_pos_assem2_result_compressed.bed files_dir/lo_pos_assem1_0_result_compressed.bed files_dir/lo_pos_assem2_0_result_compressed.bed tr_file searching_interval(1000) num_X_chr`
`python ttmars.py output_dir files_dir centro_file vcf_file reference asm_h1 asm_h2 tr_file num_X_chr`

## Positional arguments

1. `output_dir`: Output directory.
2. `centro_file`: provided centromere file.
3. `tr_file`: provided tandem repeats file.
4. `vcf_file`: callset file callset.vcf(.gz)
5. `reference`: referemce file reference_genome.fasta.
6. `asm_h1/2`: assembly files assembly1/2.fa, can be downloaded by `download_asm.sh`.
7. `assem1_non_cov_regions.bed`, `assem2_non_cov_regions.bed`, `lo_pos_assem1_result_compressed.bed`, `lo_pos_assem2_result_compressed.bed`, `lo_pos_assem1_0_result_compressed.bed`, `lo_pos_assem2_0_result_compressed.bed`: required files, downloaded to `./ttmars_files`.
8. `searching_interval(1000)`: the flanking region where TT-Mars searches for the best interval, 1000 is the recommended value.
1. `output_dir`: Output directory.
2. `files_dir`: Input files directory. `./ttmars_files/sample_name`. The directory where you store required files after running `dowaload_files.sh`.
3. `centro_file`: provided centromere file.
4. `vcf_file`: callset file callset.vcf(.gz).
5. `reference`: referemce file reference_genome.fasta.
6. `asm_h1`: assembly files assembly1.fa, which were downloaded after running `download_asm.sh`.
7. `asm_h2`: assembly files assembly2.fa, which were downloaded after running `download_asm.sh`.
8. `tr_file`: provided tandem repeats file.
9. `num_X_chr`: if male sample: 1; if female sample: 2.

## Optional arguments

`-n/--not_hg38`: if reference is NOT hg38/chm13 (hg19).
`-p/--passonly`: if consider PASS calls only.
`-p/--passonly`: if consider PASS calls only.
`-s/--seq_resolved`: if consider sequence resolved calls.
`-w/--wrong_len`: if count wrong length calls as True.
`-g/--gt_vali`: conduct genotype validation.
`-i/--gt_info`: index with GT info. (For phased callsets)
`-d/--phased `: take phased information. (For phased callsets)

`-v/--vcf_out`: output results as vcf files (tp (true positive), fp (false positive) and na).
`-i/--gt_info`: index with GT info. (For phased callsets)
`-d/--phased `: take phased information. (For phased callsets)
`-v/--vcf_out`: output results as vcf files (tp (true positive), fp (false positive) and na).
`-f/--false_neg`: output recall, must be used together with `-t/--truth_file`.
`-t/--truth_file`: input truth vcf file, must be used together with `-f/--false_neg`.

## Example Output

ttmars_combined_res.txt:
|chr| start| end| type| relative length| relative score| validation result| genotype match|
| :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |
|chr1| 893792| 893827| DEL| 1.03| 3.18| True| True|
|SV index| relative length| relative score| validation result| chr| start| end| Type| Genotype Match|
| :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: | :----: |
|0| 1.0| 3.48| True| chr1| 249912| 249912| INS| True|

## Accompanying Resources

Expand Down
24 changes: 13 additions & 11 deletions run_ttmars.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,24 +27,26 @@ tr_file=hg38_tandem_repeats.bed
#1: if male sample; 2: if female sample
num_X_chr=1

python ttmars.py "$output_dir" "$centro_file" "$files_dir"/assem1_non_cov_regions.bed "$files_dir"/assem2_non_cov_regions.bed "$vcf_file" "$reference" "$asm_h1" "$asm_h2" "$files_dir"/lo_pos_assem1_result_compressed.bed "$files_dir"/lo_pos_assem2_result_compressed.bed "$files_dir"/lo_pos_assem1_0_result_compressed.bed "$files_dir"/lo_pos_assem2_0_result_compressed.bed "$tr_file" 1000 "$num_X_chr" -s -g -w -d -i -v
python ttmars.py "$output_dir" "$files_dir" "$centro_file" "$vcf_file" "$reference" "$asm_dir"/h1.fa "$asm_dir"/h2.fa "$tr_file" "$num_X_chr" -s -g -w -d -i -v

# positional arguments:
# output_dir output directory
# centromere_file centromere file
# assem1_non_cov_regions_file
# Regions that are not covered on hap1
# assem2_non_cov_regions_file
# Regions that are not covered on hap2
# files_dir input directory that stores files used in tt-mars for the current sample
# Should include:
# assem1_non_cov_regions_file
# Regions that are not covered on hap1
# assem2_non_cov_regions_file
# Regions that are not covered on hap2
# liftover_file1 liftover file hap1
# liftover_file2 liftover file hap2
# liftover_file1_0 liftover file hap1 asm to ref
# liftover_file2_0 liftover file hap2 asm to ref
# centromere_file centromere file, default is provided by tt-mars
# vcf_file input vcf file
# ref_file reference file
# query_file1 assembly fasta file hap1
# query_file2 assembly fasta file hap2
# liftover_file1 liftover file hap1
# liftover_file2 liftover file hap2
# liftover_file1_0 liftover file hap1 asm to ref
# liftover_file2_0 liftover file hap2 asm to ref
# tandem_file tandem repeats regions
# tandem_file tandem repeats regions, default is provided by tt-mars
# region_len_m region_len_m
# {1,2} male sample 1, female sample 2

Expand Down
62 changes: 33 additions & 29 deletions ttmars.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,14 @@

parser.add_argument("output_dir",
help="output directory")
parser.add_argument("files_dir",
help="input files directory")
parser.add_argument("centromere_file",
help="centromere file")
parser.add_argument("assem1_non_cov_regions_file",
help="Regions that are not covered on hap1")
parser.add_argument("assem2_non_cov_regions_file",
help="Regions that are not covered on hap2")
# parser.add_argument("assem1_non_cov_regions_file",
# help="Regions that are not covered on hap1")
# parser.add_argument("assem2_non_cov_regions_file",
# help="Regions that are not covered on hap2")
parser.add_argument("vcf_file",
help="input vcf file")
parser.add_argument("ref_file",
Expand All @@ -22,25 +24,25 @@
help="assembly fasta file hap1")
parser.add_argument("query_file2",
help="assembly fasta file hap2")
parser.add_argument("liftover_file1",
help="liftover file hap1")
parser.add_argument("liftover_file2",
help="liftover file hap2")
# parser.add_argument("liftover_file1",
# help="liftover file hap1")
# parser.add_argument("liftover_file2",
# help="liftover file hap2")

#needed in interspersed dup validation
parser.add_argument("liftover_file1_0",
help="liftover file hap1 asm to ref")
parser.add_argument("liftover_file2_0",
help="liftover file hap2 asm to ref")
# parser.add_argument("liftover_file1_0",
# help="liftover file hap1 asm to ref")
# parser.add_argument("liftover_file2_0",
# help="liftover file hap2 asm to ref")

parser.add_argument("tandem_file",
help="tandem repeats regions")

##########################################################
##########################################################
parser.add_argument("region_len_m",
type=int,
help="region_len_m")
# parser.add_argument("region_len_m",
# type=int,
# help="region_len_m")

parser.add_argument("no_X_chr",
choices=[1, 2],
Expand Down Expand Up @@ -139,29 +141,31 @@
##########################################################
##########################################################


output_dir = args.output_dir + "/"
# if_hg38_input = args.if_hg38_input
centromere_file = args.centromere_file
#input files directory
files_dir = args.files_dir + "/"
#assembly bam files
assem1_non_cov_regions_file = args.assem1_non_cov_regions_file
assem2_non_cov_regions_file = args.assem2_non_cov_regions_file
#avg_read_depth = sys.argv[6]
#read_bam_file = sys.argv[6]
# assem1_non_cov_regions_file = args.assem1_non_cov_regions_file
# assem2_non_cov_regions_file = args.assem2_non_cov_regions_file
assem1_non_cov_regions_file = files_dir + "assem1_non_cov_regions.bed"
assem2_non_cov_regions_file = files_dir + "assem2_non_cov_regions.bed"
vcf_file = args.vcf_file
#ref fasta file
ref_file = args.ref_file
#assembly fasta files
query_file1 = args.query_file1
query_file2 = args.query_file2
liftover_file1 = args.liftover_file1
liftover_file2 = args.liftover_file2
# liftover_file1 = args.liftover_file1
# liftover_file2 = args.liftover_file2
liftover_file1 = files_dir + "lo_pos_assem1_result_compressed.bed"
liftover_file2 = files_dir + "lo_pos_assem2_result_compressed.bed"
tandem_file = args.tandem_file
# if_passonly_input = args.if_passonly_input
# seq_resolved_input = args.seq_resolved_input
# wrong_len_input = args.wrong_len_input
liftover_file1_0 = args.liftover_file1_0
liftover_file2_0 = args.liftover_file2_0
# liftover_file1_0 = args.liftover_file1_0
# liftover_file2_0 = args.liftover_file2_0
liftover_file1_0 = files_dir + "lo_pos_assem1_0_result_compressed.bed"
liftover_file2_0 = files_dir + "lo_pos_assem2_0_result_compressed.bed"

##########################################################
##########################################################
Expand Down Expand Up @@ -213,8 +217,8 @@
#max length of allowed interspersed DUP
reg_dup_upper_len = 10000000
#flanking regions for searching
# region_len_m = 1000
region_len_m = int(args.region_len_m)
region_len_m = 1000
# region_len_m = int(args.region_len_m)

#valid types
valid_types = ['DEL', 'INS', 'INV', 'DUP:TANDEM', 'DUP']
Expand Down

0 comments on commit 38b19a2

Please sign in to comment.