Skip to content

update_bed

dytk2134 edited this page Sep 12, 2018 · 1 revision

Introduction

Update the sequence id and coordinates of a Bed file using an alignment file generated by the fasta_diff program

The coordinates are converted by the following algorithm

  • bed_new_start = bed_start - match_old_start + match_new_start
  • bed_old_end = bed_end - match_old_start + match_new_start

In the following situation, the line in the bed file will be removed.

  • bed_old_start and bed_old_end coordinates not contained within match_old_start to match_old_end
  • sequence name not found in the match.tsv file (output file from fasta_diff)

bed format

  • header lines (remain the same)
  • first three required BED fields:
    • chrom (updated)
    • chromStart (updated)
    • chromEnd (updated)
  • 9 additional optional BED fields:
    • name (remains the same)
    • score (remains the same)
    • strand (remains the same)
    • thickStart (updated)
    • thickEnd (updated)
    • itemRgb (remains the same)
    • blockCount (remains the same)
    • blockStarts (remains the same)

Usage:

update_bed -a match.tsv example_file/example.bed

Message Example:

INFO     Reading alignment data from: match.tsv...
INFO       Alignments: 61768
INFO     Processing Bed file: female_Nvit_RNAseq_alignments_junctions.bed...
INFO       Updated lines: 103726
INFO       Removed lines: 8

Result Example:

CASE1: 100% match between sequences

  • Information in match.tsv
old_id old_start old_end new_id new_start new_end
Scaffold1 0 10378279 KK961494.1 0 10378279
  • original Bed file
Scaffold1       124699  125610  JUNC00000001    1       -       124699  125610  255,0,0 2       38,63   0,848
Scaffold1       125687  127004  JUNC00000002    1       -       125687  127004  255,0,0 2       42,59   0,1258
  • updated Bed file
KK961494.1      124699  125610  JUNC00000001    1       -       124699  125610  255,0,0 2       38,63   0,848
KK961494.1      125687  127004  JUNC00000002    1       -       125687  127004  255,0,0 2       42,59   0,1258

CASE2: New sequence is a substring of the old sequence with 100% match

  • Information in match.tsv
old_id old_start old_end new_id new_start new_end
Scaffold500 2215 777787 KK961993.1 0 775572
  • original Bed file
Scaffold500     194     2394    JUNC00072458    1       +       194     2394    255,0,0 2       79,22   0,2178
Scaffold500     106343  110442  JUNC00072459    61      -       106343  110442  255,0,0 2       99,92   0,4007```
  • updated Bed file
KK961993.1      104128  108227  JUNC00072459    61      -       104128  108227  255,0,0 2       99,92   0,4007
  • removed Bed file
Scaffold500     194     2394    JUNC00072458    1       +       194     2394    255,0,0 2       79,22   0,2178

CASE3: part of the old sequence was converted into Ns

  • Information in match.tsv
old_id old_start old_end new_id new_start new_end
Scaffold423 43403 44185 KK961916.1 43403 44185
Scaffold423 45136 48693 KK961916.1 45136 48693
  • original Bed file
Scaffold423     42315   43335   JUNC00064280    4       -       42315   43335   255,0,0 2       69,81   0,939
Scaffold423     45134   45845   JUNC00064281    7       -       45134   45845   255,0,0 2       87,94   0,617
Scaffold423     45799   46062   JUNC00064282    6       -       45799   46062   255,0,0 2       85,94   0,169
  • updated Bed file
KK961916.1      42315   43335   JUNC00064280    4       -       42315   43335   255,0,0 2       69,81   0,939
KK961916.1      45799   46062   JUNC00064282    6       -       45799   46062   255,0,0 2       85,94   0,169
  • removed Bed file
Scaffold423     45134   45845   JUNC00064281    7       -       45134   45845   255,0,0 2       87,94   0,617

Running the program with –h prints the following help:

update_bed -h

usage: update_bed [-h] [-a ALIGNMENT_FILE] [-u UPDATED_POSTFIX]
                  [-r REMOVED_POSTFIX] [-v]
                  Bed_FILE [Bed_FILE ...]

Update the sequence id and coordinates of a Bed file using an alignment file generated by the fasta_diff program.
Updated Line are written to a new file with '_updated'(default) appended to the original Bed file name.
Line that can not be updated, due to the id being removed completely or the line contains regions that
are removed or replaced with Ns, are written to a new file with '_removed'(default) appended to the original Bed file name.

Example:
    fasta_diff example_file/old.fa example_file/new.fa | update_bed example_file/example.bed

positional arguments:
  Bed_FILE              List one or more Bed files to be updated

optional arguments:
  -h, --help            show this help message and exit
  -a ALIGNMENT_FILE, --alignment_file ALIGNMENT_FILE
                        The alignment file generated by fasta_diff, a TSV file
                        with 6 columns: old_id, old_start, old_end, new_id,
                        new_start, new_end (default: STDIN)
  -u UPDATED_POSTFIX, --updated_postfix UPDATED_POSTFIX
                        The filename postfix for updated features (default:
                        "_updated")
  -r REMOVED_POSTFIX, --removed_postfix REMOVED_POSTFIX
                        The filename postfix for removed features (default:
                        "_removed")
  -v, --version         show program's version number and exit