# Introduction to bedtools

Here, we introduce bedtools, a valuable command-line utility for working with genomic coordinates.

## Format of a BED file

BED files are text files that contain genomic coordinate information, and are typically given the .bed extension. The format of a BED file is specified here: https://genome.ucsc.edu/FAQ/FAQformat.html#format1. Only the first three columns are mandatory for bed files, and they contain the following information:

Columns in a BED file:
- Column 1: chromosome
- Column 2: start position (the beginning of the first base is indicated by the start position 0; the beginning of the 5th base is indicated by the start position "4")
- Column 3: end position (the end of the first base is indicated by the end position "1"; the end of the 5th base is indicated by the end position "5")

People are often confused by the fact that the same "base" is referred to by a different number depending on whether you are referring to the start or the end. A simple way to understand this is to realize that the positions are not referring to the numbering of the bases themselves, but to the boundary between bases, as illustrated in the figure below:

<img src="./array_slice_indexing.png">

Note that this convention is also consistent with how array slicing works, as illustrated below:

In [1]:
from __future__ import print_function #for compatibility with both python 2 and python 3

dna_string = "ACCTG"
print(dna_string[0:4])
print(dna_string[1:5])
print(dna_string[0:5])

ACCT
CCTG
ACCTG


## Prepare sample BED files

Below, we prepare two sample BED files that we will use to demonstrate bedtools command.

In [2]:
#\t is a tab character, and \n is a newline character
file1 = open('file1.bed', 'w')
file1.write("chr1\t0\t100\n")
file1.write("chr1\t50\t150\n")
file1.write("chr1\t200\t250\n")
file1.write("chr2\t100\t250\n")
file1.close()

file2 = open('file2.bed', 'w')
file2.write("chr1\t0\t50\n")
file2.write("chr1\t50\t90\n")
file2.write("chr1\t90\t110\n")
file2.write("chr1\t110\t200\n")
file2.write("chr1\t150\t180\n")
file2.write("chr2\t400\t450\n")
file2.close()

Let's view the created files:

In [3]:
!echo "file1.bed" #print "file1.bed" to the screen
!cat "file1.bed" #display the contents of file1.bed
!echo "file2.bed"
!cat "file2.bed"

file1.bed
chr1	0	100
chr1	50	150
chr1	200	250
chr2	100	250
file2.bed
chr1	0	50
chr1	50	90
chr1	90	110
chr1	110	200
chr1	150	180
chr2	400	450


## The bedtools intersect command

This command performance an intersection between two files, and returns all overlapping intervals.

In [4]:
!bedtools intersect -a file1.bed -b file2.bed

chr1	0	50
chr1	50	90
chr1	90	100
chr1	50	90
chr1	90	110
chr1	110	150


It is equivalently known as the intersectBed command:

In [5]:
!intersectBed -a file1.bed -b file2.bed

chr1	0	50
chr1	50	90
chr1	90	100
chr1	50	90
chr1	90	110
chr1	110	150


To write the output of any command to a file, use the > operator which redirects the output to a text file

In [6]:
!bedtools intersect -a file1.bed -b file2.bed > intersection_results.bed

The contents can then be read back from the file, which could then be used as an input file for subsequent commands.

In [7]:
!cat intersection_results.bed

chr1	0	50
chr1	50	90
chr1	90	100
chr1	50	90
chr1	90	110
chr1	110	150


The --help command can be used to view options. We recommend you read through all of these to know what is possible

In [8]:
!bedtools intersect --help


*****ERROR: Unrecognized parameter: --help *****


*****
*****ERROR: Need -a and -b files. 
*****

Tool:    bedtools intersect (aka intersectBed)
Version: v2.17.0
Summary: Report overlaps between two feature files.

Usage:   bedtools intersect [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf>

Options: 
	-abam	The A input file is in BAM format.  Output will be BAM as well.

	-ubam	Write uncompressed BAM output. Default writes compressed BAM.

	-bed	When using BAM input (-abam), write output as BED. The default
		is to write output in BAM when using -abam.

	-wa	Write the original entry in A for each overlap.

	-wb	Write the original entry in B for each overlap.
		- Useful for knowing _what_ A overlaps. Restricted by -f and -r.

	-loj	Perform a "left outer join". That is, for each feature in A
		report each overlap with B.  If no overlaps are found, 
		report a NULL feature for B.

	-wo	Write the original A and B entries plus the number of base
		pairs of over

We go over some key options below. -wa reports the original entry for the -a file involved in the overlap, instead of just the overlapping portion.

In [9]:
!bedtools intersect -a file1.bed -b file2.bed -wa

chr1	0	100
chr1	0	100
chr1	0	100
chr1	50	150
chr1	50	150
chr1	50	150


Adding in the -wb option will also include the entry from "b" involved in the overlap:

In [10]:
!bedtools intersect -a file1.bed -b file2.bed -wa -wb

chr1	0	100	chr1	0	50
chr1	0	100	chr1	50	90
chr1	0	100	chr1	90	110
chr1	50	150	chr1	50	90
chr1	50	150	chr1	90	110
chr1	50	150	chr1	110	200


If you just want to know which features from the -a file overlapped with some feature in the -b file, but do not want repeated features to be reported, use the -u flag:

In [11]:
!bedtools intersect -a file1.bed -b file2.bed -u

chr1	0	100
chr1	50	150


If you want to know how many distinct regions from -b are overlapped by some region from -a, use the -c flag

In [12]:
!bedtools intersect -a file1.bed -b file2.bed -c

chr1	0	100	3
chr1	50	150	3
chr1	200	250	0
chr2	100	250	0


To filter for those entries in -a with NO overlap with an entry from -b, use the -v flag

In [13]:
!bedtools intersect -a file1.bed -b file2.bed -v

chr1	200	250
chr2	100	250


-loj can be used to report overlaps for all features in the -a file, or a blank if there are no overlaps:

In [14]:
!bedtools intersect -a file1.bed -b file2.bed -loj

chr1	0	100	chr1	0	50
chr1	0	100	chr1	50	90
chr1	0	100	chr1	90	110
chr1	50	150	chr1	50	90
chr1	50	150	chr1	90	110
chr1	50	150	chr1	110	200
chr1	200	250	.	-1	-1
chr2	100	250	.	-1	-1


-wao adds information about the number of bases in the overlap:

In [15]:
!bedtools intersect -a file1.bed -b file2.bed -wao

chr1	0	100	chr1	0	50	50
chr1	0	100	chr1	50	90	40
chr1	0	100	chr1	90	110	10
chr1	50	150	chr1	50	90	40
chr1	50	150	chr1	90	110	20
chr1	50	150	chr1	110	200	40
chr1	200	250	.	-1	-1	0
chr2	100	250	.	-1	-1	0


To impose a certain minimum overlap as a fraction of the size of the region in -a, use the -f flag

In [16]:
!bedtools intersect -a file1.bed -b file2.bed -wao -f 0.4

chr1	0	100	chr1	0	50	50
chr1	0	100	chr1	50	90	40
chr1	50	150	chr1	50	90	40
chr1	50	150	chr1	110	200	40
chr1	200	250	.	-1	-1	0
chr2	100	250	.	-1	-1	0


## The bedtools merge command

The bedtools merge command (also known as mergeBed) unifies overlapping intervals in a bed file. The regions in the file must be sorted, first by chromosome, then by the start coordinate.

In [17]:
!bedtools merge -i file1.bed

chr1	0	150
chr1	200	250
chr2	100	250


Remember that it is possible to use the cat command to concatenate files, as shown below:

In [18]:
!cat file1.bed file2.bed

chr1	0	100
chr1	50	150
chr1	200	250
chr2	100	250
chr1	0	50
chr1	50	90
chr1	90	110
chr1	110	200
chr1	150	180
chr2	400	450


Thus, if we want to merge the regions in file1.bed and file2.bed, we can do so as shown (we will use bedtools sort, a.k.a. sortBed, to sort the file)

In [19]:
!cat file1.bed file2.bed > concatenated_regions.bed
!bedtools sort -i concatenated_regions.bed > sorted_concatenated_regions.bed
!bedtools merge -i sorted_concatenated_regions.bed

chr1	0	250
chr2	100	250
chr2	400	450


Note that we can use the pipe operator to feed the output of one command directly as the input of another, so that the commands above can actually be executed in one line as shown:

In [20]:
!cat file1.bed file2.bed | bedtools sort | bedtools merge

chr1	0	250
chr2	100	250
chr2	400	450


You can also browse the help for bedtools merge as shown below.

In [21]:
!mergeBed --help


*****ERROR: Unrecognized parameter: --help *****


Tool:    bedtools merge (aka mergeBed)
Version: v2.17.0
Summary: Merges overlapping BED/GFF/VCF entries into a single interval.

Usage:   bedtools merge [OPTIONS] -i <bed/gff/vcf>

Options: 
	-s	Force strandedness.  That is, only merge features
		that are the same strand.
		- By default, merging is done without respect to strand.

	-n	Report the number of BED entries that were merged.
		- Note: "1" is reported if no merging occurred.

	-d	Maximum distance between features allowed for features
		to be merged.
		- Def. 0. That is, overlapping & book-ended features are merged.
		- (INTEGER)

	-nms	Report the names of the merged features separated by semicolons.

	-scores	Report the scores of the merged features. Specify one of 
		the following options for reporting scores:
		  sum, min, max,
		  mean, median, mode, antimode,
		  collapse (i.e., print a semicolon-separated list),
		- (INTEGER)

Notes: 
	(1

## The bedtools closest command

The bedtools closest command can be used to find the regions from one file (the -b file) that are closest to the regions in another. Overlapping features have a distance of zero. This tool is particularly useful when trying to find the nearest gene to an enhancer.

In [22]:
!bedtools closest -a file1.bed -b file2.bed

chr1	0	100	chr1	90	110
chr1	50	150	chr1	110	200
chr1	50	150	chr1	150	180
chr1	200	250	chr1	110	200
chr2	100	250	chr2	400	450


The -d option can be used to report the distance to the feature in the -b file:

In [23]:
!bedtools closest -a file1.bed -b file2.bed -d

chr1	0	100	chr1	90	110	0
chr1	50	150	chr1	110	200	0
chr1	50	150	chr1	150	180	1
chr1	200	250	chr1	110	200	1
chr2	100	250	chr2	400	450	151


The "-D ref" option will report a negative distance for features that are earlier ("upstream")  and a positive distance for those that are later ("downstream"):

In [24]:
!bedtools closest -a file1.bed -b file2.bed -D ref

chr1	0	100	chr1	90	110	0
chr1	50	150	chr1	110	200	0
chr1	50	150	chr1	150	180	1
chr1	200	250	chr1	110	200	-1
chr2	100	250	chr2	400	450	151


Many other options can be browsed in the help menu:

In [25]:
!bedtools closest --help


*****ERROR: Unrecognized parameter: --help *****


*****
*****ERROR: Need -a and -b files. 
*****

Tool:    bedtools closest (aka closestBed)
Version: v2.17.0
Summary: For each feature in A, finds the closest 
	 feature (upstream or downstream) in B.

Usage:   bedtools closest [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf>

Options: 
	-s	Req. same strandedness.  That is, find the closest feature in
		B that overlaps A on the _same_ strand.
		- By default, overlaps are reported without respect to strand.

	-S	Req. opposite strandedness.  That is, find the closest feature
		in B that overlaps A on the _opposite_ strand.
		- By default, overlaps are reported without respect to strand.

	-d	In addition to the closest feature in B, 
		report its distance to A as an extra column.
		- The reported distance for overlapping features will be 0.

	-D	Like -d, report the closest feature in B, and its distance to A
		as an extra column. Unlike -d, use negative distances t

## Other bedtools commands

We have shown only the tip of the iceberg of what bedtools can do. To get a taste of the other functionality, you can browse the documentation at https://bedtools.readthedocs.io/en/latest/ or simply enter "bedtools --help":

In [26]:
!bedtools --help

bedtools: flexible tools for genome arithmetic and DNA sequence analysis.
usage:    bedtools <subcommand> [options]

The bedtools sub-commands include:

[ Genome arithmetic ]
    intersect     Find overlapping intervals in various ways.
    window        Find overlapping intervals within a window around an interval.
    closest       Find the closest, potentially non-overlapping interval.
    coverage      Compute the coverage over defined intervals.
    map           Apply a function to a column for each overlapping interval.
    genomecov     Compute the coverage over an entire genome.
    merge         Combine overlapping/nearby intervals into a single interval.
    cluster       Cluster (but don't merge) overlapping/nearby intervals.
    complement    Extract intervals _not_ represented by an interval file.
    subtract      Remove intervals based on overlaps b/w two files.
    slop          Adjust the size of intervals.
    flank         Create new intervals from 