In [1]:
pip install bash_kernel



In [2]:
python -m bash_kernel.install

Installing IPython kernel spec


In [3]:
conda activate base


(base) 

: 1

## 108 Introduction to ChIP-Seq analysis ##


**108.1 What is ChIP-Seq?**


ChIP-Seq stands for chromatin immunoprecipitation followed by sequencing. In a nutshell, the process consists of a laboratory protocol (abbreviated as ChIP) by the end of which the full DNA content of a cell is reduced to a much smaller subset of it. This subset of DNA is then sequenced (abbreviated as Seq) and is mapped against a known reference genome. Since you are now obtaining data from a potentially variable subset of the entire genome, it poses unique challenges to data interpretation, hence a whole subfield of data analytics has grown out around it.

Whereas the name appears to imply that the analysis methods "require" that the data was obtained via chromatin immunoprecipitation (ChIP) this is not necessarily so. Many other laboratory protocols where the net effect is that only a subset of the total DNA is selected for can be processed with ChIP-Seq methods.


**108.2 What are the challenges of ChIP-Seq?**


Perhaps one of the most significant problem of this entire subfield is the insufficiently well-defined terminology. See for example the section "What the heck is a peak" later on. Also, more so than in other analysis methods, there a "subjectivity creep." Seemingly simple decisions are made all the time - that in the end seem to accumulate and slowly grow out of control as the analysis proceeds.

Examples of these insufficiently defined terms are for instance the concepts of "upstream" and "downstream" regions which typically refer to the genomic intervals that lie ahead of and after a particular genomic coordinate.

When referring to these ranges, you will need to associate a finite length to them: how far out do you still call a region to be "upstream." The choice of this distance may, in turn, significantly alter the outcome of the analysis. You will likely obtain different results when the upstream region is defined to be 500bp, 1000bp or 5000 kb long.

Further complicating the picture is that different genomes have varying complexities; in some genomes, the genes are densely packed, other vast tracts are seemingly nonfunctional. Hence there is no "objective" measure by which a "reliable" length for an upstream or downstream region could be determined from. The size of the "upstream" region is little more than a handwaving approximation.

**108.3 What the heck is a peak?**


You will see how ChIP-Seq analysis is all about peaks - a term, that in our opinion is adopted widely yet lacks any exact definition. Somewhat ironically speaking, perhaps the best description of a peak is the following:

A "peak" is what a "peak-finder" finds.

What makes the problem complicated is that we all think that we know what a "peak" is (or should be). A peak is a "shape" that rises sharply to a maximum then quickly drops back to zero. In that ideal case then the peak consists of the coordinate of the maxima on the genome (an integer number) and a peak height.

As it happens, in reality, most of your signals may not follow such well-behaved patterns - the peaks may be a lot broader,

Now if the peak is wide you may need to compute the peak width that will indicate the range where the signal is present. Then once you have a broader peak representing it with its center may be counterproductive as it is not clear that the summit will be symmetrical around it. Instead, you'll have to keep track of the right and left borders of it.

Just having a signal, of course, is insufficient, we may want to ensure that the signal rises above the background of the noise levels. Also, most studies are interested in how cells respond to stimuli that in turn will alter the signal. So the changes may occur by the peak rising in a different location or via changed peak height often denoted with the term differential peak expression.

You can see how a seemingly innocuous word: "peak" may require managing more information than you'd expect.

**108.4 How are RNA-Seq studies different from ChIP-Seq studies?**


Whereas both approaches appear to operate under similar constraints and count reads over intervals, there are several significant differences:

The DNA fragments coming from a ChIP-Seq study are much shorter than a transcript studied in RNA-Seq analysis. Instead of counting measurements distributed over transcripts a single read may cover most of the fragment (or may even be longer than the original DNA fragment)

The DNA fragments of a ChIP-Seq experiment are less localized than transcripts. The cell machinery accurately locates and transcribes DNA always starting from the same start locations. The ChIP-Seq protocol is a whole lot more inaccurate. You need to apply a peak-calling process to establish the origin of the fragments from incorrect data.

ChIP-Seq data produces a more significant number of false positives. In a typical RNA-Seq experiment, only transcripts are isolated identically, and we can be fairly sure that an observed sequence did exist as RNA. In contrast, ChIP-Seq protocols strongly depend on the properties of the reagents, selectivity, and specificity of the binding process, and these properties will vary across different proteins and across the genome.

As a matter of fact, for ChIP seq results we need to compare to control (baseline) to verify that the protocol worked at all. Whereas in RNA-Seq we can immediately see the success of it by having reads cover transcript, in ChIP-Seq study no such certainty exists. Also when comparing peaks across conditions each peak has to first pass the comparison to a background, and then only those peaks that pass the first stage are compared to one another in a second stage. Since each statistical test is fraught with uncertainties, differential peak expression studies will often exhibit compounding errors.

**108.5 What will the data look like?**


For the record let's enumerate the possible observations for a given genomic location.

No signal of any kind.
A signal that shows no peak structure and appears to be noise.
A "peak like" signal that is reported to be under a background level.
A "peak like" signal that is reported to be over a background level.
A "peak like" signal that exhibits differential expression across samples.
Many (most) peak finders will start reporting data only from level 4 on - leaving you with many head-scratching moments of the form: "Why is this perfectly looking peak not reported?". It all makes sense once you recognize that "peak finding" as the field currently defines it, combines finding the peaks and comparing them to a background. In addition, the statistical methods employed to determine these background levels have very high levels of uncertainty.

**108.6 What does a "peak" represent?**


Most ChIP-Seq guides start with explaining how ChIP-seq works, and how it allows us to identify where proteins were bound to DNA. We believe that is far more important to understand what your data stands for in a more straightforward way first, then extend that to the laboratory process.

Imagine that we have DNA from a cell that looks like this:

AGTGATACCGGTCTAAGCCTCGCCTCTTCCACAGGGGAAACTAGGTGGCC
Now let's say some locations are "covered," say "protected" from an interaction. Let's visualize those locations via the = symbols.

         =========              ==========
AGTGATACCGGTCTAAGCCTCGCCTCTTCCACAGGGGAAACTAGGTGGCC
         =========              ==========              
         
Suppose that can wash away all unprotected regions. If we did that then sequenced the remaining DNA we'd be left with:

1 GGTCTAAGC
1 GGGGAAACTA
The first number indicates how many of those sequences were found. So far so good. Now let's take another cell from the same population. Suppose that the DNA for this new cell is also now covered, but in only one location:

         =========               
AGTGATACCGGTCTAAGCCTCGCCTCTTCCACAGGGGAAACTAGGTGGCC
         =========               
         
After washing away the uncovered regions we add the outcome to the prior results, and we have:


2 GGTCTAAGC
1 GGGGAAACTA
Finally, let's assume we have DNA from a third cell, this one has a protein bound in the second location.

                                 ==========
AGTGATACCGGTCTAAGCCTCGCCTCTTCCACAGGGGAAACTAGGTGGCC
                                 ========== 
Thus our cumulative data that we obtain after washing away unprotected regions from three cells would be.

2 GGTCTAAGC
2 GGGGAAACTA 
In a ChIP-Seq experiment, instead of sequencing the entire genomes from three different cells we only sequence relatively short pieces of DNA (in reality the sequences are longer than these example sequences above but still radically shorter than say a transcript). The alignments that we generate will pass through a "peak finding" program that will attempt to reconstruct which locations had sequences mapping to and how many from each of the small sequences we have observed. If everything goes well, the ideal outcome of our analysis for the data above will produce:

A peak of height 2 that starts at coordinate 10 and ends at coordinate 19
A peak of height 2 that starts at coordinate 34 and ends at coordinate 43
Do note however that the DNA from one cell had two regions covered, whereas the two other cells had only one zone each. Our peaks above do not indicate this. They represent a superposition of configurations across all cells. New configurations leading to the same picture may be possible; there is no way to know that the covered regions appear all together or separately in the original DNA molecule.

Essential to remember:

The peaks are cumulative, superimposed measures of the observed configurations across millions of different cells.

**108.7 Do peaks represent binding?**


Now, up to this point, we avoided the word "binding" altogether - yet the purpose of ChIP-Seq is to measure exactly that. A protein bound to a DNA molecule can "protect" it from being washed away - then in a second "pulldown" stage, only the DNA bound to a particular protein is kept. Hence under the ideal condition, the result of a ChIP-Seq experiment contains DNA that was protected by one specific type of protein.

The reality is a bit different. Remember your ChIP-Seq data does not detect binding. It recognizes the abundance of DNA fragments in a pool of DNA molecules. Each measure indicated that a given DNA fragment was present when all the DNA was combined. As it turns out in practice, there may be several other reasons for DNA to be present in your library. Control samples are almost always required (see next question)

**108.8 How do we determine peaks from sequencing data?**


There is a tendency to gloss over this step, relying on the "peak finder" then using the results that it produces. Once you think about the problem, you'll see how peak finding process is more counterintuitive than you'd imagine.

Remember how sequencing instrument produces a linear sub-measurement of DNA fragments? This measured "read" has a length associated with it, but this length has nothing to do with the original fragment length.

If we mark the first fragment as = then we can see that depending on the DNA fragment lengths and our measurement (read) lengths the measured reads could fall into three different orientations.

        ========= DNA FRAGMENT =========
        
        # Reads much shorter than fragment.
        ------->                 <------
        
        # Reads as long as the fragment.
        ------------------>
                     <------------------
         
        # Reads longer than the fragment
        ------------------------------------->
   <-------------------------------------


Three different situations occur.

The read length is much smaller than the fragment.
The read length is about the same size as the DNA fragment
The read length is larger than the DNA fragment and runs past the fragment ends.
As a result, the analysis methods fall into two categories: 1 and 3 as the second case occurs very rarely. Often these are called "narrow" peak and "broad" peak methods.

Be advised that a different mindset and approach is required for each situation - methods that fit broad peaks are likely to work sub-optimally in for narrow ones and vice versa.

**108.9 What does a ChIP-Seq data measure?**
If you refer to the sketch above only one thing is common across all data sets. The start of the reads that align on the forward strand indicates the left border of the DNA fragment. The start of the reads that align on the reverse strand all indicates the right border of the DNA fragment.

Hence we can state the fundamental "law" of ChIP-Seq data:

ChIP-Seq data measure the outer borders, the "edges" of the DNA fragments.

Of course, this is not all that great of news. It makes the picture a whole lot more complicated. Let me show you what this means. Imagine that you see coverage that looks like this image below. it seems so clear cut, there is one binding event, in the middle, indicated by the highest point of the peak.


Sadly this is misleading. Where the fragments appear to overlap to most may be a superposition of many different, alternative positions. In this particular case if you were to break down the data by strand then plot only the coverages for the 5' ends for each strand separately you would see the following coverages:


It is the superposition of these two lower tracks that gives the top coverage. But when plotted this way you can clearly see at least two distinct peaks. Remember these now track what the instrument measures: the ends of the DNA fragments. It appears that there are a few distinct fragments (position) on the DNA. Let me mark these positions.

If we trust that our ChIP-Seq method measures bound locations, then these are the locations occupied by our protein. But note how none of them are bound in the location that the peak maxima indicate. If location accuracy is critical, the peaks must be found by pairing up the peaks called separately from the forward and reverse strands to attempt to reconstitute the original DNA fragments, rather than the center of each peak.


If peak location is not all that important, you may be able to take the midpoint of the original peak.

Much algorithmic effort is spent to create software that automatically builds the "consensus" location out of data. Some work better than others - none of the tools can handle all the situations, and much subjectivity remains. The type of the data and its properties, whether or not a superposition of locations are present will all substantially alter the performance of the method you choose.

**108.10 What type of ChIP-Seq studies are performed?**


Speaking by the extremes, and your study will likely fall somewhere in between but closer to one of these extremes, ChIP-Seq studies fall into two major categories:

The Theory of Everything: these are extensive scale studies, usually studying millions of locations, then often make sweeping, large-scale generalizations on where certain positions are about others. The principle "gain-some lose-some drives the methods". The data analysis typically processes all peaks with the same parameters producing very high rates of false positives and false negatives. The goal of these studies is to beat random chance - which is a very low bar to pass.
The Theory of One Peculiar Thing: these are focused studies that use ChIP-Seq surveys to attempt to explain one particular phenomenon. The data often gets reduced to just a few locations where the very high accuracy of the methods is required.
The needs of the two approaches are radically different. As a rule, if you need to accurately determine the "peak" at just a few locations the best tool that you can use is YOUR EYE! We make use of "peak callers" primarily because we can't afford to investigate every single location.

**108.11 Does the data need to be compared to control?**


Yes. The ChIP-Seq protocols have turned out to be affected by many more systematic errors than initially thought. Hence control samples are needed to identify these errors.

Typically two kinds of controls samples are used: IgG control and input control. Each control can be thought of as an incomplete ChIP-Seq protocol, where the process intentionally skips a step of the process. These controls attempt to identify 

regions that are enriched by other processes beyond the protein being bound to the DNA.

The IgG control is DNA resulting from a "mock" ChIP with Immunoglobulin G (IgG) antibody, which binds to the non-nuclear antigen.
Input control is DNA purified from cells that are cross-linked, fragmented, but without adding any antibody for enrichment.
One problem with IgG control is that if too little DNA is recovered after immunoprecipitation(IP), sequencing library will be of low complexity (less diverse) and binding sites identified using this control could be biased. I

Input DNA control is thought to be ideal in most of the cases. It represents all the chromatin that was available for IP. More information on choosing controls can be seen in:

Biostars post: What Control For Chip-Seq: Input, Igg Or Untagged Strain? for more details.
Nature, 2015: ChIP-Seq: technical considerations for obtaining high-quality data
108.12 Should my ChIP-Seq aligner be able to detect INDELs?
Absolutely YES!

One of the most misleading advice - one that is reiterated even in peer-reviewed publications like Practical Guidelines for the Comprehensive Analysis of ChIP-Seq Data PLoS Comp. Biology 2014, stems from a fundamental misunderstanding of scientists when it comes to defining how "alignment" is different from "mapping."

We described the duality before in the alignment section stating that "mapping" refers to finding the genomic location of where a read originates from, whereas "alignment" refers to finding the place and arrangement about the genome. Since ChIP-Seq methods need to identify genomic locations many wrongly assume that the alignment methods that cannot handle insertions and deletion are still acceptable since we don't care about the exact alignment anyway.

It is time to put this misconception out of its misery as it has already caused and continues to produce substantial wasted efforts. So let's make it abundantly clear:

An aligner that cannot handle insertions and deletions will not be able to find the correct coordinates when a measurement contains such differences. The "mapping" produced by such aligners over the reads in question will be severely and systematically affected.

As a rule, aligners that cannot detect insertions and deletions should NEVER be used under ANY circumstances - unless no other aligner can be utilized (very rare cases where other aligners would not work at all).



**108.13 What are the processing steps for ChIP-Seq data?**


The general processing steps are as follows:

Visualize and, in necessary, correct the quality of the sequencing data.
Align sequencing reads to a reference genome. The most popular read aligners are bwa and bowtie.
Call peaks from the alignment bam files.
Visualize the resulting peak files and signal files.
Find the biological interpretation of the positions where the peaks are observed.



## 109 Aligning ChIP-Seq data ##



As we explained before chromatin immunoprecipitation (ChIP) followed by high-throughput DNA sequencing (ChIP-Seq) is a technique that is used to identify locations in the genome, for example, the footprints of genome-wide transcription factor binding sites and histone modification enriched regions.

We will demonstrate the steps used to process ChIP-Seq data through a re-analysis of a published scientific work.

**109.1 Which paper is going to be re-analyzed?**


To illustrate all the analysis steps, we are going to use data from the following publication:

Genome-Wide Mapping of Binding Sites Reveals Multiple Biological Functions of the Transcription Factor Cst6p in Saccharomyces cerevisiae. published in MBio in 2016

The data corresponding to this paper can be found in the SRA Bioproject:

https://www.ncbi.nlm.nih.gov/bioproject/PRJNA306490
The script that will reproduce this paper is located in http://data.biostarhandbook.com/chipseq/
This section provides an annotated guide to the commands in the chip-exo-redo.sh script. This script makes use of GNU Parallel to execute multiple commands at a time (up to the number of CPU on the system).


**109.2 How do I obtain the data for project PRJNA306490?**


Make a few directoried to store data in:



In [4]:
mkdir -p data
mkdir -p refs
mkdir -p bam
esearch -db sra -query PRJNA306490  | efetch -format runinfo > runinfo.csv

#The file contains information for four SRR runs

cat runinfo.csv | cut -d , -f 1,2,3 | head

#You could run the following command on each SRR run:

fastq-dump -O data --split-files -F SRR3033154 

cat runinfo.csv | cut -f 1 -d , | grep SRR > runids.txt


# Download the fastq files.

cat runids.txt | parallel --eta --verbose  "fastq-dump -O data --split-files -F {}"

(base) (base) (base) (base) (base) (base) (base) Run,ReleaseDate,LoadDate
SRR3033154,2016-04-08 12:47:06,2015-12-18 14:04:38
SRR3033155,2016-04-08 12:47:06,2015-12-18 14:04:03
SRR3033156,2016-04-08 12:47:06,2015-12-18 14:04:53
SRR3033157,2016-04-08 12:47:06,2015-12-18 14:06:06

(base) (base) (base) (base) Read 12222009 spots for SRR3033154
Written 12222009 spots for SRR3033154
(base) (base) (base) (base) (base) (base) (base) Academic tradition requires you to cite works you base your article on.
If you use programs that use GNU Parallel to process data for an article in a
scientific publication, please cite:

  Tange, O. (2021, February 22). GNU Parallel 20210222 ('AngSangSuKyi').
  Zenodo. https://doi.org/10.5281/zenodo.4554342

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

More about funding GNU Parallel and the citation notice:
https://www.gnu.org/software/parallel/parallel_design.

: 1

**109.3 How do I find out more information each sample?**


While the runinfo.csv contains quite a bit of information, maddeningly it lacks what arguably would be the most important detail: which SRA number corresponds to the samples mentioned in the paper. We need to run the Entrez Direct tool called esummary to connect run ids to sample names:



In [5]:
pwd

esearch -db sra -query PRJNA306490  | esummary > summary.xml

/home/preonath/Biostar/Part XXIV ChIP Seq
(base) (base) (base) 

: 1

The XML file returned by this query has a nested structure with a wealth of information that can be difficult to extract. Sometime the easiest is to read it by eye and manually match the information. This might work here since there are only four samples. But there are ways to automate this data extraction.

The tool called xtract that is also part of EDirect: xtract is meant to facilitate turning XML files into tabular files. The xtract tool operates by parsing an XML file, matching and extracting and printing elements from this file.

Now XML as an industry standard with methods to do just this, there is a so-called XPath query language and the XSL styling language that could be used to turn XML documents into any other format. xtract has nothing to do with either of these - yet it is used to achieve the same. It is, in a way a re-imagining of what XPath and XSL do - but this time filtered through a bioinformaticians mind.

For example, the following command generates a new row on each element called named DocumentSummary then extracts the acc attribute of the Run element and prints that with the title. Investigate the XML file to see where this information is present.

In [6]:

cat summary.xml | xtract -pattern DocumentSummary -element Run@acc Title


(base) SRR3033157	GSM1975226: Cst6p_ethanol_rep2; Saccharomyces cerevisiae; ChIP-Seq
SRR3033156	GSM1975225: Cst6p_ethanol_rep1; Saccharomyces cerevisiae; ChIP-Seq
SRR3033155	GSM1975224: Cst6p_glucose_rep2; Saccharomyces cerevisiae; ChIP-Seq
SRR3033154	GSM1975223: Cst6p_glucose_rep1; Saccharomyces cerevisiae; ChIP-Seq
(base) 

: 1

**109.4 What other information do I need to start a ChIP-Seq analysis?**


You will now need a genome sequence and an annotation file that lists the genes.

Sometimes (often?) it is surprisingly complicated to get the latest information even from websites dedicated to the distribution of data. As you will note for yourself, the data always seems to be either in the wrong format or combined in ways that are inappropriate to your current needs.

The data sources claim that it is a matter of "just downloading it" yet soon you'll be knee-deep in elbow-grease as you wrestle in the mud with uncooperative file formats. For example, SGD claims that the latest genome releases can be downloaded from http://downloads.yeastgenome.org/sequence/S288C_reference/genome_releases/. As you happily and cheerfully download that data, you realize that the chromosomal naming coordinates are different across the sequence and the annotations. For example, the FASTA file will have the sequence names listed as:




**109.5 How do I get the "standard" yeast genome data and annotation?**


The paper that we are reanalyzing makes use of the most up-to-date UCSC yeast genome release code named sacCer3 from the UCSC download site some might balk at knowing that this release dates back to 2011. Quite a bit of knowledge could have been accumulated since then. One advantage is though that many tools, like IGV, already know of this genomic build and no custom genome building is necessary.

Let's get the genome:

In [None]:
# Reference genome.
REF=refs/saccer3.fa

URL=http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/chromFa.tar.gz
curl $URL | tar zxv

# Get the chromosome sizes. Will be used later.
curl http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes > refs/sacCer3.chrom.sizes

# Move the files
mv *.fa refs

# Create genome.
cat refs/chr*.fa > $REF


#Remember to index the reference.

bwa index $REF
samtools faidx $REF



bwa mem $REF data/SRR3033154_1.fastq | samtools sort > SRR3033154.bam 


cat runids.txt | parallel --eta --verbose "bwa mem -t 4 $REF data/{}_1.fastq | samtools sort -@ 8 > bam/{}.bam"


(base) (base) (base) (base)   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  4 3730k    4  161k    0     0  65289      0  0:00:58  0:00:02  0:00:56 65263chrI.fa
chrII.fa
  9 3730k    9  344k    0     0    99k      0  0:00:37  0:00:03  0:00:34   99kchrIII.fa
chrIV.fa
 23 3730k   23  892k    0     0   159k      0  0:00:23  0:00:05  0:00:18  173kchrIX.fa
 31 3730k   31 1187k    0     0   182k      0  0:00:20  0:00:06  0:00:14  239kchrM.fa
chrV.fa
chrVI.fa
 39 3730k   39 1465k    0     0   196k      0  0:00:18  0:00:07  0:00:11  265kchrVII.fa
chrVIII.fa
 51 3730k   51 1923k    0     0   221k      0  0:00:16  0:00:08  0:00:08  302kchrX.fa
chrXI.fa
 60 3730k   60 2272k    0     0   236k      0  0:00:15  0:00:09  0:00:06  333kchrXII.fa
 70 3730k   70 2634k    0     0   250k      0  0:00:14  0:00:10  0:00:04  354kchrXIII.fa
chrXIV.fa
 82 3730k   82 3083k    0     0   269k      0  0:00

**109.6 How do I align the ChIP-Seq data?**


We assume that by this time your fastq-dump programs above have finished. The alignment commands will all be of the form:



Note how this data is aligned in single end mode and only file 1 is used. We too were surprised to learn this from the authors, their rationale is that they are using a ChIP-Exo method that sequences only a short segment of the fragment. We don't quite believe this rationale to be correct - suffice to say only half of the data was used and half of the data was tossed out.

We can automate the alignments as just one single line.

**109.7 Do I need to trim data to borders?**


As we mentioned this before ChIP-Seq data in general and ChIP-Exo in particular measures the left-right borders of the binding sites rather than the center of them. Even though tools should be able to reconstruct these borders algorithmically from the entire alignments, sometimes (often) it is best to not even "show" this spurious data to the algorithm, and to keep only the essential sections of the data, in this case, the borders only.

That is what the authors chose here, trimming 70 bp from the ends of each alignment using the bamUtil tool:

In [None]:

# Trim each bam file.
cat runids.txt | parallel --eta --verbose "bam trimBam bam/{}.bam bam/temp-{}.bam -R 70 --clip"

# We also need to re-sort the alignments.
cat runids.txt | parallel --eta --verbose "samtools sort -@ 8 bam/temp-{}.bam > bam/trimmed-{}.bam"

# Get rid of temporary BAM files.
rm -f bam/temp*

# Reindex trimmed bam files.
cat runids.txt | parallel --eta --verbose "samtools index bam/trimmed-{}.bam"


**109.8 How do I visualize the alignments?**


ChIP-Seq data is all about coverages and locations. As such the details of alignments are usually less important. Visualizing BAM files directly puts too much strain on visualizers as often many samples need to be viewed simultaneously.

A bed graph file is a variant of a BED file that is meant to display coverages over intervals. It is a simple format consisting of a tab separated format of this form:

In [None]:
# Create an genome file that bedtools will require.
samtools faidx $REF

# Create a bedgraph file out of the BAM file.
bedtools genomecov -ibam bam/SRR3033154.bam  -g $REF.fai -bg > bam/SRR3033154.bedgraph 

bedGraphToBigWig bam/SRR3033154.bedgraph $REF bam/SRR3033154.bw


The bed graph file are smaller than the BAM file and may be visualized and processed with many other tools

Below is a figure of visualizing bedgraph and BAM files simultaneously. The upper track is bedgraph, the lower two panels are the coverage and alignment sections of a BAM file.


As it happens the bedgraph is not quite efficient either, IGV has to load the entire file into memory to visualize it. Yet another file format had to be invented (and we'll spare your the gory details) one that is called bigwig. A bigwig is a binary format that contains the same information as a bedgraph but allows for efficient access, hence allows software tools to jump to various location in the file more quickly. This format, by the way, was invented by Jim Kent, the "benevolent dictator" of the UCSC genome browser.

The format itself is not documented or described properly - a single program code, also written by Jim Kent was the sole source of understanding how the format works. We invite the readers to reflect on this a bit.

The tool called bedGraphToBigWig in Jim Kent's utilities can perform this transformation from bedgraph to bigwig:



In [None]:

#We can automate these steps as well:

# Create the coverage files for all BAM files.
ls bam/*.bam | parallel --eta --verbose "bedtools genomecov -ibam {} -g $REF.fai -bg | sort -k1,1 -k2,2n > {.}.bedgraph"

# Generate all bigwig coverages from bedgraphs.
ls bam/*.bedgraph | parallel --eta --verbose "bedGraphToBigWig {} $REF.fai {.}.bw"


In [None]:
pwd

Let's recapitulate what each file contains and what their sizes are:

SRR3033154.sam (if generated) would be 5.3Gb, plain text, has alignment details, no random access.
SRR3033154.bam is 423Mb, binary compressed format, has alignment details, allows random access.
SRR3033154.bedgraph is 34Mb, plain text, coverage info only, loads fully into memory.
SRR3033154.bw is 8.1Mb, binary format, coverage info only, allows random access.
We went from 5.3Gb to 8.1Mb, an approximately 1000 fold data size reduction. Hence, in general, when we intend to visualize these ChIP-Seq alignment files we transform them from BAM files into a bigwig files and subsequently, we only display these. Note that this same approach should be used when visualizing RNA-Seq coverage data as well.



109.9 Are there other ways to generate bigwig files?
There are other tools that can generate bigwig files directly from BAM files. The package that contains them is called deeptools and may be installed with:



In [None]:
samtools merge -r bam/glucose.bam bam/SRR3033154.bam bam/SRR3033155.bam
samtools merge -r bam/ethanol.bam bam/SRR3033156.bam bam/SRR3033157.bam
samtools index bam/glucose.bam 
samtools index bam/ethanol.bam 
#Generate the coverages for each, here you may skip making bigwig files since these files are not as large and IGV will handle them as is:

bedtools genomecov -ibam bam/glucose.bam  -g $REF.fai -bg > bam/glucose.bedgraph
bedtools genomecov -ibam bam/ethanol.bam  -g $REF.fai -bg > bam/ethanol.bedgraph
