<h1 style="font-size: 40px; margin-bottom: 0px;">7.1 ChIP-seq alignment</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

Today, we'll align our datasets to the genome using <code>bowtie2</code>. After we perform our alignments, we'll take a look at the resulting files generated by the aligner, and then we'll load the files into the Integrative Genome Viewer (IGV) to take a look at where reads have aligned to the genome. We'll explore how the reads can be visualized, and how we can begin to see reads piling up on either side of our protein of interest.

While you can follow along in this notebook running command line, there will be parts where we will run things simultaneously, so you'll want to be able to run it in the background in Terminal.

<strong>Learning objectives:</strong>
<ul>
    <li>Learn to use bowtie2 to perform alignments</li>
    <li>Explore bowtie2 outputs</li>
    <li>Learn to use samtools to work with alignment outputs</li>
    <li>Learn to use IGV to visualize aligned reads</li>
</ul>

<h1 style="font-size: 40px; margin-bottom: 0px;">ChIP-seq alignment with <code>bowtie2</code></h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

Today, we'll be aligning our four truncated datasets to our reference genome in class, and your homework this week will be to align the full datasets. Please note that the alignment can take quite a long time (~3 hours or so).

Since Zanconato et al 2015 aligned their ChIP-seq data to the human genome build hg19, we'll be doing the same. We'll be making use of an aligner that Dr. Ingolia talked about in lecture, <code>bowtie2</code>. <a href="https://bowtie-bio.sourceforge.net/bowtie2/index.shtml" rel="noopener noreferrer" target="_blank"><u>Documentation for <code>bowtie2</code> can be found here.</u></a>

<h2>Alignment</h2>

We'll be doing things a little bit in reverse today to help save time, since even with the truncated dataset, the alignment still takes some time to run. In this case, it should take ~20-30 minutes. So what we'll do is we'll first set up an alignment run, and then we'll go into the technical details about running the alignment as it runs in Terminal.

<h3>Prepare directory to receive alignment outputs</h3>

Let's go ahead and prepare a new directory <code>alignment</code> that can receive our output, which will be our alignment files, and once we've made <code>alignment</code>, let's change into it. 

In [None]:
%%bash

<h3>Align sequence data to h19 reference genome</h3>

Let's align our first sequence file using <code>bowtie2</code>.

In [None]:
%%bash

<h2>Bowtie2</h2>

While that's running, let's review alignments, specifically looking at <code>bowtie2</code>.

Aligning sequences to the genome can be quite tricky and computationally intensive. Recall from lecture that the sequences may not always be a perfect match, and in some cases, there may be short gaps in the sequences to get a good alignment (<strong>Fig 1</strong>). Moreover, the sequence reads that we are trying to align are usually around 50-150bp with millions of read for each sequence file, and each of those reads needs to be aligned to the reference genome, which for humans is approximately 3 billion basepairs.

<h4 style="text-align: center;"><strong>Fig 1</strong></h4>
<img src="./images/7_1_fig_1.png" style="height: 250px; margin: auto;"/>
<p style="text-align: center;">Image from: Harvard Chan Bioinformatics Core</p>

<h3>Reference genome and index</h3>

As Dr. Ingolia mentioned in lecture, alignment software use a genome index. While theoretically, you can align using the reference genome, converting the reference genome into an organized index enables faster searches through the genome to find alignments. As a result, <code>bowtie2</code> is able to efficiently and relatively quickly align sequence reads. To keep it's memory usage on the lower end, <code>bowtie2</code> uses an FM Index (which uses the Burrows-Wheeler Transform mentioned in lecture).

So normally before you perform your alignments, you will need to build an index from your reference genome, which can take some time, especially for larger genome sizes. Fortunately for us, bowtie2's documentation also provides a link to a pre-built index for hg19. The index files can be found in the shared folder for this course, and we've called them up when we ran our first alignment.

<h3>Alignment</h3>

To get an idea of how to use bowtie2 to align our sequences, let's take a look at how its documentation notes to set it up:

<pre style="width: 700px; margin-top: 15px; margin-bottom: 15px; color: #000000;background-color: #EEEEEE; border: 1px solid; border-color: #AAAAAA; padding: 10px; border-radius: 15px; font-size: 12px;">&dollar;bowtie2 &lbrack;options&rbrack;* -x &lt;bt2-idx&gt; {-1 &lt;m1&gt; -2 &lt;m2&gt; &vert; -U &lt;r&gt; &vert; --interleaved &lt;i&gt; &vert; -b &lt;bam&gt;} &lbrack;-S &lt;sam&gt;&rbrack;</pre>

So to use the bowtie2 aligner, you first specify the <code>bowtie2</code> command followed by options and arguments.

<h2>Let's breakdown the code:</h2>

Now let's go back and take a look at the code that we ran to align our <code>10M_ctrl_1.fastq</code> sequence file and break down the input with respect to the information we pulled from the manual.

<code>bowtie2</code> 

This is the command to call up the aligner.

<hr style="border: 1px solid; border-color: #BBBBBB;"></hr>

<code>-x ~/shared/2025-fall/courses/1547808/chip-seq/chip-index/hg19</code> 

This argument specifies the file path to where the index files are located along with the basename of the index files.

So for our input, we provided the file path to where the index files were located <code>~/shared/2025-fall/courses/1547808/chip-seq/chip-index/</code> along with the basename of our index files <code>hg19</code>.

<hr style="border: 1px solid; border-color: #BBBBBB;"></hr>

<code>-U ~/shared/2025-fall/courses/1547808/chip-seq/truncated/10M_ctrl_1.fastq</code> 

This argument specifies the file that you want to align. <code>bowtie2</code> is able to align paired-end reads indicated as <code>-1 &lt;m1&gt; -2 &lt;m2&gt;</code>, or unpaired reads indicated as<code>-U &lt;r&gt;</code>, interleaved <code>.fastq</code> files indicated as <code>--interleaved &lt;i&gt;</code>, and unaligned BAM files indicated as<code>-b &lt;bam&gt;</code>.

So in our case, the ChIP-seq dataset that we're working with is unpaired, so we can use <code>-U</code> to specify that we're providing unpaired sequence reads followed by the file name for our sequence file.

<hr style="border: 1px solid; border-color: #BBBBBB;"></hr>

<code>-S 10M_ctrl_1.sam</code>

Here, we specify our output file name, and if we want to save the SAM file in a directory that is not our current working directory, we can include the file path as well. But since we'll be saving the output to our current directory, we can just specify the file name.

<h2>Perform alignment for 10M_taz_1</h2>

Now that we've done an example for a control file, let's run an alignment using the 10M_taz_1 dataset <u>in the background</u> by adding an <code>&amp;</code> at the end of our input. Once everyone's alignment is running, we'll continue with the next section, while <code>bowtie2</code> runs in the background.

If time permits, we can also run alignments for the other <code>.fastq</code> files, but for now, we'll just run this one.

In [None]:
%%bash

<h1 style="font-size: 40px; margin-bottom: 0px;">Exploring <code>bowtie2</code> output</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

<h2>Reading the SAM file</h2>

Recall from lecture that SAM files have a specific structure (<strong>Fig 2</strong>), where an aligned sequence read retains its original identifying information but now this information is also associated with a specific genomic location, and things are just slightly rearranged a bit.

<h4 style="text-align: center;"><strong>Fig 2</strong></h4>
<img src="./images/7_1_fig_2.png" style="height: 300px; margin: auto;"/>
<p style="text-align: center;">Image from Dr. Nick Ingolia</p>

Let's take a look at our <code>10M_ctrl_1.sam</code> output file generated by <code>bowtie2</code>.

In [None]:
%%bash

You should see a lot of header lines indicated by an &commat;. The alignment information are the lines after those header lines.

There are a fair number of columns to the SAM file, so you probably see that a single line will wrap to the next line, making it more difficult to read. You might have noticed that the values within the SAM file are separated by tabs, so try to export just the first 100 lines of our SAM file into a tab-separated values <code>.tsv</code> file. That way, we can open it up just the first 100 lines in excel to see each column more easily.

Once we've exported the first 100 lines into a <code>.tsv</code> file, let's open it up in Excel to view its contents.

In [None]:
%%bash

Unfortunately, the columns do not come with headers, so <a href="https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#sam-output" rel="noopener noreferrer"><u>we'll have to look into the <code>bowtie2</code> documentation to figure out what each column indicates.</u></a>

To figure out what the FLAG column is telling us, we can look back to the bowtie2 documentation, or <a href="https://broadinstitute.github.io/picard/explain-flags.html" rel="noopener noreferrer"><u>we can also use the Broad Institute's sam FLAG  decoder.</u></a>

For the optional tags, their definitions can be found also in the <code>bowtie2</code> documentation.

<h2>Sort and index alignments</h2>

Recall from lecture that genome viewers usually assume that the alignment data that you give it is sorted by its mapped location. The genome viewer we will be using (IGV) requires us to provide it with a sorted and indexed BAM alignment file in order for it to display the aligned reads.

<h3>Convert SAM to BAM</h3>

The first thing we'll need to do is to convert our SAM file to a BAM file, which is a compressed, binary file. To do this, we'll use the <code>samtools</code> command. <a href="https://www.htslib.org/doc/samtools.html" rel="noopener noreferrer"><u>Documentation for <code>samtools</code> can be found here.</u></a>

In [None]:
%%bash

<h2>Let's breakdown the code:</h2>

<code>samtools</code>

This is the command to call up <code>samtools</code>.

<hr style="border: 1px solid; border-color: #BBBBBB;"></hr>

<code>view</code> 

This indicates that we want to make use of the sub-command <code>view</code>, which is used to convert SAM to BAM and vice versa.

<hr style="border: 1px solid; border-color: #BBBBBB;"></hr>

<code>-b</code> 

This tells the view sub-command that we want it to output a BAM file, so essentially convert the SAM file to a BAM file.

<hr style="border: 1px solid; border-color: #BBBBBB;"></hr>

<code>-o 10M_ctrl_1.bam</code> 

This specifies the output file information. Here is where we provide the sub-command with the file name.

<hr style="border: 1px solid; border-color: #BBBBBB;"></hr>

<code>10M_ctrl_1.sam</code> 

This is the name of the file that we are providing to the sub-command to convert to a BAM file.

<h3>Sort the BAM file</h3>

To sort your BAM file, you can use the <code>sort</code> sub-command within the <code>samtools</code> command.

In [None]:
%%bash

<h2>Let's breakdown the code:</h2>

<code>samtools</code> 

This is the command to call up <code>samtools</code> just like before.

<hr style="border: 1px solid; border-color: #BBBBBB;"></hr>

<code>sort</code> 

This calls up the <code>sort</code> sub-command, which gives us the ability to sort our aligned reads by their chromosomal position.

<hr style="border: 1px solid; border-color: #BBBBBB;"></hr>

<code>-o 10M_ctrl_1-sorted.bam</code> 

This specifies the output file information.

<hr style="border: 1px solid; border-color: #BBBBBB;"></hr>

<code>10M_ctrl_1.bam</code> 

This is the file that contains the alignments that we want to sort based on their chromosomal position.

<h3>Index the sorted BAM file</h3>

Like with how we needed to create a genome index to allow <code>bowtie2</code> to quickly access the sequence information, we need to index our sorted BAM file to allow for quick access to our alignment data. To index our sorted BAM files, we will use the <code>index</code> sub-command and provide it with the BAM file we want it to index:

In [None]:
%%bash

You should now have three outputs from converting, sorting, and indexing:

<ul>
    <li><code>10M_ctrl_1.bam</code></li>
    <li><code>10M_ctrl_1-sorted.bam</code></li>
    <li><code>10M_ctrl_1-sorted.bai</code></li>
</ul>

<h1 style="font-size: 40px; margin-bottom: 0px;">Visualize aligned reads</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

To prepare to visualize your aligned reads in IGV, download the following files:

<ol>
    <li>Your sorted BAM file: <code>10M_ctrl_1-sorted.bam</code></li>
    <li>The sorted and indexed BAM file: <code>10M_ctrl_1-sorted.bai</code></li>
</ol>

Now, let's open up IGV and visualize our alignments.

If there's time, we can process our <code>10M_taz_1.sam</code> file for visualization as well.

<h1 style="font-size: 40px; margin-bottom: 0px;">References</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

<p style="padding-left: 20px;"><a href="https://pubmed.ncbi.nlm.nih.gov/19549630/" rel="noopener noreferrer" target="_blank"><u>Whiteford et al 2009 Bioinformatics:</u></a> Swift: primary data analysis for the Illumina Solexa sequencing platform</p>

<p style="padding-left: 20px;"><a href="https://bionumbers.hms.harvard.edu/bionumber.aspx?id=100679&ver=5&trm=gc+content+human+genome&org=" rel="noopener noreferrer" target="_blank"><u>Bionumbers for human genome GC content</u></a></p>

<p style="padding-left: 20px;"><a href="https://www.nature.com/articles/ncb3216" rel="noopener noreferrer" target="_blank"><u>Zanconato et al 2015 Nat Cell Biol:</u></a> ChIPâ€“seq: advantages and challenges of a maturing technology</p>