<h1 style="font-size: 40px; margin-bottom: 0px;">10.1 RNA-seq alignment</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

Last week, we went over the general RNA-seq pipeline and set up our first shell script to begin learning how we can take the commands we input into Terminal and automate it. Like with ChIP-seq, we need to first align our data to a reference genome as a set up for further analysis. 

Today, we'll be reviewing and performing alignments for our truncated RNA-seq datasets that we generated last week. Everyone will be working with their own group's data for this, so we may all see slightly different outputs/results. As we continue with the RNA-seq module, we'll begin building up our piepline, connecting the lessons from our RNA-seq module together for a more automated RNA-seq analysis workflow.

Since we continued with notebook 9-1 into this week, we can merge the processes covered in both 9-1 and 10-1 into a single workflow for our shell script.

<strong>Learning objectives:</strong>

<ul> 
    <li>Learn how to run HISAT2 for aligning RNA-seq data</li>
    <li>Automate 9-1 and 10-1 commands</li>
    <li>Review splice-aware alignment</li>
    <li>Visualize RNA-seq alignments using IGV</li>
</ul>

<h1 style="font-size: 40px; margin-bottom: 0px;">HISAT2 alignment</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

<a href="https://daehwankimlab.github.io/hisat2/manual/" rel="noopener noreferrer"><u>Documentation for HISAT2 can be found here.</u></a>

Like we did with <code>bowtie2</code>, let's set up our alignment and have that run in the background as we go over the concepts and code while it runs. Since we're finishing up notebook 9-1 with 10-1, we can merge our alignment commands with our 9-1 exercise script. Then we can run everything all together.

So first, let's set up our commands including processing the outputs, then once we're done setting up our script, we can run it and go over the concepts and breakdown the code. We'll walk through the set up together while thinking through the logic of the set up.

<h2>1. Prepare directories to handle outputs</h2>

Since we'll be trying to automate more things, we can prepare some directories to handle our outputs. We can merge these commands with our 9-1 script to keep things tidy. Like with our ChIP-seq alignments, we'll have SAM and BAM files, as well as alignment logs. So let's set up our shell script to keep things organized.

In [None]:
%%bash

<h2>2. Alignment using <code>hisat2</code></h2>

For this part, we can set up the command to run our alignment, and once our shell script is running, we can break down how the input is set up in a later section. For this set up, you can work with your own group's data.

In [None]:
%%bash

No need to run anything yet, we'll continue with setting up our shell script.

<h2>3. Check for alignment completion</h2>

Since we're setting up our shell script to be more automated, we can set up our script to check to make sure the alignment was successfully completed before proceeding with the rest of the script. We can set up our script to then abort itself, so that it doesn't proceed with the rest of the script.

In [None]:
%%bash

<h2>4. Pull out rows from SAM file to look at later</h2>

Let's also pull out some rows from our SAM file to explore a bit more later once we get the output.

In [None]:
%%bash

<h2>5. Convert SAM to BAM</h2>

Like with our ChIP-seq alignment outputs, we'll compress our SAM files into BAMs.

In [None]:
%%bash

<h2>6. Check for successful conversion and delete SAM</h2>

Then, we can check for successful conversion using the same approach that we took for step 3. This way, we can set up our script to automatically check that we were able to convert our SAM into a BAM before actually deleting our SAM, and we can save server space.

In [None]:
%%bash

<h2>7. Sort BAM by position and index</h2>

To visualize our alignment data in IGV, we'll need to sort our reads based on their position and then index them. Later on, once we get our outputs, we can download them to then load into IGV like we did with our ChIP-seq alignments.

In [None]:
%%bash

<h2>8. Sort BAM by name</h2>

This is where things start to differ a bit from our ChIP-seq analysis. On Friday, we'll quantify our reads, and to prepare for that, we'll need another BAM file, where instead of sorting the aligned reads by their chromosomal position, we'll sort them based on their read name/ID. That way, each read of a read pair will appear one after another in the BAM file. This helps to keep memory usage down when we do the counting.

In [None]:
%%bash

<h2>9. Check for successful sorting and indexing (and delete unsorted BAM)</h2>

We can set up another checkpoint to then check that everything was properly sorted and indexed. To help reduce server space, we can also delete our original unsorted BAM file. That way, we don't have 3 copies of BAMs in various states of sorting. We'll just have two.

In [None]:
%%bash

<h2>10. Work in a for loop to automate alignment and processing of both conditions</h2>

We have multiple files that we want to process, so we can work everything together into a for loop to automatically work through both sample conditions. We can also consider how to work in our commands from notebook 9-1 as well to automate the entire process so far.

<h2>Run your shell script</h2>

Once we have our completed shell script, let's go ehead and run it. Then we can troubleshoot if needed and review RNA-seq alignments and our <code>hisat2</code> input.

<h1 style="font-size: 40px; margin-bottom: 0px;">Breaking down <code>hisat2</code> command input</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

While our script runs, let's break down our <code>hisat2</code> input, so we understand how the program operates.

<h2>Let's break down the code:</h2>

<code>hisat2</code>

This calls up the <code>hisat2</code> program.

<hr style="border: 1px solid; border-color: #BBBBBB;"></hr>

<code>-x ~/shared/2025-fall/courses/1547808/rna-seq/rna-index/hg19</code>

This functions the same as in <code>bowtie2</code>. We indicate with the option <code>-x</code> the path to our reference genome index files along with the basename of those files. In this case, our index files are located in <code>~/shared/2025-fall/courses/1547808/rna-index/</code>, and all the index files contain the same basename <code>hg19</code>.

<hr style="border: 1px solid; border-color: #BBBBBB;"></hr>

<code>-1 ~/shared/2025-fall/courses/1547808/rna-seq/truncated/g1/1M_g1_ctrl_r1.fastq.gz</code>

This is one key difference between our RNA-seq alignment and ChIP-seq alignment. While the ChIP-seq data was from a single-end (SE) read, our RNA-seq data was obtained via paired-end (PE) sequencing, resulting in two sequence files associated with each replicate. With the <code>-1</code> option, we tell <code>hisat2</code> that we are providing the first mate file for our PE data.

If our data were unpaired, then we would (like with <code>bowtie2</code>) use the <code>-U</code> option for unpaired or SE reads.

<hr style="border: 1px solid; border-color: #BBBBBB;"></hr>

<code>-2 ~/shared/2025-fall/courses/1547808/rna-seq/truncated/g1/1M_g1_ctrl_r2.fastq.gz</code>

With the <code>-2</code> option, we provide <code>hisat2</code> with the second mate file.

<hr style="border: 1px solid; border-color: #BBBBBB;"></hr>

<code>--rna-strandness RF</code>

Recall that our RNA-seq library was generated with a specific strandedness with respect to the original mRNA fragment, so we will need to provide this information to the aligner using the <code>--rna-strandness</code> option along with the <code>RF</code> argument to indicate that our library is directional on the first strand.

<hr style="border: 1px solid; border-color: #BBBBBB;"></hr>

<code>-p 8</code>

Like with <code>bowtie2</code>, we can specify additional performance options. In this case, we'll process our files using the 8 cores of our Biology Hub server to speed up the time it takes for our alignments.

<hr style="border: 1px solid; border-color: #BBBBBB;"></hr>

<code>--mm</code>

Because <code>hisat2</code> when aligning for the human genome is expected to use up to 4.5GB with 8 parallel processes, but sometimes it can spike up to or over 5GB during alignment, which will kill our server. So we can try and reduce the memory footprint by specifying that we want <code>hisat2</code> to use a memory-mapped index, so that all alignment processes access the same single instance of our hg19 index in memory rather than separate ones. 

<hr style="border: 1px solid; border-color: #BBBBBB;"></hr>

<code>-S ~/MCB201B_F2025/Week_10/alignments-temp/sams/1M_g1_ctrl.sam</code>

Like with <code>bowtie2</code>, we can specify that we want a <code>.sam</code> output using the <code>-S</code> option followed by the file path and name of our SAM file that we want <code>hisat2</code> to create.

<hr style="border: 1px solid; border-color: #BBBBBB;"></hr>

<code>--summary-file ~/MCB201B_F2025/Week_10/alignment-logs/1M_g1_ctrl-alignment.log</code>

The option <code>--summary-file</code> tells <code>hisat2</code> that we would like it to output a file containing alignment statistics, which will be generated once the alignment is completed. The argument after <code>--summary-file</code> is the file path and file name that we want the alignment statistics to be saved to. Like when we ran our ChIP-seq alignments, the summary file signals to us that our alignment was successfully completed, which we can incorporate into our script as a checkpoint to verify successful alignment.

<hr style="border: 1px solid; border-color: #BBBBBB;"></hr>

<code>--new-summary</code>

<p>This option alters the format of the summary file that is generated by the <code>--summary-file</code> to be more human readable.</p>

<h1 style="font-size: 40px; margin-bottom: 0px;">Review of RNA-seq alignment</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

<h4 style="text-align: center;"><strong>Fig 1</strong></h4>
<img src="./images/10_1_fig_1.png" style="height: 300px; margin: auto;"/>
<p style="text-align: center;">RNA-seq read alignment strategies (Dr. Ingolia)</p>

Recall from lecture that RNA-seq alignments have special considerations because of the processing of mRNA to remove introns and splice together exons. Three common methods for aligning reads are to align to a reference transcriptome, or reference genome, or to assemble transcripts <i>de novo</i> if a reference is unavailable (<strong>Fig 1</strong>).

<h4 style="text-align: center;"><strong>Fig 2</strong></h4>
<img src="./images/10_1_fig_2.png" style="height: 300px; margin: auto;"/>
<p style="text-align: center;">Examples of incorrect read mapping (Kim et al 2013)</p>

Alignment programs that are unable to account for large gaps in reads that span junctions may incorrectly map reads (<strong>Fig 2</strong>). They may either align the short read to only a single exon along with a penalty for the incorrectly matched basepairs, or they may incorrectly align the read to a pseudogene that doesn't contain an intervening intron, or they may remain unmapped. Splice aware aligners, such as HISAT2, can properly account for junctions, which you will be able to see when we load our alignments into IGV.

<h1 style="font-size: 40px; margin-bottom: 0px;">Viewing alignment outputs</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

<h2>Alignment statistics</h2>

Once your shell script has finished running, you should now have a number of output files to look through. We'll first take a look at our alignment log file containing each sample's alignment statistics. This will give us an idea if there was an issue with how we set up the alignment or potentially an issue with our samples/replicates. Generally, you'll want all your replicates to have similar alignment rates, and if there's a replicate that has drastically lower alignment rates, you may want to take a closer look to see if issues arose during the prep of that replicate.

<h2>Viewing SAM output</h2>

Let's take a look at the 150 rows that we pulled out of our SAM file. At first glance, it will look more or less the same as our SAM files from our ChIP-seq alignment. The notable differences will lie in the FLAG score and the CIGAR string (and the aligner-specific tags).

<h2>Take a look at read alignments for TAZ and YAP</h2>

Let's take a look at our reads using our position-sorted BAM files and the associated index file. Then we can load them into IGV to see how the reads aligned for TAZ, given that we aimed to KO TAZ expression. <a href="https://www.ncbi.nlm.nih.gov/gene?Db=gene&Cmd=DetailsSearch&Term=25937" rel="noopener noreferrer"><u>TAZ (WWTR1) is located at 3q25.1.</u></a>

Let's also see how the reads look for YAP, which we didn't KO. <a href="https://www.ncbi.nlm.nih.gov/gene?Db=gene&Cmd=DetailsSearch&Term=10413" rel="noopener noreferrer" target="_blank"><u>YAP1 is located at 11q22.1.</u></a>

<h3>Color by first-of-pair strand</h3>

We can also visualize the strandedness of our cDNA library in IGV as well. Right-click the read tracks and select "first-of-pair strand" under "Color alignments by". You should see the alignments colored either red or blue based on the direction of the cDNA library.

A cDNA library that is unstranded will have read alignments for a single gene with a mixed population when colored by direction, whereas a cDNA that is stranded will have read alignments that are all (or nearly all) the same color.

<h1 style="font-size: 40px; margin-bottom: 0px;">References</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

<p style="padding-left: 20px;"><a href="https://www.nature.com/articles/s41587-019-0201-4" rel="noopener noreferrer"><u>Kim et al 2019 Nat Biotech:</u></a> Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype</p>

<p style="padding-left: 20px;"><a href="https://doi.org/10.1093/bioinformatics/btw354" rel="noopener noreferrer"><u>Ewels et al 2016 Bioinformatics:</u></a> MultiQC: summarize analysis results for multiple tools and samples in a single report</p>

<p style="padding-left: 20px;"><a href="https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-4-r36" rel="noopener noreferrer"><u>Kim et al 2013 Genome Biol:</u></a> TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions</p>