<h1 style="font-size: 40px; margin-bottom: 0px;">9.1 RNA-seq analysis pipeline and QC</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

Now that you've gotten more familiar with ChIP-seq analysis and particularly with Terminal command line, we'll now move on to transcriptomics analysis. Specifically, we'll be analyzing our RNA-seq data that we generated during the summer. Much like with our ChIP-seq analysis module, the goal of our RNA-seq analysis is to continue to provide you with additional practice analyzing large datasets. ChIP-seq allows us to infer what genes may be regulated by TAZ based on where TAZ (and TEAD) is binding in the genome. RNA-seq complements that by providing us with information on the transcriptomic changes that occur following TAZ KO, which provides a differnt angle for us to infer downstream transcriptional targets by looking at what genes are differentially expressed when we KO TAZ. The end goal is to bring together our analyses from ChIP-seq and RNA-seq to then derive information about what genes are direct vs indirect transcriptional targets of TAZ and get a better sense of how TAZ acts as an oncogene or tumor suppressor gene.

A lot of the broad concepts that we learned for ChIP-seq will be applicable to RNA-seq analysis as well even though the specifics may differ. Our sequence files are in the standard fastq format, and we'll still align our reads to a reference genome. And we can visualize the alignments again using IGV. After that, instead of looking at peaks, we'll be quantifying transcript abundance based on our alignment data, and then using that data to determine which genes are differentially expressed when we KO TAZ.

The setup of this module will be more fun because we'll be working with all the data that we've generated in the summer. In class, we'll work with a truncated dataset like we did with our ChIP-seq analysis. Each group will perform the initial QC, alignment, and quantification on their own replicates. Then, the fun part is that we'll as a class bring together all our replicates to play with the data and perform differential expression analysis.

<strong>Learning objectives:</strong>

<ul>
    <li>Continue practicing command line and bash</li>
    <li>Learn to set up shell scripts</li>
    <li>Understand RNA-seq pipeline set up and QC</li>
    <ul>
        <li>Output some reads to look at later</li>
        <li>Perform QC on truncated dataset</li>
    </ul>
    <li>Review fastq sequence files and FastQC analysis</li>
</ul>

<h1 style="font-size: 40px; margin-bottom: 0px;">Shell scripts</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

Now that you're now more familiar with using Terminal and command line, we'll begin learning how to set up shell scripts, so that we can run a series of commands and programs by executing the script. This way, you don't need to sit and wait for Terminal to finish what it's doing to then tell it what to do next. 

We'll first learn to set up and run a basic shell script, then we'll begin adding additional layers of complexity to set ourselves up to build a shell script to analyze our RNA-seq dataset. Along the way, we'll continue to build upon what we know about command line syntax and control flow. We'll be updating our practice example script while also taking a look to see how it changes along the way by running it in Terminal as we make different modifications.

<h2>What is a shell script?</h2>

Shell scripts (or Bash scripts for <strong><u>B</u></strong>ourne <strong><u>a</u></strong>gain <strong><u>sh</u></strong>ell in our case) allow you to automate a series of commands/tasks, so you don't have to input each step separately after the previous one has completed. This allows you to set up a wrapper that brings together the programs in your full analysis set up/pipeline. 

Like Python scripts, shell scripts are just plain text files with a particular extension, and in the case of shell scripts, the file extension is <code>.sh</code>. The script contains all the commands that you want to run as well as any associated comments.

<h2>Structure of a shell script</h2>

The shell script structure is fairly flexible although you'll want to keep in mind conventions as well as proper syntax and control flow. 

<h3>Shebang</h3>

The first line of shell scripts must begin with a hashbang, often referred to as a shebang, which consists of two characters <code>#!</code>. The shebang specifies the interpreter that will be used to execute the script.

<strong>Find path to interpreter</code>

To find the path to the interpreter that we've been using when we run commands in Terminal, we can call up the variable <code>0</code> and output the interpreter to the standard output using <code>echo</code>.

To retrieve a value assigned to a variable, you can use the special character <code>$</code>. For this example, we can use it to retrieve the value assigned to the variable <code>0</code>.

In [None]:
%%bash

#!/bin/bash

Now that we know the path to our interpreter, we can include that alongside our shebang to specify what interpreter we want to use to execute our shell script.

```bash
#!/bin/bash
```


<h2>Comments</h2>

Like Python, we can add comments to our shell script to help others (and you) to understand the code. At the beginning of the shell script, after the shebang, you can include comments that provide a brief overview of the script's contents. Other information that is sometimes included in these top-level comments are copyright, author information, and version history.

Comments are specified by a hash <code>#</code>, and that everything in that line after the <code>#</code> is considered a comment.

```bash
# Liebchen was here.
```

Comments can also be added throughout the script to help explain things that might not be super obvious from the code itself or if you want to explain the logic/reasoning of how you set up the code.

<h2>Functions and commands</h2>

After the top-level comments is usually where functions are defined, and after that, follows the main code defining variables, running commands, and generating outputs. Essentially, all the different commands that we've run in Terminal, we can write out as lines of code in our shell script in the same way.

<h1 style="font-size: 40px; margin-bottom: 0px;">Setting up a shell script</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

Let's now set up a simple shell script that we can run in Terminal. 

<h2>Set up <code>.sh</code> file</h2>

First, in the File Browser, navigate to this week's directory. Then open up your Launcher and start up a Text file, which is located under the section "Other". 

Then input the shebang along with the path to our bash interpreter:

```bash
#!/bin/bash
```

Let's then save the file as <code>week-9-script.sh</code>. Make sure that you replace the <code>.txt</code> extension with the <code>.sh</code> extension. When you finish saving the file, you'll notice that the color of the shebang has changed to be like the above example. 

<h2>Make your shell script executable</h2>

In order to run our shell script in Terminal, we'll need to make it an executable file. This is kind of like turning a kitchen appliance on, so that you can use it.

In [None]:
%%bash

chmod +x ./week-9-script.sh

To confirm that our shell script is executable, we can then make use of <code>ls</code> but with an additional option <code>-l</code> to also include file permission information.

In [None]:
%%bash

ls -l

An executable file will have a <code>x</code> bit set within the file permissions information, indicating that it is executable with a command. So now to run our shell script, we can just call it up like we would with a command using Terminal.

In [None]:
%%bash

./week-9-script.sh

There won't be any output since we haven't specified any commands and just added the shebang, but if your script is able to run, you should not get an error.

Now as we make modifications and save our shell script, we can just run our script again in Terminal to test things out and see how it runs. Since this is a bit set for the file, the nice thing is that it stays permanently executable even after our server is shut down and started back up.

<h2>Add a top-level comment</h2>

We'll then go ahead and add a top-level comment that describes what our script will do. In this case, we can just note that this is our test script to play around in.

```bash
# This is a script to play around in and learn.
```

Now let's run our script. We shouldn't see any output still because our script only just contains the shebang and comments, so there shouldn't be any outputs, but there also shouldn't be any errors.

<h2>Set up a function</h2>

Conventionally, functions are defined after the top-level comments before all the main body of code/commands. For practice purposes, we can set up a simple function to output some text to the standard output stream. Although the syntax for defining a function is different than what we know from Python, the overall concept and purpose is the same, where we define a reusable block of code that we can then call up later on when we need it.

In [None]:
%%bash

#####################
# Meow at me for help
#####################
help_me_pls() {
    echo "MEOWWWWW!"
}

Now let's run our script, and we should still see that there's no output because all we've done is define our function.

<h2>Input a basic command</h2>

Now let's set up the main body of the shell script containing the commands that we want to automate. First, let's test out the function that we defined earlier.

In [None]:
%%bash

help_me_pls

Let's now run our script in Terminal to see if our function and script works. Don't forget to save your script prior to running it in Terminal.

<h3>Automate a series of commands</h3>

So now that we know everything works well. Let's set up a quick script to automate a simple task. For this example script, we can pull out the first read from one of our sequence files. This example shown here is for group 1, but you can feel free to work with your group's data.

Recall that we can use <code>cat</code> to open up files and display them in our <code>stdout</code> stream in Terminal. Since our sequence files are compressed (to help save storage usage), we'll make use of the option <code>zcat</code>, which does the same thing but for our compressed files. And since we're interested in just the first read, we can pipe <code>&vert;</code> the output from <code>zcat</code> into <code>head</code> to pull out just the first read, which we can redirect <code>&gt;</code> to a new file.

In [None]:
%%bash

zcat ~/shared/2025-fall/courses/1547808/rna-seq/truncated/g1/1M_g1_ctrl_r1.fastq.gz \
    | head -4 \
    > g1_ctrl_r1_read.txt

zcat ~/shared/2025-fall/courses/1547808/rna-seq/truncated/g1/1M_g1_ctrl_r2.fastq.gz \
    | head -4 \
    > g1_ctrl_r2_read.txt

zcat ~/shared/2025-fall/courses/1547808/rna-seq/truncated/g1/1M_g1_tazko_r1.fastq.gz \
    | head -4 \
    > g1_tazko_r1_read.txt

zcat ~/shared/2025-fall/courses/1547808/rna-seq/truncated/g1/1M_g1_tazko_r2.fastq.gz \
    | head -4 \
    > g1_tazko_r2_read.txt

#You can imagine that this can be set up as a for loop.
#We'll get to that in the subsequent parts of this notebook.

Let's save and then run our script in Terminal.

<h2>Variables</h2>

Like with Python, we can save things to variables, so we can call them up later to use. This is both something that we can do in Terminal and also in our shell scripts and is helpful to simplify our code, particularly if we're calling up things repeatedly. However, unlike Python, variables are untyped in bash, meaning that they are essentially interpreted as strings. There are some instances, like for example, integers, where the type can be inferred based on the context.

We can go ahead and set up a couple of variables, and see how they are understood when we execute our script in Terminal.

In [None]:
%%bash

#Like with Python, we can assign things to variables
meow=hungry
cats=1

#There are different ways of calling up a variable

#Note how meow alone is not enough to indicate a variable
echo meow

#Requires the $ parameter expansion symbol 
echo $meow

#Can add clarity with the curly brackets
echo ${meow}

#And then even more clarity with double quotes 
echo "${meow}"

#Variables are untyped, and type can be inferred
#For example here, arithmetic expansion
#The 1 is then inferred to be an integer type
echo $((cats + cats))

Let's run our script again. Don't forget to save before running.

<h3>Positional parameters</h3>

When we call up commands, we can also provide positional parameters which will automatically be saved to particular variables based on their position in the command line input with the first position following the command saved to the variable <code>1</code> with subsequent positions assigned to each subsequent variable. Recall that each position is separated by a space, so that's how the bash interpreter is able to identify the command and each positional parameter.

There are also ways to set up named parameters, which we're more familiar with as options and arguments that we provide when inputting commands, but for this example, let's add a positional parameter/variable to our shell script.

In [None]:
%%bash

#Can call up positional variables
echo "${1}"

echo "${2}"

#We stopped here.

Let's save the file, and now instead of running the script on its own, we can provide an argument for the first position when we run it in Terminal. You should see now that there is an additional output that comes from our first positional parameter. 

<h3>Save file paths to variables</h3>

Since variables are generally untyped and handled essentially as strings, we can store our file paths and file names to variables in the same way to simplify our code and make it easier to read.

In [None]:
%%bash

Recall that variables are generally handled as strings, so that means we can append additional information onto our file paths, such as additional subdirectories, file names, and file extensions. You have some experience with this already when you added additional paths to your <code>PATH</code> variable so that Terminal is able to find the directory containing your HOMER executables.

```bash
PATH=$PATH:/home/jovyan/homer/bin
```

You can see that the additional string <code>:/home/jovyan/homer/bin</code> is appended to the <code>PATH</code> variable, and we can do the same thing to more conveniently specify subdirectories and files with additional modifications for clarity.

In [None]:
%%bash

Let's now take a look at our script output.

<h3>Playing around with variables</h3>

We can make use of a set of operations in bash called parameter expansion and pattern matching, and since our variables are generally handled as strings, this allows us to play around with the content of our variables. We won't go over all the different operations we can do, but we'll make use of one of them to allow us to remove substrings.

The <code>$</code> special character initiates the expansion, which allows us to then make use of pattern matching to play with our variables. We can then make use of the <code>%</code> symbol to pattern match from the end of the string and remove that matching pattern. This can be handy for us to quickly pull out a base name much like we did in our Python scripts.

In [None]:
%%bash

Let's take a look to see how the output looks like and how it changes when we match to different patterns within a file name.

<h2>Create and assign an array</h2>

Like with Python, we can also work with compound data types in bash, specifically we can create and work with arrays. For our shell script example today, let's create a simple array and assign it to a variable. You'll want to keep in mind the syntax for an array, which is different in bash compared to what we're familiar with from Python.

In [None]:
%%bash

Now let's take a look at how our script runs. Don't forget to save your script.

<h3>Positions in arrays</h3>

Like with the compound data types that we're familiar with from Python, arrays in bash are also indexed based on their position within their array, and the way to call up a particular element in an array is also similar. One thing to note is that bash arrays are zero-indexed (so also not too different from what we know from Python).

In [None]:
%%bash

Let's save and then run our script to take a look at how this is being interpreted by Terminal.

<h3>Expand an array</h3>

As you might have seen with <code>echo "${cats}"</code>, we only saw the first element outputted to <code>stdout</code>. We can make use of some particular bash syntax to expand our array, so that we are able to see/output all elements within it at once.

In [None]:
%%bash

What you can see is that with <code>&lbrack;@&rbrack;</code>, we can specify that we want to expand our array, essentially pulling out all the elements of the array. We can add <code>!</code> to retrieve the index values of all our array elements, which will show us that the array is by default zero-indexed.

In [None]:
%%bash

<h3>Length of an array</h3>

We can make use of array expansion to also pull the length of an array by simultaneously expanding the array and quantifying the number of elements.

In [None]:
%%bash

<h3>Pull file names into arrays</h3>

Like how we can prepare a list of file names in Python, we can do the same thing in bash as well, which will allow us to set up a way to then work through all our files that we want to analyze. For example, you'll be setting up a script today to pull out reads from your group's sequence files and run a FastQC analysis on each of your sequence files.

In [None]:
%%bash

Let's run this and see how our output looks.

In an easier way than in Python, we can make use of pattern matching to pull particular files that we're interested based on whether or not they have a specific file extension or contain a specific pattern in their file name. Here, we make use of pattern matching but instead use the wildcard symbol <code>*</code>. This symbol lets us pull in any files matching the specified pattern around the wildcard symbol.

In [None]:
%%bash

Let's take a look at the files that we pull in this way.

<h2>Control flow</h2>

Our scripts, much like with our Python code, reads from the first line down to the last line of our script, and there are particular keywords that we can use to control the flow of code execution. We'll make use of two whose logic you're already familiar with, specifically <code>if</code>-<code>elif</code>-<code>else</code> conditional statements and for loops.

<h3>Conditional statements</h3>

The logic for conditional statements is the same in bash, but you'll want to keep in mind that the syntax is slightly different than what you're used to from Python.

In bash, conditional statements are set up with <code>if</code> followed by a condition specified within double brackets <code>&lbrack;&lbrack; ... &rbrack;&rbrack;</code>. In the same line, the conditional is initialized by a <code>; then</code> with the subsequent block of code containing the code block to be executed if the condition is satisfied/true.

The set up after for <code>elif</code> follows a similar syntax as the set up for <code>if</code> (kind of like Python), and the final <code>else</code> handles all remaining conditions.

Conditional statements are then ended/closed with a <code>fi</code> keyword, which is apparently just <code>if</code> backwards.

```bash
if [[ your condition here ]]; then
    command to be executed
    
elif [[ another condition here ]]; then
    other command to be executed
    
else
    handle remaining conditions
    
fi
```

Something to keep in mind also is that <code>true</code> is represented as <code>0</code> in bash, also indicating a success command, whereas any non-zero integer <code>1</code> or greater indicates false or an error/failure.

For our example script, we can set up a conditional statement to handle the first positional parameter, which can set your script up to take in a file path to your group's RNA-seq data and to exit when a help option or no positional argument is specified.

In [None]:
%%bash

Let's take a look at our script to see how it runs.

<h3>For loops</h3>

The logic of for loops remain the same, but you'll just want to keep in mind the slight differences in syntax, like how conditional statements have slightly different syntax in bash as well.

You can set up a for loop with the <code>for</code> keyword followed by a variable and <code>in</code> and the array you want to loop through, and in the same line that you initialize your loop starts with a <code>; do</code>. The subsequent code block then contains the operations/commands to be repeatedly done until there are no more elements to loop through. The end of the for loop is specified by a <code>done</code> keyword.

```bash
for element in "${array[@]}"; do
    command to be repeated
done
```

We can make use of our <code>cats</code> array from earlier to try out a for loop in our script.

In [None]:
%%bash

Let's take a look at our script and see how the for loop runs.

We can also then pull index values or calculate our own range of values (like how we specified <code>np.arange()</code> and <code>range()</code> in our Python for loops) to then iterate through multiple arrays. Let's create a second array and work through both our <code>cats</code> array and the <code>fave_snack</code> array.

In [None]:
%%bash

Now let's see how our for loop handles both arrays.

<h1 style="font-size: 40px; margin-bottom: 0px;">Exercise set</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

For this exercise set, you'll be applying what you've learned in this notebook to set up a new shell script called <code>rna-seq-qc.sh</code> that can perform the following tasks:

<ol>
    <li>Make new directory to hold some reads in this week's directory</li>
    <li>Save the file path to your group's RNA-seq truncated data set to a variable</li>
    <li>Pull in your group's file names into an array</li>
    <li>Pull out the first 10 reads from each of your sequence files</li>
    <li>Make a directory to hold FastQC analysis outputs</li>
    <li>Run a FastQC analysis on your group's RNA-seq truncated data files (you can refer back to notebook 6-1 on how to run <code>fastqc</code></li>
</ol>

We'll reconvene once everyone has their shell scripts to take a look to see how people set it up.

Once we all have a working shell script, we can then proceed with running them, and while it runs, we'll review the RNA-seq pipeline in preparation for the rest of this module of our course.

<h1 style="font-size: 40px; margin-bottom: 0px;">RNA-seq overview</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

RNA-seq comprises a series of experiments and analysis programs much like ChIP-seq. First, an appropriate sample needs to be obtained and prepared prior to sequencing, followed by next-gen sequencing, then alignment. Where RNA-seq differs from ChIP-seq is that rather than taking the alignments and using them to call peaks, RNA-seq takes the alignment information and quantifies the reads associated with genomic features, specifically genes. The read counts between different conditions are normalized and used for differential expression analysis to determine how the expression profile changes with different experimental manipulations. The resulting dataset can then provide insights into interesting candidates for follow up or provide a broad overview of how the transcriptome is changing between different conditions or treatments.

<h2>RNA collection</h2>

Recall from MCB201A that we collected RNA from our MDA-MB-231 cells to take a closer look at what genes are being regulated by TAZ to gain insights into how TAZ is regulating tumorigenesis. So as Dr. Ingolia mentioned in lecture, one of the important steps to be able to effectively sequence mRNA is to enrich for the mRNA in our total RNA sample. This is because the total RNA within a cell is almost entirely made up of ribosomal RNA (rRNA), which you can see from our Bioanalyzer traces (<strong>Fig 1</strong>), where the very prominent rRNA peaks allow us to assess the overall integrity of our RNA sample.

<h4 style="text-align: center;"><strong>Fig 1</strong></h4>
<img src="./images/9_1_fig_1.png" style="height: 200px; margin: auto;"/>
<p style="text-align: center;">Example Bioanalyzer trace from Group 3</p>

<h2>Library prep</h2>

In order to get enough material to sequence, a cDNA library needs to be prepared from the enriched mRNA. Our unprepped RNA was sent to the FGL for QC and library prep, where they used the KAPA RNA HyperPrep Kit in order to prepare a cDNA library.

<h3>1. mRNA enrichment</h3>

After polyA selection, which enriches for mRNA by hybridizing their polyA tails to oligo-dTs. The oligo-dTs are attached to magnetic beads, which allow your mRNA to be pelleted and any unbound RNAs (your rRNAs, ncRNAs, etc) to be washed away. So this is like any standard pull-down experiment. The mRNA is then dissociated from the beads, leaving you with a sample enriched for mRNA.

<h3>2. mRNA fragmentation</h3>

Following mRNA enrichment, the mRNAs are fragmented using a combination of heat and magnesium, breaking the mRNAs into smaller fragments. This allows short reads to cover the span of the mRNA rather than being localized to just the ends of the mRNA.

<h3>3. Library preparation</h3>

A cDNA is then synthesized from the mRNA fragments and adapters are added onto the ends of the cDNA. Recall that Dr. Ingolia talked about how mRNA is transcribed from one strand of the DNA acting as a template, so mRNA has a specific strandedness derived from which strand of DNA acted as the template for mRNA transcription. This information is lost during cDNA synthesis if steps aren't taken to preserve or retain strandedness.

Since genes can be transcribed from either the plus strand or the minus strand of DNA, and in some instances, you may have genes that overlap with each other in terms of their chromosomal location but are transcribed from opposite strands. So in order to reduce ambiguity when quantifying reads and assigning them to genes, cDNA preparation now often includes a slight modification that allows for this information to be preserved/retained.

<h4>RNA-seq strandedness</h4>

The basic idea is that when you create your cDNA library, you can selectively degrade one of your cDNA strands, and the one that is degraded depends on the specific protocol you are using. So when you amplify your cDNA library, you only have sequencing primers that are oriented a specific way, where each primer will only be associated with one strand of the cDNA rather than both (<strong>Fig 2</strong>).

<h4 style="text-align: center;"><strong>Fig 2</strong></h4>
<img src="./images/9_1_fig_2.png" style="height: 500px; margin: auto;"/>
<p style="text-align: center;">Modified image from Azenta Life Sciences</p>

In our case, our library was prepped with dUTPs incorporated into the second cDNA strand. Thus, the second cDNA strand cannot be amplified, and the first cDNA strand is sequenced. Therefore, our sequence data is considered <strong><u>directional on the first strand.</u></strong> (UC Berkeley QB3 Functional Genomics Lab).

As Dr. Ingolia talked about during lecture, the strandedness of your library has an impact on downstream alignment and read counting because some RNAs will be coded on the plus strand and others on the minus strand. We'll encounter this later on when we do our analyses, and we'll play around with an alignment where we provide the incorrect strandedness to see what impact that has on our read counts.

<h2>Next-gen sequencing</h2>

Sequencing of cDNA libraries for our RNA-seq experiments is conceptually the same as for ChIP-seq. One key difference in our case is that our samples were sequenced using paired-end sequencing, which means that each fragment was sequenced from both ends.

This means that our sequencing data for each replicate has <strong><u>two</u></strong> <code>.fastq.gz</code> files associated with it, where one corresponds to the reads from the first mate <code>&ast;_r1.fastq.gz</code> and the other from the second mate <code>&ast;_r2.fastq.gz</code>. Other than that, the type of results from sequencing are pretty similar to what we've seen with our ChIP-seq data.

<h2>FastQC analysis of RNA-seq data</h2>

As you've seen with your shell script, we make use again of FastQC to take a look at whether or not there were issues with our sequencing runs. There are some notable differences in how the FastQC outputs look between ChIP-seq data and RNA-seq data. Like with all our data, you'll want to keep in mind what the biology is that gave rise to the results that we're seeing.

<h3>Per base sequence content</h3>

<h4 style="text-align: center;"><strong>Fig 3</strong></h4>
<img src="./images/9_1_fig_3.png" style="height: 350px; margin: auto;"/>

Following mRNA fragmentation, the mRNA needs a primer to bind for cDNA synthesis to occur. Because all the mRNA fragments will have different "starting" sequences, priming needs to be random. In reality though, random priming is not completely random, and there is often a bias with enrichment for certain bases. So RNA-seq data will often fail this analysis module (<strong>Fig 3</strong>).

<h3>Sequence duplication  levels</h3>

<h4 style="text-align: center;"><strong>Fig 4</strong></h4>
<img src="./images/9_1_fig_4.png" style="height: 350px; margin: auto;"/>

Another difference that will show up in RNA-seq FastQC analyses compared to what we saw with ChIP-seq is the output of the analysis module quantifying the sequence duplication levels. Recall that FastQC assumes that an ideal library is a diverse one, which would mean that there shouldn't be overrepresented sequences in your sequencing data. However, for RNA-seq experiments, highly overexpressed genes will often be overrepresented in your dataset. So a warning or failure for this module isn't usually too big of an issue unless you weren't expecting it based on your experimental setup.

<h2>Alignment of RNA-seq data</h2>

Like with ChIP-seq, we need to align the reads to a reference. This allows us to later use the read alignments for downstream quantification. In ChIP-seq, the read alignments were usd to build coverage maps, and in the case of RNA-seq, we use the read alignments to determine the levels of expression for each gene in order to identify differentially expressed genes between our two conditions (<strong>Fig 5</strong>). There are special considerations that we need to keep in mind when aligning data from RNA-seq, which Dr. Ingolia touched upon, and we'll see when we perform the alignments next week.

<h4 style="text-align: center;"><strong>Fig 5</strong></h4>
<img src="./images/9_1_fig_5.png" style="height: 400px; margin: auto;"/>
<p style="text-align:center;">Image from Kukurba and Montgomery 2015</p>

<h2>Quantification of aligned reads (read counting)</h2>

The next step after aligning reads is to quantify the number of reads mapped to each gene to give us an idea of the level of expression of each gene within our cells (<strong>Fig 5</strong>). A matrix can then be created from all the raw counts, which is then fed into a program to determine which genes are differentially expressed between conditions.

<h2>Differential expression analysis</h2>

For differential expression analysis, we'll make use of both Python (and R) to obtain a dataset of genes and their change in expression relative to controls along with any significance values and log2 fold change in expression. This data can also then be used to generate visualizations to help us get an overview of the results of our analysis (<strong>Fig 6</strong>).

<h4 style="text-align: center;"><strong>Fig 6</strong></h4>
<img src="./images/9_1_fig_6.png" style="height: 350px; margin: auto;"/>

<h2>Clustering</h2>

From there, we can play around with our results to identify potential genes of interest. We can also cluster our data in various ways to see how consistent the replicates are with one another (<strong>Fig 7</strong>). This can also help you identify potential experimental variability, such as batch effects, that may impact your interpretation of the data.

<h4 style="text-align: center;"><strong>Fig 7</strong></h4>
<img src="./images/9_1_fig_7.png" style="height: 350px; margin: auto;"/>

<h2>Functional enrichment and annotation</h2>

We can take our differential expression dataset and take a look at what biological processes, pathways, or functions are enriched in our dataset. This can give us a broad picture of what changes are occurring within the cell based on the changes we find when we look at different broad groupings of our dataset.

<h2>Multi-omics</h2>

Towards the end of the course, we'll also be pulling together the results of our ChIP-seq analysis and our RNA-seq analysis to take a multi-omics view of our data by integrating our two analyses and gain additional insights into how TAZ is regulating tumorigenesis.

<h1 style="font-size: 40px; margin-bottom: 0px;">References</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 98%;"></hr>

<p style="padding-left: 20px;"><a href="https://doi.org/10.1101/pdb.top084970" rel="noopener noreferrer"><u>Kukurba and Montgomery 2015 CSH Protocols:</u></a> RNA Sequencing and Analysis</p>