Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
gene_DE.nb.html Example Workflow: Differential expression

This notebook takes data and metadata from and identifies genes that are differentially expressed between two groups.

Requirements and usage

This module requires you to install the following software to run examples yourself:

These requirements can be installed by following the instructions at the links above. The example R Notebooks are designed to check if additional required packages are installed and will install them if they are not.


We have prepared a quick guide to RStudio as part of our training content that you may find helpful if you're getting started with RStudio for the first time.

Note that the first time you open RStudio, you should select a CRAN mirror. You can do so by clicking Tools > Global Options > Packages and selecting a CRAN mirror near you with the Change button.

You can install the additional requirements (e.g., tidyverse) through RStudio.

Interacting with R Notebooks

You can open an R Notebook by opening the .Rmd file in RStudio. Note that working with R Notebooks requires certain R packages, but RStudio should prompt you to download them the first time you open one. This will allow you to modify and run the R code chunks. Chunks that have already been included in an example can be run by clicking the green play button in the top right corner of the chunk or by using Ctrl + Shift + Enter (Cmd + Shift + Enter on a Mac). See this guide using to R Notebooks for more information about inserting and executing code chunks.

Obtaining the R Notebooks

To run the example yourself, you will need to clone or download this repository:

Using your own data

For all the examples in this module, the gene expression data and sample metadata are stored in a data/ directory. If you'd like to adapt an example to include data you've obtained from, we recommend placing the files in the data/ directory and changing the filenames and paths in the notebooks to match these files. We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebooks if you move notebooks outside of the GitHub repository structure. We recommend using the limma user guide to help you set up your model in a way that takes into account your experimental set up and hypotheses.

Differential expression analysis with GenePattern

GenePattern contains ready-made analyses. For users who are not comfortable with using R Notebooks, the GenePattern modules can be run using a graphics user interface (GUI). To use GenePattern, you have to create an account. Here is their guide we recommend you follow to get started. For use with GenePattern, data from needs to be converted to GenePattern formats. If you would like to perform differential expression analysis with data but would prefer using GenePattern, follow the instructions below.

Preparing your data for GenePattern

In order to complete your differential expression analysis using GenePattern, you will need to have:

  1. a GCT file
  2. a CLS file

You can follow the steps below to create these files from your data.

Create a GCT file

Convert a gene expression tab separated values (TSV) file provided into a 'gene cluster text' (GCT) file for use in GenePattern. In order to create a GCT formatted file from a TSV data file, download the create_gct_file.R script. To use this script you will need to open Terminal (for Mac) or Command Prompt (for Windows). You will need to reference the script like the examples below, followed by --file argument with the name of the dataset TSV file in your current directory that you would like to convert. Note: This script requires optparse library. If optparse library is not installed, this script will install it for you.

--file: name of the file in your current directory that you would like to convert.
--output: name for the output file, the ".gct" suffix will be added if you do not add it yourself (optional).
--rewrite: file of the same name as the output will be rewritten (optional).

Examples of usage in command line:

Below is the basic template for usage of this script. The following examples will give you an idea of how it works.

Rscript scripts/create_gct_file.R \

To get an idea of how this script and its arguments work, you can run the following examples in order.

Navigate to the correct directory
Depending on where you have put the refinebio-examples directory on your computer, you will have to change this path in the code chunk below. Be sure to either have the script and input file in your current working directory, or type out the full directory path for the script and/or input file. eg. /users/Bob/Desktop/scripts/create_gct_file.R For more guidance on how to navigate directories, we recommend this tutorial.

Example 1
Here we will convert the file GSE71270.tsv, which was in our refinebio-examples/differential-expression/data directory into a GCT file. Following the template above, we will replace the <PATH TO REFINE.BIO EXPRESSION TSV> with our file name, differential-expression/data/GSE71270.tsv

Rscript scripts/create_gct_file.R \
 --file differential-expression/data/GSE71270.tsv

Note that we have not specified an --output name so in this case, the script will use the original name of our file, but replace .tsv with .gct. What you should find is that in the same folder differential-expression/data/, you now have a file named GSE71270.gct

Example 2
After running the code chunk we showed above, let's try running the same thing a second time:

Rscript scripts/create_gct_file.R \
 --file differential-expression/data/GSE71270.tsv

What you should see is an error message that says this:

differential-expression/data/GSE71270.gct already exists. Use '--rewrite' option if you want this file to be overwritten.

This is telling us that create_gct_file.R will not write over an already existing file unless we explicitly tell it to.

Example 3
If we want to save over an already existing file, we need to use the rewrite option, just like the error message says.
Let's try that:

Rscript scripts/create_gct_file.R \
 --file differential-expression/data/GSE71270.tsv \

This will rewrite over the file we made in Example 1 but should give you a message to tell you it is doing so: Overwriting file named differential-expression/data/GSE71270.gct Also note that for bash commands, a \ indicates that the command continues on the next line. Since we put --rewrite on the next line, we needed to add a \ so that it knows that the command continues on the next line.

Example 4
Lastly, if we would like to name the file something besides its original name, we can use the --output argument. Here let's save it directly to the differential-expression folder and call it something different.

Rscript scripts/create_gct_file.R \
 --file differential-expression/data/GSE71270.tsv \
 --output differential-expression/GSE71270_special_name.gct

Now you should see a file called differential-expression/GSE71270_special_name.gct.

Create a CLS format file

CLS formatted files provide the sample groups or phenotype information and are necessary for performing gene expression differential analysis using GenePattern. If you've already created a GCT format file from your data, like is described above, you can create a a CLS format using GenePattern's online CLSFileCreator

Now login into GenePattern, select a Differential Expression module, and follow the instructions to upload and analyze your newly created GCT and CLS files

You can’t perform that action at this time.