# R-based pipeline for calculating the DEGs using DESeq2

## Protocol B-2 - Calculation of Differentially Expressed Genes (DEGs) for RNA-seq  dataset

## Step 1. Set-up

### Install and import libraries
Install the packages using `BiocManager`. If you want to install the library manually, type `BiocManager::install(package_name)` in R.

In [None]:
packages <- c("Rsamtools", "GenomicFeatures", "GenomicAlignments", "BiocParallel", "SummarizedExperiment", "txdbmaker", "DESeq2")

# Install BiocManager
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

# Install packages
for (pkg in packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    BiocManager::install(pkg, suppressUpdates = TRUE, ask = FALSE)
    cat(paste(pkg, "is now installed.\n"))
  } else {
    cat(paste(pkg, "is already installed.\n"))
  }
}

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
    CRAN: https://cran.rstudio.com

Bioconductor version 3.19 (BiocManager 1.30.25), R 4.4.1 (2024-06-14)

Installing package(s) 'BiocVersion', 'Rsamtools'

also installing the dependencies ‘formatR’, ‘lambda.r’, ‘futile.options’, ‘UCSC.utils’, ‘GenomeInfoDbData’, ‘futile.logger’, ‘snow’, ‘BH’, ‘GenomeInfoDb’, ‘GenomicRanges’, ‘Biostrings’, ‘BiocGenerics’, ‘S4Vectors’, ‘IRanges’, ‘XVector’, ‘zlibbioc’, ‘bitops’, ‘BiocParallel’, ‘Rhtslib’


Old packages: 'askpass', 'commonmark', 'credentials', 'sys', 'xfun'



Rsamtools is now installed.


'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
    CRAN: https://cran.rstudio.com

Bioconductor version 3.19 (BiocManager 1.30.25), R 4.4.1 (2024-06-14)

Installing package(s) 'GenomicFeatures'

also installing the dependencies ‘matrixStats’, ‘abind’, ‘SparseArray’, ‘MatrixGenerics’, ‘S4Arrays’, ‘DelayedArray’, ‘plogr’, ‘png’, ‘SummarizedExperiment’, ‘RCurl’, ‘rjson’, ‘Biobase’, ‘RSQLite’, ‘KEGGREST’, ‘XML’, ‘GenomicAlignments’, ‘BiocIO’, ‘restfulr’, ‘AnnotationDbi’, ‘rtracklayer’




GenomicFeatures is now installed.
GenomicAlignments is already installed.
BiocParallel is already installed.
SummarizedExperiment is already installed.


'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
    CRAN: https://cran.rstudio.com

Bioconductor version 3.19 (BiocManager 1.30.25), R 4.4.1 (2024-06-14)

Installing package(s) 'txdbmaker'

also installing the dependencies ‘filelock’, ‘BiocFileCache’, ‘biomaRt’




txdbmaker is now installed.


'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
    CRAN: https://cran.rstudio.com

Bioconductor version 3.19 (BiocManager 1.30.25), R 4.4.1 (2024-06-14)

Installing package(s) 'DESeq2'

also installing the dependencies ‘locfit’, ‘RcppArmadillo’




DESeq2 is now installed.


In [None]:
suppressPackageStartupMessages(library('Rsamtools'))
suppressPackageStartupMessages(library('GenomicFeatures'))
suppressPackageStartupMessages(library('GenomicAlignments'))
suppressPackageStartupMessages(library('BiocParallel'))
suppressPackageStartupMessages(library('SummarizedExperiment'))
suppressPackageStartupMessages(library('txdbmaker'))
suppressPackageStartupMessages(library('DESeq2'))

### Set the name of result files

In [None]:
TPM_FILE <- 'TPM.csv'
TOTAL_DEGs_FILE <- 'Total_DEGs_result.csv'
DEGs_FILE <- 'DEGs_result.csv'

### Set the experimental informations between control and experimental samples

Briefly enter the main keyword of your samples. These keywords will be used for the column names of the DEGs results.

In [None]:
CONTROL <- 'WT'   ## e.g. Enter the keyword of control samples
EXPERIMENT <- 'K/O_RpoS'   ## e.g. Enter the keyword of experimental samples

## Step 2. Prepare and upload the files

When using Google Colab, as direct mounting of Google Drive is not supported in the R kernel, it is recommended to upload the required files to the Colab environment and configure the working directory for analysis and file management accordingly.

### Required files :

* Reference genome annotation file (.gff) from **protocol A.1**
* Aligned files (.bam) from **protocol B.1**

### Set the working directory

In [None]:
work_directory <- '/content'  # Enter your working directory.
setwd(work_directory)
getwd()

## Step 3. Prepare Datasheet.csv file

This is a comma-separated file created by user according to the following template.
Each row contains information of each sample.
Information is entered in 2 columns:

* **Sample_id** : It is recommended that sample names be designed for easy identification. Replicate names should end with _1, _2, and so on.
* **BAM** : Enter the directory of the aligned BAM files.

In [None]:
DATASHEET <- 'RNA-seq_datasheet_example.csv'  # Enter your datasheet name.
sampleTable <- read.csv(DATASHEET)
head(sampleTable)

“incomplete final line found by readTableHeader on 'RNA-seq_datasheet_example.csv'”


Unnamed: 0_level_0,Sample_id,BAM
Unnamed: 0_level_1,<chr>,<chr>
1,WT_1,mini_RNA-seq_Ecoli_mid-37-1.bam
2,WT_2,mini_RNA-seq_Ecoli_mid-37-2.bam
3,K/O_RpoS_1,mini_RNA-seq_Ecoli_del_rpoS_mid-37-1.bam
4,K/O_RpoS_2,mini_RNA-seq_Ecoli_del_rpoS_mid-37-2.bam


## Step 4. Load GFF File

Set `gff` using the reference genome file from ChEAP.  
`makeTxDbFromGFF` loads the GFF file into a database.  
`exonsBy` extracts the exons from the GFF file.

In [None]:
gff <- 'reference_NC_000913.3.gff'  # Enter the name of your reference genome file.
txdb <- makeTxDbFromGFF(gff, format="gff")
exons <- exonsBy(txdb, by="gene")

Import genomic features from the file as a GRanges object ... 
OK

Prepare the 'metadata' data frame ... 
OK

Make the TxDb object ... 
“the transcript names ("tx_name" column in the TxDb object) imported
  from the "Name" attribute are not unique”
“The following transcripts were dropped because their exon ranks could
  not be inferred (either because the exons are not on the same
  chromosome/strand or because they are not separated by introns): b0149,
  b0470, b0484, b1120, b1888, b2592, b3168, b4346, b4795”
OK



 ## Step 5. Load BAM files

 Put the name of your bam files into a character vector and check that they all exist on your working directory.

 The BamFileList function prepares the BAM files to be processed. The `yieldSize` argument states how many reads can be processed at once (default 2,000,000). This can be increased to speed alignment time, or decreased to reduce memory load.

In [None]:
filenames <- as.character(sampleTable$BAM)

if (all(file.exists(filenames))) {
  bamfiles <- BamFileList(filenames, yieldSize = 2000000)
  print( "BamFileList has been successfully created." )
} else {
  print( "Some files do not exist. BamFileList was not created." )
}

bamfiles

[1] "BamFileList has been successfully created."


BamFileList of length 4
names(4): mini_RNA-seq_Ecoli_mid-37-1.bam ...

## Step 6. Count Reads

`summarizeOverlaps` counts the number of reads that overlap each gene in the GFF file. First, we intialize the multiprocessing, using the `workers` argument to set the number of cores to use. The `summarizeOverlaps` arguments are as follows:
* `features`: The genomic features loaded in the previous code block
* `reads`: The bam files listed above
* `mode`: How to deal with potential overlaps. See [HTSeq-count](http://www-huber.embl.de/HTSeq/doc/count.html) documentation.
* `singleEnd`: TRUE if single-end, FALSE if paired-end
* `ignore.strand`: Whether the strand information is useful for mapping, based on library preparation method
    * TRUE: Standard Illumina
    * FALSE: Directional Illumina (Ligation), Standard SOLiD, dUTP, NSR, NNSR
* `preprocess.reads` (optional): Modify reads before aligning
    * invertStrand: Necessary for dUTP, NSR and NNSR library preparation methods
* `fragments`: Whether to count unpaired reads

In [None]:
register(MulticoreParam(workers = 4))
se <- summarizeOverlaps(features = exons,
                        reads = bamfiles,
                        mode = "IntersectionStrict",
                        singleEnd = FALSE,
                        ignore.strand = FALSE,
                        preprocess.reads = invertStrand,
                        fragments = FALSE)

In [None]:
se

class: RangedSummarizedExperiment 
dim: 4404 4 
metadata(0):
assays(1): counts
rownames(4404): b0001 b0002 ... b4823 b4824
rowData names(0):
colnames(4): mini_RNA-seq_Ecoli_mid-37-1.bam
  mini_RNA-seq_Ecoli_mid-37-2.bam
  mini_RNA-seq_Ecoli_del_rpoS_mid-37-1.bam
  mini_RNA-seq_Ecoli_del_rpoS_mid-37-2.bam
colData names(0):

## Step 7. Generate a dataframe containing raw counts

The final counts are stored in the [SummarizedExperiment](https://www.bioconductor.org/help/workflows/rnaseqGene/#summarizedexperiment) object. To view raw counts, use `assay(se)`

In [None]:
metadata <- sampleTable
colData(se) <- DataFrame(metadata)
colnames(se) <- colData(se)$Sample_id

head(assay(se))

Unnamed: 0,WT_1,WT_2,K/O_RpoS_1,K/O_RpoS_2
b0001,0,0,0,0
b0002,434,423,543,625
b0003,180,187,274,284
b0004,245,262,308,396
b0005,2,1,2,6
b0006,7,10,10,18


To assign groups for comparison, remove the replicate subscript from the colnames.

In [None]:
colData(se)$group <- gsub('.{2}$', '', colnames(se))

## Step 8. Convert the dataframe into DESeqDataSet (DDS)

In [None]:
dds <- DESeqDataSet(se, design = ~group)
nrow(dds)
dds <- dds[rowSums(assay(dds)) > 0, ]
nrow(dds)

head(assay(dds))

“some variables in design formula are characters, converting to factors”
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters



Unnamed: 0,WT_1,WT_2,K/O_RpoS_1,K/O_RpoS_2
b0002,434,423,543,625
b0003,180,187,274,284
b0004,245,262,308,396
b0005,2,1,2,6
b0006,7,10,10,18
b0007,2,1,3,5


## Step 9. Calculate FPKM

Calculate the fragments per kilobase of transcript per million mapped reads (FPKM) value from the DDS object.

In [None]:
fpkm_data <- fpkm(dds)

head(fpkm_data)

  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters



Unnamed: 0,WT_1,WT_2,K/O_RpoS_1,K/O_RpoS_2
b0002,1411.29216,1357.818042,1532.89762,1423.26415
b0003,1544.16439,1583.567724,2040.5992,1706.15177
b0004,1524.11636,1608.893657,1663.37005,1725.14466
b0005,53.77515,26.541511,46.68397,112.9747
b0006,72.09188,101.662858,89.4076,129.81925
b0007,11.19064,5.523303,14.57244,19.59175


## Step 10. Calculate TPM

Convert the calculated FPKM values to transcripts per million (TPM) values.

In [None]:
tpm_data <- sweep(fpkm_data,2,colSums(fpkm_data),`/`)*1e6  # Divide each column by the column sum (times 1e6)
head(tpm_data)

write.csv(tpm_data, file = TPM_FILE)

Unnamed: 0,WT_1,WT_2,K/O_RpoS_1,K/O_RpoS_2
b0002,1273.02226,1256.404386,1362.83665,1277.474
b0003,1392.87647,1465.293119,1814.21338,1531.38441
b0004,1374.79262,1488.727491,1478.83436,1548.43179
b0005,48.50659,24.55916,41.50481,101.40229
b0006,65.02875,94.069792,79.48864,116.52139
b0007,10.09425,5.110775,12.95576,17.58489


## Step 11. Find DEGs

Calculate the DEGs from the DDS object using the ‘DESeq’ function provided by the DESeq2 package.

Each column has the following meaning:
* **baseMean**: The average normalized expression of the gene across both experimental and control groups.
* **log2FoldChange**: The log2 difference in gene expression between the experimental and control groups. Positive means higher expression in the experimental group, negative means higher in the control.
* **lfcSE**: The standard error of the log2FoldChange.
* **stat**: The Wald test statistic indicating the significance of gene expression change.
* **pvalue**: The probability that the gene expression difference occurred by chance.
* **padj**: The adjusted p-value, correcting for multiple testing (False Discovery Rate; FDR). It indicates the statistical significance of the expression difference, with smaller values showing more confidence in the result.

In [None]:
dds <- DESeq( dds )
res <- results(dds,contrast = c( 'group', EXPERIMENT , CONTROL ))
resTable <- data.frame( res@listData, row.names = res@rownames )
head( resTable )

write.csv(resTable, TOTAL_DEGs_FILE, row.names=T, quote=T)

estimating size factors

  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters

final dispersion estimates

fitting model and testing



Unnamed: 0_level_0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
b0002,495.272229,0.09300726,0.1260783,0.7376943,0.4607002,0.9995757
b0003,225.420845,0.25745488,0.1897037,1.3571418,0.1747362,0.9995757
b0004,294.898225,0.11370526,0.1610864,0.7058649,0.4802721,0.9995757
b0005,2.510672,1.0055082,1.7566098,0.5724141,0.5670415,
b0006,10.733935,0.34137971,0.8428585,0.4050261,0.6854583,0.9995757
b0007,2.557887,1.03742772,1.7280046,0.6003617,0.5482652,


## Step 12. Identify the DEGs using cut-offs

To identify DEGs that are particularly affected by the experimental conditions, a cut-off is set to output only those DEGs that meet the conditions.

In [None]:
deg <- resTable[complete.cases(resTable[,c('padj','log2FoldChange')]) &
                            resTable$padj < 0.05 &
                            abs(resTable$log2FoldChange) > 1.0,]

write.csv(deg, DEGs_FILE, row.names=T, quote=T)