This notebook uses an R kernel.

# Differential Gene Expression using DESeq2

Author: Zhongyi (James) Guo <br>
Date: 10/28/2024

## Import Packages

In [1]:
getwd()

In [2]:
.libPaths()

In [3]:
library(tidyverse)
library(DESeq2)

── [1mAttaching core tidyverse packages[22m ──────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.0     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m  masks [34mstats[39m::filter()
[31m✖[39m [34mpurrr[39m::[32mflatten()[39m masks [34mjsonlite[39m::flatten()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m     masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org

## Import Data

In [4]:
# count matrix
count_clean <- read_tsv('../../result/deseq2//count_clean.tsv') |> 
    as.data.frame() |>
    mutate(across(where(is.double), as.integer))
head(count_clean)

[1mRows: [22m[34m58174[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m "\t"
[31mchr[39m (1): NAME
[32mdbl[39m (8): Sample_1, Sample_2, Sample_3, Sample_4, Sample_5, Sample_6, Sample_...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Unnamed: 0_level_0,NAME,Sample_1,Sample_2,Sample_3,Sample_4,Sample_5,Sample_6,Sample_7,Sample_8
Unnamed: 0_level_1,<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,ENSG00000223972,13,18,3,1,5,11,17,10
2,ENSG00000227232,1087,1002,182,531,200,114,319,172
3,ENSG00000278267,23,33,1,12,6,1,4,1
4,ENSG00000243485,0,3,0,1,2,2,0,2
5,ENSG00000284332,0,0,0,0,0,0,0,0
6,ENSG00000237613,0,0,0,0,0,0,2,0


In [5]:
# meta data
meta_data <- read_csv('../../result/deseq2/meta_data.csv')
meta_data$status <- as.factor(meta_data$status)
meta_data

[1mRows: [22m[34m8[39m [1mColumns: [22m[34m2[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (2): sample, status

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


sample,status
<chr>,<fct>
Sample_1,IS
Sample_2,IS
Sample_3,IS
Sample_4,IS
Sample_5,C
Sample_6,C
Sample_7,C
Sample_8,C


## DESeq2

In [6]:
dds <- DESeqDataSetFromMatrix(countData = count_clean, 
                              colData = meta_data, 
                              design = ~status, 
                              tidy = TRUE)

In [7]:
dds <- DESeq(dds)

estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing



In [8]:
res <- results(dds)

In [9]:
sig_gene <- res |>
    as.data.frame() |>
    filter(abs(log2FoldChange) > 1 & padj < 0.05) |>
    rownames_to_column('gene')
sig_gene

gene,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
ENSG00000227232,420.856227,1.877341,0.4063936,4.619515,3.846377e-06,3.423083e-05
ENSG00000278267,8.613466,2.615336,0.8545570,3.060458,2.209988e-03,7.991285e-03
ENSG00000268903,18.716412,5.774899,0.9753198,5.921032,3.199285e-09,6.290752e-08
ENSG00000269981,12.547480,5.752313,1.2084716,4.759990,1.936023e-06,1.891486e-05
ENSG00000239906,3.579557,3.791956,1.4671741,2.584531,9.751165e-03,2.777531e-02
ENSG00000279457,551.484304,2.532600,0.3827524,6.616811,3.670306e-11,1.079563e-09
ENSG00000250575,53.860707,5.050183,0.6465986,7.810384,5.701423e-15,3.577285e-13
ENSG00000225972,5794.202467,2.717834,0.5158510,5.268641,1.374375e-07,1.858197e-06
ENSG00000229344,1565.304483,2.235852,0.6209036,3.600965,3.170388e-04,1.543895e-03
ENSG00000230092,18.129602,-2.704409,0.6059129,-4.463362,8.068343e-06,6.512045e-05


In [10]:
dim(sig_gene)

In [11]:
dim(sig_gene)[1] / dim(count_clean)[1]

14.27% of all genes were differentially expressed between IS and C.

## Save Data

In [12]:
saveRDS(dds, file = "../../result/deseq2/dds.rds")
write_csv(sig_gene, "../../result/deseq2/sig_gene.csv")

## Conclusion

In this notebook, we used DESeq2 to perform differential gene expression analysis, identifying genes with adjusted p-values below 0.05 and an absolute log2 fold change greater than 1, indicating a two-fold expression change between IS and Control conditions. These genes were statistically significant.

In [13]:
sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] DESeq2_1.42.1               SummarizedExperiment_1.32.0
 [3] Biobase_2.62.0              MatrixGenerics_1.14.0      
 [5] matrixStats_1.3.0           GenomicRanges_1.54.1       
 [7] GenomeInfoDb_1.38.8         IRanges_2.36.0             
 [9] S4Vectors_0.40.2            BiocGenerics_0.48.1        
[11] lubridate_1.9.3             forcats_1.0.0              
[13] stringr_1.5.1               dpl