This notebook uses an R kernel.

# RNA-seq Raw Count Matrix Preprocessing

Author: Zhongyi (James) Guo <br>
Date: 10/28/2024

## Import Packages

In [1]:
getwd()

In [2]:
.libPaths()

In [3]:
library(tidyverse)
library(GEOquery)
library(AnnoProbe)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     


── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


Loading required package: Biobase



Loading required package: BiocGenerics




Attaching package: ‘BiocGenerics’




The following objects are masked from ‘package:lubridate’:

    intersect, setdiff, union




The following objects are masked from ‘package:dplyr’:

    combine, intersect, setdiff, union




The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs




The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which.max, which.min




Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.




Setting options('download.file.method.GEOquery'='auto')



Setting options('GEOquery.inmemory.gpl'=FALSE)



AnnoProbe v 0.1.7  welcome to use AnnoProbe!
If you use AnnoProbe in published research, please acknowledgements:
We thank Dr.Jianming Zeng(University of Macau), and all the members of his bioinformatics team, biotrainee, for generously sharing their experience and codes.



## Data Wrangling

### Meta Data

In [4]:
eSet <- getGEO("GSE248760", destdir = '.', getGPL = F)

Found 1 file(s)



GSE248760_series_matrix.txt.gz



In [5]:
title <- eSet$GSE248760_series_matrix.txt.gz@phenoData$title
title

In [6]:
meta_data <- title |>
    as.data.frame() |>
    separate(title, into = c("sample", "status"), sep = ", ")
meta_data

sample,status
<chr>,<chr>
Sample_1,IS
Sample_2,IS
Sample_3,IS
Sample_4,IS
Sample_5,C
Sample_6,C
Sample_7,C
Sample_8,C


In [7]:
write_csv(meta_data, '../../result/deseq2/meta_data.csv')

## Count Matrix

The RNA-seq raw count matrix was downloaded from GEO using `http`: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE248760

In [8]:
count <- read.table('../../data/GSE248760_raw_counts.txt', header = TRUE)
head(count)

Unnamed: 0_level_0,NAME,Sample_1,Sample_2,Sample_3,Sample_4,Sample_5,Sample_6,Sample_7,Sample_8
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,CLASS:DOSE,IS,IS,IS,IS,C,C,C,C
2,ENSG00000223972,13,18,3,1,5,11,17,10
3,ENSG00000227232,1087,1002,182,531,200,114,319,172
4,ENSG00000278267,23,33,1,12,6,1,4,1
5,ENSG00000243485,0,3,0,1,2,2,0,2
6,ENSG00000284332,0,0,0,0,0,0,0,0


The first row indicates the sample status for ischemic stroke (IS) or not (C, control). This piece of information has already been included in meta data, so we will remove it.

In [9]:
count <- count |> slice(-1)
count

NAME,Sample_1,Sample_2,Sample_3,Sample_4,Sample_5,Sample_6,Sample_7,Sample_8
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
ENSG00000223972,13,18,3,1,5,11,17,10
ENSG00000227232,1087,1002,182,531,200,114,319,172
ENSG00000278267,23,33,1,12,6,1,4,1
ENSG00000243485,0,3,0,1,2,2,0,2
ENSG00000284332,0,0,0,0,0,0,0,0
ENSG00000237613,0,0,0,0,0,0,2,0
ENSG00000268020,0,0,0,0,0,0,0,0
ENSG00000240361,0,0,0,0,0,0,0,0
ENSG00000186092,0,0,0,0,0,0,0,0
ENSG00000238009,29,8,4,11,57,38,28,13


In [10]:
write_tsv(count, '../../result/deseq2/count_clean.tsv')

## Conclusion

In this notebook, we extracted and cleaned the metadata and RNA-seq data for downstream differential gene expression analysis.

In [11]:
sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS/LAPACK: /home/ubuntu/miniconda3/lib/libopenblasp-r0.3.28.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] AnnoProbe_0.1.7     GEOquery_2.70.0     Biobase_2.62.0     
 [4] BiocGenerics_0.48.1 lubridate_1.9.3     forcats_1.0.0      
 [7] stringr_1.5.1       dplyr_1.1.4         purrr_1.0.2        
[10] readr_2.1.5         tidyr_1.3.1         tibble_3.2.1       
[13] ggplot2_3.5.1       tidyverse_2.0.0