# Normalise gene expression data

Nuha BinTayyash, 2020

This notebook shows how to run [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)  R package to normalize ScRNA-seq gene expression data for highly expressed genes in Islet $\alpha$ cell from [GSE8737 single cell RNA-seq ](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE87375) dataset.

#### load ScRNA-seq gene expression data for highly expressed genes in Islet $\alpha$ cell from [GSE8737 single cell RNA-seq ](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE87375) dataset. and normalize it using DESeq2

In [1]:
counts <- read.csv(file = 'GSE87375_Single_Cell_RNA-seq_Gene_Read_Count.csv',row.names = 1, header = TRUE)
dim(counts)
counts

Unnamed: 0,Symbol,GeneLength,bE17.5_1_01,bE17.5_1_02,bE17.5_1_03,bE17.5_1_04,bE17.5_1_05,bE17.5_2_01,bE17.5_2_02,bE17.5_2_03,...,aE17.5_2_22,aE17.5_2_23,aE17.5_4_07,aE17.5_4_08,aP0_2_12,aP0_2_13,aP0_2_14,aP0_3_15,aP0_3_16,aP18_2_14
ENSMUSG00000000001,Gnai3,3262,316,410,186,364,439,60,358,285,...,128,297,320,263,91,151,252,138,358,43
ENSMUSG00000000003,Pbsn,902,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ENSMUSG00000000028,Cdc45,2143,0,0,0,0,0,0,0,0,...,0,0,33,30,0,0,0,0,0,0
ENSMUSG00000000031,H19,2286,0,0,0,0,0,0,1,0,...,95,0,0,0,0,0,0,0,0,0
ENSMUSG00000000037,Scml2,4847,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ENSMUSG00000000049,Apoh,1190,0,0,0,0,0,0,0,0,...,0,0,0,8,0,0,0,0,0,0
ENSMUSG00000000056,Narf,4395,74,179,62,29,194,116,76,51,...,207,70,112,3,103,0,92,58,42,8
ENSMUSG00000000058,Cav2,2733,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,52
ENSMUSG00000000078,Klf6,4217,1,22,11,4,106,26,95,27,...,88,97,91,21,0,115,1,48,4,41
ENSMUSG00000000085,Scmh1,3544,26,45,22,0,1,25,21,48,...,11,64,56,8,86,0,40,35,54,4


In [2]:
counts[  grepl( "ERCC" , names( counts ) ), ]

“number of rows of result is not a multiple of vector length (arg 2)”

Symbol,GeneLength,bE17.5_1_01,bE17.5_1_02,bE17.5_1_03,bE17.5_1_04,bE17.5_1_05,bE17.5_2_01,bE17.5_2_02,bE17.5_2_03,...,aE17.5_2_22,aE17.5_2_23,aE17.5_4_07,aE17.5_4_08,aP0_2_12,aP0_2_13,aP0_2_14,aP0_3_15,aP0_3_16,aP18_2_14


In [2]:
alpha_col_data <- read.csv(file = 'alpha_time_points.csv',row.names = 1, header = TRUE)
head(alpha_col_data)

Unnamed: 0,pseudotime
aE17.5_2_09,0.0
aE17.5_2_16,0.005363444
aE17.5_1_11,0.022454001
aE17.5_3_07,0.022891405
aE17.5_4_06,0.030221853
aE17.5_3_04,0.037523365


Filter $\alpha$ cells and genes 

In [3]:
alpha_counts <- counts[ , grepl( "a" , names( counts ) ) ]
alpha_counts <- alpha_counts[rownames(alpha_col_data)]
keep <- rowMeans(alpha_counts) >.1
alpha_counts <- alpha_counts[keep,]
dim(alpha_counts)
write.csv(alpha_counts, file = "alpha_read_counts.csv")

Normalize data using DESeq2 and run one sample test

In [4]:
library("DESeq2")
dds <- DESeqDataSetFromMatrix(countData = alpha_counts,
                              colData = alpha_col_data,
                              design = ~pseudotime)
dds <- estimateSizeFactors(dds)
normalized_alpha_counts <-counts(dds, normalized=TRUE)
dim(normalized_alpha_counts)
write.csv(normalized_alpha_counts, file = "normalized_alpha_counts.csv")

Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min


Attaching package: ‘S4Vectors’

The followin

In [5]:
dds <- DESeq(dds, test="LRT", reduced = ~ 1)
res <- results(dds)
dim(as.data.frame(res))
write.csv(as.data.frame(res),file="alpha_DESeq2.csv")

using pre-existing size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing


In [6]:
sessionInfo()

R version 3.6.2 (2019-12-12)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Catalina 10.15.3

Matrix products: default
BLAS/LAPACK: /Users/nuhabintayyash/opt/anaconda3/envs/tensonflow_2.1/lib/libopenblasp-r0.3.7.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] DESeq2_1.26.0               SummarizedExperiment_1.16.0
 [3] DelayedArray_0.12.0         BiocParallel_1.20.0        
 [5] matrixStats_0.55.0          Biobase_2.46.0             
 [7] GenomicRanges_1.38.0        GenomeInfoDb_1.22.0        
 [9] IRanges_2.20.0              S4Vectors_0.24.0           
[11] BiocGenerics_0.32.0        

loaded via a namespace (and not attached):
 [1] jsonlite_1.6           bit64_0.9-7            splines_3.6.2         
 [4] Formula_1.2-3          latticeExtra_0.6-29    blob_1.2.1            
 [7] Geno