### RNAseq 3.4: Importing gene counts and metadata into DESeq2 for expression analysis
This jupyter notebook describes the **minimal** import process of read count data into DESeq2. It provides the backbone workflow which can be extended by adding more samples, factors, etc.

#### Start by doing `git pull` in the course directory then copying the counts files to your preferred location.
#### Locate gene counts files in a dedicated directory (no other files should be present)

In [81]:
list.files("/mnt/c/Users/Jerry/Applied-Bioinformatics-HW/data/htseq_out/")

```R
dir_counts <- "/mnt/c/Users/Jerry/Applied-Bioinformatics-HW/data/htseq_out"
```

#### Then read the filenames into a character vector
```R
counts_files <- list.files(dir_counts)
counts_files
```

#### This step allows you to subset your samples if necessary (in this example we use all files)
```R
counts_files[1:4]
counts_files <- counts_files[1:4]
```

#### Create data frame for metadata
Follow the specification from DESeq2's help pages:


`?DESeqDataSetFromHTSeqCount`  
sampleTable 

for htseq-count: a data.frame with three or more columns. Each row describes one sample. The first column is the sample name, the second column the file name of the count file generated by htseq-count, and the remaining columns are sample metadata which will be stored in colData.

In [73]:
samplesInfo <- as.data.frame(matrix(ncol=2, nrow=length(counts_files)))  
samplesInfo$samplename <- counts_files  
samplesInfo$filename <- counts_files  
samplesInfo$group <- c("mock", "ZIKV", "mock", "ZIKV") 
#Remove the blank columns
samplesInfo <- samplesInfo[,-c(1:2)]

#### Check the sample table (double check this info against the SRA records)

In [74]:
samplesInfo

samplename,filename,group
GSM2580319_counts.txt,GSM2580319_counts.txt,mock
GSM2580320_counts.txt,GSM2580320_counts.txt,ZIKV
GSM2580323_counts.txt,GSM2580323_counts.txt,mock
GSM2580324_counts.txt,GSM2580324_counts.txt,ZIKV


#### Load the DESeq2 package

```R
library("DESeq2")
```

#### Import read counts, sample table and specify the design factor (there is only one factor in this example: group)

```R
dds1 <- DESeqDataSetFromHTSeqCount(sampleTable = samplesInfo, 
                                           directory = dir_counts, 
                                           design = ~ group)
```

#### Next we perform the main DESeq2 analysis step which includes statistical model fitting

In [76]:
dds1_deseq <- DESeq(dds1)

estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing


In [77]:
class(dds1_deseq)
str(dds1_deseq)

Formal class 'DESeqDataSet' [package "DESeq2"] with 8 slots
  ..@ design            :Class 'formula'  language ~group
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  ..@ dispersionFunction:function (q)  
  .. ..- attr(*, "coefficients")= Named num [1:2] 0.00454 5.57153
  .. .. ..- attr(*, "names")= chr [1:2] "asymptDisp" "extraPois"
  .. ..- attr(*, "fitType")= chr "parametric"
  .. ..- attr(*, "varLogDispEsts")= num 1.08
  .. ..- attr(*, "dispPriorVar")= num 0.496
  ..@ rowRanges         :Formal class 'CompressedGRangesList' [package "GenomicRanges"] with 5 slots
  .. .. ..@ unlistData     :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
  .. .. .. .. ..@ seqnames       :Formal class 'Rle' [package "S4Vectors"] with 4 slots
  .. .. .. .. .. .. ..@ values         : Factor w/ 0 levels: 
  .. .. .. .. .. .. ..@ lengths        : int(0) 
  .. .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. .. ..@ metadata       : list()
  .. .. .. .. ..@ ranges    

In [80]:
dim(dds1_deseq)

We see from the above that `dds1_deseq` is a DESeqDataSet object with a complicated structure but you need no special knowledge of this object class to use DESeq2. 
If you can reproduce the above output you have successfully imported and fitted your RNA-seq reads to DESeq2.
#### From here, many analyses can be performed including overall transcriptome comparisons, variance analysis, principal component analysis, differential expression analysis etc. These will be explored in subsequent classes.