# Differential expression analysis


You have run the nf-core/rnaseq pipeline and checked the first quality control metrics of your fastq files. This was, however, only the primary analysis and we want to take it further.

Due to the computational demand of the pipeline, you only ran the pipeline on two of the 16 samples in the study yesterday. We provide you an essential output of nf-core/rnaseq pipeline in the `data` folder: It contains the combined epression matrix as produced by Salmon, which provides transcript levels for each gene (rows) and each sample (columns).


We would now like to understand exactly the difference between the expression in our groups of mice. 
Which pipeline would you use for this?

The differentialabundance pipeline, because it is intended to analyse differences in abundance between data represented as matrices, which our expression matrices are.

Have a close look at the pipeline's "Usage" page on the [nf-core docs](nf-co.re). You will need to create a samplesheet (based on the column names in the provided matrix).

Please paste here the command you used. You may need to inspect the provided expression matrix more closely and create additional files, like a samplesheet (based on the column names) or a contrast file (there happens to also be one in `data/` that you can use).

In [None]:
!nextflow run nf-core/differentialabundance -profile rnaseq,singularity --input data/samplesheet.csv --contrasts data/contrasts.csv --matrix data/salmon.merged.gene_counts.csv --outdir day3_outputs

Explain all the parameters you set and why you set them in this way. If you used or created additional files as input, explain what they are used for.

For the profiles I chose rnaseq because we are analysing data from rna sequencing, and singularity as the containerizer because it works best with codespaces on my tablet.

--input data/samplesheet.csv contains the metadata matching the different samples to different conditions in the experiment.

--contrasts data/contrasts.csv specifies which conditions should be compared with each other. Together with the samplesheet this parameter ensures the pipeline knows which samples' values should be contrasted.

--matrix data/salmon.merged.gene_counts.csv is the matrix which contains the gene expression data for each sample by gene.

--outdir day3_output specifies the output directory into which the results of the pipeline should be placed.


What were the outputs of the pipeline?

A report in the form of an html file, which summarizes the results of the pipeline, as well as various supplemental files containing plots and tables that show the analysis results.

In [2]:
#!TODO

Would you exclude any samples? If yes, which and why?

According to the MAD score calculated in the report, none of the samples are outliers, meaning that none should be excluded. However, the pca plots and clustering dendrogram suggest that the samples SNI-Sal 2 and 4 diverge from the other samples.

How many genes were differentially expressed in each contrast? Does this confirm what the paper mentions?

In the SNI-oxy versus SNI-Sal contrast seventeen genes were less expressed than the reference, suggesting down-regulation while one gene had  higher expression.

In the Sham-oxy versus SNI-oxy contrast there were seven genes with increased expression and none with reduced expression.

The paper mentions differentially expressed genes in three brain regions : the NAc, mPFC and VTA. Briefly explain what these 3 regions are.

NAc stands for the Nucleus accumbens. It is located at the end of the mesolimbic or reward pathway and plays an important role in the processing of motivation and aversion, as well as addiction and deep sleep.

mPFC stands for the medial prefrontal cortex which is involved in both decision making, memory, and association

The VTA is the ventral tegmental area, which is the origin of many dopamine-related cells and pathways. As such it is an important component of the reward and drug circuitry of the brain, since they function through dopamine release.

Is there anyway from the paper and the material and methods for us to know which genes are included in these regions?

The paper contains a table with the pathways, their brain regions and differentially expressed genes on page 1236 of Nature/page 8 of the PDF, but only for the comparison between SNI-Sal and SNI-oxy mice. Additionally the number of differentially expressed genes in the table is far lower than the number presented by the venn diagram for the same comparison in figure 3.h, meaning that we don't know which genes are included in which region for the vast majority of genes analysed in the study.

Once you have your list of differentially expressed genes, do you think just communicating those to the biologists would be sufficient? What does the publication state?

Please reproduce the Venn Diagram from Figure 3, not taking into account the brain regions but just the contrasts mentionned.