RNA-Seq pipeline for TCGA and Toil Xena datasets

This pipelie serves to get differential expression and co-expression matrices for normal and cancer tissues from the TCGA dataset, downloaded from GDC, and the Toil Xena dataset that integrates both TCGA and GTEx data.

Folders in this repository

bin: It contains the GDC Data Transfer Tool command line for Linux
config: It contains the config.yaml file used for the Snakemake pipeline
input: With the GENECODE annotation files originally used for TCGA data (v22, according to this link). It also contains GENECODE v37 (April, 2021) to filter genes and BioMart Ensembl Genes 80 to add GC content for QC purposes.
workflow: It has all the rules and scripts for the Snakemake pipeline

The pipeline

These are the main steps in the pipeline:

Get data:
- Get data from Xena: It downloads counts, sample information and annotations directly from the Xena-Toil S3 bucket.
  - Rules file: xena.smk
- Get data from GDC: It queries GDC to creat a manifest and uses the gdc-client tool to download files.
  - Rules file: gdc.smk
  - Script: queryGDC.py
Get raw matrix: This step integrates, for each tissue, raw counts with their respective annotation to obtain a raw matrix.
- Rules: raw.smk
- Scripts: addAnnotations.R and getRawMatrix.R
QC: It gets NOISeq plots, filters genes with low expression (mean raw counts < 10), gets PCA and density plots (expression per sample) and removes samples with mean expression greater and lower than 2sd from the total mean.
- Rules: qc.smk
- Scripts: NOISeqPlots.R, filterLowExpression.R, PCA.R, densityPlot.R and filterOutliers.R
Normalization: In the first run, it performs a test with different normalization methods and integrates the results. When normalization steps are defined for each tissue, it gets normalized data and runs arsyn for batch correction.
- Rules: normalization.smk
- Scripts: normalizationPlots.R, normalizationTest.R, usrNormalization.R, and runArsyn.R After this step, we get a final expression matrix for normal and cancer tissues, having the same gene set.
Get final output:
- Co-expression computation: It gets PCA and density plots after normalization and runs the ARACNE algorithm using a singularity image to get co-expression matrix for each tissue and condition. It can also obtain a pearson correlation matrix, if output files are required.
  - Rules: correlation.smk
  - Scripts: aracne_matrix.py, getPearsonMatrix.R
- Differential expression: This step, like the co-expression computation, can be run after normalization and it gets gene differential expression and its associated plots.
  - Rules: deg.smk
  - Scripts: deg.R

How to run it

Unfortunately, the HTSeq raw counts used in this pipeline were removed by GDC on March, 2022 (Data Release 32). This means that the GDC query used here (in the Get data from GDC step) will not return any result. However, all the data generated by this pipeline, including raw counts, can be found here:

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
bin		bin
config		config
input		input
workflow		workflow
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

config

config

input

input

workflow

workflow

.gitignore

.gitignore

README.md

README.md

Repository files navigation

RNA-Seq pipeline for TCGA and Toil Xena datasets

Folders in this repository

The pipeline

How to run it

About

Releases

Packages

Languages

Navigation Menu

ddiannae/tcga-xena-pipeline

Folders and files

Latest commit

History

Repository files navigation

RNA-Seq pipeline for TCGA and Toil Xena datasets

Folders in this repository

The pipeline

How to run it

About

Resources

Stars

Watchers

Forks

Languages