Navigation Menu

Skip to content

ddiannae/tcga-xena-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RNA-Seq pipeline for TCGA and Toil Xena datasets

This pipelie serves to get differential expression and co-expression matrices for normal and cancer tissues from the TCGA dataset, downloaded from GDC, and the Toil Xena dataset that integrates both TCGA and GTEx data.

Folders in this repository

  • bin: It contains the GDC Data Transfer Tool command line for Linux
  • config: It contains the config.yaml file used for the Snakemake pipeline
  • input: With the GENECODE annotation files originally used for TCGA data (v22, according to this link). It also contains GENECODE v37 (April, 2021) to filter genes and BioMart Ensembl Genes 80 to add GC content for QC purposes.
  • workflow: It has all the rules and scripts for the Snakemake pipeline

The pipeline

These are the main steps in the pipeline:

  1. Get data:
    • Get data from Xena: It downloads counts, sample information and annotations directly from the Xena-Toil S3 bucket.
      • Rules file: xena.smk
    • Get data from GDC: It queries GDC to creat a manifest and uses the gdc-client tool to download files.
      • Rules file: gdc.smk
      • Script: queryGDC.py
  2. Get raw matrix: This step integrates, for each tissue, raw counts with their respective annotation to obtain a raw matrix.
    • Rules: raw.smk
    • Scripts: addAnnotations.R and getRawMatrix.R
  3. QC: It gets NOISeq plots, filters genes with low expression (mean raw counts < 10), gets PCA and density plots (expression per sample) and removes samples with mean expression greater and lower than 2sd from the total mean.
    • Rules: qc.smk
    • Scripts: NOISeqPlots.R, filterLowExpression.R, PCA.R, densityPlot.R and filterOutliers.R
  4. Normalization: In the first run, it performs a test with different normalization methods and integrates the results. When normalization steps are defined for each tissue, it gets normalized data and runs arsyn for batch correction.
    • Rules: normalization.smk
    • Scripts: normalizationPlots.R, normalizationTest.R, usrNormalization.R, and runArsyn.R After this step, we get a final expression matrix for normal and cancer tissues, having the same gene set.
  5. Get final output:
    • Co-expression computation: It gets PCA and density plots after normalization and runs the ARACNE algorithm using a singularity image to get co-expression matrix for each tissue and condition. It can also obtain a pearson correlation matrix, if output files are required.
      • Rules: correlation.smk
      • Scripts: aracne_matrix.py, getPearsonMatrix.R
    • Differential expression: This step, like the co-expression computation, can be run after normalization and it gets gene differential expression and its associated plots.
      • Rules: deg.smk
      • Scripts: deg.R

How to run it

Unfortunately, the HTSeq raw counts used in this pipeline were removed by GDC on March, 2022 (Data Release 32). This means that the GDC query used here (in the Get data from GDC step) will not return any result. However, all the data generated by this pipeline, including raw counts, can be found here:

About

Proyecto: Evaluación de la regulación trans en cáncer: un enfoque de biología de sistemas

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published