Skip to content

Latest commit

 

History

History
62 lines (52 loc) · 4.36 KB

README_OLD.md

File metadata and controls

62 lines (52 loc) · 4.36 KB

GES_2020

Code (data analysis and model simulations) for Genetic (G) / Epigenetic (E) / Stochastic (S) paper (2020).

DrugResponse folder includes drug-response count data (csv), annotated by files in the platemap folder. Two scripts are included, functions from the previously published diprate R package (functionsDRC.R; Harris et al., Nat Meth (2016)) and the script to reproduce drug response figures in the paper (DrugResponse.R).

clonal Fractional Proliferation (cFP) folder includes single-cell data for each sample (one csv = one 384-well plate, each well with single-colony growth rates). cFP.R analysis script plots figures in text associated with drug-induced proliferation (DIP) rate distributions. PopD_trajectories.RData includes data from the DrugResponse folder for a comparison plot. Various trajectories*, distributions*, and bootstrapped (Anderson-Darling test) pvalues* files are simulated data (see Simulations folder below), which are directly compared to the experimental cFP data.

Joint_functions includes plotting functions shared between the cFP and DrugResponse folders.

Simulations folder includes the monoclonal (most sublines) and polyclonal (DS8) growth model (M/PGM) and associated parameter scans (.py files), submitted as slurm scripts. Parameter scan data is provided in the form of a pickled dataframe (.pkl files). Two scripts (PKLtoCSV*.py) convert the dataframes to csv files (DS*_expansionTest_tile_lowVal.csv) plotted in the plotParameterScan.R script. Plotting of example model simulation data happens in the cFP folder (see above).

Whole exome sequencing (WES) folder includes bash scripts (txt files) for the conversion of fastq files (see SRA accession number, below) to variant call format (vcf, SRA) and variant effect predictor (vep, supplementary information files) annotated datasets. Additionally, summarized metrics from the data (csv files), an .xlm file to identify the canonical ex19del mutation in all cell populations, and reference genome (rda file). Code for analysis and figure generation are provided (WES.R). The output of this analysis (input to GO analysis - below) is saved as variants_byCohort.RData (referenced in scRNAseq folder as well). A CNV folder is also in the WES folder for reviewer-suggested exome-specific copy number variants detection, which utilized the CoNIFER python tool. Input and output files are provided in the subfolder.

Single-cell RNA sequencing (scRNAseq) folder includes three data folders, two scripts, and some resource files. The read_count folder (generated by 10X Genomics CellRanger) includes the gene expression library for all 8 cell populations, multiplexed and sequenced together. The HTO_count bash script (provided as .txt file) contains the script to generate the umi_count folder. The umi_count folder includes the hashtag oligonucleotide (HTO) library, created by CITE-seq Count function on the 'hashed' data. The HTO_identification folder provides additional information on the HTO library. scRNAseq.R creates figures from the paper. This includes a functional interpretation analysis in VISION (with corresponding VISION_gmt subfolder that has several hallmark gene signatures). A csv file for the entire de-multiplexed dataset can be found in the supplementary information files. The differentially expressed genes output can be found in the DEGs_byCohort_hg38.RData, which is an input to the GO folder. An inferCNV folder is included that includes input files and analysis script for figure generation.

The gene ontology (GO) folder has a script that performs a GO, graph-structure based semantic similarity analysis on mutations and DEGs that define each cell population (semanticSimilarity.R), in order to find a genetic-to-epigenetic connection between the datasets. An RData file (mutations_DEGs-hg38.RData) included in the folder has all differentially-expressed genes and IMPACT mutations for each population tested (created in scRNAseq/scRNAseq.R). A script to create correlation plots of the shared GO terms between modalities is also included (GO_correlation.R).

RNA sequencing (RNAseq) folder includes a bash script (RNAseq_processing.txt) that generates a count matrix from the fastq files and an R script that runs a simple clustering and visualization of the bulk RNAseq data for all cell populations.

Seqeuncing data can be accessed in the following databases: SRA (PRJNA631050, PRJNA632351) GEO (GSE150084)