# Constructing a benchmark dataset

Here, I will be making notes about the construction of my benchmark dataset. This will also datasets that I choose to include.

# Guidelines

- Variety of tissue types: brain, spinal cord, muscle, bone, lung, trachea, liver, intestine, pancreas, thyroid, ovary, testis, blood, spleen
- Variety of dataset sizes: e.g. <10, 10-50, >50
- Datasets that include multiple sub-series
- Variety of experimental conditions: time-series data, treatment data, genetic variants. Also consider: case-control, cross-sectional (multiple samples from a single timepoint), cohort studies, 
- Variety of disease types: e.g. cancners (breast, lung cancer, leukemia), Alzheimer's disease, Parkinson's disease, Rheumatoid arthritis, multiple sclerosis, Diabetes, obesity, HIV, COVID-19, rare diseases
- types of data (single vs paired end for example)
- common vs rare diseases

ChatGPT has suggested to focus on datasets with complete data/standardised formats, however I believe there is merit in investigating how datasets with missing data are handled. 

(raw ChatGPT output...)

1. Dataset Selection Criteria

a. Source and Accessibility

	•	Platform: Select datasets exclusively from NCBI Gene Expression Omnibus (GEO) to maintain consistency in data formats and metadata standards.
	•	Accessibility: Ensure that all chosen datasets are publicly available without access restrictions to allow unrestricted use and reproducibility.

b. Tissue Types

	•	Variety: Include datasets spanning a wide range of tissue types to capture diverse biological contexts. Aim for representation from major organ systems such as:
	•	Central Nervous System: Brain, spinal cord.
	•	Musculoskeletal: Muscle, bone.
	•	Respiratory: Lung, trachea.
	•	Digestive: Liver, intestine.
	•	Endocrine: Pancreas, thyroid.
	•	Reproductive: Ovary, testis.
	•	Immune System: Blood, spleen.

c. Dataset Sizes

	•	Sample Diversity: Incorporate datasets with varying number of samples to test LLM performance across different scales:
	•	Small-Scale Studies: <10 samples.
	•	Medium-Scale Studies: 10-50 samples.
	•	Large-Scale Studies: >50 samples.

d. Sub-Series and Experimental Conditions

	•	Multiple Sub-Series: Select datasets that include multiple sub-series or distinct experimental conditions within a single GEO accession. This facilitates evaluation of LLMs in handling complex experimental designs.
	•	Experimental Variations: Ensure that some datasets involve variations such as:
	•	Time-Series Data: Longitudinal studies tracking changes over time.
	•	Treatment Conditions: Different drug treatments or interventions.
	•	Genetic Variants: Comparisons between wild-type and mutant strains.

e. Disease Types

	•	Diversity of Diseases: Include a broad spectrum of diseases to assess LLMs’ ability to handle varied biological and pathological contexts:
	•	Cancer Types: Breast cancer, lung cancer, leukemia, etc.
	•	Neurological Disorders: Alzheimer’s disease, Parkinson’s disease.
	•	Autoimmune Diseases: Rheumatoid arthritis, multiple sclerosis.
	•	Metabolic Disorders: Diabetes, obesity.
	•	Infectious Diseases: HIV, COVID-19.
	•	Rare Diseases: Select a few rare conditions to test LLM adaptability.

# Candidate datasets

## GSE213001 - Bulk tissue RNA sequencing from human donors with and without lung fibrosis (2022)
- 139 samples
- Idiopathic pulmonary fibrosis (IPF) is a lethal fibrotic lung condition with an unpredictable disease course. The pathology of IPF starts in the lung bases such that a patient’s lung apices are comparatively less fibrotic at the time of transplantation. We hypothesized that RNA sequencing of the lung apices and bases may identify differentially expressed genes that better reflect disease progression in IPF. Samples were derived from both right and left lungs when available and RNA Seq was performed on a total of 139 samples.
- High-quality mRNA was collected from resected lung tissue from 20 patients with IPF, 14 non-diseased control donors whose lungs were deemed unsuitable for transplantation, and 9 patients with end-stage interstitial lung disease other than IPF (non-IPF ILD).

## GSE127813 - High-throughput tissue dissection and cell purification with digital cytometry [healthy adults] (2019)
- 12 samples
- Tissue composition is a major determinant of phenotypic variation and a key factor influencing disease outcomes. Although scRNA-Seq has emerged as a powerful technique for characterizing cellular heterogeneity, it is currently impractical for large sample cohorts and cannot be applied to fixed specimens collected as part of routine clinical care. To overcome these challenges, we extended Cell type Identification By Estimating Relative Subsets Of RNA Transcripts (CIBERSORT) into a new platform for in silico cytometry. Our approach enables the simultaneous inference of cell type abundance and cell type-specific gene expression profiles (GEPs) from bulk tissue transcriptomes. The utility of this integrated framework, called CIBERSORTx, is demonstrated in multiple tumor types, including melanoma, where single cell reference profiles are used to dissect primary clinical specimens, revealing cell type-specific signatures of driver mutations and immunotherapy response. We anticipate that digital cytometry will augment single cell profiling efforts, enabling cost-effective, high throughput tissue characterization without the need for antibodies, disaggregation, or viable cells.
- Whole blood samples were collected from 12 healthy adult subjects and immediately processed to enumerate leukocyte composition by FACS using an FDA-approved in vitro diagnostic test (IVD Multitest 6-color TBNK, Becton Dickinson) and automated hematology analyzer for blood leukocyte differential counts (Sysmex XE-2100) in a CLIA hematology lab setting (Stanford Clinical Laboratories). Aliquots from the same whole blood samples were stored in PAXgene blood RNA tubes (Qiagen) for subsequent RNA extraction and RNA-Seq library preparation.
- NO RAW DATA AVAILABLE

## GSE235236 - Bulk RNAseq expression data of colon biopsies from intestinal mucosa of non-IBD controls and patients with Crohn's disease or ulcerative colitis (2023)
- 56 samples
- Ulcerative colitis and Crohn’s disease are chronic inflammatory intestinal diseases with perplexing heterogeneity in manifestations and response to treatment. While the molecular basis for this heterogeneity remains uncharacterized, single-cell technologies allow us to explore the transcriptional states within tissues at an unprecedented resolution which could further understanding of these complex diseases. Here, we apply single-cell RNA-sequencing to human inflamed intestine and show that the largest differences among patients are present within the myeloid compartment including macrophages and neutrophils. Using spatial transcriptomics in human tissue at single-cell resolution (CosMx Spatial Molecular Imaging) we spatially localized each of the macrophage and neutrophil subsets identified by single-cell RNA-sequencing and unravel further macrophage diversity based on their tissue localization. Finally, single-cell RNA-sequencing combined with single-cell spatial analysis reveals a strong communication network involving macrophages and inflammatory fibroblasts. Our data sheds light on the cellular complexity of these diseases and points towards the myeloid and stromal compartments as important cellular subsets for understanding patient-to-patient heterogeneity.
- Barcoded RNAseq libraries were prepared from total RNA using a TruSeq stranded mRNA kit (Illumina, San Diego, CA, USA) according to the manufacturer’s instructions. Libraries were subjected to single-end sequencing (100 bp) on a HighSeq-3000 platform (Illumina, CA, USA) at the Translational Medicine and Genomics group (Boehringer-Ingelheim GmbH & Co, Biberach, Germany). Quality filtering and adapter trimming was performed using Skewer version 2.2.8 Reads were mapped against the human reference genome using the STAR aligner version 2.5.2a. The genome used was GRCh38.p10, and gene annotation was based on Gencode version 27 (EMBL-EBI, Hinxton, UK). Read counts per gene were obtained using RSEM version 1.2.31 and the Ensembl GTF annotation file (EMBL-EBI, Hinxton, UK).

## GSE226663 - Common Mitochondrial Deletions in RNA-Seq: Evaluation of Bulk, Single-Cell and Spatial Transcriptomic Datasets
- 10 samples (2 bulk RNA, rest sc)
- To investigate whether large mitochondrial DNA (mtDNA) deletions can be quantified and detected using the high-throughput bioinformatics tool Splice-Break (https://github.com/brookehjelm/Splice-Break2).
This dataset contains 8 human middle temporal gyrus (MTG) samples (from 4 patients) that underwent spatial transcriptomic RNA-Sequencing following 10x Genomics visium protocols as well as 2 samples (from spatial transcriptomics patient 144107/144108) that underwent bulk RNA-Sequencing (one with a ribosomal depletion library preparation and one without).
- 10x Visium spatial transcriptomics profiles of middle temporal gyrus tissue taken from 4 patient samples (two tissue sections of each), consisting of 2 controls, 1 Alzheimer's Disease (AD), and 1 Parkinson's Disease (PD). Bulk RNA-Sequencing of one patient with no neurodegenerative diagnosis was included for methods comparison.

## GSE64712 - Functional characterization of human T cell hyporesponsiveness induced by CTLA4-Ig (2015)
- 7 samples
- time series
- During activation, T cells integrate multiple signals from APCs and cytokine milieu. The blockade of these signals can have clinical benefits as exemplified by CTLA4-Ig, which blocks interaction of B7 co-stimulatory molecules on APCs with CD28 on T cells. Variants of CTLA4-Ig, abatacept and belatacept are FDA approved as immunosuppressive agents in arthritis and transplantation whereas murine studies suggested that CTLA4-Ig can be beneficial in a number of other diseases. However, detailed analysis of human CD4 cell hyporesponsivness induced by CTLA4-Ig has not been performed. Herein, we established a model to study effect of CTLA4-Ig on the activation of human naïve T cells in a human mixed lymphocytes system. Comparison of human CD4 cells activated in the presence or absence of CTLA4-Ig, showed that co-stimulation blockade during TCR activation does not affect NFAT signaling but results in decreased activation of NF-kB and AP-1 transcription factors followed by profound decrease in proliferation and cytokine production. The resulting T cells become hyporesponsive to secondary activation and, although capable of receiving TCR signals, fail to proliferate or produce cytokines, demonstrating properties of anergic cells. However, unlike some models of T cell anergy, these cells did not possess increased levels of TCR signaling inhibitor CBLB. Rather, the CTLA4-Ig induced hyporesponsiveness was associated with an elevated level of p27kip1 cyclin-dependent kinase inhibitor.
- Time series. Human resting and activated T cell dUTP mRNA-Seq profiles were generated on Illumina HiSeq2500

## GSE122380 - Dynamic genetic regulation of gene expression during cellular differentiation (2018)
- 297 (!!!) samples
- Time series
- The impact of genetic regulatory elements on gene expression can change during differentiation and across cell types and environments. We mapped expression quantitative trait loci (eQTLs) throughout differentiation to elucidate the dynamics of genetic effects and the associated cell type specific mechanisms. To do so, we generated high resolution time-series RNA-sequencing data, capturing differentiation from induced pluripotent stem cells to cardiomyocytes with 16 time points in 19 human cell lines. Genetic effects on gene regulation in these data show clear temporal structure. We identified hundreds of dynamic eQTLs that change significantly over time, with enrichment in enhancers of relevant cell types. We also found nonlinear dynamic eQTLs, which have effects only during intermediate stages of differentiation. These include a variant associated with body mass index, highlighting that transient genetic effects can contribute to disease.
- We obtained RNA-seq data for 19 Yoruba individuals at 16 time points each during the differentiation from iPSC to cardiomyocyte. There are a total of 297 samples.

## GSE133702 - Next generation sequencing for temporal analysis at high altitude in Indian and Kyrgyz healthy males
- 16 samples
- Time series
- Next-generation sequencing (NGS) has revolutionized systems-based analysis of cellular pathways. The aim of the study is to analyse the time series analysis at high altitude to dynamically capture the complete drift impringes in humans and specific selection of decisive time point for therapeutic interventions
- High Altitude time point analysis to find out the physiological and molecular mechanism of acclimatization in healthy males

## GSE236761 - Translatome analysis in postmortem brain-derived samples from Autism Spectrum Disorder affected invidiuals (2023)
- single end
- case control
- 20 samples
- Autism spectrum disorder (ASD) is a complex neurodevelopmental condition with a strong genetic link, but no single well-established cause. This suggests that perturbed regulatory network may better explain its heterogeneity of phenotypes and multiple gene associations. Consistently, our previous study provided evidence of abnormal alternative polyadenylation in transcripts from distinct brain regions suggesting a potential imbalance in the protein synthesis in the postsynaptic density. To test this hypothesis, we studied transcriptome-wide alterations of mRNA translation in post-mortem brain samples from neurotypical and ASD-affected young males subjects. To this end, we employed an optimised polysome profiling technique amendable for small tissue samples and analyzed changes in the translatome using the anota2seq algorithm. The analysis revealed bolstered translation of mRNAs whose translational efficiency was previously reported to be sensitive to eIF4E, a key factor for synaptic protein synthesis modulated by Ras/ERK and PI3K/mTOR signaling pathways. This observation is consistent with previous findings linking hyperactive eIF4E to increased translation of neuroligins, a disturbed excitation/inhibition ratio in synapses and autistic-like phenotypes in mice. In summary, we reveal a link between eIF4E-dependent translation and human autism, indicating a potential pharmacy-therapeutical target for the prevention of behavioral impairments in ASD.
- 10 brain samples were collected post-mortem from neurotypical and ASD-affected donors (NIH NeuroBioBank). To minimize study bias, the donors were matched for age (4-9 years old) and sex (male); and samples only originated from Brodmann Area 19.

## GSE119027 - Truncating variant in MYOF gene is associated with limb-girdle type muscular dystrophy and cardiomyopathy (2019)
- single end
- case control
- 4 samples
- Even though genetic studies of individuals with neuromuscular diseases have uncovered the molecular background of many cardiac disorders such as cardiomyopathies and inherited arrhythmic syndromes, the genetic cause of a proportion of cardiomyopathies associated with neuromuscular phenotype still remains unknown. Here, we present a clinical case with a combination of cardiomyopathy and limb-girdle type muscular dystrophy where whole exome sequencing identified myoferlin (MYOF) - a member of the Ferlin protein family and close homolog of DYSF - as the most likely candidate gene. The disease-causative role of the identified variant c.[2576delG; 2575G>C], p.G859fs is supported by functional studies in vitro using primary patient's satellite cells, including both RNA sequencing and morphological studies, as well as recapitulating of muscle phenotype in vivo in zebrafish experiments. We provide the first evidence supporting a role of MYOF in human muscle disease.
- A muscle biopsy taken from patient’s M. deltoideus was used for morphological examination and SMSC isolation. Control cells were obtained from the hip muscles of healthy donors. The cells were placed in proliferation media (DMEM supplemented with 10% FCS) on cell culture dishes and cultured until 80% confluence. Patient and donor cells were differentiated towards myotubes using low serum condition.

## GSE63392 - RNA-Seq of cKIT+ sorted cells from 53-137 day old fetal testes and ovaries and RNA-Seq of TRA-1-81+ H1 and UCLA1 hESCs (2015)
- single end
- time series (?)
- 19 samples
- We performed RNA-Seq analyses on 15 human fetal samples at 53-137 days of development, 9 female and 5 male, and identified the transcriptional changes during the transition of human cKIT+ primordial germ cells (PGCs), the precursors of gametes, to the generation of Advanced Germline Cells. Comparing the transcriptional profile of PGCs to that of H1 and UCLA1 hESCs identifies differences between the two cell types and pinpoints molecules that can be used in the development of in vitro germ cell differentiation protocols starting from human pluripotent stem cells.
- RNA-Seq of cKIT+ cells analyzed from 6 biological samples for testes and 9 samples for ovaries from 53-137 days. 2 biological replicates of TRA-1-81+ cells sorted from H1 and UCLA1 hESCs. WGBS of cKIT+ cells analyzed from 4 biological samples of ovaries and 1 biological sample of testes at 57-137 days of development.

## GSE197976 - Impact of an Immune Modulator Mycobacterium-w on Adaptive Natural Killer Cells and Protection Against COVID-19 (2022)
- MinION
- case control
- 8 samples
- Whole transcriptome Differential Gene Expression (DGE) analysis was carried out on four biological replicates of both Mw (0.1 ml Mw administrated intradermally in each arm) and Control group at 6 months following exposure to Mycobacterium-w. Sequencing was done through Direct cDNA Sequencing (oxford nanopore technologies, Oxford, UK) using RNA isolated from Peripheral blood mononuclear cells (PBMC) by Trizol method. Native barcoding and adaptor ligation was done according to the manufacturer’s instructions. Ligated cDNA was loaded on the flow cell (R9.4.1) in MinION and were sequenced specifying 72 hours protocol. MinKNOW (v21.06.10, Microsoft Windows OS based) was used to generate FAST5 files. FAST5 files were base-called with CPU based Guppy basecaller (v.5.0.11) (ONT) to generate FASTQ files. DGE analysis was done using “pipeline-transcriptome-de” (https://github.com/nanoporetech/pipeline-transcriptome-de) pipeline. DGE analysis confirmed that upregulation of ANK pathway was evident at 6 months in the Mw group. Apart from upregulation of KLRC2 and B3GAT1 and downregulation of KLRC1, the key transcription factor in the ANK pathway, BCL11b, was persistently upregulated. Downregulation of EAT-2 and PLZF further corroborated the classic gene expression signature of ANK cells. Moreover, increased expression of AT-rich interaction domain 5B (ARID5B), as demonstrated in the Mw group plays an important role in enhanced metabolism in ANK cells as well as increased IFN-γ release and survival. DGE analysis also revealed an enhancement of ANK mediated ADCC pathway, with significant upregulation of CD247 along with downregulation of FCER1G, which is a typical signature of ANK-ADCC. Both CD247 and FCER1G are adapter molecules for FCGRIIIA (CD16) with CD247 possessing 3 ITAMs against one ITAM of FCER1G, increasing the ADCC several folds. It is possible that Mw induced augmentation of NK-ADCC might potentiate the efficacy of SARS-CoV2 vaccines as well.
- Comparative whole gene expression analysis between four Mycobacterium-w treated and four control biological replicates

## GSE264091 - Altered TE-derived genes leads to pathogenesis of FXD (note - two subseries) (2024)
- PacBIO AND Illumina
### Subseries GSE264089 - PacBIO
- 84 samples
- We identified known and unknown transposon elements in the cells from healthy donors, FXD patients, and ALL patients by long-read RNA-sequencing. Using these identified TEs, we analyzed the expression of TE-derived genes by shot-read (illumina) RNA-sequencing. We found that expression of some specific TE-derived genes were changed in FXD compared to those in healthy donors. These findings suggested the involvement of modified expressions of TE-derived genes in the pathogenesis of FXD.
- We analyzed total RNA and mRNA in 8 kinds of healthy donors' cells, 9 kinds of FXD patients' cells, and 4 kinds of ALL patients' cells.
### Subseries GSE264090 - Shortread RNA seq
- We identified known and unknown transposon elements in the cells from healthy donors, FXD patients, and ALL patients by long-read RNA-sequencing. Using these identified TEs, we analyzed the expression of TE-derived genes by shot-read (illumina) RNA-sequencing. We found that expression of some specific TE-derived genes were changed in FXD compared to those in healthy donors. These findings suggested the involvement of modified expressions of TE-derived genes in the pathogenesis of FXD.
- We analyzed mRNA in 8 kinds of healthy donors' cells, 9 kinds of FXD patients' cells, and 4 kinds of ALL patients' cells.
- 63 samples

## GSE242084 - Long read transcriptome sequencing of ccRCC tumour cell line RCC4 (2023)
- 6 samples
- Clear cell renal cell carcinoma (ccRCC) is the most common form of kidney cancer. To date, long-read RNA sequencing has not been applied to kidney cancer. Here, we used ONT long-read Direct RNA sequencing to profile the transcriptomes of ccRCC cell line RCC4, with and without exposure to pro-inflammatory cytokines. Our results revealed differentially expressed genes induced by the pro-inflammatory cytokines. Moreover, results here revealed potential tumour origin of novel isoforms and genes that were discovered in the archival tumour samples by long-read sequencing.
- RCC4 cells (with and without exposure to pro-inflammatory cytokines IFNg&TNF for 24h) were harvested for total RNA. Poly(A)+ RNA molecules were isolated, with 500ng used as input for sequencing library preparation (RNA002). Each library was loaded on individual PromethION flow cell (R9) and sequenced for 72h. Reads with Q>7 were used for subsequent bioinformatic analysis.

## GSE271893 - Unrvaleling autophagic imbalances and therapeutic insights in Mecp2-deficient models (2024) (mouse)
- 26 samples
- Loss of function mutations in MECP2are associated to Rett syndrome (RTT), a severe neurodevelopmental disease. Mainly working as a transcriptional regulator, MECP2 absence leads to gene expression perturbations resulting in deficits of synaptic function and neuronal activity. In addition, RTT patients and mouse models suffer from a complex metabolic syndrome, suggesting that related cellular pathways might contribute to neuropathogenesis. Along this line, autophagy is critical in sustaining developing neuron homeostasis by breaking down dysfunctional proteins, lipids, and organelles. Here, we investigated the autophagic pathway in RTT and found reduced content of autophagic vacuoles in Mecp2 knock-out neurons. This correlates with defective lipidation of LC3B-II probably caused by a deficiency of the autophagic membrane lipid phosphatidylethanolamine. The administration of the autophagy inducer trehalose recovers LC3B lipidation, autophagosomes content in knock-out neurons, and ameliorates their morphology, neuronal activity and synaptic ultrastructure. Moreover, we provide evidence for attenuation of motor and exploratory impairment in Mecp2 knock-out mice upon trehalose administration. Overall, our findings open new perspectives for neurodevelopmental disorders therapies based on the concept of autophagy modulation.
- To investigate threalose effect on Mecp2-KO primary neurons, we performed gene expression profiling analysys on 4 different groups of samples: WT primary neurons, threalose treated WT primary neurons, Mecp2-KO primary neurons, and threalose treated Mecp2-KO primary neurons

## GSE253226 - PKP2 Gene Therapy Improves Heart Function and Reduces Mortality in a Pkp2-deficient Mouse Model of Arrhythmogenic Right Ventricular Cardiomyopathy (2024) (human + mouse)
- Three subseries: GSE253223, GSE253224, GSE253225
### GSE253223 - TNYA0016: Time-dependent transcriptional effects of PKP2 mRNA knockdown in wild-type human iPSC-CMs
- Background: Arrhythmogenic right ventricular cardiomyopathy (ARVC) is a familial cardiac disease associated with ventricular arrhythmias and an increased risk of sudden cardiac death. Currently, there are no approved treatments that address the underlying genetic cause of this disease, representing a significant unmet need. Mutations in Plakophilin-2 (PKP2), encoding a desmosomal protein, account for approximately 40% of ARVC cases and result in reduced gene expression. Methods: Our goal is to examine the feasibility and the efficacy of adeno-associated virus 9 (AAV9)-mediated restoration of PKP2 expression in a cardiac specific knock-out mouse model of Pkp2. Results: We show that a single dose of AAV9:PKP2 gene delivery prevents disease development before the onset of cardiomyopathy and attenuates disease progression after overt cardiomyopathy. Restoration of PKP2 expression leads to a significant extension of lifespan by restoring cellular structures of desmosomes and gap junctions, preventing or halting decline in left ventricular ejection fraction, preventing or reversing dilation of the right ventricle, ameliorating ventricular arrhythmia event frequency and severity, and preventing adverse fibrotic remodeling. RNA sequencing analyses show that restoration of PKP2 expression leads to highly coordinated and durable correction of PKP2-associated transcriptional networks beyond desmosomes, revealing a broad spectrum of biological perturbances behind ARVC disease etiology. Conclusions: We identify fundamental mechanisms of PKP2-associated ARVC beyond disruption of desmosome function. The observed PKP2 dose-function relationship indicates that cardiac-selective AAV9:PKP2 gene therapy may be a promising therapeutic approach to treat ARVC patients with PKP2 mutations.
- RNA sequencing on 2 groups of iPSC-CMs with 1. wild-type background: siNeg (n=4 at day 2, 4, 6, and 8, respectively), 2. siPKP2 (n=16 at day 2, 4, 6, and 8, respectively).
- (humans)
- 20 samples
### GSE253224 - TNYA0042: TN-401 dose-dependent efficacy study at about 10 weeks in Pkp2-cKO ARVC mouse model
- Background: Arrhythmogenic right ventricular cardiomyopathy (ARVC) is a familial cardiac disease associated with ventricular arrhythmias and an increased risk of sudden cardiac death. Currently, there are no approved treatments that address the underlying genetic cause of this disease, representing a significant unmet need. Mutations in Plakophilin-2 (PKP2), encoding a desmosomal protein, account for approximately 40% of ARVC cases and result in reduced gene expression. Methods: Our goal is to examine the feasibility and the efficacy of adeno-associated virus 9 (AAV9)-mediated restoration of PKP2 expression in a cardiac specific knock-out mouse model of Pkp2. Results: We show that a single dose of AAV9:PKP2 gene delivery prevents disease development before the onset of cardiomyopathy and attenuates disease progression after overt cardiomyopathy. Restoration of PKP2 expression leads to a significant extension of lifespan by restoring cellular structures of desmosomes and gap junctions, preventing or halting decline in left ventricular ejection fraction, preventing or reversing dilation of the right ventricle, ameliorating ventricular arrhythmia event frequency and severity, and preventing adverse fibrotic remodeling. RNA sequencing analyses show that restoration of PKP2 expression leads to highly coordinated and durable correction of PKP2-associated transcriptional networks beyond desmosomes, revealing a broad spectrum of biological perturbances behind ARVC disease etiology. Conclusions: We identify fundamental mechanisms of PKP2-associated ARVC beyond disruption of desmosome function. The observed PKP2 dose-function relationship indicates that cardiac-selective AAV9:PKP2 gene therapy may be a promising therapeutic approach to treat ARVC patients with PKP2 mutations.
- RNA sequencing on 4 groups of C57BL6 mice with 1. wild-type background: WT_HBSS (n=8), 2. cKO_HBSS (n=8), 3. cKO_TN-401_3E13 vg/kg (n=6), 4. cKO_TN-401_6E13 vg/kg (n=6). Right ventricle (RV) and left ventricle (LV) tissues from each animal were analyzed simultaneously.

## GSE122246 - RNA sequencing reveals PNN as a predictive biomarker of clinical outcome in stage III colorectal cancer patients treated with adjuvant chemotherapy (2020)
- Ion torrent
- 24 samples
- Five-year overall survival of stage III colorectal cancer (CRC) patients treated with standard adjuvant chemotherapy (ACHT) is highly variable. Genomic biomarkers and/or transcriptomic profiles identified lack of adequate validation. Aim of our study was to identify and validate molecular biomarkers predictive of ACHT response in stage III CRC patients by a transcriptomic approach. From a series of CRC patients who received ACHT, two stage III extreme cohorts (unfavorable vs. favorable prognosis) were selected. RNA sequencing was performed from fresh frozen explants. Tumors were characterized for somatic mutations. Validation was performed in stage III CRC patients extracted from two GEO datasets. According to disease free survival (DFS), 108 differentially expressed genes (104/4 up/downregulated in the unfavorable prognosis group) were identified. Among 104 upregulated genes, 42 belonged to olfactory signaling pathways, 62 were classified as pseudogenes (n = 17), uncharacterized noncoding RNA (n = 10), immune response genes (n = 4), microRNA (n = 1), cancer-related genes (n = 14) and cancer-unrelated genes (n = 16). Three out of four down-regulated genes were cancer-related. Mutational status (i.e., RAS, BRAF, PIK3CA) did not differ among the cohorts. In the validation cohort, multivariate analysis showed high PNN and KCNQ1OT1 expression predictive of shorter DFS in ACHT treated patients (p = 0.018 and p = 0.014, respectively); no difference was observed in untreated patients. This is the first study that identifies by a transcriptomic approach and validates PNN and KCNQ1OT1 as molecular biomarkers predictive of chemotherapy response in stage III CRC patients. After a further validation in an independent cohort, PNN and KCNQ1OT1 evaluation could be proposed to prospectively identify stage III CRC patients benefiting from ACHT.
- Two extreme cohort study according to chemotherapy outcome
- (remove ion torrent)

## GSE199911 - Gene dysregulation in acute HIV-1 infection – early transcriptomic analysis reveals the crucial biological functions affected (2022)
- 74 samples
- Transcriptomic analyses from early human immunodeficiency virus (HIV) infection have the potential to reveal how HIV causes widespread and lasting damage to biological functions, especially in the immune system. Previous studies have been hampered by difficulties in obtaining early specimens. We studied 29 HIV infected subjects 1 month from presentation and 46 contemporaneous uninfected controls. We used RNA-seq to examine circulating immune cells to describe in detail the profound gene dysregulation observed in early HIV infection. Correlations were sought between viral load and differential gene expression, and biological implications were examined. The most profoundly upregulated biological functions related to cell cycle regulation, DNA repair and replication, microtubule and spindle organization, and immune activation and response. Our findings provide insight into the pathogenic mechanisms of early HIV-induced immune damage, information important for timely interventions to prevent clinical progression, inhibit reservoir seeding, and enable vaccine development.
- Cohort study; this sub-analysis comparing acute HIV infection to uninfected controls

## GSE167665 - Bulk RNA-seq of mice covering the whole lifespan (2 days to 904 days) from four tissues (2022)
- 63 samples, multiple tissues
- Gene expression changes during ageing were shown to oppose developmental trajectories; a reversal pattern previously linked to cellular identity loss. Generating cortex, lung, liver and muscle tissue transcriptomes of 16 mice of different ages, covering development and ageing periods, we found that expression reversals were widespread but tissue-specific. Consistent with this result, we observed an inter-tissue divergence during development and convergence during ageing (DiCo), further confirmed in independent mouse and human datasets. The genes displaying DiCo pattern were enriched among tissue-specific genes that tended to lose developmental expression levels during ageing. Finally, analysing publicly available single-cell transcriptome data, we studied the contribution of cellular composition and cell-autonomous changes to the convergence in ageing. Our results, for the first time, suggest inter-tissue convergence during ageing is widespread and associated with the loss of specialisation at the tissue and possibly also at the cellular level.
- Age-series mRNA expression profiles from 4 tissues (cerebral cortex, lung, liver, skeletal muscle) of 16 mice, covering postnatal development (2 days - 61 days) and ageing (93 days – 904 days) periods.

## GSE70503 - Transcriptomic variation of pharmacogenes in multiple human tissues and lymphoblastoid cell lines (2016)
- 139 samples, multiple tissues
- Variation in the expression level and activity of genes involved in drug disposition and action (“pharmacogenes”) can affect drug response and toxicity, especially when in tissues of pharmacological importance. Previous studies have relied primarily on microarrays to understand gene expression differences, or have focused on a single tissue or small number of samples. The goal of this study was to use RNA-seq to determine the expression levels and alternative splicing of 389 PGRN pharmacogenes across four tissues (liver, kidney, heart and adipose) and lymphoblastoid cell lines (LCLs), which are used widely in pharmacogenomics studies. Analysis of RNA-seq data from different individuals across the 5 tissues (N = 139 samples) revealed substantial variation in both expression levels and splicing across samples and tissue types. This in-depth exploration also revealed 183 splicing events in pharmacogenes that were previously not annotated. Overall, this study serves as a rich resource for the research community to inform biomarker and drug discovery and use.
- mRNA profile of 389 genes in 24 liver, 20 kidney (cortex), 25 heart (left ventricle), 25 adipose (subcutaneous) samples, and 45 lymphoblastoid cell lines (LCLs). The raw data for the LCL Samples will be available through dbGap under phs000481.

## GSE252291 - Meningioma transcriptomic landscape demonstrates novel subtypes with regional associated biology and patient outcome (2024)
- Meningiomas are the most common primary intracranial tumors in humans. While most of these tumors are benign, some are malignant, rapidly recur after multimodal treatment with surgery and radiotherapy, and can ultimately be fatal. The current WHO grade system does not always identify high risk meningiomas, therefore better characterizations of the biology of aggressive tumors are needed. In order to address these challenges, we combined 13 bulk RNA-Seq datasets, corrected for batch effects, and applied Uniform Manifold Approximation and Projection (UMAP) to create a reference landscape of ~1300 meningioma tumors. Our analyses revealed multiple distinct meningioma subtypes with specific biological signatures. Clinical metadata, mutations, copy number alterations, and gene-fusion data effectively correlated with regions of the UMAP. Notably, regional distribution of time to recurrence identified major clusters as well as intra-cluster differences of meningiomas with varying patient outcomes. The most aggressive subtype, characterized by an enrichment of higher WHO grades, frequent tumor recurrences, and shorter time to recurrence, exhibited elevated proliferation rates and RNA expression resembling muscle development. To facilitate clinical applications, we developed a cross-validated nearest-neighbors-based algorithm that accurately maps new patients onto this UMAP landscape. Our study highlights the utility of transcriptomic analysis in discerning meningioma heterogeneity as well as successful combination of multiple datasets from various sources. We provide a valuable tool for understanding the disease, predicting tumor biology and patient prognosis. This resource is accessible via the open source, interactive online tool Oncoscape, where the scientific community can explore the landscape and mine clinical and genomic metadata.
- We collected meningioma tumor samples, extracted total RNA and sequenced. Then we combined the in-house sequenced dataset with other publicly available meningioma RNA-Seq datasets that we collected to generate a UMAP. Clinical and genomic metadata was overlaid on to the UMAP and meningioma subtypes were analyzed.

## GSE182437 - RNA-Seq to validate the effect of substrate and format as drivers of transcriptional variance in human intestinal organoids (2021)
- An initial comparison of RNA-sequencing datasets from 251 experiments on 35 human intestinal organoid lines revealed that differences in transcriptional signatures could be influenced by variations in the format (3D, monolayers on transwells, and monolayers on 96-well plates) as well as substrate (collagen and Matrigel) in which the organoid lines are cultured. While the organoid lines were a common to these datasets, there were differences in experimental approaches in different laboratories. To rule out potential confounding variables, we performed an RNA-Seq experiment using 5 jejunal organoid lines cultured at the same time in different formats and substrates to evaluate their roles in baseline transcriptional signatures and determine if key findings from the initial analysis could be validated.
- 25 samples were analyzed (no technical replicates). The cell lines used were J2,J3, J4, J6, J11. Each of these cell lines were plated as follows: 3D on Matrigel; Transwells on Matrigel; Transwells on collagen; Monolayers on Matrigel; Monolayers on collagen

# Chosen datasets
1. GSE213001 - Bulk tissue RNA sequencing from human donors with and without lung fibrosis (2022)
2. GSE235236 - Bulk RNAseq expression data of colon biopsies from intestinal mucosa of non-IBD controls and patients with Crohn's disease or ulcerative colitis (2023)
3. GSE64712 - Functional characterization of human T cell hyporesponsiveness induced by CTLA4-Ig (2015)
4. GSE122380 - Dynamic genetic regulation of gene expression during cellular differentiation (2018)
5. GSE236761 - Translatome analysis in postmortem brain-derived samples from Autism Spectrum Disorder affected invidiuals (2023)
6. GSE63392 - RNA-Seq of cKIT+ sorted cells from 53-137 day old fetal testes and ovaries and RNA-Seq of TRA-1-81+ H1 and UCLA1 hESCs (2015)
7. GSE119027 - Truncating variant in MYOF gene is associated with limb-girdle type muscular dystrophy and cardiomyopathy (2019)
8. GSE133702 - Next generation sequencing for temporal analysis at high altitude in Indian and Kyrgyz healthy males
9. GSE213001 - Bulk tissue RNA sequencing from human donors with and without lung fibrosis (2022)
10. GSE264091 - Altered TE-derived genes leads to pathogenesis of FXD (note - two subseries) (2024)
11. GSE242084 - Long read transcriptome sequencing of ccRCC tumour cell line RCC4 (2023)
12. GSE271893 - Unrvaleling autophagic imbalances and therapeutic insights in Mecp2-deficient models (2024) (mouse)
13. GSE253226 - PKP2 Gene Therapy Improves Heart Function and Reduces Mortality in a Pkp2-deficient Mouse Model of Arrhythmogenic Right Ventricular Cardiomyopathy (2024) (mouse + human)
14. GSE252291 - Meningioma transcriptomic landscape demonstrates novel subtypes with regional associated biology and patient outcome (2024)
15. GSE199911 - Gene dysregulation in acute HIV-1 infection – early transcriptomic analysis reveals the crucial biological functions affected (2022)
16. GSE167665 - Bulk RNA-seq of mice covering the whole lifespan (2 days to 904 days) from four tissues (2022)
17. GSE70503 - Transcriptomic variation of pharmacogenes in multiple human tissues and lymphoblastoid cell lines (2016)
18. GSE182437 - RNA-Seq to validate the effect of substrate and format as drivers of transcriptional variance in human intestinal organoids (2021)