Ensure all datasets have the same transcriptomics normalization metric #154

jjacobson95 · 2024-04-24T20:19:32Z

All transcriptomic, proteomic, etc., data should be harmonized across datasets. This should be done before we explore batch effects for the paper.

jjacobson95 · 2024-04-25T17:33:08Z

@moonchangin I heard you you have some details on this - feel free to share in this thread.

moonchangin · 2024-04-25T20:47:30Z

Issue Description

There are discrepancies in the normalization metrics used for different datasets: MPNST and BeatAML. This leads to potential issues in comparing results directly between these datasets.

Details

MPNST Data
- Normalization Metric: TPM (Transcripts Per Million)
- Tool: Salmon-rnaseq workflow (an alignment-free tool)
BeatAML Data
- Normalization Metric: log2(RPKM + 1)
- Source: Nature Article

Proposed Solution

To harmonize the data, we propose adjusting the MPNST's TPM data to match the RPKM used in the BeatAML dataset. The adjustment can be done using the following Python code snippet, which requires the gene lengths in kilobases (kb):

tpm['gene_length_kb'] = tpm['gene_length'] / 1000  # Convert gene length to kilobases
total_RNA_seq_depth = 10**6  # Placeholder for total depth in millions
tpm['rpkm'] = (tpm['transcriptomics'] * total_RNA_seq_depth) / tpm['gene_length_kb']

Requirements

Gene Length Information: We need to obtain the gene lengths from GENCODE version 29, which is the version used previously in the RNA-seq workflow. The gene length data should be converted from GTF format to a gene-symbol to gene-length pair in CSV format.
gencode_v29_gene_lengths.csv

Code and resources used for tpm to rpkm conversion

code_for_tpm_to_rpkm.zip

sgosline · 2024-04-29T20:49:32Z

Everything is in TPM except for BeatAML. @jjacobson95 can you please dig up the TPM for beatAML? If not, please add a conversion step.

jjacobson95 added the Harmonization label Apr 24, 2024

sgosline assigned jjacobson95 Apr 29, 2024

sgosline assigned sgosline and unassigned jjacobson95 Jun 20, 2024

sgosline closed this as completed Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure all datasets have the same transcriptomics normalization metric #154

Ensure all datasets have the same transcriptomics normalization metric #154

jjacobson95 commented Apr 24, 2024

jjacobson95 commented Apr 25, 2024

moonchangin commented Apr 25, 2024

sgosline commented Apr 29, 2024

Ensure all datasets have the same transcriptomics normalization metric #154

Ensure all datasets have the same transcriptomics normalization metric #154

Comments

jjacobson95 commented Apr 24, 2024

jjacobson95 commented Apr 25, 2024

moonchangin commented Apr 25, 2024

Issue Description

Details

Proposed Solution

Requirements

Code and resources used for tpm to rpkm conversion

sgosline commented Apr 29, 2024