04. Proteomics Data

Proteomics Data Loading Instructions

There are two formats for loading protein quantification data as High Dimensional: Mass Spec Proteomics and RBM. One might ask why proteomics loading formats were developed for these two methods? tranSMART is an open source and some of the features were developed and contributed by members of the community. Various HDD data formats were developed by developers hired by Sanofi based on the Sanofi scientists specifications. I assume, at the time, these two methods were popular with the Sanofi scientists. It does not mean that quantitative proteomics data generated by other methods can’t be loaded into tranSMART. RBM format is very specialized. But format developed for Mass Spec Proteomics is quite generic and can be easily adapted to any other quantitative proteomics data format as long as the method/assay detects and quantifies specific proteins and protein isoforms that have UniProt IDs. Quantification of protein variants, mutations or modifications where different entities being quantified map to the same UniProt ID can also be loaded as High Dimensional data with some creative "porbe IDs". Advanced workflows use either "probe" or "probe-protein name" as legends for graphs and tables.

Mass Spec format Proteomics data files for loading include Platform file, Data file and Data mapping file. ETL uploads Mass Spec Proteomics data from <ProteinDataToUpload> directory.

About "Platform"

Here I have to digress and explain what is “Platform” file for HDD data in tranSMART speak to avert any confusion that this term can create in the context of the proteomics or other high-throughput data for any sane experimental scientist.

Platform concept is adopted from the Gene Expression Omnibus (GEO) data structure for microarray, next-generation sequencing, and other forms of high-throughput functional genomics data repository. According to GEO “A Platform record is composed of a summary description of the array or sequencer and, for array-based Platforms, a data table defining the array template.” For tranSMART HDD data, Platform is simply a file that matches detecting moiety (probe, antibody, peptide, ligand, etc) to a standard ID of the biological entity being quantified. When a Platform file is loaded, it links the data to the data specific Dictionary that should be already loaded when tranSMART is installed.

Protein Dictionary includes UniProt ID, Protein Name and Gene Symbol. This allows to search for a specific protein in the HDD pop-up in Advanced Workflows by simply starting to type gene symbol or UniProt ID and then selecting corresponding protein name.

Mass Spec Proteomics Data (sample)

Mass Spec Proteomics Data file format

Peptide	Majority protein IDs	Sample1	Sample2
AVLTIDEK	P01009	22.8861	8.92968
LSITGTYDLK	P01009	11.2	7.33
LALGDDSPALK	P17174	6.09	8.98

Peptide does not necessarily has to be a peptide sequence. It can be any other reagent/probe ID used for quantitative protein detection. Majority protein IDs are UniProt IDs. This column is redundant. The whole point of the Platform file (see below) is to map Peptides/Probes from the data file to the UniProt IDs. But this is the format that was specified by the original tranSMART HDD proteomics format developers and tMDataLoader expects this column to be present.

Data Types (same as for Expression data)

The last symbol in data file name (before extension, e.g. Test Study_GSE37425_PROTEIN_Data_R.txt) is one of following letters:
*R - raw data. Values is a raw data, which should be transformed to calculate log2 value and z-score.
*L - log2 data. Values is a log2 data, z-score calculated,
*T and Z - z-score data. Has same meaning, value will be written to z-score without modifications if it in range of (-2.5; 2.5). It will be truncated to this range otherwise.

NOTE: For Mass Spec data loaded as L raw values are not restored.

Mass Spec Proteomics Platform

#PLATFORM_ID: MSPROT
#PLATFORM_TITLE: QTRAP_5500
#SPECIES: Homo Sapiens

peptide	majority_protein_id	organism	gpl_id
AVLTIDEK	P01009	Homo Sapiens	MSPROT
LSITGTYDLK	P01009	Homo Sapiens	MSPROT
LALGDDSPALK	P17174	Homo Sapiens	MSPROT

If you wonder why Proteomics Platform file does not have the same format as Platform file for Expression data, why include gpl_id (platform id) as additional column even though it is already specified in the first comment line - your guess is as good as mine.

NOTE: Mass Spec Proteomics Platform can have the same Peptide mapped to several UniProt IDs. It will load without generating an error. Analysis Workflows will work. But they don't seem to process this situation correctly. To avoid generating misleading results, degenerate peptide rows can be labelled as 1, 2, 3 (Test study example)

Unlike Expression data where values for Probes without a Gene will still be loaded, in Proteomics, data for Peptides without UniProd ID in the platform will not be loaded.

Mass Spec Proteomics Mapping file

The file contains mapping between samples and corresponding subjects. It also contains additional information about samples - such as tissue type and optional attributes. Finally, it has category_cd which is used to determine path to sample related data.

Columns

The mapping file should contain 10 columns (TRIAL_NAME, SITE_ID, SUBJECT_ID, SAMPLE_CD, PLATFORM, TISSUE_TYPE, ATTRIBUTE_1, ATTRIBUTE_2, CATEGORY_CD, SOURCE_CD).

TRIAL_NAME - study identifier (should be same for all samples)
SITE_ID - samples's site. Optional
SUBJECT_ID - subject identifier
SAMPLE_CD - sample code, should match record from data file
PLATFORM - gene platform ID. It should be UPPERCASE
TISSUE_TYPE - tissue type (i.e. Blood)
ATTRIBUTE_1 - custom attribute 1. Optional
ATTRIBUTE_2 - custom attribute 2. Optional
CATEGORY_CD - multi-level category, separated by '+' symbol, used to build path in tree
SOURCE_CD - STD

`category_cd` placeholders

Usually category_cd converts to path as is. So, if you have category_cd=Protein Data+QTRAP_5500+Blood you should expect following path in tree under study root: //Protein Data/QTRAP_5500/Blood/. But you can use special keywords as tokens which will be automatically replaced with corresponding values.

PLATFORM - this token is replaced with Platform Title from the Platform File for the Platform ID indicated in the Platform column
TISSUETYPE - value from tissuetype column
ATTR1 - value from ATTRIBUTE_1 column
ATTR2 - value from ATTRIBUTE_2 column

Again, slight, seemingly unnecessary differences between expression and proteomics data mapping file are historic facts of life and are there just to keep data curators on their toes at all times.

RBM Proteomics Data

RBM stands for Rules-Based Medicine. RBM was a privately-held company that developed a proprietary Multi Analyte Profiling (MAP) technology. It was acquired by Myriad. Myriad sells an array of RBM MAPs which are basically “ELISA” type protein quantification assays were proteins are being captured by specific antibodies.

##RBM Protein Data file format Data file format as specified by the original Sanofi HDD specification document:

Input file has a constraint format with the following columns, but only Sample ID, Analyte and avalue are loaded: id, rid, sampid, plate, visit_code, Analyte (ana_unit), LDD, avalue, analval, belowLDD, read_low, read_high, logtrans, outlier
Units to be loaded along with the analyte name for display purposes eg: Agouti-Related Protein (AGRP) (pg/mL)

id	rid	sampid	plate	visit_code	Analyte (ana_unit)	LDD	avalue	analval	beloLDD	read_low	read_hi	logtrans	outlier
1	723	AA800N4Q-07	2	bl	Adiponectin (ng/ml)	0.02	2.5	2.5	0	0	0	1	0

Data file format is most likely determined by the output of the RBM data processing program used by the Sanofi scientists.

##RBM Protein Platform file format

#PLATFORM_ID: RBM100 #PLATFORM_TITLE: Test RBM Platform #SPECIES: Homo Sapiens

gpl_id	antigen_name	uniprot	gene_id
RBM100	6Ckine	O00585	112
RBM100	Adiponectin	Q15848	333
RBM100	Aldose Reductase	P15121	222

##RBM Protein Data file format

The file contains mapping between samples and corresponding subjects. It is similar to the Mass Spec Proteomics Mapping file.

_{_{STUDY_ID}}	_{_{SITE_ID}}	_{_{SUBJECT_ID}}	_{_{SAMPLE_ID}}	_{_PLATFORM}	_{_TISSUETYPE}	_{_ATTR1}	_{_ATTR2}	_{_{CATEGORY_CD}}	_{_{SOURCE_CD}}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

04. Proteomics Data

Proteomics Data Loading Instructions

About "Platform"

Mass Spec Proteomics Data (sample)

Mass Spec Proteomics Data file format

Data Types (same as for Expression data)

NOTE: For Mass Spec data loaded as L raw values are not restored.

Mass Spec Proteomics Platform

Unlike Expression data where values for Probes without a Gene will still be loaded, in Proteomics, data for Peptides without UniProd ID in the platform will not be loaded.

Mass Spec Proteomics Mapping file

Columns

`category_cd` placeholders

RBM Proteomics Data

Clone this wiki locally

04. Proteomics Data

Proteomics Data Loading Instructions

About "Platform"

Mass Spec Proteomics Data (sample)

Mass Spec Proteomics Data file format

Data Types (same as for Expression data)

NOTE: For Mass Spec data loaded as L raw values are not restored.

Mass Spec Proteomics Platform

Unlike Expression data where values for Probes without a Gene will still be loaded, in Proteomics, data for Peptides without UniProd ID in the platform will not be loaded.

Mass Spec Proteomics Mapping file

Columns

category_cd placeholders

RBM Proteomics Data

Clone this wiki locally

`category_cd` placeholders