Skip to content

04. Proteomics Data

nboukharov edited this page Jan 16, 2017 · 56 revisions

Proteomics Data Loading Instructions

There are two formats for loading protein quantification data as High Dimensional: Mass Spec Proteomics and RBM. One might ask why proteomics loading formats were developed for these two methods? tranSMART is an open source and some of the features were developed and contributed by members of the community. Various HDD data formats were developed by developers hired by Sanofi based on the Sanofi scientists specifications. I assume, at the time, these two methods were popular with the Sanofi scientists. It does not mean that quantitative proteomics data generated by other methods can’t be loaded into tranSMART. RBM format is very specialized. But format developed for Mass Spec Proteomics is quite generic and can be easily adapted to any other quantitative proteomics data format as long as the method/assay detects and quantifies specific proteins and protein isoforms that have UniProt IDs. Quantification of protein variants, mutations or modifications where different entities being quantified map to the same UniProt ID can also be loaded as High Dimensional data with some creative "porbe IDs". Advanced workflows use either "probe" or "probe-protein name" as legends for graphs and tables.

Mass Spec format Proteomics data files for loading include Platform file, Data file and Data mapping file. ETL uploads Mass Spec Proteomics data from <ProteinDataToUpload> directory.

About "Platform"

Here I have to digress and explain what is “Platform” file for HDD data in tranSMART speak to avert any confusion that this term can create in the context of the proteomics or other high-throughput data for any sane experimental scientist.

Platform concept is adopted from the Gene Expression Omnibus (GEO) data structure for microarray, next-generation sequencing, and other forms of high-throughput functional genomics data repository. According to GEO “A Platform record is composed of a summary description of the array or sequencer and, for array-based Platforms, a data table defining the array template.” For tranSMART HDD data, Platform is simply a file that matches detecting moiety (probe, antibody, peptide, ligand, etc) to a standard ID of the biological entity being quantified. When a Platform file is loaded, it links the data to the data specific Dictionary that should be already loaded when tranSMART is installed.

Protein Dictionary includes UniProt ID, Protein Name and Gene Symbol. This allows to search for a specific protein in the HDD pop-up in Advanced Workflows by simply starting to type gene symbol or UniProt ID and then selecting corresponding protein name.

Mass Spec Proteomics Data (sample)

Mass Spec Proteomics Data file format

Peptide Majority protein IDs Sample1 Sample2
AVLTIDEK P01009 22.8861 8.92968
LSITGTYDLK P01009 11.2 7.33
LALGDDSPALK P17174 6.09 8.98

Peptide does not necessarily has to be a peptide sequence. It can be any other reagent/probe ID used for quantitative protein detection. Majority protein IDs are UniProt IDs. This column is redundant. The whole point of the Platform file (see below) is to map Peptides/Probes from the data file to the UniProt IDs. But this is the format that was specified by the original tranSMART HDD proteomics format developers and tMDataLoader expects this column to be present.

Data Types (same as for Expression data)

The last symbol in data file name (before extension, e.g. Test Study_GSE37425_PROTEIN_Data_R.txt) is one of following letters:
*R - raw data. Values is a raw data, which should be transformed to calculate log2 value and z-score.
*L - log2 data. Values is a log2 data, z-score calculated,
*T and Z - z-score data. Has same meaning, value will be written to z-score without modifications if it in range of (-2.5; 2.5). It will be truncated to this range otherwise.

NOTE: For Mass Spec data loaded as L raw values are not restored.

Mass Spec Proteomics Platform

#PLATFORM_ID: MSPROT
#PLATFORM_TITLE: QTRAP_5500
#SPECIES: Homo Sapiens

peptide majority_protein_id organism gpl_id
AVLTIDEK P01009 Homo Sapiens MSPROT
LSITGTYDLK P01009 Homo Sapiens MSPROT
LALGDDSPALK P17174 Homo Sapiens MSPROT

If you wonder why Proteomics Platform file does not have the same format as Platform file for Expression data, why include gpl_id (platform id) as additional column even though it is already specified in the first comment line - your guess is as good as mine.

NOTE: Mass Spec Proteomics Platform can have the same Peptide mapped to several UniProt IDs. It will load without generating an error. Analysis Workflows will work. But they don't seem to process this situation correctly. To avoid generating misleading results, degenerate peptide rows can be labelled as 1, 2, 3 (Test study example)

Unlike Expression data where values for Probes without a Gene will still be loaded, in Proteomics, data for Peptides without UniProd ID in the platform will not be loaded.

Mass Spec Proteomics Mapping file

The file contains mapping between samples and corresponding subjects. It also contains additional information about samples - such as tissue type and optional attributes. Finally, it has category_cd which is used to determine path to sample related data.

Columns

The mapping file should contain 10 columns (TRIAL_NAME, SITE_ID, SUBJECT_ID, SAMPLE_CD, PLATFORM, TISSUE_TYPE, ATTRIBUTE_1, ATTRIBUTE_2, CATEGORY_CD, SOURCE_CD).

  • TRIAL_NAME - study identifier (should be same for all samples)
  • SITE_ID - samples's site. Optional
  • SUBJECT_ID - subject identifier
  • SAMPLE_CD - sample code, should match record from data file
  • PLATFORM - gene platform ID. It should be UPPERCASE
  • TISSUE_TYPE - tissue type (i.e. Blood)
  • ATTRIBUTE_1 - custom attribute 1. Optional
  • ATTRIBUTE_2 - custom attribute 2. Optional
  • CATEGORY_CD - multi-level category, separated by '+' symbol, used to build path in tree
  • SOURCE_CD - STD

category_cd placeholders

Usually category_cd converts to path as is. So, if you have category_cd=Protein Data+QTRAP_5500+Blood you should expect following path in tree under study root: //Protein Data/QTRAP_5500/Blood/. But you can use special keywords as tokens which will be automatically replaced with corresponding values.

  • PLATFORM - this token is replaced with Platform Title from the Platform File for the Platform ID indicated in the Platform column
  • TISSUETYPE - value from tissuetype column
  • ATTR1 - value from ATTRIBUTE_1 column
  • ATTR2 - value from ATTRIBUTE_2 column

Again, slight, seemingly unnecessary differences between expression and proteomics data mapping file are historic facts of life and are there just to keep data curators on their toes at all times.

RBM Proteomics Data

RBM stands for Rules-Based Medicine. RBM was a privately-held company that developed a proprietary Multi Analyte Profiling (MAP) technology. It was acquired by Myriad. Myriad sells an array of RBM MAPs which are basically “ELISA” type protein quantification assays were proteins are being captured by specific antibodies.

##RBM Protein Data file format Data file format as specified by the original Sanofi HDD specification document:

  • Input file has a constraint format with the following columns, but only Sample ID, Analyte and avalue are loaded: id, rid, sampid, plate, visit_code, Analyte (ana_unit), LDD, avalue, analval, belowLDD, read_low, read_high, logtrans, outlier
  • Units to be loaded along with the analyte name for display purposes eg: Agouti-Related Protein (AGRP) (pg/mL)
id rid sampid plate visit_code Analyte (ana_unit) LDD avalue analval beloLDD read_low read_hi logtrans outlier
1 723 AA800N4Q-07 2 bl Adiponectin (ng/ml) 0.02 2.5 2.5 0 0 0 1 0

Data file format is most likely determined by the output of the RBM data processing program used by the Sanofi scientists.

##RBM Protein Platform file format

#PLATFORM_ID: RBM100 #PLATFORM_TITLE: Test RBM Platform #SPECIES: Homo Sapiens

gpl_id antigen_name uniprot gene_symbol gene_id
RBM100 6Ckine O00585 112
RBM100 Adiponectin Q15848 333
RBM100 Aldose Reductase P15121 222

##RBM Protein Data file format

The file contains mapping between samples and corresponding subjects. It is similar to the Mass Spec Proteomics Mapping file.

STUDY_ID SITE_ID SUBJECT_ID SAMPLE_ID PLATFORM TISSUETYPE ATTR1 ATTR2 CATEGORY_CD SOURCE_CD