This repo contains a script and a Rmd file for the pre-processing and
normalization of MaxQuant or DIA-NN output files through the MSstats R
package. The output is a tabular file in wide format (1 row per protein,
1 column per sample/condition) that could be used as an input to run
statistics with Limma or similar.
-
Download/clone the contents of this repo into your local computer. This should create a R project folder with the script to run the preprocessing.
-
Delete the
MSstats_Output_data/folder and its contents from your local computer. -
Add the next three MaxQuant output files into this folder
- evidence.txt
- proteinGroups.txt
- annotation.csv (not included in the MaxQuant txt folder, see below how to create this one before executing the script).
NOTE: These files should be in the same folder as the R script, and this folder should be an initiated RStudio project (There should be a .Rproj file in the same folder).
-
Open your RStudio project by double-clicking the
.Rprojfile in your newly created R project folder. -
Open the script
mq_to_msstats_formating_normalization_n_prep_for_limma.R -
Modify lines between
16to31to set up the parameters for both the transformation from MaxQuant format to MSstats format, and for the actual summarizaton and normalization. -
Execute the script (click ‘Source’ on the top-right corner of the script).
-
The script should generate three
.csvfiles:msstats_tabular_data_for_limma_input.csv, in wide format suitable for downstream analysis withlimma. And two files in long format withinMSstats_Output_datawith the un-normalized and the normalized feature intensities before and after MSstats pre-processing.
BE AWARE!!: There is a known issue with the dataProcessing
function fron MSstats that makes it use a lot of RAM with big input
files (> 1 million rows). If you have A big output from DIANN and have
issues with your R session crashing due to RAM overload, you can execute
this script up to line 105 and get the output of the MSstats formatted
data from
~/MSstats_Output_data/MSstats_formated_tables/msstas_formated_diann_data_bf_normalization.csv
and continue on Galaxy, where the RAM shouldn’t be an issue.
-
Download/clone the contents of this repo into your local computer. This should create a R project folder with the script to run the preprocessing.
-
Delete the
MSstats_Output_data/folder and its contents from your local computer. -
Add the
MainOutput.tsvoutput file from DIA-NN into this folder. -
Add your
annotation_diann.csvfile into this folder.
NOTE: These files should be in the same folder as the R script, and this folder should be an initiated RStudio project (There should be a .Rproj file in the same folder).
NOTE 2: Check the samples folder a sample of the
annotation_diann.csv file and how it should look like.
-
Open your RStudio project by double-clicking the
.Rprojfile in your newly created R project folder. -
Open the script
diann_to_msstats_formating_normalization_n_prep_for_limma.R -
Modify lines between
16to21to set up the parameters for both the transformation from MaxQuant format to MSstats format, and for the actual summarizaton and normalization. -
Execute the script (click ‘Source’ on the top-right corner of the script).
-
The script should generate three
.csvfiles:msstats_tabular_data_for_limma_input.csv, in wide format suitable for downstream analysis withlimma. And two files in long format withinMSstats_Output_datawith the un-normalized and the normalized feature intensities before and after MSstats pre-processing.
You have 2 options to create your annotation file:
-
Use the
create_annotation_file.Rscript created for this purpuse (RECOMENDED). NOTE: Now the script only works if every sample corresponds to a different biological replicate and for label-free samples. Manually create your file if otherwise. -
Manually create your
annotation.csvfile in a spread sheet editor (such as MS Excel)
-
Corroborate that you have the
create_annotation_file.Rin your R Project folder. -
Go to the Console in your opened R Studio project session.
-
Type
source("create_annotation_file.R") -
Answer the questions as prompted on the Console in your R session.
-
Important!: please corroborate that your sample names/codes correspond with the desired experimental condition by opening the newly created
annotation.csvfile. It should be in the same folder of your R Project.
-
Open a new spread sheet (i.e. in MS Excel).
-
The first row should be your column names as follows: “Raw.file”, “Condition”, “BioReplicate”, “IsotopeLabelType”
-
Fill the rows with the required information for each of the required sample.
- For Raw.file: give the name of your Thermo RAW file as it was named when processed by MaxQuant.
- For Condition: give the Experimental or Biological condition of the sample.
- For BioReplicate: give the number of the biological replicate associated with this sample. If every sample came from a different biological source, then you can give a different (any) number for each sample.
- For IsotopeLabelType: Type of labelling. Since in this case we are working with label-free quantification, set all rows in this column to ‘L’.