Skip to content

Tutorial : GenePattern module

Anne Richelle edited this page Jun 25, 2021 · 14 revisions

1. Data Preparation

Gene expression matrix can be provided in .mat, .csv, or .xlsx

  • For mat files, you need to provide a structure variable (named e.g. data) containing a cell field named “genes” containing the NCBI Entrez gene ID (one gene by cell) and a double field named “value” containing a matrix with gene expression value for each of the sample you want to evaluate (i.e. with rows corresponding to genes and columns corresponding to samples). See available example dataTest.mat
  • For csv or xlsx files, you need to provide an expression matrix with rows corresponding to genes and columns corresponding to samples. The first column is the list of genes NCBI Entrez gene ID and the first header row should start with “genes” and be followed by the name of each sample present in your dataset. See available example dataTest.csv or dataTest.xlsx
  • Note 1: The list of genes and associated NCBI Entrez ID present in each reference model can be downloaded here
  • Note 2: Remove any missing values from your dataset or replace any 'NA'/'nan' with blank cells

2. The CellFie GenePattern Module

To access the CellFie module, enter the GenePattern website and click on Use GenePattern. Register a new account or log in to the GenePattern in the Amazon Cloud. Search the CellFie module in the list of available modules.

The following steps will walk you through how to set up the analysis.

2.1. Basic Parameters

CellFie requires three user defined inputs:

  • Data* - This is where you upload your gene expression dataset. If the input data is on your computer, click on Upload File… and select the .mat or .csv file. If you are using data hosted online (eg. github), you may provide the URL by clicking on Add Path or URL…
  • SampleNumber* - This refers to the number of samples present in your data. The example dataTest.csv includes 3 samples.
  • ReferenceModel* - Name of the reference model to use to compute metabolic task scores.The drop down menu will prompt you to choose the species specific genome-scale metabolic model to be used for the analysis. The example dataTest.csv is a set of data from a human cell, therefore users would select Human_recon_2_2. If you are unsure which model to select, you can check out these lists of NCBI Entrez IDs present in each of the reference models to ensure compatibility with your data.

2.2. Advanced Parameters

Multiple thresholding approaches can be used to compute the metabolic task score. These approaches are used to determine the set of genes that are active in the conditions represented by the experimental dataset. In the original manuscript presenting CellFie, the threshold of a gene is defined by the mean value of its expression over all the samples coming from the same dataset with exceptions that the threshold need to be higher or equal the 25th percentile of the overall gene expression value distribution and lower or equal to the 75th percentile (i.e. local + minmaxmean with the lower and upper bounds set to 25 and 75 respectively). Therefore, without further user input, the parameters related to the thresholding approach will be set to these ones as default. Note that we have recently benchmarked the influence of these choices on the definition of the set of active genes and observed that this parameter combination presented the best performance. We refer the reader to the following publication for more details: Richelle A, Joshi C, Lewis NE (2019) Assessing key decisions for transcriptomic data integration in biochemical networks. PLOS Computational Biology 15(7): e1007185.

Users have the option to change the type of thresholding by modifying the advanced parameters:

  • Thresholding Approach* - Use the drop-down menu to select one of two thresholding methods, either global (the threshold is the same for all the genes) or local (the threshold is gene specific). See here for more details.
  • Percentile Or Value* - The threshold can be defined using a strict value introduced by the user (value) or based on a percentile of the expression value distribution (percentile). See here for more details.

If you selected the global thresholding approach:

  • Global Cutoff* - Enter the value or percentile defining the gene activity threshold.See here for more details.

If you selected local thresholding approach:

  • Local Threshold Type* - Use the drop-down menu to define the type of local approach; either mean (the threshold for a gene is defined as the mean expression value of this gene across all the samples) or minmaxmean (the threshold for a gene is determined by the mean of expression values observed for that gene among all the samples BUT the threshold : (i) must be higher or equal to a lower bound and (ii) must be lower or equal to an upper bound).See here for more details.
  • LowerBound* - If you selected the type minmaxmean for the local thresholding approach, this parameter specifies the lower bound used to define which gene are always inactive (enter a value or a percentile).See here for more details.
  • UpperBound* - If you selected the type minmaxmean for the local thresholding approach, this parameter specifies the upper bound used to define which gene are always inactive (enter a value or a percentile).See here for more details.

3. GenePattern Module Outputs

Once your run has finished successfully, you will see a blue checkmark in the top right corner of the window.

3.1 Downloading your results

Results can be downloaded by navigating to the Jobs tab on the right hand panel, clicking on your CellFie run, and selecting Download Job.

3.2 Description of the output files

The GenePattern tool provides 6 outputs.

  • Stdout.txt - a text file containing the log information of the job submitted

  • taskInfo.csv - File containing the descriptive information about the 195 tasks assessed

  • score.csv - File containing the matrix of relative quantifications of the activity of the 195 metabolic tasks (rows) for each sample (columns) of the input data set. Note that if the results present metabolic tasks associated with a negative score, it means that they cannot be calculated either due to the lack of gene expression data or due to the lack of gene information in the reference model used (download the list of supported metabolic task for each reference model).

  • score_binary.csv – File containing the binary version of the metabolic task score matrix.

  • detailScoring.csv – This file contains 8 columns for each sample with detailed information about the essential reaction scores:

    • Column 1: ID of the sample
    • Column 2: ID of the metabolic task
    • Column 3: Task score for this sample
    • Column 4: Binary task score for this sample
    • Column 5: Essential reaction associated with this task
    • Column 6: Expression score associated with the reaction listed in column 5 (i.e. RAL -Reaction Activity Level)
    • Column 7: Gene used to determine the expression of the reaction in column 5
    • Column 8: Original expression value of the gene listed in column 7 Information is concatenated horizontally therefore columns 9-16 will contain information for sample 2, columns 17-24 will contain information for sample 3, etc.
  • Cellfieout.mat - Matlab files containing the information related to the tasks (i.e., taskInfos), and the two score matrices (score and score_binary).