Skip to content

MaayanLab/Enrichment_Sandbox

Repository files navigation

Project Summary

Enrichment analysis algorithms identify known, annotated gene sets which are most similar to an input gene set. The input gene set might be a list of differentially-expressed genes from a biological experiment (e.g. treating a cancerous cell line with a drug, and identifying genes with the greatest change in expression levels using RNA-seq), while the annotations of the known gene sets might correspond to transcription factor binding sites identified by ChIP-seq, drug targets determined by perturbation signatures, or biological processes mined from scientific publications. By associating the annotations of the most similar known gene sets to the input gene set, a scientist can use enrichment analysis to learn about the biology of the input gene set. For the cancerous cell line example, this means answering questions like "What transcription factors might this drug have activated?", "What drugs have similar effects?", or "What regulatory pathways are playing a role here?".

The primary purpose of this project is to evaluate new algorithms for enrichment analysis. It only considers algorithms whose input is a gene set, and not gene expression data. It evaluates these algorithms by measuring their abilities to, when given an annotated gene set from one gene set library, return the "matching" annotations from a different gene set library, e.g. those which refer to the same drug, transcription factor, or biological process.

Currently, the most-used algorithm for this form of enrichment analysis is the Fisher's exact test. While simple and effective, it considers each gene independently, and each annotation separately. Since we know biological phenomena interact with one another, there is theoretically much room for improvement.

Here is a link to a video explanation of this project from August 2017. Note that some naming conventions have since changed: for example, I've renamed the feature and label libraries as the input and search libraries, respectively.

Algorithms

These enrichment analysis algorithms have been implemented:

  • A control: returns a random ranking of the annotations.
  • Fisher's exact test: for each annotation, computes a p-value for overrepresentation. Then, ranks annotations by ascending p-value.
  • Binomial proportions test: variant of the Fisher's exact test.
  • Z score and Combined score: from the Enrichr API.
  • Gini Impurity and Entropy: for each annotation, use its gene set to split the set of all genes in the search library, and with the objective of separating out the genes in the input gene set, compute the gini impurity (a measure of misclassification error) or entropy (a measure of information gain) of the split. Then, rank annotations by ascending gini impurity or entropy.
    • A pairwise variant of the Gini Impurity and Entropy: splits on each pair of annotations. Then, ranks annotations by a score which is a function of the gini impurity or entropy of the splits in which it occured.
  • Machine learning methods: build classifiers that predict whether genes in the search library also belong to the input gene set, and then rank annotations by increasing feature importance.
    • RandomForest, ExtraTrees, RandomTreesEmbedding, GradientBoosting and AdaBoost from sklearn.ensemble
    • LinearSVC from sklearn.svm
    • XGBoost from the DMLC
    • Recursive variants of the above machine learning methods
    • Removed: Above machine learning methods, but using a classifier that predicts whether or not a gene set is enriched based on Fisher contingency table values and other metrics (e.g. Fisher p value, Gini impurity)
  • Mixtures of the above algorithms, i.e. score averaging and iterative elimination.
  • Removed: Fisher's exact test adjusted with gene-gene correlation data from ARCHS4

Libraries

This project supports two types of gene set libraries.

  • Transcription factor libraries:
    • ChEA: infers transcription factor binding sites using ChIP-X experiment data manually collected from scientific publications.
    • ENCODE: infers transcription factor binding sites using ChIP-seq experiments performed by the ENCODE project.
    • CREEDS: infers single-gene perturbation signatures using microarray data extracted from the Gene Expression Omnibus by crowdsource.
  • Drug libraries:
    • DrugBank: encyclopedia-like drug database compiled from scientific publications
    • Target Central Resource Database ("TargetCentral"): part of the "Illuminating the Druggable Genome" project, which compiles data from multiple sources and emphasizes GPCR, kinase, ion channel and nuclear receptor targets
    • Drug Gene Interaction Database: drug-gene interactions compiled by curation and text-mining from multiple sources
    • Drug Repurposing Hub ("DrugRepHub") drug screening library containing known clinical drugs and their targets as reported by scientific publications
    • Drug Central database of drugs approved or regulated by government agencies
    • Drug Target Commons drug bioactivity database extracted and organized from multiple sources by crowdsource
    • STITCH chemical-protein interaction network obtained from computational predictions, inter-organism knowledge, and other databases
    • CREEDS: human, mouse and rat drug perturbation signatures extracted from the Gene Expression Omnibus by crowdsource
    • LINCS: gene expression profiles from flow cytometry experiments with different perturbagens, time points, doses, and cell lines

The average gene set sizes for annotations from all the drug target libraries (i.e. all drug libraries except CREEDS and LINCS) are small. Therefore, these drug libraries were expanded using PPI data from BioGRID or hu.MAP, or coexpression data from ARCHS4. This resulted in three new, expanded libraries for each original library. See the writeup for the DrugLib_Enrichment_Comparison project for a full explanation of how these libraries were expanded.

How the algorithms are evaluated

Algorithms are evaluated using a benchmarking process: First, two libraries whose gene sets have similar annotations are chosen. (Either both have transcription factor annotations, or both have drug annotations.) One is designated as the input library, and the other is designated as the search library. Annotations in the input library whose corresponding transcription factor or drug also corresponds to at least one annotation in the search library are identified. Then, one at a time, the gene sets of these common annotations from the input library are used to perform enrichment into all of the search library. For each input gene set, the enrichment algorithm will return a ranking of all annotations in the search library, from most enriched to least enriched. Since the input and search libraries should agree with one another, the search library annotations corresponding to the same transcription factor or drug as the input annotation should appear toward the top of the rankings. In other words, a good algorithm will consistently rank these matching annotations highly, while a bad algorithm will not. "Bridge plots," explained below, are used to qualitatively evaulate this.

Because the libraries are derived differently (e.g. transcription factor-to-gene libraries might be derived from experimental ChIP-seq data, or from ChIP-seq data mined from publications, or from microarray data), a secondary purpose of this project is to use enrichment analysis to measure the agreement between the libraries. If two libraries agree with one another, then their gene sets for matching annotations should be unusually similar, so enrichment algorithms should be able to consistently perform well. Theoretically, all libraries of the same type should have some agreement, and libraries which use the same or similar techniques to obtain the gene sets should have even more agreement.

Visualizing the results

A bridge plot summarizes the results of enrichment analysis between a single pair of libraries. Here is how it is made:

  • The curve begins at (0,0).
  • At every integer x-value after zero, the y-value decreases at a constant rate.
  • At every integer x-value, the y-value also increases proportionally to the number of matches with ranks occuring at that x-value.
    • (Recall that for each common annotation in the input library, the algorithm has created a ranking of all search library annotations. From this collection of rankings, the rank positions of the "matches" are aggregated, and this is what is used to increase the y-value.
  • The decrease and increase constants are scaled such that they cancel out, and the curve ends at (n,0), where n is the number of search library annotations.
  • The plot is then normalized. The y-axis is scaled such that the product of the increase constant and the number of matches is 1. (Thus, the highest possible y-value is 1, and this occurs when all match rank positions are at x=0.) The x-axis is scaled such that ranks are evenly-spaced between zero and one.

The result is a curve whose null-distribution is similar, but not identical, to a Brownian bridge. It can also be interpreted as the empirical cumulative distribution function of the match rank positions, turned 45 degrees counter-clockwise. A good ranking method will result in a bridge plot with a large area underneath it (AUC), and with a tall supremum occuring at a low x-value. (In the best possible outcome, where all match ranks occur at x=0, the curve will be an isosceles right triangle with AUC=.5 and supremum of 1 at x=0.)

In the above example, the Fisher's exact test clearly outperforms the RandomForest algorithm, as seen by its taller and earlier supremum. The curve for the Control resembles a Brownian random walk, as expected.

Plots

These plots are available:

  • All bridge plots for a single library pair, comparing the rankings of different algorithms.
  • A combined plot, showing all bridge plots for all library pairs of a certain type (transcription factor or drug).
  • An array of bridge plots, with each library pair in its own subplot.
  • An array of hexbin plots, with each library pair in its own subplot.
    • These are useful for comparing how two different algorithms rank the same terms.

The bridge plots can be zoomed-in to view only the top ranks, and filtered to view only certain algorithms.

How to Use:

  1. Clone this repository to your computer.
  2. The enrichment analysis code works on gene vector matrices (gvms). These should be placed in the gvms subfolder. The transcription factor libraries have all been pre-downloaded to the repo in this format. If you would like to use the drug libraries, their gvms can be downloaded from this zip file. If you would like to use your own libraries, the scripts in convert_files_scripts.py might be able to help. For example, it can convert gmt files to gvms. convert_files.py shows how these scripts were used to obtain gvms for the transcription factor libraries from their original gmt files.
  3. Open perform_enrichment.py. Follow the instructions in the prompts to choose which algorithms and libraries to use. Then, run this script.
  4. Open visualize_results.py. Follow the instructions in the prompts to choose how to view the results. Then, run this script.
  • fix.py can be used to make adjustments such as renaming algorithms, resetting the results files, and modifying scores.
  • log.py is a log of experiments I've personally done, for my own reference.

Scripts I have used previously are in the old folder. They worked with an older version and probably do not work now:

  • get_pairwise_fisher_pvals.py and analyze_pairwise_fisher_pvals.py were used to perform and explore the results of pairwise enrichment analysis (evaluating the degree of enrichment of a pair of terms at a time, instead of a single term) using the Fisher's exact test.
  • create_abridged_libs.py made a new gvm which has only the first n terms of another file. This was used to quickly test out enrichment algorithms that take a long time to run, such as the pairwise algorithms.
  • edgelist_to_gmt.py was used to create gmts from drug library edgelists.
  • new_scores_from_old.py was used to create new scores from old score files.
  • get_classifiers.py was a helper function needed if a ML_Fisher_features method is used.
  • get_fisher_adjusted_vars.py was used to get and view the correlation data for the adjusted Fisher's test.

Code details:

Description of pipeline

  1. convert_files.py
  • The gmt files are transformed into gene vector matrices (gvms), with column labels being annotations and row labels being genes. Each column is a boolean vector corresponding to a gene set.
  1. perform_enrichment.py
  • For each (input, search) pair of libraries:
    • Get a list of annotations in the input library whose corresponding transcription factor is also found in at least one search library annotations. These will be referred to as the common input library annotations.
    • For each enrichment algorithm:
      • For each common input library annotation:
        • Perform enrichment by using this common annotation's gene set as the input gene set for the enrichment analysis algorithm.
        • The output of the algorithm is an enrichment score for each annotation in the search library.
        • Save this output as a column in a dataframe.
      • The final dataframe is stored as a .csv file. (Columns: common input library annotations. Indices: search library annotations. Cell values: scores given by the enrichment method. Title of file: "input_", input library, "into", search library, ".csv".)
  1. visualize_results.py
  • For each (input, search) pair of gmt libraries:
    • The results files created by perform_enrichment.py are loaded.
    • For each result file (i.e., for each algorithm):
      • For each column (i.e., for each common input library annotation):
        • The ranks for annotations which correspond to the same transcription factor or drug as the input gene set annotation are aggregated.
      • Based on these ranks, the coordinates of the bridge plot are generated, and saved to the rankings file (so that ranks and coordinates only need to be calculated once). Example: rankings_from_ChEA_2016_to_CREEDS.csv
  • Results are plotted as desired.

Naming conventions:

  • fname : file name
  • df : pandas.DataFrame
  • gmt : a file in gene matrix transposed format
  • gvm : a df in "gene vector matrix" format, where columns are gene set labels i.e. annotations, rows are genes, and cell values are boolean and denote set membership. In this format, each column is a vector describing a gene set.
  • lib : a library
  • ilib : the input library
  • slib : the search library
  • annot : annotation
  • cleaned : an annotation is "cleaned" when the corresponding transcription factor or drug name has been extracted from it. For example, when the annotation "ESR1_15608294_ChIP-ChIP_MCF-7_Human" is cleaned, it becomes "ESR1".
  • match : two annotations match when they refer to the same transcription factor or drug, i.e. they are equivalent after cleaning.
  • common input library annotation : an input library annotation with at least one match in the search library
  • tf : transcription factor
  • method : a machine learning method, for example RandomForestClassifier
  • funct : function - specifically, an enrichment analysis function in enrichment_functions.py
  • params : parameters for a function
  • algorithm : a specific function & paramaters combination which is being used to perform enrichment

Future Directions

  • Continue algorithm development

About

Benchmarking enrichment analysis algorithms using transcription factor-gene and drug-gene libraries.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages