Finding statistically significant subgraphs while correcting for multiple testing
C++ Other
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src
.gitignore
LICENSE
README.md

README.md

#Significant Subgraph Mining with Multiple Testing Correction

The first algorithm that efficiently finds all significant subgraphs from graph databases while correcting for multiple testing.

##Summary

Given a graph database (a collection of graphs), class labels of graphs, and significance threshold α, this algorithm performs:

  • efficient computation of the corrected significance threshold for each test that strictly controls the FWER under α
  • enumeration of all subgraphs that are statistically significantly associated with the class membership

Two methods are implemented for computing the corrected significance threshold:

  • Tarone's testability correction (without -w option, default):
    • Computation is fast but FWER control is not optimal (the actual FWER is smaller than α)
  • Westfall-Young permutation correction (with -w option)
    • Slower than the above but optimal FWER control is achieved (the actual FWER is almost α)
    • This method is called Westfall-Young light and its itemset mining version is available

Please see the following papers for the detailed information about this algorithm and refer them in your published research:

  • For Tarone's testability correction:
    • Sugiyama, M., Llinares-López, F., Kasenburg, N., Borgwardt, K. M.: Significant Subgraph Mining with Multiple Testing Correction, Proceedings of the 2015 SIAM International Conference on Data Mining (SDM2015), 199-207, 2015. [PDF]
  • For Westfall-Young permutation correction:
    • Llinares-López, F., Sugiyama, M., Papaxanthos, L., Borgwardt, K. M.: Fast and Memory-Efficient Significant Pattern Mining via Permutation Testing, Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD2015), 725-734, 2015. [PDF]

GASTON is used as a frequent subgraph mining algorithm:

  • Nijssen, S. and Kok, J.: A Quickstart in Frequent Structure Mining Can Make a Difference, Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2004), 647-652, 2004.

##Usage

###Data format

The input is composed of two files: a graph database (a list of graphs) and a list of class labels. In a graph database, each graph is described as follows:

# <comment>
t # <transaction id>
v <node id> <label id>
e <source node id> <target node id> <label id>

This is the same as the standard format in frequent subgraph mining softwares, such as gSpan and GASTON. For example:

# start  
t # 1        // transaction id is 1
v 0 1        // node 0 with label 1
v 1 2        // node 1 with label 2
e 0 1 0      // edge from node 0 to 1 with label 0

In the class file, each line contains the class label of the corresponding graph. The class label should be either "0" or "1".

###Compilation

To compile the program, go to the "src" directory and type

make

The "Boost" library is needed to compile it.

###Run

To run the algorithm, type

./sgmine -a <alpha> -i <input_file> -c <input_class_file> -o <output_file>

<alpha> is the target FWER. It is set to be 0.05 if skipped.

###Output format

Resulting significant subgraphs are written to the file <output_file>. The output file has the same format as the input file except for two lines: supports (start from "# s") in minor and major classes and the p-value (start from "# p") of the graph. For example:

# s 22 7       // support of this subgraph in minor (left) and major (right) classes
# p 1.5841e-07 // p-value of this subgraph
t 1
v 0 0
v 1 9
v 2 9
e 0 1 0
e 1 2 0

###Argument list

-w : Perform Westfall-Young permutation correction if specified
-m <maxsize> : Maximum size of each subgraph (default: unlimited)
-a <alpha> : Target FWER (default: 0.05)
-j <perm> : Number of permutations in -w mode (default: 1000)
-r <seed> : Seed for permutations in -w mode (default: 0)
-i <input_file> : Input file of graph database
-c <input_class_file> : Input file of class labels
-o <output_file> : Output file of significant subgraphs

Example

There is an example graph database "Chemical_340" and the corresponding class file "Chemical_340_class". To run the algorithm on this database, type

./sgmine -i Chemical_340 -c Chemical_340_class -o output

You can redirect the resulting statistics like

./sgmine -i Chemical_340 -c Chemical_340_class -o output > stat

If you use Westfall-Young permutation, type

./sgmine -w -i Chemical_340 -c Chemical_340_class -o output

Contact