amaunz / fminer

Molecular feature miner for path- and tree-shaped structures from "Large Scale Graph Mining using Backbone Refinement Classes"

This URL has Read+Write access

fminer /
name age message
file AUTHORS Fri Sep 18 04:36:14 -0700 2009 ADDED AUTHORS [amaunz]
file INSTALL Mon Apr 27 03:17:21 -0700 2009 Added doc [amaunz]
file LICENSE Wed Mar 04 06:06:19 -0800 2009 Added License [amaunz]
file Makefile Mon Dec 21 02:20:13 -0800 2009 Merge branch 'master' of git@github.com:amaunz/... [amaunz]
file README Mon Jul 27 06:26:10 -0700 2009 Added examples [amaunz]
file USAGE Wed Jul 08 06:15:21 -0700 2009 Added USAGE [amaunz]
file configure Tue Dec 08 07:45:32 -0800 2009 Added configure script [amaunz]
file main.cpp Mon Jan 04 07:31:01 -0800 2010 Fixed initialization of fm::regression [amaunz]
README
 Welcome to Fminer.

 This is Fminer (application), available from http://github.com/amaunz/fminer/tree/master .
 It depends on LibFminer, available from http://github.com/amaunz/libfminer/tree/master . 
 The official website with documentation is http://www.maunz.de/libfminer-doc .

 For installation and documentation see INSTALL.
 For license information see LICENSE.


 File Formats:
 =============
 Graphs are allowed in SMILES and gSpan format. This is to support both general graph mining, but to also make life more 
 convenient when mining molecular data. For instance, for SMILES formatted input there is a switch to enable/disable 
 aromatic ring perception, which, in gSpan format, requires re-formatting the input.
  
  SMILES file line format is defined by 
    "ID\t SMILES", e.g.:
     1    COC1=CC=C(C=C1)C2=NC(=C([NH]2)C3=CC=CC=C3)C4=CC=CC=C4

  gSpan file block format (see also: http://illimine.cs.uiuc.edu/resources/readme_plain.php?program=gSpan) is defined 
  by:
    "t # ID" means the Nth graph,
    "v M L" means that the Mth vertex in this graph has label L,
    "e P Q L" means that there is an edge connecting the Pth vertex with the Qth vertex. The edge has label L.

- Activity file line format is defined by 
    "ID\t endpoint\t activity", e.g.:
     1    Salmonella Mutagenicity    1

 The set of IDs in a gSpan or SMILES format file must be a subset of the set of activity IDs in the respective activity 
 format file, i.e., not every Activity ID must be matched by a graph id, but vice versa.


 Switches (run Fminer w/o arguments for allowable parameter ranges, defaults and explanations):
 ========
 Constraint parameters:
 -f Minimum frequency constraint, used for anti-monotonic pruning.
 -l Subgraph type, choices are paths and trees.
 -p Chi-square significance level, used for statistical upper-bound pruning.

 Constraint switches:
 -d Deactivate dynamic adjustment of upper bound (less efficient, use only for performance evaluation). Can only be set 
 when -c is not set.
 -b Switch off mining of only the most significant/most general representative of each backbone.
 -m Switch on mining of maximal trees, as induced by constraints.
 -u Deactivate upper bound pruning for chi-square constraint (less efficient, use only for performance evaluation).

 General switches:
 -s Refine fragments with frequency 1.
 -a Enable aromatic ring perception for SMILES input.
 -r Insert empty vectors in result vector to indicate BBRC borders.
 -n Output line numbers from input file instead of IDs in result file.
 -o Switch off output.


 Environment variables:
 ======================
 Define the FMINER_SMARTS environment variable to produce output in SMARTS format (e.g. export FMINER_SMARTS=1).
 Additionally define the FMINER_LAZAR environment variable to produce output in linfrag format which can be used as 
 input to Lazar (e.g. export FMINER_LAZAR=1).


 Remarks:
 ========
 Structures are output to STDOUT in the specified format.
 Output consists of blocks of gSpan graphs.
 When using SMARTS output (environment variable FMINER_SMARTS is defined), each line is a YAML sequence, containing 
 SMARTS fragment, p-value, and two sequences denoting positive and negative class occurrences (default compound ids, see 
 also switch -n). 
 When also FMINER_LAZAR is set, the linfrag output format is used.


 Examples:
 =========

 In any case, 1-frequent patterns are not refined further, unless you use the -s switch 

 Use BBRC mining (default):

 # BBRC representatives (min frequency 2, min significance 95%), using dynamic UB pruning
 ./fminer <graphs> <activities> 
 # same as above, but much slower (explicitly no dynamic UB pruning)
 ./fminer -d <graphs> <activities>                                        

 Switch off BBRC mining:

 # all 2-frequent and 95%-significant features
 # Note, that the -d is mandatory (no dynamic UB pruning possible here)!
 ./fminer -d -b <graphs> <activities>
 # All 20-frequent patterns (standard frequent pattern mining)
 ./fminer -f 20 <graphs>

 
Andreas Maunz, 2008.