  • HogProf is an extensible and tunable approach to phylogenetic profiling using orthology data. It is powered by minhash based datastructures and computationally efficient.
  • Still under major development and may change
  • Magic


  • Using orthoxoml files and a taxonomy calculated enhanced phylogenies of each family
  • These are transformed into minhash signatures and a locally sensitive hashing forest object for search and comparison of profiles
  • Taxonomic levels and evolutionary event types ( presence, loss, duplication ) can have custom weight in profile construction
  • Optimization of weights using machine learning

If you run into any problems feel free to contact me at


$ git clone
$ cd hogprof/pyprofiler
$ pip install -r reqs.txt .

lets get a current version of the OMA hdf5 file and GAF. This will alow us to use the HOGs and study the functional enrichment of our search results.

$ cd ../..
$ mkdir YourOmaDirectory
$ cd YourOmaDirectory
$ wget
$ wget

We also need to make a location to store our pyprofiler databases

$ cd ..
$ mkdir YourPyProfilerDirectory

Now navigate to the pyprofiler source folder. Open the file in the utils folder and give it the location of you OMA data as well as the folder where you would like to save your pyprofiler databases.

$ cd utils
$ nano

change these to your parameters. Don't forget the trailing slash on your paths to your directories

config = {
    "omadir": "YOUROMADIRECTORY/"
    "email": "YOUREMAIL"

Your email will be used to identify you to the NCBI when using their API.

Ok. We're ready! Now let's compile a database containing all HOGs and our desired taxonomic levels using default settings. Launch the lshbuilder script from the pyprofiler folder.

dbtypes available on the command line are : all , plants , archaea, bacteria , eukarya , protists , fungi , metazoa and vertebrates.

$python --name YOURDBNAME --dbtype all                     

This should build a taxonomic tree for the genomes contained in the release and then calculate enhanced phylogenies for all HOGs in OMA.

Once the database is completed it can be interogated using a profiler object. Construction and usage of this object is shown in the example notebook searchenrich.ipynb found in the notebooks folder. It contains analysis related to a known and poorly described protein network. Please feel free to modify it to suit the needs of your own research.


