This is a very basic jupyter tutorial for the Hereditary Depth First Search (HDFS) from the pathfinder module.

The aim of this exercise is to identify the optimum subset of elements where elements could refer to a sets of features for ML training, Linear Regression or experimental observables. Given access to a pair wise relation matrix - e.g. Pearson Correlation, Fisher Information, Joint Mutual Information, ... - one can construct a Binary Acceptance Matrix (BAM) by defining a threshold (T) below which combination is allowed. The HDFS algorithm will efficiently identify all subsets of elements whose pairwise relations fall below T for all elements in the subset. Thus the HDFS algorithm provides a list of subsets containing minimally 'related' elements. 

In [None]:
# preamble
import numpy as np
import pathfinder as pf
from pathfinder import plot_results

Set random seed for the pseudo data generation

In [None]:
seed = 0
np.random.seed(seed) 
print(f"seed = {seed}")

Here we will create the "Binary Acceptance Matrix" (BAM). \
\
The BAM ($\rho$) is a symmetric Boolean matrix that provides the pair-wise combination condition i.e element $i$ can be combined with elemet $j$ if $\rho_{ij}$ = True  \
\
For this example we will create a 'psudo' BAM by randomly generating matrix elements with Boolean values distributed as follows:

$ P(\rho_{ij}=True) = p$ \
\
 $P(\rho_{ij}=False) = 1-p$

In [None]:
N = 25                                                                          # Matrix size
p = 0.5                                                                         # Distribution of True values 
pseudo = np.triu(np.random.choice([True, False], size=(N,N), p=[p, 1-p]), 1)    # Construct BAM values 
pseudo += pseudo.T                                                              # Construct symmetric matrix

The aim of this exsersise is to identify the optimum subset of elements.
The HDFS algorithm identifies **all** allowed subsets elements using the Binary Acceptance Matrix.
To help choose the optimum set one can provide a list of weights which will give preference to the highest total path weight sum. 
If run without weight the "**find_paths**" method will return the longest paths as if uniformly weighted to 1.



In [None]:
# Generate pseudo weights
weights = np.random.rand(N)
labels = [f'F{i}' for i in range(N)]
# Provide pseudo BAM and Weights to BinaryAcceptance class
bam = pf.BinaryAcceptance(pseudo, weights=weights, labels=labels)
# Plot the BAM
plot_results.plot(bam, size=10)                                                          

Provide the **BinaryAcceptance** Object to **HDFS** and run **find_paths** to get the top 5 results

In [None]:
hdfs = pf.find_best_combinations(matrix=pseudo,
                                 weights=weights,
                                 top=5, 
                                 allow_subset=False, 
                                 verbose=True,
                                 labels=labels,
                                 algorithm='hdfs')

Provide the **BinaryAcceptance** Object to **WHDFS** and run **find_paths** to get the top 5 results

In [None]:
whdfs = pf.find_best_combinations(matrix=pseudo,
                                  weights=weights,
                                  top=5, 
                                  allow_subset=False, 
                                  verbose=True,
                                  labels=labels,
                                  algorithm='whdfs')

#### Expected result
1: Path = [1, 7, 15, 16, 22, 23],  Weight = 4.499650470394825,  

2: Path = [1, 9, 16, 22, 23],  Weight = 3.992302162921413,  

3: Path = [2, 13, 14, 15, 17],  Weight = 3.8649178973867615,  

4: Path = [2, 7, 14, 15, 17],  Weight = 3.707504655554509,  

5: Path = [2, 8, 13, 14, 17],  Weight = 3.5387637906697584  


In [None]:
print("WHDFS Vs HDFS results comparison:")
print(f"Weight comparison: {all([np.isclose(w.weight, h.weight) for w, h in zip(whdfs.res, hdfs.res)])} (fp tolerance 1e-9)")
print(f"Path comparison:   {whdfs.get_paths == hdfs.get_paths}")

## Visualising Results
#### New plot_sorted Parameter

The new `plot_sorted` parameter controls whether paths are plotted in original or sorted (weight-ordered) index space:
- `plot_sorted=False` (default): Plot in original index space - HDFS and WHDFS look the same
- `plot_sorted=True`: Plot in sorted index space - shows weight-ordered structure

In [None]:
# plot results (plot_sorted = False)
axis = plot_results.plot(results=whdfs, size=10, plot_sorted=False, axis_labels=True, highlight_top_path=True)

In [None]:
# Plot sorted results (plot_sorted = True)
axis = plot_results.plot(results=whdfs, size=10, plot_sorted=True, axis_labels=True, highlight_top_path=True)

## Comparing HDFS and WHDFS Results

With `auto_sort=True` (default), WHDFS automatically returns paths in **original index space** via `get_paths`, making comparison with HDFS straightforward.

**Important**: The internal storage (`res` attribute) differs between HDFS and WHDFS:
- `hdfs.res` contains paths in original index space
- `whdfs.res` contains paths in sorted index space (for performance)

Always use `get_paths` and `get_weights` properties result for comparison.

In [None]:
print("Results match (using ==):", whdfs == hdfs)
print("Paths match (using get_paths):", whdfs.get_paths == hdfs.get_paths)
print("Weights match (using get_weights, within floating-point tolerance):",
      all(np.isclose(whdfs.get_weights, hdfs.get_weights)))
print()
print("Note: whdfs.res != hdfs.res because res contains internal storage")
print("  (sorted indices for WHDFS, original indices for HDFS)")
print("  Always use get_paths/get_weights or == comparison instead!")